How to Create A K-Armed Bandit
Stationary Bandit
For more of the deeper level behind this, I strongly recommend reading the textbook, its free if you google it and will do a much better job of explaining what is involved. This tutorial only keeps it at a high level, and only gets the coding part of it down.
The way I approached this was similar to my neural-net from scratch approach, I went object based. Thus the first step is to actually initialize the bandit. To do that what we want to do is first determin the number of bandits required, which we can let the user define. Next, because this is stationary, we want to define a list of means and variances for the reward distributions of all the bandits. The textbook goes with a normal distribution, so that is what I will be using as well. So, when initializing our __init__ function will look something like:
-
def __init__(self, num_actions=5, default_mean=[0,1], default_variance=[0,0.1]): self.num_actions = num_actions self.mu = np.random.uniform(low=default_mean[0], high=default_mean[1], size=num_actions) self.sigma = np.random.uniform(low=default_variance[0], high=default_variance[1], size=num_actions) self.current_rewards = [0]*num_actions self.current_number_of_actions = [0]*num_actions
Here you can see we have a default for all the values, which of course the user can change. We then initialize all our necessary variables. Since we need to keep track of the number of times an action was taken as well as the rewards we have received for each action taken, its a good idea to initalize arrays with zeros for all on the initialization of the object.
The next step would be the reward function. This is essentially a random number from a normal distribution, and for each action we have already defined a mean and variance on the initialization, which makes the function easy and as such:
-
def reward(self, action): return np.random.normal(self.mu[action], self.sigma[action], 1)
The step after this would be the actual 'meat and potatoes' of the machine, the function that takes an action. When taking an action, essentially what you are doing is adding a count to the action you took, then getting a reward for that action taken, then calculating the current average total reward which is iteratively defined as:
- $Q_{n+1} = Q_n + \dfrac{1}{n}\left[R_n - Q_n\right]$
- $n$ is the number of times that action was taken
Then all we need to do is to add the second part to the current rewards that we are already keeping track of, thus making the final code:
-
def try_action(self, action): self.current_number_of_actions[action] += 1 reward = self.reward(action) step = 1/self.current_number_of_actions[action] error = reward - self.current_rewards[action] self.current_rewards[action] += step*error
Now, the part where we actually train the machine and see results. Of course one is going to need to use matplotlib to plot the results. Essentially what we are going to do is run the machine 1000 or 10,000 times using the $\epsilon$-greedy method. Essentially what that means is, for 1 - $\epsilon$ of the time, we will choose the action with the highest current reward, then $\epsilon$ times we will pick a random one. This choosing can be done by using numpy's uniform distribution random generator, since every point that it returns has an equal probability of occuring. Thus, all we need to check is that if the number returned from the random number generator is less or greater than epsilon and pick an action accordingly. Esentially, the loop (if only taking an action) will look like this:
-
machine = Machine(num_actions=10) epsilon = 0.1 steps = 10_000 for _ in range(steps): x = np.random.uniform() if x > epsilon: action = machine.current_rewards.index(max(machine.current_rewards)) else: action = np.random.randint(machine.num_actions) machine.try_action(action)
However, since we do want to plot this, we want to keep track of the average rewards at every time step, leaving us with:
-
machine = Machine(num_actions=10) epsilon = 0.1 steps = 10_000 average_reward = [] for _ in range(steps): x = np.random.uniform() if x > epsilon: action = machine.current_rewards.index(max(machine.current_rewards)) else: action = np.random.randint(machine.num_actions) machine.try_action(action) average_reward.append(sum(machine.current_rewards)/machine.num_actions)
Congratulations, now we are done, all we have left is to plot what we found, which is an easy one liner:
-
plt.plot(range(steps), average_reward)
Putting it all together
-
import numpy as np import matplotlib.pyplot as plt from tqdm import tqdm class Machine(): def __init__(self, num_actions=5, default_mean=[0,1], default_variance=[0,0.1]): self.num_actions = num_actions self.mu = np.random.uniform(low=default_mean[0], high=default_mean[1], size=num_actions) self.sigma = np.random.uniform(low=default_variance[0], high=default_variance[1], size=num_actions) self.current_rewards = [0]*num_actions self.current_number_of_actions = [0]*num_actions def try_action(self, action): self.current_number_of_actions[action] += 1 reward = self.reward(action) step = 1/self.current_number_of_actions[action] error = reward - self.current_rewards[action] self.current_rewards[action] += step*error def reward(self, action): return np.random.normal(self.mu[action], self.sigma[action], 1) machine = Machine(num_actions=10) epsilon = 0.1 steps = 10_000 average_reward = [] for _ in range(steps): x = np.random.uniform() if x > epsilon: action = machine.current_rewards.index(max(machine.current_rewards)) else: action = np.random.randint(machine.num_actions) machine.try_action(action) average_reward.append(sum(machine.current_rewards)/machine.num_actions) plt.plot(range(steps), average_reward)
Part 2: Non stationary, and UCB actions for stationary comming soon 😳 (too lazy to do all in one sitting)