Deep Q Learning Basket Ball Agent

Deep Q Learning

Deep Q Learning is a method that utilizes a neural network to estimate the Q function, $Q(S_t, A_t)$. This functiont takes in a state, $S_t$ which represents the current playing field. In the case of the basketball agent it consisted of the following:

current player position $(x, y)$
team mate positions
current player stats (distance per dribble, field goal percentages at various ranges, offensive rebounds, etc.)
defending player stats (steal, block, defensive rebounds, etc.)
distance from net
is at 3-point distance (boolean)
time on shot clock

The function also takes in an action $A_t$ which represents the action the agent takes, in this case the following:

shoot
dribble (in any direction)
pass

The function $Q$ then also spits out a numeric value representing how good the action is given the state. The higher the value, the more desireable the action.

The neural network "learns" and estimates the true $Q$ learning function using exploration and rewards. For example, if we take an action $a$ in state $s$ and we get a negative reward, we want our $Q$ function estimater to change its values so that we are less likely to take that action in that state. Similarly with positive reward.

The way to actually train the network is infact quite simple. First we build up a set of episodes as training data. This can be done by just exploring randomly (using epsilon random) and taking random actions and keeping track of the various states and actions and rewards. Then use this back log of actions and rewards to actually train the network. More concretely:

Algorithm

D = [] # initialize a list of episodes
while not converged:
    for N in number_of_collection_steps:
        s = env.current_state
        e = random.uniform()
        if e < epsilon:
            a = env.get_random_action()
        else:
            a = argmax(network.forward_propogate(s)) # get networks estimate for max state
        s_prime, r = env.get_new_state_and_reward()
        D.append((s, a, r, s_prime))
    train_network(D) # back propogation with loss function gradient
    D = []

To see how I did it, check out the github repo

The easy part was infact writing the code that trained the network. The hard part was infact writing the environment. When writing your own, you must be extremely careful of bugs, since if there are bugs that can give your agent an advantage, that agent will take advantage each time and you will not have good results. In fact, I had a bug that essentially allowed the agent a freethrow with any foul. This of course just caused the agent to spam dribbling until they got fouled, since the likelihood of it was quite high. Thus the result was not desireable since in real life that would not be the case or an ideal way to play. After adjusting however the agent was forced to find new ways to play.

This blog is just meant to be an overview, I urge you to read the original DQN paper, as well as look at the code from the github to have a more concrete look at the approach.