Continuing from the previous post (Learning to use OpenAI Gym) I explore how to create a neural network policy that can train our cart to balance the pole for a longer period of time compared to the simple hard coded policy. Using a simple network we should see an improvement my measuring the amount of consectutive steps that the pole can stay vertical.

Neural Network Policy

Use the import statements and plotting functions from the previous post

#this is the observation space 
num_inputs = 4 

If changing the parameters of the model, make sure to clear session before retraining the model

model = keras.models.Sequential([

Model: "sequential" 
Layer (type) Output Shape Param # =================================================================
dense (Dense) (None, 20) 100
dense_1 (Dense) (None, 5) 105
dense_2 (Dense) (None, 1) 6
Total params: 211
Trainable params: 211
Non-trainable params: 0

The input for our model is going to the environment’s observations: obs = [x,x,x,x] and the output will the action that is going to be taken by the cart (either 0 or 1). Since there are only two possible actions, we only need one output neuron using the sigmoid activation function to represent the action that the cart will take. If there were more than two actions that we could take, then each action would be represented by a neuron and we would use the softmax activation function.

In the code below, we decide our action based on random probability. For example let’s say the randomly generated probability is .40 and the probability of going left generated by the model is 0.50. We have a statement below that compares these two probabilites. Since .40 > .50 is a false statement, we convert our boolean to the integer 0 and our action is to go left.

We make our decisions based on this random probability because we want to find the right balance between exploring new actions and exploiting actions that we already know work well. Imagine you go to a cafe and randomly select a coffee. If you like it, the probability that you order it again next time you go is increased. However, you shouldn’t increase that probability to 100% because there might be other coffees to try that are even better than the first.

def render_policy(model,max_steps=200,seed=42): 
  frames = [] 
  env = gym.make("CartPole-v1",render_mode="rgb_array")
  #reset the environment 
  obs = env.reset()
  obs = np.array([obs[0]])
  reward = 0 
  #keep track of how many consectuive times pole is vertical 
  totals = 0 
  for step in range(max_steps):
    left_prob = model.predict(obs)
    #generate random number  
    p = np.random.rand()
    #turn boolean value into integer (ACTION  = TRUE:1 OR FALSE:0)
    action = int(p>left_prob)
    stats = env.step(action)
    obs = np.array([stats[0]])
    reward = stats[1]
    done = stats[2]
    info = stats[3]
    totals+= reward 
    if done: 
  return frames,totals

For our cart-pole environment, we can ignore past observations and actions because at each step we can see the environments full state.

For example: If the environment only revealed the cart’s position and not the velocity, you would have to consider past and current observations in order to determine the current velocity. We do not have to worry about this in our case.

Lets see how well a randomly initialized policy network performs:

frames,totals = render_policy(model)
    1/1 [==============================] - 0s 19ms/step
    p 0.3745401188473625
    left_prob [[0.49724284]]
    action 0
    1/1 [==============================] - 0s 20ms/step
    p 0.9507143064099162
    left_prob [[0.50142497]]
    action 1
    1/1 [==============================] - 0s 19ms/step
    p 0.7319939418114051
    left_prob [[0.4971607]]
    action 1
The network here is able to keep the pole vertical for 51 consectutive steps which is an improvement over the previous hard-coded policy! Woo!
