How to use RNN as policy

0

I am working with Aurélien Géron - "Hands-On Machine Learning with Scikit-Learn and TensorFlow".

I found some great example of reinforcement learning.

In this example author uses simple Neural Network as policy:

n_inputs = 4
n_hidden = 4
n_outputs = 1

learning_rate = 0.01

initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

hidden = fully_connected(X, n_hidden, activation_fn=tf.nn.elu, weights_initializer=initializer)
logits = fully_connected(hidden, n_outputs, activation_fn=None)
outputs = tf.nn.sigmoid(logits)  # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
# output - only 1 activity
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)

There I have 1 of 2 actions as output.

Based on this example I try to build my policy, where I have 3 output neurons and the most important - reccurent neural network. I try to find 1 of 3 activities.

My code at this moment:

n_inputs = 2 # two input vectors
n_steps = 10 # inputs are 10 elements vectors
n_neurons = 30
n_outputs = 3

learning_rate = 0.01
initializer = tf.contrib.layers.variance_scaling_initializer()

# example of X below
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) 

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs) 
out_softmax = tf.nn.softmax(logits) # I'm expecting 3 outputs

# action is the probability 1 of 3 activities, so:
action = tf.multinomial(tf.log(out_softmax), num_samples=1) 
y = tf.to_float(action) 

# gradients
#
# ....
#

If my input vector is like:

#(X size = (1, 24, 2))
X = [[0,1,2,... 23], [0,1,2,... 23]]

I have this error:

Check failed: NDIMS == new_sizes.size() (2 vs. 1)

Process finished with exit code -1073740791 (0xC0000409)

Why? I was sure, that the output should be a value like y = 0 (or 1 or 2) (action 0, 1 or 2)

Maybe I don't understand something? Could You help?

Of course if my X is like:

#(X size = (n, 24, 2)  (n>=2)
#e.g.:
X =[[[0,1,2,... 23], [0,1,2,... 23]],
    [[0,1,2,... 23], [0,1,2,... 23]],
    [[0,1,2,... 23], [0,1,2,... 23]],
    [[0,1,2,... 23], [0,1,2,... 23]]]

My policy works ok.

It was a big post, I hope, it describes the problem well enough. If not - pls let me know!

tensorflow
recurrent-neural-network
reinforcement-learning

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0