I am working with Aurélien Géron - "Hands-On Machine Learning with Scikit-Learn and TensorFlow".
I found some great example of reinforcement learning.
In this example author uses simple Neural Network as policy:
n_inputs = 4
n_hidden = 4
n_outputs = 1
learning_rate = 0.01
initializer = tf.contrib.layers.variance_scaling_initializer()
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = fully_connected(X, n_hidden, activation_fn=tf.nn.elu, weights_initializer=initializer)
logits = fully_connected(hidden, n_outputs, activation_fn=None)
outputs = tf.nn.sigmoid(logits) # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
# output - only 1 activity
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)
There I have 1 of 2 actions as output.
Based on this example I try to build my policy, where I have 3 output neurons and the most important - reccurent neural network. I try to find 1 of 3 activities.
My code at this moment:
n_inputs = 2 # two input vectors
n_steps = 10 # inputs are 10 elements vectors
n_neurons = 30
n_outputs = 3
learning_rate = 0.01
initializer = tf.contrib.layers.variance_scaling_initializer()
# example of X below
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
out_softmax = tf.nn.softmax(logits) # I'm expecting 3 outputs
# action is the probability 1 of 3 activities, so:
action = tf.multinomial(tf.log(out_softmax), num_samples=1)
y = tf.to_float(action)
# gradients
#
# ....
#
If my input vector is like:
#(X size = (1, 24, 2))
X = [[0,1,2,... 23], [0,1,2,... 23]]
I have this error:
Check failed: NDIMS == new_sizes.size() (2 vs. 1)
Process finished with exit code -1073740791 (0xC0000409)
Why? I was sure, that the output should be a value like y = 0 (or 1 or 2) (action 0, 1 or 2)
Maybe I don't understand something? Could You help?
Of course if my X is like:
#(X size = (n, 24, 2) (n>=2)
#e.g.:
X =[[[0,1,2,... 23], [0,1,2,... 23]],
[[0,1,2,... 23], [0,1,2,... 23]],
[[0,1,2,... 23], [0,1,2,... 23]],
[[0,1,2,... 23], [0,1,2,... 23]]]
My policy works ok.
It was a big post, I hope, it describes the problem well enough. If not - pls let me know!
User contributions licensed under CC BY-SA 3.0