First Steps with TensorFlow and Deep Learning
In this post I’ll discuss neural networks and TensorFlow basics. In particular, I haven’t found many TensorFlow examples that start out with simpler training models that still make use of neural networks. If you want a "hello world" level introduction to TensorFlow, you can find, well, something like that here.
My plan is to build this into a series that will discuss some of the details involved in what is now commonly called Deep Learning, including convolutional neural networks, recurrent neural networks, etc. but you won’t find any of that in this first post. If you want an introduction to Deep Learning to hold you over in the mean time, I can recommend:

Lectures by Geoffrey Hinton, Andrew Ng, and Yann LeCun.

This course from Udacity.

This online book and tutorial by Michael Nielsen.
My goal with this and future posts is to address gaps or obstacles I encountered getting going with both TensorFlow and Deep Learning in general, and to provide some commentary on TensorFlow’s design from a fairly neutral standpoint. On the way I’ll provide some simple examples of both very high and moderately low level code examples in both Python and Clojure for reference and contrast.
As a side note: I think my preference for Clojure is starting to become painfully obvious in my Python these days, so even when I stick to Python the functional style will come through strongly anyways. Set expectations accordingly.
Why I’m Interested in Deep Learning
Remote sensing involves a lot of machine learning, and I’ve worked with neural networks off and on in my research. In particular, a classic use of neural networks is for classification tasks when you can’t rely on assumptions of normality in your training data and need a nonparametric technique. For example, building size and morphology attributes caused issues when I implemented a segment level classification technique on LiDAR data for my Master’s thesis research. I used a Neural Network to classify segments as building/nonbuilding and to classify buildings by likely land use using neural networks for this reason.
I didn’t pursue neural networks much farther because at the time (well, in Geography where we seem to be always be several years behind the state of the art in Machine Learning) neural networks were not considered to be particularly strong methods and the general consensus was that using one or two hidden layers was pretty much the most interesting thing you could do with them. Even Machine Learning researchers in Computer Science who specialized in Neural Networks were kind of looked on as a fringe group over a decade ago, but that’s changed significantly given several machine learning and AI milestones that have come out of Deep Learning.
The most promising aspect of Deep Learning for me is the possibility of removing many of the "magic touch of the analyst" steps of feature extraction, model selection, manual data transformations, etc. that make machine learning models traditionally difficult to generalize. I tried to take a small step towards improvements of this process with my dissertation research by finding features that would generalize between scenes and using those to automatically classify urban areas in images. I used those crossscene features to provide labels for spectra extracted through an unsupervised process. I’m struck by the fact that deep learning tasks in image recognition take this process back one step further and provide a means to learn those features at multiple orders and levels of abstraction. To see what I’m referring to, see some of the papers on deep convolutional networks, for example:
Moreover, Deep Learning is proving powerful when combined with other systems, as a perceptual component in reinforcement learning and combined with tree search in AlphaGo. While I recognize the huge milestone beating top human players at Go is, I find the nearness to human models of learning explored with the Atari paper and the mixture of deep and reinforcement learning components in the Go paper the most compelling.
I won’t deny that in many ways Deep Learning is being overhyped and it’s certainly being cargo culted incessantly, but there really is something to all the fuss and it’s worth exploring.
How TensorFlow Works and Why
I found TensorFlow quite jarring to work with at first. The primary culprit is the mixture of layers of abstraction. For example, you model neural networks (at least at the TensorFlow API level) with matrix multiplication:
weights = tf.Variable(tf.truncated_normal([n1, n2])
biases = tf.zeros([n2])
lyr_out = tf.nn.relu(tf.matmul(data, weights) + biases)
This is similar to what you’d do directly against numpy
or core.matrix
if
you were cooking up a neural network model from scratch (well, with some linear
algebra ingredients in your pantry):
sigmoid = lambda x: 1. / (1. + np.exp(x))
weights = np.random.normal((100,10))
biases = np.zeros(10)
sigmoid(np.dot(data, weights) + biases)
Or:
(let [weights (> (repeatedly 1000 rand)
(reshape [100 10]))
biases (zeroarray [] [10])]
(tanh (+ (dot data weights) bias))
Ok, so relu
, tanh
, and sigmoid
are not the same activation function, but
you get the idea. The main thing is that this differs sharply from the typical
sklearn
abstraction level in Python (modified from an example on the test
branch):
clf = MLPClassifier(algorithm='lbfgs', alpha=1e5, hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)
Despite this closeness to the processing model, parameters that appear as
magical such as lbfgs
do make an appearance, albeit in a more verbose form:
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(pred, tf_train_labels)
cost_fn = tf.reduce_mean(cross_entropy)
opt = tf.train.GradientDescentOptimizer(lrate).minimize(cost_fn)
Moreover, I didn’t define the derivative in terms of the activation function output for use
with the chain rule in my back propagation model. TensorFlow handles
that for you. (The softmax_cross_entropy_with_logits
isn’t quite so magical
and mysterious either, but we’ll save that for a future post).
So why this mixture of abstraction levels? It’s true that you can wrap this entire thing
in your own sklearn
like layer of abstraction. I do so in a limited way for
the logic function learning example
I’ll be discussing, for instance. But TensorFlow aims to provide access that’s close to its data flow
graph model. In all of the above steps, nothing has happened yet. The composition of
functions, declaration of variables, etc. all implies a data flow graph, but I have not
yet executed that data flow graph yet or trained a network, whereas in the numpy
and core.matrix
examples I’m computing values i.e. already conducting feed forward
steps.
Let’s break down into bullets a few important details of TensorFlow’s design:

function composition implies a graph of processing steps

in fact, operations, variables, constants, etc. all are added as nodes in the data graph.

nothing happens until you initialize and then run a session.

variables you defined will mutate while the session runs for processing steps (i.e. back propagation) that are implied but not explicitly stated.

there are heterogenous processing nodes in the data flow graph (e.g. gpu and cpu friendly components)

this also means you encounter exceptions from functions seemingly at random for e.g. not using 32bit floating point values.

it knows enough to give you some stuff for free (like differentiation for back propagation)

it has high level options that reflect the state of the art in terms of regularization, activation functions, etc.
So TensorFlow is tailored to a specific way of doing things. It provides a very opiononated (and powerful) way of structuring deep learning workflows. It’s frequently more verbose and less evident than I would care for it to be, but now that I understand its design I find working with it to be fairly straight forward.
Learning Binary Logical Operators
I opted to build an example neural network training scenario with binary logical operators because (a) they’re very easy functions to understand, (b) it’s fairly simple to reason about what goes on in a neural network that learns them, (c) XOR demonstrates the capacity of neural networks to model nonlinear features and (d) NAND means computational power. It’s likely you already know about (c) and (d), but if you don’t, there’s a good and short treatment in this Udacity course.
Basic Neural Network models in Clojure
Carin Meier provides a
good introduction
to working with neural networks, building from linear algebra constructs with
Clojure and core.matrix
and not getting too bogged down in the elementary
details. If it’s hard to make sense of what’s going on in my treatment, I highly recommend
reading her blog post first. I’ll be using her
k9 library as a starting point to introduce
just enough TensorFlow/Deep Learning constructs into the traditional neural net
model.
Here’s how you construct a neural network in k9:
(def nnet (constructnetwork 2 3 2)
Now we’re going to generate data to label outputs NAND and AND using onehot encoded vectors (don’t pay any attention to the logical values as int hacks behind the curtain):
(defn nandgen []
(let [n1 (randint 2)
n2 (randint 2)
and? (if (= n1 n2 1) 1 0)
nand? (if (zero? and?) 1 0) ]
[[n1 n2][and? nand?]]))
Let’s get some training data:
(def trainsamples (repeatedly 1000 nandgen))
And use it to train the network (will take a little while, k9 is not intended to be an optimized library):
(def trainednetwork
(trainepochs 500 nnet trainsamples 0.5))
We end up with a network that can model AND or NAND fairly well.
Note: You may require a different learning rate, more epochs, etc.
(ff [1 1] trainednetwork)
; [0.9988687446324157 1.0871193576451767E6]
(ff [1 0] trainednetwork)
; [1.1711233357211965E5 0.9987813583560773]
Some of the differences in what we’ll be doing with TensorFlow are changes to
the actual neural net model. For one, TensorFlow is meant to be GPU optimized,
so exponentials and implied exponentials (i.e. in tanh
) are no good. We’ll
switch to using relu
, or a rectified linear activation model. It would look
something like this:
(defn relu
[x]
(max 0.0 x))
(defn drelu
[y]
(cond (< y 0.0) 0
(> y 0.0) 1
(zero? y) (throw (Exception. "should have used LRelu!"))))
There are big advantages in deep networks to using relu. There’s also some introduced
complexity and a whole set
of tools for handling the fact that you can differentiate relu
arbitrarily
close to 0 but not at 0 and that neurons can essential go dead.
We’ll skip them for now. One impact: for the final layer, we
would use a function like softmax
for the response. This would (a) guaranteee
outputs to be between 0 and 1 and (b) guarantee the output vector to sum to 1.0
(meaning outputs are proper probabilities).
(defn softmax [x]
(/ (exp x) (reduce + (exp x))))
So if we have a really excited neuron, or other strange things happen, we still get good probabilities in the output vector:
(softmax [8 0.23 0.12])
; [0.999200193878278 4.2187557805360395E4 3.7793054366837335E4]
(softmax [2 4 0.0001])
;[0.8788677886715844 0.0021784954441716386 0.11895371588424401]
Note that the softmax
definition I give above is not matrix friendly.
You’ll need to control the dimension over which you sum over as with
this numpy example to accomodate matrix variables:
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
Building Neural Networks with TensorFlow
With some of those model differences out of the way, let’s look at the actual means by which you construct and traverse a data flow graph to handle a neural network training workflow in TensorFlow. We’re going to start out with some analogous training data generators:
xor_fn = np.vectorize(lambda x, y: x != y)
nand_fn = np.vectorize(lambda x, y: not (x and y))
and_fn = np.vectorize(lambda x, y: x and y)
or_fn = np.vectorize(lambda x, y: x or y)
def sample_logic_fn(f, samples=10000):
""" Takes a vectorized function that takes two binary input values and
generates labeled data for use in training or evaluating classifiers.
"""
inputs = np.random.randint(2, size=(samples, 2))
outputs = f(inputs[:,0], inputs[:,1])
labels = np.transpose(np.vstack((outputs, outputs==False)))
return inputs.astype(np.float32), labels.astype(np.float32)
We can generate some training or test data simply with:
train_data, train_labels = sample_logic_fn(xor_fn)
Variables and function calls create nodes in the TensorFlow graph, and need to be invoked in the context of the graph, i.e. with:
graph = tf.Graph()
with graph.as_default():
tf_train_data = tf.Variable(train_data)
tf_train_labels = tf.Variable(train_labels)
The calls to Variable
can wrap numpy arrays (which is what I’m doing
here). You can also provide tf.placeholder
instead and populate that with
values later. You’ll see this done really commonly with a feed_dict
for
e.g. subsets of the training data with stochastic gradient descent (SGD)  often
called batch training or minibatch training. I’m
just wrapping the entire training dataset since it’s fairly small and the network
I’m using is not deep. It keep things simpler.
Now I’ll note that verbosity of the TensorFlow processing chain led me to pull a couple of functions out from the typical example steps you’ll see in the various Jupyter notebooks going around. It’s important for me personally to break out some of the dimension munging and matrix multiplication steps to reduce the cognitive load of the sequence of operations that defines the nodes in the data flow graph.
The first set of steps I’ve pulled out constructs the variables we need that represent the layer weights and biases. I wanted to wrap and parameterize this function so that I could provide basic details about the structure of the neural network and have it generate the nodes for me:
def nn_model(n_input, n_classes, hiddens=[]):
layer_map = []
connections = [n_input] + hiddens + [n_classes]
for n1, n2 in zip(connections[:1], connections[1:]):
weights = tf.Variable(tf.truncated_normal([n1, n2]))
biases = tf.Variable(tf.zeros([n2]))
layer_map.append((weights, biases))
return layer_map
If that zip
form looks weird to you and you’re a Clojurist, replace it in your
head with something like the following to make sense of it:
(let [lyrs [100 50 10]]
(mapv vector (butlast lyrs) (rest lyrs)))
;[[100 50][50 10]]
I.e. I’m handling all the dimensions for matrix variables of weights for the matrix multiplication between layers. That kind of a representation lets me use a reducelike composition between layer maps:
def feed_forward(layer_map, data):
accum = data
for l in layer_map[:1]:
accum = tf.nn.relu(tf.matmul(accum, l[0]) + l[1])
return tf.matmul(accum, layer_map[1][0]) + layer_map[1][1]
Each intermediate activation goes through relu
, but the final output
doesn’t have an activation function applied to it…. yet.
Now any of these TensorFlow variables need to be constructed in the context
of a data flow graph so the functions have to be called in that context. I
don’t love this level of implicit coupling with the environment, but I understand
why it’s there and the with
blocks do provide a decent way of structuring it.
With the two function definitions above, we can create our neural
network layer and define the data flow graph which will feed data forward
through it.
model = nn_model(input_dims, n_labels, hiddens=hiddens)
pred = feed_forward(model, tf_train_data)
The next step is to determine the cost function that will be minimized by the optimizer as well as the optimizer that will be used for back propagation:
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(pred, tf_train_labels)
cost_fn = tf.reduce_mean(cross_entropy)
opt = tf.train.GradientDescentOptimizer(lrate).minimize(cost_fn)
train_pred = tf.nn.softmax(pred)
I won’t dive into softmax_cross_entropy_with_logits
too much, but I will say
that it is tied to our softmax
function on the output and the relu
activation
layers. cross_entropy
is a measure of the distance between the vector of probabilities
output by the neural network and your one hot encoded labels. Basically it knows
the outputs from the network must be converted to probabilities
(i.e. with softmax
to be compared for corresdondence with the labels. However,
when we back propagate through the graph and update weights, we need to use the
distance scale that corresponds to the activation of the neurons, which is the
output from relu
(not a probability). We essentially cancel softmax
for
this case by using a log distance. You can read more about cross entropy
here.
Each update of the weights, i.e. each training epoch occurs when we run our model in a TensorFlow session:
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
for step in range(n_steps):
_, cost_y, preds = session.run([opt, cost_fn, train_pred])
I have other logic written in to calculate the accuracy and print the progress
with each 100 epochs, but this is the part that actually runs the graph we
built in the other with
block. Note that opt
, cost_fn
, and train_pred
are each variables we defined above (in the other with
block). The list of
arguments provided to session.run
is referred to as fetches
in the TensorFlow
nomenclature. Every node referred to in fetches gets evluated when run
is called.
So in our case:

The optimization Operation runs and returns
None
(why we chuck the output into_
). 
The
cost_fn
andtrain_pred
Tensor will each return values, which we retrieve ascost_y
andpreds
.
We can look at the value of cost_y
, and it’s good to see it constantly decreasing. If
it bounces around a lot it’s a good hint that you either need to adjust your learning rate
or make sure the nodes you’ve arranged in your data flow graph actually work together
as expected.
I also use a basic accuracy
measure (average error):
def accuracy(preds, labels):
return np.mean(np.argmax(preds,1) == np.argmax(labels,1))
The argmax
calls in numpy
, invoked this way (across axis 1), basically work this
way: if my label is [0.0, 1.0]
, then the highest probability should correspond with
the 1.0
, so [0.1, 0.9]
is a match, and as we’d apply a threshold to this it would
actually be an exact correspondence. [0.6, 0.4]
is a fail. This accuracy function
has nothing to do with the optimization that occurs when the TensorFlow graph is
evaluated  it’s just there for human readable output.
Running the entire thing, we can see some quick convergence for logic functions:
Cost function value as of step 0: 0.833859 Accuracy for training dataset: 0.244 Validation accuracy estimate: 0.736 Cost function value as of step 200: 0.002133 Accuracy for training dataset: 1.000 Validation accuracy estimate: 1.000 Cost function value as of step 400: 0.001000 Accuracy for training dataset: 1.000 Validation accuracy estimate: 1.000
If you want to run the code, you can get it here. I’ll be updating the repo over time with more examples as I continue this series of blog posts.