In this post I’ll discuss neural networks and TensorFlow basics. In particular, I haven’t found many TensorFlow examples that start out with simpler training models that still make use of neural networks. If you want a “hello world” level introduction to TensorFlow, you can find, well, something like that here.

My plan is to build this into a series that will discuss some of the details involved in what is now commonly called Deep Learning, including convolutional neural networks, recurrent neural networks, etc. but you won’t find any of that in this first post. If you want an introduction to Deep Learning to hold you over in the mean time, I can recommend:

Lectures by Geoffrey Hinton, Andrew Ng, and Yann LeCun.

This course from Udacity.

This online book and tutorial by Michael Nielsen.

My goal with this and future posts is to address gaps or obstacles I encountered getting going with both TensorFlow and Deep Learning in general, and to provide some commentary on TensorFlow’s design from a fairly neutral standpoint. On the way I’ll provide some simple examples of both very high and moderately low level code examples in both Python and Clojure for reference and contrast.

As a side note: I think my preference for Clojure is starting to become painfully obvious in my Python these days, so even when I stick to Python the functional style will come through strongly anyways. Set expectations accordingly.

# Why I’m Interested in Deep Learning

Remote sensing involves a lot of machine learning, and I’ve worked with neural networks off and on in my research. In particular, a classic use of neural networks is for classification tasks when you can’t rely on assumptions of normality in your training data and need a nonparametric technique. For example, building size and morphology attributes caused issues when I implemented a segment level classification technique on LiDAR data for my Master’s thesis research. I used a Neural Network to classify segments as building/non-building and to classify buildings by likely land use using neural networks for this reason.

I didn’t pursue neural networks much farther because at the time (well, in Geography where we seem to be always be several years behind the state of the art in Machine Learning) neural networks were not considered to be particularly strong methods and the general consensus was that using one or two hidden layers was pretty much the most interesting thing you could do with them. Even Machine Learning researchers in Computer Science who specialized in Neural Networks were kind of looked on as a fringe group over a decade ago, but that’s changed significantly given several machine learning and AI milestones that have come out of Deep Learning.

The most promising aspect of Deep Learning for me is the possibility of removing many of the \“magic touch of the analyst\” steps of feature extraction, model selection, manual data transformations, etc. that make machine learning models traditionally difficult to generalize. I tried to take a small step towards improvements of this process with my dissertation research by finding features that would generalize between scenes and using those to automatically classify urban areas in images. I used those cross-scene features to provide labels for spectra extracted through an unsupervised process. I’m struck by the fact that deep learning tasks in image recognition take this process back one step further and provide a means to learn those features at multiple orders and levels of abstraction. To see what I’m referring to, see some of the papers on deep convolutional networks, for example:

Moreover, Deep Learning is proving powerful when combined with other systems, as a perceptual component in reinforcement learning and combined with tree search in AlphaGo. While I recognize the huge milestone beating top human players at Go is, I find the nearness to human models of learning explored with the Atari paper and the mixture of deep and reinforcement learning components in the Go paper the most compelling.

I won’t deny that in many ways Deep Learning is being overhyped and it’s certainly being cargo culted incessantly, but there really is something to all the fuss and it’s worth exploring.

# How TensorFlow Works and Why

I found TensorFlow quite jarring to work with at first. The primary culprit is the mixture of layers of abstraction. For example, you model neural networks (at least at the TensorFlow API level) with matrix multiplication:

```
weights = tf.Variable(tf.truncated_normal([n1, n2])
biases = tf.zeros([n2])
lyr_out = tf.nn.relu(tf.matmul(data, weights) + biases)
```

This is similar to what you’d do directly against `numpy`

or
`core.matrix`

if you were cooking up a neural network model from scratch
(well, with some linear algebra ingredients in your pantry):

```
sigmoid = lambda x: 1. / (1. + np.exp(-x))
weights = np.random.normal((100,10))
biases = np.zeros(10)
sigmoid(np.dot(data, weights) + biases)
```

Or:

```
(let [weights (-> (repeatedly 1000 rand)
(reshape [100 10]))
biases (zero-array [] [10])]
(tanh (+ (dot data weights) bias))
```

Ok, so `relu`

, `tanh`

, and `sigmoid`

are not the same activation
function, but you get the idea. The main thing is that this differs
sharply from the typical `sklearn`

abstraction level in Python (modified
from an example on the test branch):

```
clf = MLPClassifier(algorithm='l-bfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)
```

Despite this closeness to the processing model, parameters that appear
as magical such as `l-bfgs`

do make an appearance, albeit in a more
verbose form:

```
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(pred, tf_train_labels)
cost_fn = tf.reduce_mean(cross_entropy)
opt = tf.train.GradientDescentOptimizer(lrate).minimize(cost_fn)
```

Moreover, I didn’t define the derivative in terms of the activation
function output for use with the chain rule in my back propagation
model. TensorFlow handles that for you. (The
`softmax_cross_entropy_with_logits`

isn’t quite so magical and
mysterious either, but we’ll save that for a future post).

So why this mixture of abstraction levels? It’s true that you can wrap
this entire thing in your own `sklearn`

like layer of abstraction. I do
so in a limited way for the logic function learning
example
I’ll be discussing, for instance. But TensorFlow aims to provide access
that’s close to its data flow graph model. In all of the above steps,
**nothing has happened yet**. The composition of functions, declaration
of variables, etc. all implies a data flow graph, but I have not yet
executed that data flow graph yet or trained a network, whereas in the
`numpy`

and `core.matrix`

examples I’m computing values i.e. already
conducting feed forward steps.

Let’s break down into bullets a few important details of TensorFlow’s design:

function composition implies a graph of processing steps

in fact, operations, variables, constants, etc. all are added as nodes in the data graph.

nothing happens until you initialize and then run a session.

variables you defined will mutate while the session runs for processing steps (i.e. back propagation) that are implied but not explicitly stated.

there are heterogenous processing nodes in the data flow graph (e.g. gpu and cpu friendly components)

this also means you encounter exceptions from functions seemingly at random for e.g. not using 32-bit floating point values.

it knows enough to give you some stuff for free (like differentiation for back propagation)

it has high level options that reflect the state of the art in terms of regularization, activation functions, etc.

So TensorFlow is tailored to a specific way of doing things. It provides a very opiononated (and powerful) way of structuring deep learning workflows. It’s frequently more verbose and less evident than I would care for it to be, but now that I understand its design I find working with it to be fairly straight forward.

# Learning Binary Logical Operators

I opted to build an example neural network training scenario with binary logical operators because (a) they’re very easy functions to understand, (b) it’s fairly simple to reason about what goes on in a neural network that learns them, © XOR demonstrates the capacity of neural networks to model nonlinear features and (d) NAND means computational power. It’s likely you already know about © and (d), but if you don’t, there’s a good and short treatment in this Udacity course.

## Basic Neural Network models in Clojure

Carin Meier provides a good
introduction
to working with neural networks, building from linear algebra constructs
with Clojure and `core.matrix`

and not getting too bogged down in the
elementary details. If it’s hard to make sense of what’s going on in my
treatment, I highly recommend reading her blog post first. I’ll be using
her k9 library as a starting point to
introduce just enough TensorFlow/Deep Learning constructs into the
traditional neural net model.

Here’s how you construct a neural network in k9:

```
(def nnet (construct-network 2 3 2)
```

Now we’re going to generate data to label outputs NAND and AND using one-hot encoded vectors (don’t pay any attention to the logical values as int hacks behind the curtain):

```
(defn nand-gen []
(let [n1 (rand-int 2)
n2 (rand-int 2)
and? (if (= n1 n2 1) 1 0)
nand? (if (zero? and?) 1 0) ]
[[n1 n2][and? nand?]]))
```

Let’s get some training data:

```
(def train-samples (repeatedly 1000 nand-gen))
```

And use it to train the network (will take a little while, k9 is not intended to be an optimized library):

```
(def trained-network
(train-epochs 500 nnet train-samples 0.5))
```

We end up with a network that can model AND or NAND fairly well.

**Note**: You may require a different learning rate, more epochs, etc.

```
(ff [1 1] trained-network)
; [0.9988687446324157 -1.0871193576451767E-6]
(ff [1 0] trained-network)
; [1.1711233357211965E-5 0.9987813583560773]
```

Some of the differences in what we’ll be doing with TensorFlow are
changes to the actual neural net model. For one, TensorFlow is meant to
be GPU optimized, so exponentials and implied exponentials (i.e. in
`tanh`

) are no good. We’ll switch to using `relu`

, or a rectified linear
activation model. It would look something like this:

```
(defn relu
[x]
(max 0.0 x))
(defn d-relu
[y]
(cond (< y 0.0) 0
(> y 0.0) 1
(zero? y) (throw (Exception. "should have used LRelu!"))))
```

There are big advantages in deep networks to using relu. There’s also
some introduced complexity and a whole set of tools for handling the
fact that you can differentiate `relu`

arbitrarily close to 0 but not at
0 and that neurons can essential go dead. We’ll skip them for now. One
impact: for the final layer, we would use a function like `softmax`

for
the response. This would (a) guaranteee outputs to be between 0 and 1
and (b) guarantee the output vector to sum to 1.0 (meaning outputs are
proper probabilities).

```
(defn soft-max [x]
(/ (exp x) (reduce + (exp x))))
```

So if we have a really excited neuron, or other strange things happen, we still get good probabilities in the output vector:

```
(soft-max [8 0.23 0.12])
; [0.999200193878278 4.2187557805360395E-4 3.7793054366837335E-4]
(soft-max [2 -4 0.0001])
;[0.8788677886715844 0.0021784954441716386 0.11895371588424401]
```

Note that the `soft-max`

definition I give above is not matrix friendly.
You’ll need to control the dimension over which you sum over as with
this numpy example to accomodate matrix variables:

```
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
```

## Building Neural Networks with TensorFlow

With some of those model differences out of the way, let’s look at the actual means by which you construct and traverse a data flow graph to handle a neural network training workflow in TensorFlow. We’re going to start out with some analogous training data generators:

```
xor_fn = np.vectorize(lambda x, y: x != y)
nand_fn = np.vectorize(lambda x, y: not (x and y))
and_fn = np.vectorize(lambda x, y: x and y)
or_fn = np.vectorize(lambda x, y: x or y)
def sample_logic_fn(f, samples=10000):
""" Takes a vectorized function that takes two binary input values and
generates labeled data for use in training or evaluating classifiers.
"""
inputs = np.random.randint(2, size=(samples, 2))
outputs = f(inputs[:,0], inputs[:,1])
labels = np.transpose(np.vstack((outputs, outputs==False)))
return inputs.astype(np.float32), labels.astype(np.float32)
```

We can generate some training or test data simply with:

```
train_data, train_labels = sample_logic_fn(xor_fn)
```

Variables and function calls create nodes in the TensorFlow graph, and need to be invoked in the context of the graph, i.e. with:

```
graph = tf.Graph()
with graph.as_default():
tf_train_data = tf.Variable(train_data)
tf_train_labels = tf.Variable(train_labels)
```

The calls to `Variable`

can wrap numpy arrays (which is what I’m doing
here). You can also provide `tf.placeholder`

instead and populate that
with values later. You’ll see this done really commonly with a
`feed_dict`

for e.g. subsets of the training data with stochastic
gradient descent (SGD) - often called batch training or minibatch
training. I’m just wrapping the entire training dataset since it’s
fairly small and the network I’m using is not deep. It keep things
simpler.

Now I’ll note that verbosity of the TensorFlow processing chain led me to pull a couple of functions out from the typical example steps you’ll see in the various Jupyter notebooks going around. It’s important for me personally to break out some of the dimension munging and matrix multiplication steps to reduce the cognitive load of the sequence of operations that defines the nodes in the data flow graph.

The first set of steps I’ve pulled out constructs the variables we need that represent the layer weights and biases. I wanted to wrap and parameterize this function so that I could provide basic details about the structure of the neural network and have it generate the nodes for me:

```
def nn_model(n_input, n_classes, hiddens=[]):
layer_map = []
connections = [n_input] + hiddens + [n_classes]
for n1, n2 in zip(connections[:-1], connections[1:]):
weights = tf.Variable(tf.truncated_normal([n1, n2]))
biases = tf.Variable(tf.zeros([n2]))
layer_map.append((weights, biases))
return layer_map
```

If that `zip`

form looks weird to you and you’re a Clojurist, replace it
in your head with something like the following to make sense of it:

```
(let [lyrs [100 50 10]]
(mapv vector (butlast lyrs) (rest lyrs)))
;[[100 50][50 10]]
```

I.e. I’m handling all the dimensions for matrix variables of weights for the matrix multiplication between layers. That kind of a representation lets me use a reduce-like composition between layer maps:

```
def feed_forward(layer_map, data):
accum = data
for l in layer_map[:-1]:
accum = tf.nn.relu(tf.matmul(accum, l[0]) + l[1])
return tf.matmul(accum, layer_map[-1][0]) + layer_map[-1][1]
```

Each intermediate activation goes through `relu`

, but the final output
doesn’t have an activation function applied to it…. yet.

Now any of these TensorFlow variables need to be constructed in the
context of a data flow graph so the functions have to be called in that
context. I don’t love this level of implicit coupling with the
environment, but I understand why it’s there and the `with`

blocks do
provide a decent way of structuring it. With the two function
definitions above, we can create our neural network layer and define the
data flow graph which will feed data forward through it.

```
model = nn_model(input_dims, n_labels, hiddens=hiddens)
pred = feed_forward(model, tf_train_data)
```

The next step is to determine the cost function that will be minimized by the optimizer as well as the optimizer that will be used for back propagation:

```
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(pred, tf_train_labels)
cost_fn = tf.reduce_mean(cross_entropy)
opt = tf.train.GradientDescentOptimizer(lrate).minimize(cost_fn)
train_pred = tf.nn.softmax(pred)
```

I won’t dive into `softmax_cross_entropy_with_logits`

too much, but I
will say that it is tied to our `softmax`

function on the output and the
`relu`

activation layers. `cross_entropy`

is a measure of the distance
between the vector of probabilities output by the neural network and
your one hot encoded labels. Basically it knows the outputs from the
network must be converted to probabilities (i.e. with `softmax`

to be
compared for corresdondence with the labels. However, when we back
propagate through the graph and update weights, we need to use the
distance scale that corresponds to the activation of the neurons, which
is the output from `relu`

(not a probability). We essentially cancel
`softmax`

for this case by using a log distance. You can read more about
cross entropy
here.

Each update of the weights, i.e. each training epoch occurs when we run our model in a TensorFlow session:

```
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
for step in range(n_steps):
_, cost_y, preds = session.run([opt, cost_fn, train_pred])
```

I have other logic written in to calculate the accuracy and print the
progress with each 100 epochs, but this is the part that actually runs
the graph we built in the other `with`

block. Note that `opt`

,
`cost_fn`

, and `train_pred`

are each variables we defined above (in the
other `with`

block). The list of arguments provided to `session.run`

is
referred to as `fetches`

in the TensorFlow nomenclature. Every node
referred to in fetches gets evluated when `run`

is called. So in our
case:

The optimization

*Operation*runs and returns`None`

(why we chuck the output into`_`

).The

`cost_fn`

and`train_pred`

*Tensor*will each return values, which we retrieve as`cost_y`

and`preds`

.

We can look at the value of `cost_y`

, and it’s good to see it constantly
decreasing. If it bounces around a lot it’s a good hint that you either
need to adjust your learning rate or make sure the nodes you’ve arranged
in your data flow graph actually work together as expected.

I also use a basic `accuracy`

measure (average error):

```
def accuracy(preds, labels):
return np.mean(np.argmax(preds,1) == np.argmax(labels,1))
```

The `argmax`

calls in `numpy`

, invoked this way (across axis 1),
basically work this way: if my label is `[0.0, 1.0]`

, then the highest
probability should correspond with the `1.0`

, so `[0.1, 0.9]`

is a
match, and as we’d apply a threshold to this it would actually be an
exact correspondence. `[0.6, 0.4]`

is a fail. This accuracy function has
nothing to do with the optimization that occurs when the TensorFlow
graph is evaluated - it’s just there for human readable output.

Running the entire thing, we can see some quick convergence for logic functions:

```
Cost function value as of step 0: 0.833859
Accuracy for training dataset: 0.244
Validation accuracy estimate: 0.736
Cost function value as of step 200: 0.002133
Accuracy for training dataset: 1.000
Validation accuracy estimate: 1.000
Cost function value as of step 400: 0.001000
Accuracy for training dataset: 1.000
Validation accuracy estimate: 1.000
```

If you want to run the code, you can get it here. I’ll be updating the repo over time with more examples as I continue this series of blog posts.