February 22, 2016

First Steps with TensorFlow and Deep Learning

  1. Why I’m Interested in Deep Learning
  2. How TensorFlow Works and Why
  3. Learning Binary Logical Operators
    1. Basic Neural Network models in Clojure
    2. Building Neural Networks with TensorFlow

In this post I’ll discuss neural networks and TensorFlow basics. In particular, I haven’t found many TensorFlow examples that start out with simpler training models that still make use of neural networks. If you want a "hello world" level introduction to TensorFlow, you can find, well, something like that here.

My plan is to build this into a series that will discuss some of the details involved in what is now commonly called Deep Learning, including convolutional neural networks, recurrent neural networks, etc. but you won’t find any of that in this first post. If you want an introduction to Deep Learning to hold you over in the mean time, I can recommend:

My goal with this and future posts is to address gaps or obstacles I encountered getting going with both TensorFlow and Deep Learning in general, and to provide some commentary on TensorFlow’s design from a fairly neutral standpoint. On the way I’ll provide some simple examples of both very high and moderately low level code examples in both Python and Clojure for reference and contrast.

As a side note: I think my preference for Clojure is starting to become painfully obvious in my Python these days, so even when I stick to Python the functional style will come through strongly anyways. Set expectations accordingly.

Why I’m Interested in Deep Learning

Remote sensing involves a lot of machine learning, and I’ve worked with neural networks off and on in my research. In particular, a classic use of neural networks is for classification tasks when you can’t rely on assumptions of normality in your training data and need a nonparametric technique. For example, building size and morphology attributes caused issues when I implemented a segment level classification technique on LiDAR data for my Master’s thesis research. I used a Neural Network to classify segments as building/non-building and to classify buildings by likely land use using neural networks for this reason.

I didn’t pursue neural networks much farther because at the time (well, in Geography where we seem to be always be several years behind the state of the art in Machine Learning) neural networks were not considered to be particularly strong methods and the general consensus was that using one or two hidden layers was pretty much the most interesting thing you could do with them. Even Machine Learning researchers in Computer Science who specialized in Neural Networks were kind of looked on as a fringe group over a decade ago, but that’s changed significantly given several machine learning and AI milestones that have come out of Deep Learning.

The most promising aspect of Deep Learning for me is the possibility of removing many of the "magic touch of the analyst" steps of feature extraction, model selection, manual data transformations, etc. that make machine learning models traditionally difficult to generalize. I tried to take a small step towards improvements of this process with my dissertation research by finding features that would generalize between scenes and using those to automatically classify urban areas in images. I used those cross-scene features to provide labels for spectra extracted through an unsupervised process. I’m struck by the fact that deep learning tasks in image recognition take this process back one step further and provide a means to learn those features at multiple orders and levels of abstraction. To see what I’m referring to, see some of the papers on deep convolutional networks, for example:

Moreover, Deep Learning is proving powerful when combined with other systems, as a perceptual component in reinforcement learning and combined with tree search in AlphaGo. While I recognize the huge milestone beating top human players at Go is, I find the nearness to human models of learning explored with the Atari paper and the mixture of deep and reinforcement learning components in the Go paper the most compelling.

I won’t deny that in many ways Deep Learning is being overhyped and it’s certainly being cargo culted incessantly, but there really is something to all the fuss and it’s worth exploring.

How TensorFlow Works and Why

I found TensorFlow quite jarring to work with at first. The primary culprit is the mixture of layers of abstraction. For example, you model neural networks (at least at the TensorFlow API level) with matrix multiplication:

weights = tf.Variable(tf.truncated_normal([n1, n2])
biases = tf.zeros([n2])
lyr_out = tf.nn.relu(tf.matmul(data, weights) + biases)

This is similar to what you’d do directly against numpy or core.matrix if you were cooking up a neural network model from scratch (well, with some linear algebra ingredients in your pantry):

sigmoid = lambda x: 1. / (1. + np.exp(-x))
weights = np.random.normal((100,10))
biases = np.zeros(10)
sigmoid(np.dot(data, weights) + biases)

Or:

(let [weights (-> (repeatedly 1000 rand)
                  (reshape [100 10]))
      biases (zero-array [] [10])]
  (tanh (+ (dot data weights) bias))

Ok, so relu, tanh, and sigmoid are not the same activation function, but you get the idea. The main thing is that this differs sharply from the typical sklearn abstraction level in Python (modified from an example on the test branch):

clf = MLPClassifier(algorithm='l-bfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)

Despite this closeness to the processing model, parameters that appear as magical such as l-bfgs do make an appearance, albeit in a more verbose form:

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(pred, tf_train_labels)
cost_fn = tf.reduce_mean(cross_entropy)
opt = tf.train.GradientDescentOptimizer(lrate).minimize(cost_fn)

Moreover, I didn’t define the derivative in terms of the activation function output for use with the chain rule in my back propagation model. TensorFlow handles that for you. (The softmax_cross_entropy_with_logits isn’t quite so magical and mysterious either, but we’ll save that for a future post).

So why this mixture of abstraction levels? It’s true that you can wrap this entire thing in your own sklearn like layer of abstraction. I do so in a limited way for the logic function learning example I’ll be discussing, for instance. But TensorFlow aims to provide access that’s close to its data flow graph model. In all of the above steps, nothing has happened yet. The composition of functions, declaration of variables, etc. all implies a data flow graph, but I have not yet executed that data flow graph yet or trained a network, whereas in the numpy and core.matrix examples I’m computing values i.e. already conducting feed forward steps.

Let’s break down into bullets a few important details of TensorFlow’s design:

  • function composition implies a graph of processing steps

  • in fact, operations, variables, constants, etc. all are added as nodes in the data graph.

  • nothing happens until you initialize and then run a session.

  • variables you defined will mutate while the session runs for processing steps (i.e. back propagation) that are implied but not explicitly stated.

  • there are heterogenous processing nodes in the data flow graph (e.g. gpu and cpu friendly components)

  • this also means you encounter exceptions from functions seemingly at random for e.g. not using 32-bit floating point values.

  • it knows enough to give you some stuff for free (like differentiation for back propagation)

  • it has high level options that reflect the state of the art in terms of regularization, activation functions, etc.

So TensorFlow is tailored to a specific way of doing things. It provides a very opiononated (and powerful) way of structuring deep learning workflows. It’s frequently more verbose and less evident than I would care for it to be, but now that I understand its design I find working with it to be fairly straight forward.

Learning Binary Logical Operators

I opted to build an example neural network training scenario with binary logical operators because (a) they’re very easy functions to understand, (b) it’s fairly simple to reason about what goes on in a neural network that learns them, (c) XOR demonstrates the capacity of neural networks to model nonlinear features and (d) NAND means computational power. It’s likely you already know about (c) and (d), but if you don’t, there’s a good and short treatment in this Udacity course.

Basic Neural Network models in Clojure

Carin Meier provides a good introduction to working with neural networks, building from linear algebra constructs with Clojure and core.matrix and not getting too bogged down in the elementary details. If it’s hard to make sense of what’s going on in my treatment, I highly recommend reading her blog post first. I’ll be using her k9 library as a starting point to introduce just enough TensorFlow/Deep Learning constructs into the traditional neural net model.

Here’s how you construct a neural network in k9:

(def nnet (construct-network 2 3 2)

Now we’re going to generate data to label outputs NAND and AND using one-hot encoded vectors (don’t pay any attention to the logical values as int hacks behind the curtain):

(defn nand-gen []
  (let [n1 (rand-int 2)
        n2 (rand-int 2)
        and? (if (= n1 n2 1) 1 0)
        nand? (if (zero? and?) 1 0) ]
    [[n1 n2][and? nand?]]))

Let’s get some training data:

(def train-samples (repeatedly 1000 nand-gen))

And use it to train the network (will take a little while, k9 is not intended to be an optimized library):

(def trained-network
  (train-epochs 500 nnet train-samples 0.5))

We end up with a network that can model AND or NAND fairly well.

Note: You may require a different learning rate, more epochs, etc.

(ff [1 1] trained-network)
; [0.9988687446324157 -1.0871193576451767E-6]
(ff [1 0] trained-network)
; [1.1711233357211965E-5 0.9987813583560773]

Some of the differences in what we’ll be doing with TensorFlow are changes to the actual neural net model. For one, TensorFlow is meant to be GPU optimized, so exponentials and implied exponentials (i.e. in tanh) are no good. We’ll switch to using relu, or a rectified linear activation model. It would look something like this:

(defn relu
  [x]
  (max 0.0 x))

(defn d-relu
  [y]
  (cond (< y 0.0) 0
        (> y 0.0) 1
        (zero? y) (throw (Exception. "should have used LRelu!"))))

There are big advantages in deep networks to using relu. There’s also some introduced complexity and a whole set of tools for handling the fact that you can differentiate relu arbitrarily close to 0 but not at 0 and that neurons can essential go dead. We’ll skip them for now. One impact: for the final layer, we would use a function like softmax for the response. This would (a) guaranteee outputs to be between 0 and 1 and (b) guarantee the output vector to sum to 1.0 (meaning outputs are proper probabilities).

(defn soft-max [x]
  (/ (exp x) (reduce + (exp x))))

So if we have a really excited neuron, or other strange things happen, we still get good probabilities in the output vector:

(soft-max [8 0.23 0.12])
; [0.999200193878278 4.2187557805360395E-4 3.7793054366837335E-4]
(soft-max [2 -4 0.0001])
;[0.8788677886715844 0.0021784954441716386 0.11895371588424401]

Note that the soft-max definition I give above is not matrix friendly. You’ll need to control the dimension over which you sum over as with this numpy example to accomodate matrix variables:

def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

Building Neural Networks with TensorFlow

With some of those model differences out of the way, let’s look at the actual means by which you construct and traverse a data flow graph to handle a neural network training workflow in TensorFlow. We’re going to start out with some analogous training data generators:

xor_fn = np.vectorize(lambda x, y: x != y)
nand_fn = np.vectorize(lambda x, y: not (x and y))
and_fn = np.vectorize(lambda x, y: x and y)
or_fn = np.vectorize(lambda x, y: x or y)

def sample_logic_fn(f, samples=10000):
    """ Takes a vectorized function that takes two binary input values and
    generates labeled data for use in training or evaluating classifiers.
    """
    inputs = np.random.randint(2, size=(samples, 2))
    outputs = f(inputs[:,0], inputs[:,1])
    labels = np.transpose(np.vstack((outputs, outputs==False)))
    return inputs.astype(np.float32), labels.astype(np.float32)

We can generate some training or test data simply with:

train_data, train_labels = sample_logic_fn(xor_fn)

Variables and function calls create nodes in the TensorFlow graph, and need to be invoked in the context of the graph, i.e. with:

graph = tf.Graph()
with graph.as_default():
    tf_train_data = tf.Variable(train_data)
    tf_train_labels = tf.Variable(train_labels)

The calls to Variable can wrap numpy arrays (which is what I’m doing here). You can also provide tf.placeholder instead and populate that with values later. You’ll see this done really commonly with a feed_dict for e.g. subsets of the training data with stochastic gradient descent (SGD) - often called batch training or minibatch training. I’m just wrapping the entire training dataset since it’s fairly small and the network I’m using is not deep. It keep things simpler.

Now I’ll note that verbosity of the TensorFlow processing chain led me to pull a couple of functions out from the typical example steps you’ll see in the various Jupyter notebooks going around. It’s important for me personally to break out some of the dimension munging and matrix multiplication steps to reduce the cognitive load of the sequence of operations that defines the nodes in the data flow graph.

The first set of steps I’ve pulled out constructs the variables we need that represent the layer weights and biases. I wanted to wrap and parameterize this function so that I could provide basic details about the structure of the neural network and have it generate the nodes for me:

def nn_model(n_input, n_classes, hiddens=[]):
    layer_map = []
    connections = [n_input] + hiddens + [n_classes]
    for n1, n2 in zip(connections[:-1], connections[1:]):
        weights = tf.Variable(tf.truncated_normal([n1, n2]))
        biases = tf.Variable(tf.zeros([n2]))
        layer_map.append((weights, biases))
    return layer_map

If that zip form looks weird to you and you’re a Clojurist, replace it in your head with something like the following to make sense of it:

(let [lyrs [100 50 10]]
  (mapv vector (butlast lyrs) (rest lyrs)))
;[[100 50][50 10]]

I.e. I’m handling all the dimensions for matrix variables of weights for the matrix multiplication between layers. That kind of a representation lets me use a reduce-like composition between layer maps:

def feed_forward(layer_map, data):
    accum = data
    for l in layer_map[:-1]:
        accum = tf.nn.relu(tf.matmul(accum, l[0]) + l[1])
    return tf.matmul(accum, layer_map[-1][0]) + layer_map[-1][1]

Each intermediate activation goes through relu, but the final output doesn’t have an activation function applied to it…​. yet.

Now any of these TensorFlow variables need to be constructed in the context of a data flow graph so the functions have to be called in that context. I don’t love this level of implicit coupling with the environment, but I understand why it’s there and the with blocks do provide a decent way of structuring it. With the two function definitions above, we can create our neural network layer and define the data flow graph which will feed data forward through it.

model = nn_model(input_dims, n_labels, hiddens=hiddens)
pred = feed_forward(model, tf_train_data)

The next step is to determine the cost function that will be minimized by the optimizer as well as the optimizer that will be used for back propagation:

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(pred, tf_train_labels)
cost_fn = tf.reduce_mean(cross_entropy)
opt = tf.train.GradientDescentOptimizer(lrate).minimize(cost_fn)
train_pred = tf.nn.softmax(pred)

I won’t dive into softmax_cross_entropy_with_logits too much, but I will say that it is tied to our softmax function on the output and the relu activation layers. cross_entropy is a measure of the distance between the vector of probabilities output by the neural network and your one hot encoded labels. Basically it knows the outputs from the network must be converted to probabilities (i.e. with softmax to be compared for corresdondence with the labels. However, when we back propagate through the graph and update weights, we need to use the distance scale that corresponds to the activation of the neurons, which is the output from relu (not a probability). We essentially cancel softmax for this case by using a log distance. You can read more about cross entropy here.

Each update of the weights, i.e. each training epoch occurs when we run our model in a TensorFlow session:

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    for step in range(n_steps):
        _, cost_y, preds = session.run([opt, cost_fn, train_pred])

I have other logic written in to calculate the accuracy and print the progress with each 100 epochs, but this is the part that actually runs the graph we built in the other with block. Note that opt, cost_fn, and train_pred are each variables we defined above (in the other with block). The list of arguments provided to session.run is referred to as fetches in the TensorFlow nomenclature. Every node referred to in fetches gets evluated when run is called. So in our case:

  • The optimization Operation runs and returns None (why we chuck the output into _).

  • The cost_fn and train_pred Tensor will each return values, which we retrieve as cost_y and preds.

We can look at the value of cost_y, and it’s good to see it constantly decreasing. If it bounces around a lot it’s a good hint that you either need to adjust your learning rate or make sure the nodes you’ve arranged in your data flow graph actually work together as expected.

I also use a basic accuracy measure (average error):

def accuracy(preds, labels):
    return np.mean(np.argmax(preds,1) == np.argmax(labels,1))

The argmax calls in numpy, invoked this way (across axis 1), basically work this way: if my label is [0.0, 1.0], then the highest probability should correspond with the 1.0, so [0.1, 0.9] is a match, and as we’d apply a threshold to this it would actually be an exact correspondence. [0.6, 0.4] is a fail. This accuracy function has nothing to do with the optimization that occurs when the TensorFlow graph is evaluated - it’s just there for human readable output.

Running the entire thing, we can see some quick convergence for logic functions:

Cost function value as of step 0: 0.833859
Accuracy for training dataset: 0.244
Validation accuracy estimate: 0.736
Cost function value as of step 200: 0.002133
Accuracy for training dataset: 1.000
Validation accuracy estimate: 1.000
Cost function value as of step 400: 0.001000
Accuracy for training dataset: 1.000
Validation accuracy estimate: 1.000

If you want to run the code, you can get it here. I’ll be updating the repo over time with more examples as I continue this series of blog posts.

Tags: clojure deep learning TensorFlow neural networks python machine learning