I’m currently taking a deep learning course, which used learning the XOR function as its first example of feedforward networks. The XOR function has the following truth table

$$x$$$$y$$$$x \oplus y$$

which when graphed, is not linearly separable (1s cannot be separated from the 0s by drawing a line)

XOR is not linearly separable

So if a linear model won’t work, I guess that means we need a nonlinear one. We can do this by using the activation function on the outputs of our neurons. is defined as

and graphed below.

The RELU function

However, since has sharp corners, it is not differentiable at , so gradient based learning methods won’t work as well. So we use the function instead, which is a softened version of defined as

shown below.

The softplus function

We begin with your usual imports

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

Then define the inputs and expected outputs of the neural network

inputs = np.array([[0, 0],
                   [0, 1],
                   [1, 0],
                   [1, 1]])

xor_outputs = np.array([0, 1, 1, 0])

Next, we define the structure of the neural network. Note that I had to increase the learning rate from the default value.

XOR = Sequential()
XOR.add(Dense(2, activation='softplus', input_dim=2))
XOR.add(Dense(1, activation='sigmoid'))

# Make the model learn faster (take bigger steps) than by default.
sgd = SGD(lr=0.1)

This defines the network

The XOR network

where the hidden layer activation function is and the output layer activation function is the traditional sigmoid function used to output a number between 0 and 1, indicating the probability of the output being a logical 0 or a logical 1. Note that Keras does not require us to explicitly form the input layer.

Now we actually train the network.

XOR.fit(inputs, xor_outputs, epochs=5000, verbose=0)
cost, acc = XOR.evaluate(inputs, xor_outputs, verbose=0)
print(f'cost: {cost}, acc: {acc * 100}%')

which outputs

cost: 0.007737404201179743, acc: 100.0%
 [0.9978434 ]

Training the network on other boolean functions work exactly the same way, so much so that the only difference is using a different output array.

This was my first experience with a neural network, so here are some things that I learned for your amusement:

  • I originally expected this model to train very quickly because the problem was so small, so I only used 10-20 training epochs and got absolutely garbage results. Here, I’m using 5000 training epochs.
  • I had to increase the learning rate to train in a reasonable amount of time.
  • One should not blindly upgrade TensorFlow without reading the release notes. All-in-all, I spent more time trying to install the correct versions of Tensorflow and CUDA than I did trying to get even this simple of a neural network to work correctly.
  • Even small models like this use quite a lot of GPU memory.

Note that boolean functions are bad functions for neural networks to learn. This is because their domain and ranges are discrete and (typically) small. Learning the function takes more time and space than simply listing a truth table.