Differentiable Neural Computers (DNCs) - Nature article thoughts
The addition of working memory to artificial neural networks (ANNs) is an obvious upgrade when we compare ANNs to the Von Neumann CPU architecture, and one that came to the fore in the RAM (Reasoning, Attention, Memory) workshop at NIPS last year (which was packed to the rafters). Obvious and building however are two different things..
The recent paper from Google Deepmind on Differentiable Neural Computers or DNCs represents another significant step in the journey to add working memory to ANNs, so it's worth taking a more in-depth look at it.
The Times They Are A Changin
In homage to the newly-minted nobel prize winner Bob Dylan, we use his 1964 classic title to draw attention to the stunning changes reconfiguring the landscape in neural networks at present.
Movidius has been acquired by Intel - their tagline "Visual Sensing for the Internet of Things" gives a clue to their focus - their VPU or Vision Processing Unit which can execute both TensorFlow or Caffe neural network models.
Intel themselves are preparing the Linux kernel for new x86 instructions, dedicated to running neural networks on CPUs as opposed to GPUs (Intel has been lagging Nvidia in this area for a long time).
Nvidia are still the clear hardware leader in terms of adoption and public performance - they have bet big on deep learning and at GTC 2016 this year it was the cornerstone of the conference - from the DGX-1 to the CUDA 8 / cuDNN 5.x releases.
Finally we know that Google has their own TPUs (Tensor Processing Units) but not much about them, or how they measure up to GPUs or CPUs.
Simply put, hardware is morphing to run larger neural networks more efficiently, and using less power. Every major software company now has links to academic institutions and are actively working to apply deep learning / neural computing to their platforms and products.
This level of hardware and software activity in the field of neural computing is completely unprecedented, and shows no sign of abating. How then does the DNC paper play into all of this activity, if at all?
Differentiable Neural Computers
What then, are DNCs? Breaking down the paper, we get the following key points:
At its core, a DNC is "just" a recurrent neural network (RNN).
This RNN however is allowed to read / write / update locations in memory (M) - the RNN stores vectors or tensors of data of size W with M having N rows of size W, so M = N*W.
The DNC uses differentiable attention to decide where to read from / write to / update existing rows in memory. This is a key point as this now enables well-understood learning algorithms such as Stochastic Gradient Descent (SGD) to be used to train the DNC.
The memory bank M is associative - the implementation uses cosine similarity so that partial as well as exact matches are supported.
There is another data structure (named L) which is separate to the memory M. L is used to provide temporal context and links by remembering the order of memory reads and writes. Therefore "L" is simply a linked list which allows the DNC to remember the order in which it read or wrote information to "M".
Lastly, I find it intriguing to see the references to cognitive computing / biological plausibility in the papers (not common in this space - a hangover of the connectionism vs computationalism debate of the 1990s) - multiple references to similarities between the DNC and the hippocampus, or how synapses can encode temporal context and links.
The following image is taken from the Deepmind blog post and clearly shows the RNN, read and write heads, N*W memory (M) and the linked list encoding temporal associations in M, L.
What about Memory Networks from Weston et al?
Weston et al at Facebook have also been working hard in this space. The diagram below is from their June 2016 Arxiv paper and this paper is the latest in a line of work on memory networks going back to 2014, and perhaps the memory component is inspired / motivated by earlier work on WSABIE.
The Nature paper better expounds on the generality of their solution (covering document analysis and understanding, dialogue, graph planning etc.), but this does not necessarily mean that the approach is better.
Impact and Relevance
In my opinion, DNCs / RAM represent the single biggest advance in recurrent architectures since LSTM. The addition of memory, coupled with a well-defined mechanism to differentiate and thus train over it clearly improves the ability of RNNs to perform more complex tasks such as planning, as evidenced in the paper on the bAbl dataset or London underground tasks.
Business applications can make very significant use of DNCs and architectures like them. The ability to plan, or to arrive at a better understanding of large documents has big implications for decision support systems, data analytics, project management and information retrieval. It is not difficult to imagine a DNC plugin for ElasticSearch and Solr for example, or a DNC edition of Microsoft Project Server.
Couple that software support with a burgeoning native CPU instruction set support for tensor-centric operations coupled with ongoing GPU improvements and TPUs and the future for neural computing is set to grow brighter and brighter.
SHRDLU from Winograd is widely regarded as a high point in AI, reached in 1972 and not substantially bettered or replicated since then (Liang, ICML 2015 slides 100 - 105). Does the Mini-SHRDLU block puzzle experiments referenced on page three of the Nature article point to the next substantial research area for Deepmind - to improve on SHRDLU performance from 1972?