- Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal)
- Torch – provides a Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu)
- Pylearn2 – Pylearn2 is a library designed to make machine learning research easy.
- Blocks – A Theano framework for training neural networks
- Tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.
- MXNet – MXNet is a deep learning framework designed for both efficiency and flexibility.
- Caffe -Caffe is a deep learning framework made with expression, speed, and modularity in mind.Caffe is a deep learning framework made with expression, speed, and modularity in mind.
- Lasagne – Lasagne is a lightweight library to build and train neural networks in Theano.
- Keras– A theano based deep learning library.
- Deep Learning Tutorials – examples of how to
*do*Deep Learning with Theano (from LISA lab at University of Montreal) - Chainer – A GPU based Neural Network Framework
- DeepLearnToolbox – A Matlab toolbox for Deep Learning (from Rasmus Berg Palm)
- Cuda-Convnet – A fast C++/CUDA implementation of convolutional (or more generally, feed-forward) neural networks. It can model arbitrary layer connectivity and network depth. Any directed acyclic graph of layers will do. Training is done using the back-propagation algorithm.
- Deep Belief Networks. Matlab code for learning Deep Belief Networks (from Ruslan Salakhutdinov).
- RNNLM– Tomas Mikolov’s Recurrent Neural Network based Language models Toolkit.
- RNNLIB-RNNLIB is a recurrent neural network library for sequence learning problems. Applicable to most types of spatiotemporal data, it has proven particularly effective for speech and handwriting recognition.
- matrbm. Simplified version of Ruslan Salakhutdinov’s code, by Andrej Karpathy (Matlab).
- deeplearning4j– Deeplearning4J is an Apache 2.0-licensed, open-source, distributed neural net library written in Java and Scala.
- Estimating Partition Functions of RBM’s. Matlab code for estimating partition functions of Restricted Boltzmann Machines using Annealed Importance Sampling (from Ruslan Salakhutdinov).
- Learning Deep Boltzmann Machines Matlab code for training and fine-tuning Deep Boltzmann Machines (from Ruslan Salakhutdinov).
- The LUSH programming language and development environment, which is used @ NYU for deep convolutional networks
- Eblearn.lsh is a LUSH-based machine learning library for doing Energy-Based Learning. It includes code for “Predictive Sparse Decomposition” and other sparse auto-encoder methods for unsupervised learning. Koray Kavukcuoglu provides Eblearn code for several deep learning papers on thispage.
- deepmat– Deepmat, Matlab based deep learning algorithms.
- MShadow – MShadow is a lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA. The goal of mshadow is to support efficient, device invariant and simple tensor library for machine learning project that aims for both simplicity and performance. Supports CPU/GPU/Multi-GPU and distributed system.
- CXXNET – CXXNET is fast, concise, distributed deep learning framework based on MShadow. It is a lightweight and easy extensible C++/CUDA neural network toolkit with friendly Python/Matlab interface for training and prediction.
- Nengo-Nengo is a graphical and scripting based software package for simulating large-scale neural systems.
- Eblearn is a C++ machine learning library with a BSD license for energy-based learning, convolutional networks, vision/recognition applications, etc. EBLearn is primarily maintained by Pierre Sermanet at NYU.
- cudamat is a GPU-based matrix library for Python. Example code for training Neural Networks and Restricted Boltzmann Machines is included.
- Gnumpy is a Python module that interfaces in a way almost identical to numpy, but does its computations on your computer’s GPU. It runs on top of cudamat.
- The CUV Library (github link) is a C++ framework with python bindings for easy use of Nvidia CUDA functions on matrices. It contains an RBM implementation, as well as annealed importance sampling code and code to calculate the partition function exactly (from AIS lab at University of Bonn).
- 3-way factored RBM and mcRBM is python code calling CUDAMat to train models of natural images (from Marc’Aurelio Ranzato).
- Matlab code for training conditional RBMs/DBNs and factored conditional RBMs (from Graham Taylor).
- mPoT is python code using CUDAMat and gnumpy to train models of natural images (from Marc’Aurelio Ranzato).
- neuralnetworks is a java based gpu library for deep learning algorithms.
- ConvNet is a matlab based convolutional neural network toolbox.
- Elektronn is a deep learning toolkit that makes powerful neural networks accessible to scientists outside the machine learning community.
- OpenNN is an open source class library written in C++ programming language which implements neural networks, a main area of deep learning research.
- NeuralDesigner is an innovative deep learning tool for predictive analytics.
- Theano Generalized Hebbian Learning.

Recent developments allow CUDA programs to transparently compile and run at full speed on x86 architectures. This advance makes CUDA a viable programming model for *all* application development, just like OpenMP. The PGI CUDA C/C++ compiler for x86 (from the Portland Group Inc.) is the reason for this recent change in mindset. It is the first native CUDA compiler that can transparently create a binary that will run on an x86 processor. No GPU is required. As a result, programmers now have the ability to use a single source tree of CUDA code to reach those customers who own CUDA-enabled GPUs as or who use x86-based systems.

Figure 1 illustrates the options and target platforms that are currently available to build and run CUDA applications. The various products are discussed next.

Aside from the new CUDA-x86 compiler, the other products require developer or customer intervention to run CUDA on multiple backends. For example:

- nvcc: The freely downloadable nvcc compiler from NVIDIA creates both host and device code. With the use of the
`__device__`

and`__host__`

specifiers, a developer can use C++ Thrust functions to run on both host and CUDA-enabled devices. This x86 pathway is represented by the dotted line in Figure 1, as the programmer must explicitly specify use of the host processor. In addition, developers must explicitly check whether a GPU is present and use this information to select the memory space in which the data will reside (that is, GPU or host). The Thrust API also allows CUDA codes to be transparently compiled to run on different backends. The Thrust documentation shows how to use OpenMP to run a Monte Carlo simulation on x86. Note that Thrust is not optimized to create efficient OpenMP code. - gpuocelot provides a dynamic compilation framework to run CUDA binaries on various backends such as x86, AMD GPUs, and an x86-based PTX emulator. The emulator alone is a valuable tool for finding hot spots and bottlenecks in CUDA codes. The gpuocelot website claims that it “allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation.” I recommend this project even though it is challenging to use. As it matures, Ocelot will provide a pathway for customers to run CUDA binaries on various backends.
- MCUDA is an academic project that translates CUDA to C. It is not currently maintained, but the papers are interesting reading. A follow-up project (FCUDA) provides a CUDA to FPGA translation capability.
- SWAN provides a CUDA-to-OpenCL translation capability. The authors note that Swan is “not a drop in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls rewritten.” Still, it is an interesting project to bridge the gap between CUDA and OpenCL.

The CUDA-x86 compiler is the first to provide a seamless pathway to create a multi-platform application.

### Why It Matters

Using CUDA for all application development may seem like a radical concept to many readers, but in fact, it is the natural extension of the emerging CPU/GPU paradigm of high-speed computing. One of the key benefits of CUDA is that it uses C/C++ and can be adopted easily and it runs on 300+ million GPUs and now all x86 chips. If this still feels like an edgy practice, this video presentation might be helpful.

CUDA works well now at its principal task — massively parallel computation — as demonstrated by the variety and number of projects that achieve 100x or greater performance in the NVIDIA showcase. See Figure 2.