1 Try : CUDA Development without GPU
If you want to run the code on your machine but you don’t have a GPU? Or maybe you want to try things out before firing up your AWS instance? Here I show you a way to run the CUDA code without a GPU.
Note: this only works on Linux, maybe there are other alternatives for Mac or Windows.
Ocelot lets you run CUDA programs on NVIDIA GPUs, AMD GPUs and x86-CPUs without recompilation. Here we’ll take advantage of the latter to run our code using our CPU.
You’ll need to install the following packages:
- C++ Compiler (GCC)
- Lex Lexer Generator (Flex)
- YACC Parser Generator (Bison)
And these libraries:
- GLEW (optional for GL interop)
- GL (for NVIDIA GPU Devices)
With Arch Linux, this should go something like this:
pacman -S gcc flex bison scons boost glew
On Ubuntu it should be similar (sudo apt-get install flex bison g++ scons libboost-all-dev). If you don’t know the name of a package, search for it with ‘apt-cache search package_name’.
You should probably install LLVM too, it’s not mandatory, but I think it runs faster with LLVM.
pacman -S llvm clang
And of course you’ll need to install CUDA and the OpenCL headers. You can do it manually or using your distro’s package manager (for ubuntu I belive the package is called nvidia-cuda-toolkit):
pacman -S cuda libcl opencl-nvidia
One last dependency is Hydrazine. Fetch the source code:
svn checkout http://hydrazine.googlecode.com/svn/trunk/ hydrazine
Or if you’re like me and prefer Git:
git svn clone -s http://hydrazine.googlecode.com/svn/ hydrazine
And install it like this (you might need to install automake if you don’t have it already):
sudo make install
Now we can finally install Ocelot. This is where it gets a bit messy. Fetch the Ocelot source code:
svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot
Or with Git (warning, this will take a while, the whole repo is about 1.9 GB):
git svn clone -s http://gpuocelot.googlecode.com/svn/ gpuocelot
Now go to the ocelot directory:
And install Ocelot with:
sudo ./build.py --install
Sadly, the last command probably failed. This is how I fixed the problems.
You could fix this adding an include flag. I just added a logical link to the hydrazine code we downloaded previously:
ln -s /path/to/hydrazine/hydrazine
Make sure you link to the second hydrazine directory (inside this directory you’ll find directories like implementation and interface). You should do this in the ocelot directory where you’re running the build.py script (gpuocelot/ocelot).
For any error that looks like this:
llvm/Target/TargetData.h: No such file or directory
Just edit the source code and replace it with this header:
The LLVM project moved the file.
LLVM IR folder “missing”
Similarly, files referenced by Ocelot from the “IR” package were moved (LLVM 3.2-5 on Arch Linux). If you get an error about
LLVM/IR/LLVMContext.h missing, edit the following files:
and replace the includes at the top of each file for
The next problem I ran into was:
.release_build/ocelot/ptxgrammar.hpp:351:14:error:'PTXLexer' is not a member of 'parser'
Go ahead, open the ‘.release_build/ocelot/ptxgrammar.hpp’ file and just comment line 355:
/* int yyparse (parser::PTXLexer& lexer, parser::PTXParser::State& state); */
That should fix the error.
boost libraries not found
On up-to-date Arch Linux boxes, it will complain about not finding boost libraries ‘boost_system-mt’, ‘boost_filesystem-mt’, ‘boost_thread-mt’.
I had to edit two files:
And just remove the trailing -mt from the library names:
Finish the installation
After those fixes everything should work.
Whew! That wasn’t fun. Hopefully with the help of this guide it won’t be too painful.
To finish the installation, run:
And you can check the library was installed correctly running:
It should return -locelot. If it didn’t, check your LD_LIBRARY_PATH. On my machine, Ocelot was installed under /usr/local/lib so I just added this to my LD_LIBRARY_PATH:
Here’s the link to the installation instructions.
Running the code with Ocelot
We’re finally ready enjoy the fruits of our hard work. We need to do two things:
Ocelot configuration file
Add a file called configure.ocelot to your project (in the same directory as our Makefile and student_func.cu files), and copy this:
You can check this guide for more information about these settings.
Compile with the Ocelot library
And lastly, a small change to our Makefile. Append this to the GCC_OPTS:
GCC_OPTS=-O3 -Wall -Wextra -m64 `OcelotConfig -l`
And change the student target so it uses g++ and not nvcc:
student: compare main.o student_func.o Makefile
g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GCC_OPTS)
I just replaced ‘nvcc’ with ‘g++’ and ‘NVCC_OPTS’ with ‘GCC_OPTS’.
And that’s it!
I forked the github repo and added these changes in case you want to take a look.
I found this guide helpful, it might have some additional details for installing things under ubuntu and/or manually.
Note for debian users
I successfully installed ocelot under debian squeeze, following the above steps, except that I needed to download llvm from upstream, as indicated in the above guide for ubuntu.
Other than that, after fixing some includes as indicated (Replacing ‘TargetData.h’ by ‘IR/DataLayout.h’, or adding ‘/IR/’ to some includes), it just compiled.
To build the student project, I needed to replace -m64 by -m32 to fit my architecture, and to make the other indicated changes.
Here are my makefile diffs:
$ git diff Makefile
diff --git a/HW1/student/Makefile b/HW1/student/Makefile
index b6df3a4..55480af 100755
@@ -22,7 +22,8 @@ OPENCV_INCLUDEPATH=/usr/include
OPENCV_LIBS=-lopencv_core -lopencv_imgproc -lopencv_highgui
# On Macs the default install locations are below #
@@ -36,12 +37,12 @@ CUDA_INCLUDEPATH=/usr/local/cuda-5.0/include
-NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m64
+NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m32
-GCC_OPTS=-O3 -Wall -Wextra -m64
+GCC_OPTS=-O3 -Wall -Wextra -m32 `OcelotConfig -l` -I /usr/include/i386-linux-gn
student: compare main.o student_func.o Makefile
- $(NVCC) -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS)
+ g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GC
main.o: main.cpp timer.h utils.h HW1.cpp
g++ -c main.cpp $(GCC_OPTS) -I $(CUDA_INCLUDEPATH) -I $(OPENCV_LIBPATH)
I’m using cuda toolkit 4.2.
I don’t know why, but it was necessary to add
/usr/lib/gcc/i486-linux-gnu/4.4 to the PATH for nvcc to work:
Eclipse CUDA plugin
This is probably for another entry, but I used this guide to integrate CUDA into Eclipse Indigo.
The plugin is University of Bayreuth’s Eclipse Toolchain for CUDA compiler
2 Try :Running CUDA Code Natively on x86 Processors
A new compiler from PGI makes it possible to use the same CUDA code on x86 processors, Nvidia chips, or both
We focused on Fermi and the architectural changes that significantly broadened the types of applications that map well to GPGPU computing yet preserve the application performance of software written for previous generations of CUDA-enabled GPUs. This article addresses the mindset that CUDA is a language for only GPU-based applications.
Recent developments allow CUDA programs to transparently compile and run at full speed on x86 architectures. This advance makes CUDA a viable programming model for all application development, just like OpenMP. The PGI CUDA C/C++ compiler for x86 (from the Portland Group Inc.) is the reason for this recent change in mindset. It is the first native CUDA compiler that can transparently create a binary that will run on an x86 processor. No GPU is required. As a result, programmers now have the ability to use a single source tree of CUDA code to reach those customers who own CUDA-enabled GPUs as or who use x86-based systems.
Figure 1 illustrates the options and target platforms that are currently available to build and run CUDA applications. The various products are discussed next.
Figure 1: The various options for compiling and running a CUDA program.
Aside from the new CUDA-x86 compiler, the other products require developer or customer intervention to run CUDA on multiple backends. For example:
- nvcc: The freely downloadable nvcc compiler from NVIDIA creates both host and device code. With the use of the
__host__ specifiers, a developer can use C++ Thrust functions to run on both host and CUDA-enabled devices. This x86 pathway is represented by the dotted line in Figure 1, as the programmer must explicitly specify use of the host processor. In addition, developers must explicitly check whether a GPU is present and use this information to select the memory space in which the data will reside (that is, GPU or host). The Thrust API also allows CUDA codes to be transparently compiled to run on different backends. The Thrust documentation shows how to use OpenMP to run a Monte Carlo simulation on x86. Note that Thrust is not optimized to create efficient OpenMP code.
- gpuocelot provides a dynamic compilation framework to run CUDA binaries on various backends such as x86, AMD GPUs, and an x86-based PTX emulator. The emulator alone is a valuable tool for finding hot spots and bottlenecks in CUDA codes. The gpuocelot website claims that it “allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation.” I recommend this project even though it is challenging to use. As it matures, Ocelot will provide a pathway for customers to run CUDA binaries on various backends.
- MCUDA is an academic project that translates CUDA to C. It is not currently maintained, but the papers are interesting reading. A follow-up project (FCUDA) provides a CUDA to FPGA translation capability.
- SWAN provides a CUDA-to-OpenCL translation capability. The authors note that Swan is “not a drop in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls rewritten.” Still, it is an interesting project to bridge the gap between CUDA and OpenCL.
The CUDA-x86 compiler is the first to provide a seamless pathway to create a multi-platform application.
Why It Matters
Using CUDA for all application development may seem like a radical concept to many readers, but in fact, it is the natural extension of the emerging CPU/GPU paradigm of high-speed computing. One of the key benefits of CUDA is that it uses C/C++ and can be adopted easily and it runs on 300+ million GPUs and now all x86 chips. If this still feels like an edgy practice, this video presentation might be helpful.
CUDA works well now at its principal task — massively parallel computation — as demonstrated by the variety and number of projects that achieve 100x or greater performance in the NVIDIA showcase. See Figure 2.
Figure 2: All top 100 CUDA apps attain speedups in excess of 100x.
A new compiler from PGI makes it possible to use the same CUDA code on x86 processors, Nvidia chips, or both
PGI CUDA-x86: CUDA Programming for Multi-core CPUs
The NVIDIA CUDA architecture was developed to enable offloading of compute-intensive kernels to GPUs. Through API function calls and language extensions, CUDA gives developers control over mapping of general-purpose compute kernels to GPUs, and over placement and movement of data between host memory and GPU memory. CUDA is supported on x86 and x64 (64-bit x86) systems running Linux, Windows or MacOS and that include an NVIDIA CUDA-enabled GPU. First introduced in 2007, CUDA is the most popular GPGPU parallel programming model with an estimated user-base of over 100,000 developers worldwide.
Let’s review the hardware around which the CUDA programming model was designed. Figure 1 below shows an abstraction of a multi-core x64+GPU platform focused on computing, with the graphics functionality stripped out. The key to the performance potential of the NVIDIA GPU is the large number of thread processors, up to 512 of them in a Fermi-class GPU. They’re organized into up to 16 multi-processors, each of which has 32 thread processors. Each thread processor has registers along with integer and floating point functional units; the thread processors within a multiprocessor run in SIMD mode. Fermi peak single-precision performance is about 1.4 TFLOPS and peak double-precision is about 550 GFLOPS.
Figure 1: NVIDIA Fermi-class GPU Accelerator
The GPU has a large (up to 6GB) high bandwidth long latency device main memory. Each multi-processor has a small 64KB local shared memory that functions as both a hardware data cache and a software-managed data cache, and has a large register file.
The GPU has two levels of parallelism, SIMD within a multiprocessor, and parallel across multiprocessors. In addition, there is another very important level of concurrency: the thread processors support extremely fast multithread context switching to tolerate the long latency to device main memory. If a given thread stalls waiting for a device memory access, it is swapped out and another ready thread is swapped in and starts executing within a few cycles.
What kind of algorithms run well on this architecture?
- Massive parallelism—is needed to effectively use hundreds of thread processors and provide enough slack parallelism for the fast multi-threading to effectively tolerate device memory latency and maximize device memory bandwidth utilization.
- Regular parallelism—is needed for GPU hardware and firmware that is optimized for the regular parallelism found in graphics kernels; these correspond roughly to rectangular iteration spaces (think tightly nested loops).
- Limited synchronization—thread processors within a multi-processor can synchronize quickly enough to enable coordinated vector operations like reductions, but there is virtually no ability to synchronize across multi-processors.
- Locality—is needed to enable use of the hardware or user-managed data caches to minimize accesses to device memory.
This sounds a lot like a nest of parallel loops. So, NVIDIA defined the CUDA programming model to enable efficient mapping of general-purpose compute-intensive loop nests onto the GPU hardware. Specifically, a 1K x 1K matrix multiply loop that looks as follows on the host:
for (i = 0; i < 1024; ++i)
for (k = 0; k < 1024; ++k)
for (j = 0; j < 1024; ++j)
c[i][j] =+= a[i][k]*b[k][j];
can be rewritten in its most basic form in CUDA C as:
cudaMalloc( &ap, memsizeA );
cudaMemcpy( ap, a, memsizeA, cudaMemcpyHostToDevice );
c_mmul_kernel <<<(64,64),(16,16)>>>(ap, bp, cp, 1024);
cudaMemcpy( c, cp, memsizeC, cudaMemcpyDeviceToHost );
__global__ void c_mmul_kernel(float* a, float* b, float* c, n)
int i = blockIdx.y*16+threadIdx.y;
int j = blockIdx.x*16+threadIdx.x;
for( int k = 0; k < n; ++k )_
c[n*i+j] += a[n*i+k] * b[n*k+j];
The triply-nested matrix multiply loop becomes a single dot-product loop, split out to a self-contained kernel function. The two outer loops are abstracted away in the launch of the kernel on the GPU. Conceptually, the over one million 1024-length dot-products it takes to perform the matrix multiply are all launched simultaneously on the GPU. The CUDA programmer structures fine-grain parallel tasks, in this case dot-product operations, as CUDA threads, organizes the threads into rectangular thread blocks with 32 to 1024 threads each, and organizes the thread-blocks into a rectangular grid. Each thread-block is assigned to a CUDA GPU multi-processor, and the threads within a thread-block are executed by the thread-processors within that multiprocessor.
The programmer also manages the memory hierarchy on the GPU, moving data from the host to device memory, from variables in device memory to variables in shared memory, or to variables that the user intends to be assigned to registers.
PGI CUDA C/C++ for Multi-core x64
The PGI CUDA C/C++ compiler for multi-core x64 platforms will allow developers to compile and optimize CUDA applications to run on x64-based workstations, servers and clusters with or without an NVIDIA GPU accelerator. Is it possible to compile CUDA C efficiently for multi-core processors? CUDA C is simply a parallel programming model and language. While it was designed with the structure required for efficient GPU programming, it also can be compiled for efficient execution on multi-core x64.
Looking at a multicore x64 CPU, we see features very like what we have on the NVIDIA GPU. We have MIMD parallelism across the cores, typically 4 cores but we know there are up to 12 on some chips today and up to 48 on a single motherboard. We have SIMD parallelism in the AVX or SSE instructions. So it’s the same set of features, excepting that CPUs are optimized with deep cache memory hierarchies for memory latency, whereas the GPU is optimized for memory bandwidth. Mapping the CUDA parallelism onto the CPU parallelism seems straightforward from basic principles.
Consider the process the CUDA programmer uses to convert existing serial or parallel programs to CUDA C, as outlined above. Many aspects of this process can simply be reversed by the compiler:
- Reconstitute parallel/vector loop nests from the CUDA C chevron syntax
- Where possible, remove or replace programmer-inserted __syncthreads() calls by appropriate mechanisms on the CPU
In effect, the PGI CUDA C/C++ compiler will process CUDA C as a native parallel programming language for mapping to multi-core x64 CPUs. CUDA thread blocks will be mapped to processor cores to effect multi-core execution, and CUDA thread-level parallelism will be mapped to the SSE or AVX SIMD units as shown in Figure 2 below. All existing PGI x64 optimizations for Intel and AMD CPUs will be applied to CUDA C/C++ host code—SIMD/AVX vectorization, inter-procedural analysis and optimizations, auto-parallelization for multi-core, OpenMP extensions support, etc.
Figure 2: Mapping CUDA to GPUs versus Multi-core CPUs
Initially, PGI CUDA C/C++ will target the CUDA 3.1 runtime API. There are no current plans to implement the CUDA driver API. The definition of warpSize may be changed (probably to 1 in optimizing versions of the compiler); correctly implementing warp-synchronous programming would either require implicit synchronization after each memory access, or would require the compiler to prove that such synchronization is not required. It’s much more natural to require programmers to use the value of warpSize to determine how many threads are running in SIMD mode.
What kind of performance can you expect from CUDA C programs running on multi-core CPUs? There are many determining factors. Typical CUDA C programs perform many explicit operations and optimizations that are not necessary when programming multi-core CPUs using OpenMP or threads-based programming:
- Explicit movement of data from host main memory to CUDA device memory
- Data copies from arrays in CUDA device memory to temporary arrays in multi-processor shared memory
- Synchronization of SIMT thread processors to ensure shared memory coherency
- Manual unrolling of loops
In many cases, the PGI CUDA C compiler will remove explicit synchronization of the thread processors if it can determine it’s safe to split loops in which synchronization calls occur. Manual unrolling of loops will not typically hurt performance on x64, and may help in some cases. However, explicit movement of data from host memory to “device” copies will still occur, and explicit movement of data from device copies to temporary arrays in shared memory will still occur; these operations are pure overhead on a multi-core processor.
It will be easy to write CUDA programs that run really well on the GPU and don’t run so well on a CPU. We can’t guarantee high performance, if you’ve gone and tightly hand-tuned your kernel code. As with OpenCL, we’re making the language portable, and many programs will port and run well; but there is no guarantee of general performance portability.
PGI Unified Binary for Multi-core x64 and NVIDIA GPUs
In later releases, in addition to multi-core execution, the PGI CUDA C/C++ compiler will support execution of device kernels on NVIDIA CUDA-enabled GPUs. PGI Unified Binary technology will enable developers to build one binary that will use NVIDIA GPUs when present or default to using multi-core x64 if no GPU is present.
Figure 3: PGI Unified Binary for NVIDIA GPUs and Multi-core CPUs
It’s important to clarify that the PGI CUDA C/C++ compiler for multi-core does not split work between the CPU and GPU; it executes device kernels in multi-core mode on the CPU. Even with the PGI Unified Binary feature, the device kernels will execute either on the GPU or on the multi-core, since the data will have been allocated in one memory or the other. PGI CUDA C/C++ also is not intended to as a replacement for OpenMP or other parallel programming models for CPUs. It is a feature of the PGI compilers that will enable CUDA programs to run on either CPUs or GPUs, and will give developers the option of a uniform manycore parallel programming model for applications where it’s needed and appropriate. It will ensure CUDA C programs are portable to virtually any multi-core x64 processor-based HPC system.
The PGI compiler will implement the NVIDIA CUDA C language and closely track the evolution of CUDA C moving forward. The implementation will proceed in phases:
- Prototype demonstration at SC10 in New Orleans (November 2010).
- First production release in Q2 2011 with most CUDA C functionality. This will not be a performance release; it will use multi-core parallelism across threads in a single thread block, in the same way as PGI CUDA Fortran emulation mode, but will not exploit parallelism across thread blocks.
- Performance release in Q3 2011 leveraging multi-core and SSE/AVX to implement low-overhead native parallel/SIMD execution; this will use a single core to execute all the threads in a single thread block, in SIMD mode where possible, and use multi-core parallelism across the thread blocks.
- Unification release in Q4 2011 that supports PGI Unified Binary technology to create binaries that use NVIDIA GPU accelerators when present, or run on multi-core CPUs if no GPU is present.
The necessary elements of the NVIDIA CUDA toolkit needed to compile and execute CUDA C/C++ programs (header files, for example) will be bundled with the PGI compiler. Finally, the same optimizations and features implemented for CUDA C/C++ for multi-core will also be supported in CUDA Fortran, offering interoperability and a uniform programming model across both languages.
How It Works
In CUDA-x86, thread blocks are mapped to x86 processor cores. Thread-level parallelism is mapped to SSE (Streaming SIMD Extensions) or AVX SIMD units as shown below. (AVX is an extension of SSE to 256-bit operation). PGI indicates that:
- The size of a warp (that is, the basic unit of code to be run) will be different than the typical 32 threads per warp for a GPU. For x86 computing, a warp might be the size of the SIMD units on the x86 core (either four or eight threads) or one thread per warp when SIMD execution is not utilized.
- In many cases, the PGI CUDA C compiler removes explicit synchronization of the thread processors when the compiler can determine it is safe to split loops.
- CUDA considers the GPU as a separate device from the host processors. CUDA x86 maintains this memory model, which means that data movement between the host and device memory spaces still consumes application runtime. As shown in the device bandwidth SDK example below, a modern Xeon processor can transfer data to a CUDA-x86 “device” at about 4GB/sec. All CUDA x86 pointers reside in the x86 memory space, so programmers can use conditional compilation to directly access memory without requiring data transfers when running on multicore processors.
Trying Out the Compiler
The PGI installation process is fairly straightforward:
- Register and download the latest version from PGI
- Extract the tarfile at the location of your choice and follow the instructions in INSTALL.txt.
- Under Linux, this basically requires running the file ./install as superuser and answering a few straight-forward questions.
- Note that you should answer “yes” to the installation of CUDA even if you have a GPU version of CUDA already installed on your system. The PGI x86 version will not conflict with the GPU version. Otherwise, the PGI compiler will not understand files with the .cu file extension.
- Create the license.dat file.
At this point, you have a 15-day license for the PGI compilers.
Setup the environment to build with the PGI tools as discussed in the installation guide. Following are the commands for bash under Linux:
Copy the PGI NVIDIA SDK samples to a convenient location and build them:
This is the output of
deviceQuery on an Intel Xeon e5560 processor:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
"DEVICE EMULATION MODE"
CUDA Driver Version: 99.99
CUDA Runtime Version: 99.99
CUDA Capability Major revision number: 9998
CUDA Capability Minor revision number: 9998
Total amount of global memory: 128000000 bytes
Number of multiprocessors: 1
Number of cores: 0
Total amount of constant memory: 1021585952 bytes
Total amount of shared memory per block: 1021586048 bytes
Total number of registers available per block: 1021585904
Warp size: 1
Maximum number of threads per block: 1021585920
Maximum sizes of each dimension of a block: 32767 x 2 x 0
Maximum sizes of each dimension of a grid: 1021586032 x 32767 x 1021586048
Maximum memory pitch: 4206313 bytes
Texture alignment: 1021585952 bytes
Clock rate: 0.00 GHz
Concurrent copy and execution: Yes
limit on kernels: Yes
Support host page-locked memory mapping: Yes
Compute mode: Unknown
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 99.99, CUDA Runtime Version = 99.99, NumDevs = 1, Device = DEVICE EMULATION MODE
Press <Enter> to Quit...
The output of
bandwidthTest shows that device transfers work as expected:
Device 0: DEVICE EMULATION MODE
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB
[bandwidthTest] - Test results:
Press <Enter> to Quit...
As with NVIDIA’s nvcc compiler, it is easy to use the PGI pgCC compiler to build an executable from a CUDA source file. As an example, copy the
arrayReversal_multiblock_fast.cu code from Part 3 of this series. To compile and run it under Linux, type: