Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘OpenCL’ Category

Deep Learning Software/ Framework links

Posted by Hemprasad Y. Badgujar on July 15, 2016

  1. Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal)
  2. Torch – provides a Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu)
  3. Pylearn2 – Pylearn2 is a library designed to make machine learning research easy.
  4. Blocks – A Theano framework for training neural networks
  5. Tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.
  6. MXNet – MXNet is a deep learning framework designed for both efficiency and flexibility.
  7. Caffe -Caffe is a deep learning framework made with expression, speed, and modularity in mind.Caffe is a deep learning framework made with expression, speed, and modularity in mind.
  8. Lasagne – Lasagne is a lightweight library to build and train neural networks in Theano.
  9. Keras– A theano based deep learning library.
  10. Deep Learning Tutorials – examples of how to do Deep Learning with Theano (from LISA lab at University of Montreal)
  11. Chainer – A GPU based Neural Network Framework
  12. DeepLearnToolbox – A Matlab toolbox for Deep Learning (from Rasmus Berg Palm)
  13. Cuda-Convnet – A fast C++/CUDA implementation of convolutional (or more generally, feed-forward) neural networks. It can model arbitrary layer connectivity and network depth. Any directed acyclic graph of layers will do. Training is done using the back-propagation algorithm.
  14. Deep Belief Networks. Matlab code for learning Deep Belief Networks (from Ruslan Salakhutdinov).
  15. RNNLM– Tomas Mikolov’s Recurrent Neural Network based Language models Toolkit.
  16. RNNLIB-RNNLIB is a recurrent neural network library for sequence learning problems. Applicable to most types of spatiotemporal data, it has proven particularly effective for speech and handwriting recognition.
  17. matrbm. Simplified version of Ruslan Salakhutdinov’s code, by Andrej Karpathy (Matlab).
  18. deeplearning4j– Deeplearning4J is an Apache 2.0-licensed, open-source, distributed neural net library written in Java and Scala.
  19. Estimating Partition Functions of RBM’s. Matlab code for estimating partition functions of Restricted Boltzmann Machines using Annealed Importance Sampling (from Ruslan Salakhutdinov).
  20. Learning Deep Boltzmann Machines Matlab code for training and fine-tuning Deep Boltzmann Machines (from Ruslan Salakhutdinov).
  21. The LUSH programming language and development environment, which is used @ NYU for deep convolutional networks
  22. Eblearn.lsh is a LUSH-based machine learning library for doing Energy-Based Learning. It includes code for “Predictive Sparse Decomposition” and other sparse auto-encoder methods for unsupervised learning. Koray Kavukcuoglu provides Eblearn code for several deep learning papers on thispage.
  23. deepmat– Deepmat, Matlab based deep learning algorithms.
  24. MShadow – MShadow is a lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA. The goal of mshadow is to support efficient, device invariant and simple tensor library for machine learning project that aims for both simplicity and performance. Supports CPU/GPU/Multi-GPU and distributed system.
  25. CXXNET – CXXNET is fast, concise, distributed deep learning framework based on MShadow. It is a lightweight and easy extensible C++/CUDA neural network toolkit with friendly Python/Matlab interface for training and prediction.
  26. Nengo-Nengo is a graphical and scripting based software package for simulating large-scale neural systems.
  27. Eblearn is a C++ machine learning library with a BSD license for energy-based learning, convolutional networks, vision/recognition applications, etc. EBLearn is primarily maintained by Pierre Sermanet at NYU.
  28. cudamat is a GPU-based matrix library for Python. Example code for training Neural Networks and Restricted Boltzmann Machines is included.
  29. Gnumpy is a Python module that interfaces in a way almost identical to numpy, but does its computations on your computer’s GPU. It runs on top of cudamat.
  30. The CUV Library (github link) is a C++ framework with python bindings for easy use of Nvidia CUDA functions on matrices. It contains an RBM implementation, as well as annealed importance sampling code and code to calculate the partition function exactly (from AIS lab at University of Bonn).
  31. 3-way factored RBM and mcRBM is python code calling CUDAMat to train models of natural images (from Marc’Aurelio Ranzato).
  32. Matlab code for training conditional RBMs/DBNs and factored conditional RBMs (from Graham Taylor).
  33. mPoT is python code using CUDAMat and gnumpy to train models of natural images (from Marc’Aurelio Ranzato).
  34. neuralnetworks is a java based gpu library for deep learning algorithms.
  35. ConvNet is a matlab based convolutional neural network toolbox.
  36. Elektronn is a deep learning toolkit that makes powerful neural networks accessible to scientists outside the machine learning community.
  37. OpenNN is an open source class library written in C++ programming language which implements neural networks, a main area of deep learning research.
  38. NeuralDesigner  is an innovative deep learning tool for predictive analytics.
  39. Theano Generalized Hebbian Learning.

Posted in C, Computing Technology, CUDA, Deep Learning, GPU (CUDA), JAVA, OpenCL, PARALLEL, PHP, Project Related | Leave a Comment »

Setting Global C++ Include Paths in Visual Studio 2012 (and 2011, and 2010)

Posted by Hemprasad Y. Badgujar on May 14, 2014

Setting Global C++ Include Paths in Visual Studio 2012 (and 2011, and 2010)

Starting with Visual Studio 2010, Microsoft decided to make life hard on C++ developers.  System-wide include path settings used to be accessed through Tools | Options | Projects and Solutions | VC++ Directories.  However, that option is gone:


Instead, the system-wide include paths are now located within the ‘Properties’ interface.  To access it, select View | Property Manager.  No dialog will appear yet. Instead, the Property Manager appears as a tab along with the Solution Explorer:


Note:  The Property Manager won’t contain anything unless a solution is loaded.

Now, expand one of your projects, then expand Debug | Win32 or Release | Win32:


Right click Microsoft.Cpp.Win32.user and select Properties:


This brings up the Microsoft.Cpp.Win32.User Property Pages dialog, which should look familiar enough:


Alternate Access

The properties can be accessed directly as an XML file by editing%LOCALAPPDATA%\Microsoft\MSBuild\v4.0\Microsoft.Cpp.Win32.user.props



Posted in Apps Development, Computer Network & Security, Computer Softwares, Computer Vision, CUDA, Installation, OpenCL, OpenCV | Tagged: , , , , | Leave a Comment »

Getting Started with CUDA

Posted by Hemprasad Y. Badgujar on March 4, 2013

What are the capabilities of Nvidia’s CUDA running on the GPU and how does it compare to CPU performance? I bought a GeForce 9800GT and set about finding out, starting off by installing the CUDA drivers, toolkit and SDK from the Cuda Zone.

The first thing I noticed was that on my Vista64 machine the sample projects had been installed to:

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\projects

which is read only. Rather than fight with Vista’s UAC I copied everything into the C:\CUDA directory. To build the solution in VS2008 on my Vista 64 machine all I needed to do was switch the platform to x64, ignore the warning:


Command line warning D9035 : option 'Wp64' has been deprecated and will be removed in a future release


and everything was fine. The SDK’s sample template conveniently included both a gold (CPU) implementation of a function and a GPU implementation. An initial run of the template project showed that only the GPU section was timed. Since the reason to use CUDA is performance and I wanted a comparison, the first modification I made was to put a timer around the CPU implementation:


cutilCheckError( cutStartTimer( timer));
computeGold( reference, h_idata, num_threads);  // reference solution
cutilCheckError( cutStopTimer( timer));


and raced them – but the results weren’t too inspiring:


GPU Processing time: 84.362747 (ms)
CPU Processing time: 0.001257 (ms)


The CPU solution wasn’t even threaded. I remembered the question of a student at the Stanford CUDA lecture on YouTube:


Q: Since there’s overhead in moving the data to the GPU how do you decide when it’s worthwhile?

A: Generally speaking it makes the most sense for large problems with high data intensity where you have to do multiple calculations per data element. 

Hmm, the template code only processed 128 bytes with 32 threads so I had paid the setup costs and then not sent enough data to the GPU – no wonder the CPU was faster. So I needed to increase the data set, but there’s a problem with that since the provided kernel code assumes the entire data set will fit in shared memory and binds the size of the data to the thread count. There needed to be some changes. But you can’t just increase the number of threads or you’ll get:


cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <>, line 88 : invalid configuration argument.


First step was to find out what resources were available on the GPU, then I’d need to work out how to get at those resources. Running the SDK Device Query told me how much global and shared memory was available as well as how many threads I could use:


Device 0: "GeForce 9800 GT"
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 1073741824 bytes
  Number of multiprocessors:                     14
  Number of cores:                               112
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.50 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)


Some interesting numbers there, since the GeForce can perform both a FMUL (2 flops) and a FADD (1 flop) per clock, per processor, we can calculate the maximum theoretical Gflops attainable is 1.5 GHz * 112 * (2 + 1) = 504 Gflops. By way of comparison, the E8400 in my test machine has a peak of 24 Gflops according to Intel’s data sheet:




But back to the problem of pushing more data through.  A few problems:

1) The data size needs to be uncoupled from the thread count which means a change to the GRID count from this:


// setup execution parameters
dim3  grid( 1, 1, 1);
dim3  threads( num_threads, 1, 1);


to something more like this:


cThreadsPerBlock = 64;
cBlocksPerGridx = 1024;
cBlocksPerGridy = 1024;

cData = cThreadsPerBlock * cBlocksPerGridx * cBlocksPerGridy;

dim3  grid ( cBlocksPerGridx, cBlocksPerGridy, 1); 
dim3  block( cThreadsPerBlock, 1, 1);


where the counts of Blocks Per Grid in the x and y directions would need to be data derived. To simplify the example I’ve done it backwards and set the data size based on thread and block breakdown. These grid and block variables are then be passed to GPU using the triple angle bracket <<< >>> notation:


testKernel<<< grid, block, shared_mem_size >>>( d_idata, d_odata);


which is the same as:


testKernel<<< grid, 64, shared_mem_size >>> ( d_idata, d_odata);


because the passed argument is converted to a CUDA dim3 type which “is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.” from the programming guide.

Specifying a shared_mem_size on the kernel call as above allows you to specify the size at runtime. You can then pick up a reference to the memory in the kernel code with:


extern  __shared__  float sdata[];


Alternatively if you know the size at compilation time you can also declare the shared memory inside the kernel like this:


__shared__ float sdata[256];


Which would mean the kernel call would be just be:


testKernel<<< grid, 64 >>> ( d_idata, d_odata);


2) The kernel code must loop through the grid. Calculate the thread id, block id and then global id to figure where in the global data we are up to. Pass the size of the data(int len) since num_threads is no longer coupled with the data length.  The __umul24 in the code provides increased performance but comes with a warning: “Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but __mul24 and __umul24 provide signed and unsigned 24-bit integer multiplication with a throughput of 8 operations per clock cycle. On future architectures however, __[u]mul24 will be slower than 32-bit integer multiplication”.


__global__ void
testKernel( float* g_idata, float* g_odata, int len) 
  // shared memory
  // the size is determined by the host application
  extern  __shared__  float sdata[];

  // thread id
  const unsigned int tid = threadIdx.x;
  // block id
  const unsigned int bid = __umul24(gridDim.x, blockIdx.y) + blockIdx.x ;
  // global memory id
  const unsigned int gid = tid + __umul24(blockDim.x, bid);

  const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z);


3) The kernel needs to read from global memory and then synchronise across threads, this causes the threads across warps to sync and thus presents a consistent shared memory picture. So now thread 0 can read from SDATA(1) and will see the data which thread 1 loaded. A call to __syncthreads() is only needed when the count of threads per block exceed the warpSize because as mentioned in the performance optimisation whitepaper: “Instructions are SIMD synchronous within a warp”. Of course every call has a cost and the programming guide states that “throughput for __syncthreads is 8 operations per clock cycle in the case where no thread has to wait for any other threads.”

None of this is important in the sample template code because there is no communication between threads, thus no need for shared memory or thread syncing – a situation in which registers would normally be used but in this case shared memory has presumably been used by Nvidia for example purposes.


const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z); 
SDATA(tid) = g_idata[tid];
if (cThreadsPerBlock > warpSize) __syncthreads();


At this point I had revised the template to time the CPU for comparison, remove the size restrictions to allow a decent amount of data to be pushed through and was ready to attempt to answer the question – given the overhead of pushing the data to the GPU, when it is worth doing so? Running the code gave some unexpected answers. Keeping the thread count constant I varied the cBlocksPerGridy to yield various data sizes:



The GPU and CPU seemed to take the same amount of time with different data loads but the GPU was hampered by a constant overhead of 80ms, the exact same difference I noted when only 128 bytes were trialled in the very first instance before any modification.  Where was the time going? Some sort of setup cost?  Also how much was being taken in the kernel and how much in the data transfer? I needed more fine grained data to see what was going on.

I had modified the supplied SDK template code in a minimal way in order to measure CPU vs GPU performance and found that for the simple test code (1 float multiplication) that the E8400 CPU with a claimed 24 Gflops was handily out performing a GPU with a theoretical max 504 Gflops. Where was all the time going? Was the kernel the culprit, the memory copy or something else? I started out by trying to reuse the


cutilCheckError( cutStartTimer( timer));


timing method already in the template. Looking into the CUDA source in SDK\common\src\stopwatch_win.cpp showed that on Windows it was using the QueryPerformanceFrequency method which uses the highest possible resolution hardware timer … on the CPU. Using it to measure GPU performance is problematic because timing the GPU using a CPU timer requires the GPU and the CPU to be synchronised with:




and ruins the timing information. To measure times on the GPU I needed to use GPU based timing on stream 0 using events:

cudaEventRecord(start, 0);

So I created an array of start and stop events, broke the GPU processes into 5 steps and timed everything. The 5 GPU processes were:

1) Alloc: Host to Device – The allocation of memory on the device for the input array which needed to be copied over from the host.

2) Copy: Host to Device – Copying the input array from the host onto the device. Data size divided by time taken here would give bandwidth.

3) Alloc: Device to Host – The allocation of memory on the device for the output array where the result would be stored before being copied back to the host.

4) Compute – Running the actual kernel, reading from the input array, processing and writing results to the output array.

5) Copy: Device to Host – Copying the output array back to the host.

I also retained my CPU timing to measure the amount of time it took for the GPU to do everything and get the answer back onto the host – that way I’d have a 1:1 comparison against the CPU version. That gives one more thing to measure, how does the sum of the GPU times compare to the overall CPU time?

6) Sync with CPU – CPU time minus sum of GPU times indicates how long it takes to sync the two.

Set up 5 GPU timers to get a breakdown of where the GPU was spending time and keep the 2 CPU timers for the original comparison:


// GPU timers - used to time GPU streams
int cGpuTimer = 5;

cudaEvent_t* rgGpuTimer_start = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);
cudaEvent_t* rgGpuTimer_stop = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);

for (int i=0;i<cGpuTimer;i++)
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_start[i] ) );
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_stop[i] ) );


and wrap all the GPU calls with timing calls:


cutilCheckError( cutStartTimer( rgTimer[0]));

  // Alloc: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[0], 0 ) );
  float* d_idata;
  cutilSafeCall( cudaMalloc( (void**) &d_idata, global_mem_size));
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[0], 0 ) );

  // Copy: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[1], 0 ) );
  cutilSafeCall( cudaMemcpy( d_idata, h_idata, global_mem_size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[1], 0 ) );

  // Alloc: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[2], 0 ) );
  float* d_odata;
  cutilSafeCall( cudaMalloc( (void**) &d_odata, global_mem_size)); // The pad won't be read back
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[2], 0 ) );

  // Compute
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[3], 0 ) );
  dim3  gridDim ( cBlocksPerGridx, cBlocksPerGridy, 1);
  dim3  blockDim( cThreadsPerBlock, 1, 1);

  testKernel<<< gridDim, blockDim, shared_mem_size >>>( d_idata, d_odata, cData);

  cutilCheckMsg("Kernel execution failed");
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[3], 0 ) );

  // Copy: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[4], 0 ) );
  cutilSafeCall( cudaMemcpy( h_odata, d_odata, global_mem_size, cudaMemcpyDeviceToHost) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[4], 0 ) );

cudaThreadSynchronize(); // Block until memory copy is done to ensure accurate timing

cutilCheckError( cutStopTimer( rgTimer[0]));


With this code in place I was ready to find out where the extra 80ms that the GPU took compared to the CPU was coming from and how much time each of the GPU tasks took. First a baseline comparison to verify that the code was still the same and gave the same numbers.

So here’s the graph from before on the left, and here’s the new graph, which should be identical, on the right:




Wow! What’s happened here? All the CPU times are the same, as expected, but the GPU has suddenly closed the gap and now takes only a few ms extra – the 80ms gap has vanished. A diff of the two versions shows that the only change to the code is the addition of GPU timing – and that turns out to be why the GPU suddenly sped up. Directly after setting the device, sending a wakeup call to the GPU like this:


if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
    cutilDeviceInit(argc, argv);
    cudaSetDevice( cutGetMaxGflopsDeviceId() );

    cudaEvent_t wakeGPU;
    cutilSafeCall( cudaEventCreate( &wakeGPU) );


means that 80ms vanishes from the timed loop later in the code. Note that the variable is scoped so it isn’t used. Is the GeForce like a person – goes faster when it knows it is being watched?!  Or is this some wakeup from a power saving mode, I’m not sure.  This is the only extra code needed to cut 80ms from the timing which shows how tricky it is to time accurately on the ms scale, the slightest change can have a significant effect. It is always advisable to run tests on large volumes of data with a lot of loops to drown out one-off costs like this where possible.  While on the topic of getting accurate performance readings note that all timing should be done on release code, particularly timing breakdowns as the SDK/common/cutil_readme.txt file states:


“These macros are compiled out in release builds and so they will not affect performance. Note that in debug mode they call cudaThreadSynchronize() to ensure that kernel execution has completed, which can affect performance.” 

Well now that the extra 80ms has been eliminated what does our new GPU timing code show us about how the GPU spends its time? Here’s a chart showing the breakdown for a 16MB sample:



The majority of the time, and this holds for the other data sizes, is taken copying data back and forth. So experimentally it seems that the overhead in moving the data back and forth is quite significant. Of the 24.8ms required in total to process 16MB, 21.9ms were spent copying data. The actual processing takes almost no time.  Running a variety of input sizes and timing each one can tell us what kind of bandwidth we are typically getting as shown in the table below where times are in ms:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 9.0 1771.9 11.8 1359.3
32MB 16.3 1966.0 22.2 1442.8
64MB 30.6 2093.9 49.8 1285.4
128MB 58.2 2198.2 83.9 1526.4
256MB 114.9 2228.7 171.4 1493.4

We wanted to find how where the GPU was spending its time and now discovered that most of the time is in moving data back and forth.  Can we now answer the question of where the GPU outperforms the CPU? Is 2GB/s the expected throughput? Well Nvidia provides a tool in the SDK to answer that – the “Bandwidth Test”. Running it through the provided GUI tool yields the following results:


Running on......
      device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               2152.6

Quick Mode
Device to Host Bandwidth for Pageable memory
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1919.2

Quick Mode
Device to Device Bandwidth
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               48507.8


So we can see for 32MB, performance is roughly in line with the template results so that’s case closed … or is it? Two things give cause for concern:

1) PCIe 2.0 is theoretically capable of 500 MB/s per lane and with a x16 slot there are 16 lanes. So throughput should be up around 8GB/s, not the 2GB/s observed.

2) What exactly does “Host to Device Bandwidth for Pageable memory” in the bandwidth test results mean? Pageable memory?

So I found out that the bulk of the time was in data copying, first confirmed that the speeds observed were similar to those given in the Nvidia test suite and then raised new questions about whether we were getting everything out of the hardware given 2GB/s observed and 8GB/s theoretical. So now I need to confirm that my hardware really is PCIe 2.0 x16 and figure out what pageable memory is.

I’d added GPU based timing to my template code and found out that most of the time was spent copying data back and forth between the host and the device. The “Bandwidth Test” in the SDK gave roughly similar results although it mentioned something about pageable memory. But the big problem was the theoretical performance of PCIe 2.0 x16 far exceeded what I was seeing. So the first step was to confirm that both my graphics card and my motherboard supported and were using PCIe 2.0 x16. To do this I used CPU-Z and GPU-Z, with the following results:




So after confirming the hardware should have been capable of better speeds I took another look at the BandwidthTest. Running with the –help switch reveals several options:


C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --help
Usage:  bandwidthTest [OPTION]...
Test the bandwidth for device to host, host to device, and device to device transfers

Example:  measure the bandwidth of device to host pinned memory copies in the range 1024 Bytes
          to 102400 Bytes in 1024 Byte increments
./bandwidthTest --memory=pinned --mode=range --start=1024 --end=102400 --increment=1024 --dtoh

--help  Display this help menu
--csv   Print results as a CSV
--device=[deviceno]     Specify the device device to be used
  all - compute cumulative bandwidth on all the devices
  0,1,2,...,n - Specify any particular device to be used
--memory=[MEMMODE]      Specify which memory mode to use
  pageable - pageable memory
  pinned   - non-pageable system memory
--mode=[MODE]   Specify the mode to use
  quick - performs a quick measurement
  range - measures a user-specified range of values
  shmoo - performs an intense shmoo of a large range of values
--htod  Measure host to device transfers
--dtoh  Measure device to host transfers
--dtod  Measure device to device transfers
--wc    Allocate pinned memory as write-combined
--cputiming     Force CPU-based timing always
Range mode options
--start=[SIZE]  Starting transfer size in bytes
--end=[SIZE]    Ending transfer size in bytes
--increment=[SIZE]      Increment size in bytes


Particularly of interest is the “pinned” memory mode. Let’s try that:


C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5256.9
Quick Mode
Device to Host Bandwidth for Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4891.6
Quick Mode
Device to Device Bandwidth
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 48498.6


and we see that this mode vastly improves the maximum throughput. Not sure why Nvidia didn’t make it the default option. Speeds are now up to 5GB/s. A short investigation of the code reveals that the timing isn’t quite analogous to the testing we are doing in the template code:


56: // defines, project
57: #define MEMCOPY_ITERATIONS  10


as the bandwidthTest copies the same memory 10 times in a row as compared to the single copy we are doing. So we expect our performance to lag slightly behind this 5GB/s. Conveniently, all the code needed to use pinned memory is provided in the bandwidthTest, so putting it into a few wrapper functions called freeHost, mallocHost and memCpy yields:


//  Memory functions to switch between pinned and pageable memory as required

freeHost(void* h_mem, memoryMode memMode)
    if( PINNED == memMode ) {
        return cudaFreeHost(h_mem);
    else {
    return cudaSuccess;

mallocHost(void** h_mem ,uint memSize, memoryMode memMode, bool wc)
    if( PINNED == memMode ) {
#if CUDART_VERSION >= 2020
        return cudaHostAlloc( h_mem, memSize, (wc) ? cudaHostAllocWriteCombined : 0 );
        if (wc) {printf("Write-Combined unavailable on CUDART_VERSION less than 2020, running is: %d", CUDART_VERSION);
        return cudaMallocHost( h_mem, memSize );
    else { // PAGEABLE memory mode
        *h_mem = malloc( memSize );

    return cudaSuccess;

memCpy(void* sink, void* source, uint memSize, cudaMemcpyKind direction, memoryMode memMode)
    if( PINNED == memMode ) {
        return cudaMemcpyAsync( sink, source, memSize, direction, 0);
    else {
        return cudaMemcpy( sink, source, memSize, direction);


These functions take the same parameters as the existing functions with the addition of memory mode and for mallocHost whether or not the memory is Write Combined. Changing the allocation, copying and freeing over to these new functions allow use of pinned memory. Running the same test set shows that now the time is much more evenly spread between tasks:




and running the new numbers on the throughput we get:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 3.2 5026.7 3.3 4878.0
32MB 6.1 5242.5 6.5 4891.5
64MB 12.2 5251.1 13.1 4871.7
128MB 24.4 5247.6 26.2 4894.1
256MB 48.9 5239.0 52.3 4894.7

So now the throughput approaches the theoretical limit and matches the best the bandwidthTest provides. The total times are down significantly and the GPU is faster on all tested sizes. The 256MB trial runs in 30% less time down from 340ms to 236ms.



The next challenge is to find where else time is lost. The pie charts show that most of the time is still spent in allocation and copying with very little in compute time so there’s no need to look at the kernel. We’ve already probably cut most of the time we can from the copying so that leaves allocation. A good idea would probably be to allocate the memory once and then use it over and over for multiple kernel executions, an intensive process like the kind Nvidia suggests are best suited for CUDA. But what if the code needs to be as shown, one kernel being run on one large set of data and then returning to another application? This is the kind of flow seen in Matlab MEX files where CUDA is used – Matlab passes the data through the C/C++ MEX file, which runs up a CUDA program gets the result and then returns to Matlab. Could parallel memory copies and allocations speed things up in this situation?

So we’ve switched the code over to use pinned memory in preference to pageable and attained the desired speedup in memory operations from 2GB/s to about 5GB/s. Theoretically PCIe 2.0 x16 should be able to hit 8GB/s and I don’t know why we aren’t able to achieve speeds closer to this number. If anyone knows please leave a comment or e-mail me. From here the next thing to investigate to get more throughput in the single kernel scenario is parallel allocations and copies.

Posted in Artificial Intelligence, Computer Languages, Computing Technology, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Image Processing, Neural Network, OpenCL, PARALLEL, Simulation, Virtualization | Leave a Comment »

Octree Textures on the GPU

Posted by Hemprasad Y. Badgujar on February 24, 2013

Octree Textures on the GPU

Sylvain Lefebvre

Samuel Hornus

Fabrice Neyret

Texture mapping is a very effective and efficient technique for enriching the appearance of polygonal models with details. Textures store not only color information, but also normals for bump mapping and various shading attributes to create appealing surface effects. However, texture mapping usually requires parameterizing a mesh by associating a 2D texture coordinate with every mesh vertex. Distortions and seams are often introduced by this difficult process, especially on complex meshes.

The 2D parameterization can be avoided by defining the texture inside a volume enclosing the object. Debry et al. (2002) and Benson and Davis (2002) have shown how 3D hierarchical data structures, named octree textures, can be used to efficiently store color information along a mesh surface without texture coordinates. This approach has two advantages. First, color is stored only where the surface intersects the volume, thus reducing memory requirements. Figures 37-1 and 37-2 illustrate this idea. Second, the surface is regularly sampled, and the resulting texture does not suffer from any distortions. In addition to mesh painting, any application that requires storing information on a complex surface can benefit from this approach.

37_octree_01.jpgFigure 37-1 An Octree Texture Surrounding a 3D Model

37_octree_02.jpgFigure 37-2 Unparameterized Mesh Textures with an Octree Texture

This chapter details how to implement octree textures on today’s GPUs. The octree is directly stored in texture memory and accessed from a fragment program. We discuss the trade-offs between performance, storage efficiency, and rendering quality. After explaining our implementation in Section 37.1, we demonstrate it on two different interactive applications:

  • A surface-painting application (Section 37.2). In particular, we discuss the different possibilities for filtering the resulting texture (Section 37.2.3). We also show how a texture defined in an octree can be converted into a standard texture, possibly at runtime (Section 37.2.4).
  • A nonphysical simulation of liquid flowing along a surface (Section 37.3). The simulation runs entirely on the GPU.

37.1 A GPU-Accelerated Hierarchical Structure: The N3-Tree

37.1.1 Definition

An octree is a regular hierarchical data structure. The first node of the tree, the root, is a cube. Each node has either eight children or no children. The eight children form a 2x2x2 regular subdivision of the parent node. A node with children is called an internal node. A node without children is called a leaf. Figure 37-3 shows an octree surrounding a 3D model where the nodes that have the bunny’s surface inside them have been refined and empty nodes have been left as leaves.

37_octree_03.jpgFigure 37-3 An Octree Surrounding a 3D Model

In an octree, the resolution in each dimension increases by two at each subdivision level. Thus, to reach a resolution of 256x256x256, eight levels are required (28= 256). Depending on the application, one might prefer to divide each edge by an arbitrary number N rather than 2. We therefore define a more generic structure called an N3 tree. In an N3-tree, each node has N 3 children. The octree is an N3-tree with N = 2. A larger value of N reduces the tree depth required to reach a given resolution, but it tends to waste memory because the surface is less closely matched by the tree.

37.1.2 Implementation

To implement a hierarchical tree on a GPU, we need to define how to store the structure in texture memory and how to access the structure from a fragment program.

A simple approach to implement an octree on a CPU is to use pointers to link the tree nodes together. Each internal node contains an array of pointers to its children. A child can be another internal node or a leaf. A leaf contains only a data field.

Our implementation on the GPU follows a similar approach. Pointers simply become indices within a texture. They are encoded as RGB values. The content of the leaves is directly stored as an RGB value within the parent node’s array of pointers. We use the alpha channel to distinguish between a pointer to a child and the content of a leaf. Our approach relies on dependent texture lookups (or texture indirections). This requires the hardware to support an arbitrary number of dependent texture lookups, which is the case for GeForce FX and GeForce 6 Series GPUs.

The following sections detail our GPU implementation of the N3-tree. For clarity, the figures illustrate the 2D equivalent of an octree (a quadtree).


We store the tree in an 8-bit RGBA 3D texture called the indirection pool. Each “pixel” of the indirection pool is called a cell.

The indirection pool is subdivided into indirection grids. An indirection grid is a cube of NxNxN cells (a 2x2x2 grid for an octree). Each node of the tree is represented by an indirection grid. It corresponds to the array of pointers in the CPU implementation described earlier.

A cell of an indirection grid can be empty or can contain one of the following:

  • Data, if the corresponding child is a leaf
  • The index of an indirection grid, if the corresponding child is another internal node

Figure 37-4 illustrates our tree storage representation.

37_octree_04.jpgFigure 37-4 Storage in Texture Memory (2D Case)

We note S = Su Sv Sw as the number of indirection grids stored in the indirection pool and R= (N x Su ) x (N x Sv ) x (N x Sw ) as the resolution in cells of the indirection pool.

Data values and indices of children are both stored as RGB triples. The alpha channel is used as a flag to determine the cell content (alpha = 1 indicates data; alpha = 0.5 indicates index; alpha = 0 indicates empty cell). The root of the tree is always stored at (0, 0, 0) within the indirection pool.

Accessing the Structure: Tree Lookup

Once the tree is stored in texture memory, we need to access it from a fragment program. As with standard 3D textures, the tree defines a texture within the unit cube. We want to retrieve the value stored in the tree at a point M U220A.GIF [0, 1]3. The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.

Let I D be the index of the indirection grid of the node visited at depth D. The tree lookup is initialized with I 0= (0, 0, 0), which corresponds to the tree root. When we are at depth D, we know the index I D of the current node’s indirection grid. We now explain how we retrieve I D+1 from I D .

The lookup point M is inside the node visited at depth D. To decide what to do next, we need to read from the indirection grid ID the value stored at the location corresponding to M. To do so, we need to compute the coordinates of M within the node.

At depth D, a complete tree produces a regular grid of resolution N D N D N D within the unit cube. We call this grid the depth-D grid. Each node of the tree at depth D corresponds to a cell of this grid. In particular, M is within the cell corresponding to the node visited at depth D. The coordinates of M within this cell are given by frac(M x N D ). We use these coordinates to read the value from the indirection grid I D . The lookup coordinates within the indirection pool are thus computed as:


We then retrieve the RGBA value stored at P in the indirection pool. Depending on the alpha value, either we will return the RGB color if the child is a leaf, or we will interpret the RGB values as the index of the child’s indirection grid (I D+1) and continue to the next tree depth. Figure 37-5 summarizes this entire process for the 2D case (quadtree).

37_octree_05.jpgFigure 37-5 Example of a Tree Lookup

The lookup ends when a leaf is reached. In practice, our fragment program also stops after a fixed number of texture lookups: on most hardware, it is only possible to implement loop statements with a fixed number of iterations (however, early exit is possible on GeForce 6 Series GPUs). The application is in charge of limiting the tree depth with respect to the maximum number of texture lookups done within the fragment program. The complete tree lookup code is shown in Listing 37-1.

Example 37-1. The Tree Lookup Cg Code

float4 tree_lookup(uniform sampler3D IndirPool, // Indirection Pool

  uniform float3 invS, // 1 / S

  uniform float N,

  float3 M) // Lookup coordinates


  float4 I = float4(0.0, 0.0, 0.0, 0.0);

  float3 MND = M;

  for (float i=0; i// fixed # of iterations

    float3 P;

    // compute lookup coords. within current node

    P = (MND + floor(0.5 + * 255.0)) * invS;

    // access indirection pool

    if (I.w < 0.9)                   // already in a leaf?

      I =(float4)tex3D(IndirPool,P);// no, continue to next depth

 #ifdef DYN_BRANCHING // early exit if hardware supports dynamic branching

    if (I.w > 0.9)    // a leaf has been reached



    if (I.w < 0.1) // empty cell


    // compute pos within next depth grid

    MND = MND * N;


  return (I);


Further Optimizations

In our tree lookup algorithm, as we explained earlier, the computation of P requires a frac instruction. In our implementation, however, as shown Listing 37-1, we actually avoid computing the frac by relying on the cyclic behavior of the texture units (repeat mode). We leave the detailed explanations as an appendix, located on the book’s CD.

We compute P as


where D D is an integer within the range [0, S[.

We store D D instead of directly storing the I D values. Please refer to the appendix on the CD for the code to compute D D .

Encoding Indices

The indirection pool is an 8-bit 3D RGBA texture. This means that we can encode only 256 different values per channel. This gives us an addressing space of 24 bits (3 indices of 8 bits), which makes it possible to encode octrees large enough for most applications.

Within a fragment program, a texture lookup into an 8-bit texture returns a value mapped between [0,1]. However, we need to encode integers. Using a floating-point texture to do so would require more memory and would reduce performance. Instead, we map values between [0,1] with a fixed precision of 1/255 and simply multiply the floating-point value by 255 to obtain an integer. Note that on hardware without fixed-precision registers, we need to compute floor(0.5 + 255 * v) to avoid rounding errors.

37.2 Application 1: Painting on Meshes

In this section we use the GPU-accelerated octree structure presented in the previous section to create a surface-painting application. Thanks to the octree, the mesh does not need to be parameterized. This is especially useful with complex meshes such as trees, hairy monsters, or characters.

The user will be able to paint on the mesh using a 3D brush, similar to the brush used in 2D painting applications. In this example, the painting resolution is homogeneous along the surface, although multiresolution painting would be an easy extension if desired.

37.2.1 Creating the Octree

We start by computing the bounding box of the object to be painted. The object is then rescaled such that its largest dimension is mapped between [0,1]. The same scaling is applied to the three dimensions because we want the painting resolution to be the same in every dimension. After this process, the mesh fits entirely within the unit box.

The user specifies the desired resolution of the painting. This determines the depth of the leaves of the octree that contain colors. For instance, if the user selects a resolution of 5123, the leaves containing colors will be at depth 9.

The tree is created by subdividing the nodes intersecting the surface until all the leaves either are empty or are at the selected depth (color leaves). To check whether a tree node intersects the geometry, we rely on the box defining the boundary of the node. This process is depicted in Figure 37-6. We use the algorithm shown in Listing 37-2.

37_octree_06a.jpgFigure 37-6 Building an Octree Around a Mesh Surface

This algorithm uses our GPU octree texture API. The links between nodes (indices in the indirection grids) are set up by the createChild() call. The values stored in tree leaves are set up by calling setChildAsEmpty() and setChildColor(), which also set the appropriate alpha value.

Example 37-2. Recursive Algorithm for Octree Creation

void createNode(depth, polygons, box)

  for all children (i, j, k) within (N, N, N)

     if (depth + 1 == painting depth)       // painting depth reached?

        setChildColor(i, j, k, white)       // child is at depth+1


        childbox = computeSubBox(i, j, k, box)

        if (childbox intersect polygons)

           child = createChild(i, j, k)

           // recurse

           createNode(depth + 1, polygons, childbox)


          setChildAsEmpty(i, j, k)

37.2.2 Painting

In our application, the painting tool is drawn as a small sphere moving along the surface of the mesh. This sphere is defined by a painting center P center and a painting radius P radius. The behavior of the brush is similar to that of brushes in 2D painting tools.

When the user paints, the leaf nodes intersecting the painting tool are updated. The new color is computed as a weighted sum of the previous color and the painting color. The weight is such that the painting opacity decreases as the distance from P center increases.

To minimize the amount of data to be sent to the GPU as painting occurs, only the modified leaves are updated in texture memory. This corresponds to a partial update of the indirection pool texture (under OpenGL, we use glTexSubImage3D). The modifications are tracked on a copy of the tree stored in CPU memory.

37.2.3 Rendering

To render the textured mesh, we need to access the octree from the fragment program, using the tree lookup defined in Section 37.1.2.

The untransformed coordinates of the vertices are stored as 3D texture coordinates. These 3D texture coordinates are interpolated during the rasterization of the triangles. Therefore, within the fragment program, we know the 3D point of the mesh surface being projected in the fragment. By using these coordinates as texture coordinates for the tree lookup, we retrieve the color stored in the octree texture.

However, this produces the equivalent of a standard texture lookup in “nearest” mode. Linear interpolation and mipmapping are often mandatory for high-quality results. In the following section, we discuss how to implement these techniques for octree textures.

Linear Interpolation

Linear interpolation of the texture can be obtained by extending the standard 2D linear interpolation. Because the octree texture is a volume texture, eight samples are required for linear interpolation, as shown in Figure 37-7.

37_octree_07.jpgFigure 37-7 Linear Interpolation Using Eight Samples

However, we store information only where the surface intersects the volume. Some of the samples involved in the 3D linear interpolation are not on the surface and have no associated color information. Consider a sample at coordinates (ijk) within the maximum depth grid (recall that the depth D grid is the regular grid produced by a complete octree at depth D). The seven other samples involved in the 3D linear interpolation are at coordinates (i+1, jk), (ij+1, k), (ijk+1), (ij+1, k+1), (i+1, jk+1), (i+1, j+1, k), and (i+1, j+1, k+1). However, some of these samples may not be included in the tree, because they are too far from the surface. This leads to rendering artifacts, as shown in Figure 37-8.

37_octree_08.jpgFigure 37-8 Fixing Artifacts Caused by Straightforward Linear Interpolation

We remove these artifacts by modifying the tree creation process. We make sure that all of the samples necessary for linear interpolation are included in the tree. This can be done easily by enlarging the box used to check whether a tree node intersects the geometry. The box is built in such a way that it includes the previous samples in each dimension. Indeed, the sample at (ijk) must be added if one of the previous samples (for example, the one at (i-1, j-1, k-1)) is in the tree. This is illustrated in Figure 37-9.

37_octree_09.jpgFigure 37-9 Modifying the Tree Creation to Remove Linear Interpolation Artifacts

In our demo, we use the same depth for all color leaves. Of course, the octree structure makes it possible to store color information at different depths. However, doing so complicates linear interpolation. For more details, refer to Benson and Davis 2002.


When a textured mesh becomes small on the screen, multiple samples of the texture fall into the same pixel. Without a proper filtering algorithm, this leads to aliasing. Most GPUs implement the mipmapping algorithm on standard 2D textures. We extend this algorithm to our GPU octree textures.

We define the mipmap levels as follows. The finest level (level 0) corresponds to the leaves of the initial octree. A coarser level is built from the previous one by merging the leaves in their parent node. The node color is set to the average color of its leaves, and the leaves are suppressed, as shown in Figure 37-10. The octree depth is therefore reduced by one at each mipmapping level. The coarsest level has only one root node, containing the average color of all the leaves of the initial tree.

37_octree_10.jpgFigure 37-10 An Example of a Tree with Mipmapping

Storing one tree per mipmapping level would be expensive. Instead, we create a second 3D texture, called the LOD pool. The LOD pool has one cell per indirection grid of the indirection pool (see Figure 37-10, bottom row). Its resolution is thus S u S v S w (see “Storage” in Section 37.1.2). Each node of the initial tree becomes a leaf at a given mipmapping level. The LOD pool stores the color taken by the nodes when they are used as leaves in a mipmapping level.

To texture the mesh at a specific mipmapping level, we stop the tree lookup at the corresponding depth and look up the node’s average color in the LOD pool. The appropriate mipmapping level can be computed within the fragment program using partial-derivative instructions.

37.2.4 Converting the Octree Texture to a Standard 2D Texture

Our ultimate goal is to use octree textures as a replacement for 2D textures, thus completely removing the need for a 2D parameterization. However, the octree texture requires explicit programming of the texture filtering. This leads to long fragment programs. On recent GPUs, performance is still high enough for texture-authoring applications, where a single object is displayed. But for applications displaying complex scenes, such as games or simulators, rendering performance may be too low. Moreover, GPUs are extremely efficient at displaying filtered standard 2D texture maps.

Being able to convert an octree texture into a standard 2D texture is therefore important. We would like to perform this conversion dynamically: this makes it possible to select the best representation at runtime. For example, an object near the viewpoint would use the linearly interpolated octree texture and switch to the corresponding filtered standard 2D texture when it moves farther away. The advantage is that filtering of the 2D texture is natively handled by the GPU. Thus, the extra cost of the octree texture is incurred only when details are visible.

In the following discussion, we assume that the mesh is already parameterized. We describe how we create a 2D texture map from an octree texture.

To produce the 2D texture map, we render the triangles using their 2D (uv) coordinates instead of their 3D (xyz) coordinates. The triangles are textured with the octree texture, using the 3D coordinates of the mesh vertices as texture coordinates for the tree lookup. The result is shown in Figure 37-11.

37fig11.jpgFigure 37-11 Converting the Octree into a Standard 2D Texture

However, this approach produces artifacts. When the 2D texture is applied to the mesh with filtering, the background color bleeds inside the texture. This happens because samples outside of the 2D triangles are used by the linear interpolation for texture filtering. It is not sufficient to add only a few border pixels: more and more pixels outside of the triangles are used by coarser mipmapping levels. These artifacts are shown in Figure 37-12.

37_octree_12.jpgFigure 37-12 Artifacts Resulting from Straightforward Conversion

To suppress these artifacts, we compute a new texture in which the colors are extrapolated outside of the 2D triangles. To do so, we use a simplified GPU variant of the extrapolation method known as push-pull. This method has been used for the same purpose in Sander et al. 2001.

We first render the 2D texture map as described previously. The background is set with an alpha value of 0. The triangles are rendered using an alpha value of 1. We then ask the GPU to automatically generate the mipmapping levels of the texture. Then we collapse all the mipmapping levels into one texture, interpreting the alpha value as a transparency coefficient. This is done with the Cg code shown in Listing 37-3.

Finally, new mipmapping levels are generated by the GPU from this new texture. Figures 37-13 and 37-14 show the result of this process.

37_octree_13.jpgFigure 37-13 Color Extrapolation

37_octree_14.jpgFigure 37-14 Artifacts Removed Due to Color Extrapolation

Example 37-3. Color Extrapolation Cg Code

PixelOut main(V2FI IN,

  uniform sampler2D Tex) // texture with mipmapping levels


  PixelOut OUT;

  float4 res = float4(0.0, 0.0, 0.0, 0.0);

  float alpha = 0.0;

  // start with coarsest level

  float sz = TEX_SIZE;

   // for all mipmapping levels

   for (float i=0.0; i<=TEX_SIZE_LOG2; i+=1.0)


      // texture lookup at this level

      float2 MIP = float2(sz/TEX_SIZE, 0.0);

      float4 c = (float4)tex2D(Tex, IN.TCoord0, MIP.xy, MIP.yx);

      // blend with previous

      res = c + res * (1.0 - c.w);

      // go to finer level

      sz /= 2.0;


   // done - return normalized color (alpha == 1)

   OUT.COL = float4(,1);

   return OUT;


37.3 Application 2: Surface Simulation

We have seen with the previous application that octree structures are useful for storing color information along a mesh surface. But octree structures on GPUs are also useful for simulation purposes. In this section, we present how we use an octree structure on the GPU to simulate liquid flowing along a mesh.

We do not go through the details of the simulation itself, because that is beyond the scope of this chapter. We concentrate instead on how we use the octree to make available all the information required by the simulation.

The simulation is done by a cellular automaton residing on the surface of the object. To perform the simulation, we need to attach a 2D density map to the mesh surface. The next simulation step is computed by updating the value of each pixel with respect to the density of its neighbors. This is done by rendering into the next density map using the previous density map and neighboring information as input.

Because physical simulation is very sensitive to distortions, using a standard 2D parameterization to associate the mesh surface to the density map would not produce good results in general. Moreover, computation power could be wasted if some parts of the 2D density map were not used. Therefore, we use an octree to avoid the parameterization.

The first step is to create an octree around the mesh surface (see Section 37.2.1). We do not directly store density within the octree: the density needs to be updated using a render-to-texture operation during the simulation and should therefore be stored in a 2D texture map. Instead of density, each leaf of the octree contains the index of a pixel within the 2D density map. Recall that the leaves of the octree store three 8-bit values (in RGB channels). To be able to use a density map larger than 256×256, we combine the values of the blue and green channels to form a 16-bit index.

During simulation, we also need to access the density of the neighbors. A set of 2D RGB textures, called neighbor textures, is used to encode neighboring information. Let I be an index stored within a leaf L of the octree. Let Dmap be the density map and N a neighbor texture. The Cg call tex2D(Dmap,I) returns the density associated with leaf L. The call tex2D(N,I)gives the index within the density map corresponding to a neighbor (in 3D space) of the leaf L. Therefore, tex2D(Dmap, tex2D(N,I)) gives us the density of the neighbor of L.

To encode the full 3D neighborhood information, 26 textures would be required (a leaf of the tree can have up to 26 neighbors in 3D). However, fewer neighbors are required in practice. Because the octree is built around a 2D surface, the average number of neighbors is likely to be closer to 9.

Once these textures have been created, the simulation can run on the density map. Rendering is done by texturing the mesh with the density map. The octree is used to retrieve the density stored in a given location of the mesh surface. Results of the simulation are shown in Figure 37-15. The user can interactively add liquid on the surface. Videos are available on the book’s CD.

37_octree_15.jpgFigure 37-15 Liquid Flowing Along Mesh Surfaces

37.4 Conclusion

We have presented a complete GPU implementation of octree textures. These structures offer an efficient and convenient way of storing undistorted data along a mesh surface. This can be color data, as in the mesh-painting application, or data for dynamic texture simulation, as in the flowing liquid simulation. Rendering can be done efficiently on modern hardware, and we have provided solutions for filtering to avoid texture aliasing. Nevertheless, because 2D texture maps are preferable in some situations, we have shown how an octree texture can be dynamically converted into a 2D texture without artifacts.

Octrees are very generic data structures, widely used in computer science. They are a convenient way of storing information on unparameterized meshes, and more generally in space. Many other applications, such as volume rendering, can benefit from their hardware implementation.

We hope that you will discover many further uses for and improvements to the techniques presented in this chapter! Please see for updates of the source code and additional information.

37.5 References

Benson, D., and J. Davis. 2002. “Octree Textures.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 785–790.

Debry, D., J. Gibbs, D. Petty, and N. Robins. 2002. “Painting and Rendering Textures on Unparameterized Models.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 763–768.

Sander, P., J. Snyder, S. Gortler, and H. Hoppe. 2001. “Texture Mapping Progressive Meshes.” In Proceedings of SIGGRAPH 2001, pp. 409–416.


Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.

The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.

NVIDIA makes no warranty or representation that the techniques described herein are free from any Intellectual Property claims. The reader assumes all risk of any such claims based on his or her use of these techniques.

The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact:

        U.S. Corporate and Government Sales
(800) 382-3419

For sales outside of the U.S., please contact:

        International Sales

Visit Addison-Wesley on the Web:

Library of Congress Cataloging-in-Publication Data

GPU gems 2 : programming techniques for high-performance graphics and general-purpose
computation / edited by Matt Pharr ; Randima Fernando, series editor.
p. cm.
Includes bibliographical references and index.
ISBN 0-321-33559-7 (hardcover : alk. paper)
1. Computer graphics. 2. Real-time programming. I. Pharr, Matt. II. Fernando, Randima.

T385.G688 2005

GeForce™ and NVIDIA Quadro® are trademarks or registered trademarks of NVIDIA Corporation.

Nalu, Timbury, and Clear Sailing images © 2004 NVIDIA Corporation.

mental images and mental ray are trademarks or registered trademarks of mental images, GmbH.

Copyright © 2005 by NVIDIA Corporation.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. Printed in the United States of America. Published simultaneously in Canada.

For information on obtaining permission for use of material from this work, please submit a written request to:

       Pearson Education, Inc.
Rights and Contracts Department
One Lake Street
Upper Saddle River, NJ 07458

Text printed in the United States on recycled paper at Quebecor World Taunton in Taunton, Massachusetts.

Second printing, April 2005

Posted in CLOUD, CLUSTER, Computer Softwares, Computing Technology, CUDA, Graphics Cards, OpenCL, OpenGL, PARALLEL | 1 Comment »


Posted by Hemprasad Y. Badgujar on October 1, 2012


“The OpenCL specifications” by the Khronos Group

Format: PDF
File Size: 3.3MB
Digital: 377 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 15 November 2011 (version 1.2, revision 15)
OpenCL-version: 1.2

As a specifications-document, you cannot expect a nice piece of prose, but most of the knowledge you need to know is in it. There are certainly some gaps (especially in clear explanation), but every version is getting better. When studying other sources, always have this document with you as a reference. I printed it as two pages per side (A4).

Read chapters 1 to 3, and leave the rest as a reference. Other books explain the long, long lists of language specifications in a nicer form. Also the most you’ll better learn by doing.

“The OpenCL Programming Book” by Fixstars

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: PDF
File Size: 3.2MB
Pages: 325
Price: USD 19.50
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Aki Asahara, Satoshi Miki, Jeongdo Son and Satoshi Miki. Satoru Tagawa (translator)
Published Date: January 2012
OpenCL-version: 1.2


Format: PDF
File Size: 3.49MB
Pages: 246
Price: free
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki
Published Date: 31 March 2010
OpenCL-version: 1.0

1.0-version: It seems to be translated from Japanese to English, but except some small typos and spelling errors the book is very easy to read. The book explains the chapters you could skip in Khronos’ specifications-document, but certainly is not complete since it discusses OpenCL 1.0 and has a focus on the basics. The parts where that build up a program step-by-step is a bit annoying to read, because they repeat the whole program again while only a few lines have changed. The book would be more like 180-200 pages if written more compact.

1.2-version: Thicker, more up to date and a promise there are less translation-errors.

Full review later.

Heterogeneous Computing with OpenCL by Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa

Format: print 
Pages: 400 (approx.)
Price: USD 69.95
Publisher: Morgan Kaufmann
Authors: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa
Published Date: Sept 2011
OpenCL-version: 1.1

This is where we all chose OpenCL for: hybrid processors. And this book dives into that world completely, so we actually learn a lot new stuff about the advantages of having a GPU on your lap.

Full review later.

OpenCL in Action by Matthew Scarpino

Format: PDF and/or print
File Size: 8.1MB
Pages: 475 (approx.)
Price: USD 47.99 (e-book). USD 59.99 (p-book + e-book)
Publisher: Addison-Wesley Professional
Authors: Matthew Scarpino
Published Date: non-final version updated regularly, target November 2011
OpenCL-version: 1.1

Just like the above book, “OpenCL in Action” is work-in-progress and you can read along. Matthew Scarpino also wrote SWT/JFace In Action and Programming the Cell-processor, has a profession in Linux and has much experience in IT. The book seems to target an audience who want a more practical guide to learn OpenCL.

Full review later.

“OpenCL Programming Guide” by Aaftab Munshi, Benedict Gaster, Timothy G. Mattson and Dan Ginsburg

Format: PDF and/or print
File Size: ??MB
Pages: 648
Price: USD 35.19 (e-book), USD 43.99 (print), USD 59.39 (bundle)
Publisher: Addison-Wesley Professional
Authors: Aaftab Munshi (Apple, Khronos Group), Benedict Gaster (AMD), Timothy G. Mattson, Dan Ginsburg
Published Date: August 2011
OpenCL-version: 1.1
Homepage: and

Aaftab Munshi is also responsible for the OpenCL-specifications, so he probably knows where he’s talking about.

The 648 pages it is quite bigger than the targeted 480. Currently this is a very good replacement for Fixstars’ book. Disadvantage is that sending the printed book overseas (not USA/Canada) is much too expensive and people from the Eurasian continent, Africa and Latin America should just print it locally – looking into that to find better options.

Full review later.

“Programming Massively Parallel Processors” by David B. Kirk and Wen-mei W. Hwu

Format: Acid-free paper book
Pages: 258 pages
Price: USD 46.40
Publisher: Morgan Kaufmann
Authors: David B. Kirk (NVIDIA) and Wen-mei W. Hwu (University of Illinois)
Published Date: 28 January 2010
OpenCL-version: 1.0

The book claims to discuss both OpenCL and CUDA, but actually there was a chapter added after most of the book was written and the focus is strong towards NVIDIA hardware. It is a nice book for people who need to learn to program CUDA-only software/hardware and don’t want a book that’s too hard to understand. There are assignments at the end of each chapter and important subjects are explained to the bottom, so you don’t need to have a hard time with those assignments. After you read the book you have learned that initialisation of OpenCL programs is tedious and know a lot about optimising kernels for NVIDIA GPUs.

It is not good for people interested in OpenCL-compliant architectures from AMD, ARM and IBM besides NVIDIA’s. It is one of the best resources to understand NVIDIA architectures from a view of a GPGPU-programmer.

Posted in Computer Languages, CUDA, GPU (CUDA), Open CL, OpenCL, PARALLEL, Project Related | Leave a Comment »

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.


News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: