Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘Graphics Cards’ Category

Computer Vision Algorithm Implementations

Posted by Hemprasad Y. Badgujar on May 6, 2014


Participate in Reproducible Research

General Image Processing

OpenCV
(C/C++ code, BSD lic) Image manipulation, matrix manipulation, transforms
Torch3Vision
(C/C++ code, BSD lic) Basic image processing, matrix manipulation and feature extraction algorithms: rotation, flip, photometric normalisations (Histogram Equalization, Multiscale Retinex, Self-Quotient Image or Gross-Brajovic), edge detection, 2D DCT, 2D FFT, 2D Gabor, PCA to do Eigen-Faces, LDA to do Fisher-Faces. Various metrics (Euclidean, Mahanalobis, ChiSquare, NormalizeCorrelation, TangentDistance, …)
ImLab
(C/C++ code, MIT lic) A Free Experimental System for Image Processing (loading, transforms, filters, histogram, morphology, …)
CIMG
(C/C++ code, GPL and LGPL lic) CImg Library is an open source C++ toolkit for image processing
Generic Image Library (GIL)boost integration
(C/C++ code, MIT lic) Adobe open source C++ Generic Image Library (GIL)
SimpleCV a kinder, gentler machine vision library
(python code, MIT lic) SimpleCV is a Python interface to several powerful open source computer vision libraries in a single convenient package
PCL, The Point Cloud Library
(C/C++ code, BSD lic) The Point Cloud Library (or PCL) is a large scale, open project for point cloud processing. The PCL framework contains numerous state-of-the art algorithms including filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation.
Population, imaging library in C++ for processing, analysing, modelling and visualising
(C/C++ code, CeCill lic) Population is an open-source imaging library in C++ for processing, analysing, modelling and visualising including more than 200 algorithms designed by V. Tariel.
qcv
(C/C++ code, LGPL 3) A computer vision framework based on Qt and OpenCV that provides an easy to use interface to display, analyze and run computer vision algorithms. The library is provided with multiple application examples including stereo, SURF, Sobel and and Hough transform.
Machine Vision Toolbox
(MATLAB/C, LGPL lic) image processing, segmentation, blob/line/point features, multiview geometry, camera models, colorimetry.
BoofCV
(Java code, Apache lic) BoofCV is an open source Java library for real-time computer vision and robotics applications. BoofCV is organized into several packages: image processing, features, geometric vision, calibration, visualize, and IO.
Simd
(C++ code, MIT lic) Simd is free open source library in C++. It includes high performance image processing algorithms. The algorithms are optimized with using of SIMD CPU extensions such as SSE2, SSSE3, SSE4.2 and AVX2.
Free but not open source – ArrayFire (formely LibJacket) is a matrix library for CUDA
(CUDA/C++, free lic) ArrayFire offers hundreds of general matrix and image processing functions, all running on the GPU. The syntax is very Matlab-like, with the goal of offering easy porting of Matlab code to C++/ArrayFire.

Image Acquisition, Decoding & encoding

FFMPEG
(C/C++ code, LGPL or GPL lic) Record, convert and stream audio and video (lot of codec)
OpenCV
(C/C++ code, BSD lic) PNG, JPEG,… images, avi video files, USB webcam,…
Torch3Vision
(C/C++ code, BSD lic) Video file decoding/encoding (ffmpeg integration), image capture from a frame grabber or from USB, Sony pan/tilt/zoom camera control using VISCA interface
lib VLC
(C/C++ code, GPL lic) Used by VLC player: record, convert and stream audio and video
Live555
(C/C++ code, LGPL lic) RTSP streams
ImageMagick
(C/C++ code, GPL lic) Loading & saving DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, TIFF, and more
DevIL
(C/C++ code, LGPL lic) Loading & saving various image format
FreeImage
(C/C++ code, GPL & FPL lic) PNG, BMP, JPEG, TIFF loading
VideoMan
(C/C++ code, LGPL lic) VideoMan is trying to make the image capturing process from cameras, video files or image sequences easier.

Segmentation

OpenCV
(C/C++ code, BSD lic) Pyramid image segmentation
Branch-and-Mincut
(C/C++ code, Microsoft Research Lic) Branch-and-Mincut Algorithm for Image Segmentation
Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo

Machine Learning

Torch
(C/C++ code, BSD lic) Gradient machines ( multi-layered perceptrons, radial basis functions, mixtures of experts, convolutional networks and even time-delay neural networks), Support vector machines, Ensemble models (bagging, adaboost), Non-parametric models (K-nearest-neighbors, Parzen regression and Parzen density estimator), distributions (Kmeans, Gaussian mixture models, hidden Markov models, input-output hidden Markov models, and Bayes classifier), speech recognition tools

Object Detection

OpenCV
(C/C++ code, BSD lic) Viola-jones face detection (Haar features)
Torch3Vision
(C/C++ code, BSD lic) MLP & cascade of Haar-like classifiers face detection
Hough Forests
(C/C++ code, Microsoft Research Lic) Class-Specific Hough Forests for Object Detection
Efficient Subwindow Object Detection
(C/C++ code, Apache Lic) Christoph Lampert “Efficient Subwindow” algorithms for Object Detection
INRIA Object Detection and Localization Toolkit
(C/C++ code, Custom Lic) Histograms of Oriented Gradients library for Object Detection

Object Category Labelling

Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo
Multi-label optimization
(C/C++/MATLAB code) The gco-v3.0 library is for optimizing multi-label energies. It supports energies with any combination of unary, pairwise, and label cost terms.

Optical flow

OpenCV
(C/C++ code, BSD lic) Horn & Schunck algorithm, Lucas & Kanade algorithm, Lucas-Kanade optical flow in pyramids, block matching.
GPU-KLT+FLOW
(C/C++/OpenGL/Cg code, LGPL) Gain-Adaptive KLT Tracking and TV-L1 optical flow on the GPU.
RLOF
(C/C++/Matlab code, Custom Lic.) The RLOF library provides GPU / CPU implementation of Optical Flow and Feature Tracking method.

Features Extraction & Matching

SIFT by R. Hess
(C/C++ code, GPL lic) SIFT feature extraction & RANSAC matching
OpenSURF
(C/C++ code) SURF feature extraction algorihtm (kind of fast SIFT)
ASIFT (from IPOL)
(C/C++ code, Ecole Polytechnique and ENS Cachan for commercial Lic) Affine SIFT (ASIFT)
VLFeat (formely Sift++)
(C/C++ code) SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift
SiftGPU
A GPU Implementation of Scale Invariant Feature Transform (SIFT)
Groupsac
(C/C++ code, GPL lic) An enhance version of RANSAC that considers the correlation between data points

Nearest Neighbors matching

FLANN
(C/C++ code, BSD lic) Approximate Nearest Neighbors (Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration)
ANN
(C/C++ code, LGPL lic) Approximate Nearest Neighbor Searching

Tracking

OpenCV
(C/C++ code, BSD lic) Kalman, Condensation, CAMSHIFT, Mean shift, Snakes
KLT: An Implementation of the Kanade-Lucas-Tomasi Feature Tracker
(C/C++ code, public domain) Kanade-Lucas-Tomasi Feature Tracker
GPU_KLT
(C/C++/OpenGL/Cg code, ) A GPU-based Implementation of the Kanade-Lucas-Tomasi Feature Tracker
GPU-KLT+FLOW
(C/C++/OpenGL/Cg code, LGPL) Gain-Adaptive KLT Tracking and TV-L1 optical flow on the GPU
On-line boosting trackers
(C/C++, LGPL) On-line boosting tracker, semi-supervised tracker, beyond semi-supervised tracker
Single Camera background subtraction tracking
(C/C++, LGPL) Background subtraction based tracking algorithm using OpenCV.
Multi-camera tracking
(C/C++, LGPL) Multi-camera particle filter tracking algorithm using OpenCv and intel IPP.

Simultaneous localization and mapping

Real-Time SLAM – SceneLib
(C/C++ code, LGPL lic) Real-time vision-based SLAM with a single camera
PTAM
(C/C++ code, Isis Innovation Limited lic) Parallel Tracking and Mapping for Small AR Workspaces
GTSAM
(C/C++ code, BSD lic) GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices

Camera Calibration & constraint

OpenCV
(C/C++ code, BSD lic) Chessboard calibration, calibration with rig or pattern
Geometric camera constraint – Minimal Problems in Computer Vision
Minimal problems in computer vision arise when computing geometrical models from image data. They often lead to solving systems of algebraic equations.
Camera Calibration Toolbox for Matlab
(Matlab toolbox) Camera Calibration Toolbox for Matlab by Jean-Yves Bouguet (C implementation in OpenCV)

Multi-View Reconstruction

Bundle Adjustment – SBA
(C/C++ code, GPL lic) A Generic Sparse Bundle Adjustment Package Based on the Levenberg-Marquardt Algorithm
Bundle Adjustment – SSBA
(C/C++ code, LGPL lic) Simple Sparse Bundle Adjustment (SSBA)

Stereo

Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo
LIBELAS: Library for Efficient LArge-scale Stereo Matching
(C/C++ code) Disparity maps, stereo

Structure from motion

Bundler
(C/C++ code, GPL lic) A structure-from-motion system for unordered image collections
Patch-based Multi-view Stereo Software (Windows version)
(C/C++ code, GPL lic) A multi-view stereo software that takes a set of images and camera parameters, then reconstructs 3D structure of an object or a scene visible in the images
libmv – work in progress
(C/C++ code, MIT lic) A structure from motion library
Multicore Bundle Adjustment
(C/C++/GPU code, GPL3 lic) Design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems.
openMVG
(C/C++/GPU code, MPL2 lic) OpenMVG (Multiple View Geometry) “open Multiple View Geometry” is a library for computer-vision scientists and especially targeted to the Multiple View Geometry community. It is designed to provide an easy access to the classical problem solvers in Multiple View Geometry and solve them accurately..

Visual odometry

LIBVISO2: Library for VISual Odometry 2
(C/C++ code, Matlab, GPL lic) Libviso 2 is a very fast cross-platfrom (Linux, Windows) C++ library with MATLAB wrappers for computing the 6 DOF motion of a moving mono/stereo camera.

Posted in Apps Development, C, Computer Hardware, Computer Network & Security, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Graphics Cards, Image Processing, OpenCV, PARALLEL, Simulation, Virtualization | Tagged: , , , , , , , , , , , , , , , , , , , | 3 Comments »

CUDA Thread Execution Model

Posted by Hemprasad Y. Badgujar on July 22, 2013


CUDA Thread Execution Model

 

Grid of Thread Blocks

Grid of Thread Blocks

In a previous article, I gave an introduction to programming with CUDA. Now I’d like to go into a little bit more depth about the CUDA thread execution model and the architecture of a CUDA enabled GPU. I assume that the reader has basic knowledge about CUDA and already knows how to setup a project that uses the CUDA runtime API. If you don’t know how to setup a project with CUDA, you can refer to my previous article:Introduction to CUDA.

 

 

GPU Architecture

To understand the thread execution model for modern GPU’s, we must first make an analysis of the GPU compute architecture. In this article I will focus on the Fermi compute architecture found in modern GPU’s (GTX 580).

Overview of the Fermi Architecture

A Fermi GPU consists of 512 CUDA cores. These 512 CUDA cores are split across 16 Streaming Multiprocessors (SM) each SM consisting of 32 CUDA cores. The GPU has 6 64-bit memory partitions supporting up to 6 GB of GDDR5 DRAM memory.

Fermi Arcitecture

Fermi Arcitecture

Each streaming multiprocessor (SM) has 32 cuda cores. Each CUDA core consists of an integer arithmetic logic unit (ALU) and a floating point unit (FPU).

Fermi Streaming Multiprocessor (SM)

Fermi Streaming Multiprocessor (SM)

The SM has 16 load/store units allowing source and destination addresses to be calculated for sixteen threads per clock.

Each SM also has four Special Function Units (SFU) that execute transcendental instructions such as sin, cosine, reciprocal, and square root.

CUDA Threads

Now that we’ve seen the specific architecture of a Fermi GPU, let’s analyze the more general CUDA thread execution model.

Each kernel function is executed in a grid of threads. This grid is divided into blocks also known as thread blocks and each block is further divided into threads.

Cuda Execution Model

Cuda Execution Model

In the image above we see that this example grid is divided into nine thread blocks (3×3), each thread block consists of 9 threads (3×3) for a total of 81 threads for the kernel grid.

This image only shows 2-dimensional grid, but if the graphics device supports compute capability 2.0, then the grid of thread blocks can actually be partitioned into 1, 2 or 3 dimensions, otherwise if the device supports compute capability 1.x, then thread blocks can be partitioned into 1, or 2 dimensions (in this case, then the 3rd dimension should always be set to 1).

The thread block is partitioned into individual threads and for all compute capabilities, threads can be partitioned into 1, 2, or 3 dimensions. The maximum number of threads that can be assigned to a thread block is 512 for devices with compute capability 1.x and 1024 threads for devices that support compute capability 2.0.

Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.0
Maximum dimensionality of a grid of thread blocks 2 3
Maximum x-, y-, or z-dimension of a grid of thread blocks 65535
Maximum dimensionality of a thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 1024

The number of blocks within a gird can be determined within a kernel by using the built-in variable gridDim and the number of threads within a block can be determined by using the built-in variable blockDim.

A thread block is uniquely identified in a kernel function by using the built-in variableblockIdx and a thread within a block is uniquely identified in a kernel function by using the built-in variable threadIdx.

The built-in variables gridDimblockDimblockIdx, and threadIdx are each 3-component structs with members x, y, z.

With a 1-D kernel, the unique thread ID within a block is the same as the x component of the threadIdx variable.

and the unique block ID within a grid is the same as the x component of the blockIdx variable:

To determine the unique thread ID in a 2-D block, you would use the following formula:

and to determine the unique block ID within a 2-D grid, you would use the following formula:

I’ll leave it as an exercise for the reader to determine the formula to compute the unique thread ID and block ID in a 3D grid.

Matrix Addition Example

Let’s take a look at an example kernel that one might execute.

Let’s assume we want to implement a kernel function that adds two matrices and stores the result in a 3rd.

The general formula for matrix addition is:

That is, the sum of matrix A and matrix B is the sum of the components of matrix A and matrix B.

Let’s first write the host version of this method that we would execute on the CPU.

MatrixAdd.cpp
1
2
3
4
5
6
7
8
9
10
11
void MatrixAddHost( float* C, float* A, float* B, unsigned int matrixRank )
{
    for( unsigned int j = 0; j < matrixRank; ++j )
    {
        for ( unsigned int i = 0; i < matrixRank; ++i )
        {
            unsigned int index = ( j * matrixRank ) + i;
            C[index] = A[index] + B[index];
        }
    }
}

This is a pretty standard method that loops through the rows and columns of a matrix and adds the components and stores the results in a 3rd. Now let’s see how we might execute this kernel on the GPU using CUDA.

First, we need to think of the problem domain. I this case, the domain is trivial: it is the components of a matrix. Since we are operating on 2-D arrays, it seems reasonable to split our domain into two dimensions; one for the rows, and another for the columns of the matrices.

We will assume that we are working on square matrices. This simplifies the problem but mathematically matrix addition only requires that the two matrices have the same number of rows and columns but does not have the requirement that the matrices must be square.

Since we know that a kernel is limited to 512 threads/block with compute capability 1.x and 1024 threads/block with compute capability 2.0, then we know we can split our job into square thread blocks each consisting of 16×16 threads (256 threads per block) with compute capability 1.x and 32×32 threads (1024 threads per block) with compute capability 2.0.

For simplicity, I will assume compute capability 1.x for the remainder of this tutorial.

If we limit the size of our matrix to no larger than 16×16, then we only need a single block to compute the matrix sum and our kernel execution configuration might look something like this:

main.cpp
1
2
3
dim3 gridDim( 1, 1, 1 );
dim3 blockDim( matrixRank, matrixRank, 1 );
MatrixAddDevice<<<gridDim, blockDim>>>( C, A, B, matrixRank );

In this simple case, the kernel grid consists of only a single block with matrixRank xmatrixRank threads.

However, if we want to sum matrices larger than 512 components, then we must split our problem domain into smaller groups that can be processed in multiple blocks.

Let’s assume that we want to limit our blocks to execute in 16×16 (256) threads. We can determine the number of blocks that will be required to operate on the entire array by dividing the size of the matrix dimension by the maximum number of threads per block and round-up to the nearest whole number:

And we can determine the number of threads per block by dividing the size of the matrix dimension by the number of blocks and round-up to the nearest whole number:

So for example, for a 4×4 matrix, we would get

and the number of threads is computed as:

resulting in a 1×1 grid of 4×4 thread blocks for a total of 16 threads.

Another example a 512×512 matirx, we would get:

and the number of threads is computed as:

resulting in a 32×32 grid of 16×16 thread blocks for a total of 262,144 threads.

The host code to setup the kernel granularity might look like this:

main.cpp
1
2
3
4
5
6
size_t blocks = ceilf( matrixRank / 16.0f );
dim3 gridDim( blocks, blocks, 1 );
size_t threads = ceilf( matrixRank / (float)blocks );
dim3 blockDim( threads, threads, 1 );
 
MatrixAddDevice<<< gridDim, blockDim >>>( C, A, B, matrixRank );
You may have noticed that if the size of the matrix does not fit nicely into equally divisible blocks, then we may get more threads than are needed to process the array. It is not possible to configure a gird of thread blocks with 1 block containing less threads than the others. The only way to solve this is to execute multiple kernels – one that handles all the equally divisible blocks, and a 2nd kernel invocation that handles the partial block. The other solution to this problem is simply to ignore any of the threads that are executed outside of our problem domain which is generally the easier (and more efficient) than invoking multiple kernels (this should be profiled to be proven).

The Matrix Addition Kernel Function

On the device, one kernel function is created for every thread in the problem domain (the matrix elements). We can use the built-in variables gridDimblockDimblockIdx, and threadIdx, to identify the current matrix element that the current kernel is operating on.

If we assume we have a 9×9 matrix and we split the problem domain into 3×3 blocks each consisting of 3×3 threads as shown in the CUDA Grid below, then we could compute the ith column and the jth row of the matrix with the following formula:

So for thread (0,0) of block (1,1) of our 9×9 matrix, we would get:

for the column and:

for the row.

The index into the 1-D buffer that store the matrix is then computed as:

and substituting gives:

Which is the correct element in the matrix. This solution assumes we are accessing the matrix in row-major order.

CUDA Grid Example

CUDA Grid Example

Let’s see how we might implement this in the kernel.

MatrixAdd.cu
1
2
3
4
5
6
7
8
9
10
11
__global__ void MatrixAddDevice( float* C, float* A, float* B, unsigned int matrixRank )
{
    unsigned int column = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int row    = ( blockDim.y * blockIdx.y ) + threadIdx.y;
 
    unsigned int index = ( matrixRank * row ) + column;
    if ( index < matrixRank * matrixRank ) // prevent reading/writing array out-of-bounds.
    {
        C[index] = A[index] + B[index];
    }
}

On line 3, and 4 we compute the column and row of the matrix we are operating on using the formulas shown earlier.

On line 6, the 1-d index in the matrix array is computed based on the size of a single dimension of the square matris.

We must be careful that we don’t try to read or write out of the bounds of the matrix. This might happen if the size of the matrix does not fit nicely into the size of the CUDA grid (in the case of matrices whose size is not evenly divisible by 16) To protect the read and write operation, on line 7 we must check that the computed index does not exceed the size of our array.

Thread Synchronization

CUDA provides a synchronization barrier for all threads in a block through the__syncthreads() method. A practical example of thread synchronization will be shown in a later article about optimization a CUDA kernel, but for now it’s only important that you know this functionality exists.

Thread synchronization is only possible across all threads in a block but not across all threads running in the grid. By not allowing threads across blocks to be synchronized, CUDA enables multiple blocks to be executed on other streaming multiprocessors (SM) in any order. The queue of blocks can be distributed to any SM without having to wait for blocks from another SM to be complete. This allows the CUDA enabled applications to scale across platforms that have more SM at it’s disposal, executing more blocks concurrently than another platforms with less SM’s.

Thread synchronization follows strict synchronization rules. All threads in a block must hit the synchronization point or none of them must hit synchronization point.

Give the following code block:

sample.cu
1
2
3
4
5
6
7
8
if ( threadID % 2 == 0 )
{
    __syncthreads();
}
else
{
    __syncthreads();
}

will cause the threads in a block to wait indefinitely for each other because the two occurrences of __syncthreads are considered separate synchronization points and all threads of the same block must hit the same synchronization point, or all of them must not hit it.

Thread Assignment

When a kernel is invoked, the CUDA runtime will distribute the blocks across the SM’s on the device. A maximum of 8 blocks (irrelevant of platform) will be assigned to each SM as long as there are enough resources (registers, shared memory, and threads) to execute all the blocks. In the case where there are not enough resources on the SM, then the CUDA runtime will automatically assign less blocks per SM until the resource usage is below the maximum per SM.

The total number of blocks that can be executed concurrently is dependent on the device. In the case of the Fermi architecture discussed earlier, a total of 16 SM’s can concurrently handle 8 blocks for a total of 128 blocks executing concurrently on the device.

Because the Fermi architecture support compute compatibility 2.0, we can create thread blocks consisting of at most 1024 threads, then the Fermi device can technically support 131,072 threads residing in the SM’s for execution. This does not mean that every clock tick the devices is executing 131,072 instruction simultaneously. In order to understand how the blocks are actually executed on the device, we must look one step further to see how the threads of a block are actually scheduled on the SM’s.

Thread Scheduling

When a block is assigned to a SM, it is further divided into groups of 32 threads called a warp. Warp scheduling is different depending on the platform, but if we take a look at the Fermi architecture, we see that a single SM consists of 32 CUDA cores (or streaming processor) – two groups of 16 per SM.

Each SM in the Fermi architecture (see Fermi architecture image above) features two warp schedulers allowing two warps to be issued and executed concurrently. Fermi’s dual-warp scheduler selects two warps and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four special function units (SFU’s).

Most instructions can be dual-issued; two integer instructions, two floating point instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently.

Fermi - Dual Warp Scheduler

Fermi – Dual Warp Scheduler

You might be wondering why it would be useful to schedule 8 blocks of a maximum of 1024 threads if the SM only has 32 SP’s? The answer is that each instruction of a kernel may require more than a few clock cycles to execute (for example, an instruction to read from global memory will require multiple clock cycles). Any instruction that requires multiple clock cycles to execute incurs latency. The latency of long-running instructions can be hidden by executing instructions from other warps while waiting for the result of the previous warp. This technique of filling the latency of expensive operations with work from other threads is often called latency hiding.

Thread Divergence

It is reasonable to imagine that your CUDA program contains flow-control statements like if-then-elseswitchwhile loops, or for loops. Whenever you introduce these flow-control statements in your code, you also introduce the possibility of thread divergence. It is important to be aware of the consequence of thread divergence and also to understand how you can minimize the negative impact of divergence.

Thread divergence occurs when some threads in a warp follow a different execution path than others. Let’s take the following code block as an example:

test.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void TestDivergence( float* dst, float* src )
{
    unsigned int index = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    float value = 0.0f;
 
    if ( threadIdx.x % 2 == 0 )
    {
        // Threads executing PathA are active while threads
        // executing PathB are inactive.
        value = PathA( src );
    }
    else
    {
        // Threads executing PathB are active while threads
        // executing PathA are inactive.
        value = PathB( src );
    }
    // Threads converge here again and execute in parallel.
    dst[index] = value;
}

Then our flow control and thread divergence would look something like this:

Thread Divergence

Thread Divergence

As you can see from this example, the even numbered threads in each block will execute PathA while the odd numbered threads in the block will execute PathB. This is pretty much the worst-case scenario for simple divergence example.

Both PathA and PathB cannot be executed concurrently on all threads because their execution paths are different. Only the threads that execute the exact same execution path can run concurrently so the total running time of the warp is the sum of the execution time of both PathA and PathB.

In this example, the threads in the warp that execute PathA are activated if the condition is true and all the other threads are deactivated. Then, in another pass, all the threads that execute PathB are activated if the condition is false are activated and the other threads are deactivated. This means that to resolve this condition requires 2-passes to be executed for a single warp.

The overhead of having the warp execute both PathA and PathB can be eliminated if the programmer takes careful consideration when writing the kernel. If possible, all threads of a block (since warps can’t span thread blocks) should execute the same execution path. This way you guarantee that all threads in a warp will execute the same execution path and there will be no thread divergence within a block.

Exercise

If a device supports compute capability 1.3 then it can have blocks with a maximum of 512 threads/block and 8 blocks/SM can be scheduled concurrently. Each SM can schedule groups of 32-thread units called warps. The maximum number of resident warps per SM in a device that supports compute capability 1.3 is 32 and the maximum number of resident threads per SM is 1024.

Q. What would be the ideal block granularity to compute the product of two 2-D matrices of size 1024 x 1024?

  1. 4×4?
  2. 8×8?
  3. 16×16?
  4. or 32×32?

A. To answer this question, let’s analyze each choice and give pros and cons for each one.

4×4: If we decide to split our domain into 4×4 thread blocks, then we have 16 threads per block. In order to fully occupy the SM that can support 1024 threads per SM, we would need 1024/16 = 64 blocks but the SM can only schedule 8 blocks/SM so each SM would be scheduled with 8 blocks each having 16 threads which is 128 threads/SM. When divided into warps, we only have 4 warps scheduled per SM out of a total of 32 which gives only 12.5% occupancy.

8×8: We have the same problem here as we did with the 4×4 thread block granularity except not as severe. With 8×8 thread blocks, we get 64 threads per block. For a SM that can support 1024 threads per SM, we would need 1024/64 = 16 blocks but since we are limited to 8 blocks maximum per SM, we can only execute 8×64 = 512 threads/SM. When split into warps of 32-threads each, we get 512/32 = 16 warps scheduled per SM from a possible total 32 warps. This give only 50% occupancy.

16×16: A 16×16 thread block gives 256 threads/block. With a maximum thread limit per SM of 1024, we get 1024/256 = 4 blocks/SM. This is within the 8 block limit so 4 blocks, each of 256 threads can be scheduled on one SM. With 4 blocks each with 256 threads, we get a total of 1024 threads. The threads are further split into warps of 32 threads each for a total of 32 warps. Since the device can support 32 warps/SM we have achieved 100% occupancy.

32×32: This option is not even an option since a 32×32 thread block produces a single block with 1024 threads. As stated earlier, we are limited to 512 threads per block with compute capability 1.3 so our kernel wouldn’t even run.

So the best choice for this problem domain would be to invoke a kernel with block size16×16.

Conclusion

In this article, I discussed the architecture of a CUDA enabled GPU, in particular the Fermi architecture. I also showed how a kernel function is scheduled on the GPU and how the warp scheduler executes instructions from different warps in order to minimize the amount of noticeable latency between kernel instructions.

 

Posted on November 15, 2011 by 

Posted in Computer Languages, Computing Technology, CUDA, Game Development, GPU (CUDA), Graphics Cards, PARALLEL | Leave a Comment »

UPDATED! AMD Announces FX-9590 and FX-9370: Return of the GHz Race

Posted by Hemprasad Y. Badgujar on July 18, 2013


AMD Announces FX-9590 and FX-9370: Return of the GHz Race

Today at E3 AMD announced their latest CPUs, the FX-9590 and FX-9370. Similar to what we’re seeing with Richland vs. Trinity, AMD is incrementing the series number to 9000 while sticking with the existing Piledriver Vishera architecture. These chips are the result of tuning and binning on GlobalFoundries’ 32nm SOI process, though the latest jump from the existing FX-8350 is nonetheless quite impressive.

The FX-8350 had a base clock of 4.0GHz with a maximum Turbo Core clock of 4.2GHz; the FX-9590 in contrast has a maximum Turbo clock of 5GHz and the FX-9370 tops out at 4.7GHz. We’ve asked AMD for details on the base clocks for the new parts, but so far have not yet received a response; we’re also missing details on TDP, cache size, etc. but those will likely be the same as the FX-8350/8320 (at least for everything but TDP).

Posted in Computer Hardwares, Computing Technology, Graphics Cards, PARALLEL | 1 Comment »

Introduction to CUDA 5.0

Posted by Hemprasad Y. Badgujar on March 3, 2013


Introduction to CUDA 5.0

CUDA 5

CUDA 5

In this article, I will introduce the reader to CUDA 5.0. I will briefly talk about the architecture of the Kepler GPU(Graphics Processing Unit) and I will show you how you can take advantage of the many CUDA (Compute UnifiedDevice Architecture) cores in the GPU to create massively parallel programs.

 

List of Figures

Figure 1. Floating Point Operations per Second
Figure 2. Memory Bandwidth
Figure 3. Kepler GK110 Die
Figure 4. Kepler Architecture
Figure 5. Kepler Streaming Multiprocessor (SMX)
Figure 6. Warp Scheduler
Figure 7. Dynamic Parallelism
Figure 8. Dynamic Parallelism
Figure 9. Hyper-Q
Figure 10. Grid Management Unit
Figure 11. GPUDirect
Figure 12. Control Panel
Figure 13. System Manager
Figure 14. Device Manager
Figure 15. Command Prompt
Figure 16. Device Query
Figure 17. New Project Dialog
Figure 18. Cuda Execution Model
Figure 19. CUDA Grid Example
Figure 20. Warp Scheduler
Figure 21. Thread Divergence
Figure 22. CUDA Memory Model
Figure 23. Matrix Multiply – Global Memory
Figure 24. Tiles
Figure 25. Matrix Multiply – Tiles
Figure 26. CUDA Occupancy Calculator

List of Tables

Table 1. Threading Compute Capability
Table 2. Memory Compute Capability
Table 3. Properties of Memory Types

Introduction

Using the power of the NVIDIA GPU, CUDA allows the programmer to create highly parallel applications that can perform hundreds of times faster than an equivalent program that is written to run on the CPU alone. The NVIDIA CUDA Tookit provides several API’s for integrating a CUDA program into your C and C++ applications.

CUDA supports a heterogeneous programming environment where parts of the application code is written for the CPU and other parts of the application are written to execute on the GPU. The application is compiled into a single executable that can run on both devices simultaneously.

In a CUDA intensive application, the CPU is used to allocate CUDA memory buffers, execute CUDA kernels and retrieve and analyze the result of running a kernel on the GPU. The GPU is used to synchronously process large amounts of data or to execute a simulation that can easily be split into a large grid where each grid executes a part of the simulation in parallel.

The NVIDIA GPU consists of hundreds (even thousands) of CUDA cores that can work in parallel to operate on extremely large datasets in a very short time. For this reason, the NVIDIA GPU is much more suited to work in a highly parallel nature than the CPU.

The image below shows the computing power of the GPU and how it compares to the CPU. The vertical axis shows the theoretical GFLOP/s (Giga Floating Point Operations per Second). The horizontal axis shows the advances in technology over the years[1].

Floating Point Operations Per SecondFigure 1. Floating Point Operations Per Second

As can be seen from the image, the latest GPU from NVIDIA (The GTX 680 at the time of this writing) can theoretically perform 3 Trillion () Floating Point Operations per Second (or 3 teraFLOPS)[1].

The GPU is also capable of transferring large amounts of data through the AGP bus. The image below shows the memory bandwidth in GB/s of the latest NVIDIA GPU compared to the latest desktop CPUs from Intel[1].

Memory BandwidthFigure 2. Memory Bandwidth

In this article, I will introduce the latest GPU architecture from NVIDIA: Kepler. I will also introduce the CUDA threading model and demonstrate how you can execute a CUDA kernel in a C++ application. I will also introduce the CUDA memory model and I will show how you can optimize your CUDA application by making use of shared memory.

Kepler Architecture

Kepler is the name given to the latest line of desktop GPUs from NVIDIA. It is currently NVIDIA’s flagship GPU replacing the Fermi architecture.

The Kepler GPU consits of 7.1 billion transistors[2] making it the fastest and most complex microprocessor ever built.

Kepler GK110 DieFigure 3. Kepler GK110 Die

Despite it’s huge transistor count, the Kepler GPU is much more power efficient than its predecessor delivering up to 3x the performance per watt of the Fermi architecture[2].

The Kepler GPU was designed to be the highest performing GPU in the world. The Kepler GK110 consists of 15 SMX (streaming multiprocessor) units and six 64-bit memory controllers[2] as shown in the image below.

Kepler ArchitectureFigure 4. Kepler Architecture

If we zoom into a single SMX unit, we see that each SMX unit consists of 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).

Kepler Streaming MultiprocessorFigure 5. Kepler Streaming Multiprocessor (SMX)

The 192 single-precision CUDA cores each contain a single-precision floating-point unit (FPU) as well as a 32-bit integer arithmetic logic unit (ALU).

Each SMX supports 64 KB of shared memory, and 48 KB of read-only data cache. The shared memory and the data cache are accessible to all threads executing on the same streaming multiprocessor. Access to these memory areas is highly optimized and should be favored over accessing memory in global DRAM.

The SMX will schedule 32 threads in a group called a warp. Using compute capability 3.5, the GK110 GPU can schedule 64 warps per SMX for a total of 2,048 threads that can be resident in a single SMX at a time (not all threads will be active at the same time as we will see in the section describing the threading model).

Each SMX has four warp schedulers and eight instruction dispatch units (two dispatch units per warp scheduler) allowing four warps to be issued and executed concurrently on the streaming multiprocessor[2].

Warp SchedulerFigure 6. Warp Scheduler

Dynamic Parallelism

The GK110 GPU supports a feature called Dynamic Parallelism. Dynamic Parallelism allows the GPU to create new work for itself by creating new kernels as they are needed without the intervention of the CPU.

Dynamic ParallelismFigure 7. Dynamic Parallelism

As can be seen from the image, on the left, the Fermi GPU requires the CPU to execute kernels on the GPU. On the right side of the image, the Kepler GPU is capable of launching kernels from within a kernel itself. No intervention from the CPU is required.

This allows the GPU kernel to be more adaptive to dynamic branching and recursive algorithms which has some impact on the way we can implement certain functions on the GPU (such as Ray Tracing, Path Tracing and other rasterization techniques).

Dynamic Parallelism also allows the programmer to better load-balance their GPU based application. Threads can by dynamically launched based on the amount of work that needs to be performed in a particular region of the grid domain. In this case, the initial compute grid can be very coarse and the kernel can dynamically refine the grid size depending on the amount of work that needs to be performed.

Dynamic ParallelismFigure 8. Dynamic Parallelism

As can be seen from the image, the left grid granularity is too coarse to produce an accurate simulation. The grid in the center is too fine and many kernels are not performing any actual work. On the right image we see that using dynamic parallelism, the grid can be dynamically refined to produce just the right balance between granularity and workload.

Hyper-Q

The Fermi architecture relied on a single hardware work queue to schedule work from multiple streams. This resulted in false intra-stream dependencies requiring dependent kernels within one stream to complete before additional kernels in a separate stream could be executed[2].

The Kepler GK110 resolves this false intra-stream dependency with the introduction of theHyper-Q feature. Hyper-Q increases the total number of hardware work-queues to 32 compared to the single work-queue of the Fermi architecture.

Hyper-QFigure 9. Hyper-Q

CUDA applications that utilize multiple streams will immeditaly benifit from the multiple hardware work queues offered by the Hyper-Q feature. These stream intensive applications can see a potential increase in performance of up to 32x[2].

Grid Management Unit

In order to facilitate the Dynamic Parallelism feature introduced in the GK110 GPU a newGrid Managment Unit (GMU) needed to be designed. In the previous Fermi architecture, grids were passed to the CUDA Work Distributor (CWD) directly form the stream queue. Since it is now possible to execute more kernels directly in a running CUDA kernel, a bi-directional communication link is required from the SMX to the CWD via the GMU.

Grid Management UnitFigure 10. Grid Management Unit

NVIDIA GPUDirect

The Kepler GK110 supports the Remote Direct Memory Access (RDMA) feature in NVIDIA GPUDirect[2]. GPUDirect allows data to be transferred directly from one GPU to another via 3rd-party devices such as InfiniBand (IB), Network Interface Cards (NIC), and Solid-State disc drives (SSD).

GPUDirectFigure 11. GPUDirect

Getting Started with CUDA

In this article, I will use Visual Studio 2010 to create a CUDA enabled application. The settings and configurations for Visual Studio 2008 will be similar and you should be able to follow along even if you have not yet upgraded to VS2010.

System Requirements

Before you can run a CUDA program, you must make sure that your system meets the minimum requirements.

  • CUDA-capable GPU
  • Microsoft Windows XP, Vista, 7, or 8 or Windows Server 2003 or 2008
  • NVIDIA CUDA Toolkit
  • Microsoft Visual Studio 2008 or 2010 or a corresponding version of Microsoft Visual C++ Express

Verify your GPU

To verify you have a CUDA enabled GPU first check the graphics device you have installed.

  1. Open the Contol Panel from the Start Menu.
    Control PanelFigure 12. Control Panel

  2. Double-Click the System applet to open the System Control Panel.
  3. In Windows XP, click on the Hardware tab then click the Device Manager button. In Windows 7 click the Device Manager link. 
    System ManagerFigure 13. System Manager

  4. In the Device Manager window that appears, expand the Display Adapters node in the device tree.
    Device ManagerFigure 14. Device Manager

    If your device is listed at https://developer.nvidia.com/cuda-gpus then you have a CUDA-capable GPU.

Install CUDA

Download and install the latest NVIDIA CUDA Toolkit. The CUDA Toolkit is available athttps://developer.nvidia.com/cuda-downloads.

At the time of this writing, the latest version of the CUDA toolkit is CUDA 5.0 Production Release.

The CUDA Toolkit contains the drivers and tools needed to create, build and run a CUDA application as well as libraries, header files, and CUDA samples source code and other resource[3].

By default, the CUDA toolkit is installed to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v#.#, where #.# refers to the CUDA version you have installed. For the CUDA 5.0 toolkit, the complete path to the CUDA installation will be C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0.

The installation will include the following directories:

  • bin: This folder contains the CUDA compiler and runtime libraries (DLLs)
  • include: The C header files that are needed to compile your CUDA programs.
  • lib: The library files that are needed to link your CUDA programs.
  • doc: This directory contains the documentation for the CUDA Toolkit such as theCUDA C Programming Guide, the CUDA C Best Practices Guide and the documentation for the different CUDA libraries that are available in the Toolkit.

The CUDA Samples contain sample source code and projects for Visual Studio 2008 and Visual Studio 2010. On Windows XP, the samples can be found in C:\Document and Settings\All Users\Application Data\NVIDIA Corporation\CUDA Samples\v#.# and for Windows Vista, Windows 7, and Windows Server 2008, the samples can be found atC:\ProgramData\NVIDIA Corporation\CUDA Samples\v#.# where #.# is the installed CUDA version.

Verify the Installation

Before you start creating a CUDA application, it is important to verify that your CUDA installation is working correctly.

  1. Open a Command Prompt window by going to Start Menu > All Programs > Accessories > Command Prompt
    Command PromptFigure 15. Command Prompt

  2. In the Command Prompt window type:
    nvcc -V

    You should see something similar to what is shown in the Command Prompt screenshot above. The output may differ slightly depending on the version of the CUDA Toolkit you installed but you should not get an error.

Run Compiled Sample

The CUDA Toolkit comes with both the source code and compiled executable for the Toolkit samples. On Windows XP the compiled samples can be found at C:\Document and Settings\All Users\Application Data\NVIDIA Corporation\CUDA Samples\v#.#\bin\win32\Release\ and on Windows 7, Windows 8, Windows Server 2003, and Windows Server 2008 the compiled samples can be found atC:\ProgramData\NVIDIA Corporation\CUDA Samples\v#.#\bin\win32\Release. On a 64-bit version of Windows, you can replace the win32 with win64 to run the 64-bit version of the samples.

Try to run the deviceQuery sample in a Command Prompt window. You should see some output similar to the following image:

deviceQueryFigure 16. deviceQuery

Of course the output generated on your system will be different than this (unless you also have a GeForce GT 330M mobile GPU). Of course, the important thing is that your device(s) is(are) found and the device information is displayed without any errors.

Creating your First Project

For this article, I will create a CUDA application using Microsoft Visual Studio 2010. If you are still using Microsoft Visual Studio 2008 the steps will be very similar and you should still be able to follow along.

Open your Visual Studio IDE and create a new project.

As of CUDA Toolkit 5.0, Visual Studio project templates will be available that can be used to quickly create a project that is ready for creating a CUDA enabled application. Previous to CUDA Toolkit 5.0, Visual Studio project templates were only available when you installedNVIDIA Nsight Visual Studio Edition.

In the New Project dialog box, select NVIDIA > CUDA from the Installed Templatespane. In the right pane, select the CUDA 5.0 Runtime template.

New Project DialogFigure 17. New Project Dialog

Give your project a meaningful name such as “CUDATemplate” or something similar.

Click OK to create a new project.

This will create a new Visual Studio C++ project with a single CUDA source file calledkernel.cu

You should be able to compile and run this sample already at this point to confirm it is working. You should get the following output:

{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55}

If you got any errors or something went wrong, then you should check that do have a CUDA enabled GPU and that you installed the CUDA Toolkit prior to installing Visual Studio 2010. Follow the steps in the previous sections again and make sure you did everything correctly.

Using the Visual Studio project template for the CUDA 5.0 Runtime will automatically configure the build settings necessary to compile a CUDA enabled application. If you want to know how to add the configure necessary to build CUDA source files to an existing C/C++ project, then you can refer to my previous article titled Introduction to CUDA that I wrote last year. That article focuses on CUDA 4.0 using Visual Studio 2008 but the steps are almost identical for CUDA 5.0 using Visual Studio 2010.

Threading Model

The CUDA threading model describes how a kernel is executed on the GPU.

CUDA Threads

Each kernel function is executed in a grid of threads. This grid is divided into blocks also known as thread blocks and each block is further divided into threads.

Cuda Execution ModelFigure 18. Cuda Execution Model

In the image above we see that this example grid is divided into nine thread blocks (3×3), each thread block consists of 9 threads (3×3) for a total of 81 threads for the kernel grid.

This image only shows 2-dimensional grid, but if the graphics device supports compute capability 2.0 or higher, then the grid of thread blocks can actually be partitioned into 1, 2 or 3 dimensions, otherwise if the device supports compute capability 1.x, then thread blocks can be partitioned into 1, or 2 dimensions (in this case, then the 3rd dimension should always be set to 1).

The thread block is partitioned into individual threads and for all compute capabilities, threads in a block can be partitioned into 1, 2, or 3 dimensions. The maximum number of threads that can be assigned to a thread block is 512 for devices with compute capability 1.x and 1024 threads for devices that support compute capability 2.0 and higher.

Table 1. Threading Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5
Maximum dimensionality of a grid of thread blocks. 2 3
Maximum x-, dimension of a grid of thread blocks. 65535 231-1
Maximum y- or z-dimension of a grid of thread blocks. 65535
Maximum dimensionality of a thread block. 3
Maximum x- or y-dimension of a block. 512 1024
Maximum z-dimension of a block. 64
Maximum number of threads per block. 512 1024
Warp size. 32
Maximum number of resident blocks per multiprocessor. 8 16
Maximum number of resident warps per multiprocessor. 24 32 48 64
Maximum number of resident threads per multiprocessor. 768 1024 1536 2048

The number of blocks within a gird can be determined within a kernel by using the built-in variable gridDim and the number of threads within a block can be determined by using the built-in variable blockDim.

A thread block is uniquely identified in a kernel function by using the built-in variableblockIdx and a thread within a block is uniquely identified in a kernel function by using the built-in variable threadIdx.

The built-in variables gridDimblockDimblockIdx, and threadIdx are each 3-component structs with members x, y, z.

With a 1-D kernel, the unique thread ID within a block is the same as the x component of the threadIdx variable.

and the unique block ID within a grid is the same as the x component of the blockIdx variable:

To determine the unique thread ID in a 2-D block, you would use the following formula:

and to determine the unique block ID within a 2-D grid, you would use the following formula:

I’ll leave it as an exercise for the reader to determine the formula to compute the unique thread ID and block ID in a 3D grid.

Matrix Addition Example

Let’s take a look at an example kernel that one might execute.

Let’s assume we want to implement a kernel function that adds two matrices and stores the result in a 3rd.

The general formula for matrix addition is:

That is, the sum of matrix A and matrix B is the sum of the components of matrix A and matrix B.

Let’s first write the host version of this method that we would execute on the CPU.

MatrixAdd.cpp
1
2
3
4
5
6
7
8
9
10
11
void MatrixAddHost( float* C, float* A, float* B, unsigned int matrixDim )
{
    for( unsigned int j = 0; j < matrixDim; ++j )
    {
        for ( unsigned int i = 0; i < matrixDim; ++i )
        {
            unsigned int index = ( j * matrixDim) + i;
            C[index] = A[index] + B[index];
        }
    }
}

This is a pretty standard method that loops through the rows and columns of a matrix and adds the components and stores the results in a 3rd. Now let’s see how we might execute this kernel on the GPU using CUDA.

First, we need to think of the problem domain. I this case, the domain is trivial: it is the components of a matrix. Since we are operating on 2-D arrays, it seems reasonable to split our domain into two dimensions; one for the rows, and another for the columns of the matrices.

We will assume that we are working on square matrices. This simplifies the problem but mathematically matrix addition only requires that the two matrices have the same number of rows and columns but does not have the requirement that the matrices must be square.

Since we know that a kernel is limited to 512 threads/block with compute capability 1.x and 1024 threads/block with compute capability 2.x and 3.x, then we know we can split our job into square thread blocks each consisting of 16×16 threads (256 threads per block) with compute capability 1.x and 32×32 threads (1024 threads per block) with compute capability 2.x and 3.x.

If we limit the size of our matrix to no larger than 16×16, then we only need a single block to compute the matrix sum and our kernel execution configuration might look something like this:

main.cpp
1
2
3
dim3 gridDim( 1, 1, 1 );
dim3 blockDim( matrixDim, matrixDim, 1 );
MatrixAddDevice<<<gridDim, blockDim>>>( C, A, B, matrixDim );

In this simple case, the kernel grid consists of only a single block with matrixDim xmatrixDim threads.

However, if we want to sum matrices larger than 512 components, then we must split our problem domain into smaller groups that can be processed in multiple blocks.

Let’s assume that we want to limit our blocks to execute in 16×16 (256) threads. We can determine the number of blocks that will be required to operate on the entire array by dividing the size of the matrix dimension by the maximum number of threads per block and round-up to the nearest whole number:

And we can determine the number of threads per block by dividing the size of the matrix dimension by the number of blocks and round-up to the nearest whole number:

So for example, for a 4×4 matrix, we would get

and the number of threads is computed as:

resulting in a 1×1 grid of 4×4 thread blocks for a total of 16 threads.

Another example a 512×512 matirx, we would get:

and the number of threads is computed as:

resulting in a 32×32 grid of 16×16 thread blocks for a total of 262,144 threads.

The host code to setup the kernel granularity might look like this:

main.cpp
1
2
3
4
5
6
size_t blocks = ceilf( matrixDim / 16.0f );
dim3 gridDim( blocks, blocks, 1 );
size_t threads = ceilf( matrixDim / (float)blocks );
dim3 blockDim( threads, threads, 1 );
MatrixAddDevice<<< gridDim, blockDim >>>( C, A, B, matrixDim );
You may have noticed that if the size of the matrix does not fit nicely into equally divisible blocks, then we may get more threads than are needed to process the array. It is not possible to configure a gird of thread blocks with 1 block containing less threads than the others. The only way to solve this is to execute multiple kernels – one that handles all the equally divisible blocks, and a 2nd kernel invocation that handles the partial block. The other solution to this problem is simply to ignore any of the threads that are executed outside of our problem domain which is generally the easier (and more efficient) than invoking multiple kernels (this should be profiled to be proven).

The Matrix Addition Kernel Function

On the device, one kernel function is created for every thread in the problem domain (the matrix elements). We can use the built-in variables gridDimblockDimblockIdx, andthreadIdx, to identify the current matrix element that the current kernel is operating on.

If we assume we have a 9×9 matrix and we split the problem domain into 3×3 blocks each consisting of 3×3 threads as shown in the CUDA Grid below, then we could compute theith column and the jth row of the matrix with the following formula:

So for thread (0,0) of block (1,1) of our 9×9 matrix, we would get:

for the column and:

for the row.

The index into the 1-D buffer that store the matrix is then computed as:

and substituting gives:

Which is the correct element in the matrix. This solution assumes we are accessing the matrix in row-major order.

CUDA Grid ExampleFigure 19. CUDA Grid Example

Let’s see how we might implement this in the kernel.

MatrixAdd.cu
1
2
3
4
5
6
7
8
9
10
11
__global__ void MatrixAddDevice( float* C, float* A, float* B, unsigned int matrixDim )
{
    unsigned int column = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int row    = ( blockDim.y * blockIdx.y ) + threadIdx.y;
    unsigned int index = ( matrixDim * row ) + column;
    if ( index < matrixDim * matrixDim ) // prevent reading/writing array out-of-bounds.
    {
        C[index] = A[index] + B[index];
    }
}

The kernel function is defined using the __global__ declaration specifier. This specifier is used to identify a function that should execute on the device. Optionally you can also specify host functions with the __host__ declaration specifier within a CUDA source file but this is implied if no specifier is applied to the function declaration.

On line 3, and 4 we compute the column and row of the matrix we are operating on using the formulas shown earlier.

On line 6, the 1-d index in the matrix array is computed based on the size of a single dimension of the square matrix.

We must be careful that we don’t try to read or write out of the bounds of the matrix. This might happen if the size of the matrix does not fit nicely into the size of the CUDA grid (in the case of matrices whose size is not evenly divisible by 16) To protect the read and write operation, on line 7 we must check that the computed index does not exceed the size of our array.

Thread Synchronization

CUDA provides a synchronization barrier for all threads in a block through the__syncthreads() method. A practical example of thread synchronization will be shown in a later article about optimization a CUDA kernel, but for now it’s only important that you know this functionality exists.

Thread synchronization is only possible across all threads in a block but not across all threads running in the grid. By not allowing threads across blocks to be synchronized, CUDA enables multiple blocks to be executed on other streaming multiprocessors (SM) in any order. The queue of blocks can be distributed to any SM without having to wait for blocks from another SM to be complete. This allows the CUDA enabled applications to scale across platforms that have more SM at it’s disposal, executing more blocks concurrently than another platforms with less SM’s.

Thread synchronization follows strict synchronization rules. All threads in a block must hit the synchronization point or none of them must hit synchronization point.

Give the following code block:

sample.cu
1
2
3
4
5
6
7
8
if ( threadID % 2 == 0 )
{
    __syncthreads();
}
else
{
    __syncthreads();
}

will cause the threads in a block to wait indefinitely for each other because the two occurrences of __syncthreads are considered separate synchronization points and all threads of the same block must hit the same synchronization point, or all of them must not hit it.

Thread Assignment

When a kernel is invoked, the CUDA runtime will distribute the blocks across the SM’s on the device. With compute compatibility 1.x and 2.x a maximum of 8 blocks will be assigned to each SM and with compute compatibility 3.x a maximum of 16 blocks will be assigned to each SM as long as there are enough resources (registers, shared memory, and threads) to execute all the blocks. In the case where there are not enough resources on the SM, then the CUDA runtime will automatically assign less blocks per SM until the resource usage is below the maximum per SM.

The total number of blocks that can be executed concurrently is dependent on the device. In the case of the Fermi architecture a total of 16 SM’s can concurrently handle 8 blocks for a total of 128 blocks executing concurrently on the device. Kepler devices can handle 16 thread blocks per SMX for a total of 240 thread blocks that can execute concurrently on a single device.

Both the Fermi and Kepler architecture support thread blocks consisting of at most 1024 threads. The Fermi device can support a maximum of 48 warps per SM. The Kepler architecture increases the amount of resident warps per SMX to 64.

The Fermi device can support a maximum of 1,536 resident threads (32×48) per SM. Kepler supports 2,048 threads per SMX (32×64). With 15 SMX units, the Kepler GPU can have a total of 30,720 resident threads on the device. This does not mean that every clock tick the devices is executing 30,720 instruction simultaneously (there are only 2,880 CUDA Cores on the GK110 device). In order to understand how the blocks are actually executed on the device, we must look one step further to see how the threads of a block are actually scheduled on the SM’s.

Thread Scheduling

When a block is assigned to a SMX, it is further divided into groups of 32 threads called awarp. Warp scheduling is different depending on the platform, but if we take a look at the Kepler architecture, we see that a single SMX consists of 192 CUDA cores (a CUDA core is also sometimes referred to a streaming processor or SP for short).

Each SMX in the Kepler architecture features four warp schedulers allowing four warps to be issued and executed concurrently. Kepler’s quad-warp scheduler selects four warps and issues two independent instructions from each warp every cycle[2].

Warp SchedulerFigure 20. Warp Scheduler

You might be wondering why it would be useful to schedule 16 blocks of a maximum of 1024 threads if the SMX only has 192 cuda cores? The answer is that each instruction of a kernel may require more than a few clock cycles to execute (for example, an instruction to read from global memory will require multiple clock cycles). Any instruction that requires multiple clock cycles to execute incurs latency. The latency of long-running instructions can be hidden by executing instructions from other warps while waiting for the result of the previous warp. This technique of filling the latency of expensive operations with work from other threads is often called latency hiding.

Thread Divergence

It is reasonable to imagine that your CUDA program contains flow-control statements likeif-then-elseswitchwhile loops, or for loops. Whenever you introduce these flow-control statements in your code, you also introduce the possibility of thread divergence. It is important to be aware of the consequence of thread divergence and also to understand how you can minimize the negative impact of divergence.

Thread divergence occurs when some threads in a warp follow a different execution path than others. Let’s take the following code block as an example:

test.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void TestDivergence( float* dst, float* src )
{
    unsigned int index = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    float value = 0.0f;
    if ( threadIdx.x % 2 == 0 )
    {
        // Threads executing PathA are active while threads
        // executing PathB are inactive.
        value = PathA( src );
    }
    else
    {
        // Threads executing PathB are active while threads
        // executing PathA are inactive.
        value = PathB( src );
    }
    // Threads converge here again and execute in parallel.
    dst[index] = value;
}

Then our flow control and thread divergence would look something like this:

Thread DivergenceFigure 21. Thread Divergence

As you can see from this example, the even numbered threads in each block will executePathA while the odd numbered threads in the block will execute PathB. This is pretty much the worst-case scenario for simple divergence example.

Both PathA and PathB cannot be executed concurrently on all threads because their execution paths are different. Only the threads that execute the exact same execution path can run concurrently so the total running time of the warp is the sum of the execution time of both PathA and PathB.

In this example, the threads in the warp that execute PathA are activated if the condition is true and all the other threads are deactivated. Then, in another pass, all the threads that execute PathB are activated if the condition is false are activated and the other threads are deactivated. This means that to resolve this condition requires 2-passes to be executed for a single warp.

The overhead of having the warp execute both PathA and PathB can be eliminated if the programmer takes careful consideration when writing the kernel. If possible, all threads of a block (since warps can’t span thread blocks) should execute the same execution path. This way you guarantee that all threads in a warp will execute the same execution path and there will be no thread divergence within a block.

Memory Model

There are several different types of memory that your CUDA application has access to. For each different memory type there are tradeoffs that must be considered when designing the algorithm for your CUDA kernel.

Global memory has a very large address space, but the latency to access this memory type is very high. Shared memory has a very low access latency but the memory address is small compared to Global memory. In order to make proper decisions regarding where to place data and when, you must understand the differences between these memory types and how these decisions will affect the performance of your kernel.

In the next sections, I will describe the different memory types and show examples of using different memory to improve the performance of your kernel.

CUDA Memory Types

Every CUDA enabled GPU provides several different types of memory. These different types of memory each have different properties such as access latency, address space, scope, and lifetime.

The different types of memory are registersharedlocalglobal, and constantmemory.

On devices with compute capability 1.x, there are 2 locations where memory can possibly reside; cache memory and device memory.

The cache memory is considered “on-chip” and accesses to the cache is very fast. Shared memory and cached constant memory are stored in cache memory with devices that support compute capability 1.x.

The device memory is considered “off-chip” and accesses to device memory is about ~100x slower than accessing cached memory. Global memory, local memory and (uncached) constant memory is stored in device memory.

On devices that support compute capability 2.x, there is an additional memory bank that is stored with each streaming multiprocessor. This is considered L1-cache and although the address space is relatively small, it’s access latency is very low.

CUDA Memory ModelFigure 22. CUDA Memory Model

In the following sections I will describe each type and when it is best to use that memory type.

Register

Scalar variables that are declared in the scope of a kernel function and are not decorated with any attribute are stored in register memory by default. Register memory access is very fast, but the number of registers that are available per block is limited.

Arrays that are declared in the kernel function are also stored in register memory but only if access to the array elements are performed using constant indexes (meaning the index that is being used to access an element in the array is not a variable and thus the index can be determined at compile-time). It is currently not possible to perform random access to register variables.

Register variables are private to the thread. Threads in the same block will get private versions of each register variable. Register variables only exists as long as the thread exists. Once the thread finishes execution, a register variable cannot be accessed again. Each invocation of the kernel function must initialize the variable each time it is invoked. This might seem obvious because the scope of the variable is within the kernel function, but this is not true for all variables declared in the kernel function as we will see with shared memory.

Variables declared in register memory can be both read and written inside the kernel. Reads and writes to register memory does not need to be synchronized.

Local

Any variable that can’t fit into the register space allowed for the kernel will spill-over into local memory. Local memory has the same access latency as global memory (that is to say, slow). Accesses to local memory is cached only on GPU’s with compute capability 2.x or higher[4].

Like registers, local memory is private to the thread. Each thread must initialize the contents of a variable stored in local memory before it should be used. You cannot rely on another thread (even in the same block) to initialize local memory because it is private to the thread.

Variables in local memory have the lifetime of the thread. Once the thread is finished executing, the local variable is no longer accessible.

You cannot decorate a variable declaration with any attribute but the compiler will automatically put variable declarations in local memory under the following conditions:

  • Arrays that are accessed with run-time indexes. That is, the compiler can’t determine the indices at compile time.
  • Large structures or arrays that would consume too much register space.
  • Any variable declared that exceeds the number of registers for that kernel (this is called register-spilling).

The only way that you can determine if the compiler has put some function scope variables in local memory is by manual inspection of the PTX assembly code (obtained by compiling with the -ptx or -keep option). Local variables will be declared using the .localmnemonic and loaded using the ld.local mnemonic and stored with the st.localmnemonic.

Variables in local memory can be both read and written within the kernel and access to local memory does not need to be synchronized.

Shared

Variables that are decorated with the __shared__ attribute are stored in shared memory. Accessing shared memory is very fast (~100 times faster than global memory) although each streaming multiprocessor has a limited amount of shared memory address space.

Shared memory must be declared within the scope of the kernel function but has a lifetime of the block (as opposed to register, or local memory which has a lifetime of the thread). When a block is finished execution, the shared memory that was defined in the kernel cannot be accessed.

Shared memory can be both read from and written to within the kernel. Modification of shared memory must be synchronized unless you guarantee that each thread will only access memory that will not be read-from or written-to by other threads in the block. Block synchronization is acheived using the __syncthreads() barrier function inside the kernel function.

Since access to shared memory is faster than accessing global memory, it is more efficient to copy global memory to shared memory to be used within the kernel but only if the number of accesses to global memory can be reduced within the block (as we’ll see with the matrix multiply example that I will show later).

Global

Variables that are decorated with the __device__ attribute and are declared in global scope (outside of the scope of the kernel function) are stored in global memory. The access latency to global memory is very high (~100 times slower than shared memory) but there is much more global memory than shared memory (up to 6GB but the actual size is different across graphics cards even of the same compute capability).

Unlike register, local, and shared memory, global memory can be read from and written to using the C-function cudaMemcpy.

Global memory has a lifetime of the application and is accessible to all threads of all kernels. One must take care when reading from and writing to global memory because thread execution cannot be synchronized across different blocks. The only way to ensure access to global memory is synchronized is by invoking separate kernel invocations (splitting the problem into different kernels and synchronizing on the host between kernel invocations).

Global memory is declared on the host process using cudaMalloc and freed in the host process using cudaFree. Pointers to global memory can be passed to a kernel function as parameters to the kernel (as we will see in the example later).

Reads from global memory is cached only on devices that support compute capability 2.x or higher[4] but any write to global memory will invalidate the cache thus eliminating the benefit of cache. Access to global memory on devices that support compute capability 1.x is not cached.

It is a bit of an art-form to reduce the number of accesses to global memory from within a kernel by using blocks of shared memory because the access latency to shared memory is about 100 times faster than accessing global memory. Later, I will show an example of how we can reduce the global memory access using shared memory.

Constant

Variables that are decorated with the __constant__ attribute are declared in constant memory. Like global variables, constant variables must be declared in global scope (outside the scope of any kernel function). Constant variables share the same memory banks as global memory (device memory) but unlike global memory, there is only a limited amount of constant memory that can be declared (64KB on all compute capabilities).

Access latency to constant memory is considerably faster than global memory because constant memory is cached but unlike global memory, constant memory cannot be written to from within the kernel. This allows constant memory caching to work because we are guaranteed that the values in constant memory will not be changed and therefor will not become invalidated during the execution of a kernel.

Constant memory can be written to by the host process using thecudaMemcpyToSymbol function and read-from using the cudaMemcpyFromSymbolfunction. As far as I can tell, it is not possible to dynamically allocate storage for constant memory (the size of constant memory buffers must be statically declared and determined at compile-time).

Like global memory, constant memory has a lifetime of the application. It can be accessed by all threads of all kernels and the value will not change across kernel invocations unless explicitly modified by the host process.

Properties of Memory

The amount of memory that is available to the CUDA application is (in most cases) specific to the compute capability of the device. For each compute capability, the size restrictions of each type of memory (except global memory) id defined in the table below. The application programmer is encouraged to query the device properties in the application using the
cudaGetDeviceProperties method.

Table 2. Memory Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5
Number of 32-bit registers per thread 128 63 255
Maximum amount of shared memory per SM 16 KB 48 KB
Amount of local memory per thread 16 KB 512 KB
Constant memory size 64 KB

The following table summarizes the different memory types and the properties of those types.

Table 3. Properties of Memory Types
Memory Located Cached Access Scope Lifetime
Register cache n/a Host: None
Kernel: R/W
thread thread
Local device 1.x: No
2.x: Yes
Host: None
Kernel: R/W
thread thread
Shared cache n/a Host: None
Kernel: R/W
block block
Global device 1.x: No
2.x: Yes
Host: R/W
Kernel: R/W
application application
Constant device Yes Host: R/W
Kernel: R
application application

Pointers to Memory

You can use pointers to memory in a kernel but you must be aware that the pointer type does not determine where the memory is located.

For example, the following code declares a pointer to constant memory and a pointer to global memory. You should be aware that only the pointer variable is constant – not what it points to.

test.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
__constant__ float* constPtr;
__device__ float* globalPtr;
__global__ void KernelFunction(void)
{
    // Assign the pointer to global memory to a pointer to constant memory.
    // This will not compile because the pointer is constant and you can't change
    // what a const-pointer points to in the kernel.
    constPtr = globalPtr;
    // This will compile because what the const pointer points to is not
    // necessarily const (if it is, you'll probaly get a runtime error).
    *constPtr = *globalPtr;
}

Since you can’t dynamically allocate constant memory, this example would not be very useful anyways.

Be careful when using pointers like this. It is a best-practice rule to ensure that a declared pointer only points to one type of memory (so a single pointer declaration will only point to global memory and another pointer declaration will only point to shared memory).

Minimize Global Memory Access

Since access latency is much higher for global memory than it is for shared memory, it should be our objective to minimize accesses to global memory in favor of shared memory. This doesn’t mean that every access to data in global memory should first be copied into a variable in shared (or register) memory. Obviously we will not benefit from the low latency shared memory access if our algorithm only needs to make a single access to global memory. But it happens in some cases that multiple threads in the same block will all read from the same location in global memory. If this is the case, then we can speed-up our algorithm by first allowing each thread in a block to copy one part of the global memory into a shared memory buffer and then allowing all of the threads in a block to access all elements in that shared memory buffer.

To demonstrate this, I will show several flavors the classic matrix multiply example. The first example I will show is the standard implementation of the matrix multiply using only global memory access. Then, I will show an optimized version of the algorithm that uses shared memory to reduce the number of accesses to global memory for threads of the same block.

Matrix Multiply using Global Memory

This version of the matrix multiply algorithm is the easiest to understand however it is also a very naive approach.

MatrixMultiply.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
__global__ void MatrixMultiplyKernel_GlobalMem( float* C, const float* A, const float* B, unsigned int matrixDim )
{
    // Compute the row index
    unsigned int i = ( blockDim.y * blockIdx.y ) + threadIdx.y;
    // Compute the column index
    unsigned int j = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int index = ( i * matrixDim ) + j;
    float sum = 0.0f;
    for ( unsigned int k = 0; k < matrixDim; ++k )
    {
        sum += A[i * matrixDim + k] * B[k * matrixDim + j];
    }
    C[index] = sum;
}

The parameters AB, and C all point to buffers of global memory.

The fist step is to figure out which row (i) and which column (j) we are operating on for this kernel.

On line 10, we loop through all of the elements of row i of matrix A and the column j of matrix B and compute the summed product of corresponding entries (the dot product of row i and column j). A visual aid of this algorithm is shown below.

Matrix Multiply - Global MemoryFigure 23. Matrix Multiply – Global Memory

If we analyze this algorithm, we may notice that the same row elements of matrix A are being accessed for every resulting row element of matrix C and all the column elements of matrix B are being accessed for every resulting column element of matrix C. If we say that the resulting matrix C is N x M elements, then each element of matrix A is being accessedM times and each element of matrix B is being accessed N times. That seems pretty wasteful to me.

Matrix Multiply using Shared Memory

What if we could reduce the number of times the elements of matrix A and B are accessed to just 1? Well, depending on the size of our matrix, we could just store the contents of matrix A and matrix B into shared memory buffers then just compute the resulting matrix C from those buffers instead. This might work with small matrices (remember that shared memory is local to a single block and with compute capability 1.3, we are limited to matrices of about 20 x 20 because we are limited to 512 threads that can be assigned to a single block).

But what if we had larger matrices to multiply? If we can find a way to split the problem into “phases” then we could simply load each “phase” into shared memory, process that “phase”, then load the next “phase” and process that one until we have exhausted the entire domain.

This technique of splitting our problem domain into phases is called “tiling” named because of the way we can visualize the technique as equal sized tiles that represent our problem domain.

TilesFigure 24. Tiles

For this particular problem, the best partitioning of the problem domain is actually the same as partitioning of the grid of threads that are used to compute the result.

If we split our grid into blocks of 16 x 16 threads (which I showed previously in the section about CUDA thread execution to be a good granularity for this problem) then we can create two buffers in shared memory that are the same size as a single thread block in our kernel grid, one that holds a “tile” of matrix A, and other to store a “tile” of matrix B.

Let’s see how this might look:

Matrix Multiply - TilesFigure 25. Matrix Multiply – Tiles

So the idea is simple, each thread block defines a pair of shared memory buffers that are used to “cache” a “tile” of data from matrix A and matrix B. Since the “tile” is the same size as the thread block, we can just let each thread in the thread block load a single element from matrix A into one of the shared memory buffers and a single element from matrix Binto the other. Using this technique, we can reduce the number of global memory access to matrixDim / BLOCK_SIZE per thread (where BLOCK_SIZE is the size of the thread block and shared memory buffer in a single dimension).

But will this work? We only have access to 16 KB (16,384 Bytes) of shared memory per streaming multiprocessor for devices of compute capability 1.x. If our BLOCK_SIZE is 16 then we need 162 floating point values (4-bytes each) per shared memory buffer. So the size in bytes of each shared memory buffer is:

And we need 2 buffers, so we will need 2,048 Bytes of shared memory per block. If you remember from the previous article about the CUDA thread execution model,
thread blocks of size 16 x 16 will allow 4 resident blocks to be scheduled per streaming multiprocessor. So 4 blocks each requiring 2,048 Bytes gives a total requirement of 8,192 KB of shared memory which is 50% of the available shared memory per streaming multiprocessor. So this this tiling strategy will work.

So let’s see how we might implement this in the kernel.

MatrixMultiply.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define BLOCK_SIZE 16
__global__ void MatrixMultiplyKernel_SharedMem( float* C, const float* A, const float* B, unsigned int matrixDim )
{
    unsigned int tx = threadIdx.x;
    unsigned int ty = threadIdx.y;
    unsigned int bx = blockIdx.x;
    unsigned int by = blockIdx.y;
    // Allocate share memory to store the matrix data in tiles
    __shared__ float sA[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float sB[BLOCK_SIZE][BLOCK_SIZE];
    // Compute the column index
    unsigned int j = ( blockDim.x * bx ) + tx;
    // Compute the row index
    unsigned int i = ( blockDim.y * by ) + ty;
    unsigned int index = ( i * matrixDim ) + j;
    float sum = 0.0f;
    // Loop through the tiles of the input matrices
    // in separate phases of size BLOCK_SIZE
    for( unsigned int phase = 0; phase < matrixDim/BLOCK_SIZE; ++phase )
    {
        // Allow each thread in the block to populate the shared memory
        sA[ty][tx] = A[i * matrixDim + (phase * BLOCK_SIZE + tx)];
        sB[ty][tx] = B[(phase * BLOCK_SIZE + ty) * matrixDim + j];
        __syncthreads();
        for( unsigned int k = 0; k < BLOCK_SIZE; ++k )
        {
            sum += sA[ty][k] * sB[k][tx];
        }
        __syncthreads();
    }
    C[index] = sum;   
}

On line 5-8, we just store some “shorthand” versions of the thread and block indexes into private thread variables (these are stored in registers).

On line 11, and 12 the two shared memory buffers are declared to store enough values that each thread in the thread block can store a single entry in the arrays.

On line 15, the index of the column is computed and stored in another registry variable jand on line 16, the row is computed and stored in registry variable i.

On line 20, the 1-D index into the result matrix C is computed and the sum of the products is stored in the float variable sum.

On line 25, we will loop over the “tiles” (called phases here) of matrix A and matrix B. You should note that this algorithm assumes the size of the matrix is evenly divisible by the size of the thread block.

On lines 28 and 29 is where the magic happens. Since the shared memory is accessible to every thread in the block, we can let every thread in the block copy 1 element from matrix A and one element from matrix B into the shared memory blocks.

Before we can access the data in the shared memory blocks, we must ensure that all threads in the entire block have had a chance to write their data. To do that we need to synchronize the execution of all the threads in the block by calling the __syncthreads()method.

Then the for loop on line 32 will loop through the elements of shared memory and sum the products.

Before we leave this loop and start filling the next “tile” into shared memory, we must ensure that all threads are finished with the shared memory buffers. To do that, we must execute __syncthreads() again on line 36.

This will repeat until all phases (or tiles) of the matrix have been processed.

Once all phases are complete, then the value stored in sum will contain the final result and it is written to the destination matrix C.

Running the global memory version of the matrix multiply on my laptop with a 512 x 512 matrix runs in about 45 milliseconds. Running the shared memory version on the same matrix completes in about 15 milliseconds (including copying memory from host to device and copying the result back to host memory). This provides a speed-up of 300%!

Resources as a Limiting Constraint

It is entirely possible to allocate more shared memory per block than 2,048 bytes, but the block scheduler will reduce the number of blocks scheduled on a streaming multiprocessor until the shared memory requirements are met. If you want to allocate all 16 KB of shared memory in a single block, then only a single block will be resident in the streaming multiprocessor at any given moment which will reduce the occupancy of the streaming multiprocessor to 25% (for a 16 x 16 thread block on compute capability 1.x).

This reduced thread occupancy is not ideal, but it is conceivable to imagine that a single block might have this requirement. In most cases the GPU will still out-perform the CPU if the benefit of using the low-latency memory is fully realized.

This is also true for the number of registers that can be allocated per block. If a single kernel declares 32 32-bit variables that must be stored in registers and the thread block consists of 16 x 16 threads, then the maximum number of blocks that can be active in a streaming multiprocessor on a device with compute capability 1.3 is 2 because the maximum number of 32-bit registers that can be used at any moment in time is 16,384.

So the number of 32-bit registers/block is 8,192. So the streaming multiprocessor can accommodate a maximum of 8,192 / 16,384 = 2 blocks.

CUDA GPU Occupancy Calculator

Since version 4.1, the CUDA Toolkit comes with a tool called the CUDA GPU Occupancy Calculator. This tool is a Microsoft Excel file that can be used to compute the maximum thread occupancy of the streaming multiprocessor given a set of limiting constraints (threads per block, registers per thread, and shared memory (bytes) per block). This tool is provided in the following folder:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.X\tools

CUDA Occupancy CalculatorFigure 26. CUDA Occupancy Calculator

The CUDA Occupancy Calculator allows you to compute the best thread granularity for your thread blocks given a specific compute capability and resource constraints.

You can refer to the second worksheet titled “Help” to learn how to use the CUDA GPU Occupancy Calculator.

Exercises

Q1. Would the MatrixAddDevice kernel function shown in this article benefit from the use of shared memory? Explain your answer.

A1. No, it would not benefit from the use of shared memory because each matrix element is only accessed once. You would still need to access each matrix component to store it in shared memory only to require an access from shared memory to access it again. In this case, store the data in shared memory will only increase the time to execute the kernel because more load/store operations will need to be performed.

Q2. In almost all of the examples shown here, I decided to use a 16×16 thread granularity for the thread blocks. Can you explain why this is a good choice for thread granularity on devices of compute capability (you can assume that register use and shared memory allocation are within the limits in each case):

  1. 1.3?
  2. 2.0?
  3. 3.0?

A2. To answer this question, let’s take a look at each individual compute capability.

a. For Compute Capability 1.3 threads are split into groups of 32 threads called warps. The maximum number of warps/SM is 32. If we create a 16×16 thread block, then we have a total of 256 threads/block. Each block will be split into 8 warps to be scheduled on the SM. Since we know that the maximum number of warps/SM for devices with compute capability 1.3 is 32, then 4 thread blocks will be scheduled on each SM. Each SM can support up to 8 resident blocks per SM and 4 is still within our limit. Also with a maximum resident thread limit of 1024 threads and we exactly meet this requirement (4×256) so we also achieve 100% thread occupancy on the SM! So yes, a 16×16 thread block is a good choice for devices with compute capability 1.3.

b. For devices with compute capability 2.0 threads are also split into groups of 32 threads called warps. In this case, the maximum number of warps/SM is 48. Again, we have 256 threads per block which are split into 8 warps to be scheduled on the SM then 6 thread blocks will be scheduled per SM (48/8). 6 blocks is within the 8 block limit, so we haven’t exceeded the block limit. And with a maximum resident thread limit of 1536 threads, we exactly meet this requirement (6×256) so we also achieve a 100% thread occupancy on the SM! So yes, a 16×16 thread block is also a good choice for devices with compute capability 2.0.

c. For devices with compute capability 3.0 the threads are also split into groups of 32threads called warps. So again, each block will be split into 8 warps. The maximum number of warps that can be active in a SM is 64. This allows for 8 thread blocks to be scheduled per SM. This is within the limit of 16 blocks/SM and again matches exactly the maximum number of threads of 2048 threads (8×256) that can be scheduled for each SM so we also achieve 100% thread occupancy. So yes, a 16×16 thread block is also a good choice for devices with compute capability 3.0 (and consequently this is also true for devices of compute capability 3.5).

Q3. Assuming we have a block of 256 threads each, what is the maximum amount of shared memory we can use per block and still maintain 100% thread occupancy for devices of compute capability (assume the register count is not a limiting resource):

  1. 1.3?
  2. 2.0?
  3. 3.0?

a. In the previous exercise we already established that with a thread blocks of 256 threads, we will have 4 resident blocks per SM. Since devices of compute capability 1.3 have a maximum 16 KB (16,384 bytes) of shared memory then each block can use a maximum of 4,096 bytes (16,384/4) of shared memory while still maintaining 100% thread occupancy.

b. In the previous exercise we saw that we could schedule 6 blocks of 256 threads. Devices of compute capability 2.0 have a maximum of 48 KB (49,152 bytes) of shared memory per SM. This means that we can allocate a maximum of 8,192 bytes (49,152/6) of shared memory while still maintaining 100% thread occupancy.

c. In the previous exercise we saw that we could schedule 8 blocks of 256 threads to get 100% thread occupancy. Devices with compute capability 3.0 also have a maximum of 48 KB (49,152 KB) of shared memory per SM. In this case, we can only allocate 6,144 bytes(49,152/8) of shared memory while still maintaining 100% thread occupancy.

Q4. In the case (c) above, what would happen if we created thread blocks of 1024 threads? Would we still have 100% thread occupancy? How much shared memory could we allocate per thread block and maintain 100% thread occupancy? Explain your answer.

Q5. Answer question (3) and (4) again but this time compute the number of registers you have available per thread while still maintaining 100% thread occupancy. In this case, you can assume that shared memory is not a limiting resource.

Hint: To answer Q5 correctly, you must also take the register allocation granularity and unit size into consideration. For compute capability 1.3, the register allocation granularity is at the block level and the register allocation unit size is 512. For compute capability 2.x register allocation granularity is at the warp level and the register allocation unit size is 64. For compute capability 3.x, the register allocation granularity is at the warp level and the register allocation unit size is 256.

References

1. NVIDIA Corporation (2012, October). CUDA C Programming Guide. (PG-02829-001_v5.0). USA. Available from:http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Accessed: October 2012.

2. NVIDIA Corporation (2012, October). NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. (V1.0). USA. Available from:http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. Accessed: October 2012.

3. NVIDIA Corporation (2012, October). NVIDIA CUDA Getting Started Guide For Microsoft Windows. (DU-05349-001_v5.0). USA. Available from:http://developer.download.nvidia.com/compute/cuda/5_0/rel/docs/CUDA_Getting_Started_Guide_For_Microsoft_Windows.pdf. Accessed: October 2012.

4. NVIDIA Corporation (2012, October). CUDA C Best Practices Guide. (DG-05603-001_v5.0). USA. Available from:http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf. Accessed: October 2012.

5. Kirk, David B. and Hwu, Wen-mei W. (2010). Programming Massively Parallel Processors. 1st. ed. Burlington, MA 01803, USA: Morgan Kaufmann Publishers.

Posted in Artificial Intelligence, C, Computer Hardwares, Computer Languages, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Graphics Cards, Image Processing, Neural Network, PARALLEL, Research Menu | Leave a Comment »

Octree Textures on the GPU

Posted by Hemprasad Y. Badgujar on February 24, 2013


Octree Textures on the GPU

Sylvain Lefebvre
GRAVIR/IMAG – INRIA

Samuel Hornus
GRAVIR/IMAG – INRIA

Fabrice Neyret
GRAVIR/IMAG – INRIA

Texture mapping is a very effective and efficient technique for enriching the appearance of polygonal models with details. Textures store not only color information, but also normals for bump mapping and various shading attributes to create appealing surface effects. However, texture mapping usually requires parameterizing a mesh by associating a 2D texture coordinate with every mesh vertex. Distortions and seams are often introduced by this difficult process, especially on complex meshes.

The 2D parameterization can be avoided by defining the texture inside a volume enclosing the object. Debry et al. (2002) and Benson and Davis (2002) have shown how 3D hierarchical data structures, named octree textures, can be used to efficiently store color information along a mesh surface without texture coordinates. This approach has two advantages. First, color is stored only where the surface intersects the volume, thus reducing memory requirements. Figures 37-1 and 37-2 illustrate this idea. Second, the surface is regularly sampled, and the resulting texture does not suffer from any distortions. In addition to mesh painting, any application that requires storing information on a complex surface can benefit from this approach.

37_octree_01.jpgFigure 37-1 An Octree Texture Surrounding a 3D Model

37_octree_02.jpgFigure 37-2 Unparameterized Mesh Textures with an Octree Texture

This chapter details how to implement octree textures on today’s GPUs. The octree is directly stored in texture memory and accessed from a fragment program. We discuss the trade-offs between performance, storage efficiency, and rendering quality. After explaining our implementation in Section 37.1, we demonstrate it on two different interactive applications:

  • A surface-painting application (Section 37.2). In particular, we discuss the different possibilities for filtering the resulting texture (Section 37.2.3). We also show how a texture defined in an octree can be converted into a standard texture, possibly at runtime (Section 37.2.4).
  • A nonphysical simulation of liquid flowing along a surface (Section 37.3). The simulation runs entirely on the GPU.

37.1 A GPU-Accelerated Hierarchical Structure: The N3-Tree

37.1.1 Definition

An octree is a regular hierarchical data structure. The first node of the tree, the root, is a cube. Each node has either eight children or no children. The eight children form a 2x2x2 regular subdivision of the parent node. A node with children is called an internal node. A node without children is called a leaf. Figure 37-3 shows an octree surrounding a 3D model where the nodes that have the bunny’s surface inside them have been refined and empty nodes have been left as leaves.

37_octree_03.jpgFigure 37-3 An Octree Surrounding a 3D Model

In an octree, the resolution in each dimension increases by two at each subdivision level. Thus, to reach a resolution of 256x256x256, eight levels are required (28= 256). Depending on the application, one might prefer to divide each edge by an arbitrary number N rather than 2. We therefore define a more generic structure called an N3 tree. In an N3-tree, each node has N 3 children. The octree is an N3-tree with N = 2. A larger value of N reduces the tree depth required to reach a given resolution, but it tends to waste memory because the surface is less closely matched by the tree.

37.1.2 Implementation

To implement a hierarchical tree on a GPU, we need to define how to store the structure in texture memory and how to access the structure from a fragment program.

A simple approach to implement an octree on a CPU is to use pointers to link the tree nodes together. Each internal node contains an array of pointers to its children. A child can be another internal node or a leaf. A leaf contains only a data field.

Our implementation on the GPU follows a similar approach. Pointers simply become indices within a texture. They are encoded as RGB values. The content of the leaves is directly stored as an RGB value within the parent node’s array of pointers. We use the alpha channel to distinguish between a pointer to a child and the content of a leaf. Our approach relies on dependent texture lookups (or texture indirections). This requires the hardware to support an arbitrary number of dependent texture lookups, which is the case for GeForce FX and GeForce 6 Series GPUs.

The following sections detail our GPU implementation of the N3-tree. For clarity, the figures illustrate the 2D equivalent of an octree (a quadtree).

Storage

We store the tree in an 8-bit RGBA 3D texture called the indirection pool. Each “pixel” of the indirection pool is called a cell.

The indirection pool is subdivided into indirection grids. An indirection grid is a cube of NxNxN cells (a 2x2x2 grid for an octree). Each node of the tree is represented by an indirection grid. It corresponds to the array of pointers in the CPU implementation described earlier.

A cell of an indirection grid can be empty or can contain one of the following:

  • Data, if the corresponding child is a leaf
  • The index of an indirection grid, if the corresponding child is another internal node

Figure 37-4 illustrates our tree storage representation.

37_octree_04.jpgFigure 37-4 Storage in Texture Memory (2D Case)

We note S = Su Sv Sw as the number of indirection grids stored in the indirection pool and R= (N x Su ) x (N x Sv ) x (N x Sw ) as the resolution in cells of the indirection pool.

Data values and indices of children are both stored as RGB triples. The alpha channel is used as a flag to determine the cell content (alpha = 1 indicates data; alpha = 0.5 indicates index; alpha = 0 indicates empty cell). The root of the tree is always stored at (0, 0, 0) within the indirection pool.

Accessing the Structure: Tree Lookup

Once the tree is stored in texture memory, we need to access it from a fragment program. As with standard 3D textures, the tree defines a texture within the unit cube. We want to retrieve the value stored in the tree at a point M U220A.GIF [0, 1]3. The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.

Let I D be the index of the indirection grid of the node visited at depth D. The tree lookup is initialized with I 0= (0, 0, 0), which corresponds to the tree root. When we are at depth D, we know the index I D of the current node’s indirection grid. We now explain how we retrieve I D+1 from I D .

The lookup point M is inside the node visited at depth D. To decide what to do next, we need to read from the indirection grid ID the value stored at the location corresponding to M. To do so, we need to compute the coordinates of M within the node.

At depth D, a complete tree produces a regular grid of resolution N D N D N D within the unit cube. We call this grid the depth-D grid. Each node of the tree at depth D corresponds to a cell of this grid. In particular, M is within the cell corresponding to the node visited at depth D. The coordinates of M within this cell are given by frac(M x N D ). We use these coordinates to read the value from the indirection grid I D . The lookup coordinates within the indirection pool are thus computed as:

0599equ01.jpg

We then retrieve the RGBA value stored at P in the indirection pool. Depending on the alpha value, either we will return the RGB color if the child is a leaf, or we will interpret the RGB values as the index of the child’s indirection grid (I D+1) and continue to the next tree depth. Figure 37-5 summarizes this entire process for the 2D case (quadtree).

37_octree_05.jpgFigure 37-5 Example of a Tree Lookup

The lookup ends when a leaf is reached. In practice, our fragment program also stops after a fixed number of texture lookups: on most hardware, it is only possible to implement loop statements with a fixed number of iterations (however, early exit is possible on GeForce 6 Series GPUs). The application is in charge of limiting the tree depth with respect to the maximum number of texture lookups done within the fragment program. The complete tree lookup code is shown in Listing 37-1.

Example 37-1. The Tree Lookup Cg Code

float4 tree_lookup(uniform sampler3D IndirPool, // Indirection Pool

  uniform float3 invS, // 1 / S

  uniform float N,

  float3 M) // Lookup coordinates

{

  float4 I = float4(0.0, 0.0, 0.0, 0.0);

  float3 MND = M;

  for (float i=0; i// fixed # of iterations

    float3 P;

    // compute lookup coords. within current node

    P = (MND + floor(0.5 + I.xyz * 255.0)) * invS;

    // access indirection pool

    if (I.w < 0.9)                   // already in a leaf?

      I =(float4)tex3D(IndirPool,P);// no, continue to next depth

 #ifdef DYN_BRANCHING // early exit if hardware supports dynamic branching

    if (I.w > 0.9)    // a leaf has been reached

      break;

#endif

    if (I.w < 0.1) // empty cell

      discard;

    // compute pos within next depth grid

    MND = MND * N;

  }

  return (I);

}

Further Optimizations

In our tree lookup algorithm, as we explained earlier, the computation of P requires a frac instruction. In our implementation, however, as shown Listing 37-1, we actually avoid computing the frac by relying on the cyclic behavior of the texture units (repeat mode). We leave the detailed explanations as an appendix, located on the book’s CD.

We compute P as

0602equ01.jpg

where D D is an integer within the range [0, S[.

We store D D instead of directly storing the I D values. Please refer to the appendix on the CD for the code to compute D D .

Encoding Indices

The indirection pool is an 8-bit 3D RGBA texture. This means that we can encode only 256 different values per channel. This gives us an addressing space of 24 bits (3 indices of 8 bits), which makes it possible to encode octrees large enough for most applications.

Within a fragment program, a texture lookup into an 8-bit texture returns a value mapped between [0,1]. However, we need to encode integers. Using a floating-point texture to do so would require more memory and would reduce performance. Instead, we map values between [0,1] with a fixed precision of 1/255 and simply multiply the floating-point value by 255 to obtain an integer. Note that on hardware without fixed-precision registers, we need to compute floor(0.5 + 255 * v) to avoid rounding errors.

37.2 Application 1: Painting on Meshes

In this section we use the GPU-accelerated octree structure presented in the previous section to create a surface-painting application. Thanks to the octree, the mesh does not need to be parameterized. This is especially useful with complex meshes such as trees, hairy monsters, or characters.

The user will be able to paint on the mesh using a 3D brush, similar to the brush used in 2D painting applications. In this example, the painting resolution is homogeneous along the surface, although multiresolution painting would be an easy extension if desired.

37.2.1 Creating the Octree

We start by computing the bounding box of the object to be painted. The object is then rescaled such that its largest dimension is mapped between [0,1]. The same scaling is applied to the three dimensions because we want the painting resolution to be the same in every dimension. After this process, the mesh fits entirely within the unit box.

The user specifies the desired resolution of the painting. This determines the depth of the leaves of the octree that contain colors. For instance, if the user selects a resolution of 5123, the leaves containing colors will be at depth 9.

The tree is created by subdividing the nodes intersecting the surface until all the leaves either are empty or are at the selected depth (color leaves). To check whether a tree node intersects the geometry, we rely on the box defining the boundary of the node. This process is depicted in Figure 37-6. We use the algorithm shown in Listing 37-2.

37_octree_06a.jpgFigure 37-6 Building an Octree Around a Mesh Surface

This algorithm uses our GPU octree texture API. The links between nodes (indices in the indirection grids) are set up by the createChild() call. The values stored in tree leaves are set up by calling setChildAsEmpty() and setChildColor(), which also set the appropriate alpha value.

Example 37-2. Recursive Algorithm for Octree Creation

void createNode(depth, polygons, box)

  for all children (i, j, k) within (N, N, N)

     if (depth + 1 == painting depth)       // painting depth reached?

        setChildColor(i, j, k, white)       // child is at depth+1

     else

        childbox = computeSubBox(i, j, k, box)

        if (childbox intersect polygons)

           child = createChild(i, j, k)

           // recurse

           createNode(depth + 1, polygons, childbox)

        else

          setChildAsEmpty(i, j, k)

37.2.2 Painting

In our application, the painting tool is drawn as a small sphere moving along the surface of the mesh. This sphere is defined by a painting center P center and a painting radius P radius. The behavior of the brush is similar to that of brushes in 2D painting tools.

When the user paints, the leaf nodes intersecting the painting tool are updated. The new color is computed as a weighted sum of the previous color and the painting color. The weight is such that the painting opacity decreases as the distance from P center increases.

To minimize the amount of data to be sent to the GPU as painting occurs, only the modified leaves are updated in texture memory. This corresponds to a partial update of the indirection pool texture (under OpenGL, we use glTexSubImage3D). The modifications are tracked on a copy of the tree stored in CPU memory.

37.2.3 Rendering

To render the textured mesh, we need to access the octree from the fragment program, using the tree lookup defined in Section 37.1.2.

The untransformed coordinates of the vertices are stored as 3D texture coordinates. These 3D texture coordinates are interpolated during the rasterization of the triangles. Therefore, within the fragment program, we know the 3D point of the mesh surface being projected in the fragment. By using these coordinates as texture coordinates for the tree lookup, we retrieve the color stored in the octree texture.

However, this produces the equivalent of a standard texture lookup in “nearest” mode. Linear interpolation and mipmapping are often mandatory for high-quality results. In the following section, we discuss how to implement these techniques for octree textures.

Linear Interpolation

Linear interpolation of the texture can be obtained by extending the standard 2D linear interpolation. Because the octree texture is a volume texture, eight samples are required for linear interpolation, as shown in Figure 37-7.

37_octree_07.jpgFigure 37-7 Linear Interpolation Using Eight Samples

However, we store information only where the surface intersects the volume. Some of the samples involved in the 3D linear interpolation are not on the surface and have no associated color information. Consider a sample at coordinates (ijk) within the maximum depth grid (recall that the depth D grid is the regular grid produced by a complete octree at depth D). The seven other samples involved in the 3D linear interpolation are at coordinates (i+1, jk), (ij+1, k), (ijk+1), (ij+1, k+1), (i+1, jk+1), (i+1, j+1, k), and (i+1, j+1, k+1). However, some of these samples may not be included in the tree, because they are too far from the surface. This leads to rendering artifacts, as shown in Figure 37-8.

37_octree_08.jpgFigure 37-8 Fixing Artifacts Caused by Straightforward Linear Interpolation

We remove these artifacts by modifying the tree creation process. We make sure that all of the samples necessary for linear interpolation are included in the tree. This can be done easily by enlarging the box used to check whether a tree node intersects the geometry. The box is built in such a way that it includes the previous samples in each dimension. Indeed, the sample at (ijk) must be added if one of the previous samples (for example, the one at (i-1, j-1, k-1)) is in the tree. This is illustrated in Figure 37-9.

37_octree_09.jpgFigure 37-9 Modifying the Tree Creation to Remove Linear Interpolation Artifacts

In our demo, we use the same depth for all color leaves. Of course, the octree structure makes it possible to store color information at different depths. However, doing so complicates linear interpolation. For more details, refer to Benson and Davis 2002.

Mipmapping

When a textured mesh becomes small on the screen, multiple samples of the texture fall into the same pixel. Without a proper filtering algorithm, this leads to aliasing. Most GPUs implement the mipmapping algorithm on standard 2D textures. We extend this algorithm to our GPU octree textures.

We define the mipmap levels as follows. The finest level (level 0) corresponds to the leaves of the initial octree. A coarser level is built from the previous one by merging the leaves in their parent node. The node color is set to the average color of its leaves, and the leaves are suppressed, as shown in Figure 37-10. The octree depth is therefore reduced by one at each mipmapping level. The coarsest level has only one root node, containing the average color of all the leaves of the initial tree.

37_octree_10.jpgFigure 37-10 An Example of a Tree with Mipmapping

Storing one tree per mipmapping level would be expensive. Instead, we create a second 3D texture, called the LOD pool. The LOD pool has one cell per indirection grid of the indirection pool (see Figure 37-10, bottom row). Its resolution is thus S u S v S w (see “Storage” in Section 37.1.2). Each node of the initial tree becomes a leaf at a given mipmapping level. The LOD pool stores the color taken by the nodes when they are used as leaves in a mipmapping level.

To texture the mesh at a specific mipmapping level, we stop the tree lookup at the corresponding depth and look up the node’s average color in the LOD pool. The appropriate mipmapping level can be computed within the fragment program using partial-derivative instructions.

37.2.4 Converting the Octree Texture to a Standard 2D Texture

Our ultimate goal is to use octree textures as a replacement for 2D textures, thus completely removing the need for a 2D parameterization. However, the octree texture requires explicit programming of the texture filtering. This leads to long fragment programs. On recent GPUs, performance is still high enough for texture-authoring applications, where a single object is displayed. But for applications displaying complex scenes, such as games or simulators, rendering performance may be too low. Moreover, GPUs are extremely efficient at displaying filtered standard 2D texture maps.

Being able to convert an octree texture into a standard 2D texture is therefore important. We would like to perform this conversion dynamically: this makes it possible to select the best representation at runtime. For example, an object near the viewpoint would use the linearly interpolated octree texture and switch to the corresponding filtered standard 2D texture when it moves farther away. The advantage is that filtering of the 2D texture is natively handled by the GPU. Thus, the extra cost of the octree texture is incurred only when details are visible.

In the following discussion, we assume that the mesh is already parameterized. We describe how we create a 2D texture map from an octree texture.

To produce the 2D texture map, we render the triangles using their 2D (uv) coordinates instead of their 3D (xyz) coordinates. The triangles are textured with the octree texture, using the 3D coordinates of the mesh vertices as texture coordinates for the tree lookup. The result is shown in Figure 37-11.

37fig11.jpgFigure 37-11 Converting the Octree into a Standard 2D Texture

However, this approach produces artifacts. When the 2D texture is applied to the mesh with filtering, the background color bleeds inside the texture. This happens because samples outside of the 2D triangles are used by the linear interpolation for texture filtering. It is not sufficient to add only a few border pixels: more and more pixels outside of the triangles are used by coarser mipmapping levels. These artifacts are shown in Figure 37-12.

37_octree_12.jpgFigure 37-12 Artifacts Resulting from Straightforward Conversion

To suppress these artifacts, we compute a new texture in which the colors are extrapolated outside of the 2D triangles. To do so, we use a simplified GPU variant of the extrapolation method known as push-pull. This method has been used for the same purpose in Sander et al. 2001.

We first render the 2D texture map as described previously. The background is set with an alpha value of 0. The triangles are rendered using an alpha value of 1. We then ask the GPU to automatically generate the mipmapping levels of the texture. Then we collapse all the mipmapping levels into one texture, interpreting the alpha value as a transparency coefficient. This is done with the Cg code shown in Listing 37-3.

Finally, new mipmapping levels are generated by the GPU from this new texture. Figures 37-13 and 37-14 show the result of this process.

37_octree_13.jpgFigure 37-13 Color Extrapolation

37_octree_14.jpgFigure 37-14 Artifacts Removed Due to Color Extrapolation

Example 37-3. Color Extrapolation Cg Code

PixelOut main(V2FI IN,

  uniform sampler2D Tex) // texture with mipmapping levels

{

  PixelOut OUT;

  float4 res = float4(0.0, 0.0, 0.0, 0.0);

  float alpha = 0.0;

  // start with coarsest level

  float sz = TEX_SIZE;

   // for all mipmapping levels

   for (float i=0.0; i<=TEX_SIZE_LOG2; i+=1.0)

   {

      // texture lookup at this level

      float2 MIP = float2(sz/TEX_SIZE, 0.0);

      float4 c = (float4)tex2D(Tex, IN.TCoord0, MIP.xy, MIP.yx);

      // blend with previous

      res = c + res * (1.0 - c.w);

      // go to finer level

      sz /= 2.0;

   }

   // done - return normalized color (alpha == 1)

   OUT.COL = float4(res.xyz/res.w,1);

   return OUT;

}

37.3 Application 2: Surface Simulation

We have seen with the previous application that octree structures are useful for storing color information along a mesh surface. But octree structures on GPUs are also useful for simulation purposes. In this section, we present how we use an octree structure on the GPU to simulate liquid flowing along a mesh.

We do not go through the details of the simulation itself, because that is beyond the scope of this chapter. We concentrate instead on how we use the octree to make available all the information required by the simulation.

The simulation is done by a cellular automaton residing on the surface of the object. To perform the simulation, we need to attach a 2D density map to the mesh surface. The next simulation step is computed by updating the value of each pixel with respect to the density of its neighbors. This is done by rendering into the next density map using the previous density map and neighboring information as input.

Because physical simulation is very sensitive to distortions, using a standard 2D parameterization to associate the mesh surface to the density map would not produce good results in general. Moreover, computation power could be wasted if some parts of the 2D density map were not used. Therefore, we use an octree to avoid the parameterization.

The first step is to create an octree around the mesh surface (see Section 37.2.1). We do not directly store density within the octree: the density needs to be updated using a render-to-texture operation during the simulation and should therefore be stored in a 2D texture map. Instead of density, each leaf of the octree contains the index of a pixel within the 2D density map. Recall that the leaves of the octree store three 8-bit values (in RGB channels). To be able to use a density map larger than 256×256, we combine the values of the blue and green channels to form a 16-bit index.

During simulation, we also need to access the density of the neighbors. A set of 2D RGB textures, called neighbor textures, is used to encode neighboring information. Let I be an index stored within a leaf L of the octree. Let Dmap be the density map and N a neighbor texture. The Cg call tex2D(Dmap,I) returns the density associated with leaf L. The call tex2D(N,I)gives the index within the density map corresponding to a neighbor (in 3D space) of the leaf L. Therefore, tex2D(Dmap, tex2D(N,I)) gives us the density of the neighbor of L.

To encode the full 3D neighborhood information, 26 textures would be required (a leaf of the tree can have up to 26 neighbors in 3D). However, fewer neighbors are required in practice. Because the octree is built around a 2D surface, the average number of neighbors is likely to be closer to 9.

Once these textures have been created, the simulation can run on the density map. Rendering is done by texturing the mesh with the density map. The octree is used to retrieve the density stored in a given location of the mesh surface. Results of the simulation are shown in Figure 37-15. The user can interactively add liquid on the surface. Videos are available on the book’s CD.

37_octree_15.jpgFigure 37-15 Liquid Flowing Along Mesh Surfaces

37.4 Conclusion

We have presented a complete GPU implementation of octree textures. These structures offer an efficient and convenient way of storing undistorted data along a mesh surface. This can be color data, as in the mesh-painting application, or data for dynamic texture simulation, as in the flowing liquid simulation. Rendering can be done efficiently on modern hardware, and we have provided solutions for filtering to avoid texture aliasing. Nevertheless, because 2D texture maps are preferable in some situations, we have shown how an octree texture can be dynamically converted into a 2D texture without artifacts.

Octrees are very generic data structures, widely used in computer science. They are a convenient way of storing information on unparameterized meshes, and more generally in space. Many other applications, such as volume rendering, can benefit from their hardware implementation.

We hope that you will discover many further uses for and improvements to the techniques presented in this chapter! Please see http://www.aracknea.net/octreetextures for updates of the source code and additional information.

37.5 References

Benson, D., and J. Davis. 2002. “Octree Textures.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 785–790.

Debry, D., J. Gibbs, D. Petty, and N. Robins. 2002. “Painting and Rendering Textures on Unparameterized Models.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 763–768.

Sander, P., J. Snyder, S. Gortler, and H. Hoppe. 2001. “Texture Mapping Progressive Meshes.” In Proceedings of SIGGRAPH 2001, pp. 409–416.


Copyright

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.

The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.

NVIDIA makes no warranty or representation that the techniques described herein are free from any Intellectual Property claims. The reader assumes all risk of any such claims based on his or her use of these techniques.

The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact:

        U.S. Corporate and Government Sales
(800) 382-3419
corpsales@pearsontechgroup.com

For sales outside of the U.S., please contact:

        International Sales
international@pearsoned.com

Visit Addison-Wesley on the Web: www.awprofessional.com

Library of Congress Cataloging-in-Publication Data

GPU gems 2 : programming techniques for high-performance graphics and general-purpose
computation / edited by Matt Pharr ; Randima Fernando, series editor.
p. cm.
Includes bibliographical references and index.
ISBN 0-321-33559-7 (hardcover : alk. paper)
1. Computer graphics. 2. Real-time programming. I. Pharr, Matt. II. Fernando, Randima.

T385.G688 2005
006.66—dc22
2004030181

GeForce™ and NVIDIA Quadro® are trademarks or registered trademarks of NVIDIA Corporation.

Nalu, Timbury, and Clear Sailing images © 2004 NVIDIA Corporation.

mental images and mental ray are trademarks or registered trademarks of mental images, GmbH.

Copyright © 2005 by NVIDIA Corporation.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. Printed in the United States of America. Published simultaneously in Canada.

For information on obtaining permission for use of material from this work, please submit a written request to:

       Pearson Education, Inc.
Rights and Contracts Department
One Lake Street
Upper Saddle River, NJ 07458

Text printed in the United States on recycled paper at Quebecor World Taunton in Taunton, Massachusetts.

Second printing, April 2005

Posted in CLOUD, CLUSTER, Computer Softwares, Computing Technology, CUDA, Graphics Cards, OpenCL, OpenGL, PARALLEL | 1 Comment »

Graphics Card Hierarchy Chart

Posted by Hemprasad Y. Badgujar on September 2, 2012


Graphics Card Hierarchy Chart

What about this other card that’s not on the list? How do I know if it’s a good deal or not?

This will happen. In fact, it’s guaranteed to happen, because inventory levels and prices change quickly. So how do you know if that card you’ve got your eye on is a good buy in its price range?

Here is a resource to help you judge if a card is a good buy or not. The graphics card hierarchy chartgroups graphics cards with similar overall performance levels into tiers. The top tier contains the highest-performing cards available and performance decreases as you go down the tiers from there.

You can use this hierarchy to compare the pricing between two cards, to see which one is a better deal, and also to determine if an upgrade is worthwhile. I don’t recommend upgrading your graphics card unless the replacement card is at least three tiers higher. Otherwise, the upgrade is somewhat parallel and you may not notice a worthwhile difference in performance.

At the request of readers, I have added mobile graphics and integrated chipsets to the hierarchy chart. I want to make it clear that there is very little performance data available for these graphics solutions. While the discrete video cards in the chart are placed in tiers based on a lot of information, many of the mobile and integrated devices in the chart are guesstimates based on their specifications. At worst, I don’t think they’re more than one tier away from their actual performance, but this is something to keep in mind when considering mobile graphics chipsets.

Graphics Card Hierarchy Chart
GeForce Radeon Intel
Discrete: GTX 690
Discrete: GTX 590, GTX 680 Discrete: HD 6990, HD 7970 GHz Ed.
Discrete: GTX 670 Discrete: HD 7970
Discrete: GTX 580 Discrete: HD 5970, HD 7870, HD 7950
Discrete: GTX 295, GTX 480, GTX 570
Go (mobile): 680M
Discrete: HD 4870 X2, HD 6970, HD 7850
Mobility: 7970M
Discrete: GTX 470, GTX 560 Ti, GTX 560 Ti 448 Core Discrete: HD 4850 X2, HD 5870, HD 6950
Mobility: 7950M
Discrete: GTX 560
Go (mobile): 580M
Discrete: HD 5850, HD 6870
Mobility: 6990M
Discrete: 9800 GX2, GTX 285, GTX 460 256-bit, GTX 465
Go (mobile): 675M
Discrete: HD 6850
Mobility: 6900M
Discrete: GTX 260, GTX 275, GTX 280, GTX 460 192-bit, GTX 460 SE, GTX 550 Ti, GTX 560 SE
Go (mobile): 570M, 670M
Discrete: HD 4870, HD 5770, HD 4890, HD 5830, HD 6770,  HD 6790, HD 7770
Mobility: HD 5870, 6800M
Discrete: 8800 Ultra, 9800 GTX, 9800 GTX+, GTS 250, GTS 450
Go (mobile): 560M, 660M
Discrete: HD 3870 X2, HD 4850, HD 5750, HD 6750, HD 7750
Mobility: HD 4850, HD 5850, 7870M
Discrete: 8800 GTX, 8800 GTS 512 MB, GT 545 (GDDR5)
Go (mobile): GTX 280M, GTX 285M, 555M (GDDR5)
Discrete: HD 4770
Mobility: HD 4860, 7770M, 7850M
Discrete: 8800 GT 512 MB, 9800 GT, GT 545 (DDR3), GT 640 (DDR3)
Go (mobile): 9800M GTX, GTX 260M (112), GTS 360M (GDDR5), 555M (DDR3)
Discrete: HD 4830, HD 5670, HD 6670 (GDDR5)
Mobility: HD 5770, HD 5750, 6600M/6700M (GDDR5), 7750M
Discrete: 8800 GTS 640 MB, 9600 GT, GT 240 (GDDR5)
Go (mobile): 9800M GTS, GTX 160M
Discrete: HD 2900 XT, HD 3870, HD 5570 (GDDR5), HD 6570 (GDDR5)
Mobility: 6500M (GDDR5), 6600M/6700M (DDR3), 7730M
Discrete: 8800 GS, 9600 GSO, GT 240 (DDR3)
Go (mobile): GTX 260M (96), GTS 150M, GTS 360M (DDR3)
Discrete: HD 3850 512 MB, HD 4670, HD 5570 (DDR3), HD 6570 (DDR3), HD 6670 (DDR3)
Mobility: HD 3870, HD 5730, HD 5650, 6500M (DDR3)
Discrete: 8800 GT 256 MB, 8800 GTS 320MB, GT 440 GDDR5
Go (mobile): 8800M
Discrete: HD 2900 PRO, HD 3850 256 MB, 5550 (GDDR5)
Mobility: HD 3850
Discrete: 7950 GX2, GT 440 DDR3 Discrete: X1950 XTX, HD 4650 (DDR3), 5550 (DDR3)
Discrete: 7800 GTX 512, 7900 GTO, 7900 GTX, GT 430, GT 530
Go (mobile): 550M
Discrete: X1900 XT, X1950 XT, X1900 XTX
Discrete: 7800 GTX, 7900 GT, 7950 G, GT 220 (DDR3)
Go (mobile): 525M, 540M
Discrete: X1800 XT, X1900 AIW, X1900 GT, X1950 PRO, HD 2900 GT, HD 5550 (DDR2)
Discrete: 7800 GT, 7900 GS, 8600 GTS, 9500 GT (GDDR3), GT 220 (DDR2)
Go (mobile): 7950 GTX
Discrete: X1800 XL, X1950 GT, HD 4650 (DDR2), HD 6450
Mobility: X1800 XT, HD 4650, HD 5165, 6400M
Integrated: 6620G, 6550D
Discrete: 6800 Ultra, 7600 GT, 7800 GS, 8600 GS, 8600 GT (GDDR3), 9500 GT (DDR2)
Go (mobile): 7800 GTX, 7900 GTX
Discrete: X800 XT (& PE), X850 XT (& PE), X1650 XT, X1800 GTO, HD 2600 XT, HD 3650 (DDR3), HD 3670
Mobility: X1900, 3670
Integrated: 6520G, 6530D
Integrated: Intel HD Graphics 4000
Discrete: 6800 GT, 6800 GS (PCIe), 8600 GT (DDR2), GT 520
Go (mobile): 7800, Go 7900 GS, 520M, 520MX
Discrete: X800 XL, X800 GTO2/GTO16, HD 2600 PRO, HD 3650 (DDR2),
Mobility: X800 XT, HD 2600 XT, 3650
Integrated: 6410D, 6480G
Discrete: 6800 GS (AGP)
Go (mobile): 6800 Ultra, 7600 GT, 8600M GT, 8700M GT, 410M
Discrete: X800 GTO 256 MB, X800 PRO,X850 PRO, X1650 GT
Mobility: HD 2600
Integrated: 6370D, 6380G
Discrete: 6800, 7300 GT GDDR3, 7600 GS, 8600M GS
Go (mobile): 6800, 7700
Discrete: X800, X800 GTO 128 MB, X1600 XT, X1650 PRO
Mobility: X1800, HD 5145, HD 5470 (GDDR5), HD 5450,
Discrete: 6600 GT, 6800LE, 6800 XT, 7300 GT (DDR2), 8500 GT, 9400 GT
Go (mobile): 7600 (128-bit)
Discrete: 9800 XT, X700 PRO, X800 GT, X800 SE, X1300 XT, X1600 PRO, HD 2400 XT, HD 4350, HD 4550, HD 5450
Mobility: X800, 3470, HD 5470 (DDR3), HD 5430, 6300M
Integrated: HD 6310, HD 6320
Integrated: Intel HD Graphics 3000
Discrete: FX 5900, FX 5900 Ultra, FX 5950 Ultra, 6600 (128-bit)
Go (mobile): 6800 (128-bit)
Integrated: 9300, 9400
Discrete: 9700, 9700 PRO, 9800, 9800 PRO, X700, X1300 PRO, X1550, HD 2400 PRO
Mobility: X1450, X1600, X1700, 2400 XT, X2500, 3450
Integrated: HD 3200, HD 3300, HD 4200, HD 4250, HD 4290, HD 6250, HD 6290
Discrete: FX 5800 Ultra, FX 5900 XT
Go (mobile): 6600, Go 7600 (64-bit)
Discrete: 9500 PRO, 9600 XT, 9800 PRO (128-bit), X600 XT, X1050 (128-bit)
Mobility: 9800, X700, X1350, X1400, X2300, HD 2400
Integrated: Intel HD Graphics (Core i5-6×1), 2000
Discrete: 4 Ti 4600, 4 Ti 4800, FX 5700 Ultra, 6200, 8300, 8400 G, G 210, G 310
Go (mobile): 315M
Discrete: 9600 PRO, 9800 LE, X600 PRO, HD 2300
Mobility: 9700 (128-bit), X600, X1300
Integrated: Xpress 1250
Integrated: Intel HD Graphics (Core i3 5×0, Core i5-6×0)
Discrete: 4 Ti4200, 4 Ti4400, 4 Ti4800 SE, FX 5600 Ultra, FX 5700, 6600 (64-bit), 7300 GS, 8400M GS, 9300M G, 9300M GS Discrete: 9500, 9550, 9600, X300, X1050 (64-bit)
Mobility: 9600
Integrated: Intel HD Graphics (Pentium G)
Discrete: 3 Ti500, FX 5200 Ultra, FX 5600, FX 5700 LE, 6200 TC, 6600 LE, 7200 GS, 7300 LE
Go (mobile): 5700, 8200M, 9200M GS, 9100
Integrated: 8200, 8300
Discrete: 8500, 9100, 9000 PRO, 9600 LE, X300 SE, X1150
Mobility 9700 (64-bit)
Integrated: GMA X4500
Discrete: 3, 3 Ti200, FX 5200 (128-bit), FX 5500,
Go (mobile): 5600, 6200, 6400, 7200, 7300, 7400 (64-bit)
Discrete: 9000, 9200, 9250
Mobility: 9600 (64-bit), X300
Discrete: FX 5200 (64 bit)
Go (mobile): 7200, 7400 (32-bit)
Integrated: 6100, 6150, 7025, 7050
Discrete: 9200 SE
Integrated: Xpress 200M, Xpress 1000, Xpress 1150
Integrated: GMA X3000, X3100, X3500
Discrete: 2 GTS, 4 MX 440, 2 Ultra, 2 Ti, 2 Ti 200 Discrete: 7500 Integrated: GMA 3000, 3100
Discrete: 256, 2 MX 200, 4 MX 420, 2 MX 400 Discrete: SDR, LE, DDR, 7000, 7200 Integrated: GMA 500, 900, 950
Discrete: Nvidia TNT Discrete: Rage 128 Discrete: Intel 740

Summary

There you have it folks; the best cards for the money this month. Now all that’s left to do is to find and purchase them.

Don’t worry too much about which brand you choose, because all of the cards out there are close to Nvidia’s and ATI’s reference designs. Just pay attention to price, warranty, and the manufacturer’s reputation for honoring the warranty if something goes wrong.

Posted in Computer Hardwares, Graphics Cards, Mixed | Leave a Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: