Something More for Research

Explorer of Research #HEMBAD

Getting Started with CUDA

Posted by Hemprasad Y. Badgujar on March 4, 2013


What are the capabilities of Nvidia’s CUDA running on the GPU and how does it compare to CPU performance? I bought a GeForce 9800GT and set about finding out, starting off by installing the CUDA drivers, toolkit and SDK from the Cuda Zone.

The first thing I noticed was that on my Vista64 machine the sample projects had been installed to:

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\projects

which is read only. Rather than fight with Vista’s UAC I copied everything into the C:\CUDA directory. To build the solution in VS2008 on my Vista 64 machine all I needed to do was switch the platform to x64, ignore the warning:

 

Command line warning D9035 : option 'Wp64' has been deprecated and will be removed in a future release

 

and everything was fine. The SDK’s sample template conveniently included both a gold (CPU) implementation of a function and a GPU implementation. An initial run of the template project showed that only the GPU section was timed. Since the reason to use CUDA is performance and I wanted a comparison, the first modification I made was to put a timer around the CPU implementation:

 

cutilCheckError( cutStartTimer( timer));
computeGold( reference, h_idata, num_threads);  // reference solution
cutilCheckError( cutStopTimer( timer));

 

and raced them – but the results weren’t too inspiring:

 

GPU Processing time: 84.362747 (ms)
CPU Processing time: 0.001257 (ms)

 

The CPU solution wasn’t even threaded. I remembered the question of a student at the Stanford CUDA lecture on YouTube:

 

Q: Since there’s overhead in moving the data to the GPU how do you decide when it’s worthwhile?

A: Generally speaking it makes the most sense for large problems with high data intensity where you have to do multiple calculations per data element. 

Hmm, the template code only processed 128 bytes with 32 threads so I had paid the setup costs and then not sent enough data to the GPU – no wonder the CPU was faster. So I needed to increase the data set, but there’s a problem with that since the provided kernel code assumes the entire data set will fit in shared memory and binds the size of the data to the thread count. There needed to be some changes. But you can’t just increase the number of threads or you’ll get:

 

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <template.cu>, line 88 : invalid configuration argument.

 

First step was to find out what resources were available on the GPU, then I’d need to work out how to get at those resources. Running the SDK Device Query told me how much global and shared memory was available as well as how many threads I could use:

 

Device 0: "GeForce 9800 GT"
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 1073741824 bytes
  Number of multiprocessors:                     14
  Number of cores:                               112
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.50 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

 

Some interesting numbers there, since the GeForce can perform both a FMUL (2 flops) and a FADD (1 flop) per clock, per processor, we can calculate the maximum theoretical Gflops attainable is 1.5 GHz * 112 * (2 + 1) = 504 Gflops. By way of comparison, the E8400 in my test machine has a peak of 24 Gflops according to Intel’s data sheet:

 

Intel_E8400

 

But back to the problem of pushing more data through.  A few problems:

1) The data size needs to be uncoupled from the thread count which means a change to the GRID count from this:

 

// setup execution parameters
dim3  grid( 1, 1, 1);
dim3  threads( num_threads, 1, 1);

 

to something more like this:

 

cThreadsPerBlock = 64;
cBlocksPerGridx = 1024;
cBlocksPerGridy = 1024;

cData = cThreadsPerBlock * cBlocksPerGridx * cBlocksPerGridy;

dim3  grid ( cBlocksPerGridx, cBlocksPerGridy, 1); 
dim3  block( cThreadsPerBlock, 1, 1);

 

where the counts of Blocks Per Grid in the x and y directions would need to be data derived. To simplify the example I’ve done it backwards and set the data size based on thread and block breakdown. These grid and block variables are then be passed to GPU using the triple angle bracket <<< >>> notation:

 

testKernel<<< grid, block, shared_mem_size >>>( d_idata, d_odata);

 

which is the same as:

 

testKernel<<< grid, 64, shared_mem_size >>> ( d_idata, d_odata);

 

because the passed argument is converted to a CUDA dim3 type which “is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.” from the programming guide.

Specifying a shared_mem_size on the kernel call as above allows you to specify the size at runtime. You can then pick up a reference to the memory in the kernel code with:

 

extern  __shared__  float sdata[];

 

Alternatively if you know the size at compilation time you can also declare the shared memory inside the kernel like this:

 

__shared__ float sdata[256];

 

Which would mean the kernel call would be just be:

 

testKernel<<< grid, 64 >>> ( d_idata, d_odata);

 

2) The kernel code must loop through the grid. Calculate the thread id, block id and then global id to figure where in the global data we are up to. Pass the size of the data(int len) since num_threads is no longer coupled with the data length.  The __umul24 in the code provides increased performance but comes with a warning: “Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but __mul24 and __umul24 provide signed and unsigned 24-bit integer multiplication with a throughput of 8 operations per clock cycle. On future architectures however, __[u]mul24 will be slower than 32-bit integer multiplication”.

 

__global__ void
testKernel( float* g_idata, float* g_odata, int len) 
{
  // shared memory
  // the size is determined by the host application
  extern  __shared__  float sdata[];

  // thread id
  const unsigned int tid = threadIdx.x;
  // block id
  const unsigned int bid = __umul24(gridDim.x, blockIdx.y) + blockIdx.x ;
  // global memory id
  const unsigned int gid = tid + __umul24(blockDim.x, bid);

  const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z);

 

3) The kernel needs to read from global memory and then synchronise across threads, this causes the threads across warps to sync and thus presents a consistent shared memory picture. So now thread 0 can read from SDATA(1) and will see the data which thread 1 loaded. A call to __syncthreads() is only needed when the count of threads per block exceed the warpSize because as mentioned in the performance optimisation whitepaper: “Instructions are SIMD synchronous within a warp”. Of course every call has a cost and the programming guide states that “throughput for __syncthreads is 8 operations per clock cycle in the case where no thread has to wait for any other threads.”

None of this is important in the sample template code because there is no communication between threads, thus no need for shared memory or thread syncing – a situation in which registers would normally be used but in this case shared memory has presumably been used by Nvidia for example purposes.

 

const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z); 
SDATA(tid) = g_idata[tid];
if (cThreadsPerBlock > warpSize) __syncthreads();

 

At this point I had revised the template to time the CPU for comparison, remove the size restrictions to allow a decent amount of data to be pushed through and was ready to attempt to answer the question – given the overhead of pushing the data to the GPU, when it is worth doing so? Running the code gave some unexpected answers. Keeping the thread count constant I varied the cBlocksPerGridy to yield various data sizes:

 

 

The GPU and CPU seemed to take the same amount of time with different data loads but the GPU was hampered by a constant overhead of 80ms, the exact same difference I noted when only 128 bytes were trialled in the very first instance before any modification.  Where was the time going? Some sort of setup cost?  Also how much was being taken in the kernel and how much in the data transfer? I needed more fine grained data to see what was going on.

I had modified the supplied SDK template code in a minimal way in order to measure CPU vs GPU performance and found that for the simple test code (1 float multiplication) that the E8400 CPU with a claimed 24 Gflops was handily out performing a GPU with a theoretical max 504 Gflops. Where was all the time going? Was the kernel the culprit, the memory copy or something else? I started out by trying to reuse the

 

cutilCheckError( cutStartTimer( timer));

 

timing method already in the template. Looking into the CUDA source in SDK\common\src\stopwatch_win.cpp showed that on Windows it was using the QueryPerformanceFrequency method which uses the highest possible resolution hardware timer … on the CPU. Using it to measure GPU performance is problematic because timing the GPU using a CPU timer requires the GPU and the CPU to be synchronised with:

 

cudaThreadSynchronize();

 

and ruins the timing information. To measure times on the GPU I needed to use GPU based timing on stream 0 using events:

cudaEventRecord(start, 0);

So I created an array of start and stop events, broke the GPU processes into 5 steps and timed everything. The 5 GPU processes were:

1) Alloc: Host to Device – The allocation of memory on the device for the input array which needed to be copied over from the host.

2) Copy: Host to Device – Copying the input array from the host onto the device. Data size divided by time taken here would give bandwidth.

3) Alloc: Device to Host – The allocation of memory on the device for the output array where the result would be stored before being copied back to the host.

4) Compute – Running the actual kernel, reading from the input array, processing and writing results to the output array.

5) Copy: Device to Host – Copying the output array back to the host.

I also retained my CPU timing to measure the amount of time it took for the GPU to do everything and get the answer back onto the host – that way I’d have a 1:1 comparison against the CPU version. That gives one more thing to measure, how does the sum of the GPU times compare to the overall CPU time?

6) Sync with CPU – CPU time minus sum of GPU times indicates how long it takes to sync the two.

Set up 5 GPU timers to get a breakdown of where the GPU was spending time and keep the 2 CPU timers for the original comparison:

 

// GPU timers - used to time GPU streams
int cGpuTimer = 5;

cudaEvent_t* rgGpuTimer_start = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);
cudaEvent_t* rgGpuTimer_stop = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);

for (int i=0;i<cGpuTimer;i++)
{
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_start[i] ) );
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_stop[i] ) );
}

 

and wrap all the GPU calls with timing calls:

 

cutilCheckError( cutStartTimer( rgTimer[0]));

  // Alloc: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[0], 0 ) );
  float* d_idata;
  cutilSafeCall( cudaMalloc( (void**) &d_idata, global_mem_size));
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[0], 0 ) );

  // Copy: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[1], 0 ) );
  cutilSafeCall( cudaMemcpy( d_idata, h_idata, global_mem_size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[1], 0 ) );

  // Alloc: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[2], 0 ) );
  float* d_odata;
  cutilSafeCall( cudaMalloc( (void**) &d_odata, global_mem_size)); // The pad won't be read back
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[2], 0 ) );

  // Compute
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[3], 0 ) );
  dim3  gridDim ( cBlocksPerGridx, cBlocksPerGridy, 1);
  dim3  blockDim( cThreadsPerBlock, 1, 1);

  testKernel<<< gridDim, blockDim, shared_mem_size >>>( d_idata, d_odata, cData);

  cutilCheckMsg("Kernel execution failed");
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[3], 0 ) );

  // Copy: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[4], 0 ) );
  cutilSafeCall( cudaMemcpy( h_odata, d_odata, global_mem_size, cudaMemcpyDeviceToHost) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[4], 0 ) );

cudaThreadSynchronize(); // Block until memory copy is done to ensure accurate timing

cutilCheckError( cutStopTimer( rgTimer[0]));

 

With this code in place I was ready to find out where the extra 80ms that the GPU took compared to the CPU was coming from and how much time each of the GPU tasks took. First a baseline comparison to verify that the code was still the same and gave the same numbers.

So here’s the graph from before on the left, and here’s the new graph, which should be identical, on the right:

 

 

 

Wow! What’s happened here? All the CPU times are the same, as expected, but the GPU has suddenly closed the gap and now takes only a few ms extra – the 80ms gap has vanished. A diff of the two versions shows that the only change to the code is the addition of GPU timing – and that turns out to be why the GPU suddenly sped up. Directly after setting the device, sending a wakeup call to the GPU like this:

 

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
    cutilDeviceInit(argc, argv);
else
    cudaSetDevice( cutGetMaxGflopsDeviceId() );

{
    cudaEvent_t wakeGPU;
    cutilSafeCall( cudaEventCreate( &wakeGPU) );
}

 

means that 80ms vanishes from the timed loop later in the code. Note that the variable is scoped so it isn’t used. Is the GeForce like a person – goes faster when it knows it is being watched?!  Or is this some wakeup from a power saving mode, I’m not sure.  This is the only extra code needed to cut 80ms from the timing which shows how tricky it is to time accurately on the ms scale, the slightest change can have a significant effect. It is always advisable to run tests on large volumes of data with a lot of loops to drown out one-off costs like this where possible.  While on the topic of getting accurate performance readings note that all timing should be done on release code, particularly timing breakdowns as the SDK/common/cutil_readme.txt file states:

 

“These macros are compiled out in release builds and so they will not affect performance. Note that in debug mode they call cudaThreadSynchronize() to ensure that kernel execution has completed, which can affect performance.” 

Well now that the extra 80ms has been eliminated what does our new GPU timing code show us about how the GPU spends its time? Here’s a chart showing the breakdown for a 16MB sample:

 

 

The majority of the time, and this holds for the other data sizes, is taken copying data back and forth. So experimentally it seems that the overhead in moving the data back and forth is quite significant. Of the 24.8ms required in total to process 16MB, 21.9ms were spent copying data. The actual processing takes almost no time.  Running a variety of input sizes and timing each one can tell us what kind of bandwidth we are typically getting as shown in the table below where times are in ms:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 9.0 1771.9 11.8 1359.3
32MB 16.3 1966.0 22.2 1442.8
64MB 30.6 2093.9 49.8 1285.4
128MB 58.2 2198.2 83.9 1526.4
256MB 114.9 2228.7 171.4 1493.4

We wanted to find how where the GPU was spending its time and now discovered that most of the time is in moving data back and forth.  Can we now answer the question of where the GPU outperforms the CPU? Is 2GB/s the expected throughput? Well Nvidia provides a tool in the SDK to answer that – the “Bandwidth Test”. Running it through the provided GUI tool yields the following results:

 

Running on......
      device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               2152.6

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1919.2

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               48507.8

 

So we can see for 32MB, performance is roughly in line with the template results so that’s case closed … or is it? Two things give cause for concern:

1) PCIe 2.0 is theoretically capable of 500 MB/s per lane and with a x16 slot there are 16 lanes. So throughput should be up around 8GB/s, not the 2GB/s observed.

2) What exactly does “Host to Device Bandwidth for Pageable memory” in the bandwidth test results mean? Pageable memory?

So I found out that the bulk of the time was in data copying, first confirmed that the speeds observed were similar to those given in the Nvidia test suite and then raised new questions about whether we were getting everything out of the hardware given 2GB/s observed and 8GB/s theoretical. So now I need to confirm that my hardware really is PCIe 2.0 x16 and figure out what pageable memory is.

I’d added GPU based timing to my template code and found out that most of the time was spent copying data back and forth between the host and the device. The “Bandwidth Test” in the SDK gave roughly similar results although it mentioned something about pageable memory. But the big problem was the theoretical performance of PCIe 2.0 x16 far exceeded what I was seeing. So the first step was to confirm that both my graphics card and my motherboard supported and were using PCIe 2.0 x16. To do this I used CPU-Z and GPU-Z, with the following results:

 

CPU_GPU

 

So after confirming the hardware should have been capable of better speeds I took another look at the BandwidthTest. Running with the –help switch reveals several options:

 

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --help
Usage:  bandwidthTest [OPTION]...
Test the bandwidth for device to host, host to device, and device to device transfers

Example:  measure the bandwidth of device to host pinned memory copies in the range 1024 Bytes
          to 102400 Bytes in 1024 Byte increments
./bandwidthTest --memory=pinned --mode=range --start=1024 --end=102400 --increment=1024 --dtoh

Options:
--help  Display this help menu
--csv   Print results as a CSV
--device=[deviceno]     Specify the device device to be used
  all - compute cumulative bandwidth on all the devices
  0,1,2,...,n - Specify any particular device to be used
--memory=[MEMMODE]      Specify which memory mode to use
  pageable - pageable memory
  pinned   - non-pageable system memory
--mode=[MODE]   Specify the mode to use
  quick - performs a quick measurement
  range - measures a user-specified range of values
  shmoo - performs an intense shmoo of a large range of values
--htod  Measure host to device transfers
--dtoh  Measure device to host transfers
--dtod  Measure device to device transfers
--wc    Allocate pinned memory as write-combined
--cputiming     Force CPU-based timing always
Range mode options
--start=[SIZE]  Starting transfer size in bytes
--end=[SIZE]    Ending transfer size in bytes
--increment=[SIZE]      Increment size in bytes

 

Particularly of interest is the “pinned” memory mode. Let’s try that:

 

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5256.9
Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4891.6
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 48498.6

 

and we see that this mode vastly improves the maximum throughput. Not sure why Nvidia didn’t make it the default option. Speeds are now up to 5GB/s. A short investigation of the code reveals that the timing isn’t quite analogous to the testing we are doing in the template code:

bandwidthTest.cu

 

56: // defines, project
57: #define MEMCOPY_ITERATIONS  10

 

as the bandwidthTest copies the same memory 10 times in a row as compared to the single copy we are doing. So we expect our performance to lag slightly behind this 5GB/s. Conveniently, all the code needed to use pinned memory is provided in the bandwidthTest, so putting it into a few wrapper functions called freeHost, mallocHost and memCpy yields:

 

////////////////////////////////////////////////////////////////////////////////
//  Memory functions to switch between pinned and pageable memory as required
////////////////////////////////////////////////////////////////////////////////

cudaError
freeHost(void* h_mem, memoryMode memMode)
{
    if( PINNED == memMode ) {
        return cudaFreeHost(h_mem);
    }
    else {
        free(h_mem);
    }
    return cudaSuccess;
}

cudaError
mallocHost(void** h_mem ,uint memSize, memoryMode memMode, bool wc)
{
    if( PINNED == memMode ) {
#if CUDART_VERSION >= 2020
        return cudaHostAlloc( h_mem, memSize, (wc) ? cudaHostAllocWriteCombined : 0 );
#else
        if (wc) {printf("Write-Combined unavailable on CUDART_VERSION less than 2020, running is: %d", CUDART_VERSION);
        return cudaMallocHost( h_mem, memSize );
#endif
    }
    else { // PAGEABLE memory mode
        *h_mem = malloc( memSize );
    }

    return cudaSuccess;
}

cudaError
memCpy(void* sink, void* source, uint memSize, cudaMemcpyKind direction, memoryMode memMode)
{
    if( PINNED == memMode ) {
        return cudaMemcpyAsync( sink, source, memSize, direction, 0);
    }
    else {
        return cudaMemcpy( sink, source, memSize, direction);
    }
}

 

These functions take the same parameters as the existing functions with the addition of memory mode and for mallocHost whether or not the memory is Write Combined. Changing the allocation, copying and freeing over to these new functions allow use of pinned memory. Running the same test set shows that now the time is much more evenly spread between tasks:

 

 

 

and running the new numbers on the throughput we get:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 3.2 5026.7 3.3 4878.0
32MB 6.1 5242.5 6.5 4891.5
64MB 12.2 5251.1 13.1 4871.7
128MB 24.4 5247.6 26.2 4894.1
256MB 48.9 5239.0 52.3 4894.7

So now the throughput approaches the theoretical limit and matches the best the bandwidthTest provides. The total times are down significantly and the GPU is faster on all tested sizes. The 256MB trial runs in 30% less time down from 340ms to 236ms.

 

 

The next challenge is to find where else time is lost. The pie charts show that most of the time is still spent in allocation and copying with very little in compute time so there’s no need to look at the kernel. We’ve already probably cut most of the time we can from the copying so that leaves allocation. A good idea would probably be to allocate the memory once and then use it over and over for multiple kernel executions, an intensive process like the kind Nvidia suggests are best suited for CUDA. But what if the code needs to be as shown, one kernel being run on one large set of data and then returning to another application? This is the kind of flow seen in Matlab MEX files where CUDA is used – Matlab passes the data through the C/C++ MEX file, which runs up a CUDA program gets the result and then returns to Matlab. Could parallel memory copies and allocations speed things up in this situation?

So we’ve switched the code over to use pinned memory in preference to pageable and attained the desired speedup in memory operations from 2GB/s to about 5GB/s. Theoretically PCIe 2.0 x16 should be able to hit 8GB/s and I don’t know why we aren’t able to achieve speeds closer to this number. If anyone knows please leave a comment or e-mail me. From here the next thing to investigate to get more throughput in the single kernel scenario is parallel allocations and copies.

Leave a comment

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Abraham Zamudio [Matematico]

Matematica Aplicada, Linux ,Programacion Cientifica , HIgh Performance COmputing, APrendizaje Automatico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

Business India 2.0

All about Business Travel 2.0 ideas,technology,ventures and the capital making its happen