Something More for Research

Explorer of Research #HEMBAD

CUDA Vector Addition

Posted by Hemprasad Y. Badgujar on June 14, 2014


This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see section 1 of the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments tohelp@nccs.gov

vecAdd.cu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;
    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}
int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;
    // Host input vectors
    double *h_a;
    double *h_b;
    //Host output vector
    double *h_c;
    // Device input vectors
    double *d_a;
    double *d_b;
    //Device output vector
    double *d_c;
    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);
    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);
    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);
    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i);
    }
    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
    int blockSize, gridSize;
    // Number of threads in each thread block
    blockSize = 1024;
    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);
    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i
        sum += h_c[i];
    printf("final result: %f\n", sum/n);
    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
    return 0;
}

Changes:

Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.

5
6
7
8
9
10
11
12
13
14
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;
    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

Let’s take a look at what makes up this simple kernel.

6
__global__ void vecAdd(double *a, double *b, double *c, int n)

The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.

9
int id = blockIdx.x*blockDim.x+threadIdx.x;

We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain xy, and z members.

12
if (id < n)

Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

13
c[id] = a[id] + b[id]

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and band write the sum to c

Array Pointers:

21
22
23
24
25
26
27
28
29
30
31
// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
// Device output vector
double *d_c;

With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.

cudaMalloc:

41
42
43
44
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.

cudaMemcpy to device:

53
54
55
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.

Thread mapping:

57
58
59
60
61
62
63
int blockSize, gridSize;
 
// Number of threads in each thread block
blockSize = 256;
 
// Number of thread blocks in grid
gridSize = ceil((float)n/blockSize);

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.

Launch kernel:

65
66
// Execute the kernel
vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);

We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

cudaMemcpy to host:

68
69
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.

release device memory

77
78
79
80
// Release device memory
 cudaFree(d_a);
 cudaFree(d_b);
 cudaFree(d_c);

As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.

Compiling:

The CUDA environment must be set up so before compiling make sure the module is loaded

$ module load cudatoolkit
$ nvcc vecAdd.cu -o vecAdd.out

Running:

$ aprun ./vecAdd.out
final result: 1.000000
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: