Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘CUDA TUTORIALS’ Category

CUDA Unified Memory

Posted by Hemprasad Y. Badgujar on October 6, 2015


CUDA Unified Memory

CUDA is the language of Nvidia GPU’s.  To extract maximum performance from GPU’s, you’ll want to develop applications in CUDA.

CUDA Toolkit is the primary IDE (integrated development environment) for developing CUDA-enabled applications.  The main roles of the Toolkit IDE are to simplify the software development process, maximize software developer productivity, and provide features that enhance GPU performance.  The Toolkit has been steadily evolving in tandem with GPU hardware and currently sits at Version 6.5.

One of the most important features of CUDA 6.5 is Unified Memory (UM).  (UM was actually first introduced in CUDA v.6.0).  CPU host memory and GPU device memory are physically separate entities, connected by a relatively slow PCIe bus.  Prior to v.6.0, data elements shared in both CPU and GPU memory required two copies – one copy in CPU memory and one copy in GPU memory.  Developers had to allocate memory on the CPU, allocate memory on the GPU, and then copy data from CPU to GPU and from GPU to CPU.  This dual data management scheme added complexity to programs, opportunities for the introduction of software bugs, and an excessive focus of time and energy on data management tasks.

UM corrects this.  UM creates a memory pool that is shared between CPU and GPU, with a single memory address space and single pointers accessible to both host and device code.  The CUDA driver and runtime libraries automatically handle data transfers between host and device memory, thus relieving developers from the burden of explicitly managing those data transfers.  UM improves performance by automatically providing data locality on the CPU or GPU, wherever it might be required by the application algorithm.  UM also guarantees global coherency of data on host and device, thus reducing the introduction of software bugs.

Let’s explore some sample code that illustrates these concepts.  We won’t concern ourselves with the function of this algorithm; instead, we’ll just focus on the syntax. (Credit to Nvidia for this C/CUDA template example).

Without Unified Memory

Without Unified Memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <string.h>
#include <stdio.h>
struct DataElement
{
  char *name;
  int value;
};
__global__
void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';
  elem->value++;
}
void launch(DataElement *elem) {
  DataElement *d_elem;
  char *d_name;
  int namelen = strlen(elem->name) + 1;
  // Allocate memory on GPU
  cudaMalloc(&d_elem, sizeofDataElement());
  cudaMalloc(&d_name, namelen);
  // Copy data from CPU to GPU
  cudaMemcpy(d_elem, elem, sizeof(DataElement),
     cudaMemcpyHostToDevice);
  cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
  cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*),
     cudaMemcpyHostToDevice);
  // Launch kernel
  Kernel<<< 1, 1 >>>(d_elem);
  // Copy data from GPU to CPU
  cudaMemcpy(&(elem->value), &(d_elem->value), sizeof(int),
     cudaMemcpyDeviceToHost);
  cudaMemcpy(elem->name, d_name, namelen, cudaMemcpyDeviceToHost);
  cudaFree(d_name);
  cudaFree(d_elem);
}
int main(void)
{
  DataElement *e;
  // Allocate memory on CPU
  e = (DataElement*)malloc(sizeof(DataElement));
  e->value = 10;
  // Allocate memory on CPU
  e->name = (char*)malloc(sizeof(char) * (strlen("hello") + 1));
  strcpy(e->name, "hello");
  launch(e);
  printf("On host: name=%s, value=%d\n", e->name, e->value);
  free(e->name);
  free(e);
  cudaDeviceReset();
}

Note these key points:

  • L51,55: Allocate memory on CPU
  • L24,25: Allocate memory on GPU
  • L28-32: Copy data from CPU to GPU
  • L35: Run kernel
  • L38-40: Copy data from GPU to CPU

With Unified Memory 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <string.h>
#include <stdio.h>
struct DataElement
{
  char *name;
  int value;
};
__global__
void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';
  elem->value++;
}
void launch(DataElement *elem) {
  // Launch kernel
  Kernel<<< 1, 1 >>>(elem);
  cudaDeviceSynchronize();
}
int main(void)
{
  DataElement *e;
  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&e, sizeof(DataElement));
  e->value = 10;
  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&(e->name), sizeof(char) *
     (strlen("hello") + 1) );
  strcpy(e->name, "hello");
  launch(e);
  printf("On host: name=%s, value=%d\n", e->name, e->value);
  cudaFree(e->name);
  cudaFree(e);
  cudaDeviceReset();
}
 

Note these key points:

  • L28, 32, 33: Allocate unified memory on CPU and GPU
  • L19: Run kernel

With UM, memory is allocated on the CPU and GPU in a single address space and managed with a single pointer.  Note how the “malloc’s” and “cudaMalloc’s” are condensed into single calls to cudaMallocManaged().  Furthermore, explicit cudaMemcpy() data transfers between CPU and GPU are eliminated, as the CUDA runtime handles these transfers automatically in the background. Collectively these actions simplify code development, code maintenance, and data management.

As software project managers, we like UM for the productivity enhancements it provides for our software development teams.  It improves software quality, reduces coding time, effort and cost, and enhances overall performance. As software engineers, we like UM because of reduced coding effort and the fact that we can focus time and effort on writing CUDA kernel code, where all the parallel performance comes from, instead of spending time on memory management tasks.  Unified Memory is major step forward in GPU programming.

Posted in CUDA, CUDA TUTORIALS, GPU (CUDA), PARALLEL | Leave a Comment »

CUDA Random Numbers

Posted by Hemprasad Y. Badgujar on October 3, 2015


CUDA Random Example

In order to use cuRAND, we need to add two include files into our program:

#include &lt;curand.h&gt;
#include &lt;curand_kernel.h&gt;

cuRAND uses a curandState_t type to keep track of the state of the random sequence. The normal C rand function also has a state, but it is global, and hidden from the programmer. This makes rand not thread-safe, but easier to use.

A curandState_t object must be initialized with a call to curand_init which has the following parameters:

  • seed: The seed determines the beginning point of the sequence of random numbers.
  • sequence: The sequence number is another seed-like value. It is used so that, if all cores have the same seed, but different sequence numbers, then they will get different random values.
  • offset: The amount we skip ahead in the random sequence. This can be zero.
  • state: A pointer to the curandState_t object to initialize.

Once we have an initialized curandState_t object, we can get random numbers with the curand function which takes a pointer to a curandState_t object and returns to us a random unsigned integer.

The following program uses these functions to generate random numbers:

#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */
#include <curand.h>
#include 

#define MAX 100

/* this GPU kernel function calculates a random number and stores it in the parameter */
__global__ void random(int* result) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandState_t state;

  /* we have to initialize the state */
  curand_init(0, /* the seed controls the sequence of random values that are produced */
              0, /* the sequence number is only important with multiple cores */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &state);

  /* curand works like rand - except that it takes a state as a parameter */
  *result = curand(&state) % MAX;
}

int main( ) {
  /* allocate an int on the GPU */
  int* gpu_x;
  cudaMalloc((void**) &gpu_x, sizeof(int));

  /* invoke the GPU to initialize all of the random states */
  random<<<1, 1>>>(gpu_x);

  /* copy the random number back */
  int x;
  cudaMemcpy(&x, gpu_x, sizeof(int), cudaMemcpyDeviceToHost);

  printf("Random number = %d.\n", x);

  /* free the memory we allocated */
  cudaFree(gpu_x);

  return 0;
}

When run, this program produces the exact same random number each time. This is because the seed passed in was 0. In order to get a different random number each time, we can pass in the current time as the seed.


#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */

#include 
#include 

#define MAX 100

/* this GPU kernel function calculates a random number and stores it in the parameter */
__global__ void random(unsigned int seed, int* result) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandState_t state;

  /* we have to initialize the state */
  curand_init(seed, /* the seed controls the sequence of random values that are produced */
              0, /* the sequence number is only important with multiple cores */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &state);

  /* curand works like rand - except that it takes a state as a parameter */
  *result = curand(&state) % MAX;
}

int main( ) {
  /* allocate an int on the GPU */
  int* gpu_x;
  cudaMalloc((void**) &gpu_x, sizeof(int));

  /* invoke the GPU to initialize all of the random states */
  random<<<1, 1>>>(time(NULL), gpu_x);

  /* copy the random number back */
  int x;
  cudaMemcpy(&x, gpu_x, sizeof(int), cudaMemcpyDeviceToHost);

  printf("Random number = %d.\n", x);

  /* free the memory we allocated */
  cudaFree(gpu_x);

  return 0;
}

Using Random Numbers Across Cores

If we want to get random numbers in multiple GPU cores, then we would need each core to have its own curandState_t.

If we want each run of the program to produce different sequences of random numbers, then we would need to set the seed to the current time.

However, now we would likely have each core get the same sequence of numbers. This is probably undesirable. To avoid it, we set the sequence parameter to the thread’s ID.

This way, each thread will have a different stream of random numbers, which will also be different each time the program is run.

The following program illustrates this by creating N curandState_t objects, then launching a GPU kernel to get N random numbers from them, in parallel.

#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */
#include 
#include 

#define N 25

#define MAX 100

/* this GPU kernel function is used to initialize the random states */
__global__ void init(unsigned int seed, curandState_t* states) {

  /* we have to initialize the state */
  curand_init(seed, /* the seed can be the same for each core, here we pass the time in from the CPU */
              blockIdx.x, /* the sequence number should be different for each core (unless you want all
                             cores to get the same sequence of numbers for some reason - use thread id! */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &states[blockIdx.x]);
}

/* this GPU kernel takes an array of states, and an array of ints, and puts a random int into each */
__global__ void randoms(curandState_t* states, unsigned int* numbers) {
  /* curand works like rand - except that it takes a state as a parameter */
  numbers[blockIdx.x] = curand(&states[blockIdx.x]) % 100;
}

int main( ) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandState_t* states;

  /* allocate space on the GPU for the random states */
  cudaMalloc((void**) &states, N * sizeof(curandState_t));

  /* invoke the GPU to initialize all of the random states */
  init<<<n, 1="">>>(time(0), states);

  /* allocate an array of unsigned ints on the CPU and GPU */
  unsigned int cpu_nums[N];
  unsigned int* gpu_nums;
  cudaMalloc((void**) &gpu_nums, N * sizeof(unsigned int));

  /* invoke the kernel to get some random numbers */
  randoms<<<n, 1="">>>(states, gpu_nums);

  /* copy the random numbers back */
  cudaMemcpy(cpu_nums, gpu_nums, N * sizeof(unsigned int), cudaMemcpyDeviceToHost);

  /* print them out */
  for (int i = 0; i < N; i++) {
    printf("%u\n", cpu_nums[i]);
  }

  /* free the memory we allocated for the states and numbers */
  cudaFree(states);
  cudaFree(gpu_nums);

  return 0;
}

This program is also the first to use multiple GPU kernel functions.


Random Distributions

In addition to the curand function which, together with modular arithmetic, can return to us random integers from any range we wish, cuRAND provides functions to get floating point numbers from different distributions:

__device__ float curand_uniform (curandState_t *state)

__device__ float curand_normal (curandState_t *state)

curand_uniform returns a random number between 0.0 and 1.0 following a uniform distribution. This means that all floating point numbers in that range are equally likely to be produced.

curand_normal also returns a random number between 0.0 and 1.0, but it follows a normal distribution, meaning that the number 0.5 is more likely to be produced than numbers near 0.0 or 1.0. Normal distributions would be important for modelling many natural phenomenon accurately.

Posted in CUDA TUTORIALS, GPU (CUDA), PARALLEL | Tagged: | Leave a Comment »

CUDA with Visual Studio Step By Step

Posted by Hemprasad Y. Badgujar on February 15, 2015


CUDA  with Visual Studio -The very first program

This post is my sharing about how to config CUDA 5.0 (exactly 5.0.35) with Visual C++ Express 2010 on Windows 7. Besides, some other issues are mentioned including how to compile a CUDA program, how to measure runtime of a function or part of code, how to make Visual C++ and Visual Assist X aware of CUDA C++ code, and the last thing is the answer to the questtion: ” Is it possible to program (write code, compile only) on a non CUDA machine?”.

1. Installation

You need 2 program, Visual C++ 2010 Express and CUDA 5 (32 bit or 64 bit based on your system). After downloading them, install the Visual C++ first, then the CUDA library (choose all the options). There is nothing special about this step.

2. Write your first program

The files of a CUDA program are classified as two types: the normal C++ source file (*.cpp and *.h, ect.) and the CUDA C++ file (*.cu and *.cuh). The CUDA source file must be compiled by NVCC program (a compiler from Nvidia) and the resulted binary code will be combined with the code from the normal C++ file, which is compiled by VS C++ compiler. So the problem is that how to make this compilation run smoothly. And here are steps of writing a CUDA program:

+ Open VS C++ 2010 Express.

+ File->New->Project->Empty Project, enter the name of the project, Exp1.

+ In the Solution Explorer Tab, add new source file for your project, choose the C++ File (.cpp) type and type the name of the file as main.cu.

config include6

+ Write your code:

/**
* A matrix multiplication using the cuBLAS library.
*/


#include <cstdlib>
#include <iostream>
#include <string>

#include <time.h>

#include <cublas.h>

typedef float ScalarT;

// Some helper functions //
/**
* Calculates 1D index from row-major order to column-major order.
*/
#define index(r,c,rows) (((c)*(rows))+(r))

#define CudaSafeCall( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
inline void __cudaSafeCall( cublasStatus err, const char *file, const int line )
{
if( err != CUBLAS_STATUS_SUCCESS )
{
std::cerr << “CUDA call failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

#define AllocCheck( err ) __allocCheck( err, __FILE__, __LINE__ )
inline void __allocCheck( void* err, const char *file, const int line )
{
if( err == 0 )
{
std::cerr << “Allocation failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

void printMat( const ScalarT* const mat, size_t rows, size_t columns, std::string prefix = “Matrix:” )
{
// Maximum to print
const size_t max_rows = 5;
const size_t max_columns = 16;

std::cout << prefix << std::endl;
for( size_t r = 0; r < rows && r < max_rows; ++r )
{
for( size_t c = 0; c < columns && c < max_columns; ++c )
{
std::cout << mat[index(r,c,rows)] << ” “;
}
std::cout << std::endl;
}
}
// Main program //
int main( int argc, char** argv )
{
size_t HA = 4200;
size_t WA = 23000;
size_t WB = 1300;
size_t HB = WA;
size_t WC = WB;
size_t HC = HA;

size_t r, c;

cudaEvent_t tAllStart, tAllEnd;
cudaEvent_t tKernelStart, tKernelEnd;
float time;

// Prepare host memory and input data //
ScalarT* A = ( ScalarT* )malloc( HA * WA * sizeof(ScalarT) );
AllocCheck( A );
ScalarT* B = ( ScalarT* )malloc( HB * WB * sizeof(ScalarT) );
AllocCheck( B );
ScalarT* C = ( ScalarT* )malloc( HC * WC * sizeof(ScalarT) );
AllocCheck( C );

for( r = 0; r < HA; r++ )
{
for( c = 0; c < WA; c++ )
{
A[index(r,c,HA)] = ( ScalarT )index(r,c,HA);
}
}

for( r = 0; r < HB; r++ )
{
for( c = 0; c < WB; c++ )
{
B[index(r,c,HB)] = ( ScalarT )index(r,c,HB);
}
}

// Initialize cuBLAS //

cublasStatus status;
cublasInit();

// Prepare device memory //
ScalarT* dev_A;
ScalarT* dev_B;
ScalarT* dev_C;

status = cublasAlloc( HA * WA, sizeof(ScalarT), ( void** )&dev_A );
CudaSafeCall( status );

status = cublasAlloc( HB * WB, sizeof(ScalarT), ( void** )&dev_B );
CudaSafeCall( status );

status = cublasAlloc( HC * WC, sizeof(ScalarT), ( void** )&dev_C );
CudaSafeCall( status );

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

status = cublasSetMatrix( HA, WA, sizeof(ScalarT), A, HA, dev_A, HA );
CudaSafeCall( status );

status = cublasSetMatrix( HB, WB, sizeof(ScalarT), B, HB, dev_B, HB );
CudaSafeCall( status );

// Call cuBLAS function //
cudaEventCreate(&tKernelStart);
cudaEventCreate(&tKernelEnd);
cudaEventRecord(tKernelStart, 0);

// Use of cuBLAS constant CUBLAS_OP_N produces a runtime error!
const char CUBLAS_OP_N = ‘n'; // ‘n’ indicates that the matrices are non-transposed.
cublasSgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for float
// cublasDgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for double
status = cublasGetError();
CudaSafeCall( status );

cudaEventRecord(tKernelEnd, 0);
cudaEventSynchronize(tKernelEnd);

cudaEventElapsedTime(&time, tKernelStart, tKernelEnd);
std::cout << “time (kernel only): ” << time << “ms” << std::endl;

// Load result from device //
cublasGetMatrix( HC, WC, sizeof(ScalarT), dev_C, HC, C, HC );
CudaSafeCall( status );

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

cudaEventElapsedTime(&time, tAllStart, tAllEnd);

std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

// Print result //
//printMat( A, HA, WA, “\nMatrix A:” );
//printMat( B, HB, WB, “\nMatrix B:” );
//printMat( C, HC, WC, “\nMatrix C:” );

// Free CUDA memory //
status = cublasFree( dev_A );
CudaSafeCall( status );

status = cublasFree( dev_B );
CudaSafeCall( status );

status = cublasFree( dev_C );
CudaSafeCall( status );

status = cublasShutdown();
CudaSafeCall( status );

// Free host memory //
free( A );
free( B );
free( C );

return EXIT_SUCCESS;
}

+ Config the project as a CUDA project. In the Solution Explorer, right click on the name of the project and choose Build Customizations, in the dialog appeared, check the CUDA 5.0 option, then OK.

config include6
config include6

+ Right click on the CUDA code file (main.cu in this example), choose Properties. In the dialog appeared, choose CUDA C/C++ as the image below:

config include6

+ In the Property Manager tab (View->Property Manager), right click on the Microsoft.Cpp.Win32.user as the image below and choose Properties.

config include6

+ In the VC++ Directories, you have to add some paths (to folders) of CUDA include files, reference folder, library files, like in the images below (do not close the dialog after this step):

config include6
config include6
config include6

+ In the Linker tree, choose Input and add the library files needed for CUDA programs as in the image below:

config include6

+ You will be asked to save the configuration (for all CUDA programs), choose Yes. The configuration steps (start with the operations in the Property Manager above) are needed only one time.
Now you can build your program (use Release option).

3. Timing measurement

In earlier time of CUDA (version <5.0) there are two ways that can be used to measure the time of a program, a function or a part of the proram. But in CUDA 5 (or in the best of my knowledge with CUDA 5), only one way: using cudaEvent_t.

+ Declaration:

cudaEvent_t tAllStart, tAllEnd;
float time;

+ Start recording time information:

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

+ Stop recording time information:

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

+ Get the time and output:

cudaEventElapsedTime(&time, tAllStart, tAllEnd);
std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

4. How to make Visual C++ and Visual Assist X be aware of the CUDA source files?

You can get this information from the links below:

+ Link 1

+ Link 2

5. Is it possible to program (write code and compile only) on a non CUDA machine?

This question is related to my circumstance because I have one CUDA desktop machine at the lab, which can be remoted controlled from my house, so I would like to write and compile the program on my labtop, then copy the program file to the desktop machine to run. Fortunately, the question is YES. We can write and compile CUDA program on a non CUDA machine. You install the Visual tool first, then the CUDA toolkit but do not select the CUDA driver option since your machine does not have any CUDA device. The same steps should be followed with the laptop for getting things done.

Posted in CUDA TUTORIALS, GPU (CUDA), PARALLEL | Tagged: | Leave a Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: