Something More for Research

Explorer of Research #HEMBAD

Posts Tagged ‘CUDA’

CUDA Random Numbers

Posted by Hemprasad Y. Badgujar on October 3, 2015


CUDA Random Example

In order to use cuRAND, we need to add two include files into our program:

#include <curand.h>
#include <curand_kernel.h>

cuRAND uses a curandState_t type to keep track of the state of the random sequence. The normal C rand function also has a state, but it is global, and hidden from the programmer. This makes rand not thread-safe, but easier to use.

A curandState_t object must be initialized with a call to curand_init which has the following parameters:

  • seed: The seed determines the beginning point of the sequence of random numbers.
  • sequence: The sequence number is another seed-like value. It is used so that, if all cores have the same seed, but different sequence numbers, then they will get different random values.
  • offset: The amount we skip ahead in the random sequence. This can be zero.
  • state: A pointer to the curandState_t object to initialize.

Once we have an initialized curandState_t object, we can get random numbers with the curand function which takes a pointer to a curandState_t object and returns to us a random unsigned integer.

The following program uses these functions to generate random numbers:

#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */
#include <curand.h>
#include 

#define MAX 100

/* this GPU kernel function calculates a random number and stores it in the parameter */
__global__ void random(int* result) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandState_t state;

  /* we have to initialize the state */
  curand_init(0, /* the seed controls the sequence of random values that are produced */
              0, /* the sequence number is only important with multiple cores */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &state);

  /* curand works like rand - except that it takes a state as a parameter */
  *result = curand(&state) % MAX;
}

int main( ) {
  /* allocate an int on the GPU */
  int* gpu_x;
  cudaMalloc((void**) &gpu_x, sizeof(int));

  /* invoke the GPU to initialize all of the random states */
  random<<<1, 1>>>(gpu_x);

  /* copy the random number back */
  int x;
  cudaMemcpy(&x, gpu_x, sizeof(int), cudaMemcpyDeviceToHost);

  printf("Random number = %d.\n", x);

  /* free the memory we allocated */
  cudaFree(gpu_x);

  return 0;
}

When run, this program produces the exact same random number each time. This is because the seed passed in was 0. In order to get a different random number each time, we can pass in the current time as the seed.


#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */

#include 
#include 

#define MAX 100

/* this GPU kernel function calculates a random number and stores it in the parameter */
__global__ void random(unsigned int seed, int* result) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandState_t state;

  /* we have to initialize the state */
  curand_init(seed, /* the seed controls the sequence of random values that are produced */
              0, /* the sequence number is only important with multiple cores */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &state);

  /* curand works like rand - except that it takes a state as a parameter */
  *result = curand(&state) % MAX;
}

int main( ) {
  /* allocate an int on the GPU */
  int* gpu_x;
  cudaMalloc((void**) &gpu_x, sizeof(int));

  /* invoke the GPU to initialize all of the random states */
  random<<<1, 1>>>(time(NULL), gpu_x);

  /* copy the random number back */
  int x;
  cudaMemcpy(&x, gpu_x, sizeof(int), cudaMemcpyDeviceToHost);

  printf("Random number = %d.\n", x);

  /* free the memory we allocated */
  cudaFree(gpu_x);

  return 0;
}

Using Random Numbers Across Cores

If we want to get random numbers in multiple GPU cores, then we would need each core to have its own curandState_t.

If we want each run of the program to produce different sequences of random numbers, then we would need to set the seed to the current time.

However, now we would likely have each core get the same sequence of numbers. This is probably undesirable. To avoid it, we set the sequence parameter to the thread’s ID.

This way, each thread will have a different stream of random numbers, which will also be different each time the program is run.

The following program illustrates this by creating N curandState_t objects, then launching a GPU kernel to get N random numbers from them, in parallel.

#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */
#include 
#include 

#define N 25

#define MAX 100

/* this GPU kernel function is used to initialize the random states */
__global__ void init(unsigned int seed, curandState_t* states) {

  /* we have to initialize the state */
  curand_init(seed, /* the seed can be the same for each core, here we pass the time in from the CPU */
              blockIdx.x, /* the sequence number should be different for each core (unless you want all
                             cores to get the same sequence of numbers for some reason - use thread id! */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &states[blockIdx.x]);
}

/* this GPU kernel takes an array of states, and an array of ints, and puts a random int into each */
__global__ void randoms(curandState_t* states, unsigned int* numbers) {
  /* curand works like rand - except that it takes a state as a parameter */
  numbers[blockIdx.x] = curand(&states[blockIdx.x]) % 100;
}

int main( ) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandState_t* states;

  /* allocate space on the GPU for the random states */
  cudaMalloc((void**) &states, N * sizeof(curandState_t));

  /* invoke the GPU to initialize all of the random states */
  init<<<n, 1="">>>(time(0), states);

  /* allocate an array of unsigned ints on the CPU and GPU */
  unsigned int cpu_nums[N];
  unsigned int* gpu_nums;
  cudaMalloc((void**) &gpu_nums, N * sizeof(unsigned int));

  /* invoke the kernel to get some random numbers */
  randoms<<<n, 1="">>>(states, gpu_nums);

  /* copy the random numbers back */
  cudaMemcpy(cpu_nums, gpu_nums, N * sizeof(unsigned int), cudaMemcpyDeviceToHost);

  /* print them out */
  for (int i = 0; i < N; i++) {
    printf("%u\n", cpu_nums[i]);
  }

  /* free the memory we allocated for the states and numbers */
  cudaFree(states);
  cudaFree(gpu_nums);

  return 0;
}

This program is also the first to use multiple GPU kernel functions.


Random Distributions

In addition to the curand function which, together with modular arithmetic, can return to us random integers from any range we wish, cuRAND provides functions to get floating point numbers from different distributions:

__device__ float curand_uniform (curandState_t *state)

__device__ float curand_normal (curandState_t *state)

curand_uniform returns a random number between 0.0 and 1.0 following a uniform distribution. This means that all floating point numbers in that range are equally likely to be produced.

curand_normal also returns a random number between 0.0 and 1.0, but it follows a normal distribution, meaning that the number 0.5 is more likely to be produced than numbers near 0.0 or 1.0. Normal distributions would be important for modelling many natural phenomenon accurately.

Posted in CUDA TUTORIALS, GPU (CUDA), PARALLEL | Tagged: | Leave a Comment »

OpenCV CUDA Sample Program

Posted by Hemprasad Y. Badgujar on July 17, 2015


Design considerations

OpenCV GPU module is written using CUDA, therefore it benefits from the CUDA ecosystem. There is a large community, conferences, publications, many tools and libraries developed such as NVIDIA NPP, CUFFT, Thrust.

The GPU module is designed as host API extension. This design provides the user an explicit control on how data is moved between CPU and GPU memory. Although the user has to write some additional code to start using the GPU, this approach is both flexible and allows more efficient computations.

GPU modules includes class cv::gpu::GpuMat which is a primary container for data kept in GPU memory. It’s interface is very similar with cv::Mat, its CPU counterpart. All GPU functions receive GpuMat as input and output arguments. This allows to invoke several GPU algorithms without downloading data. GPU module API interface is also kept similar with CPU interface where possible. So developers who are familiar with Opencv on CPU could start using GPU straightaway.

Short sample

In the sample below an image is loaded from png0file, next it is uploaded to GPU, thresholded, downloaded and displayed.

#include <iostream>
#include "opencv2/opencv.hpp"
#include "opencv2/gpu/gpu.hpp"

int main (int argc, char* argv[])
{
    try
    {
        cv::Mat src_host = cv::imread("file.png", CV_LOAD_IMAGE_GRAYSCALE);
        cv::gpu::GpuMat dst, src;
        src.upload(src_host);

        cv::gpu::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);

        cv::Mat result_host = dst;
        cv::imshow("Result", result_host);
        cv::waitKey();
    }
    catch(const cv::Exception& ex)
    {
        std::cout << "Error: " << ex.what() << std::endl;
    }
    return 0;
}

Posted in Mixed | Tagged: , , , | Leave a Comment »

CUDA with Visual Studio Step By Step

Posted by Hemprasad Y. Badgujar on February 15, 2015


CUDA  with Visual Studio -The very first program

This post is my sharing about how to config CUDA 5.0 (exactly 5.0.35) with Visual C++ Express 2010 on Windows 7. Besides, some other issues are mentioned including how to compile a CUDA program, how to measure runtime of a function or part of code, how to make Visual C++ and Visual Assist X aware of CUDA C++ code, and the last thing is the answer to the questtion: ” Is it possible to program (write code, compile only) on a non CUDA machine?”.

1. Installation

You need 2 program, Visual C++ 2010 Express and CUDA 5 (32 bit or 64 bit based on your system). After downloading them, install the Visual C++ first, then the CUDA library (choose all the options). There is nothing special about this step.

2. Write your first program

The files of a CUDA program are classified as two types: the normal C++ source file (*.cpp and *.h, ect.) and the CUDA C++ file (*.cu and *.cuh). The CUDA source file must be compiled by NVCC program (a compiler from Nvidia) and the resulted binary code will be combined with the code from the normal C++ file, which is compiled by VS C++ compiler. So the problem is that how to make this compilation run smoothly. And here are steps of writing a CUDA program:

+ Open VS C++ 2010 Express.

+ File->New->Project->Empty Project, enter the name of the project, Exp1.

+ In the Solution Explorer Tab, add new source file for your project, choose the C++ File (.cpp) type and type the name of the file as main.cu.

config include6

+ Write your code:

/**
* A matrix multiplication using the cuBLAS library.
*/


#include <cstdlib>
#include <iostream>
#include <string>

#include <time.h>

#include <cublas.h>

typedef float ScalarT;

// Some helper functions //
/**
* Calculates 1D index from row-major order to column-major order.
*/
#define index(r,c,rows) (((c)*(rows))+(r))

#define CudaSafeCall( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
inline void __cudaSafeCall( cublasStatus err, const char *file, const int line )
{
if( err != CUBLAS_STATUS_SUCCESS )
{
std::cerr << “CUDA call failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

#define AllocCheck( err ) __allocCheck( err, __FILE__, __LINE__ )
inline void __allocCheck( void* err, const char *file, const int line )
{
if( err == 0 )
{
std::cerr << “Allocation failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

void printMat( const ScalarT* const mat, size_t rows, size_t columns, std::string prefix = “Matrix:” )
{
// Maximum to print
const size_t max_rows = 5;
const size_t max_columns = 16;

std::cout << prefix << std::endl;
for( size_t r = 0; r < rows && r < max_rows; ++r )
{
for( size_t c = 0; c < columns && c < max_columns; ++c )
{
std::cout << mat[index(r,c,rows)] << ” “;
}
std::cout << std::endl;
}
}
// Main program //
int main( int argc, char** argv )
{
size_t HA = 4200;
size_t WA = 23000;
size_t WB = 1300;
size_t HB = WA;
size_t WC = WB;
size_t HC = HA;

size_t r, c;

cudaEvent_t tAllStart, tAllEnd;
cudaEvent_t tKernelStart, tKernelEnd;
float time;

// Prepare host memory and input data //
ScalarT* A = ( ScalarT* )malloc( HA * WA * sizeof(ScalarT) );
AllocCheck( A );
ScalarT* B = ( ScalarT* )malloc( HB * WB * sizeof(ScalarT) );
AllocCheck( B );
ScalarT* C = ( ScalarT* )malloc( HC * WC * sizeof(ScalarT) );
AllocCheck( C );

for( r = 0; r < HA; r++ )
{
for( c = 0; c < WA; c++ )
{
A[index(r,c,HA)] = ( ScalarT )index(r,c,HA);
}
}

for( r = 0; r < HB; r++ )
{
for( c = 0; c < WB; c++ )
{
B[index(r,c,HB)] = ( ScalarT )index(r,c,HB);
}
}

// Initialize cuBLAS //

cublasStatus status;
cublasInit();

// Prepare device memory //
ScalarT* dev_A;
ScalarT* dev_B;
ScalarT* dev_C;

status = cublasAlloc( HA * WA, sizeof(ScalarT), ( void** )&dev_A );
CudaSafeCall( status );

status = cublasAlloc( HB * WB, sizeof(ScalarT), ( void** )&dev_B );
CudaSafeCall( status );

status = cublasAlloc( HC * WC, sizeof(ScalarT), ( void** )&dev_C );
CudaSafeCall( status );

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

status = cublasSetMatrix( HA, WA, sizeof(ScalarT), A, HA, dev_A, HA );
CudaSafeCall( status );

status = cublasSetMatrix( HB, WB, sizeof(ScalarT), B, HB, dev_B, HB );
CudaSafeCall( status );

// Call cuBLAS function //
cudaEventCreate(&tKernelStart);
cudaEventCreate(&tKernelEnd);
cudaEventRecord(tKernelStart, 0);

// Use of cuBLAS constant CUBLAS_OP_N produces a runtime error!
const char CUBLAS_OP_N = ‘n'; // ‘n’ indicates that the matrices are non-transposed.
cublasSgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for float
// cublasDgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for double
status = cublasGetError();
CudaSafeCall( status );

cudaEventRecord(tKernelEnd, 0);
cudaEventSynchronize(tKernelEnd);

cudaEventElapsedTime(&time, tKernelStart, tKernelEnd);
std::cout << “time (kernel only): ” << time << “ms” << std::endl;

// Load result from device //
cublasGetMatrix( HC, WC, sizeof(ScalarT), dev_C, HC, C, HC );
CudaSafeCall( status );

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

cudaEventElapsedTime(&time, tAllStart, tAllEnd);

std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

// Print result //
//printMat( A, HA, WA, “\nMatrix A:” );
//printMat( B, HB, WB, “\nMatrix B:” );
//printMat( C, HC, WC, “\nMatrix C:” );

// Free CUDA memory //
status = cublasFree( dev_A );
CudaSafeCall( status );

status = cublasFree( dev_B );
CudaSafeCall( status );

status = cublasFree( dev_C );
CudaSafeCall( status );

status = cublasShutdown();
CudaSafeCall( status );

// Free host memory //
free( A );
free( B );
free( C );

return EXIT_SUCCESS;
}

+ Config the project as a CUDA project. In the Solution Explorer, right click on the name of the project and choose Build Customizations, in the dialog appeared, check the CUDA 5.0 option, then OK.

config include6
config include6

+ Right click on the CUDA code file (main.cu in this example), choose Properties. In the dialog appeared, choose CUDA C/C++ as the image below:

config include6

+ In the Property Manager tab (View->Property Manager), right click on the Microsoft.Cpp.Win32.user as the image below and choose Properties.

config include6

+ In the VC++ Directories, you have to add some paths (to folders) of CUDA include files, reference folder, library files, like in the images below (do not close the dialog after this step):

config include6
config include6
config include6

+ In the Linker tree, choose Input and add the library files needed for CUDA programs as in the image below:

config include6

+ You will be asked to save the configuration (for all CUDA programs), choose Yes. The configuration steps (start with the operations in the Property Manager above) are needed only one time.
Now you can build your program (use Release option).

3. Timing measurement

In earlier time of CUDA (version <5.0) there are two ways that can be used to measure the time of a program, a function or a part of the proram. But in CUDA 5 (or in the best of my knowledge with CUDA 5), only one way: using cudaEvent_t.

+ Declaration:

cudaEvent_t tAllStart, tAllEnd;
float time;

+ Start recording time information:

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

+ Stop recording time information:

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

+ Get the time and output:

cudaEventElapsedTime(&time, tAllStart, tAllEnd);
std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

4. How to make Visual C++ and Visual Assist X be aware of the CUDA source files?

You can get this information from the links below:

+ Link 1

+ Link 2

5. Is it possible to program (write code and compile only) on a non CUDA machine?

This question is related to my circumstance because I have one CUDA desktop machine at the lab, which can be remoted controlled from my house, so I would like to write and compile the program on my labtop, then copy the program file to the desktop machine to run. Fortunately, the question is YES. We can write and compile CUDA program on a non CUDA machine. You install the Visual tool first, then the CUDA toolkit but do not select the CUDA driver option since your machine does not have any CUDA device. The same steps should be followed with the laptop for getting things done.

Posted in CUDA TUTORIALS, GPU (CUDA), PARALLEL | Tagged: | Leave a Comment »

CUDA with Visual Studio

Posted by Hemprasad Y. Badgujar on February 15, 2015


CUDA with Visual Studio -The very first program

This post is my sharing about how to config CUDA 5.0 (exactly 5.0.35) with Visual C++ Express 2010 on Windows 7. Besides, some other issues are mentioned including how to compile a CUDA program, how to measure runtime of a function or part of code, how to make Visual C++ and Visual Assist X aware of CUDA C++ code, and the last thing is the answer to the questtion: ” Is it possible to program (write code, compile only) on a non CUDA machine?”.

1. Installation

You need 2 program, Visual C++ 2010 Express and CUDA 5 (32 bit or 64 bit based on your system). After downloading them, install the Visual C++ first, then the CUDA library (choose all the options). There is nothing special about this step.

2. Write your first program

The files of a CUDA program are classified as two types: the normal C++ source file (*.cpp and *.h, ect.) and the CUDA C++ file (*.cu and *.cuh). The CUDA source file must be compiled by NVCC program (a compiler from Nvidia) and the resulted binary code will be combined with the code from the normal C++ file, which is compiled by VS C++ compiler. So the problem is that how to make this compilation run smoothly. And here are steps of writing a CUDA program:

+ Open VS C++ 2010 Express.

+ File->New->Project->Empty Project, enter the name of the project, Exp1.

+ In the Solution Explorer Tab, add new source file for your project, choose the C++ File (.cpp) type and type the name of the file as main.cu.

config include6

+ Write your code:

/**
* A matrix multiplication using the cuBLAS library.
*/

#include <cstdlib>
#include <iostream>
#include <string>

#include <time.h>

#include <cublas.h>

typedef float ScalarT;

// Some helper functions //
/**
* Calculates 1D index from row-major order to column-major order.
*/
#define index(r,c,rows) (((c)*(rows))+(r))

#define CudaSafeCall( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
inline void __cudaSafeCall( cublasStatus err, const char *file, const int line )
{
if( err != CUBLAS_STATUS_SUCCESS )
{
std::cerr << “CUDA call failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

#define AllocCheck( err ) __allocCheck( err, __FILE__, __LINE__ )
inline void __allocCheck( void* err, const char *file, const int line )
{
if( err == 0 )
{
std::cerr << “Allocation failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

void printMat( const ScalarT* const mat, size_t rows, size_t columns, std::string prefix = “Matrix:” )
{
// Maximum to print
const size_t max_rows = 5;
const size_t max_columns = 16;

std::cout << prefix << std::endl;
for( size_t r = 0; r < rows && r < max_rows; ++r )
{
for( size_t c = 0; c < columns && c < max_columns; ++c )
{
std::cout << mat[index(r,c,rows)] << ” “;
}
std::cout << std::endl;
}
}
// Main program //
int main( int argc, char** argv )
{
size_t HA = 4200;
size_t WA = 23000;
size_t WB = 1300;
size_t HB = WA;
size_t WC = WB;
size_t HC = HA;

size_t r, c;

cudaEvent_t tAllStart, tAllEnd;
cudaEvent_t tKernelStart, tKernelEnd;
float time;

// Prepare host memory and input data //
ScalarT* A = ( ScalarT* )malloc( HA * WA * sizeof(ScalarT) );
AllocCheck( A );
ScalarT* B = ( ScalarT* )malloc( HB * WB * sizeof(ScalarT) );
AllocCheck( B );
ScalarT* C = ( ScalarT* )malloc( HC * WC * sizeof(ScalarT) );
AllocCheck( C );

for( r = 0; r < HA; r++ )
{
for( c = 0; c < WA; c++ )
{
A[index(r,c,HA)] = ( ScalarT )index(r,c,HA);
}
}

for( r = 0; r < HB; r++ )
{
for( c = 0; c < WB; c++ )
{
B[index(r,c,HB)] = ( ScalarT )index(r,c,HB);
}
}

// Initialize cuBLAS //

cublasStatus status;
cublasInit();

// Prepare device memory //
ScalarT* dev_A;
ScalarT* dev_B;
ScalarT* dev_C;

status = cublasAlloc( HA * WA, sizeof(ScalarT), ( void** )&dev_A );
CudaSafeCall( status );

status = cublasAlloc( HB * WB, sizeof(ScalarT), ( void** )&dev_B );
CudaSafeCall( status );

status = cublasAlloc( HC * WC, sizeof(ScalarT), ( void** )&dev_C );
CudaSafeCall( status );

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

status = cublasSetMatrix( HA, WA, sizeof(ScalarT), A, HA, dev_A, HA );
CudaSafeCall( status );

status = cublasSetMatrix( HB, WB, sizeof(ScalarT), B, HB, dev_B, HB );
CudaSafeCall( status );

// Call cuBLAS function //
cudaEventCreate(&tKernelStart);
cudaEventCreate(&tKernelEnd);
cudaEventRecord(tKernelStart, 0);

// Use of cuBLAS constant CUBLAS_OP_N produces a runtime error!
const char CUBLAS_OP_N = ‘n’; // ‘n’ indicates that the matrices are non-transposed.
cublasSgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for float
// cublasDgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for double
status = cublasGetError();
CudaSafeCall( status );

cudaEventRecord(tKernelEnd, 0);
cudaEventSynchronize(tKernelEnd);

cudaEventElapsedTime(&time, tKernelStart, tKernelEnd);
std::cout << “time (kernel only): ” << time << “ms” << std::endl;

// Load result from device //
cublasGetMatrix( HC, WC, sizeof(ScalarT), dev_C, HC, C, HC );
CudaSafeCall( status );

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

cudaEventElapsedTime(&time, tAllStart, tAllEnd);

std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

// Print result //
//printMat( A, HA, WA, “\nMatrix A:” );
//printMat( B, HB, WB, “\nMatrix B:” );
//printMat( C, HC, WC, “\nMatrix C:” );

// Free CUDA memory //
status = cublasFree( dev_A );
CudaSafeCall( status );

status = cublasFree( dev_B );
CudaSafeCall( status );

status = cublasFree( dev_C );
CudaSafeCall( status );

status = cublasShutdown();
CudaSafeCall( status );

// Free host memory //
free( A );
free( B );
free( C );

return EXIT_SUCCESS;
}
+ Config the project as a CUDA project. In the Solution Explorer, right click on the name of the project and choose Build Customizations, in the dialog appeared, check the CUDA 5.0 option, then OK.

config include6
config include6

+ Right click on the CUDA code file (main.cu in this example), choose Properties. In the dialog appeared, choose CUDA C/C++ as the image below:

config include6

+ In the Property Manager tab (View->Property Manager), right click on the Microsoft.Cpp.Win32.user as the image below and choose Properties.

config include6

+ In the VC++ Directories, you have to add some paths (to folders) of CUDA include files, reference folder, library files, like in the images below (do not close the dialog after this step):

config include6
config include6
config include6

+ In the Linker tree, choose Input and add the library files needed for CUDA programs as in the image below:

config include6

+ You will be asked to save the configuration (for all CUDA programs), choose Yes. The configuration steps (start with the operations in the Property Manager above) are needed only one time.
Now you can build your program (use Release option).

3. Timing measurement

In earlier time of CUDA (version <5.0) there are two ways that can be used to measure the time of a program, a function or a part of the proram. But in CUDA 5 (or in the best of my knowledge with CUDA 5), only one way: using cudaEvent_t.

+ Declaration:

cudaEvent_t tAllStart, tAllEnd;
float time;

+ Start recording time information:

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

+ Stop recording time information:

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

+ Get the time and output:

cudaEventElapsedTime(&time, tAllStart, tAllEnd);
std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

4. How to make Visual C++ and Visual Assist X be aware of the CUDA source files?

You can get this information from the links below:

+ Link 1

+ Link 2

5. Is it possible to program (write code and compile only) on a non CUDA machine?

This question is related to my circumstance because I have one CUDA desktop machine at the lab, which can be remoted controlled from my house, so I would like to write and compile the program on my labtop, then copy the program file to the desktop machine to run. Fortunately, the question is YES. We can write and compile CUDA program on a non CUDA machine. You install the Visual tool first, then the CUDA toolkit but do not select the CUDA driver option since your machine does not have any CUDA device. The same steps should be followed with the laptop for getting things done.

Posted in Mixed | Tagged: | Leave a Comment »

How to get started with wxWidgets on Windows

Posted by Hemprasad Y. Badgujar on February 3, 2015


How to get started with wxWidgets on Windows

wxWidgets is a cross-platform GUI library, that is also available for Windows. You can get started with using wxWidgets in a few steps:

  1. Download and install the Windows installer for the current stable release of wxWidgets from its download page. It installs the source and build files in C:. For example, inC:\wxWidgets-3.0.2\
  2. wxWidgets needs to be built before it can be used with your application. Go toC:\wxWidgets-3.0.2\build\msw and open the .sln file that matches the Visual Studio version you intend to use for your application. For example, I open wx_vc10.sln using Visual Studio 2012.
  3. Choose one of the build types: Debug, Release, DLL Debug or DLL Release and build the solution. The resulting .lib files are placed in C:\wxWidgets-3.0.2\lib\vc_lib
  4. Create a new Visual Studio solution for your C++ application. Remember that it has to be Win32 Project, not a Win32 Console Project. The difference is that the main function is defined inside wxWidgets and does not need to be defined in your application code.
  5. Add a .cpp file to your solution and copy the Hello World code into it.
  6. Add C:\wxWidgets-3.0.2\include and C:\wxWidgets-3.0.2\include\msvc as additional include directories to the solution.
  7. Add C:\wxWidgets-3.0.2\lib\vc_lib as additional library directory to the solution.
  8. Build the solution and run it to see an empty wxWidgets window.

Posted in Computer Vision, Entertainment, Free Tools, My Research Related, OpenCV | Tagged: , , , , , , , | Leave a Comment »

Professional ways of tracking GPU memory leakage

Posted by Hemprasad Y. Badgujar on January 25, 2015


Depending on what I am doing and what I need to track/trace and profile I utilise all 4 packages above. They also have the added benefit of being a: free; b: well maintained; c: free; d: regularly updated; e: free.

In case you hadn’t guessed I like the free part:)

In regards of object management, I would recommend an old C++ coding principle: as soon as you create an object, add the line that deletes it, every new should always (eventually) have a delete. That way you know that you are destroying the objects you create, however it will not save you from orphaned memory block memory leaks, where you change where pointers are pointing, for example:

myclass* firstInstance = new myclass();
myclass* secondInstance = new myclass();
firstInstance = secondInstance;
delete firstInstance;
delete secondInstance;

You will now have created a small memory leak where the data for the real firstInstance is now not being pointed at by any pointer. Very hard to detect when this happens in a large code-base, and more common that it should be.

generally these are the pairings you need to be aware of to ensure you properly dispose of all your objects:

new -> delete
new[] -> delete[]
malloc() -> free() // or you can use realloc(0) instead of free()
calloc() -> free() // or you can use realloc(0) instead of free()
realloc(nonzero) -> free() // or you can use realloc(0) instead of free()

If you are coming from a language with garbage collection to C++ it can take a while to get used to, but it quickly becomes habit:)

Posted in C, Computer Languages, Computer Vision, Computing Technology, CUDA | Tagged: , , , , , | Leave a Comment »

How to Build a GPU-Accelerated Cluster

Posted by Hemprasad Y. Badgujar on December 22, 2014


Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (“nodes”) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current world’s fastest supercomputer, Oak Ridge National Labs TITAN, utilizes more than 18,000 NVIDIA Kepler GPUs.

In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost.

I gave a talk on this topic at GTC 2013 (session S3516 – Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out!

There are multiple motivating reason for building a GPU-based research cluster.

  • Get a feel for production systems and performance estimates;
  • Port your applications to GPUs and distributed computing (using CUDA-aware MPI);
  • Tune GPU and CPU load balancing for your application;
  • Use the cluster as development platform;
  • Early experience means increased readiness;
  • The investment is relatively small for a research prototype cluster

Figure 1 shows the steps to build a small GPU cluster. Let’s look at the process in more detail.

Steps in building GPU based Clusters
Figure 1: Seven steps to build and test a small research GPU cluster.

1. Choose Your Hardware

There are two steps to choosing the correct hardware.

  1. Node Hardware Details. This isthe specification of the machine (node) for your cluster. Each node has the  following components.
    • CPU processor from any vendor;
    • A motherboard with the following PCI-express connections:
      • 2x PCIe x16 Gen2/3 connections for Tesla GPUs;
      • 1x PCIe x8 wide for HCI Infiniband card;
    • 2 available network ports;
    • A minimum of 16-24 GB DDR3 RAM. (It is good to have more RAM in the system).
    • A power-supply unit (SMPS) with ample power rating. The total power supply needed includes power taken by the CPU, GPUs and other components in the system.
    • Secondary storage (HDD / SSD) based on your needs.

    GPU boards are wide enough to cover two physically adjacent PCI-e slots, so make sure that the PCIe x16 and x8 slots are physically separated on the motherboard so that you can fit a minimum of 2 PCI-e x16 GPUs and 1 PCIe x8 network card.

  2. Choose the right form factor forGPUs. Once you decide your machine specs you should also decide which modelGPUs you would like to consider for your system. The form factor ofGPUs is an important consideration. Kepler-based NVIDIA TeslaGPUs are available in two main form factors.
    • Tesla workstation products (C Series) are actively cooled GPU boards (this means they have a fan cooler over the GPU chip) that you can just plug in to your desktop computer in a PCI-e x16 slot. These use either two 6-pin or one 8-pin power supply connector.
    • Server products (M Series) are passively cooled GPUs (no fans) installed in standard servers sold by various OEMs.

    There are three different options for adding GPUs to your cluster:

    • you can buy C-series GPUs and install them in existing workstations or servers with enough space;
    • you can buy workstations from a vendor with C-series GPUs installed; or
    • you can buy servers with M-series GPUs installed.

2. Allocate Space, Power and Cooling

The goal for this step is to assess your physical infrastructure, including space, power and cooling needs, network considerations and storage requirements to ensure optimal system choices with room to grow your cluster in the future. You should make sure that you have enough space, power and cooling for your cluster. Clusters are mainly rack mounted, with multiple machines installed in a vertical rack. Vendors offer many server solutions that minimize the use of rack space.

3. Assembly and Physical Deployment

After deciding the machine configuration and real estate the next step is to physically deploy your cluster. Figure 2 shows the cluster deployment connections. The head node is the external interface to the cluster; it receives all external network connections, processes incoming requests, and assigns work to compute nodes (nodes with GPUs that perform the computation).

In a research prototype cluster you can also make use one of the compute nodes as a head node, but routing all traffic from the head node and also making it a compute node is not a good idea for production clusters because of performance and security issues. Production and large clusters mostly have a dedicated node to handle all incoming traffic while the head node just manages the work distribution for the compute nodes.

Head Node & Compute Nodes connections
Figure 2: Head node and compute node connections.

4. Head Node Installation

I recommend installing the head node with the open source Rocks Linux distribution. Rocks is a customizable, easy and quick way to install nodes. The Rocks installation package includes essential components for clusters, such as MPI. ROCKS head node installation is well-documented in the Rocks user guide, but here is a summary of the steps.

  • Follow the steps in Chapter 3 of the Rocks user guide and do a CD-based installation.
  • Install the NVIDIA drivers and CUDA Toolkit on the head node. (CUDA 5 provides a unified package that contain NVIDIA driver, toolkit and CUDA Samples.) 
  • Install network interconnect drivers (e.g. Infiniband) on the head node. These drivers are available from your interconnect manufacturer.
  • Nagios® Core™ is an open source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go wrong and when they get better. To install, follow the instructions given in the Nagios installation guide.
  • The NRPE Nagios add-on allows you to execute Nagios plugins on remote Linux machines. This allows you to monitor local resources like CPU load and memory usage, which are not usually exposed to external machines, on remote machines using Nagios. Install NRPE following the install guide.

5. Compute Node Installation

After you have completed the head node installation, you will install the compute node software with the help of Rocks and the following steps.

  • On the head node: in a terminal shell run the command:
    > insert-ethers

    Choose “Compute Nodes” as the new node to add.

  • Power on the compute node with the Rocks CD as the first boot device or do a network installation.
  • The compute node will connect to the head node and start the installation.
  • Install the NRPE package as described in the NRPE guide.

6. Management and Monitoring

Once you finish the head node and all compute node installations, your cluster is ready to use! Before you actually start using it to run applications of interest, you should also set up management and monitoring tools on the cluster. These tools are necessary for proper management and monitoring of all resources available in cluster. In this section, I will describe various tools and software packages for GPU management and monitoring.

GPU SYSTEM MANAGEMENT

The NVIDIA System Management Interface (NVIDIA-SMI) is a tool distributed as part of the NVIDIA GPU driver. NVIDIA-SMI provides a variety of GPU system information including

  • thermal monitoring metrics: GPU temperature, chassis inlet/outlet temperatures;
  • system Information: firmware revision, configuration information;
  • system state: fan states, GPU faults, power system fault; ECC errors, etc.

NVIDIA-SMI allows you to configure the compute mode for any device in the system (Reference: CUDA C Programming Guide)

  • Default compute mode: multiple host threads can use the device at the same time.
  • Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may be current to as many threads as desired within the process that created the context.
  • Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time.
  • Prohibited compute mode: No CUDA context can be created on the device.

NVIDIA-SMI also allows you to turn ECC (Error Correcting Code memory) mode on and off. The default is ON, but applications that do not need ECC can get higher memory bandwidth by disabling it.

GPU MONITORING WITH THE TESLA DEPLOYMENT KIT

The Tesla Deployment Kit is a collection of tools provided to better manage NVIDIA Tesla™ GPUs. These tools support Linux (32-bit and 64-bit), Windows 7 (64-bit), and Windows Server 2008 R2 (64-bit). The current distribution contains NVIDIA-healthmon and the NVML API.

NVML API

The NVML API is a C-based API which provides programmatic state monitoring and management of NVIDIA GPU devices. The NVML dynamic run-time library ships with the NVIDIA display driver, and the NVML SDK provides headers, stub libraries and sample applications. NVML can be used from Python or Perl (bindings are available) as well as C/C++ or Fortran.

Ganglia is an open-source scalable distributed monitoring system used for clusters and grids with very low per-node overhead and high concurrency. Ganglia gmond is an NVML-based Python module for monitoring NVIDIA GPUs in the Ganglia interface.

NVIDIA-HEALTHMON 

This utility provides quick health checking of GPUs in cluster nodes. The tool detects issues and suggests remedies to software and system configuration problems, but it is not a comprehensive hardware diagnostic tool. Features include:

  • basic CUDA and NVML sanity check;
  • diagnosis of GPU failures;
  • check for conflicting drivers;
  • poorly seated GPU detection;
  • check for disconnected power cables;
  • ECC error detection and reporting;
  • bandwidth test;
  • infoROM validation.

7. Run Benchmarks and Applications

Once your cluster is up and running you will want to validate it by running some benchmarks and sample applications. There are various benchmarks and code samples for GPUs and the network as well as applications to run on the entire cluster. For GPUs, you need to run two basic tests.

  1. devicequery: This sample code is available with the CUDA Samples included in the CUDA Toolkit installation package. devicequery simply enumerates the properties of the CUDA devices present in a node. This is not a benchmark but successfully running this or any other CUDA sample serves to verify that you have the CUDA driver and toolkit properly installed on the system.
  2. bandwidthtest: This is another of the CUDA Samples included with the Toolkit. This sample measures the cudaMemcopy bandwidth of the GPU across PCI-e as well as internally. You should measure device-to-device copy bandwidth, host-to-device copy bandwidth for pageable and page-locked memory, and device-to-host copy bandwidth for pageable and page-locked memory.

To benchmark network performance, you should run the bandwidth and latency tests for your installed MPI distribution. MPI standard installations have standard benchmarks such as /tests/osu_benchmarks-3.1.1. You should consider using an open source CUDA-aware MPI implementation like MVAPICH2, as described in earlier Parallel Forall posts An Introduction to CUDA-Aware MPI and Benchmarking CUDA-Aware MPI.

To benchmark the entire cluster, you should run the LINPACK numerical linear algebra application. The top 500 supercomputers list uses the HPL benchmark to decide the fastest supercomputers on Earth. The CUDA-enabled version of HPL (High-Performance LINPACK) optimized for GPUs is available from NVIDIA on request, and there is a Fermi-optimized version available to all NVIDIA registered developers.

# In this post I have provided an overview of the basic steps to build a GPU-accelerated research prototype cluster. For more details on GPU-based clusters and some of best practices for production clusters, please refer to Dale Southard’s GTC 2013 talk S3249 – Introduction to Deploying, Managing, and Using GPU Clusters by Dale Southard.

Posted in CLOUD, CLUSTER, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GRID, Linux OS, Mixed, Multimedia, PARALLEL | Tagged: , , | Leave a Comment »

Install CUDA 6.5 on Ubuntu 14.04

Posted by Hemprasad Y. Badgujar on December 22, 2014


Install build-essential:

1
$ apt-get update && apt-get install build-essential

Get CUDA installer:

1
$ wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

Extract CUDA installer:

1
2
3
$ chmod +x cuda_6.5.14_linux_64.run
$ mkdir nvidia_installers
$ ./cuda_6.5.14_linux_64.run -extract=`pwd`/nvidia_installers

Run Nvidia driver installer:

1
2
$ cd nvidia_installers
$ ./NVIDIA-Linux-x86_64-340.29.run

At this point it will popup an 8-bit UI that will ask you to accept a license agreement, and then start installing.

screenshot

At this point, I got an error:

1
2
3
4
5
6
7
Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or
         improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver
         such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
         device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

         Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log'
         for more information.

After reading this forum post I installed:

1
$ sudo apt-get install linux-image-extra-virtual

When it prompted me what do to about the grub changes, I chose “choose package maintainers version”.

Reboot:

1
$ reboot

Disable nouveau

At this point you need to disable nouveau, since it conflicts with the nvidia kernel module.

Open a new file

1
$ vi /etc/modprobe.d/blacklist-nouveau.conf

and add these lines to it

1
2
3
4
5
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

and then save the file.

Disable the Kernel Nouveau:

1
$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf

Reboot:

1
2
$ update-initramfs -u
$ reboot

One more try — this time it works

Get Kernel source:

1
2
$ apt-get install linux-source
$ apt-get install linux-headers-3.13.0-37-generic

Rerun Nvidia driver installer:

1
2
$ cd nvidia_installers
$ ./NVIDIA-Linux-x86_64-340.29.run

Load nvidia kernel module:

1
$ modprobe nvidia

Run CUDA + samples installer:

1
2
$ ./cuda-linux64-rel-6.5.14-18749181.run
$ ./cuda-samples-linux-6.5.14-18745345.run

Verify CUDA is correctly installed

1
2
3
$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery   

You should see the following output:

1
2
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GRID K520
Result = PASS

You should reboot the system afterwards and verify the driver installation with the nvidia-settings utility.

Environment Variables

As part of the CUDA environment, you should add the following in the .bashrc file of your home folder.

export CUDA_HOME=/usr/local/cuda6.5
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64

PATH=${CUDA_HOME}/bin:${PATH}
export PATH

CUDA SDK Samples

Now you can copy the SDK samples into your home directory, and proceed with the build process.

$ cudainstallsamples6.5.sh  ~
$ cd ~/NVIDIA_CUDA6.5_Samples
$ make

If everything goes well, you should be able to verify your CUDA installation by running thedeviceQuery sample in bin/x86_64/linux/release.

Source (http://tleyden.github.io/)

Posted in Computer Network & Security, Computer Vision, Computing Technology, CUDA | Tagged: , | Leave a Comment »

Running CUDA Code Natively on x86 Processors

Posted by Hemprasad Y. Badgujar on December 20, 2014


1 Try : CUDA Development without GPU

If you want to run the code on your machine but you don’t have a GPU? Or maybe you want to try things out before firing up your AWS instance? Here I show you a way to run the CUDA code without a GPU.

Note: this only works on Linux, maybe there are other alternatives for Mac or Windows.

Ocelot lets you run CUDA programs on NVIDIA GPUs, AMD GPUs and x86-CPUs without recompilation. Here we’ll take advantage of the latter to run our code using our CPU.

Dependencies

You’ll need to install the following packages:

  • C++ Compiler (GCC)
  • Lex Lexer Generator (Flex)
  • YACC Parser Generator (Bison)
  • SCons

And these libraries:

  • boost_system
  • boost_filesystem
  • boost_serialization
  • GLEW (optional for GL interop)
  • GL (for NVIDIA GPU Devices)

With Arch Linux, this should go something like this:

pacman -S gcc flex bison scons boost glew

On Ubuntu it should be similar (sudo apt-get install flex bison g++ scons libboost-all-dev). If you don’t know the name of a package, search for it with ‘apt-cache search package_name’.

You should probably install LLVM too, it’s not mandatory, but I think it runs faster with LLVM.

pacman -S llvm clang

And of course you’ll need to install CUDA and the OpenCL headers. You can do it manually or using your distro’s package manager (for ubuntu I belive the package is called nvidia-cuda-toolkit):

pacman -S cuda libcl opencl-nvidia

One last dependency is Hydrazine. Fetch the source code:

svn checkout http://hydrazine.googlecode.com/svn/trunk/ hydrazine

Or if you’re like me and prefer Git:

git svn clone -s http://hydrazine.googlecode.com/svn/ hydrazine

And install it like this (you might need to install automake if you don’t have it already):

cd hydrazine
libtoolize
aclocal
autoconf
automake --add-missing
./configure
sudo make install

Installation

Now we can finally install Ocelot. This is where it gets a bit messy. Fetch the Ocelot source code:

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot

Or with Git (warning, this will take a while, the whole repo is about 1.9 GB):

git svn clone -s http://gpuocelot.googlecode.com/svn/ gpuocelot

Now go to the ocelot directory:

cd gpuocelot/ocelot

And install Ocelot with:

sudo ./build.py --install

Troubleshooting

Sadly, the last command probably failed. This is how I fixed the problems.

Hydrazine headers not found

You could fix this adding an include flag. I just added a logical link to the hydrazine code we downloaded previously:

ln -s /path/to/hydrazine/hydrazine

Make sure you link to the second hydrazine directory (inside this directory you’ll find directories like implementation and interface). You should do this in the ocelot directory where you’re running the build.py script (gpuocelot/ocelot).

LLVM header file not found

For any error that looks like this:

llvm/Target/TargetData.h: No such file or directory

Just edit the source code and replace it with this header:

llvm/DataLayout.h

The LLVM project moved the file.

LLVM IR folder “missing”

Similarly, files referenced by Ocelot from the “IR” package were moved (LLVM 3.2-5 on Arch Linux). If you get an error about LLVM/IR/LLVMContext.h missing, edit the following files:

ocelot/ir/implementation/ExternalFunctionSet.cpp
ocelot/executive/implementation/LLVMModuleManager.cpp
ocelot/executive/implementation/LLVMState.cpp

and replace the includes at the top of each file for LLVM/IR/LLVMContext.h and LLVM/IR/Module.h with LLVM/LLVMContext.h and LLVM/Module.h, respectively.

PTXLexer errors

The next problem I ran into was:

.release_build/ocelot/ptxgrammar.hpp:351:14:error:'PTXLexer' is not a member of 'parser'

Go ahead, open the ‘.release_build/ocelot/ptxgrammar.hpp’ file and just comment line 355:

/* int yyparse (parser::PTXLexer& lexer, parser::PTXParser::State& state); */

That should fix the error.

boost libraries not found

On up-to-date Arch Linux boxes, it will complain about not finding boost libraries ‘boost_system-mt’, ‘boost_filesystem-mt’, ‘boost_thread-mt’.

I had to edit two files:

  • scripts/build_environment.py
  • SConscript

And just remove the trailing -mt from the library names:

  • boost_system
  • boost_filesystem
  • boost_thread

Finish the installation

After those fixes everything should work.

Whew! That wasn’t fun. Hopefully with the help of this guide it won’t be too painful.

To finish the installation, run:

sudo ldconfig

And you can check the library was installed correctly running:

OcelotConfig -l

It should return -locelot. If it didn’t, check your LD_LIBRARY_PATH. On my machine, Ocelot was installed under /usr/local/lib so I just added this to my LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Here’s the link to the installation instructions.

Running the code with Ocelot

We’re finally ready enjoy the fruits of our hard work. We need to do two things:

Ocelot configuration file

Add a file called configure.ocelot to your project (in the same directory as our Makefile and student_func.cu files), and copy this:

{
    ocelot: "ocelot",
    trace: {
        database: "traces/database.trace",
        memoryChecker: {
            enabled: false,
            checkInitialization: false
        },
        raceDetector: {
            enabled: false,
            ignoreIrrelevantWrites: false
        },
        debugger: {
            enabled: false,
            kernelFilter: "",
            alwaysAttach: true
        }
    },
    cuda: {
        implementation: "CudaRuntime",
        tracePath: "trace/CudaAPI.trace"
    },
    executive: {
        devices: [llvm],
        preferredISA: nvidia,
        optimizationLevel: full,
        defaultDeviceID: 0,
        asynchronousKernelLaunch: True,
        port: 2011,
        host: "127.0.0.1",
        workerThreadLimit: 8,
        warpSize: 16
    },
    optimizations: {
        subkernelSize: 10000,
    }
}

You can check this guide for more information about these settings.

Compile with the Ocelot library

And lastly, a small change to our Makefile. Append this to the GCC_OPTS:

GCC_OPTS=-O3 -Wall -Wextra -m64 `OcelotConfig -l`

And change the student target so it uses g++ and not nvcc:

student: compare main.o student_func.o Makefile
    g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GCC_OPTS)

I just replaced ‘nvcc’ with ‘g++’ and ‘NVCC_OPTS’ with ‘GCC_OPTS’.

make clean
make

And that’s it!

I forked the github repo and added these changes in case you want to take a look.

I found this guide helpful, it might have some additional details for installing things under ubuntu and/or manually.

Note for debian users

I successfully installed ocelot under debian squeeze, following the above steps, except that I needed to download llvm from upstream, as indicated in the above guide for ubuntu.

Other than that, after fixing some includes as indicated (Replacing ‘TargetData.h’ by ‘IR/DataLayout.h’, or adding ‘/IR/’ to some includes), it just compiled.

To build the student project, I needed to replace -m64 by -m32 to fit my architecture, and to make the other indicated changes.

Here are my makefile diffs:

$ git diff Makefile
diff --git a/HW1/student/Makefile b/HW1/student/Makefile
index b6df3a4..55480af 100755
--- a/HW1/student/Makefile
+++ b/HW1/student/Makefile
@@ -22,7 +22,8 @@ OPENCV_INCLUDEPATH=/usr/include

 OPENCV_LIBS=-lopencv_core -lopencv_imgproc -lopencv_highgui

-CUDA_INCLUDEPATH=/usr/local/cuda-5.0/include
+#CUDA_INCLUDEPATH=/usr/local/cuda-5.0/include
+CUDA_INCLUDEPATH=/usr/local/cuda/include

 ######################################################
 # On Macs the default install locations are below    #
@@ -36,12 +37,12 @@ CUDA_INCLUDEPATH=/usr/local/cuda-5.0/include
 #CUDA_INCLUDEPATH=/usr/local/cuda/include
 #CUDA_LIBPATH=/usr/local/cuda/lib

-NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m64
+NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m32

-GCC_OPTS=-O3 -Wall -Wextra -m64
+GCC_OPTS=-O3 -Wall -Wextra -m32 `OcelotConfig -l` -I /usr/include/i386-linux-gn

 student: compare main.o student_func.o Makefile
-       $(NVCC) -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) 
+       g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GC

 main.o: main.cpp timer.h utils.h HW1.cpp
        g++ -c main.cpp $(GCC_OPTS) -I $(CUDA_INCLUDEPATH) -I $(OPENCV_LIBPATH)
$

I’m using cuda toolkit 4.2.

I don’t know why, but it was necessary to add /usr/lib/gcc/i486-linux-gnu/4.4 to the PATH for nvcc to work:

export PATH=$PATH:/usr/lib/gcc/i486-linux-gnu/4.4

Eclipse CUDA plugin

This is probably for another entry, but I used this guide to integrate CUDA into Eclipse Indigo.

The plugin is University of Bayreuth’s Eclipse Toolchain for CUDA compiler



2 Try :Running CUDA Code Natively on x86 Processors

We  focused on Fermi and the architectural changes that significantly broadened the types of applications that map well to GPGPU computing yet preserve the application performance of software written for previous generations of CUDA-enabled GPUs. This article addresses the mindset that CUDA is a language for only GPU-based applications.

Recent developments allow CUDA programs to transparently compile and run at full speed on x86 architectures. This advance makes CUDA a viable programming model for all application development, just like OpenMP. The PGI CUDA C/C++ compiler for x86 (from the Portland Group Inc.) is the reason for this recent change in mindset. It is the first native CUDA compiler that can transparently create a binary that will run on an x86 processor. No GPU is required. As a result, programmers now have the ability to use a single source tree of CUDA code to reach those customers who own CUDA-enabled GPUs as or who use x86-based systems.

Figure 1 illustrates the options and target platforms that are currently available to build and run CUDA applications. The various products are discussed next.

Figure 1: The various options for compiling and running a CUDA program.

Aside from the new CUDA-x86 compiler, the other products require developer or customer intervention to run CUDA on multiple backends. For example:

  • nvcc: The freely downloadable nvcc compiler from NVIDIA creates both host and device code. With the use of the __device__ and __host__ specifiers, a developer can use C++ Thrust functions to run on both host and CUDA-enabled devices. This x86 pathway is represented by the dotted line in Figure 1, as the programmer must explicitly specify use of the host processor. In addition, developers must explicitly check whether a GPU is present and use this information to select the memory space in which the data will reside (that is, GPU or host). The Thrust API also allows CUDA codes to be transparently compiled to run on different backends. The Thrust documentation shows how to use OpenMP to run a Monte Carlo simulation on x86. Note that Thrust is not optimized to create efficient OpenMP code.
  • gpuocelot provides a dynamic compilation framework to run CUDA binaries on various backends such as x86, AMD GPUs, and an x86-based PTX emulator. The emulator alone is a valuable tool for finding hot spots and bottlenecks in CUDA codes. The gpuocelot website claims that it “allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation.” I recommend this project even though it is challenging to use. As it matures, Ocelot will provide a pathway for customers to run CUDA binaries on various backends.
  • MCUDA is an academic project that translates CUDA to C. It is not currently maintained, but the papers are interesting reading. A follow-up project (FCUDA) provides a CUDA to FPGA translation capability.
  • SWAN provides a CUDA-to-OpenCL translation capability. The authors note that Swan is “not a drop in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls rewritten.” Still, it is an interesting project to bridge the gap between CUDA and OpenCL.

The CUDA-x86 compiler is the first to provide a seamless pathway to create a multi-platform application.

Why It Matters

Using CUDA for all application development may seem like a radical concept to many readers, but in fact, it is the natural extension of the emerging CPU/GPU paradigm of high-speed computing. One of the key benefits of CUDA is that it uses C/C++ and can be adopted easily and it runs on 300+ million GPUs and now all x86 chips. If this still feels like an edgy practice, this video presentation might be helpful.

CUDA works well now at its principal task — massively parallel computation — as demonstrated by the variety and number of projects that achieve 100x or greater performance in the NVIDIA showcase. See Figure 2.

Figure 2: All top 100 CUDA apps attain speedups in excess of 100x.

PGI CUDA-x86: CUDA Programming for Multi-core CPUs

Introduction

The NVIDIA CUDA architecture was developed to enable offloading of compute-intensive kernels to GPUs. Through API function calls and language extensions, CUDA gives developers control over mapping of general-purpose compute kernels to GPUs, and over placement and movement of data between host memory and GPU memory. CUDA is supported on x86 and x64 (64-bit x86) systems running Linux, Windows or MacOS and that include an NVIDIA CUDA-enabled GPU. First introduced in 2007, CUDA is the most popular GPGPU parallel programming model with an estimated user-base of over 100,000 developers worldwide.

Let’s review the hardware around which the CUDA programming model was designed. Figure 1 below shows an abstraction of a multi-core x64+GPU platform focused on computing, with the graphics functionality stripped out. The key to the performance potential of the NVIDIA GPU is the large number of thread processors, up to 512 of them in a Fermi-class GPU. They’re organized into up to 16 multi-processors, each of which has 32 thread processors. Each thread processor has registers along with integer and floating point functional units; the thread processors within a multiprocessor run in SIMD mode. Fermi peak single-precision performance is about 1.4 TFLOPS and peak double-precision is about 550 GFLOPS.

Fermi Block Diagram

Figure 1: NVIDIA Fermi-class GPU Accelerator

The GPU has a large (up to 6GB) high bandwidth long latency device main memory. Each multi-processor has a small 64KB local shared memory that functions as both a hardware data cache and a software-managed data cache, and has a large register file.

The GPU has two levels of parallelism, SIMD within a multiprocessor, and parallel across multiprocessors. In addition, there is another very important level of concurrency: the thread processors support extremely fast multithread context switching to tolerate the long latency to device main memory. If a given thread stalls waiting for a device memory access, it is swapped out and another ready thread is swapped in and starts executing within a few cycles.

What kind of algorithms run well on this architecture?

  • Massive parallelism—is needed to effectively use hundreds of thread processors and provide enough slack parallelism for the fast multi-threading to effectively tolerate device memory latency and maximize device memory bandwidth utilization.
  • Regular parallelism—is needed for GPU hardware and firmware that is optimized for the regular parallelism found in graphics kernels; these correspond roughly to rectangular iteration spaces (think tightly nested loops).
  • Limited synchronization—thread processors within a multi-processor can synchronize quickly enough to enable coordinated vector operations like reductions, but there is virtually no ability to synchronize across multi-processors.
  • Locality—is needed to enable use of the hardware or user-managed data caches to minimize accesses to device memory.

This sounds a lot like a nest of parallel loops. So, NVIDIA defined the CUDA programming model to enable efficient mapping of general-purpose compute-intensive loop nests onto the GPU hardware. Specifically, a 1K x 1K matrix multiply loop that looks as follows on the host:

for (i = 0; i < 1024; ++i)
   for (k = 0; k < 1024; ++k)
      for (j = 0; j < 1024; ++j)
         c[i][j] =+= a[i][k]*b[k][j]; 

can be rewritten in its most basic form in CUDA C as:

cudaMalloc( &ap, memsizeA );
...
cudaMemcpy( ap, a, memsizeA, cudaMemcpyHostToDevice );
...
c_mmul_kernel <<<(64,64),(16,16)>>>(ap, bp, cp, 1024);
cudaMemcpy( c, cp, memsizeC, cudaMemcpyDeviceToHost );
...
	
__global__ void c_mmul_kernel(float* a, float* b, float* c, n)
{
   int i = blockIdx.y*16+threadIdx.y;
   int j = blockIdx.x*16+threadIdx.x;
   for( int k = 0; k < n; ++k )_
      c[n*i+j] += a[n*i+k] * b[n*k+j];
}

The triply-nested matrix multiply loop becomes a single dot-product loop, split out to a self-contained kernel function. The two outer loops are abstracted away in the launch of the kernel on the GPU. Conceptually, the over one million 1024-length dot-products it takes to perform the matrix multiply are all launched simultaneously on the GPU. The CUDA programmer structures fine-grain parallel tasks, in this case dot-product operations, as CUDA threads, organizes the threads into rectangular thread blocks with 32 to 1024 threads each, and organizes the thread-blocks into a rectangular grid. Each thread-block is assigned to a CUDA GPU multi-processor, and the threads within a thread-block are executed by the thread-processors within that multiprocessor.

The programmer also manages the memory hierarchy on the GPU, moving data from the host to device memory, from variables in device memory to variables in shared memory, or to variables that the user intends to be assigned to registers.

PGI CUDA C/C++ for Multi-core x64

The PGI CUDA C/C++ compiler for multi-core x64 platforms will allow developers to compile and optimize CUDA applications to run on x64-based workstations, servers and clusters with or without an NVIDIA GPU accelerator. Is it possible to compile CUDA C efficiently for multi-core processors? CUDA C is simply a parallel programming model and language. While it was designed with the structure required for efficient GPU programming, it also can be compiled for efficient execution on multi-core x64.

Looking at a multicore x64 CPU, we see features very like what we have on the NVIDIA GPU. We have MIMD parallelism across the cores, typically 4 cores but we know there are up to 12 on some chips today and up to 48 on a single motherboard. We have SIMD parallelism in the AVX or SSE instructions. So it’s the same set of features, excepting that CPUs are optimized with deep cache memory hierarchies for memory latency, whereas the GPU is optimized for memory bandwidth. Mapping the CUDA parallelism onto the CPU parallelism seems straightforward from basic principles.

Consider the process the CUDA programmer uses to convert existing serial or parallel programs to CUDA C, as outlined above. Many aspects of this process can simply be reversed by the compiler:

  • Reconstitute parallel/vector loop nests from the CUDA C chevron syntax
  • Where possible, remove or replace programmer-inserted __syncthreads() calls by appropriate mechanisms on the CPU

In effect, the PGI CUDA C/C++ compiler will process CUDA C as a native parallel programming language for mapping to multi-core x64 CPUs. CUDA thread blocks will be mapped to processor cores to effect multi-core execution, and CUDA thread-level parallelism will be mapped to the SSE or AVX SIMD units as shown in Figure 2 below. All existing PGI x64 optimizations for Intel and AMD CPUs will be applied to CUDA C/C++ host code—SIMD/AVX vectorization, inter-procedural analysis and optimizations, auto-parallelization for multi-core, OpenMP extensions support, etc.

Multi-core Mapping

Figure 2: Mapping CUDA to GPUs versus Multi-core CPUs

Initially, PGI CUDA C/C++ will target the CUDA 3.1 runtime API. There are no current plans to implement the CUDA driver API. The definition of warpSize may be changed (probably to 1 in optimizing versions of the compiler); correctly implementing warp-synchronous programming would either require implicit synchronization after each memory access, or would require the compiler to prove that such synchronization is not required. It’s much more natural to require programmers to use the value of warpSize to determine how many threads are running in SIMD mode.

What kind of performance can you expect from CUDA C programs running on multi-core CPUs? There are many determining factors. Typical CUDA C programs perform many explicit operations and optimizations that are not necessary when programming multi-core CPUs using OpenMP or threads-based programming:

  • Explicit movement of data from host main memory to CUDA device memory
  • Data copies from arrays in CUDA device memory to temporary arrays in multi-processor shared memory
  • Synchronization of SIMT thread processors to ensure shared memory coherency
  • Manual unrolling of loops

In many cases, the PGI CUDA C compiler will remove explicit synchronization of the thread processors if it can determine it’s safe to split loops in which synchronization calls occur. Manual unrolling of loops will not typically hurt performance on x64, and may help in some cases. However, explicit movement of data from host memory to “device” copies will still occur, and explicit movement of data from device copies to temporary arrays in shared memory will still occur; these operations are pure overhead on a multi-core processor.

It will be easy to write CUDA programs that run really well on the GPU and don’t run so well on a CPU. We can’t guarantee high performance, if you’ve gone and tightly hand-tuned your kernel code. As with OpenCL, we’re making the language portable, and many programs will port and run well; but there is no guarantee of general performance portability.

PGI Unified Binary for Multi-core x64 and NVIDIA GPUs

In later releases, in addition to multi-core execution, the PGI CUDA C/C++ compiler will support execution of device kernels on NVIDIA CUDA-enabled GPUs. PGI Unified Binary technology will enable developers to build one binary that will use NVIDIA GPUs when present or default to using multi-core x64 if no GPU is present.

PGI Unified Binary

Figure 3: PGI Unified Binary for NVIDIA GPUs and Multi-core CPUs

Conclusion

It’s important to clarify that the PGI CUDA C/C++ compiler for multi-core does not split work between the CPU and GPU; it executes device kernels in multi-core mode on the CPU. Even with the PGI Unified Binary feature, the device kernels will execute either on the GPU or on the multi-core, since the data will have been allocated in one memory or the other. PGI CUDA C/C++ also is not intended to as a replacement for OpenMP or other parallel programming models for CPUs. It is a feature of the PGI compilers that will enable CUDA programs to run on either CPUs or GPUs, and will give developers the option of a uniform manycore parallel programming model for applications where it’s needed and appropriate. It will ensure CUDA C programs are portable to virtually any multi-core x64 processor-based HPC system.

The PGI compiler will implement the NVIDIA CUDA C language and closely track the evolution of CUDA C moving forward. The implementation will proceed in phases:

  • Prototype demonstration at SC10 in New Orleans (November 2010).
  • First production release in Q2 2011 with most CUDA C functionality. This will not be a performance release; it will use multi-core parallelism across threads in a single thread block, in the same way as PGI CUDA Fortran emulation mode, but will not exploit parallelism across thread blocks.
  • Performance release in Q3 2011 leveraging multi-core and SSE/AVX to implement low-overhead native parallel/SIMD execution; this will use a single core to execute all the threads in a single thread block, in SIMD mode where possible, and use multi-core parallelism across the thread blocks.
  • Unification release in Q4 2011 that supports PGI Unified Binary technology to create binaries that use NVIDIA GPU accelerators when present, or run on multi-core CPUs if no GPU is present.

The necessary elements of the NVIDIA CUDA toolkit needed to compile and execute CUDA C/C++ programs (header files, for example) will be bundled with the PGI compiler. Finally, the same optimizations and features implemented for CUDA C/C++ for multi-core will also be supported in CUDA Fortran, offering interoperability and a uniform programming model across both languages.

How It Works

In CUDA-x86, thread blocks are mapped to x86 processor cores. Thread-level parallelism is mapped to SSE (Streaming SIMD Extensions) or AVX SIMD units as shown below. (AVX is an extension of SSE to 256-bit operation). PGI indicates that:

  • The size of a warp (that is, the basic unit of code to be run) will be different than the typical 32 threads per warp for a GPU. For x86 computing, a warp might be the size of the SIMD units on the x86 core (either four or eight threads) or one thread per warp when SIMD execution is not utilized.
  • In many cases, the PGI CUDA C compiler removes explicit synchronization of the thread processors when the compiler can determine it is safe to split loops.
  • CUDA considers the GPU as a separate device from the host processors. CUDA x86 maintains this memory model, which means that data movement between the host and device memory spaces still consumes application runtime. As shown in the device bandwidth SDK example below, a modern Xeon processor can transfer data to a CUDA-x86 “device” at about 4GB/sec. All CUDA x86 pointers reside in the x86 memory space, so programmers can use conditional compilation to directly access memory without requiring data transfers when running on multicore processors.

Trying Out the Compiler

The PGI installation process is fairly straightforward:

  1. Register and download the latest version from PGI
  2. Extract the tarfile at the location of your choice and follow the instructions in INSTALL.txt.
    • Under Linux, this basically requires running the file ./install as superuser and answering a few straight-forward questions.
    • Note that you should answer “yes” to the installation of CUDA even if you have a GPU version of CUDA already installed on your system. The PGI x86 version will not conflict with the GPU version. Otherwise, the PGI compiler will not understand files with the .cu file extension.
  3. Create the license.dat file.

At this point, you have a 15-day license for the PGI compilers.

Setup the environment to build with the PGI tools as discussed in the installation guide. Following are the commands for bash under Linux:

1
2
3
4
PGI=/opt/pgi; export PGI
MANPATH=$MANPATH:$PGI/linux86-64/11.5/man; export MANPATH
LM_LICENSE_FILE=$PGI/license.dat; export LM_LICENSE_FILE
PATH=$PGI/linux86-64/11.5/bin:$PATH; export PATH

Copy the PGI NVIDIA SDK samples to a convenient location and build them:

1
2
3
cp –r /opt/pgi/linux86-64/2011/cuda/cudaX86SDK  .
cd cudaX86SDK ;
make

This is the output of deviceQuery on an Intel Xeon e5560 processor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "DEVICE EMULATION MODE"
  CUDA Driver Version:                           99.99
  CUDA Runtime Version:                          99.99
  CUDA Capability Major revision number:         9998
  CUDA Capability Minor revision number:         9998
  Total amount of global memory:                 128000000 bytes
  Number of multiprocessors:                     1
  Number of cores:                               0
  Total amount of constant memory:               1021585952 bytes
  Total amount of shared memory per block:       1021586048 bytes
  Total number of registers available per block: 1021585904
  Warp size:                                     1
  Maximum number of threads per block:           1021585920
  Maximum sizes of each dimension of a block:    32767 x 2 x 0
  Maximum sizes of each dimension of a grid:     1021586032 x 32767 x 1021586048
  Maximum memory pitch:                          4206313 bytes
  Texture alignment:                             1021585952 bytes
  Clock rate:                                    0.00 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Unknown
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 99.99, CUDA Runtime Version = 99.99, NumDevs = 1, Device = DEVICE EMULATION MODE
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------

The output of bandwidthTest shows that device transfers work as expected:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Running on...
 Device 0: DEVICE EMULATION MODE
 Quick Mode
 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4152.5
 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4257.0
 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         8459.2
[bandwidthTest] - Test results:
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------

As with NVIDIA’s nvcc compiler, it is easy to use the PGI pgCC compiler to build an executable from a CUDA source file. As an example, copy the arrayReversal_multiblock_fast.cu code from Part 3 of this series. To compile and run it under Linux, type:

1
2
3
pgCC arrayReversal_multiblock_fast.cu
./a.out
Correct!

Posted in Computer Network & Security, Computer Softwares, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, PARALLEL | Tagged: | Leave a Comment »

Running MPI/GPU program

Posted by Hemprasad Y. Badgujar on December 19, 2014


GPUs provide the ability to use mathematical operations at a fraction of the cost and with higher performance than on the current generation of processors. FutureGrid provides the ability to test such an infrastructure as part of its delta cluster. Here, we provide a step-by-step guide on how to run a parallel matrix multiplication program using IntelMPI and CUDA on Delta machines. The MPI framework distributes the work among compute nodes, each of which use CUDA to execute the shared workload. We also provide the complete parallel matrix multiplication code using MPI/CUDA that has already been tested on Delta cluster in attachment.

Source Code Package

MPI code: pmm_mpi.c

#include   void invoke_cuda_vecadd();  int main(int argc, char *argv[]) { int rank, size;  MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ invoke_cuda_vecadd(); /* the cuda code */ MPI_Finalize(); return 0; }

CUDA code: dgemm_cuda.cu

#include <stdio.h>

__global__ void cuda_vecadd(int *array1, int *array2, int *array3)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
array3[index] = array1[index] + array2[index];
}

extern “C” void invoke_cuda_vecadd()
{
cudaMalloc((void**) &devarray1, sizeof(int)*10);
cudaMalloc((void**) &devarray2, sizeof(int)*10);
cudaMalloc((void**) &devarray3, sizeof(int)*10);
cudaMemcpy(devarray1, hostarray1, sizeof(int)*10, cudaMemcpyHostToDevice);
cudaMemcpy(devarray2, hostarray2, sizeof(int)*10, cudaMemcpyHostToDevice);
cuda_vec_add<<<1, 10>>>(devarray1, devarray2, devarray3);
cudaMemcpy(hostarray3, devarray3, sizeof(int)*10, cudaMemcpyDeviceToHost);
cudaFree(devarray1);
cudaFree(devarray2);
cudaFree(devarray3);
}

Note: Mixing MPI and CUDA code may cause problems during linking because of the difference between C and C++ calling conventions. The use of extern “C” around invoke_cuda_code which instructs the nvcc (a wrapper of c++) compiler to make that function callable from the C runtime.

Compiling the MPI/CUDA program:

Load the Modules
> module load IntelMPI # load Intel MPI
> module load Intel # load icc > module load cuda # load cuda tools
This will load the Intel MPI, the compiler, and the cuda tools. Next compile the code with

> nvcc -c dgemm_cuda.cu -o dgemm_cuda.o   > mpiicc -o pmm_mpi.c -o pmm_mpi.o
> mpiicc -o mpicuda pmm_mpi.o dgemm_cuda.o -lcudart -lcublas -L /opt/cuda/lib64 -I /opt/cuda/include

Note: The CUDA compiler nvcc is used only to compile the CUDA source file, and the IntelMPI compiler mpiicc is used to compile the C code and do the linking
Setting Up and Submitting MPI Jobs:

1. qsub -I -l nodes=4 -q delta        # get 4 nodes from FG
2. uniq /var/spool/torque/aux/399286.i136 > gpu_nodes_list       #create machine file list
3. module load IntelMPI                # load Intel MPI
4. module load Intel                     # load icc
5. module load cuda                     # load cuda tools
6. mpdboot -r ssh -f gpu_nodes_list -n 4  # will start an mpd ring on 4 nodes including local host
7. mpiexec -l -machinefile gpu_nodes_list -n 4 ./mpicuda 10000 1 4  # run mpi program using 4 nodes

Comparison between four implementations of sequential matrix multiplication on Delta:

Posted in Mixed | Tagged: , , | Leave a Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: