Something More for Research

Explorer of Research #HEMBAD

HOW TO MIX MPI AND CUDA IN A SINGLE PROGRAM

Posted by Hemprasad Y. Badgujar on December 19, 2014


MPI is a well-known programming model for Distributed Memory Computing. If you have access to GPU resources, MPI can be used to distribute tasks to computers, each of which can use their CPU and also GPU to process the distributed task.

My toy problem in hand is to use  a mix of MPI and CUDA to handle traditional sparse-matrix vector multiplication. The program can be structured as:

Each node uses both CPU and GPU resources
Each node uses both CPU and GPU resources
  1. Read a sparse matrix from from disk, and split it into sub-matrices.
  2. Use MPI to distribute the sub-matrices to processes.
  3. Each process would call a CUDA kernel to handle the multiplication. The result of multiplication would be copied back to each computer memory.
  4. Use MPI to gather results from each of the processes, and re-form the final matrix.

One of the options is to put both MPI and CUDA code in a single file,spaghetti.cu. This program can be compiled using nvcc, which internally uses gcc/g++ to compile your C/C++ code, and linked to your MPI library:

1
nvcc -I/usr/mpi/gcc/openmpi-1.4.6/include -L/usr/mpi/gcc/openmpi-1.4.6/lib64 -lmpi spaghetti.cu -o program

The downside is it might end up being a plate of spaghetti, if you have some seriously long program.

Another cleaner option is to have MPI and CUDA code separate in two files: main.c and multiply.cu respectively. These two files can be compiled using mpicc, and nvcc respectively into object files (.o) and combined into a single executable file using mpicc. This second option is an opposite compilation of the above, using mpicc, meaning that you have to link to your CUDA library.

1
2
3
4
module load openmpi cuda #(optional) load modules on your node
mpicc -c main.c -o main.o
nvcc -arch=sm_20 -c multiply.cu -o multiply.o
mpicc main.o multiply.o -lcudart -L/apps/CUDA/cuda-5.0/lib64/ -o program

And finally, you can request two processes and two GPUs to test your program on the cluster using PBS script like:

1
2
#PBS -l nodes=2:ppn=2:gpus=2
mpiexec -np 2 ./program

The main.c, containing the call to CUDA file, would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "mpi.h"
int main(int argc, char *argv[])
{
/* It's important to put this call at the begining of the program, after variable declarations. */
MPI_Init(argc, argv);
/* Get the number of MPI processes and the rank of this process. */
        MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
        MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
// ==== Call function 'call_me_maybe' from CUDA file multiply.cu: ==========
call_me_maybe();
/* ... */
}

And in multiply.cu, define call_me_maybe() with the ‘extern‘ keyword to make it accessible from main.c (without additional #include …)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* multiply.cu */
#include <cuda.h>
#include <cuda_runtime.h>
 __global__ void __multiply__ ()
 {
 }
 extern "C" void call_me_maybe()
{
     /* ... Load CPU data into GPU buffers  */
     __multiply__ <<< ...block configuration... >>> (x, y);
     /* ... Transfer data from GPU to CPU */
}

 

Mixing MPI and CUDA

Mixing MPI (C) and CUDA (C++) code requires some care during linking because of differences between the C and C++ calling conventions and runtimes. A helpful overview of the issues can be found at How to Mix C and C++.

One option is to compile and link all source files with a C++ compiler, which will enforce additional restrictions on C code. Alternatively, if you wish to compile your MPI/C code with a C compiler and call CUDA kernels from within an MPI task, you can wrap the appropriate CUDA-compiled functions with the extern keyword, as in the following example.

These two source files can be compiled and linked with both a C and C++ compiler into a single executable on Oscar using:

$ module load mvapich2 cuda
$ mpicc -c main.c -o main.o
$ nvcc -c multiply.cu -o multiply.o
$ mpicc main.o multiply.o -lcudart

The CUDA/C++ compiler nvcc is used only to compile the CUDA source file, and the MPI C compiler mpicc is user to compile the C code and to perform the linking.

01. /* multiply.cu */
02. 
03. #include 
04. #include 
05. 
06. __global__ void __multiply__ (const float *a, float *b)
07. {
08. const int i = threadIdx.x + blockIdx.x * blockDim.x;
09.     b[i] *= a[i];
10. }
11. 
12. extern "C" void launch_multiply(const float *a, const *b)
13. {
14.     /* ... load CPU data into GPU buffers a_gpu and b_gpu */
15. 
16.     __multiply__ <<< ...block configuration... >>> (a_gpu, b_gpu);
17. 
18.     safecall(cudaThreadSynchronize());
19.     safecall(cudaGetLastError());
20. 
21.     /* ... transfer data from GPU to CPU */

Note the use of extern "C" around the function launch_multiply, which instructs the C++ compiler (nvcc in this case) to make that function callable from the C runtime. The following C code shows how the function could be called from an MPI task.

01. /* main.c */
02. 
03. #include 
04. 
05. void launch_multiply(const float *a, float *b);
06. 
07. int main (int argc, char **argv)
08. {
09.        int rank, nprocs;
10.     MPI_Init (&argc, &argv);
11.     MPI_Comm_rank (MPI_COMM_WORLD, &rank);
12.     MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
13. 
14.     /* ... prepare arrays a and b */
15. 
16.     launch_multiply (a, b);
17. 
18.     MPI_Finalize();
19.        return 1;
20. }
Advertisements

One Response to “HOW TO MIX MPI AND CUDA IN A SINGLE PROGRAM”

  1. Karishma Bansole said

    Sir, are this program run only on Oscar cluster?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: