Something More for Research

Explorer of Research #HEMBAD

Running MPI/GPU program

Posted by Hemprasad Y. Badgujar on December 19, 2014


GPUs provide the ability to use mathematical operations at a fraction of the cost and with higher performance than on the current generation of processors. FutureGrid provides the ability to test such an infrastructure as part of its delta cluster. Here, we provide a step-by-step guide on how to run a parallel matrix multiplication program using IntelMPI and CUDA on Delta machines. The MPI framework distributes the work among compute nodes, each of which use CUDA to execute the shared workload. We also provide the complete parallel matrix multiplication code using MPI/CUDA that has already been tested on Delta cluster in attachment.

Source Code Package

MPI code: pmm_mpi.c

#include   void invoke_cuda_vecadd();  int main(int argc, char *argv[]) { int rank, size;  MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ invoke_cuda_vecadd(); /* the cuda code */ MPI_Finalize(); return 0; }

CUDA code: dgemm_cuda.cu

#include <stdio.h>

__global__ void cuda_vecadd(int *array1, int *array2, int *array3)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
array3[index] = array1[index] + array2[index];
}

extern “C” void invoke_cuda_vecadd()
{
cudaMalloc((void**) &devarray1, sizeof(int)*10);
cudaMalloc((void**) &devarray2, sizeof(int)*10);
cudaMalloc((void**) &devarray3, sizeof(int)*10);
cudaMemcpy(devarray1, hostarray1, sizeof(int)*10, cudaMemcpyHostToDevice);
cudaMemcpy(devarray2, hostarray2, sizeof(int)*10, cudaMemcpyHostToDevice);
cuda_vec_add<<<1, 10>>>(devarray1, devarray2, devarray3);
cudaMemcpy(hostarray3, devarray3, sizeof(int)*10, cudaMemcpyDeviceToHost);
cudaFree(devarray1);
cudaFree(devarray2);
cudaFree(devarray3);
}

Note: Mixing MPI and CUDA code may cause problems during linking because of the difference between C and C++ calling conventions. The use of extern “C” around invoke_cuda_code which instructs the nvcc (a wrapper of c++) compiler to make that function callable from the C runtime.

Compiling the MPI/CUDA program:

Load the Modules
> module load IntelMPI # load Intel MPI
> module load Intel # load icc > module load cuda # load cuda tools
This will load the Intel MPI, the compiler, and the cuda tools. Next compile the code with

> nvcc -c dgemm_cuda.cu -o dgemm_cuda.o   > mpiicc -o pmm_mpi.c -o pmm_mpi.o
> mpiicc -o mpicuda pmm_mpi.o dgemm_cuda.o -lcudart -lcublas -L /opt/cuda/lib64 -I /opt/cuda/include

Note: The CUDA compiler nvcc is used only to compile the CUDA source file, and the IntelMPI compiler mpiicc is used to compile the C code and do the linking
Setting Up and Submitting MPI Jobs:

1. qsub -I -l nodes=4 -q delta        # get 4 nodes from FG
2. uniq /var/spool/torque/aux/399286.i136 > gpu_nodes_list       #create machine file list
3. module load IntelMPI                # load Intel MPI
4. module load Intel                     # load icc
5. module load cuda                     # load cuda tools
6. mpdboot -r ssh -f gpu_nodes_list -n 4  # will start an mpd ring on 4 nodes including local host
7. mpiexec -l -machinefile gpu_nodes_list -n 4 ./mpicuda 10000 1 4  # run mpi program using 4 nodes

Comparison between four implementations of sequential matrix multiplication on Delta:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: