Something More for Research

Explorer of Research #HEMBAD

Archive for December 19th, 2014

Parallel Code: Maximizing your Performance Potential

Posted by Hemprasad Y. Badgujar on December 19, 2014


No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.
Read the rest of this entry »

Posted in Computer Hardwares, Computer Languages, Computing Technology, GPU (CUDA), GPU Accelareted, My Research Related, PARALLEL, Research Menu | Tagged: | Leave a Comment »

Running MPI/GPU program

Posted by Hemprasad Y. Badgujar on December 19, 2014


GPUs provide the ability to use mathematical operations at a fraction of the cost and with higher performance than on the current generation of processors. FutureGrid provides the ability to test such an infrastructure as part of its delta cluster. Here, we provide a step-by-step guide on how to run a parallel matrix multiplication program using IntelMPI and CUDA on Delta machines. The MPI framework distributes the work among compute nodes, each of which use CUDA to execute the shared workload. We also provide the complete parallel matrix multiplication code using MPI/CUDA that has already been tested on Delta cluster in attachment.

Source Code Package

MPI code: pmm_mpi.c

#include   void invoke_cuda_vecadd();  int main(int argc, char *argv[]) { int rank, size;  MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ invoke_cuda_vecadd(); /* the cuda code */ MPI_Finalize(); return 0; }

CUDA code: dgemm_cuda.cu

#include <stdio.h>

__global__ void cuda_vecadd(int *array1, int *array2, int *array3)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
array3[index] = array1[index] + array2[index];
}

extern “C” void invoke_cuda_vecadd()
{
cudaMalloc((void**) &devarray1, sizeof(int)*10);
cudaMalloc((void**) &devarray2, sizeof(int)*10);
cudaMalloc((void**) &devarray3, sizeof(int)*10);
cudaMemcpy(devarray1, hostarray1, sizeof(int)*10, cudaMemcpyHostToDevice);
cudaMemcpy(devarray2, hostarray2, sizeof(int)*10, cudaMemcpyHostToDevice);
cuda_vec_add<<<1, 10>>>(devarray1, devarray2, devarray3);
cudaMemcpy(hostarray3, devarray3, sizeof(int)*10, cudaMemcpyDeviceToHost);
cudaFree(devarray1);
cudaFree(devarray2);
cudaFree(devarray3);
}

Note: Mixing MPI and CUDA code may cause problems during linking because of the difference between C and C++ calling conventions. The use of extern “C” around invoke_cuda_code which instructs the nvcc (a wrapper of c++) compiler to make that function callable from the C runtime.

Compiling the MPI/CUDA program:

Load the Modules
> module load IntelMPI # load Intel MPI
> module load Intel # load icc > module load cuda # load cuda tools
This will load the Intel MPI, the compiler, and the cuda tools. Next compile the code with

> nvcc -c dgemm_cuda.cu -o dgemm_cuda.o   > mpiicc -o pmm_mpi.c -o pmm_mpi.o
> mpiicc -o mpicuda pmm_mpi.o dgemm_cuda.o -lcudart -lcublas -L /opt/cuda/lib64 -I /opt/cuda/include

Note: The CUDA compiler nvcc is used only to compile the CUDA source file, and the IntelMPI compiler mpiicc is used to compile the C code and do the linking
Setting Up and Submitting MPI Jobs:

1. qsub -I -l nodes=4 -q delta        # get 4 nodes from FG
2. uniq /var/spool/torque/aux/399286.i136 > gpu_nodes_list       #create machine file list
3. module load IntelMPI                # load Intel MPI
4. module load Intel                     # load icc
5. module load cuda                     # load cuda tools
6. mpdboot -r ssh -f gpu_nodes_list -n 4  # will start an mpd ring on 4 nodes including local host
7. mpiexec -l -machinefile gpu_nodes_list -n 4 ./mpicuda 10000 1 4  # run mpi program using 4 nodes

Comparison between four implementations of sequential matrix multiplication on Delta:

Posted in Mixed | Tagged: , , | Leave a Comment »

HOW TO MIX MPI AND CUDA IN A SINGLE PROGRAM

Posted by Hemprasad Y. Badgujar on December 19, 2014


MPI is a well-known programming model for Distributed Memory Computing. If you have access to GPU resources, MPI can be used to distribute tasks to computers, each of which can use their CPU and also GPU to process the distributed task.

My toy problem in hand is to use  a mix of MPI and CUDA to handle traditional sparse-matrix vector multiplication. The program can be structured as:

Each node uses both CPU and GPU resources
Each node uses both CPU and GPU resources
  1. Read a sparse matrix from from disk, and split it into sub-matrices.
  2. Use MPI to distribute the sub-matrices to processes.
  3. Each process would call a CUDA kernel to handle the multiplication. The result of multiplication would be copied back to each computer memory.
  4. Use MPI to gather results from each of the processes, and re-form the final matrix.

One of the options is to put both MPI and CUDA code in a single file,spaghetti.cu. This program can be compiled using nvcc, which internally uses gcc/g++ to compile your C/C++ code, and linked to your MPI library:

1
nvcc -I/usr/mpi/gcc/openmpi-1.4.6/include -L/usr/mpi/gcc/openmpi-1.4.6/lib64 -lmpi spaghetti.cu -o program

The downside is it might end up being a plate of spaghetti, if you have some seriously long program.

Another cleaner option is to have MPI and CUDA code separate in two files: main.c and multiply.cu respectively. These two files can be compiled using mpicc, and nvcc respectively into object files (.o) and combined into a single executable file using mpicc. This second option is an opposite compilation of the above, using mpicc, meaning that you have to link to your CUDA library.

1
2
3
4
module load openmpi cuda #(optional) load modules on your node
mpicc -c main.c -o main.o
nvcc -arch=sm_20 -c multiply.cu -o multiply.o
mpicc main.o multiply.o -lcudart -L/apps/CUDA/cuda-5.0/lib64/ -o program

And finally, you can request two processes and two GPUs to test your program on the cluster using PBS script like:

1
2
#PBS -l nodes=2:ppn=2:gpus=2
mpiexec -np 2 ./program

The main.c, containing the call to CUDA file, would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "mpi.h"
int main(int argc, char *argv[])
{
/* It's important to put this call at the begining of the program, after variable declarations. */
MPI_Init(argc, argv);
/* Get the number of MPI processes and the rank of this process. */
        MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
        MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
// ==== Call function 'call_me_maybe' from CUDA file multiply.cu: ==========
call_me_maybe();
/* ... */
}

And in multiply.cu, define call_me_maybe() with the ‘extern‘ keyword to make it accessible from main.c (without additional #include …)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* multiply.cu */
#include <cuda.h>
#include <cuda_runtime.h>
 __global__ void __multiply__ ()
 {
 }
 extern "C" void call_me_maybe()
{
     /* ... Load CPU data into GPU buffers  */
     __multiply__ <<< ...block configuration... >>> (x, y);
     /* ... Transfer data from GPU to CPU */
}

 

Mixing MPI and CUDA

Mixing MPI (C) and CUDA (C++) code requires some care during linking because of differences between the C and C++ calling conventions and runtimes. A helpful overview of the issues can be found at How to Mix C and C++.

One option is to compile and link all source files with a C++ compiler, which will enforce additional restrictions on C code. Alternatively, if you wish to compile your MPI/C code with a C compiler and call CUDA kernels from within an MPI task, you can wrap the appropriate CUDA-compiled functions with the extern keyword, as in the following example.

These two source files can be compiled and linked with both a C and C++ compiler into a single executable on Oscar using:

$ module load mvapich2 cuda
$ mpicc -c main.c -o main.o
$ nvcc -c multiply.cu -o multiply.o
$ mpicc main.o multiply.o -lcudart

The CUDA/C++ compiler nvcc is used only to compile the CUDA source file, and the MPI C compiler mpicc is user to compile the C code and to perform the linking.

01. /* multiply.cu */
02. 
03. #include 
04. #include 
05. 
06. __global__ void __multiply__ (const float *a, float *b)
07. {
08. const int i = threadIdx.x + blockIdx.x * blockDim.x;
09.     b[i] *= a[i];
10. }
11. 
12. extern "C" void launch_multiply(const float *a, const *b)
13. {
14.     /* ... load CPU data into GPU buffers a_gpu and b_gpu */
15. 
16.     __multiply__ <<< ...block configuration... >>> (a_gpu, b_gpu);
17. 
18.     safecall(cudaThreadSynchronize());
19.     safecall(cudaGetLastError());
20. 
21.     /* ... transfer data from GPU to CPU */

Note the use of extern "C" around the function launch_multiply, which instructs the C++ compiler (nvcc in this case) to make that function callable from the C runtime. The following C code shows how the function could be called from an MPI task.

01. /* main.c */
02. 
03. #include 
04. 
05. void launch_multiply(const float *a, float *b);
06. 
07. int main (int argc, char **argv)
08. {
09.        int rank, nprocs;
10.     MPI_Init (&argc, &argv);
11.     MPI_Comm_rank (MPI_COMM_WORLD, &rank);
12.     MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
13. 
14.     /* ... prepare arrays a and b */
15. 
16.     launch_multiply (a, b);
17. 
18.     MPI_Finalize();
19.        return 1;
20. }

Posted in CLUSTER, Computer Hardware, Computer Softwares, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, GRID, Open CL, OpenMP, PARALLEL | Tagged: , , | 1 Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: