Something More for Research

Explorer of Research #HEMBAD

CUDA Unified Memory

Posted by Hemprasad Y. Badgujar on October 6, 2015


CUDA Unified Memory

CUDA is the language of Nvidia GPU’s.  To extract maximum performance from GPU’s, you’ll want to develop applications in CUDA.

CUDA Toolkit is the primary IDE (integrated development environment) for developing CUDA-enabled applications.  The main roles of the Toolkit IDE are to simplify the software development process, maximize software developer productivity, and provide features that enhance GPU performance.  The Toolkit has been steadily evolving in tandem with GPU hardware and currently sits at Version 6.5.

One of the most important features of CUDA 6.5 is Unified Memory (UM).  (UM was actually first introduced in CUDA v.6.0).  CPU host memory and GPU device memory are physically separate entities, connected by a relatively slow PCIe bus.  Prior to v.6.0, data elements shared in both CPU and GPU memory required two copies – one copy in CPU memory and one copy in GPU memory.  Developers had to allocate memory on the CPU, allocate memory on the GPU, and then copy data from CPU to GPU and from GPU to CPU.  This dual data management scheme added complexity to programs, opportunities for the introduction of software bugs, and an excessive focus of time and energy on data management tasks.

UM corrects this.  UM creates a memory pool that is shared between CPU and GPU, with a single memory address space and single pointers accessible to both host and device code.  The CUDA driver and runtime libraries automatically handle data transfers between host and device memory, thus relieving developers from the burden of explicitly managing those data transfers.  UM improves performance by automatically providing data locality on the CPU or GPU, wherever it might be required by the application algorithm.  UM also guarantees global coherency of data on host and device, thus reducing the introduction of software bugs.

Let’s explore some sample code that illustrates these concepts.  We won’t concern ourselves with the function of this algorithm; instead, we’ll just focus on the syntax. (Credit to Nvidia for this C/CUDA template example).

Without Unified Memory

Without Unified Memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <string.h>
#include <stdio.h>
struct DataElement
{
  char *name;
  int value;
};
__global__
void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';
  elem->value++;
}
void launch(DataElement *elem) {
  DataElement *d_elem;
  char *d_name;
  int namelen = strlen(elem->name) + 1;
  // Allocate memory on GPU
  cudaMalloc(&d_elem, sizeofDataElement());
  cudaMalloc(&d_name, namelen);
  // Copy data from CPU to GPU
  cudaMemcpy(d_elem, elem, sizeof(DataElement),
     cudaMemcpyHostToDevice);
  cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
  cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*),
     cudaMemcpyHostToDevice);
  // Launch kernel
  Kernel<<< 1, 1 >>>(d_elem);
  // Copy data from GPU to CPU
  cudaMemcpy(&(elem->value), &(d_elem->value), sizeof(int),
     cudaMemcpyDeviceToHost);
  cudaMemcpy(elem->name, d_name, namelen, cudaMemcpyDeviceToHost);
  cudaFree(d_name);
  cudaFree(d_elem);
}
int main(void)
{
  DataElement *e;
  // Allocate memory on CPU
  e = (DataElement*)malloc(sizeof(DataElement));
  e->value = 10;
  // Allocate memory on CPU
  e->name = (char*)malloc(sizeof(char) * (strlen("hello") + 1));
  strcpy(e->name, "hello");
  launch(e);
  printf("On host: name=%s, value=%d\n", e->name, e->value);
  free(e->name);
  free(e);
  cudaDeviceReset();
}

Note these key points:

  • L51,55: Allocate memory on CPU
  • L24,25: Allocate memory on GPU
  • L28-32: Copy data from CPU to GPU
  • L35: Run kernel
  • L38-40: Copy data from GPU to CPU

With Unified Memory 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <string.h>
#include <stdio.h>
struct DataElement
{
  char *name;
  int value;
};
__global__
void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';
  elem->value++;
}
void launch(DataElement *elem) {
  // Launch kernel
  Kernel<<< 1, 1 >>>(elem);
  cudaDeviceSynchronize();
}
int main(void)
{
  DataElement *e;
  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&e, sizeof(DataElement));
  e->value = 10;
  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&(e->name), sizeof(char) *
     (strlen("hello") + 1) );
  strcpy(e->name, "hello");
  launch(e);
  printf("On host: name=%s, value=%d\n", e->name, e->value);
  cudaFree(e->name);
  cudaFree(e);
  cudaDeviceReset();
}
 

Note these key points:

  • L28, 32, 33: Allocate unified memory on CPU and GPU
  • L19: Run kernel

With UM, memory is allocated on the CPU and GPU in a single address space and managed with a single pointer.  Note how the “malloc’s” and “cudaMalloc’s” are condensed into single calls to cudaMallocManaged().  Furthermore, explicit cudaMemcpy() data transfers between CPU and GPU are eliminated, as the CUDA runtime handles these transfers automatically in the background. Collectively these actions simplify code development, code maintenance, and data management.

As software project managers, we like UM for the productivity enhancements it provides for our software development teams.  It improves software quality, reduces coding time, effort and cost, and enhances overall performance. As software engineers, we like UM because of reduced coding effort and the fact that we can focus time and effort on writing CUDA kernel code, where all the parallel performance comes from, instead of spending time on memory management tasks.  Unified Memory is major step forward in GPU programming.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: