Something More for Research

Explorer of Research #HEMBAD

Parallel Code: Maximizing your Performance Potential

Posted by Hemprasad Y. Badgujar on December 19, 2014

No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.

CUDA Parallel Thread Management

Regardless of the environment or architecture you are using, one thing is certain: you must properly manage the threads running in your application to optimize performance. This post will discuss how to get the most out of your threads in a CUDA application.

CUDA Threads

CUDA threads utilize block and thread IDs to determine what data to compute. Block IDs can be 1D or 2D. Thread IDs can be 1D, 2D, or 3D. Utilizing multidimensional threads and blocks greatly simplifies memory addressing when performing operations on multidimensional data (a very common occurrence in image processing, for example). You, the programmer, declare the size of the block (between 1 and 512 concurrent threads), the number of dimensions (1D, 2D, 3D) of the block, and the block dimensions in threads. In each block, all of the threads are capable of sharing data and synchronizing. The image below depicts the CUDA grid/block/thread structure.

Depiction of the threads, blocks and grids during a CUDA execution.

So, assuming you’ve got your kernel up and running, how do you properly manipulate the threads running in your application? For starters, declaring the proper block/grid size (and dimension) is paramount. The appropriate size for these parameters is hardware- and device-dependent and must be fiddled with through trial and error. There’s not really a “General Rule” for determining the values for these parameters outside of really knowing the data in your application and the limitations of your hardware. Let’s say for now, that your block and grid sizes are sufficient.

With appropriate block and grid sizes/dimensions, there are two keys to optimizing your application’s performance: thread communications and thread paths.

Thread Communications

When the threads (in the same block) in your application need to communicate or share data, there are two methods that CUDA provides: shared memory and __syncthreads().  The __syncthreads() command effectively creates a barrier in the code execution where all of the threads in a given block will wait until all of the threads in that block have reached the synchronization point. This is especially useful for ensuring that computation data is written to memory before other threads read it. Improper use of this command, however, can create deadlock conditions and cause your application to hang. Deadlocks are literally a show stopper since they will cause your application to stop dead in its tracks.

Maximizing the use of shared memory will be discussed in much greater detail in a later post. Effectively utilizing shared memory is an absolute necessity for a high-performance application. Shared memory is hundreds of times faster than global memory. A common method of scheduling computations on a device that maximizes the use of shared memory is, at a high level, relatively simple:

  1. Partition the data into smaller subsets that fit into shared memory
  2. Load the subset from global memory to shared memory
  3. Call __syncthreads() to synchronize the threads
  4. Perform the computation on the data subset via shared memory
  5. Copy the results from shared memory to global memory

Structuring your code in this fashion will pay great dividends.

Thread Paths

The other aspect of managing threads is controlling the paths of your threads. In nearly every application, it is almost impossible to structure code without branches (e.g. if/else conditions). Threads in the same block that execute different pieces of code (different execution paths) as a result of branch conditions are said to be divergent. When threads within the same block have different execution paths, they must be serialized. Since all threads in a block always run the same code, if any thread executes the code inside the IF condition (or if-then-else, for loops, etc), all of the threads in that same warp (a group of 32 threads) will go through that section of code. This occurs even if they are not actually executing (when the branch condition is not met)! If half of the threads in a given warp evaluate a branch condition as true, the utilization of the execution units is only 50%, meaning that half of the threads are effectively DOING NOTHING! The actual performance impact depends on the size and frequency of these divergent branch conditions.

Divergence can be avoided when a branch condition is a function of the thread ID. An example of code that would likely produce divergence:
if(threadIdx.x > 4) { //your code }

This divergence is a result of the branch granularity being less than the warp size. By making the branch granularity a whole multiple of the warp size (instead of less than the warp size), this divergence can be completely eliminated:
if(threadIdx.x/WARP_SIZE > 4) { //your code }

Optimizing in the Real World

I know what you’re thinking: “All of this is great information, Justin, but how can I check my code for deadlocks and divergent branches?” Easy – step through every line of code in the entire application with a fine toothed comb.

Well, that doesn’t sound easy. Fortunately, the NVIDIA CUDA profiler provides a very robust means for identifying these problems in your code. There are visual and text-based versions of the profiler – I’ll be discussing the text version. From the command line, the values of four environmental variables can be set:

  • CUDA_PROFILE (set to 1 or 0 to enable/disable the profiler)
  • CUDA_PROFILE_LOG (set to the name of the log file that will contain the profiler results)
  • CUDA_PROFILE_CSV (set to 1 or 0 to enable or disable a CSV version of the log)
  • Most importantly: CUDA_PROFILE_CONFIG (specify the four event types that are to be profiled)

The CUDA profiler only supports four types of events being profiled at a time. Later posts will discuss the other event types of the profiler, but with regards to managing threads, a few event types are essential to profile:

  • branch (number of branch events taken by threads)
  • divergent_branch (number of divergent branches within a warp)

Screenshot of NVIDIA CUDA Profiler: Analysis mode showing divergent branches

With these set, the profiler will output the number of branches and divergent branches that are encountered when executing the application, which provides invaluable insight as to which portions of code are degrading performance. Using this information, you can tell if any of the branch conditions are causing threads to diverge. In addition to the events that were chosen to be profiled, the profiler can also output the total execution time on both the CPU and GPU for the application/kernel, which can be used to gauge performance when tweaking code. Additional functions of the CUDA profiler will be discussed throughout the next several posts.

More information on the NVIDIA CUDA profiler, including information about the visual profiler, can be found in the Profiler User’s Guide:

Host/Device Transfers and Data Movement – Topic #2

Transferring data between the host and device is a very costly move. It is not uncommon to have code making multiple transactions between the host and device without the programmer’s knowledge. Cleverly structuring code can save tons of processing time! On top of that, it is imperative to understand the cost of these host device transfers. In some cases, it may be more beneficial to run certain algorithms or pieces of code on the host due to the costly transfer time associated with farming data to the device.

CUDA Host/Device Transfers and Data Movement

Profiling Your CUDA Code for Timing Data

In a standard CUDA application, several steps typically occur:

  1. Allocate memory on the device
  2. Copy data from host to device
  3. Perform some calculations
  4. Copy Data from Device to Host
  5. Free the allocated device memory
  6. Rinse and Repeat

In the above list, steps 2 and 4 are an absolute necessity in every CUDA application, but are also HUGE performance robbers as well. These transfers are the slowest portion of data movement involved in any aspect of GPU computing. The actual transfer speed (bandwidth) is dependent on the type of hardware you’re using, but regardless of this point, it is still the slowest. In the example code below, I will illustrate this point:

int main()
    const unsigned int X=1048576; //1 Megabyte
    const unsigned int bytes = X*sizeof(int);
    int *hostArray= (int*)malloc(bytes);
    int *deviceArray;



In this example, there are no operations being run on the device. The data is simply copied from the host to the device and back. I’ve named this program To profile this code, it simply needs to be compiled with nvcc and then run with nvprof (nvprof is new in CUDA 5 – the older command line profiler can still be used in earlier versions of CUDA):

$ nvcc -o profileExample

$ nvprof ./profileExample
======== NVPROF is profiling profileExample.out...
======== Command: profileExample.o
======== Profiling result:
Time(%)     Time  Calls      Avg      Min      Max Name
  50.08 718.11us      1 718.11us 718.11us 718.11us [CUDA memcpy DtoH]
  49.92 715.94us      1 715.94us 715.94us 715.94us [CUDA memcpy HtoD]

On my desktop I run a GTX 680 graphics card. As you can see from the above results, a simple copy operation to/from the GPU takes in excess of 715 microseconds each way (a lifetime in terms of computation time). In complex applications with larger amounts of data going back and forth between the host and device many times, this can result in significant time being wasted on these transfers.

Alternative Profiling Options Using Timers

In addition to the nvprof profiler, any CPU timer can be used to measure the elapsed time of a CUDA call/function or kernel execution. It is important to note that if you’re using a CPU timer to measure the timing performance of a portion (or all) or your application, that many of the CUDA functions are asynchronous. This means that the function returns control to the associated thread prior to completing all of their work. If you’re using a CPU timer you must synchronize the CPU thread associated with the timer with the device by callingcudaDeviceSynchronize() immediately before starting and stopping the CPU timer. This blocks the CPU threads until all the CUDA calls issued by that thread have been completed. CUDA also provides its own method for timing using events. The following example code snippet illustrates how to use the CUDA event timers to profile your code:

cudaEvent_t startTime, stopTime;

float time;







cudaEventElapsedTime(&time, startTime, stopTime);



In this example, the cudaEventRecord() function call places the startTime and stopTime events into the default execution stream, ’0′. The device records a timestamp for the event when it reaches that event in the execution stream. cudaEventElapsedTime() simply returns the time in milliseconds (with roughly 0.5us resolution) between the events.

Importance of Data Transfers in CUDA Applications

Analyzing these timing results can prove to be hugely beneficial in determining which portions of your application are the most expensive in terms of time. While there are a number of factors that can make one portion of code more expensive in terms of time, a good way to increase the performance of your application is to minimize the host/device transfers.

The peak theoretical bandwidth between device memory and the device processor is significantly higher than the peak theoretical bandwidth between the host memory and device memory. Therefore, in order to get the most bang for your buck in your application, you really need to minimize these host<->device data transfers. Many programmers are unaware of the high overhead associated with these transfers and by intelligently reducing or eliminating them, you can see very large gains in performance. Try performing a ‘before and after’ type test with your code. If you have multiple transfers occurring throughout your application, try reducing this number and observe the results.

The next post in this series will identify effective ways to optimize your code and avoid numerous transfers between the host and device. Utilizing pinned/mapped memory, asynchronous transfers, and overlapping transfers with computations can yield lofty performance gains if you have many host/device transfers occurring in your application.

More information about nvprof can be located at NVIDIA’s Developer Zone:
CUDA Toolkit Documentation – Profiler User’s Guide


Optimize CUDA Host/Device Transfers

In every single CUDA application (well any useful ones, that is) there is at the very least one host-to-device transfer and one device-to-host transfer. More complicated applications often have many transfers between the host and device. In CUDA programming, this is one of the most expensive operations in terms of timing.

So, if these host/device data transfers are so costly, how do you avoid them? Well, you can’t. But what you can do is minimize the number of transfers between host and device in your application, and mask their impact on the performance of your application.

First, any intermediate data structures that are used within your kernel should always be allocated and destroyed solely on the device. This removes the need to map these structures to host memory and removes the need to transfer this data between the host and device.

If your application has multiple host/device transfers, every effort should be made to batch these transfers into one large transfer. I like to think of this as if you were carrying groceries. Why make multiple trips out to the car when you can load up your arms and do it all at once? Most GPUs support transfer speeds between 5GB/sec and 11GB/sec.

For situations where there is no way around transferring data between host and device, more advanced techniques can be employed to lessen the impact on your application: pinned (also known as page-locked, or mapped) memory and asynchronous transfers.

Pinned Memory

The cudaHostAlloc() function allows you to allocate host memory that can be read from the device and written directly to by the device. This allocated memory is called pinned memory. Pinned memory transfers attain the highest bandwidth between the host and device. During execution, a block that requires host data only needs to wait for a small portion of the data to be transferred (when operating through pinned memory). Typical host-to-device copies make all blocks wait until all of the data associated with the copy operation is transferred. Keep in mind, however, that pinning too much memory can degrade overall system performance by reducing the amount of memory available to the system for paging operations. How much memory you can safely pin differs from system to system, so definitely experiment with this to find the optimal amount.

Asynchronous Transfers

Standard host/device transfers are known as blocking transfers. Control of the main thread is returned only after the data transfer is complete. The cudaMemcpyAsync() function is effectively a non-blocking version of the standard cudaMemcpy(). When executing an asynchronous transfer via cudaMemcpyAsync(), control is returned immediately to the main thread. If you’re not jumping up and down with excitement after hearing that, you should be!

Asynchronous transfers required pinned memory and make use of CUDA streams. In CUDA, streams are essentially sequences of operations that are performed in order on the device. Creating multiple streams is a bit more of an advanced CUDA technique, but one that must be learned if you want the most bang for your buck. With multiple streams in a single application, operations within separate streams can be overlapped, providing a great way to mask the host/device transfer time. Let’s look at an example of how using multiple streams can benefit you and your application:

//your code

Here, both the transfer and kernel are using the default stream, 0. During execution, the kernel will not be launched until the entire copy operation is complete and control has been returned back to the main thread. This is because both the kernel and memory copy are part of the same stream. Now, let’s look at the code using multiple streams:

//your code

By defining two new streams, we are able to make use of concurrent copy and compute. The memory copy is executing in one stream while the kernel is off in another stream, asynchronous from one another. An important note is to make sure that your device supports concurrent copy and execute before you put this in all of your code. This can be done via the deviceOverlap field of the cudaDeviceProp structure.

While this is an advanced technique, if your data can be broken into chunks and transferred in various stages, you can launch multiple kernel instances to operate on each chunk of data as it arrives on the device. Doing so will almost completely mask the transfer time between the host and device.

So, armed with the knowledge of streams, asynchronous transfers, and pinned memory, you now have some insight on how to squeeze out some more performance from your application. My next post will discuss how to efficiently make use of the available memory types accessible to you within your GPU application.

Cache and Shared Memory Optimizations – Topic #3

In addition to managing the threads running in your application, properly utilizing the various memory types available on your device is paramount to ensuring that you’re squeezing every drop of performance from your application. Shared memory, local memory, and register memory all have their advantages and disadvantages and need to be used very carefully to avoid wasting valuable clock cycles. Phenomena such as bank conflicts, memory spilling (too much data being placed in registers and spilling into local memory),  improper loop unrolling, as well as the amount of shared memory, all play pivotal roles in obtaining the greatest performance.

GPU Memory Types – Performance Comparison

In terms of speed, if all the various types of device memory were to race here’s how the race would turn out:

  • 1st place: Register file
  • 2nd place: Shared Memory
  • 3rd place: Constant Memory
  • 4th: Texture Memory
  • Tie for last place: Local Memory and Global Memory

Looking at the above list, it would seem that to have the best performance we’d only want to use register file, shared memory, and constant memory. In a simple world I’d agree with that statement. However, there are many more factors associated with choosing the best form of memory for various portions of your application.

Memory Features

The only two types of memory that actually reside on the GPU chip are register and shared memory. Local, Global, Constant, and Texture memory all reside off chip. Local, Constant, and Texture are all cached.

While it would seem that the fastest memory is the best, the other two characteristics of the memory that dictate how that type of memory should be utilized are the scope and lifetime of the memory:

  • Data stored in register memory is visible only to the thread that wrote it and lasts only for the lifetime of that thread.
  • Local memory has the same scope rules as register memory, but performs slower.
  • Data stored in shared memory is visible to all threads within that block and lasts for the duration of the block. This is invaluable because this type of memory allows for threads to communicate and share data between one another.
  • Data stored in global memory is visible to all threads within the application (including the host), and lasts for the duration of the host allocation.
  • Constant and texture memory won’t be used here because they are beneficial for only very specific types of applications. Constant memory is used for data that will not change over the course of a kernel execution and is read only. Using constant rather than global memory can reduce the required memory bandwidth, however, this performance gain can only be realized when a warp of threads read the same location.Similar to constant memory, texture memory is another variety of read-only memory on the device. When all reads in a warp are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory.

How to Choose Memory Type

Knowing how and when to use each type of memory goes a long way towards optimizing the performance of your application. More often than not, it is best to make use of shared memory due to the fact that threads within the same block utilizing shared memory can communicate. Combined with its excellent performance, this makes shared memory a good ‘all around’ choice when used properly. In some cases however, it may be better to make use of the other types of available memory.

Shared Memory

A common problem arises when memory is shared: with all memory available to all threads, there will be many threads accessing the data simultaneously. To alleviate this potential bottleneck, shared memory is divided into 32 logical banks. Successive sections of memory are assigned to successive banks (see Figure 1).

Diagram of NVIDIA Kepler GPU architecture Shared Memory and L1 Cache Memory

Some facts about shared memory:

  • The total size of shared memory may be set to 16KB, 32KB or 48KB (with the remaining amount automatically used for L1 Cache) as shown in Figure 1. Shared memory defaults to 48KB (with 16KB remaining for L1 Cache).
  • With the Kepler architecture, each bank has a bandwidth of 64 bits per clock cycle. The older Fermi architecture was clocked differently, but effectively offered half this bandwidth.
  • There are 32 threads in a warp and exactly 32 shared memory banks. Because each bank services only one request per cycle, multiple simultaneous accesses to the same bank will result in what is known as a bank conflict. This will be discussed further in the next post.
  • GPUs section memory banks into 32-bit words (4 bytes). Kepler architecture introduced the option to increase banks to 8 bytes usingcudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte). This can help avoid bank conflicts when accessing double precision data.

When there are no bank conflicts present, shared memory performance is comparable to register memory. Use it properly and shared memory will be lightning fast.

Register Memory

In most cases, accessing a register consumes zero clock cycles per instruction. However, delays can occur due to read after write dependencies and bank conflicts. The latency of read after write dependencies is roughly 24 clock cycles. For newer CUDA devices that have 32 cores per multiprocessor, it may take up to 768 threads to completely hide latency.

In addition to the read after write latency, register pressure can severely detract from the performance of the application. Register pressure occurs when there are not enough registers available for a given task. When this occurs, the data is “spilled over” using local memory. See the following posts for further details.

Local Memory

Local memory is not a physical type of memory, but an abstraction of global memory. Its scope is local to the thread and it resides off-chip, which makes it as expensive to access as global memory. Local memory is used only to hold automatic variables. The compiler makes use of local memory when it determines that there is not enough register space to hold the variable. Automatic variables that are large structures or arrays are also typically placed in local memory.


All in all, for most applications my recommendation is definitely to try to make use of shared memory wherever possible. It is the most versatile and easy-to-use type of memory. Shared memory allows communication between threads within a warp which can make optimizing code much easier for beginner to intermediate programmers. The other types of memory all have their place in CUDA applications, but for the general case, shared memory is the way to go.


So now that you know a little bit about each of the various types of memory available to you in your GPU applications, you’re ready to learn how to efficiently use them. The next post will discuss how you can optimize the use of the various types of memory throughout your application.


GPU Shared Memory Performance Optimization

Think for a moment: global memory is up to 150x slower than some of the other types of device memory available. If you could reduce the number of global memory accesses needed by your application, then you’d realize a significant performance increase (especially if your application performs the same operations in a loop or things of that nature). The easiest way to obtain this performance gain is to coalesce your memory accesses to global memory. The number of concurrent global memory accesses of the threads in a given warp is equal to the number of cache lines needed to service all of the threads of the warp. So how do you coalesce your accesses you ask? There are many ways.

The simplest way to coalesce your memory accesses is to have the N-th thread in a warp access the N-th word in a cache line. If the threads in a warp are accessing adjacent 4-byte words (float, for example), a single cache line (and therefore, a single coalesced transaction) will service that memory access. Even if some words of the cache line are not requested by any thread in the warp (e.g., several of the threads access the same word, or some of the threads don’t participate in the access), all data in the cache line is fetched anyways. This results in a single global memory access (see Figure 1).

Diagram of NVIDIA Kepler Aligned Memory Accesses

If sequential threads in a warp access sequential memory locations, but the memory locations are not aligned with the cache lines (overlapping), there will be two 128-byte (L1) cache lines requested. This results in 128-bytes of additional memory being fetched even though it is not needed (see the red blocks in Figure 2). Fortunately, memory allocated via cudaMalloc() is guaranteed to be aligned to at least 256 bytes. By choosing intelligent thread block sizes (typically multiples of the warp size), it facilitates memory accesses by the warps that are aligned to cache lines. This means fewer memory accesses are needed. Let your mind wander for a moment as to what would happen to the memory locations that are accessed by the 2nd, 3rd, 4th, etc thread blocks if the thread block size was not a multiple of warp size. Not good.

Diagram of NVIDIA Kepler Mis-Aligned Memory Accesses

So what happens if your memory accesses are misaligned? Let’s take a look. Below is a simple kernel that demonstrates aligned and misaligned accesses.

__global__ void misalignedCopy(float *outputData, float *inputData, int offset)
    int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
    outputData[xid] = inputData[xid];

In the code example above, data is copied from the array inputData to the array outputData. Both of these arrays exist in global memory. The kernel here is executed within a loop in host code that varies the offset between 0 and 32. Here, global memory accesses with 0 offset, or with offsets that are multiples of 32 words, result in a single cache line transaction. When the offset is not a multiple of 32 words, two L1 cache lines are loaded per warp. This results in roughly 80% of the memory throughput achieved compared to the case with no offsets.

Another technique, similar to coalescing, is known as striding. Strided memory accesses will be discussed in the next post.

Shared Memory Bank Conflicts

If your application is making use of shared memory, you’d expect to see increased performance compared to an implementation using only global memory. Because it is on-chip, shared memory has a much higher bandwidth and lower latency than global memory. But this speed increase requires that your application have no bank conflicts between threads.

In order to actually achieve the high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (also known as banks) that can be accessed simultaneously. This means any memory load/store of N memory addresses than spans N distinct memory banks can be serviced simultaneously (see Figure 3). In performance gain terms, this means that the memory exhibits an effective bandwidth that is N times as high as that of a single memory module.

Diagram of NVIDIA Kepler Shared Memory Banks Parallel Accesses

The problem however, lies in situations where multiple addresses of a memory request map to the same memory bank. When this occurs (a bank conflict), the accesses are serialized, reducing the effective bandwidth. A memory request that has bank conflicts is split into as many separate conflict-free requests as necessary, which greatly reduces the performance of the application (by a factor that’s equal to the number of separate memory requests). As shown in Figure 4, serialized shared memory accesses can take much longer.

Diagram of NVIDIA Kepler Shared Memory Banks Serialized Accesses

The only exception is the case of shared memory broadcasts. These occur when all threads in a warp access the same location in shared memory. In this case, a bank conflict does not occur.


It really cannot be stressed enough to make as much use of shared memory as possible in your application. In my next post I will provide an example that illustrates just how much faster shared memory is compared to global memory, as well as the impacts with regards to performance that result when reads to global memory are coalesced and bank conflicts are removed. In addition, I will discuss strided memory accesses, and provide some additional insight into the optimization techniques for the other types of available memory.


Avoiding GPU Memory Performance Bottlenecks

So all this advice is great and all, but I’m sure you’re wondering “What actually is strided memory access?” The following example will illustrate this phenomenon and outline its effect on the effective bandwidth:

__global__ void strideExample (float *outputData, float *inputData, int stride=2)
    int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    outputData[index] = inputData[index];

In the above code, threads within a warp access data words in memory with a stride of 2. This leads to a load of two L1 cache lines per warp. The actual accessing of the memory is shown below.

Diagram of NVIDIA Kepler Strided Memory Accesses

Accesses with a stride of 2 result in a 50% load/store efficiency (shown above), since half of the elements involved in the transaction are not used (becoming wasted bandwidth). As the stride increases, the effective bandwidth decreases until there is a single cache line for each of the threads in a warp (wow, that’s a lot of lost performance!).

Strided accesses can debilitate performance of even the most optimized algorithms. For large strides, the effective bandwidth is poor, regardless of the architecture of compute capability version. Intuitively, this makes sense. When concurrent threads are simultaneously accessing data located in memory addresses that are far apart in the physical memory, the accesses cannot be combined. For these types of situations, you absolutely must not use global memory if you wish to realize any sort of performance gain from your application for accesses with a stride greater than 1. In cases where you are stuck with strided memory accesses, you must ensure that as much data as possible is used from each cache line fetching operation.

So, if I haven’t made it clear enough: if you can avoid global memory, you should. In my personal experiences programming with CUDA, you really can’t go wrong if you intelligently make use of shared memory. With the exception of bank conflicts (discussed in Shared Memory Optimization), you don’t suffer the painful penalties that accompany global memory usage when you have non-sequential memory accesses, or misaligned accesses by warps in shared memory.

For those of us who are more advanced, if you can make use of registers without register pressure or read-after-write dependencies, you should. I briefly discussed register memory in previous posts, but feel that it warrants a bit more discussion here.

Shared memory allows communications between threads, which is very convenient. However, for those of us looking to squeeze out every last drop of performance from our applications, you really need to make use of registers when you can. Think of it this way – shared memory is kind of the “jack of all trades” memory. It’s suitable for “most” applications and operations, but for register operations (without read-after-write issues) there is no comparison. Typically, register access consumes zero extra clock cycles per instruction. While this lack of processing latency makes register memory very appealing, read-after-write dependencies have a latency of roughly 24 clock cycles. When such a dependency appears in a loop of code, this latency will add up very quickly.

The only other downside of register memory is called register pressure. Register pressure occurs when there are just simply not enough registers for a given task. Although every multiprocessor in a GPU contains literally thousands of 32 bit registers, these get partitioned amongst concurrent threads. You can set the maximum number of registers that can be allocated (by the compiler) via the command line.

To summarize, when you’re developing your algorithms and applications you really need to be aware of how you’re making use of memory:

  • Global memory is great for beginner programmers, as it drastically simplifies coding for those who aren’t skilled or experienced in regards to CUDA programming. Performance will be lower.
  • If you aren’t needing to squeeze out every drop of performance, shared memory can take you to where you need to be. The benefits of thread-to-thread communications within a warp makes many algorithms easier to code and implement, making shared memory a very attractive option.
  • Register memory is the fastest, but a little more tricky. There are hard limits to what you can do with register memory, but if what your algorithm requires fits inside those confines, then definitely make use of registers.
  • Very specific types of applications can really benefit from using texture and local memory, but if you’re in the market for those types of memory, you probably wouldn’t be reading this blog in the first place.

The next portion of this blog will step away from the memory aspect of performance optimization and into optimizing configurations and the art of keeping all the multiprocessors on your device busy throughout the execution of your kernel.

above all by

All in all, utilizing devices like the NVIDIA GPU, Cell BE or Intel Xeon Phi to increase the performance of your application doesn’t have to be a daunting task. Over the next several posts, this blog will outline and identify effective techniques to make troubleshooting the performance leaks of your application an easy matter. Each of these common bottlenecks will be discussed in detail in an effort to provide programmers insight into how to make use of all the resources that many popular architectures provide.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.


News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: