Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘Image Processing’ Category

Databases for Multi-camera , Network Camera , E-Surveillace

Posted by Hemprasad Y. Badgujar on February 18, 2016


Multi-view, Multi-Class Dataset: pedestrians, cars and buses

This dataset consists of 23 minutes and 57 seconds of synchronized frames taken at 25fps from 6 different calibrated DV cameras.
One camera was placed about 2m high of the ground, two others where located on a first floor high, and the rest on a second floor to cover an area of 22m x 22m.
The sequence was recorded at the EPFL university campus where there is a road with a bus stop, parking slots for cars and a pedestrian crossing.

Download

Ground truth images
Ground truth annotations

References

The dataset on this page has been used for our multiview object pose estimation algorithm described in the following paper:

G. Roig, X. Boix, H. Ben Shitrit and P. Fua Conditional Random Fields for Multi-Camera Object Detection, ICCV11.

Multi-camera pedestrians video

“EPFL” data set: Multi-camera Pedestrian Videos

people tracking
results, please cite one of the references below.

On this page you can download a few multi-camera sequences that we acquired for developing and testing our people detection and tracking framework. All of the sequences feature several synchronised video streams filming the same area under different angles. All cameras are located about 2 meters from the ground. All pedestrians on the sequences are members of our laboratory, so there is no privacy issue. For the Basketball sequence, we received consent from the team.

Laboratory sequences

These sequences were shot inside our laboratory by 4 cameras. Four (respectively six) people are sequentially entering the room and walking around for 2 1/2 minutes. The frame rate is 25 fps and the videos are encoded using MPEG-4 codec.

[Camera 0] [Camera 1] [Camera 2] [Camera 3]

Calibration file for the 4 people indoor sequence.

[Camera 0] [Camera 1] [Camera 2] [Camera 3]

Calibration file for the 6 people indoor sequence.

Campus sequences

These two sequences called campus were shot outside on our campus with 3 DV cameras. Up to four people are simultaneously walking in front of them. The white line on the screenshots shows the limits of the area that we defined to obtain our tracking results. The frame rate is 25 fps and the videos are encoded using Indeo 5 codec.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2]
[Seq.2, cam. 0] [Seq.2, cam. 1] [Seq.2, cam. 2]

Calibration file for the two above outdoor scenes.

Terrace sequences

The sequences below, called terrace, were shot outside our building on a terrace. Up to 7 people evolve in front of 4 DV cameras, for around 3 1/2 minutes. The frame rate is 25 fps and the videos are encoded using Indeo 5 codec.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]
[Seq.2, cam. 0] [Seq.2, cam. 1] [Seq.2, cam. 2] [Seq.1, cam. 3]

Calibration file for the terrace scene.

Passageway sequence

This sequence dubbed passageway was filmed in an underground passageway to a train station. It was acquired with 4 DV cameras at 25 fps, and is encoded with Indeo 5. It is a rather difficult sequence due to the poor lighting.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]

Calibration file for the passageway scene.

Basketball sequence

This sequence was filmed at a training session of a local basketball team. It was acquired with 4 DV cameras at 25 fps, and is encoded with Indeo 5.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]

Calibration file for the basketball scene.

Camera calibration

POM only needs a simple calibration consisting of two homographies per camera view, which project the ground plane in top view to the ground plane in camera views and to the head plane in camera views (a plane parallel to the ground plane but located 1.75 m higher). Therefore, the calibration files given above consist of 2 homographies per camera. In degenerate cases where the camera is located inside the head plane, this one will project to a horizontal line in the camera image. When this happens, we do not provide a homography for the head plane, but instead we give the height of the line in which the head plane will project. This is expressed in percentage of the image height, starting from the top.

The homographies given in the calibration files project points in the camera views to their corresponding location on the top view of the ground plane, that is

H * X_image = X_topview .

We have also computed the camera calibration using the Tsai calibration toolkit for some of our sequences. We also make them available for download. They consist of an XML file per camera view, containing the standard Tsai calibration parameters. Note that the image size used for calibration might differ from the size of the video sequences. In this case, the image coordinates obtained with the calibration should be normalized to the size of the video.

Ground truth

We have created a ground truth data for some of the video sequences presented above, by locating and identifying the people in some frames at a regular interval.

To use these ground truth files, you must rely on the same calibration with the exact same parameters that we used when generating the data. We call top view the rectangular area of the ground plane in which we perform tracking.

This area is of dimensions tv_width x tv_height and has top left coordinate (tv_origin_x, tv_origin_y). Besides, we call grid our discretization of the top view area into grid_width x grid_height cells. An example is illustrated by the figure below, in which the grid has dimensions 5 x 4.

The people’s position in the ground truth are expressed in discrete grid coordinates. In order to be projected into the images with homographies or the Tsai calibration, these grid coordinates need to be translated into top view coordinates. We provide below a simple C function that performs this translation. This function takes the following parameters:

  • pos : the person position coming from the ground truth file
  • grid_width, grid_height : the grid dimension
  • tv_origin_x, tv_origin_y : the top left corner of the top view
  • tv_width, tv_height : the top view dimension
  • tv_x, tv_y : the top view coordinates, i.e. the output of the function
  void grid_to_tv(int pos, int grid_width, int grid_height,                  float tv_origin_x, float tv_origin_y, float tv_width,                  float tv_height, float &tv_x, float &tv_y) {     tv_x = ( (pos % grid_width) + 0.5 ) * (tv_width / grid_width) + tv_origin_x;    tv_y = ( (pos / grid_width) + 0.5 ) * (tv_height / grid_height) + tv_origin_y;  }

The table below summarizes the aforementionned parameters for the ground truth files we provide. Note that the ground truth for the terrace sequence has been generated with the Tsai calibration provided in the table. You will need to use this one to get a proper bounding box alignment.

Ground Truth Grid dimensions Top view origin Top view dimensions Calibration
6-people laboratory 56 x 56 (0 , 0) 358 x 360 file
terrace, seq. 1 30 x 44 (-500 , -1,500) 7,500 x 11,000 file (Tsai)
passageway, seq. 1 40 x 99 (0 , 38.48) 155 x 381 file

The format of the ground truth file is the following:

 1 <number of frames>  <number of people>  <grid width>  <grid height>  <step size>  <first frame>  <last frame> <pos> <pos> <pos> ... <pos> <pos> <pos> ... . . .

where <number of frames> is the total number of frames, <number of people> is the number of people for which we have produced a ground truth, <grid width> and <grid height>are the ground plane grid dimensions, <step size> is the frame interval between two ground truth labels (i.e. if set to 25, then there is a label once every 25 frames), and <first frame> and <last frame> are the first and last frames for which a label has been entered.

After the header, every line represents the positions of people at a given frame. <pos> is the position of a person in the grid. It is normally a integer >= 0, but can be -1 if undefined (i.e. no label has been produced for this frame) or -2 if the person is currently out of the grid.

References

Multiple Object Tracking using K-Shortest Paths Optimization

Jérôme Berclaz, François Fleuret, Engin Türetken, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence
2011
pdf | show bibtex

Multi-Camera People Tracking with a Probabilistic Occupancy Map

François Fleuret, Jérôme Berclaz, Richard Lengagne, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence
pdf | show bibtex

MuHAVi: Multicamera Human Action Video Data

including selected action sequences with

MAS: Manually Annotated Silhouette Data

for the evaluation of human action recognition methods

Figure 1. The top view of the configuration of 8 cameras used to capture the actions in the blue action zone (which is marked with white tapes on the scene floor).

camera symbol

camera name

V1 Camera_1
V2 Camera_2
V3 Camera_3
V4 Camera_4
V5 Camera_5
V6 Camera_6
V7 Camera_7
V8 Camera_8

Table 1. Camera view names appearing in the MuHAVi data folders and the corresponding symbols used in Fig. 1.

 

On the table below, you can click on the links to download the data (JPG images) for the corresponding action

Important: We noted that some earlier versions of that earlier versions of MS Internet Explorer could not download files over 2GB size, so we recomment to use alternative browsers such as Firefox or Chrome.

Each tar file contains 7 folders corresponding to 7 actors (Person1 to Person7) each of which contains 8 folders corresponding to 8 cameras (Camera_1 to Camera_8). Image frames corresponding to every combination of action/actor/camera are named with image frame numbers starting from 00000001.jpg for simplicity. The video frame rate is 25 frames per second and the resolution of image frames (except for Camera_8) is 720 x 576 Pixels (columns x rows). The image resolution is 704 x 576 for Camera_8.

action class

action name

size
C1 WalkTurnBack 2.6GB
C2 RunStop 2.5GB
C3 Punch 3.0GB
C4 Kick 3.4GB
C5 ShotGunCollapse 4.3GB
C6 PullHeavyObject 4.5GB
C7 PickupThrowObject 3.0GB
C8 WalkFall 3.9GB
C9 LookInCar 4.6GB
C10 CrawlOnKnees 3.4GB
C11 WaveArms 2.2GB
C12 DrawGraffiti 2.7GB
C13 JumpOverFence 4.4GB
C14 DrunkWalk 4.0GB
C15 ClimbLadder 2.1GB
C16 SmashObject 3.3GB
C17 JumpOverGap 2.6GB

MIT Trajectory Data Set – Multiple Camera Views

Download

MIT trajectory data set is for the research of activity analysis in multiple single camera view using the trajectories of objects as features. Object tracking is based on background subtraction using a Adaptive Gaussian Mixture model. There are totally four camera views. Trajectories in different camera views have been synchronized. The data can be downloaded from the following link,

MIT trajectory data set

Background image

Reference

Please cite as:

X. Wang, K. Tieu and E. Grimson, Correspondence‐Free Activity Analysis and Scene Modeling in Multiple Camera Views, IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI), Vol. 32, pp. 56-71, 2010..

Details

MIT traffic data set is for research on activity analysis and crowded scenes. It includes a traffic video sequence of 90 minutes long. It is recorded by a stationary camera. The size of the scene is 720 by 480. It is divided into 20 clips and can be downloaded from the following links.

Ground Truth

In order to evaluate the performance of human detection on this data set, ground truth of pedestrians of some sampled frames are manually labeled. It can be downloaded below. A readme file provides the instructions of how to use it.
Ground truth of pedestrians

References

  1. Unsupervised Activity Perception in Crowded and Complicated scenes Using Hierarchical Bayesian Models
    X. Wang, X. Ma and E. Grimson
    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 31, pp. 539-555, 2009
  2. Automatic Adaptation of a Generic Pedestrian Detector to a Specific Traffic Scene
    M. Wang and X. Wang
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2011

Description

This dataset is presented in our CVPR 2015 paper,
Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification, In Computer Vision and Pattern Recognition (CVPR), 2015. PDF

The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car. The surveillance-nature data contains 50,000 car images captured in the front view. Please refer to our paper for the details.

The dataset is well prepared for the following computer vision tasks:

  • Fine-grained classification
  • Attribute prediction
  • Car model verification

The train/test subsets of these tasks introduced in our paper are included in the dataset. Researchers are also welcome to utilize it for any other tasks such as image ranking, multi-task learning, and 3D reconstruction.

Note

  1. You need to complete the release agreement form to download the dataset. Please see below.
  2. The CompCars database is available for non-commercial research purposes only.
  3. All images of the CompCars database are obtained from the Internet which are not property of MMLAB, The Chinese University of Hong Kong. The MMLAB is not responsible for the content nor the meaning of these images.
  4. You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the images and any portion of derived data.
  5. You agree not to further copy, publish or distribute any portion of the CompCars database. Except, for internal use at a single site within the same organization it is allowed to make copies of the database.
  6. The MMLAB reserves the right to terminate your access to the database at any time.
  7. All submitted papers or any publicly available text using the CompCars database must cite the following paper:
    Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification, In Computer Vision and Pattern Recognition (CVPR), 2015.

Download instructions

Download the CompCars dataset Release Agreement, read it carefully, and complete it appropriately. Note that the agreement should be signed by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to Mr. Linjie Yang (yl012(at)ie.cuhk.edu.hk) and cc to Chen Change Loy (ccloy(at)ie.cuhk.edu.hk). We will verify your request and contact you on how to download the database.

Stanford Cars Dataset

Overview

       The Cars dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.

Download

       Training images can be downloaded here.
Testing images can be downloaded here.
A devkit, including class labels for training images and bounding boxes for all images, can be downloaded here.
If you’re interested in the BMW-10 dataset, you can get that here.

Update: For ease of development, a tar of all images is available here and all bounding boxes and labels for both training and test are available here. If you were using the evaluation server before (which is still running), you can use test annotations here to evaluate yourself without using the server.

Evaluation

       An evaluation server has been set up here. Instructions for the submission format are included in the devkit. This dataset was featured as part of FGComp 2013, and competition results are directly comparable to results obtained from evaluating on images here.

Citation

       If you use this dataset, please cite the following paper:

3D Object Representations for Fine-Grained Categorization
Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei
4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.
[pdf]   [BibTex]   [slides]

Note that the dataset, as released, has 196 categories, one less than in the paper, as it has been cleaned up slightly since publication. Numbers should be more or less comparable, though.

The HDA dataset is a multi-camera high-resolution image sequence dataset for research on high-definition surveillance. 18 cameras (including VGA, HD and Full HD resolution) were recorded simultaneously during 30 minutes in a typical indoor office scenario at a busy hour (lunch time) involving more than 80 persons. In the current release (v1.1), 13 cameras have been fully labeled.

 

The venue spans three floors of the Institute for Systems and Robotics (ISR-Lisbon) facilities. The following pictures show the placement of the cameras. The 18 recorded cameras are identified with a small red circle. The 13 cameras with a coloured view field have been fully labeled in the current release (v1.1).

 

Each frame is labeled with the bounding boxes tightly adjusted to the visible body of the persons, the unique identification of each person, and flag bits indicating occlusion and crowd:

  • The bounding box is drawn so that it completely and tightly encloses the person.
  • If the person is occluded by something (except image boundaries), the bounding box is drawn by estimating the whole body extent.
  • People partially outside the image boundaries have their BB’s cropped to image limits. Partially occluded people and people partially outside the image boundaries are marked as ‘occluded’.
  • A unique ID is associated to each person, e.g., ‘person01’. In case of identity doubt, the special ID ‘personUnk’ is used.
  • Groups of people that are impossible to label individually are labelled collectively as ‘crowd’. People in front of a ’crowd’ area are labeled normally.

The following figures show examples of labeled frames: (a) an unoccluded person; (b) two occluded people; (c) a crowd with three people in front.

 

Data formats:

For each camera we provide the .jpg frames sequentially numbered and a .txt file containing the annotations according to the “video bounding box” (vbb) format defined in the Caltech Pedestrian Detection Database. Also on this site there are tools to visualise the annotations overlapped on the image frames.

 

Some statistics:

Labeled Sequences: 13

Number of Frames: 75207

Number of Bounding Boxes: 64028

Number of Persons: 85

 

Repository of Results:

We maintain a public repository of re-identification results in this dataset. Send us your CMC curve to be uploaded  (alex at isr ist utl pt).
Click here to see the full list and detailed experiments.

MANUAL_c_l_e_a_n cam60

Advertisements

Posted in Computer Network & Security, Computer Research, Computer Vision, Image Processing, Multimedia | Leave a Comment »

Bilateral Filtering

Posted by Hemprasad Y. Badgujar on September 14, 2015


Popular Filters

When smoothing or blurring images (the most popular goal of smoothing is to reduce noise), we can use diverse linear filters, because linear filters are easy to achieve, and are kind of fast, the most used ones are Homogeneous filter, Gaussian filter, Median filter, et al.

When performing a linear filter, we do nothing but output pixel’s value g(i,j)  which is determined as a weighted sum of input pixel values f(i+k, j+l):

g(i, j)=SUM[f(i+k, j+l) h(k, l)];

in which, h(k, l)) is called the kernel, which is nothing more than the coefficients of the filter.

Homogeneous filter is the most simple filter, each output pixel is the mean of its kernel neighbors ( all of them contribute with equal weights), and its kernel K looks like:

1

 Gaussian filter is nothing but using different-weight-kernel, in both x and y direction, pixels located in the middle would have bigger weight, and the weights decrease with distance from the neighborhood center, so pixels located on sides have smaller weight, its kernel K is something like (when kernel is 5*5):

gkernel

Median filter is something that replace each pixel’s value with the median of its neighboring pixels. This method is great when dealing with “salt and pepper noise“.

Bilateral Filter

By using all the three above filters to smooth image, we not only dissolve noise, but also smooth edges, which make edges less sharper, even disappear. To solve this problem, we can use a filter called bilateral filter, which is an advanced version of Gaussian filter, it introduces another weight that represents how two pixels can be close (or similar) to one another in value, and by considering both weights in image,  Bilateral filter can keep edges sharp while blurring image.

Let me show you the process by using this image which have sharp edge.

21

 

Say we are smoothing this image (we can see noise in the image), and now we are dealing with the pixel at middle of the blue rect.

22   23

Left-above picture is a Gaussian kernel, and right-above picture is Bilateral filter kernel, which considered both weight.

We can also see the difference between Gaussian filter and Bilateral filter by these pictures:

Say we have an original image with noise like this

32

 

By using Gaussian filter, the image is smoother than before, but we can see the edge is no longer sharp, a slope appeared between white and black pixels.

33

 

However, by using Bilateral filter, the image is smoother, the edge is sharp, as well.

31

OpenCV code

It is super easy to make these kind of filters in OpenCV:

1 //Homogeneous blur:
2 blur(image, dstHomo, Size(kernel_length, kernel_length), Point(-1,-1));
3 //Gaussian blur:
4 GaussianBlur(image, dstGaus, Size(kernel_length, kernel_length), 0, 0);
5 //Median blur:
6 medianBlur(image, dstMed, kernel_length);
7 //Bilateral blur:
8 bilateralFilter(image, dstBila, kernel_length, kernel_length*2, kernel_length/2);

and for each function, you can find more details in OpenCV Documentation

Test Images

Glad to use my favorite Van Gogh image :

vangogh

 

From left to right: Homogeneous blur, Gaussian blur, Median blur, Bilateral blur.

(click iamge to view full size version :p )

kernel length = 3:

homo3 Gaussian3 Median3 Bilateral3

kernel length = 9:

homo9 Gaussian9 Median9 Bilateral9
kernel length = 15:

homo15 Gaussian15 Median15 Bilateral15

kernel length = 23:

homo23 Gaussian23 Median23 Bilateral23
kernel length = 31:

homo31 Gaussian31 Median31 Bilateral31
kernel length = 49:

homo49 Gaussian49 Median49 Bilateral49
kernel length = 99:

homo99 Gaussian99 Median99 Bilateral99

Trackback URL.

Posted in C, Image / Video Filters, Image Processing, OpenCV, OpenCV, OpenCV Tutorial | Leave a Comment »

Posted by Hemprasad Y. Badgujar on December 11, 2014


Cloud scaling, Part 1: Build a compute node or small cluster application and scale with HPC

Leveraging warehouse-scale computing as needed

Discover methods and tools to build a compute node and small cluster application that can scale with on-demand high-performance computing (HPC) by leveraging the cloud. This series takes an in-depth look at how to address unique challenges while tapping and leveraging the efficiency of warehouse-scale on-demand HPC. The approach allows the architect to build locally for expected workload and to spill over into on-demand cloud HPC for peak loads. Part 1 focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Exotic HPC architectures with custom-scaled processor cores and shared memory interconnection networks are being rapidly replaced by on-demand clusters that leverage off-the-shelf general purpose vector coprocessors, converged Ethernet at 40 Gbit/s per link or more, and multicore headless servers. These new HPC on-demand cloud resources resemble what has been called warehouse-scale computing, where each node is homogeneous and headless and the focus is on total cost of ownership and power use efficiency overall. However, HPC has unique requirements that go beyond social networks, web search, and other typical warehouse-scale computing solutions. This article focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Moving to high-performance computing

The TOP500 and Green500 supercomputers (see Resources) since 1994 are more often not custom designs, but rather designed and integrated with off-the-shelf headless servers, converged Ethernet or InfiniBand clustering, and general-purpose graphics processing unit (GP-GPU) coprocessors that aren’t for graphics but rather for single program, multiple data (SPMD) workloads. The trend in high-performance computing (HPC) away from exotic custom processor and memory interconnection design to off-the-shelf—warehouse-scale computing—is based on the need to control total cost of ownership, increase power efficiency, and balance operational expenditure (OpEx) and capital expenditure (CapEx) for both start-up and established HPC operations. This means that you can build your own small cluster with similar methods and use HPC warehouse-scale resources on-demand when you need them.

The famous 3D torus interconnection that Cray and others used may never fully go away (today, the TOP500 is one-third massively parallel processors [MPPs] and two-thirds cluster architecture for top performers), but focus on efficiency and new OpEx metrics like Green500 Floating Point Operations (FLOPs)/Watt are driving HPC and keeping architecture focused on clusters. Furthermore, many applications of interest today are data driven (for example, digital video analytics), so many systems not only need traditional sequential high performance storage for HPC checkpoints (saved state of a long-running job) but more random access to structured (database) and unstructured (files) large data sets. Big data access is a common need of traditional warehouse-scale computing for cloud services as well as current and emergent HPC workloads. So, warehouse-scale computing is not HPC, but HPC applications can leverage data center-inspired technology for cloud HPC on demand, if designed to do so from the start.

Power to computing

Power to computing can be measured in terms of a typical performance metric per Watt—for example, FLOPS/Watt or input/output per second/Watt for computing and I/O, respectively. Furthermore, any computing facility can be seen as a plant for converting Watts into computational results, and a gross measure of good plant design is power use efficiency (PUE), which is simply the ratio of total facility power over that delivered to computing equipment. A good value today is 1.2 or less. One reason for higher PUEs is inefficient cooling methods, administrative overhead, and lack of purpose-built facilities compared to cloud data centers (see Resources for a link to more information).

Changes in scalable computing architecture focus over time include:

  • Early focus on a fast single processor (uniprocessor) to push the stored-program arithmetic logic unit central processor to the highest clock rates and instruction throughput possible:
    • John von Neumann, Alan Turing, Robert Noyce (founder of Intel), Ted Hoff (Intel universal processor proponent), along with Gordon Moore see initial scaling as a challenge to scaling digital logic and clock a processor as fast as possible.
    • Up to at least 1984 (and maybe longer), the general rule was “the processor makes the computer.”
    • Cray Computer designs vector processors (X-MP, Y-MP) and distributed memory multiprocessors interconnected by a six-way interconnect 3D torus for custom MPP machines. But this is unique to the supercomputing world.
    • IBM’s focus early on was scalable mainframes and fast uniprocessors until the announcement of the IBM® Blue Gene® architecture in 1999 using a multicore IBM® POWER® architecture system-on-a-chip design and a 3D torus interconnection. The current TOP500 includes many Blue Gene systems, which have often occupied the LINPACK-measured TOP500 number one spot.
  • More recently since 1994, HPC is evolving to a few custom MPP and mostly off-the-shelf clusters, using both custom interconnections (for example, Blue Gene and Cray) and off-the-shelf converged Ethernet (10G, 40G) and InfiniBand:
    • The TOP500 has become dominated by clusters, which comprise the majority of top-performing HPC solutions (two-thirds) today.
    • As shown in the TOP500 chart by architecture since 1994, clusters and MPP dominate today (compared to single instruction, multiple data [SIMD] vector; fast uniprocessors; symmetric multiprocessing [SMP] shared memory; and other, more obscure architectures).
    • John Gage at Sun Microsystems (now Oracle) stated that “the network is the computer,” referring to distributed systems and the Internet, but low-latency networks in clusters likewise become core to scaling.
    • Coprocessors interfaced to cluster nodes via memory-mapped I/O, including GP-GPU and even hybrid field-programmable gate array (FPGA) processors, are used to accelerate specific computing workloads on each cluster node.
  • Warehouse-scale computing and the cloud emerge with focus on MapReduce and what HPC would call embarrassingly parallel applications:
    • The TOP500 is measured with LINPACK and FLOPs and so is not focused on cost of operations (for example, FLOPs/Watt) or data access. Memory access is critical, but storage access is not so critical, except for job checkpoints (so a job can be restarted, if needed).
    • Many data-driven applications have emerged in the new millennium, including social networks, Internet search, global geographical information systems, and analytics associated with more than a billion Internet users. This is not HPC in the traditional sense but warehouse-computing operating at a massive scale.
    • Luiz André Barroso states that “the data center is the computer,” a second shift away from processor-focused design. The data center is highly focused on OpEx as well as CapEx, and so is a better fit for HPC where FLOPs/Watt and data access matter. These Google data centers have a PUE less than 1.2—a measure of total facility power consumed divided by power used for computation. (Most computing enterprises have had a PUE of 2.0 or higher, so, 1.2 is very low indeed. See Resources for more information.)
    • Amazon launched Amazon Elastic Compute Cloud (Amazon EC2), which is best suited to web services but has some scalable and at least high-throughput computing features (see Resources).
  • On-demand cloud HPC services expand, with an emphasis on clusters, storage, coprocessors and elastic scaling:
    • Many private and public HPC clusters occupy TOP500, running Linux® and using common open source tools, such that users can build and scale applications on small clusters but migrate to the cloud for on-demand large job handling. Companies like Penguin Computing, which features Penguin On-Demand, leverage off-the-shelf clusters (InfiniBand and converged 10G/40G Ethernet), Intel or AMD multicore headless nodes, GP-GPU coprocessors, and scalable redundant array of independent disks (RAID) storage.
    • IBM Platform computing provides IBM xSeries® and zSeries® computing on demand with workload management tools and features.
    • Numerous universities and start-up companies leverage HPC on demand with cloud services or off-the-shelf clusters to complement their own private services. Two that I know well are the University of Alaska Arctic Region Supercomputing Center (ARSC) Pacman (Penguin Computing) and the University of Colorado JANUS cluster supercomputer. A common Red Hat Enterprise Linux (RHEL) open source workload tool set and open architecture allow for migration of applications from private to public cloud HPC systems.

Figure 1 shows the TOP500 move to clusters and MPP since the mid-1990s.

Figure 1. TOP500 evolution to clusters and MPP since 1994

Image showing the evolution to clustersThe cloud HPC on-demand approach requires well-defined off-the-shelf clustering, compute nodes, and tolerance for WAN latency to transfer workload. As such, these systems are not likely to overtake top spots in the TOP500, but they are likely to occupy the Green500 and provide efficient scaling for many workloads and now comprise the majority of the Top500.

High-definition digital video computer vision: a scalable HPC case study

Most of us deal with compressed digital video, often in Motion Picture Experts Group (MPEG) 4 format, and don’t think of the scale of even a high-definition (HD) web cam in terms of data rates and processing to apply simple image processing analysis. Digital cinema workflow and post-production experts know the challenges well. They deal with 4K data (roughly 4-megapixel) individual frames or much higher resolution. These frames might be compressed, but they are not compressed over time in groups of pictures like MPEG does and are often lossless compression rather than lossy.

To start to understand an HPC problem that involves FLOPs, uncompressed data, and tools that can be used for scale-up, let’s look at a simple edge-finder transform. The transform-example.zip includes Open Computer Vision (OpenCV) algorithms to transform a real-time web cam stream into a Sobel or Canny edge view in real time. See Figure 2.

Figure 2. HD video Canny edge transform

Image showing a Canny edge transformLeveraging cloud HPC for video analytics allows for deployment of more intelligent smart phone applications. Perhaps phone processors will someday be able to handle real-time HD digital video facial recognition, but in the mean time, cloud HPC can help. Likewise, data that originates in data centers, like geographic information systems (GIS) data, needs intensive processing for analytics to segment scenes, create point clouds of 3D data from stereo vision, and recognize targets of interest (such as well-known landmarks).

Augmented reality and video analytics

Video analytics involves collection of structured (database) information from unstructured video (files) and video streams—for example, facial recognition. Much of the early focus has been on security and automation of surveillance, but applications are growing fast and are being used now for more social applications, e.g. facial recognition, perhaps not to identify a person but to capture and record their facial expression and mood (while shopping). This technology can be coupled with augmented reality, whereby the analytics are used to update a scene with helpful information (such as navigation data). Video data can be compressed and uplinked to warehouse-scale data centers for processing so that the analytics can be collected and information provided in return not available on a user’s smart phone. The image processing is compute intensive and involves big data storage, and likely a scaling challenge (see Resources for a link to more information).

Sometimes, when digital video is collected in the field, the data must be brought to the computational resources; but if possible, digital video should only be moved when necessary to avoid encoding to compress and decoding to decompress for viewing. Specialized coprocessors known as codecs (coder/decoder) are designed to decode without software and coprocessors to render graphics (GPUs) exist, but to date, no CV coprocessors are widely available. Khronos has announced an initiative to define hardware acceleration for OpenCV in late 2012, but work has only just begun (see Resources). So, to date, CV remains more of an HPC application that has had attention primarily from digital cinema, but this is changing rapidly based on interest in CV on mobiles and in the Cloud.

Although all of us imagine CV to be implemented on mobile robotics, in our heads-up displays for intelligent transportation, and on visors (like Google Goggles that are now available) for personal use, it’s not clear that all of the processing must be done on the embedded devices or that it should be even if it could. The reason is data: Without access to correlated data center data, CV information has less value. For example, how much value is there in knowing where your are without more mapping and GIS data to help you with where you want to go next? Real-time CV and video analytics are making progress, but they face many challenges, including huge storage requirements, high network bit rates for transport, and significant processing demands for interpretation. Whether the processing is done by cloud HPC clusters or embedded systems, it’s clear that concurrency and parallel processing will play a huge role. Try running a simple Hough linear transform on the 12-megapixel cactus photo I took, and you’ll see why HPC might be needed just to segment a scene at 60 frames/s.

The challenge of making algorithms parallel

HPC with both clusters and MPP requires coding methods to employ many thread of execution on each multicore node and to use message-passing interfaces (MPIs) and basic methods to map data and code to process resources and collect results. For digital video, the mapping can be simple if done at a frame level. Within a frame is more difficult but still not bad other than the steps of segmenting and restitching frames together.

The power of MapReduce

The MapReduce concept is generally associated with Google and the open source Hadoop project (from Apache Software Foundation), but any parallel computation must employ this concept to obtain speed-up, whether done at a node or cluster level with Java™ technology or at a thread level for a nonuniform memory access (NUMA) shared memory. For applications like digital video analytics, the mapping is data intensive, so it makes sense to move the function to the data (in the mapping stage), but either way, the data to be processed must be mapped and processed and the results combined. A clever mapping avoids data dependencies and the need for synchronization as much as possible. In the case of image processing, for CV, the mapping could be within a frame, at the frame level, or by groups of pictures (see Resources).

Key tools for designing cluster scaling applications for cloud HPC on demand include the following:

  • Threading is the way in which a single application (or Linux process) is one address space on one cluster node and can be designed to use all processor cores on that node. Most often, this is done with Portable Operating System Interface for UNIX® (POSIX) Pthreads or with a library like OpenMP, which abstracts the low-level details of POSIX threading. I find POSIX threading to be fairly simple and typically write Pthread code as can be seen in the hpc_cloud_grid.tar.gz example. This example maps threads to the over-number space for prime number searching.
  • MPI is a library that can be linked into a cluster parallel application to assist with mapping of processing to each node, synchronization, and reduction of results. Although you can use MPI to implement MapReduce, unlike Hadoop, it typically moves data (in messages) to program functions running on each node (rather than moving code to the data). In the final video analytics article in this series, I will provide a thread and MPI cluster-scalable version of the capture-transform code. Here, I provide the simple code for a single thread and node to serve as a reference. Run it and Linux dstat at the same time to monitor CPU, I/O, and storage use. It is a resource-intensive program that computes Sobel and Canny transforms on a 2560×1920-pixel image. It should run on any Linux system with OpenCV and a web cam.
  • Vector SIMD and SPMD processing can be accomplished on Intel and AMD nodes with a switch to enable during compilation or, with more work, by creation of transform kernels in CUDA or OpenCL for off-load to a GPU or GP-GPU coprocessor.
  • OpenCV is highly useful for video analytics, as it includes not only convenient image capture, handling, and display functions but also most of the best image processing transforms used in CV.

The future of on-demand cloud HPC

This articles makes an argument for cloud HPC. The goal here is to acquaint you with the idea and some of the challenging, yet compelling applications (like CV) as well as to introduce you to methods for programming applications that can scale on clusters and MPP machines. In future articles, I will take the CV example further and adapt it for not only threading but also for MPI so that we can examine how well it scales on cloud HPC (in my case, at ARSC on Pacman or JANUS). My research involves comparison of tightly coupled CV coprocessors (that I am building using an Altera Stratix IV FPGA I call a computer vision processing unit [CVPU]). I am comparing this to what I can achieve with CV on ARSC for the purpose of understanding whether environmental sensing and GIS data are best processed like graphics, with a coprocessor, or on a cluster or perhaps with a combination of the two. The goals for this research are lofty. In the case of CVPU, the CV/graphics Turing-like test I imagine is one in which the scene that the CVPU parses can then be sent to a GPU for rendering. Ideally, the parsed/rendered image would be indistinguishable from the true digital video stream. When rendered scenes and the ability to analyze them reaches a common level of fidelity, augmented reality, perceptual computing, and video analytics will have amazing power to transform our lives.

Cloud scaling, Part 2: Tour high-performance cloud system design advances

Learn how to leverage co-processing, nonvolatile memory, interconnection, and storage

Breakthrough device technology requires the system designer to re-think operating and application software design in order to realize the potential benefits of closing the access gap or pushing processing into the I/O path with coprocessors. Explore and consider how the latest memory, compute, and interconnection devices and subsystems can affect your scalable, data-centric, high-performance cloud computing system design. Breakthroughs in device technology can be leveraged for transition between compute-centric and the more balanced data-centric compute architectures.

The author examines storage-class memory and demonstrates how to fill the long-standing performance gap between RAM and spinning disk storage; details the use of I/O bus coprocessors (for processing closer to data); explains how to employ InfiniBand to build low-cost, high performance interconnection networks; and discusses scalable storage for unstructured data.

Computing systems engineering has historically been dominated by scaling processors and dynamic RAM (DRAM) interfaces to working memory, leaving a huge gap between data-driven and computational algorithms (see Resources). Interest in data-centric computing is growing rapidly, along with novel system design software and hardware devices to support data transformation with large data sets.

The data focus in software is no surprise given applications of interest today, such as video analytics, sensor networks, social networking, computer vision and augmented reality, intelligent transportation, machine-to-machine systems, and big data initiatives like IBM’s Smarter Planet and Smarter Cities.

The current wave of excitement is about collecting, processing, transforming, and mining the big data sets:

  • The data focus is leading toward new device-level breakthroughs in nonvolatile memory (storage-class memory, SCM) which brings big data closer to processing.
  • At the same time, input/output coprocessors are bringing processing closer to the data.
  • Finally, low-latency, high-bandwidth off-the-shelf interconnections like InfiniBand are allowing researchers to quickly build 3D torus and fat-tree clusters that used to be limited to the most exotic and expensive custom high-performance computing (HPC) designs.

Yet, the systems software and even system design often remain influenced by out-of-date bottlenecks and thinking. For example, consider threading and multiprogramming. The whole idea came about because of slow disk drive access; what else can a program do when waiting on data but run another one. Sure, we have redundant array of independent disks (RAID) scaling and NAND flash solid-state disks (SSDs), but as noted by IBM Almaden Research, the time scale differences of the access time gap are massive in human terms.

The access time gap between a CPU, RAM, and storage can be measured in terms of typical performance for each device, but perhaps the gap is more readily understood when put into human terms (as IBM Almaden has done for illustrative purposes).

If a typical CPU operation is similar to what a human can do in seconds, then RAM access at 100 times more latency is much like taking a few minutes to access information. However, by the same comparison, disk access at 100,000 times more latency compared to RAM is on the order of months (100 days). (See Figure 1.)

Figure 1. The data access gap

Image showing the data access gapMany experienced computer engineers have not really thought hard about the 100 to 200 random I/O operations per second (IOPS) — it is the mechanical boundary for a disk drive. (Sure, sequential access is as high as hundreds of megabytes per second, but random access remains what it was more than 50 years ago, with the same 15K RPM seek and rotate access latency.)

Finally, as Almaden notes, tape is therefore glacially slow. So, why do we bother? For the capacity, of course. But how can we get processing to the data or data to the processing more efficiently?

Look again at Figure 1. Improvements to NAND flash memory for use in mobile devices and more recently SSD has helped to close the gap; however, it is widely believed that NAND flash device technology will be pushed to its limits fairly quickly, as noted by numerous system researchers (see Resources). The transistor floating gate technology used is already at scaling limits and pushing it farther is leading to lower reliability, so although it has been a stop-gap for data-centric computing, it is likely not the solution.

Instead, several new nonvolatile RAM (NVRAM) device technologies are likely solutions, including:

  • Phase change RAM (PCRAM): This memory uses a heating element to turn a class of materials known as chalcogenides into either a crystallized or amorphous glass state, thereby storing two states that can be programmed and read, with state retained even when no power is applied. PCRAM appears to show the most promise in the near term for M-type synchronous nonvolatile memory (NVM).
  • Resistive RAM (RRAM): Most often described as a circuit that is unlike a capacitor, inductor, or resistor, RRAM provides a unique relationship between current and voltage unlike other well-known devices that store charge or magnetic energy or provide linear resistance to current flow. Materials with properties called memristors have been tested for many decades but engineers usually avoid them because of their nonlinear properties and the lack of application for them. IEEE fellow Leon Chua describes them in “Memristor: The Missing Circuit Element.” A memristor’s behavior can be summarized as follows: Current flow in one direction causes electrical resistance to increase and in the opposite direction resistance decreases, but the memristor retains the last resistance it had when flow is re-started. As such, it can store a nonvolatile state, be programmed, and the state read. For details and even some controversy on what is and is not a memristor, seeResources.
  • Spin transfer torque RAM (STT-RAM): A current passed through a magnetic layer can produce a spin-polarized current that, when directed into a magnetic layer, can change its orientation via angular momentum. This behavior can be used to excite oscillations and flip the orientation of nanometer-scale magnetic devices. The main drawback is the high current needed to flip the orientation.

Consult the many excellent entries in Resources for more in-depth information on each device technology.

From a systems perspective, as these devices evolve, where they can be used and how well each might fill the access gap depends on the device’s:

  • Cost
  • Scalability (device integration size must be smaller than a transistor to beat flash; less than 20 nanometers)
  • Latency to program and read
  • Device reliability
  • Perhaps most importantly, durability (how often it can be programmed and erased before it becomes unreliable).

Based on these device performance considerations, IBM has divided SCM into two main classes:

  • S-type: Asynchronous access via an I/O controller. Threading or multiprogramming is used to hide the I/O latency to the device.
  • M-type: Synchronous access via a memory controller. Think about this as wait-states for RAM access in which a CPU core stalls.

Further, NAND SSD would be considered fast storage, accessed via a block-oriented storage controller (much higher I/O rates but similar bandwidth to a spinning disk drive).

It may seem like the elimination of asynchronous I/O for data processing (except, of course, for archive access or cluster scaling) might be a cure-all for data-centric processing. In some sense it is, but systems designers and software developers will have to change habits. The need for I/O latency hiding will largely go away on each node in a system, but it won’t go away completely. Clusters built from InfiniBand deal with node-to-node data-transfer latency with Message Passing Interface or MapReduce schemes and enjoy similar performance to this envisioned SCM node except when booting or when node data exceeds node working RAM size.

So, for scaling purposes, cluster interconnection and I/O latency hiding among nodes in the cluster is still required.

Moving processing closer to data with coprocessors

Faster access to big data is ideal and looks promising, but some applications will always benefit from the alternative approach of moving processing closer to data interfaces. Many examples exist, such as graphics (graphics processing units, GPUs), network processors, protocol-offload engines like the TCP/IP Offload Engine, RAID on chip, encryption coprocessors, and more recently, the idea of computer vision coprocessors. My research involves computer vision and graphics coprocessors, both at scale in clusters and embedded. I am working on what I call a computer vision processing unit, comparing several coprocessors that became more widely pursued with the 2012 announcement of OpenVX by Khronos (see Resources).

In the embedded world, such a method might be described as an intelligent sensor or smart camera, methods in which preliminary processing of raw data is provided by the sensor interface and an embedded logic device or microprocessor, perhaps even a multicore system on a chip (SoC).

In the scalable world, this most often involves use of a coprocessor bus or channel adapter (like PCI Express, PCIe, and Ethernet or InfiniBand); it provides data processing between the data source (network side) and the node I/O controller (host side).

Whether processing should be done or is more efficient when done in the I/O path or on a CPU core has always been a topic of hot debate, but based on an existence proof (GPUs and network processors), clearly they can be useful, waxing and waning in popularity based on coprocessor technology compared to processor. So, let’s take a quick look at some of the methods:

Vector processing for single program, multiple data
Provided today by GPUs, general-purpose GPUs (GP-GPUs), and application processing units (APUs), the idea is that data can be transformed on its way to an output device like a display or sent to a GP-GPU/APU and transformed on a round trip from the host. “General purpose” implies more sophisticated features like double-precision arithmetic compared to single precision only for graphics-specific processing.
Many core
Traditional many-core coprocessor cards (see Resources) are available from various vendors. The idea is to lower cost and power consumption by using simpler, yet numerous cores on the I/O bus, with round-trip offloading of processing to the cards for a more capable but power-hungry and costly full-scale multicore host. Typically, the many-core coprocessor might have an order of magnitude more cores than the host and often includes gigabit or 10G Ethernet and other types of network interfaces.
I/O bus field-programmable gate arrays (FPGAs)
FPGA cards, most often used to prototype a new coprocessor in the early stages of development, can perhaps used as a solution for low-volume coprocessors as well.
Embedded SoCs
A multicore solution can be used in an I/O device to create an intelligent device like a stereo ranging or time-of-flight camera.
Interface FPGA/configurable programmable logic devices
A digital logic state machine can provide buffering and continuous transformation of I/O data, such as digital video encoding.

Let’s look at an example based on offload and I/O path. Data transformation has obvious value for applications like the decoding of MPEG4 digital video, consisting of a GPU coprocessor in the path between the player and a display as shown in Figure 2 for the Linux® MPlayer video decoder and presentation acceleration unit (VDPAU) software interface to NVIDIA MPEG decoding on the GPU.

Figure 2. Simple video decode offload example

Image showing an example of a simple video decode offloadLikewise, any data processing or transformation that can be done in-bound or out-bound from a CPU host may have value, especially if the coprocessor can provide processing at a lower cost with great efficiency or with lower power consumption based on purpose-built processors compared to general-purpose CPUs.

To start to understand a GP-GPU compared to a multicore coprocessor approach, try downloading the two examples of a point spread function to sharpen the edges on an image (threaded transform example) compared with the GPU transform example. Both provide the same 320×240-pixel transformation, but in one case, the Compute Unified Device Architecture (CUDA) C code provided requires a GPU or GP-GPU coprocessor and, in the other case, either a multicore host or a many-core (for example, MICA) coprocessor.

So which is better?

Neither approach is clearly better, mostly because the NVRAM solutions have not yet been made widely available (except as expensive battery-backed DRAM or as S-type SCM from IBM Texas Memory Systems Division) and moving processing into the I/O data path has traditionally involved less friendly programming. Both are changing, though: Coprocessors are adopting higher-level languages like the Open Compute Language (OpenCL) in which code written for multicore hosts runs equally well on Intel MICA or Altera Startix IV/V architectures.

Likewise, all of the major computer systems companies are working feverishly to release SCM products, with PCRAM the most likely to be available first. My advice is to assume that both will be with us for some time and operating systems and applications must be able to deal with both. The memristor, or RRAM, includes a vision that resembles Isaac Asimov’s fictional positronic brain in which memory and processing are fully integrated as they are in a human neural system but with metallic materials. The concept of fully integrated NVM and processing is generally referred to as processing in memory (PIM) or neuromorphic processing (see Resources). Scalable NVM integrated processing holds extreme promise for biologically inspired intelligent systems similar to the human visual cortex, for example. Pushing toward the goal of integrated NVM, with PIM from both sides, is probably a good approach, so I plan to keep up with and keep working on systems that employ both methods—coprocessors and NVM. Nature has clearly favored direct, low-level, full integration of PIM at scale for intelligent systems.

Scaling nodes with Infiniband interconnection

System designers always have to consider the trade-off between scaling up each node in a system and scaling out a solution that uses networking or more richly interconnected clustering to scale processing, I/O, and data storage. At some point, scaling the memory, processing, and storage a single node can integrate hits a practical limit in terms of cost, power efficiency, and size. It is also often more convenient from a reliability, availability, and servicing perspective to spread capability over multiple nodes so that if one needs repair or upgrade, others can continue to provide service with load sharing.

Figure 3 shows a typical InfiniBand 3D torus interconnection.

Figure 3. Example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)

Image showing an example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)In Figure 3, the 4x4x4 shown is for the San Diego Supercomputing Center (SDSC) Gordon supercomputer, as documented by Mellanox, which uses a 36-port InfiniBand switch to connect nodes to each other and to storage I/O.

InfiniBand, Converged Enhanced Ethernet iSCSI (CEE), or Fibre Channel is the most often used scalable storage interface for access to big data. This storage area network (SAN) scaling for RAID arrays is used to host distributed, scalable file systems like Ceph, Lustre, Apache Hadoop, or the IBM General Parallel File System (GPFS). Use of CEE and InfiniBand for storage access using the Open Fabric Alliance SCSI Remote Direct Memory Access (RDMA) Protocol and iSCSI Extensions for RDMA is a natural fit for SAN storage integrated with an InfiniBand cluster. Storage is viewed more as a distributed archive of unstructured data that is searched or mined and loaded into node NVRAM for cluster processing. Higher-level data-centric cluster processing methods like Hadoop MapReduce can also be used to bring code (software) to the data at each node. These topics are big-data-related topics that I describe more in the last part of this four-part series.

The future of data-centric scaling

This articles makes an argument for systems design and architecture that move processors closer to data-generating and -consuming devices, as well as simplification of memory hierarchy to include fewer levels, leveraging lower-latency, scalable NVM devices. This defines a data-centric node design that can be further scaled with low-latency off-the-shelf interconnection networks like InfiniBand. The main challenge with data-centric computing is not instructions-per-second or floating-point-operations-per-second only, but rather IOPS and the overall power efficiency of data processing.

In Part 1 of this series, I uncovered methods and tools to build a compute node and small cluster application that can scale with on-demand HPC by leveraging the cloud. In this article I detailed such high-performance system design advances as co-processing, nonvolatile memory, interconnection, and storage.

In Part 3 in this series I provide more in-depth coverage of a specific data-centric computing application — video analytics. Video analytics includes applications such as facial recognition for security and computer forensics, use of cameras for intelligent transportation monitoring, retail and marketing that involves integration of video (for example, visualizing yourself in a suit you’re considering from a web-based catalog), as well as a wide range of computer vision and augmented reality applications that are being invented daily. Although many of these applications involve embedded computer vision, most also require digital video analysis, transformation, and generation in cloud-based scalable servers. Algorithms like Sobel transformation can be run on typical servers, but algorithms like the generalized Hough transform, facial recognition, image registration, and stereo (point cloud) mapping, for example, require the NVM and coprocessor approaches this article discussed for scaling.

In the last part of the series, I deal with big data issues.

Cloud scaling, Part 3: Explore video analytics in the cloud

Using methods, tools, and system design for video and image analysis, monitoring, and security

Explore and consider methods, tools, and system design for video and image analysis with cloud scaling. As described in earlier articles in this series, video analytics requires a more balanced data-centric compute architecture compared to traditional compute-centric, scalable, high-performance computing. The author examines the use of OpenCV and similar tools for digital video analysis and methods to scale this analysis using cluster and distributed system design.

The use of coprocessors designed for video analytics and the new OpenVX hardware acceleration discussed in previous articles can be applied to the computer vision (CV) examples presented in this article. This new data-centric technology for CV and video analytics requires the system designer to re-think application software and system design to meet demanding requirements, such as real-time monitoring and security for large, public facilities and infrastructure as well as a more entertaining, interactive, and safer world.

Public safety and security

The integration of video analytics in public places is perhaps the best way to ensure public safety, providing digital forensic capabilities to law enforcement and the potential to increase detection of threats and prevention of public safety incidents. At the same time, this need has to be balanced with rights to privacy, which can become a contentious issue if these systems are abused or not well understood. For example, the extension of facial detection, as shown in Figure 1, to facial recognition has obvious identification capability and can be used to track an individual as he or she moves from one public place to another. To many people, facial analytics might be seen an invasion of privacy, and use of CV and video analytics should adhere to surveillance and privacy rights laws and policies, to be sure—any product or service developer might want to start by considering best practices outlined by the Federal Trade Commission (FTC; see Resources).

Digital video using standards such as that from Motion Picture Experts Group (MPEG) for encoding video to compress, transport, uncompress, and display it has led to a revolution in computing ranging from social networking media and amateur digital cinema to improved training and education. Tools for decoding and consuming digital video are widely used by all every day, but tools to encode and analyze uncompressed video frames are needed for video analytics, such as Open Computer Vision (OpenCV). One of the readily available and quite capable tools for encoding and decoding of digital video is FFmpeg; for still images, GNU Image Processing (GIMP) is quite useful (see Resources for links). With these three basic tools, an open source developer is fully equipped to start exploring computer vision (CV) and video analytics. Before exploring these tools and development methods, however, let’s first define these terms better and consider applications.

The first article in this series, Cloud scaling, Part 1: Build your own and scale with HPC on demand, provided a simple example using OpenCV that implements a Canny edge transformation on continuous real-time video from a Linux® web cam. This is an example of a CV application that you could use as a first step in segmenting an image. In general, CV applications involve acquisition, digital image formats for pixels (picture elements that represent points of illumination), images and sequences of them (movies), processing and transformation, segmentation, recognition, and ultimately scene descriptions. The best way to understand what CV encompasses is to look at examples. Figure 1 shows face and facial feature detection analysis using OpenCV. Note that in this simple example, using the Haar Cascade method (a machine learning algorithm) for detection analysis, the algorithm best detects faces and eyes that are not occluded (for example, my youngest son’s face is turned to the side) or shadowed and when the subject is not squinting. This is perhaps one of the most important observations that can be made regarding CV: It’s not a trivial problem. Researchers in this field often note that although much progress has been made since its advent more than 50 years ago, most applications still can’t match the scene segmentation and recognition performance of a 2-year-old child, especially when the ability to generalize and perform recognition in a wide range of conditions (lighting, size variation, orientation and context) is considered.

Figure 1. Using OpenCV for facial recognition

Image showing facial recognition analysisTo help you understand the analytical methods used in CV, I have created a small test set of images from the Anchorage, Alaska area that isavailable for download. The images have been processed using GIMP and OpenCV. I developed C/C++ code to use the OpenCV application programming interface with a Linux web cam, precaptured images, or MPEG movies. The use of CV to understand video content (sequences of images), either in real time or from precaptured databases of image sequences, is typically referred to as video analytics.

Defining video analytics

Video analytics is broadly defined as analysis of digital video content from cameras (typically visible light, but it could be from other parts of the spectrum, such as infrared) or stored sequences of images. Video analytics involves several disciplines but at least includes:

  • Image acquisition and encoding. As a sequence of images or groups of compressed images. This stage of video analytics can be complex, including photometer (camera) technology, analog decoding, digital formats for arrays of light samples (pixels) in frames and sequences, and methods of compressing and decompressing this data.
  • CV. The inverse of graphical rendering, where acquired scenes are converted into descriptions compared to rendering a scene from a description. Most often, CV assumes that this process of using a computer to “see” should operate wherever humans do, which often distinguishes it from machine vision. The goal of seeing like a human does most often means that CV solutions employ machine learning.
  • Machine vision. Again, the inverse of rendering but most often in a well-controlled environment for the purpose of process control—for example, inspecting printed circuit boards or fabricated parts to make sure they are geometrically correct within tolerances.
  • Image processing. A broad application of digital signal processing methods to samples from photometers and radiometers (detectors that measure electromagnetic radiation) to understand the properties of an observation target.
  • Machine learning. Algorithms developed based on the refinement of the algorithm through training data, whereby the algorithm improves performance and generalizes when tested with new data.
  • Real-time and interactive systems. Systems that require response by a deadline relative to a request for service or at least a quality of service that meets SLAs with customers or users of the services.
  • Storage, networking, database, and computing. All required to process digital data used in video analytics, but a subtle, yet important distinction is that this is an inherently data-centric compute problem, as was discussed in Part 2 of this series.

Video analytics, therefore, is broader in scope than CV and is a system design problem that might include mobile elements like a smart phone (for example, Google Goggles) and cloud-based services for the CV aspects of the overall system. For example, IBM has developed a video analytics system known as the video correlation and analysis suite (VCAS), for which the IBM Travel and Transportation Solution BriefSmarter Safety and Security Solution for Rail [PDF] is available; it is a good example of a system design concept. Detailed focus on each system design discipline involved in a video analytics solution is beyond the scope of this article, but many pointers to more information for system designers are available in Resources. The rest of this article focuses on CV processing examples and applications.

Basic structure of video analytics applications

You can break the architecture of cloud-based video analytics systems down into two major segments: embedded intelligent sensors (such as smart phones, tablets with a camera, or customized smart cameras) and cloud-based processing for analytics that can’t be directly computed on the embedded device. Why break the architecture into two segments compared to fully solving in the smart embedded device? Embedding CV in transportation, smart phones, and products is not always practical. Even when embedding a smart camera is smart, so often, the compressed video or scene description may be back-hauled to a cloud-based video analytics system, just to offload the resource-limited embedded device. Perhaps more important, though, than resource limitations is that video transported to the cloud for analysis allows for correlation with larger data sets and annotation with up-to-date global information for augmented reality (AR) returned to the devices.

The smart camera devices for applications like gesture and facial expression recognition must be embedded. However, more intelligent inference to identify people and objects and fully parse scenes is likely to require scalable data-centric systems that can be more efficiently scaled in a data center. Furthermore, data processing acceleration at scale ranging from the Khronos OpenVX CV acceleration standards to the latest MPEG standards and feature-recognition databases are key to moving forward with improved video analytics, and two-segment cloud plus smart camera solutions allow for rapid upgrades.

With sufficient data-centric computing capability leveraging the cloud and smart cameras, the dream of inverse rendering can perhaps be realized where, in the ultimate “Turing-like” test that can be demonstrated for CV, scene parsing and re-rendered display and direct video would be indistinguishable for a remote viewer. This is essentially done now in digital cinema with photorealistic rendering, but this rendering is nowhere close to real time or interactive.

Video analytics apps: Individual scenarios

Killer applications for video analytics are being thought of every day for CV and video analytics, some perhaps years from realization because of computing requirements or implementation cost. Nevertheless, here is a list of interesting applications:

  • AR views of scenes for improved understanding. If you have ever looked at, for example, a landing plane and thought, I wish I could see the cockpit view with instrumentation, this is perhaps possible. I worked in Space Shuttle mission control long ago, where a large development team meticulously re-created a view of the avionics for ground controllers that shadowed what astronauts could see—all graphical, but imaging fusion of both video and graphics to annotate and re-create scenes with meta-data. A much simplified example is presented here in concept to show how an aircraft observed via a tablet computer camera could be annotated with attitude and altitude estimation data (see the example in this article).
  • Skeletal transformations to track the movement and estimate the intent and trajectory of an animal that might jump onto a highway. See the example in this article.
  • Fully autonomous or mostly autonomous vehicles with human supervisory control only. Think of the steps between today’s cruise control and tomorrow’s full autonomous car. Cars that can parallel park themselves today are a great example of this stepwise development.
  • Beyond face detection to reliable recognition and, perhaps more importantly, for expression feedback. Is the driver of a semiautonomous vehicle aggravated, worried, surprised?
  • Virtual shopping (AR to try products). Shoppers can see themselves in that new suit.
  • Signage that interacts with viewers. This is based on expressions, likes and dislikes, and data that the individual has made public.
  • Two-way television and interactive digital cinema. Entertainment for which viewers can influence the experience, almost as if they were actors in the content.
  • Interactive telemedicine. This is available any time with experts from anywhere in the world.

I make no attempt in this article to provide an exhaustive list of applications, but I explore more by looking closely at both AR (annotated views of the world through a camera and display—think heads-up displays such as fighter pilots have) and skeletal transformations for interactive tracking. To learn more beyond these two case studies and for more in-depth application-specific uses of CV and video analytics in medicine, transportation safety, security and surveillance, mapping and remote sensing, and an ever-increasing list of system automation that includes video content analysis, consult the many entries in Resources. The tools available can help anyone with computer engineering skills get started. You can also download a larger set of test images as well as all OpenCV code I developed for this article.

Example: Augmented reality

Real-time video analytics can change the face of reality by augmenting the view a consumer has with a smart phone held up to products or our view of the world (for example, while driving a vehicle) and can allow for a much more interactive experience for users for everything from movies to television, shopping, and travel to how we work. In AR, the ideal solution provides seamless transition from scenes captured with digital video to scenes generated by rendering for a user in real time, mixing both digital video and graphics in an AR view for the user. Poorly designed AR systems distract a user from normal visual cues, but a well-designed AR system can increase overall situation awareness, fusing metrics with visual cues (think fighter pilot heads-up displays).

The use of CV and video analytics in intelligent transportation systems has significant value for safety improvement, and perhaps eventually CV may be the key technology for self-driving vehicles. This appears to be the case based on the U.S. Defense Advanced Research Projects Agency challenge and the Google car, although use of the full spectrum with forward-looking infrared and instrumentation in addition to CV has made autonomous vehicles possible. Another potentially significant application is air traffic safety, especially for airports to detect and prevent runway incursion scenarios. The imagined AR view of an aircraft on final approach at Ted Stevens airport in Anchorage shows a Hough linear transform that might be used to segment and estimate aircraft attitude and altitude visually, as shown in Figure 2. Runway incursion safety is of high interest to the U.S. Federal Aviation Administration (FAA), and statistics for these events can be found in Resources.

Figure 2. AR display example

Image showing an example of video augmentationFor intelligent transportation, drivers will most likely want to participate even as systems become more intelligent, so a balance of automation and human participation and intervention should be kept in mind (for autonomous or semiautonomous vehicles).

Skeletal transformation examples: Tracking movement for interactive systems

Skeletal transformations are useful for applications like gesture recognition or gate analysis of humans or animals—any application where the motion of a body’s skeleton (rigid members) must be tracked can benefit from a skeletal transformation. Most often, this transformation is applied to bodies or limbs in motion, which further enables the use of background elimination for foreground tracking. However, it can still be applied to a single snapshot, as shown in Figure 3, where a picture of a moose is first converted to a gray map, then a threshold binary image, and finally the medial distance is found for each contiguous region and thinned to a single pixel, leaving just the skeletal structure of each object. Notice that the ears on the moose are back—an indication of the animal’s intent (higher-resolution skeletal transformation might be able to detect this as well as the gait of the animal).

Figure 3. Skeletal transformation of a moose

Image showing an example of a skeletal transformationSkeletal transformations can certainly be useful in tracking animals that might cross highways or charge a hiker, but the transformation has also become of high interest for gesture recognition in entertainment, such as in the Microsoft® Kinect® software developer kit (SDK). Gesture recognition can be used for entertainment but also has many practical purposes, such as automatic sign language recognition—not yet available as a product but a concept in research. Certainly skeletal transformation CV can analyze the human gait for diagnostic or therapeutic purposes in medicine or to capture human movement for animation in digital cinema.

Skeletal transformations are widely used in gesture-recognition systems for entertainment. Creative and Intel have teamed up to create an SDK for Windows® called the Creative* Interactive Gesture Camera Developer Kit (see Resources for a link) that uses a time-of-flight light detection and ranging sensor, camera, and stereo microphone. This SDK is similar to the Kinect SDK but intended for early access for developers to build gesture-recognition applications for the device. The SDK is amazingly affordable and could become the basis from some breakthrough consumer devices now that it is in the hands of a broad development community. To get started, you can purchase the device from Intel, and then download the Intel® Perceptual Computing SDK. The demo images are included as an example along with numerous additional SDK examples to help developers understand what the device can do. You can use the finger tracking example shown in Figure 4 right away just by installing the SDK for Microsoft Visual Studio® and running the Gesture Viewer sample.

Figure 4. Skeletal transformation using the Intel Perceptual Computing SDK and Creative Interactive Gesture Camera Developer Kit

Image showing a skeletal and blob transformation of a hand

 op

The future of video analytics

This article makes an argument for the use of video analytics primarily to improve public safety; for entertainment purposes, social networking, telemedicine, and medical augmented diagnostics; and to envision products and services as a consumer. Machine vision has quietly helped automate industry and process control for years, but CV and video analytics in the cloud now show promise for providing vision-based automation in the everyday world, where the environment is not well controlled. This will be a challenge both in terms of algorithms for image processing and machine learning as well as data-centric computer architectures discussed in this series. The challenges for high-performance video analytics (in terms of receiver operating characteristics and throughput) should not be underestimated, but with careful development, this rapidly growing technology promises a wide range of new products and even human vision system prosthetics for those with sign impairments or loss of vision. Based on the value of vision to humans, no doubt this is also fundamental to intelligent computing systems.

Downloads

Description Name Size
OpenCV Video Analytics Examples va-opencv-examples.zip 600KB
Simple images for use with OpenCV example-images.zip 6474KB

Resources

Learn

Get products and technologies

Downloads

Description Name Size
GPU accelerated image transform sharpenCUDA.zip 644KB
Grid threaded comparison hpc_dm_cloud_grid.zip 1.08MB
Simple image for transform benchmark Cactus-320×240-pixel.ppm.zip 206KB

Resources

Learn

Downloads

Description Name Size
Continuous HD digital camera transform example transform-example.zip 123KB
Grid threaded prime generator benchmark hpc_cloud_grid.tar.gz 3KB
High-resolution image for transform benchmark Cactus-12mpixel.zip 12288KB

Resources

Learn

Posted in Apps Development, CLOUD, Computer Languages, Computer Software, Computer Vision, GPU (CUDA), GPU Accelareted, Image Processing, OpenCV, OpenCV, PARALLEL, Project Related, Video | Leave a Comment »

cuSVM for CUDA 6.0 and Matlab x64

Posted by Hemprasad Y. Badgujar on October 13, 2014


cuSVM for CUDA 6.0 and Matlab x64

This page shows how to build cuSVM, GPU accelerated SVM with dense format. Library has been written by AUSTIN CARPENTER. The procedure use CUDA 6.0, MATLAB x64 and Visual Studio 2012. The code and project files were modified in order to compile and link library, many steps were taken from http://www.parallelcoding.com/2012/02/09/cusvm-in-visual-studio-2010-with-cuda-4-0/

Modifications:

  1. Addmatlab variables:
    1. cuSVMTrainIter – contains number of iteration the solver does
    2. cuSVMTrainObj –  contains the final objective function value after the trainning
  2. In file cuSVMSolver.cu lines 869-874 all calls of cudaMemcpyToSymbol was changed, because of changes made in CUDA 6.0 runtime library –http://stackoverflow.com/questions/12947914/error-in-cudamemcpytosymbol-using-cuda-5
    before the change:
    mxCUDA_SAFE_CALL(cudaMemcpyToSymbol(„taumin”, &h_taumin, sizeof(float) ));
    after the change:
    mxCUDA_SAFE_CALL(cudaMemcpyToSymbol(taumin, &h_taumin, sizeof(float) ));
  3. In functions FindBI, FindBJ, FindStoppingJ – change the way reduction in shared memory was done (http://stackoverflow.com/questions/6510427/cuda-finding-max-using-reduction-error)
  4. The kernel cache size is constrained to 400MB, if you want bigger cache you can modify cuSVMSolver.cu line 24
    #define KERNEL_CACHE_SIZE (400*1024*1024)

 

Build Procedure

Download preconfigure cuSVM Visual Studio 2010 solution with LibSVM and matlab scritp for classification

All steps describe below are done, you have to check if all paths are set correctly and yours GPU computational capability is set properly.

My setup:

  • windows 7 x64
  • visual studio 2012
  • CUDA 6.0
  • Matlab R2014a
  • the code was tested on Quadro 5000 and Geforce GTX 580

Prerequisites:

Determine paths:

  1. Matlab include path, mine is „D:\Program Files\MATLAB\R2014a\extern\include” (Matlab was installed on drive d:\)
  2. Matlab library path: „D:\Program Files\MATLAB\R2014a\extern\lib\win64\microsoft”
  3. CUDA toolkit include path: „C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include”
  4. GPU compute capability, mine is 1.2 in case of GeForce GT 330M(compute_12,sm_12), and 3.0 in case GeForce GTX 690 (compute_30,sm_30)

 

Changes made in projects properties (the same steps are for both projects: cuSVMPredict, cuSVMTrain):

  1. Open solution in VS 2010
  2. Right click on project (cuSVMTrain or cuSVMPredict)  and choose „Build Customizations …”, make sure that „CUDA 5.0(.targets, .props)” is checked
  3. Right click oncuSVMTrain and choose project „Properties”
    1. Expand „Configuration Properties”
      1. General->Target Extension: .mexw64
      2. General->Configuration Type: Dynamic Library (.dll)
    2. Expand c/c++-
      1. General->Additional Include Directories: $(SolutionDir)inc\;D:\Program Files\MATLAB\R2014a\extern\include;$(CudaToolkitIncludeDir);%(AdditionalIncludeDirectories)
    3. ExpandCUDA C/C++
      1. Common->Additional Include Directories: $(SolutionDir)inc\;D:\Program Files\MATLAB\R2014a\extern\include;$(CudaToolkitIncludeDir);%(AdditionalIncludeDirectories)
      2. Common->Target Machine Platform: 64-bit (–machine 64)
      3. Device->Code Generation: compute_30,sm_30– this depends on your GPU compute capability
    4. Expand Linker
      1. General->Additional Library Directories: %(AdditionalLibraryDirectories); $(CudaToolkitLibDir); D:\Program Files\MATLAB\R2014a\extern\lib\win64\microsoft
      2. Input->Additional Dependencies: cuda.lib;cublas.lib;libmex.lib;libmat.lib;libmx.lib;cudart.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)
      3. Input->Module Definition File: TrainModule.def (for cuSVMTrain project, for cuSVMPredict set PredictModule.def)
    5. Expand Build Events
      1. Post-Build Event->Command Line:
        echo copy „$(CudaToolkitBinDir)\cudart*.dll” „$(OutDir)”
        copy „$(CudaToolkitBinDir)\cudart*.dll” „$(OutDir)”
        each command in separate line

Eventually you can check if it is „Release” or „Debug” build.

 

How to use cuSVM

The zip package contains two folders:

  • cuSVM – Visual Studio 2012 solution
  • cuSVMmatlab – contains:
  1. libsvm,
  2. compile cuSVMTrain.mexw64 and cuSVMPredict.mexw64 in Libfolder,
  3. sample datasets in datafolder
  4. matlab script cuSVMTest.m
  1. Build cuSVM in Release or Debug mode – important check your GPU compute capability
  2. Copy cuSVMTrain.mexw64 and cuSVMPredict.mexw64 to Lib folder
  3. Add Libfolder matlab search path.
  4. If you want classify some dataset open  mfile.

 

 

 

Posted in Computing Technology, GPU (CUDA), GPU Accelareted, Image Processing | Tagged: , | Leave a Comment »

Integral Histogram for fast HoG feature calculation

Posted by Hemprasad Y. Badgujar on October 12, 2014


Histograms of Oriented Gradients or HOG features in combination with a support vector machine have been successfully used for object Detection (most popularly pedestrian detection).
An Integral Histogram representation can be used for fast calculation of Histograms of Oriented Gradients over arbitrary rectangular regions of the image. The idea of an integral histogram is analogous to that of an integral image, used by viola and jones for fast calculation of haar features for face detection. Mathematically,


where b represents the bin number of the histogram. This way the calculation of hog over any arbitrary rectangle in the image requires just 4*bins number of array references. For more details on integral histogram representation, please refer,

Integral Histogram

The following demonstrates how such integral histogram can be calculated from an image and used for the calculation of hog features using the opencv computer vision library :

/*Function to calculate the integral histogram*/

IplImage** calculateIntegralHOG(IplImage* in)

{

/*Convert the input image to grayscale*/

IplImage* img_gray = cvCreateImage(cvGetSize(in), IPL_DEPTH_8U,1);
cvCvtColor(in, img_gray, CV_BGR2GRAY);
cvEqualizeHist(img_gray,img_gray);

/* Calculate the derivates of the grayscale image in the x and y directions using a sobel operator and obtain 2 gradient images for the x and y directions*/

IplImage *xsobel, *ysobel;
xsobel = doSobel(img_gray, 1, 0, 3);
ysobel = doSobel(img_gray, 0, 1, 3);
cvReleaseImage(&img_gray);

/* Create an array of 9 images (9 because I assume bin size 20 degrees and unsigned gradient ( 180/20 = 9), one for each bin which will have zeroes for all pixels, except for the pixels in the original image for which the gradient values correspond to the particular bin. These will be referred to as bin images. These bin images will be then used to calculate the integral histogram, which will quicken the calculation of HOG descriptors */

IplImage** bins = (IplImage**) malloc(9 * sizeof(IplImage*));
for (int i = 0; i < 9 ; i++) {
bins[i] = cvCreateImage(cvGetSize(in), IPL_DEPTH_32F,1);
cvSetZero(bins[i]);
}

/* Create an array of 9 images ( note the dimensions of the image, the cvIntegral() function requires the size to be that), to store the integral images calculated from the above bin images. These 9 integral images together constitute the integral histogram */

IplImage** integrals = (IplImage**) malloc(9 * sizeof(IplImage*)); for (int i = 0; i < 9 ; i++) {
integrals[i] = cvCreateImage(cvSize(in->width + 1, in->height + 1),
IPL_DEPTH_64F,1);
}

/* Calculate the bin images. The magnitude and orientation of the gradient at each pixel is calculated using the xsobel and ysobel images.{Magnitude = sqrt(sq(xsobel) + sq(ysobel) ), gradient = itan (ysobel/xsobel) }. Then according to the orientation of the gradient, the value of the corresponding pixel in the corresponding image is set */

int x, y;
float temp_gradient, temp_magnitude;
for (y = 0; y <in->height; y++) {

/* ptr1 and ptr2 point to beginning of the current row in the xsobel and ysobel images respectively. ptrs[i] point to the beginning of the current rows in the bin images */

float* ptr1 = (float*) (xsobel->imageData + y * (xsobel->widthStep));
float* ptr2 = (float*) (ysobel->imageData + y * (ysobel->widthStep));
float** ptrs = (float**) malloc(9 * sizeof(float*));
for (int i = 0; i < 9 ;i++){
ptrs[i] = (float*) (bins[i]->imageData + y * (bins[i]->widthStep));
}

/*For every pixel in a row gradient orientation and magnitude are calculated and corresponding values set for the bin images. */

for (x = 0; x <in->width; x++) {

/* if the xsobel derivative is zero for a pixel, a small value is added to it, to avoid division by zero. atan returns values in radians, which on being converted to degrees, correspond to values between -90 and 90 degrees. 90 is added to each orientation, to shift the orientation values range from {-90-90} to {0-180}. This is just a matter of convention. {-90-90} values can also be used for the calculation. */

if (ptr1[x] == 0){
temp_gradient = ((atan(ptr2[x] / (ptr1[x] + 0.00001))) * (180/ PI)) + 90;
}
else{
temp_gradient = ((atan(ptr2[x] / ptr1[x])) * (180 / PI)) + 90;
}
temp_magnitude = sqrt((ptr1[x] * ptr1[x]) + (ptr2[x] * ptr2[x]));

/*The bin image is selected according to the gradient values. The corresponding pixel value is made equal to the gradient magnitude at that pixel in the corresponding bin image */

if (temp_gradient <= 20) {
ptrs[0][x] = temp_magnitude;
}
else if (temp_gradient <= 40) {
ptrs[1][x] = temp_magnitude;
}
else if (temp_gradient <= 60) {
ptrs[2][x] = temp_magnitude;
}
else if (temp_gradient <= 80) {
ptrs[3][x] = temp_magnitude;
}
else if (temp_gradient <= 100) {
ptrs[4][x] = temp_magnitude;
}
else if (temp_gradient <= 120) {
ptrs[5][x] = temp_magnitude;
}
else if (temp_gradient <= 140) {
ptrs[6][x] = temp_magnitude;
}
else if (temp_gradient <= 160) {
ptrs[7][x] = temp_magnitude;
}
else {
ptrs[8][x] = temp_magnitude;
}
}
}

cvReleaseImage(&xsobel);
cvReleaseImage(&ysobel);

/*Integral images for each of the bin images are calculated*/

for (int i = 0; i <9 ; i++){
cvIntegral(bins[i], integrals[i]);
}

for (int i = 0; i <9 ; i++){
cvReleaseImage(&bins[i]);
}

/*The function returns an array of 9 images which consitute the integral histogram*/

return (integrals);

}

The following demonstrates how the integral histogram calculated using the above function can be used to calculate the histogram of oriented gradients for any rectangular region in the image:

/* The following function takes as input the rectangular cell for which the histogram of oriented gradients has to be calculated, a matrix hog_cell of dimensions 1×9 to store the bin values for the histogram, the integral histogram, and the normalization scheme to be used. No normalization is done if normalization = -1 */

void calculateHOG_rect(CvRect cell, CvMat* hog_cell,
IplImage** integrals, int normalization) {

/* Calculate the bin values for each of the bin of the histogram one by one */

for (int i = 0; i < 9 ; i++){

float a =((double*)(integrals[i]->imageData + (cell.y)
* (integrals[i]->widthStep)))[cell.x];
float b = ((double*) (integrals[i]->imageData + (cell.y + cell.height)
* (integrals[i]->widthStep)))[cell.x + cell.width];
float c = ((double*) (integrals[i]->imageData + (cell.y)
* (integrals[i]->widthStep)))[cell.x + cell.width];
float d = ((double*) (integrals[i]->imageData + (cell.y + cell.height)
* (integrals[i]->widthStep)))[cell.x];

((float*) hog_cell->data.fl)[i] = (a + b) – (c + d);

}

/*Normalize the matrix*/
if (normalization != -1){
cvNormalize(hog_cell, hog_cell, 1, 0, normalization);
}

}

I had described how the HOG features for pedestrian detection can be obtained using the above framework and how SVM can be trained for such features for pedestrian detection

/*This function takes in a the path and names of
64x128 pixel images, the size of the cell to be
used for calculation of hog features(which should
be 8x8 pixels, some modifications will have to be 
done in the code for a different cell size, which
could be easily done once the reader understands
how the code works), a default block size of 2x2
cells has been considered and the window size
parameter should be 64x128 pixels (appropriate
modifications can be easily done for other say
64x80 pixel window size). All the training images
are expected to be stored at the same location and
the names of all the images are expected to be in
sequential order like a1.jpg, a2.jpg, a3.jpg ..
and so on or a(1).jpg, a(2).jpg, a(3).jpg ... The
explanation of all the parameters below will make
clear the usage of the function. The synopsis of
the function is as follows :

prefix : it should be the path of the images, along
with the prefix in the image name for
example if the present working directory is
/home/saurabh/hog/ and the images are in
/home/saurabh/hog/images/positive/ and are
named like pos1.jpg, pos2.jpg, pos3.jpg ....,
then the prefix parameter would be
"images/positive/pos" or if the images are
named like pos(1).jpg, pos(2).jpg,
pos(3).jpg ... instead, the prefix parameter
would be "images/positive/pos("

suffix : it is the part of the name of the image
files after the number for example for the
above examples it would be ".jpg" or ").jpg"

cell   : it should be CvSize(8,8), appropriate changes
need to be made for other cell sizes

window : it should be CvSize(64,128), appropriate
changes need to be made for other window sizes

number_samples : it should be equal to the number of
training images, for example if the
training images are pos1.jpg, pos2.jpg
..... pos1216.jpg, then it should be
1216

start_index : it should be the start index of the images'
names for example for the above case it
should be 1 or if the images were named
like pos1000.jpg, pos1001.jpg, pos1002.jpg
.... pos2216.jpg, then it should be 1000

end_index : it should be the end index of the images'
name for example for the above cases it
should be 1216 or 2216

savexml   : if you want to store the extracted features,
then you can pass to it the name of an xml
file to which they should be saved

normalization : the normalization scheme to be used for
computing the hog features, any of the
opencv schemes could be passed or -1
could be passed if no normalization is
to be done */

CvMat* train_64x128(char *prefix, char *suffix, CvSize cell,
CvSize window, int number_samples, int start_index,
int end_index, char *savexml = NULL, int canny = 0,
int block = 1, int normalization = 4) 
{

char filename[50] = "", number[8];
int prefix_length;
prefix_length = strlen(prefix);
int bins = 9;

/* A default block size of 2x2 cells is considered */

int block_width = 2, block_height = 2;

/* Calculation of the length of a feature vector for
an image (64x128 pixels)*/

int feature_vector_length;
feature_vector_length = (((window.width -
cell.width * block_width)/ cell.width) + 1) *
(((window.height - cell.height * block_height)
/ cell.height) + 1) * 36;

/* Matrix to store the feature vectors for
all(number_samples) the training samples */

CvMat* training = cvCreateMat(number_samples,
feature_vector_length, CV_32FC1);

CvMat row;
CvMat* img_feature_vector;
IplImage** integrals;
int i = 0, j = 0;

printf("Beginning to extract HoG features from
positive images\n");

strcat(filename, prefix);

/* Loop to calculate hog features for each
image one by one */

for (i = start_index; i <= end_index; i++) 
{
cvtInt(number, i);
strcat(filename, number);
strcat(filename, suffix);
IplImage* img = cvLoadImage(filename);

/* Calculation of the integral histogram for
fast calculation of hog features*/

integrals = calculateIntegralHOG(img);
cvGetRow(training, &row, j);
img_feature_vector
= calculateHOG_window(integrals, cvRect(0, 0,
window.width, window.height), normalization);
cvCopy(img_feature_vector, &row);
j++;
printf("%s\n", filename);
filename[prefix_length] = '';
for (int k = 0; k < 9; k++) 
{
cvReleaseImage(&integrals[k]);
}
}
if (savexml != NULL) 
{
cvSave(savexml, training);
}

return training;
}

/* This function is almost the same as
train_64x128(...), except the fact that it can
take as input images of bigger sizes and
generate multiple samples out of a single
image.

It takes 2 more parameters than
train_64x128(...), horizontal_scans and
vertical_scans to determine how many samples
are to be generated from the image. It
generates horizontal_scans x vertical_scans
number of samples. The meaning of rest of the
parameters is same.

For example for a window size of
64x128 pixels, if a 320x240 pixel image is
given input with horizontal_scans = 5 and
vertical scans = 2, then it will generate to
samples by considering windows in the image
with (x,y,width,height) as (0,0,64,128),
(64,0,64,128), (128,0,64,128), .....,
(0,112,64,128), (64,112,64,128) .....
(256,112,64,128)

The function takes non-overlapping windows
from the image except the last row and last
column, which could overlap with the second
last row or second last column. So the values
of horizontal_scans and vertical_scans passed
should be such that it is possible to perform
that many scans in a non-overlapping fashion
on the given image. For example horizontal_scans
= 5 and vertical_scans = 3 cannot be passed for
a 320x240 pixel image as that many vertical scans
are not possible for an image of height 240
pixels and window of height 128 pixels. */

CvMat* train_large(char *prefix, char *suffix,
CvSize cell, CvSize window, int number_images,
int horizontal_scans, int vertical_scans,
int start_index, int end_index,
char *savexml = NULL, int normalization = 4)
{
char filename[50] = "", number[8];
int prefix_length;
prefix_length = strlen(prefix);
int bins = 9;

/* A default block size of 2x2 cells is considered */

int block_width = 2, block_height = 2;

/* Calculation of the length of a feature vector for
an image (64x128 pixels)*/

int feature_vector_length;
feature_vector_length = (((window.width -
cell.width * block_width) / cell.width) + 1) *
(((window.height - cell.height * block_height)
/ cell.height) + 1) * 36;

/* Matrix to store the feature vectors for
all(number_samples) the training samples */

CvMat* training = cvCreateMat(number_images
* horizontal_scans * vertical_scans,
feature_vector_length, CV_32FC1);

CvMat row;
CvMat* img_feature_vector;
IplImage** integrals;
int i = 0, j = 0;
strcat(filename, prefix);

printf("Beginning to extract HoG features
from negative images\n");

/* Loop to calculate hog features for each
image one by one */

for (i = start_index; i <= end_index; i++) 
{
cvtInt(number, i);
strcat(filename, number);
strcat(filename, suffix);
IplImage* img = cvLoadImage(filename);
integrals = calculateIntegralHOG(img);
for (int l = 0; l < vertical_scans - 1; l++)
{
for (int k = 0; k < horizontal_scans - 1; k++)
{
cvGetRow(training, &row, j);
img_feature_vector = calculateHOG_window(
integrals, cvRect(window.width * k,
window.height * l, window.width,
window.height), normalization);

cvCopy(img_feature_vector, &row);
j++;
}

cvGetRow(training, &row, j);

img_feature_vector = calculateHOG_window(
integrals, cvRect(img->width - window.width,
window.height * l, window.width,
window.height), normalization);

cvCopy(img_feature_vector, &row);
j++;
}

for (int k = 0; k < horizontal_scans - 1; k++)
{
cvGetRow(training, &row, j);

img_feature_vector = calculateHOG_window(
integrals, cvRect(window.width * k,
img->height - window.height, window.width,
window.height), normalization);

cvCopy(img_feature_vector, &row);
j++;
}
cvGetRow(training, &row, j);

img_feature_vector = calculateHOG_window(integrals,
cvRect(img->width - window.width, img->height -
window.height, window.width, window.height),
normalization);

cvCopy(img_feature_vector, &row);
j++;

printf("%s\n", filename);
filename[prefix_length] = '';
for (int k = 0; k < 9; k++)
{
cvReleaseImage(&integrals[k]);
}

cvReleaseImage(&img);

}

printf("%d negative samples created \n",
training->rows);

if (savexml != NULL)
{
cvSave(savexml, training);
printf("Negative samples saved as %s\n",
savexml);
}

return training;

}


/* This function trains a linear support vector
machine for object classification. The synopsis is
as follows :

pos_mat : pointer to CvMat containing hog feature
vectors for positive samples. This may be
NULL if the feature vectors are to be read
from an xml file

neg_mat : pointer to CvMat containing hog feature
vectors for negative samples. This may be
NULL if the feature vectors are to be read
from an xml file

savexml : The name of the xml file to which the learnt
svm model should be saved

pos_file: The name of the xml file from which feature
vectors for positive samples are to be read.
It may be NULL if feature vectors are passed
as pos_mat

neg_file: The name of the xml file from which feature
vectors for negative samples are to be read.
It may be NULL if feature vectors are passed
as neg_mat*/


void trainSVM(CvMat* pos_mat, CvMat* neg_mat, char *savexml,
char *pos_file = NULL, char *neg_file = NULL) 
{


/* Read the feature vectors for positive samples */
if (pos_file != NULL) 
{
printf("positive loading...\n");
pos_mat = (CvMat*) cvLoad(pos_file);
printf("positive loaded\n");
}

/* Read the feature vectors for negative samples */
if (neg_file != NULL)
{
neg_mat = (CvMat*) cvLoad(neg_file);
printf("negative loaded\n");
}

int n_positive, n_negative;
n_positive = pos_mat->rows;
n_negative = neg_mat->rows;
int feature_vector_length = pos_mat->cols;
int total_samples;
total_samples = n_positive + n_negative;

CvMat* trainData = cvCreateMat(total_samples,
feature_vector_length, CV_32FC1);

CvMat* trainClasses = cvCreateMat(total_samples,
1, CV_32FC1 );

CvMat trainData1, trainData2, trainClasses1,
trainClasses2;

printf("Number of positive Samples : %d\n",
pos_mat->rows);

/*Copy the positive feature vectors to training
data*/

cvGetRows(trainData, &trainData1, 0, n_positive);
cvCopy(pos_mat, &trainData1);
cvReleaseMat(&pos_mat);

/*Copy the negative feature vectors to training
data*/

cvGetRows(trainData, &trainData2, n_positive,
total_samples);

cvCopy(neg_mat, &trainData2);
cvReleaseMat(&neg_mat);

printf("Number of negative Samples : %d\n",
trainData2.rows);

/*Form the training classes for positive and
negative samples. Positive samples belong to class
1 and negative samples belong to class 2 */

cvGetRows(trainClasses, &trainClasses1, 0, n_positive);
cvSet(&trainClasses1, cvScalar(1));

cvGetRows(trainClasses, &trainClasses2, n_positive,
total_samples);

cvSet(&trainClasses2, cvScalar(2));


/* Train a linear support vector machine to learn from
the training data. The parameters may played and
experimented with to see their effects*/

CvSVM svm(trainData, trainClasses, 0, 0,
CvSVMParams(CvSVM::C_SVC, CvSVM::LINEAR, 0, 0, 0, 2,
0, 0, 0, cvTermCriteria(CV_TERMCRIT_EPS,0, 0.01)));

printf("SVM Training Complete!!\n");

/*Save the learnt model*/

if (savexml != NULL) {
svm.save(savexml);
}
cvReleaseMat(&trainClasses);
cvReleaseMat(&trainData);

}

 

Posted in CUDA, GPU Accelareted, Image Processing, OpenCV, OpenCV, OpenCV Tutorial, PARALLEL | Tagged: , , , , | 1 Comment »

Computer Vision Algorithm Implementations

Posted by Hemprasad Y. Badgujar on May 6, 2014


Participate in Reproducible Research

General Image Processing

OpenCV
(C/C++ code, BSD lic) Image manipulation, matrix manipulation, transforms
Torch3Vision
(C/C++ code, BSD lic) Basic image processing, matrix manipulation and feature extraction algorithms: rotation, flip, photometric normalisations (Histogram Equalization, Multiscale Retinex, Self-Quotient Image or Gross-Brajovic), edge detection, 2D DCT, 2D FFT, 2D Gabor, PCA to do Eigen-Faces, LDA to do Fisher-Faces. Various metrics (Euclidean, Mahanalobis, ChiSquare, NormalizeCorrelation, TangentDistance, …)
ImLab
(C/C++ code, MIT lic) A Free Experimental System for Image Processing (loading, transforms, filters, histogram, morphology, …)
CIMG
(C/C++ code, GPL and LGPL lic) CImg Library is an open source C++ toolkit for image processing
Generic Image Library (GIL)boost integration
(C/C++ code, MIT lic) Adobe open source C++ Generic Image Library (GIL)
SimpleCV a kinder, gentler machine vision library
(python code, MIT lic) SimpleCV is a Python interface to several powerful open source computer vision libraries in a single convenient package
PCL, The Point Cloud Library
(C/C++ code, BSD lic) The Point Cloud Library (or PCL) is a large scale, open project for point cloud processing. The PCL framework contains numerous state-of-the art algorithms including filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation.
Population, imaging library in C++ for processing, analysing, modelling and visualising
(C/C++ code, CeCill lic) Population is an open-source imaging library in C++ for processing, analysing, modelling and visualising including more than 200 algorithms designed by V. Tariel.
qcv
(C/C++ code, LGPL 3) A computer vision framework based on Qt and OpenCV that provides an easy to use interface to display, analyze and run computer vision algorithms. The library is provided with multiple application examples including stereo, SURF, Sobel and and Hough transform.
Machine Vision Toolbox
(MATLAB/C, LGPL lic) image processing, segmentation, blob/line/point features, multiview geometry, camera models, colorimetry.
BoofCV
(Java code, Apache lic) BoofCV is an open source Java library for real-time computer vision and robotics applications. BoofCV is organized into several packages: image processing, features, geometric vision, calibration, visualize, and IO.
Simd
(C++ code, MIT lic) Simd is free open source library in C++. It includes high performance image processing algorithms. The algorithms are optimized with using of SIMD CPU extensions such as SSE2, SSSE3, SSE4.2 and AVX2.
Free but not open source – ArrayFire (formely LibJacket) is a matrix library for CUDA
(CUDA/C++, free lic) ArrayFire offers hundreds of general matrix and image processing functions, all running on the GPU. The syntax is very Matlab-like, with the goal of offering easy porting of Matlab code to C++/ArrayFire.

Image Acquisition, Decoding & encoding

FFMPEG
(C/C++ code, LGPL or GPL lic) Record, convert and stream audio and video (lot of codec)
OpenCV
(C/C++ code, BSD lic) PNG, JPEG,… images, avi video files, USB webcam,…
Torch3Vision
(C/C++ code, BSD lic) Video file decoding/encoding (ffmpeg integration), image capture from a frame grabber or from USB, Sony pan/tilt/zoom camera control using VISCA interface
lib VLC
(C/C++ code, GPL lic) Used by VLC player: record, convert and stream audio and video
Live555
(C/C++ code, LGPL lic) RTSP streams
ImageMagick
(C/C++ code, GPL lic) Loading & saving DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, TIFF, and more
DevIL
(C/C++ code, LGPL lic) Loading & saving various image format
FreeImage
(C/C++ code, GPL & FPL lic) PNG, BMP, JPEG, TIFF loading
VideoMan
(C/C++ code, LGPL lic) VideoMan is trying to make the image capturing process from cameras, video files or image sequences easier.

Segmentation

OpenCV
(C/C++ code, BSD lic) Pyramid image segmentation
Branch-and-Mincut
(C/C++ code, Microsoft Research Lic) Branch-and-Mincut Algorithm for Image Segmentation
Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo

Machine Learning

Torch
(C/C++ code, BSD lic) Gradient machines ( multi-layered perceptrons, radial basis functions, mixtures of experts, convolutional networks and even time-delay neural networks), Support vector machines, Ensemble models (bagging, adaboost), Non-parametric models (K-nearest-neighbors, Parzen regression and Parzen density estimator), distributions (Kmeans, Gaussian mixture models, hidden Markov models, input-output hidden Markov models, and Bayes classifier), speech recognition tools

Object Detection

OpenCV
(C/C++ code, BSD lic) Viola-jones face detection (Haar features)
Torch3Vision
(C/C++ code, BSD lic) MLP & cascade of Haar-like classifiers face detection
Hough Forests
(C/C++ code, Microsoft Research Lic) Class-Specific Hough Forests for Object Detection
Efficient Subwindow Object Detection
(C/C++ code, Apache Lic) Christoph Lampert “Efficient Subwindow” algorithms for Object Detection
INRIA Object Detection and Localization Toolkit
(C/C++ code, Custom Lic) Histograms of Oriented Gradients library for Object Detection

Object Category Labelling

Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo
Multi-label optimization
(C/C++/MATLAB code) The gco-v3.0 library is for optimizing multi-label energies. It supports energies with any combination of unary, pairwise, and label cost terms.

Optical flow

OpenCV
(C/C++ code, BSD lic) Horn & Schunck algorithm, Lucas & Kanade algorithm, Lucas-Kanade optical flow in pyramids, block matching.
GPU-KLT+FLOW
(C/C++/OpenGL/Cg code, LGPL) Gain-Adaptive KLT Tracking and TV-L1 optical flow on the GPU.
RLOF
(C/C++/Matlab code, Custom Lic.) The RLOF library provides GPU / CPU implementation of Optical Flow and Feature Tracking method.

Features Extraction & Matching

SIFT by R. Hess
(C/C++ code, GPL lic) SIFT feature extraction & RANSAC matching
OpenSURF
(C/C++ code) SURF feature extraction algorihtm (kind of fast SIFT)
ASIFT (from IPOL)
(C/C++ code, Ecole Polytechnique and ENS Cachan for commercial Lic) Affine SIFT (ASIFT)
VLFeat (formely Sift++)
(C/C++ code) SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift
SiftGPU
A GPU Implementation of Scale Invariant Feature Transform (SIFT)
Groupsac
(C/C++ code, GPL lic) An enhance version of RANSAC that considers the correlation between data points

Nearest Neighbors matching

FLANN
(C/C++ code, BSD lic) Approximate Nearest Neighbors (Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration)
ANN
(C/C++ code, LGPL lic) Approximate Nearest Neighbor Searching

Tracking

OpenCV
(C/C++ code, BSD lic) Kalman, Condensation, CAMSHIFT, Mean shift, Snakes
KLT: An Implementation of the Kanade-Lucas-Tomasi Feature Tracker
(C/C++ code, public domain) Kanade-Lucas-Tomasi Feature Tracker
GPU_KLT
(C/C++/OpenGL/Cg code, ) A GPU-based Implementation of the Kanade-Lucas-Tomasi Feature Tracker
GPU-KLT+FLOW
(C/C++/OpenGL/Cg code, LGPL) Gain-Adaptive KLT Tracking and TV-L1 optical flow on the GPU
On-line boosting trackers
(C/C++, LGPL) On-line boosting tracker, semi-supervised tracker, beyond semi-supervised tracker
Single Camera background subtraction tracking
(C/C++, LGPL) Background subtraction based tracking algorithm using OpenCV.
Multi-camera tracking
(C/C++, LGPL) Multi-camera particle filter tracking algorithm using OpenCv and intel IPP.

Simultaneous localization and mapping

Real-Time SLAM – SceneLib
(C/C++ code, LGPL lic) Real-time vision-based SLAM with a single camera
PTAM
(C/C++ code, Isis Innovation Limited lic) Parallel Tracking and Mapping for Small AR Workspaces
GTSAM
(C/C++ code, BSD lic) GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices

Camera Calibration & constraint

OpenCV
(C/C++ code, BSD lic) Chessboard calibration, calibration with rig or pattern
Geometric camera constraint – Minimal Problems in Computer Vision
Minimal problems in computer vision arise when computing geometrical models from image data. They often lead to solving systems of algebraic equations.
Camera Calibration Toolbox for Matlab
(Matlab toolbox) Camera Calibration Toolbox for Matlab by Jean-Yves Bouguet (C implementation in OpenCV)

Multi-View Reconstruction

Bundle Adjustment – SBA
(C/C++ code, GPL lic) A Generic Sparse Bundle Adjustment Package Based on the Levenberg-Marquardt Algorithm
Bundle Adjustment – SSBA
(C/C++ code, LGPL lic) Simple Sparse Bundle Adjustment (SSBA)

Stereo

Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo
LIBELAS: Library for Efficient LArge-scale Stereo Matching
(C/C++ code) Disparity maps, stereo

Structure from motion

Bundler
(C/C++ code, GPL lic) A structure-from-motion system for unordered image collections
Patch-based Multi-view Stereo Software (Windows version)
(C/C++ code, GPL lic) A multi-view stereo software that takes a set of images and camera parameters, then reconstructs 3D structure of an object or a scene visible in the images
libmv – work in progress
(C/C++ code, MIT lic) A structure from motion library
Multicore Bundle Adjustment
(C/C++/GPU code, GPL3 lic) Design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems.
openMVG
(C/C++/GPU code, MPL2 lic) OpenMVG (Multiple View Geometry) “open Multiple View Geometry” is a library for computer-vision scientists and especially targeted to the Multiple View Geometry community. It is designed to provide an easy access to the classical problem solvers in Multiple View Geometry and solve them accurately..

Visual odometry

LIBVISO2: Library for VISual Odometry 2
(C/C++ code, Matlab, GPL lic) Libviso 2 is a very fast cross-platfrom (Linux, Windows) C++ library with MATLAB wrappers for computing the 6 DOF motion of a moving mono/stereo camera.

Posted in Apps Development, C, Computer Hardware, Computer Network & Security, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Graphics Cards, Image Processing, OpenCV, PARALLEL, Simulation, Virtualization | Tagged: , , , , , , , , , , , , , , , , , , , | 3 Comments »

Research skills

Posted by Hemprasad Y. Badgujar on March 26, 2013


Research skills

Writing papers, giving research talks, and writing research proposals are key skills, but they aren’t easy. This page describes how I approach each of these three challenges, in the hope that they may be useful to you.

You probably have ideas of your own. Can you share them with other readers of this page? Here’s a Wiki page that anyone can edit, where you can add your comments on the presentations below, your own thoughts and suggestions, and pointers to other material you have found useful.

 

How to give a good research talk

“How to give a good research talk”, Simon Peyton Jones, John Launchbury, John Hughes, SIGPLAN Notices 28(11), Nov 1993.

Since we wrote the paper quite a few people have written with constructive comments. Nick Nethercote also has a useful 2-page guide about giving a talk.

 

How to write a good research paper

“How to write a good research paper”, Simon Peyton Jones.

I gave this talk at the Technical University of Vienna in October 2004. There isn’t a paper about it, only the talk:

 

How to write a good research proposal

“How to write a good research proposal”, Simon Peyton Jones and Alan Bundy.

Other resources

Finally, here are some pointers to other advice I have found useful, though Google will find you a lot more besides.

 

Posted in Artificial Intelligence, Entertainment, Image Processing, Journals & Conferences, Neural Network, Placement, Project Related, Simulation | 1 Comment »

CUDA Open Source Projects

Posted by Hemprasad Y. Badgujar on March 4, 2013


CUDA Open Source Projects

In searching for projects to use for learning and developing with plus requests from the NVidia forums I have put together a list here of free and open source research and projects that use CUDA.  Please if you have one to add or updates of anything here let me know.

GNURadio Software defined radio. A hardware/software combination that does baseband signal processing in software. Experiments were carried out to integrate CUDA into this mix.
MediaCoder A transcoding application for videos with a strong focus on mobile players. Some operations (de-interlacing, scaling, encoding) are have been CUDA accelerated.
Bullet Bullet: physics simulation started to include CUDA but it is not fully capable yet.  Perhaps some CUDA genius will add to it?
Thrust (included in Release 4.0) Excellent Library!! A Parallel Template Library for CUDA. Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity.
Pycuda A module which allows access to the complete range of CUDA functionality in Python, including seamless numpy integration, OpenGL interoperability and lots more. Released under the MIT/X consortium license.
FOLKI-GPU An optical-flow estimation, implemented on CUDA
Flam4 CUDA A CUDA accelerated renderer for fractal frames. Sample videos hereand here. Use other tools like Apophysis 2.0 to generate the parameter files (.flame files). A new and ongoing approach to port fractal frame rendering to CUDA is described here.
CUJ2K A CUDA accelerated JPEG 2000 encoder. Command line tool and C/C++ library. This is student work with excellent documentation. Notable speedup achieved only for large files.
Ocelot A Binary Translation Framework for PTX
Msieve A library for factoring large integers, as in RSA-size numbers. The polynomial selection phase of the general number field sieve has a great deal of CUDA code, and the speedup over a CPU is enormous (10-50x)
PFAC An open library for exact string matching performed on GPUs.
cuSVM A CUDA implementation of Support Vector Classification and Regression.
multisvm In this project, it is described how a naïve implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput.
gpuminer Parallel Data Mining on Graphics Processors
Cmatch Cmatch, performs exact string matching for a set of query sequences and achieves a speedup of as much as 35x on a recent GPU over the equivalent CPU-bound version.
R+GPU A popular Open Source solution for Statistical Analysis

Posted in Apps Development, Artificial Intelligence, Computer Languages, CUDA, GPU (CUDA), GPU Accelareted, Image Processing, Neural Network, Open CL, OpenMP, PARALLEL, Project Related, Simulation, Virtualization | 1 Comment »

Getting Started with CUDA

Posted by Hemprasad Y. Badgujar on March 4, 2013


What are the capabilities of Nvidia’s CUDA running on the GPU and how does it compare to CPU performance? I bought a GeForce 9800GT and set about finding out, starting off by installing the CUDA drivers, toolkit and SDK from the Cuda Zone.

The first thing I noticed was that on my Vista64 machine the sample projects had been installed to:

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\projects

which is read only. Rather than fight with Vista’s UAC I copied everything into the C:\CUDA directory. To build the solution in VS2008 on my Vista 64 machine all I needed to do was switch the platform to x64, ignore the warning:

 

Command line warning D9035 : option 'Wp64' has been deprecated and will be removed in a future release

 

and everything was fine. The SDK’s sample template conveniently included both a gold (CPU) implementation of a function and a GPU implementation. An initial run of the template project showed that only the GPU section was timed. Since the reason to use CUDA is performance and I wanted a comparison, the first modification I made was to put a timer around the CPU implementation:

 

cutilCheckError( cutStartTimer( timer));
computeGold( reference, h_idata, num_threads);  // reference solution
cutilCheckError( cutStopTimer( timer));

 

and raced them – but the results weren’t too inspiring:

 

GPU Processing time: 84.362747 (ms)
CPU Processing time: 0.001257 (ms)

 

The CPU solution wasn’t even threaded. I remembered the question of a student at the Stanford CUDA lecture on YouTube:

 

Q: Since there’s overhead in moving the data to the GPU how do you decide when it’s worthwhile?

A: Generally speaking it makes the most sense for large problems with high data intensity where you have to do multiple calculations per data element. 

Hmm, the template code only processed 128 bytes with 32 threads so I had paid the setup costs and then not sent enough data to the GPU – no wonder the CPU was faster. So I needed to increase the data set, but there’s a problem with that since the provided kernel code assumes the entire data set will fit in shared memory and binds the size of the data to the thread count. There needed to be some changes. But you can’t just increase the number of threads or you’ll get:

 

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <template.cu>, line 88 : invalid configuration argument.

 

First step was to find out what resources were available on the GPU, then I’d need to work out how to get at those resources. Running the SDK Device Query told me how much global and shared memory was available as well as how many threads I could use:

 

Device 0: "GeForce 9800 GT"
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 1073741824 bytes
  Number of multiprocessors:                     14
  Number of cores:                               112
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.50 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

 

Some interesting numbers there, since the GeForce can perform both a FMUL (2 flops) and a FADD (1 flop) per clock, per processor, we can calculate the maximum theoretical Gflops attainable is 1.5 GHz * 112 * (2 + 1) = 504 Gflops. By way of comparison, the E8400 in my test machine has a peak of 24 Gflops according to Intel’s data sheet:

 

Intel_E8400

 

But back to the problem of pushing more data through.  A few problems:

1) The data size needs to be uncoupled from the thread count which means a change to the GRID count from this:

 

// setup execution parameters
dim3  grid( 1, 1, 1);
dim3  threads( num_threads, 1, 1);

 

to something more like this:

 

cThreadsPerBlock = 64;
cBlocksPerGridx = 1024;
cBlocksPerGridy = 1024;

cData = cThreadsPerBlock * cBlocksPerGridx * cBlocksPerGridy;

dim3  grid ( cBlocksPerGridx, cBlocksPerGridy, 1); 
dim3  block( cThreadsPerBlock, 1, 1);

 

where the counts of Blocks Per Grid in the x and y directions would need to be data derived. To simplify the example I’ve done it backwards and set the data size based on thread and block breakdown. These grid and block variables are then be passed to GPU using the triple angle bracket <<< >>> notation:

 

testKernel<<< grid, block, shared_mem_size >>>( d_idata, d_odata);

 

which is the same as:

 

testKernel<<< grid, 64, shared_mem_size >>> ( d_idata, d_odata);

 

because the passed argument is converted to a CUDA dim3 type which “is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.” from the programming guide.

Specifying a shared_mem_size on the kernel call as above allows you to specify the size at runtime. You can then pick up a reference to the memory in the kernel code with:

 

extern  __shared__  float sdata[];

 

Alternatively if you know the size at compilation time you can also declare the shared memory inside the kernel like this:

 

__shared__ float sdata[256];

 

Which would mean the kernel call would be just be:

 

testKernel<<< grid, 64 >>> ( d_idata, d_odata);

 

2) The kernel code must loop through the grid. Calculate the thread id, block id and then global id to figure where in the global data we are up to. Pass the size of the data(int len) since num_threads is no longer coupled with the data length.  The __umul24 in the code provides increased performance but comes with a warning: “Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but __mul24 and __umul24 provide signed and unsigned 24-bit integer multiplication with a throughput of 8 operations per clock cycle. On future architectures however, __[u]mul24 will be slower than 32-bit integer multiplication”.

 

__global__ void
testKernel( float* g_idata, float* g_odata, int len) 
{
  // shared memory
  // the size is determined by the host application
  extern  __shared__  float sdata[];

  // thread id
  const unsigned int tid = threadIdx.x;
  // block id
  const unsigned int bid = __umul24(gridDim.x, blockIdx.y) + blockIdx.x ;
  // global memory id
  const unsigned int gid = tid + __umul24(blockDim.x, bid);

  const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z);

 

3) The kernel needs to read from global memory and then synchronise across threads, this causes the threads across warps to sync and thus presents a consistent shared memory picture. So now thread 0 can read from SDATA(1) and will see the data which thread 1 loaded. A call to __syncthreads() is only needed when the count of threads per block exceed the warpSize because as mentioned in the performance optimisation whitepaper: “Instructions are SIMD synchronous within a warp”. Of course every call has a cost and the programming guide states that “throughput for __syncthreads is 8 operations per clock cycle in the case where no thread has to wait for any other threads.”

None of this is important in the sample template code because there is no communication between threads, thus no need for shared memory or thread syncing – a situation in which registers would normally be used but in this case shared memory has presumably been used by Nvidia for example purposes.

 

const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z); 
SDATA(tid) = g_idata[tid];
if (cThreadsPerBlock > warpSize) __syncthreads();

 

At this point I had revised the template to time the CPU for comparison, remove the size restrictions to allow a decent amount of data to be pushed through and was ready to attempt to answer the question – given the overhead of pushing the data to the GPU, when it is worth doing so? Running the code gave some unexpected answers. Keeping the thread count constant I varied the cBlocksPerGridy to yield various data sizes:

 

 

The GPU and CPU seemed to take the same amount of time with different data loads but the GPU was hampered by a constant overhead of 80ms, the exact same difference I noted when only 128 bytes were trialled in the very first instance before any modification.  Where was the time going? Some sort of setup cost?  Also how much was being taken in the kernel and how much in the data transfer? I needed more fine grained data to see what was going on.

I had modified the supplied SDK template code in a minimal way in order to measure CPU vs GPU performance and found that for the simple test code (1 float multiplication) that the E8400 CPU with a claimed 24 Gflops was handily out performing a GPU with a theoretical max 504 Gflops. Where was all the time going? Was the kernel the culprit, the memory copy or something else? I started out by trying to reuse the

 

cutilCheckError( cutStartTimer( timer));

 

timing method already in the template. Looking into the CUDA source in SDK\common\src\stopwatch_win.cpp showed that on Windows it was using the QueryPerformanceFrequency method which uses the highest possible resolution hardware timer … on the CPU. Using it to measure GPU performance is problematic because timing the GPU using a CPU timer requires the GPU and the CPU to be synchronised with:

 

cudaThreadSynchronize();

 

and ruins the timing information. To measure times on the GPU I needed to use GPU based timing on stream 0 using events:

cudaEventRecord(start, 0);

So I created an array of start and stop events, broke the GPU processes into 5 steps and timed everything. The 5 GPU processes were:

1) Alloc: Host to Device – The allocation of memory on the device for the input array which needed to be copied over from the host.

2) Copy: Host to Device – Copying the input array from the host onto the device. Data size divided by time taken here would give bandwidth.

3) Alloc: Device to Host – The allocation of memory on the device for the output array where the result would be stored before being copied back to the host.

4) Compute – Running the actual kernel, reading from the input array, processing and writing results to the output array.

5) Copy: Device to Host – Copying the output array back to the host.

I also retained my CPU timing to measure the amount of time it took for the GPU to do everything and get the answer back onto the host – that way I’d have a 1:1 comparison against the CPU version. That gives one more thing to measure, how does the sum of the GPU times compare to the overall CPU time?

6) Sync with CPU – CPU time minus sum of GPU times indicates how long it takes to sync the two.

Set up 5 GPU timers to get a breakdown of where the GPU was spending time and keep the 2 CPU timers for the original comparison:

 

// GPU timers - used to time GPU streams
int cGpuTimer = 5;

cudaEvent_t* rgGpuTimer_start = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);
cudaEvent_t* rgGpuTimer_stop = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);

for (int i=0;i<cGpuTimer;i++)
{
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_start[i] ) );
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_stop[i] ) );
}

 

and wrap all the GPU calls with timing calls:

 

cutilCheckError( cutStartTimer( rgTimer[0]));

  // Alloc: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[0], 0 ) );
  float* d_idata;
  cutilSafeCall( cudaMalloc( (void**) &d_idata, global_mem_size));
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[0], 0 ) );

  // Copy: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[1], 0 ) );
  cutilSafeCall( cudaMemcpy( d_idata, h_idata, global_mem_size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[1], 0 ) );

  // Alloc: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[2], 0 ) );
  float* d_odata;
  cutilSafeCall( cudaMalloc( (void**) &d_odata, global_mem_size)); // The pad won't be read back
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[2], 0 ) );

  // Compute
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[3], 0 ) );
  dim3  gridDim ( cBlocksPerGridx, cBlocksPerGridy, 1);
  dim3  blockDim( cThreadsPerBlock, 1, 1);

  testKernel<<< gridDim, blockDim, shared_mem_size >>>( d_idata, d_odata, cData);

  cutilCheckMsg("Kernel execution failed");
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[3], 0 ) );

  // Copy: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[4], 0 ) );
  cutilSafeCall( cudaMemcpy( h_odata, d_odata, global_mem_size, cudaMemcpyDeviceToHost) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[4], 0 ) );

cudaThreadSynchronize(); // Block until memory copy is done to ensure accurate timing

cutilCheckError( cutStopTimer( rgTimer[0]));

 

With this code in place I was ready to find out where the extra 80ms that the GPU took compared to the CPU was coming from and how much time each of the GPU tasks took. First a baseline comparison to verify that the code was still the same and gave the same numbers.

So here’s the graph from before on the left, and here’s the new graph, which should be identical, on the right:

 

 

 

Wow! What’s happened here? All the CPU times are the same, as expected, but the GPU has suddenly closed the gap and now takes only a few ms extra – the 80ms gap has vanished. A diff of the two versions shows that the only change to the code is the addition of GPU timing – and that turns out to be why the GPU suddenly sped up. Directly after setting the device, sending a wakeup call to the GPU like this:

 

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
    cutilDeviceInit(argc, argv);
else
    cudaSetDevice( cutGetMaxGflopsDeviceId() );

{
    cudaEvent_t wakeGPU;
    cutilSafeCall( cudaEventCreate( &wakeGPU) );
}

 

means that 80ms vanishes from the timed loop later in the code. Note that the variable is scoped so it isn’t used. Is the GeForce like a person – goes faster when it knows it is being watched?!  Or is this some wakeup from a power saving mode, I’m not sure.  This is the only extra code needed to cut 80ms from the timing which shows how tricky it is to time accurately on the ms scale, the slightest change can have a significant effect. It is always advisable to run tests on large volumes of data with a lot of loops to drown out one-off costs like this where possible.  While on the topic of getting accurate performance readings note that all timing should be done on release code, particularly timing breakdowns as the SDK/common/cutil_readme.txt file states:

 

“These macros are compiled out in release builds and so they will not affect performance. Note that in debug mode they call cudaThreadSynchronize() to ensure that kernel execution has completed, which can affect performance.” 

Well now that the extra 80ms has been eliminated what does our new GPU timing code show us about how the GPU spends its time? Here’s a chart showing the breakdown for a 16MB sample:

 

 

The majority of the time, and this holds for the other data sizes, is taken copying data back and forth. So experimentally it seems that the overhead in moving the data back and forth is quite significant. Of the 24.8ms required in total to process 16MB, 21.9ms were spent copying data. The actual processing takes almost no time.  Running a variety of input sizes and timing each one can tell us what kind of bandwidth we are typically getting as shown in the table below where times are in ms:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 9.0 1771.9 11.8 1359.3
32MB 16.3 1966.0 22.2 1442.8
64MB 30.6 2093.9 49.8 1285.4
128MB 58.2 2198.2 83.9 1526.4
256MB 114.9 2228.7 171.4 1493.4

We wanted to find how where the GPU was spending its time and now discovered that most of the time is in moving data back and forth.  Can we now answer the question of where the GPU outperforms the CPU? Is 2GB/s the expected throughput? Well Nvidia provides a tool in the SDK to answer that – the “Bandwidth Test”. Running it through the provided GUI tool yields the following results:

 

Running on......
      device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               2152.6

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1919.2

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               48507.8

 

So we can see for 32MB, performance is roughly in line with the template results so that’s case closed … or is it? Two things give cause for concern:

1) PCIe 2.0 is theoretically capable of 500 MB/s per lane and with a x16 slot there are 16 lanes. So throughput should be up around 8GB/s, not the 2GB/s observed.

2) What exactly does “Host to Device Bandwidth for Pageable memory” in the bandwidth test results mean? Pageable memory?

So I found out that the bulk of the time was in data copying, first confirmed that the speeds observed were similar to those given in the Nvidia test suite and then raised new questions about whether we were getting everything out of the hardware given 2GB/s observed and 8GB/s theoretical. So now I need to confirm that my hardware really is PCIe 2.0 x16 and figure out what pageable memory is.

I’d added GPU based timing to my template code and found out that most of the time was spent copying data back and forth between the host and the device. The “Bandwidth Test” in the SDK gave roughly similar results although it mentioned something about pageable memory. But the big problem was the theoretical performance of PCIe 2.0 x16 far exceeded what I was seeing. So the first step was to confirm that both my graphics card and my motherboard supported and were using PCIe 2.0 x16. To do this I used CPU-Z and GPU-Z, with the following results:

 

CPU_GPU

 

So after confirming the hardware should have been capable of better speeds I took another look at the BandwidthTest. Running with the –help switch reveals several options:

 

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --help
Usage:  bandwidthTest [OPTION]...
Test the bandwidth for device to host, host to device, and device to device transfers

Example:  measure the bandwidth of device to host pinned memory copies in the range 1024 Bytes
          to 102400 Bytes in 1024 Byte increments
./bandwidthTest --memory=pinned --mode=range --start=1024 --end=102400 --increment=1024 --dtoh

Options:
--help  Display this help menu
--csv   Print results as a CSV
--device=[deviceno]     Specify the device device to be used
  all - compute cumulative bandwidth on all the devices
  0,1,2,...,n - Specify any particular device to be used
--memory=[MEMMODE]      Specify which memory mode to use
  pageable - pageable memory
  pinned   - non-pageable system memory
--mode=[MODE]   Specify the mode to use
  quick - performs a quick measurement
  range - measures a user-specified range of values
  shmoo - performs an intense shmoo of a large range of values
--htod  Measure host to device transfers
--dtoh  Measure device to host transfers
--dtod  Measure device to device transfers
--wc    Allocate pinned memory as write-combined
--cputiming     Force CPU-based timing always
Range mode options
--start=[SIZE]  Starting transfer size in bytes
--end=[SIZE]    Ending transfer size in bytes
--increment=[SIZE]      Increment size in bytes

 

Particularly of interest is the “pinned” memory mode. Let’s try that:

 

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5256.9
Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4891.6
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 48498.6

 

and we see that this mode vastly improves the maximum throughput. Not sure why Nvidia didn’t make it the default option. Speeds are now up to 5GB/s. A short investigation of the code reveals that the timing isn’t quite analogous to the testing we are doing in the template code:

bandwidthTest.cu

 

56: // defines, project
57: #define MEMCOPY_ITERATIONS  10

 

as the bandwidthTest copies the same memory 10 times in a row as compared to the single copy we are doing. So we expect our performance to lag slightly behind this 5GB/s. Conveniently, all the code needed to use pinned memory is provided in the bandwidthTest, so putting it into a few wrapper functions called freeHost, mallocHost and memCpy yields:

 

////////////////////////////////////////////////////////////////////////////////
//  Memory functions to switch between pinned and pageable memory as required
////////////////////////////////////////////////////////////////////////////////

cudaError
freeHost(void* h_mem, memoryMode memMode)
{
    if( PINNED == memMode ) {
        return cudaFreeHost(h_mem);
    }
    else {
        free(h_mem);
    }
    return cudaSuccess;
}

cudaError
mallocHost(void** h_mem ,uint memSize, memoryMode memMode, bool wc)
{
    if( PINNED == memMode ) {
#if CUDART_VERSION >= 2020
        return cudaHostAlloc( h_mem, memSize, (wc) ? cudaHostAllocWriteCombined : 0 );
#else
        if (wc) {printf("Write-Combined unavailable on CUDART_VERSION less than 2020, running is: %d", CUDART_VERSION);
        return cudaMallocHost( h_mem, memSize );
#endif
    }
    else { // PAGEABLE memory mode
        *h_mem = malloc( memSize );
    }

    return cudaSuccess;
}

cudaError
memCpy(void* sink, void* source, uint memSize, cudaMemcpyKind direction, memoryMode memMode)
{
    if( PINNED == memMode ) {
        return cudaMemcpyAsync( sink, source, memSize, direction, 0);
    }
    else {
        return cudaMemcpy( sink, source, memSize, direction);
    }
}

 

These functions take the same parameters as the existing functions with the addition of memory mode and for mallocHost whether or not the memory is Write Combined. Changing the allocation, copying and freeing over to these new functions allow use of pinned memory. Running the same test set shows that now the time is much more evenly spread between tasks:

 

 

 

and running the new numbers on the throughput we get:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 3.2 5026.7 3.3 4878.0
32MB 6.1 5242.5 6.5 4891.5
64MB 12.2 5251.1 13.1 4871.7
128MB 24.4 5247.6 26.2 4894.1
256MB 48.9 5239.0 52.3 4894.7

So now the throughput approaches the theoretical limit and matches the best the bandwidthTest provides. The total times are down significantly and the GPU is faster on all tested sizes. The 256MB trial runs in 30% less time down from 340ms to 236ms.

 

 

The next challenge is to find where else time is lost. The pie charts show that most of the time is still spent in allocation and copying with very little in compute time so there’s no need to look at the kernel. We’ve already probably cut most of the time we can from the copying so that leaves allocation. A good idea would probably be to allocate the memory once and then use it over and over for multiple kernel executions, an intensive process like the kind Nvidia suggests are best suited for CUDA. But what if the code needs to be as shown, one kernel being run on one large set of data and then returning to another application? This is the kind of flow seen in Matlab MEX files where CUDA is used – Matlab passes the data through the C/C++ MEX file, which runs up a CUDA program gets the result and then returns to Matlab. Could parallel memory copies and allocations speed things up in this situation?

So we’ve switched the code over to use pinned memory in preference to pageable and attained the desired speedup in memory operations from 2GB/s to about 5GB/s. Theoretically PCIe 2.0 x16 should be able to hit 8GB/s and I don’t know why we aren’t able to achieve speeds closer to this number. If anyone knows please leave a comment or e-mail me. From here the next thing to investigate to get more throughput in the single kernel scenario is parallel allocations and copies.

Posted in Artificial Intelligence, Computer Languages, Computing Technology, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Image Processing, Neural Network, OpenCL, PARALLEL, Simulation, Virtualization | Leave a Comment »

Introduction to CUDA 5.0

Posted by Hemprasad Y. Badgujar on March 3, 2013


Introduction to CUDA 5.0

CUDA 5

CUDA 5

In this article, I will introduce the reader to CUDA 5.0. I will briefly talk about the architecture of the Kepler GPU(Graphics Processing Unit) and I will show you how you can take advantage of the many CUDA (Compute UnifiedDevice Architecture) cores in the GPU to create massively parallel programs.

 

List of Figures

Figure 1. Floating Point Operations per Second
Figure 2. Memory Bandwidth
Figure 3. Kepler GK110 Die
Figure 4. Kepler Architecture
Figure 5. Kepler Streaming Multiprocessor (SMX)
Figure 6. Warp Scheduler
Figure 7. Dynamic Parallelism
Figure 8. Dynamic Parallelism
Figure 9. Hyper-Q
Figure 10. Grid Management Unit
Figure 11. GPUDirect
Figure 12. Control Panel
Figure 13. System Manager
Figure 14. Device Manager
Figure 15. Command Prompt
Figure 16. Device Query
Figure 17. New Project Dialog
Figure 18. Cuda Execution Model
Figure 19. CUDA Grid Example
Figure 20. Warp Scheduler
Figure 21. Thread Divergence
Figure 22. CUDA Memory Model
Figure 23. Matrix Multiply – Global Memory
Figure 24. Tiles
Figure 25. Matrix Multiply – Tiles
Figure 26. CUDA Occupancy Calculator

List of Tables

Table 1. Threading Compute Capability
Table 2. Memory Compute Capability
Table 3. Properties of Memory Types

Introduction

Using the power of the NVIDIA GPU, CUDA allows the programmer to create highly parallel applications that can perform hundreds of times faster than an equivalent program that is written to run on the CPU alone. The NVIDIA CUDA Tookit provides several API’s for integrating a CUDA program into your C and C++ applications.

CUDA supports a heterogeneous programming environment where parts of the application code is written for the CPU and other parts of the application are written to execute on the GPU. The application is compiled into a single executable that can run on both devices simultaneously.

In a CUDA intensive application, the CPU is used to allocate CUDA memory buffers, execute CUDA kernels and retrieve and analyze the result of running a kernel on the GPU. The GPU is used to synchronously process large amounts of data or to execute a simulation that can easily be split into a large grid where each grid executes a part of the simulation in parallel.

The NVIDIA GPU consists of hundreds (even thousands) of CUDA cores that can work in parallel to operate on extremely large datasets in a very short time. For this reason, the NVIDIA GPU is much more suited to work in a highly parallel nature than the CPU.

The image below shows the computing power of the GPU and how it compares to the CPU. The vertical axis shows the theoretical GFLOP/s (Giga Floating Point Operations per Second). The horizontal axis shows the advances in technology over the years[1].

Floating Point Operations Per SecondFigure 1. Floating Point Operations Per Second

As can be seen from the image, the latest GPU from NVIDIA (The GTX 680 at the time of this writing) can theoretically perform 3 Trillion () Floating Point Operations per Second (or 3 teraFLOPS)[1].

The GPU is also capable of transferring large amounts of data through the AGP bus. The image below shows the memory bandwidth in GB/s of the latest NVIDIA GPU compared to the latest desktop CPUs from Intel[1].

Memory BandwidthFigure 2. Memory Bandwidth

In this article, I will introduce the latest GPU architecture from NVIDIA: Kepler. I will also introduce the CUDA threading model and demonstrate how you can execute a CUDA kernel in a C++ application. I will also introduce the CUDA memory model and I will show how you can optimize your CUDA application by making use of shared memory.

Kepler Architecture

Kepler is the name given to the latest line of desktop GPUs from NVIDIA. It is currently NVIDIA’s flagship GPU replacing the Fermi architecture.

The Kepler GPU consits of 7.1 billion transistors[2] making it the fastest and most complex microprocessor ever built.

Kepler GK110 DieFigure 3. Kepler GK110 Die

Despite it’s huge transistor count, the Kepler GPU is much more power efficient than its predecessor delivering up to 3x the performance per watt of the Fermi architecture[2].

The Kepler GPU was designed to be the highest performing GPU in the world. The Kepler GK110 consists of 15 SMX (streaming multiprocessor) units and six 64-bit memory controllers[2] as shown in the image below.

Kepler ArchitectureFigure 4. Kepler Architecture

If we zoom into a single SMX unit, we see that each SMX unit consists of 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).

Kepler Streaming MultiprocessorFigure 5. Kepler Streaming Multiprocessor (SMX)

The 192 single-precision CUDA cores each contain a single-precision floating-point unit (FPU) as well as a 32-bit integer arithmetic logic unit (ALU).

Each SMX supports 64 KB of shared memory, and 48 KB of read-only data cache. The shared memory and the data cache are accessible to all threads executing on the same streaming multiprocessor. Access to these memory areas is highly optimized and should be favored over accessing memory in global DRAM.

The SMX will schedule 32 threads in a group called a warp. Using compute capability 3.5, the GK110 GPU can schedule 64 warps per SMX for a total of 2,048 threads that can be resident in a single SMX at a time (not all threads will be active at the same time as we will see in the section describing the threading model).

Each SMX has four warp schedulers and eight instruction dispatch units (two dispatch units per warp scheduler) allowing four warps to be issued and executed concurrently on the streaming multiprocessor[2].

Warp SchedulerFigure 6. Warp Scheduler

Dynamic Parallelism

The GK110 GPU supports a feature called Dynamic Parallelism. Dynamic Parallelism allows the GPU to create new work for itself by creating new kernels as they are needed without the intervention of the CPU.

Dynamic ParallelismFigure 7. Dynamic Parallelism

As can be seen from the image, on the left, the Fermi GPU requires the CPU to execute kernels on the GPU. On the right side of the image, the Kepler GPU is capable of launching kernels from within a kernel itself. No intervention from the CPU is required.

This allows the GPU kernel to be more adaptive to dynamic branching and recursive algorithms which has some impact on the way we can implement certain functions on the GPU (such as Ray Tracing, Path Tracing and other rasterization techniques).

Dynamic Parallelism also allows the programmer to better load-balance their GPU based application. Threads can by dynamically launched based on the amount of work that needs to be performed in a particular region of the grid domain. In this case, the initial compute grid can be very coarse and the kernel can dynamically refine the grid size depending on the amount of work that needs to be performed.

Dynamic ParallelismFigure 8. Dynamic Parallelism

As can be seen from the image, the left grid granularity is too coarse to produce an accurate simulation. The grid in the center is too fine and many kernels are not performing any actual work. On the right image we see that using dynamic parallelism, the grid can be dynamically refined to produce just the right balance between granularity and workload.

Hyper-Q

The Fermi architecture relied on a single hardware work queue to schedule work from multiple streams. This resulted in false intra-stream dependencies requiring dependent kernels within one stream to complete before additional kernels in a separate stream could be executed[2].

The Kepler GK110 resolves this false intra-stream dependency with the introduction of theHyper-Q feature. Hyper-Q increases the total number of hardware work-queues to 32 compared to the single work-queue of the Fermi architecture.

Hyper-QFigure 9. Hyper-Q

CUDA applications that utilize multiple streams will immeditaly benifit from the multiple hardware work queues offered by the Hyper-Q feature. These stream intensive applications can see a potential increase in performance of up to 32x[2].

Grid Management Unit

In order to facilitate the Dynamic Parallelism feature introduced in the GK110 GPU a newGrid Managment Unit (GMU) needed to be designed. In the previous Fermi architecture, grids were passed to the CUDA Work Distributor (CWD) directly form the stream queue. Since it is now possible to execute more kernels directly in a running CUDA kernel, a bi-directional communication link is required from the SMX to the CWD via the GMU.

Grid Management UnitFigure 10. Grid Management Unit

NVIDIA GPUDirect

The Kepler GK110 supports the Remote Direct Memory Access (RDMA) feature in NVIDIA GPUDirect[2]. GPUDirect allows data to be transferred directly from one GPU to another via 3rd-party devices such as InfiniBand (IB), Network Interface Cards (NIC), and Solid-State disc drives (SSD).

GPUDirectFigure 11. GPUDirect

Getting Started with CUDA

In this article, I will use Visual Studio 2010 to create a CUDA enabled application. The settings and configurations for Visual Studio 2008 will be similar and you should be able to follow along even if you have not yet upgraded to VS2010.

System Requirements

Before you can run a CUDA program, you must make sure that your system meets the minimum requirements.

  • CUDA-capable GPU
  • Microsoft Windows XP, Vista, 7, or 8 or Windows Server 2003 or 2008
  • NVIDIA CUDA Toolkit
  • Microsoft Visual Studio 2008 or 2010 or a corresponding version of Microsoft Visual C++ Express

Verify your GPU

To verify you have a CUDA enabled GPU first check the graphics device you have installed.

  1. Open the Contol Panel from the Start Menu.
    Control PanelFigure 12. Control Panel

  2. Double-Click the System applet to open the System Control Panel.
  3. In Windows XP, click on the Hardware tab then click the Device Manager button. In Windows 7 click the Device Manager link. 
    System ManagerFigure 13. System Manager

  4. In the Device Manager window that appears, expand the Display Adapters node in the device tree.
    Device ManagerFigure 14. Device Manager

    If your device is listed at https://developer.nvidia.com/cuda-gpus then you have a CUDA-capable GPU.

Install CUDA

Download and install the latest NVIDIA CUDA Toolkit. The CUDA Toolkit is available athttps://developer.nvidia.com/cuda-downloads.

At the time of this writing, the latest version of the CUDA toolkit is CUDA 5.0 Production Release.

The CUDA Toolkit contains the drivers and tools needed to create, build and run a CUDA application as well as libraries, header files, and CUDA samples source code and other resource[3].

By default, the CUDA toolkit is installed to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v#.#, where #.# refers to the CUDA version you have installed. For the CUDA 5.0 toolkit, the complete path to the CUDA installation will be C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0.

The installation will include the following directories:

  • bin: This folder contains the CUDA compiler and runtime libraries (DLLs)
  • include: The C header files that are needed to compile your CUDA programs.
  • lib: The library files that are needed to link your CUDA programs.
  • doc: This directory contains the documentation for the CUDA Toolkit such as theCUDA C Programming Guide, the CUDA C Best Practices Guide and the documentation for the different CUDA libraries that are available in the Toolkit.

The CUDA Samples contain sample source code and projects for Visual Studio 2008 and Visual Studio 2010. On Windows XP, the samples can be found in C:\Document and Settings\All Users\Application Data\NVIDIA Corporation\CUDA Samples\v#.# and for Windows Vista, Windows 7, and Windows Server 2008, the samples can be found atC:\ProgramData\NVIDIA Corporation\CUDA Samples\v#.# where #.# is the installed CUDA version.

Verify the Installation

Before you start creating a CUDA application, it is important to verify that your CUDA installation is working correctly.

  1. Open a Command Prompt window by going to Start Menu > All Programs > Accessories > Command Prompt
    Command PromptFigure 15. Command Prompt

  2. In the Command Prompt window type:
    nvcc -V

    You should see something similar to what is shown in the Command Prompt screenshot above. The output may differ slightly depending on the version of the CUDA Toolkit you installed but you should not get an error.

Run Compiled Sample

The CUDA Toolkit comes with both the source code and compiled executable for the Toolkit samples. On Windows XP the compiled samples can be found at C:\Document and Settings\All Users\Application Data\NVIDIA Corporation\CUDA Samples\v#.#\bin\win32\Release\ and on Windows 7, Windows 8, Windows Server 2003, and Windows Server 2008 the compiled samples can be found atC:\ProgramData\NVIDIA Corporation\CUDA Samples\v#.#\bin\win32\Release. On a 64-bit version of Windows, you can replace the win32 with win64 to run the 64-bit version of the samples.

Try to run the deviceQuery sample in a Command Prompt window. You should see some output similar to the following image:

deviceQueryFigure 16. deviceQuery

Of course the output generated on your system will be different than this (unless you also have a GeForce GT 330M mobile GPU). Of course, the important thing is that your device(s) is(are) found and the device information is displayed without any errors.

Creating your First Project

For this article, I will create a CUDA application using Microsoft Visual Studio 2010. If you are still using Microsoft Visual Studio 2008 the steps will be very similar and you should still be able to follow along.

Open your Visual Studio IDE and create a new project.

As of CUDA Toolkit 5.0, Visual Studio project templates will be available that can be used to quickly create a project that is ready for creating a CUDA enabled application. Previous to CUDA Toolkit 5.0, Visual Studio project templates were only available when you installedNVIDIA Nsight Visual Studio Edition.

In the New Project dialog box, select NVIDIA > CUDA from the Installed Templatespane. In the right pane, select the CUDA 5.0 Runtime template.

New Project DialogFigure 17. New Project Dialog

Give your project a meaningful name such as “CUDATemplate” or something similar.

Click OK to create a new project.

This will create a new Visual Studio C++ project with a single CUDA source file calledkernel.cu

You should be able to compile and run this sample already at this point to confirm it is working. You should get the following output:

{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55}

If you got any errors or something went wrong, then you should check that do have a CUDA enabled GPU and that you installed the CUDA Toolkit prior to installing Visual Studio 2010. Follow the steps in the previous sections again and make sure you did everything correctly.

Using the Visual Studio project template for the CUDA 5.0 Runtime will automatically configure the build settings necessary to compile a CUDA enabled application. If you want to know how to add the configure necessary to build CUDA source files to an existing C/C++ project, then you can refer to my previous article titled Introduction to CUDA that I wrote last year. That article focuses on CUDA 4.0 using Visual Studio 2008 but the steps are almost identical for CUDA 5.0 using Visual Studio 2010.

Threading Model

The CUDA threading model describes how a kernel is executed on the GPU.

CUDA Threads

Each kernel function is executed in a grid of threads. This grid is divided into blocks also known as thread blocks and each block is further divided into threads.

Cuda Execution ModelFigure 18. Cuda Execution Model

In the image above we see that this example grid is divided into nine thread blocks (3×3), each thread block consists of 9 threads (3×3) for a total of 81 threads for the kernel grid.

This image only shows 2-dimensional grid, but if the graphics device supports compute capability 2.0 or higher, then the grid of thread blocks can actually be partitioned into 1, 2 or 3 dimensions, otherwise if the device supports compute capability 1.x, then thread blocks can be partitioned into 1, or 2 dimensions (in this case, then the 3rd dimension should always be set to 1).

The thread block is partitioned into individual threads and for all compute capabilities, threads in a block can be partitioned into 1, 2, or 3 dimensions. The maximum number of threads that can be assigned to a thread block is 512 for devices with compute capability 1.x and 1024 threads for devices that support compute capability 2.0 and higher.

Table 1. Threading Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5
Maximum dimensionality of a grid of thread blocks. 2 3
Maximum x-, dimension of a grid of thread blocks. 65535 231-1
Maximum y- or z-dimension of a grid of thread blocks. 65535
Maximum dimensionality of a thread block. 3
Maximum x- or y-dimension of a block. 512 1024
Maximum z-dimension of a block. 64
Maximum number of threads per block. 512 1024
Warp size. 32
Maximum number of resident blocks per multiprocessor. 8 16
Maximum number of resident warps per multiprocessor. 24 32 48 64
Maximum number of resident threads per multiprocessor. 768 1024 1536 2048

The number of blocks within a gird can be determined within a kernel by using the built-in variable gridDim and the number of threads within a block can be determined by using the built-in variable blockDim.

A thread block is uniquely identified in a kernel function by using the built-in variableblockIdx and a thread within a block is uniquely identified in a kernel function by using the built-in variable threadIdx.

The built-in variables gridDimblockDimblockIdx, and threadIdx are each 3-component structs with members x, y, z.

With a 1-D kernel, the unique thread ID within a block is the same as the x component of the threadIdx variable.

and the unique block ID within a grid is the same as the x component of the blockIdx variable:

To determine the unique thread ID in a 2-D block, you would use the following formula:

and to determine the unique block ID within a 2-D grid, you would use the following formula:

I’ll leave it as an exercise for the reader to determine the formula to compute the unique thread ID and block ID in a 3D grid.

Matrix Addition Example

Let’s take a look at an example kernel that one might execute.

Let’s assume we want to implement a kernel function that adds two matrices and stores the result in a 3rd.

The general formula for matrix addition is:

That is, the sum of matrix A and matrix B is the sum of the components of matrix A and matrix B.

Let’s first write the host version of this method that we would execute on the CPU.

MatrixAdd.cpp
1
2
3
4
5
6
7
8
9
10
11
void MatrixAddHost( float* C, float* A, float* B, unsigned int matrixDim )
{
    for( unsigned int j = 0; j < matrixDim; ++j )
    {
        for ( unsigned int i = 0; i < matrixDim; ++i )
        {
            unsigned int index = ( j * matrixDim) + i;
            C[index] = A[index] + B[index];
        }
    }
}

This is a pretty standard method that loops through the rows and columns of a matrix and adds the components and stores the results in a 3rd. Now let’s see how we might execute this kernel on the GPU using CUDA.

First, we need to think of the problem domain. I this case, the domain is trivial: it is the components of a matrix. Since we are operating on 2-D arrays, it seems reasonable to split our domain into two dimensions; one for the rows, and another for the columns of the matrices.

We will assume that we are working on square matrices. This simplifies the problem but mathematically matrix addition only requires that the two matrices have the same number of rows and columns but does not have the requirement that the matrices must be square.

Since we know that a kernel is limited to 512 threads/block with compute capability 1.x and 1024 threads/block with compute capability 2.x and 3.x, then we know we can split our job into square thread blocks each consisting of 16×16 threads (256 threads per block) with compute capability 1.x and 32×32 threads (1024 threads per block) with compute capability 2.x and 3.x.

If we limit the size of our matrix to no larger than 16×16, then we only need a single block to compute the matrix sum and our kernel execution configuration might look something like this:

main.cpp
1
2
3
dim3 gridDim( 1, 1, 1 );
dim3 blockDim( matrixDim, matrixDim, 1 );
MatrixAddDevice<<<gridDim, blockDim>>>( C, A, B, matrixDim );

In this simple case, the kernel grid consists of only a single block with matrixDim xmatrixDim threads.

However, if we want to sum matrices larger than 512 components, then we must split our problem domain into smaller groups that can be processed in multiple blocks.

Let’s assume that we want to limit our blocks to execute in 16×16 (256) threads. We can determine the number of blocks that will be required to operate on the entire array by dividing the size of the matrix dimension by the maximum number of threads per block and round-up to the nearest whole number:

And we can determine the number of threads per block by dividing the size of the matrix dimension by the number of blocks and round-up to the nearest whole number:

So for example, for a 4×4 matrix, we would get

and the number of threads is computed as:

resulting in a 1×1 grid of 4×4 thread blocks for a total of 16 threads.

Another example a 512×512 matirx, we would get:

and the number of threads is computed as:

resulting in a 32×32 grid of 16×16 thread blocks for a total of 262,144 threads.

The host code to setup the kernel granularity might look like this:

main.cpp
1
2
3
4
5
6
size_t blocks = ceilf( matrixDim / 16.0f );
dim3 gridDim( blocks, blocks, 1 );
size_t threads = ceilf( matrixDim / (float)blocks );
dim3 blockDim( threads, threads, 1 );
MatrixAddDevice<<< gridDim, blockDim >>>( C, A, B, matrixDim );
You may have noticed that if the size of the matrix does not fit nicely into equally divisible blocks, then we may get more threads than are needed to process the array. It is not possible to configure a gird of thread blocks with 1 block containing less threads than the others. The only way to solve this is to execute multiple kernels – one that handles all the equally divisible blocks, and a 2nd kernel invocation that handles the partial block. The other solution to this problem is simply to ignore any of the threads that are executed outside of our problem domain which is generally the easier (and more efficient) than invoking multiple kernels (this should be profiled to be proven).

The Matrix Addition Kernel Function

On the device, one kernel function is created for every thread in the problem domain (the matrix elements). We can use the built-in variables gridDimblockDimblockIdx, andthreadIdx, to identify the current matrix element that the current kernel is operating on.

If we assume we have a 9×9 matrix and we split the problem domain into 3×3 blocks each consisting of 3×3 threads as shown in the CUDA Grid below, then we could compute theith column and the jth row of the matrix with the following formula:

So for thread (0,0) of block (1,1) of our 9×9 matrix, we would get:

for the column and:

for the row.

The index into the 1-D buffer that store the matrix is then computed as:

and substituting gives:

Which is the correct element in the matrix. This solution assumes we are accessing the matrix in row-major order.

CUDA Grid ExampleFigure 19. CUDA Grid Example

Let’s see how we might implement this in the kernel.

MatrixAdd.cu
1
2
3
4
5
6
7
8
9
10
11
__global__ void MatrixAddDevice( float* C, float* A, float* B, unsigned int matrixDim )
{
    unsigned int column = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int row    = ( blockDim.y * blockIdx.y ) + threadIdx.y;
    unsigned int index = ( matrixDim * row ) + column;
    if ( index < matrixDim * matrixDim ) // prevent reading/writing array out-of-bounds.
    {
        C[index] = A[index] + B[index];
    }
}

The kernel function is defined using the __global__ declaration specifier. This specifier is used to identify a function that should execute on the device. Optionally you can also specify host functions with the __host__ declaration specifier within a CUDA source file but this is implied if no specifier is applied to the function declaration.

On line 3, and 4 we compute the column and row of the matrix we are operating on using the formulas shown earlier.

On line 6, the 1-d index in the matrix array is computed based on the size of a single dimension of the square matrix.

We must be careful that we don’t try to read or write out of the bounds of the matrix. This might happen if the size of the matrix does not fit nicely into the size of the CUDA grid (in the case of matrices whose size is not evenly divisible by 16) To protect the read and write operation, on line 7 we must check that the computed index does not exceed the size of our array.

Thread Synchronization

CUDA provides a synchronization barrier for all threads in a block through the__syncthreads() method. A practical example of thread synchronization will be shown in a later article about optimization a CUDA kernel, but for now it’s only important that you know this functionality exists.

Thread synchronization is only possible across all threads in a block but not across all threads running in the grid. By not allowing threads across blocks to be synchronized, CUDA enables multiple blocks to be executed on other streaming multiprocessors (SM) in any order. The queue of blocks can be distributed to any SM without having to wait for blocks from another SM to be complete. This allows the CUDA enabled applications to scale across platforms that have more SM at it’s disposal, executing more blocks concurrently than another platforms with less SM’s.

Thread synchronization follows strict synchronization rules. All threads in a block must hit the synchronization point or none of them must hit synchronization point.

Give the following code block:

sample.cu
1
2
3
4
5
6
7
8
if ( threadID % 2 == 0 )
{
    __syncthreads();
}
else
{
    __syncthreads();
}

will cause the threads in a block to wait indefinitely for each other because the two occurrences of __syncthreads are considered separate synchronization points and all threads of the same block must hit the same synchronization point, or all of them must not hit it.

Thread Assignment

When a kernel is invoked, the CUDA runtime will distribute the blocks across the SM’s on the device. With compute compatibility 1.x and 2.x a maximum of 8 blocks will be assigned to each SM and with compute compatibility 3.x a maximum of 16 blocks will be assigned to each SM as long as there are enough resources (registers, shared memory, and threads) to execute all the blocks. In the case where there are not enough resources on the SM, then the CUDA runtime will automatically assign less blocks per SM until the resource usage is below the maximum per SM.

The total number of blocks that can be executed concurrently is dependent on the device. In the case of the Fermi architecture a total of 16 SM’s can concurrently handle 8 blocks for a total of 128 blocks executing concurrently on the device. Kepler devices can handle 16 thread blocks per SMX for a total of 240 thread blocks that can execute concurrently on a single device.

Both the Fermi and Kepler architecture support thread blocks consisting of at most 1024 threads. The Fermi device can support a maximum of 48 warps per SM. The Kepler architecture increases the amount of resident warps per SMX to 64.

The Fermi device can support a maximum of 1,536 resident threads (32×48) per SM. Kepler supports 2,048 threads per SMX (32×64). With 15 SMX units, the Kepler GPU can have a total of 30,720 resident threads on the device. This does not mean that every clock tick the devices is executing 30,720 instruction simultaneously (there are only 2,880 CUDA Cores on the GK110 device). In order to understand how the blocks are actually executed on the device, we must look one step further to see how the threads of a block are actually scheduled on the SM’s.

Thread Scheduling

When a block is assigned to a SMX, it is further divided into groups of 32 threads called awarp. Warp scheduling is different depending on the platform, but if we take a look at the Kepler architecture, we see that a single SMX consists of 192 CUDA cores (a CUDA core is also sometimes referred to a streaming processor or SP for short).

Each SMX in the Kepler architecture features four warp schedulers allowing four warps to be issued and executed concurrently. Kepler’s quad-warp scheduler selects four warps and issues two independent instructions from each warp every cycle[2].

Warp SchedulerFigure 20. Warp Scheduler

You might be wondering why it would be useful to schedule 16 blocks of a maximum of 1024 threads if the SMX only has 192 cuda cores? The answer is that each instruction of a kernel may require more than a few clock cycles to execute (for example, an instruction to read from global memory will require multiple clock cycles). Any instruction that requires multiple clock cycles to execute incurs latency. The latency of long-running instructions can be hidden by executing instructions from other warps while waiting for the result of the previous warp. This technique of filling the latency of expensive operations with work from other threads is often called latency hiding.

Thread Divergence

It is reasonable to imagine that your CUDA program contains flow-control statements likeif-then-elseswitchwhile loops, or for loops. Whenever you introduce these flow-control statements in your code, you also introduce the possibility of thread divergence. It is important to be aware of the consequence of thread divergence and also to understand how you can minimize the negative impact of divergence.

Thread divergence occurs when some threads in a warp follow a different execution path than others. Let’s take the following code block as an example:

test.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void TestDivergence( float* dst, float* src )
{
    unsigned int index = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    float value = 0.0f;
    if ( threadIdx.x % 2 == 0 )
    {
        // Threads executing PathA are active while threads
        // executing PathB are inactive.
        value = PathA( src );
    }
    else
    {
        // Threads executing PathB are active while threads
        // executing PathA are inactive.
        value = PathB( src );
    }
    // Threads converge here again and execute in parallel.
    dst[index] = value;
}

Then our flow control and thread divergence would look something like this:

Thread DivergenceFigure 21. Thread Divergence

As you can see from this example, the even numbered threads in each block will executePathA while the odd numbered threads in the block will execute PathB. This is pretty much the worst-case scenario for simple divergence example.

Both PathA and PathB cannot be executed concurrently on all threads because their execution paths are different. Only the threads that execute the exact same execution path can run concurrently so the total running time of the warp is the sum of the execution time of both PathA and PathB.

In this example, the threads in the warp that execute PathA are activated if the condition is true and all the other threads are deactivated. Then, in another pass, all the threads that execute PathB are activated if the condition is false are activated and the other threads are deactivated. This means that to resolve this condition requires 2-passes to be executed for a single warp.

The overhead of having the warp execute both PathA and PathB can be eliminated if the programmer takes careful consideration when writing the kernel. If possible, all threads of a block (since warps can’t span thread blocks) should execute the same execution path. This way you guarantee that all threads in a warp will execute the same execution path and there will be no thread divergence within a block.

Memory Model

There are several different types of memory that your CUDA application has access to. For each different memory type there are tradeoffs that must be considered when designing the algorithm for your CUDA kernel.

Global memory has a very large address space, but the latency to access this memory type is very high. Shared memory has a very low access latency but the memory address is small compared to Global memory. In order to make proper decisions regarding where to place data and when, you must understand the differences between these memory types and how these decisions will affect the performance of your kernel.

In the next sections, I will describe the different memory types and show examples of using different memory to improve the performance of your kernel.

CUDA Memory Types

Every CUDA enabled GPU provides several different types of memory. These different types of memory each have different properties such as access latency, address space, scope, and lifetime.

The different types of memory are registersharedlocalglobal, and constantmemory.

On devices with compute capability 1.x, there are 2 locations where memory can possibly reside; cache memory and device memory.

The cache memory is considered “on-chip” and accesses to the cache is very fast. Shared memory and cached constant memory are stored in cache memory with devices that support compute capability 1.x.

The device memory is considered “off-chip” and accesses to device memory is about ~100x slower than accessing cached memory. Global memory, local memory and (uncached) constant memory is stored in device memory.

On devices that support compute capability 2.x, there is an additional memory bank that is stored with each streaming multiprocessor. This is considered L1-cache and although the address space is relatively small, it’s access latency is very low.

CUDA Memory ModelFigure 22. CUDA Memory Model

In the following sections I will describe each type and when it is best to use that memory type.

Register

Scalar variables that are declared in the scope of a kernel function and are not decorated with any attribute are stored in register memory by default. Register memory access is very fast, but the number of registers that are available per block is limited.

Arrays that are declared in the kernel function are also stored in register memory but only if access to the array elements are performed using constant indexes (meaning the index that is being used to access an element in the array is not a variable and thus the index can be determined at compile-time). It is currently not possible to perform random access to register variables.

Register variables are private to the thread. Threads in the same block will get private versions of each register variable. Register variables only exists as long as the thread exists. Once the thread finishes execution, a register variable cannot be accessed again. Each invocation of the kernel function must initialize the variable each time it is invoked. This might seem obvious because the scope of the variable is within the kernel function, but this is not true for all variables declared in the kernel function as we will see with shared memory.

Variables declared in register memory can be both read and written inside the kernel. Reads and writes to register memory does not need to be synchronized.

Local

Any variable that can’t fit into the register space allowed for the kernel will spill-over into local memory. Local memory has the same access latency as global memory (that is to say, slow). Accesses to local memory is cached only on GPU’s with compute capability 2.x or higher[4].

Like registers, local memory is private to the thread. Each thread must initialize the contents of a variable stored in local memory before it should be used. You cannot rely on another thread (even in the same block) to initialize local memory because it is private to the thread.

Variables in local memory have the lifetime of the thread. Once the thread is finished executing, the local variable is no longer accessible.

You cannot decorate a variable declaration with any attribute but the compiler will automatically put variable declarations in local memory under the following conditions:

  • Arrays that are accessed with run-time indexes. That is, the compiler can’t determine the indices at compile time.
  • Large structures or arrays that would consume too much register space.
  • Any variable declared that exceeds the number of registers for that kernel (this is called register-spilling).

The only way that you can determine if the compiler has put some function scope variables in local memory is by manual inspection of the PTX assembly code (obtained by compiling with the -ptx or -keep option). Local variables will be declared using the .localmnemonic and loaded using the ld.local mnemonic and stored with the st.localmnemonic.

Variables in local memory can be both read and written within the kernel and access to local memory does not need to be synchronized.

Shared

Variables that are decorated with the __shared__ attribute are stored in shared memory. Accessing shared memory is very fast (~100 times faster than global memory) although each streaming multiprocessor has a limited amount of shared memory address space.

Shared memory must be declared within the scope of the kernel function but has a lifetime of the block (as opposed to register, or local memory which has a lifetime of the thread). When a block is finished execution, the shared memory that was defined in the kernel cannot be accessed.

Shared memory can be both read from and written to within the kernel. Modification of shared memory must be synchronized unless you guarantee that each thread will only access memory that will not be read-from or written-to by other threads in the block. Block synchronization is acheived using the __syncthreads() barrier function inside the kernel function.

Since access to shared memory is faster than accessing global memory, it is more efficient to copy global memory to shared memory to be used within the kernel but only if the number of accesses to global memory can be reduced within the block (as we’ll see with the matrix multiply example that I will show later).

Global

Variables that are decorated with the __device__ attribute and are declared in global scope (outside of the scope of the kernel function) are stored in global memory. The access latency to global memory is very high (~100 times slower than shared memory) but there is much more global memory than shared memory (up to 6GB but the actual size is different across graphics cards even of the same compute capability).

Unlike register, local, and shared memory, global memory can be read from and written to using the C-function cudaMemcpy.

Global memory has a lifetime of the application and is accessible to all threads of all kernels. One must take care when reading from and writing to global memory because thread execution cannot be synchronized across different blocks. The only way to ensure access to global memory is synchronized is by invoking separate kernel invocations (splitting the problem into different kernels and synchronizing on the host between kernel invocations).

Global memory is declared on the host process using cudaMalloc and freed in the host process using cudaFree. Pointers to global memory can be passed to a kernel function as parameters to the kernel (as we will see in the example later).

Reads from global memory is cached only on devices that support compute capability 2.x or higher[4] but any write to global memory will invalidate the cache thus eliminating the benefit of cache. Access to global memory on devices that support compute capability 1.x is not cached.

It is a bit of an art-form to reduce the number of accesses to global memory from within a kernel by using blocks of shared memory because the access latency to shared memory is about 100 times faster than accessing global memory. Later, I will show an example of how we can reduce the global memory access using shared memory.

Constant

Variables that are decorated with the __constant__ attribute are declared in constant memory. Like global variables, constant variables must be declared in global scope (outside the scope of any kernel function). Constant variables share the same memory banks as global memory (device memory) but unlike global memory, there is only a limited amount of constant memory that can be declared (64KB on all compute capabilities).

Access latency to constant memory is considerably faster than global memory because constant memory is cached but unlike global memory, constant memory cannot be written to from within the kernel. This allows constant memory caching to work because we are guaranteed that the values in constant memory will not be changed and therefor will not become invalidated during the execution of a kernel.

Constant memory can be written to by the host process using thecudaMemcpyToSymbol function and read-from using the cudaMemcpyFromSymbolfunction. As far as I can tell, it is not possible to dynamically allocate storage for constant memory (the size of constant memory buffers must be statically declared and determined at compile-time).

Like global memory, constant memory has a lifetime of the application. It can be accessed by all threads of all kernels and the value will not change across kernel invocations unless explicitly modified by the host process.

Properties of Memory

The amount of memory that is available to the CUDA application is (in most cases) specific to the compute capability of the device. For each compute capability, the size restrictions of each type of memory (except global memory) id defined in the table below. The application programmer is encouraged to query the device properties in the application using the
cudaGetDeviceProperties method.

Table 2. Memory Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5
Number of 32-bit registers per thread 128 63 255
Maximum amount of shared memory per SM 16 KB 48 KB
Amount of local memory per thread 16 KB 512 KB
Constant memory size 64 KB

The following table summarizes the different memory types and the properties of those types.

Table 3. Properties of Memory Types
Memory Located Cached Access Scope Lifetime
Register cache n/a Host: None
Kernel: R/W
thread thread
Local device 1.x: No
2.x: Yes
Host: None
Kernel: R/W
thread thread
Shared cache n/a Host: None
Kernel: R/W
block block
Global device 1.x: No
2.x: Yes
Host: R/W
Kernel: R/W
application application
Constant device Yes Host: R/W
Kernel: R
application application

Pointers to Memory

You can use pointers to memory in a kernel but you must be aware that the pointer type does not determine where the memory is located.

For example, the following code declares a pointer to constant memory and a pointer to global memory. You should be aware that only the pointer variable is constant – not what it points to.

test.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
__constant__ float* constPtr;
__device__ float* globalPtr;
__global__ void KernelFunction(void)
{
    // Assign the pointer to global memory to a pointer to constant memory.
    // This will not compile because the pointer is constant and you can't change
    // what a const-pointer points to in the kernel.
    constPtr = globalPtr;
    // This will compile because what the const pointer points to is not
    // necessarily const (if it is, you'll probaly get a runtime error).
    *constPtr = *globalPtr;
}

Since you can’t dynamically allocate constant memory, this example would not be very useful anyways.

Be careful when using pointers like this. It is a best-practice rule to ensure that a declared pointer only points to one type of memory (so a single pointer declaration will only point to global memory and another pointer declaration will only point to shared memory).

Minimize Global Memory Access

Since access latency is much higher for global memory than it is for shared memory, it should be our objective to minimize accesses to global memory in favor of shared memory. This doesn’t mean that every access to data in global memory should first be copied into a variable in shared (or register) memory. Obviously we will not benefit from the low latency shared memory access if our algorithm only needs to make a single access to global memory. But it happens in some cases that multiple threads in the same block will all read from the same location in global memory. If this is the case, then we can speed-up our algorithm by first allowing each thread in a block to copy one part of the global memory into a shared memory buffer and then allowing all of the threads in a block to access all elements in that shared memory buffer.

To demonstrate this, I will show several flavors the classic matrix multiply example. The first example I will show is the standard implementation of the matrix multiply using only global memory access. Then, I will show an optimized version of the algorithm that uses shared memory to reduce the number of accesses to global memory for threads of the same block.

Matrix Multiply using Global Memory

This version of the matrix multiply algorithm is the easiest to understand however it is also a very naive approach.

MatrixMultiply.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
__global__ void MatrixMultiplyKernel_GlobalMem( float* C, const float* A, const float* B, unsigned int matrixDim )
{
    // Compute the row index
    unsigned int i = ( blockDim.y * blockIdx.y ) + threadIdx.y;
    // Compute the column index
    unsigned int j = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int index = ( i * matrixDim ) + j;
    float sum = 0.0f;
    for ( unsigned int k = 0; k < matrixDim; ++k )
    {
        sum += A[i * matrixDim + k] * B[k * matrixDim + j];
    }
    C[index] = sum;
}

The parameters AB, and C all point to buffers of global memory.

The fist step is to figure out which row (i) and which column (j) we are operating on for this kernel.

On line 10, we loop through all of the elements of row i of matrix A and the column j of matrix B and compute the summed product of corresponding entries (the dot product of row i and column j). A visual aid of this algorithm is shown below.

Matrix Multiply - Global MemoryFigure 23. Matrix Multiply – Global Memory

If we analyze this algorithm, we may notice that the same row elements of matrix A are being accessed for every resulting row element of matrix C and all the column elements of matrix B are being accessed for every resulting column element of matrix C. If we say that the resulting matrix C is N x M elements, then each element of matrix A is being accessedM times and each element of matrix B is being accessed N times. That seems pretty wasteful to me.

Matrix Multiply using Shared Memory

What if we could reduce the number of times the elements of matrix A and B are accessed to just 1? Well, depending on the size of our matrix, we could just store the contents of matrix A and matrix B into shared memory buffers then just compute the resulting matrix C from those buffers instead. This might work with small matrices (remember that shared memory is local to a single block and with compute capability 1.3, we are limited to matrices of about 20 x 20 because we are limited to 512 threads that can be assigned to a single block).

But what if we had larger matrices to multiply? If we can find a way to split the problem into “phases” then we could simply load each “phase” into shared memory, process that “phase”, then load the next “phase” and process that one until we have exhausted the entire domain.

This technique of splitting our problem domain into phases is called “tiling” named because of the way we can visualize the technique as equal sized tiles that represent our problem domain.

TilesFigure 24. Tiles

For this particular problem, the best partitioning of the problem domain is actually the same as partitioning of the grid of threads that are used to compute the result.

If we split our grid into blocks of 16 x 16 threads (which I showed previously in the section about CUDA thread execution to be a good granularity for this problem) then we can create two buffers in shared memory that are the same size as a single thread block in our kernel grid, one that holds a “tile” of matrix A, and other to store a “tile” of matrix B.

Let’s see how this might look:

Matrix Multiply - TilesFigure 25. Matrix Multiply – Tiles

So the idea is simple, each thread block defines a pair of shared memory buffers that are used to “cache” a “tile” of data from matrix A and matrix B. Since the “tile” is the same size as the thread block, we can just let each thread in the thread block load a single element from matrix A into one of the shared memory buffers and a single element from matrix Binto the other. Using this technique, we can reduce the number of global memory access to matrixDim / BLOCK_SIZE per thread (where BLOCK_SIZE is the size of the thread block and shared memory buffer in a single dimension).

But will this work? We only have access to 16 KB (16,384 Bytes) of shared memory per streaming multiprocessor for devices of compute capability 1.x. If our BLOCK_SIZE is 16 then we need 162 floating point values (4-bytes each) per shared memory buffer. So the size in bytes of each shared memory buffer is:

And we need 2 buffers, so we will need 2,048 Bytes of shared memory per block. If you remember from the previous article about the CUDA thread execution model,
thread blocks of size 16 x 16 will allow 4 resident blocks to be scheduled per streaming multiprocessor. So 4 blocks each requiring 2,048 Bytes gives a total requirement of 8,192 KB of shared memory which is 50% of the available shared memory per streaming multiprocessor. So this this tiling strategy will work.

So let’s see how we might implement this in the kernel.

MatrixMultiply.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define BLOCK_SIZE 16
__global__ void MatrixMultiplyKernel_SharedMem( float* C, const float* A, const float* B, unsigned int matrixDim )
{
    unsigned int tx = threadIdx.x;
    unsigned int ty = threadIdx.y;
    unsigned int bx = blockIdx.x;
    unsigned int by = blockIdx.y;
    // Allocate share memory to store the matrix data in tiles
    __shared__ float sA[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float sB[BLOCK_SIZE][BLOCK_SIZE];
    // Compute the column index
    unsigned int j = ( blockDim.x * bx ) + tx;
    // Compute the row index
    unsigned int i = ( blockDim.y * by ) + ty;
    unsigned int index = ( i * matrixDim ) + j;
    float sum = 0.0f;
    // Loop through the tiles of the input matrices
    // in separate phases of size BLOCK_SIZE
    for( unsigned int phase = 0; phase < matrixDim/BLOCK_SIZE; ++phase )
    {
        // Allow each thread in the block to populate the shared memory
        sA[ty][tx] = A[i * matrixDim + (phase * BLOCK_SIZE + tx)];
        sB[ty][tx] = B[(phase * BLOCK_SIZE + ty) * matrixDim + j];
        __syncthreads();
        for( unsigned int k = 0; k < BLOCK_SIZE; ++k )
        {
            sum += sA[ty][k] * sB[k][tx];
        }
        __syncthreads();
    }
    C[index] = sum;   
}

On line 5-8, we just store some “shorthand” versions of the thread and block indexes into private thread variables (these are stored in registers).

On line 11, and 12 the two shared memory buffers are declared to store enough values that each thread in the thread block can store a single entry in the arrays.

On line 15, the index of the column is computed and stored in another registry variable jand on line 16, the row is computed and stored in registry variable i.

On line 20, the 1-D index into the result matrix C is computed and the sum of the products is stored in the float variable sum.

On line 25, we will loop over the “tiles” (called phases here) of matrix A and matrix B. You should note that this algorithm assumes the size of the matrix is evenly divisible by the size of the thread block.

On lines 28 and 29 is where the magic happens. Since the shared memory is accessible to every thread in the block, we can let every thread in the block copy 1 element from matrix A and one element from matrix B into the shared memory blocks.

Before we can access the data in the shared memory blocks, we must ensure that all threads in the entire block have had a chance to write their data. To do that we need to synchronize the execution of all the threads in the block by calling the __syncthreads()method.

Then the for loop on line 32 will loop through the elements of shared memory and sum the products.

Before we leave this loop and start filling the next “tile” into shared memory, we must ensure that all threads are finished with the shared memory buffers. To do that, we must execute __syncthreads() again on line 36.

This will repeat until all phases (or tiles) of the matrix have been processed.

Once all phases are complete, then the value stored in sum will contain the final result and it is written to the destination matrix C.

Running the global memory version of the matrix multiply on my laptop with a 512 x 512 matrix runs in about 45 milliseconds. Running the shared memory version on the same matrix completes in about 15 milliseconds (including copying memory from host to device and copying the result back to host memory). This provides a speed-up of 300%!

Resources as a Limiting Constraint

It is entirely possible to allocate more shared memory per block than 2,048 bytes, but the block scheduler will reduce the number of blocks scheduled on a streaming multiprocessor until the shared memory requirements are met. If you want to allocate all 16 KB of shared memory in a single block, then only a single block will be resident in the streaming multiprocessor at any given moment which will reduce the occupancy of the streaming multiprocessor to 25% (for a 16 x 16 thread block on compute capability 1.x).

This reduced thread occupancy is not ideal, but it is conceivable to imagine that a single block might have this requirement. In most cases the GPU will still out-perform the CPU if the benefit of using the low-latency memory is fully realized.

This is also true for the number of registers that can be allocated per block. If a single kernel declares 32 32-bit variables that must be stored in registers and the thread block consists of 16 x 16 threads, then the maximum number of blocks that can be active in a streaming multiprocessor on a device with compute capability 1.3 is 2 because the maximum number of 32-bit registers that can be used at any moment in time is 16,384.

So the number of 32-bit registers/block is 8,192. So the streaming multiprocessor can accommodate a maximum of 8,192 / 16,384 = 2 blocks.

CUDA GPU Occupancy Calculator

Since version 4.1, the CUDA Toolkit comes with a tool called the CUDA GPU Occupancy Calculator. This tool is a Microsoft Excel file that can be used to compute the maximum thread occupancy of the streaming multiprocessor given a set of limiting constraints (threads per block, registers per thread, and shared memory (bytes) per block). This tool is provided in the following folder:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.X\tools

CUDA Occupancy CalculatorFigure 26. CUDA Occupancy Calculator

The CUDA Occupancy Calculator allows you to compute the best thread granularity for your thread blocks given a specific compute capability and resource constraints.

You can refer to the second worksheet titled “Help” to learn how to use the CUDA GPU Occupancy Calculator.

Exercises

Q1. Would the MatrixAddDevice kernel function shown in this article benefit from the use of shared memory? Explain your answer.

A1. No, it would not benefit from the use of shared memory because each matrix element is only accessed once. You would still need to access each matrix component to store it in shared memory only to require an access from shared memory to access it again. In this case, store the data in shared memory will only increase the time to execute the kernel because more load/store operations will need to be performed.

Q2. In almost all of the examples shown here, I decided to use a 16×16 thread granularity for the thread blocks. Can you explain why this is a good choice for thread granularity on devices of compute capability (you can assume that register use and shared memory allocation are within the limits in each case):

  1. 1.3?
  2. 2.0?
  3. 3.0?

A2. To answer this question, let’s take a look at each individual compute capability.

a. For Compute Capability 1.3 threads are split into groups of 32 threads called warps. The maximum number of warps/SM is 32. If we create a 16×16 thread block, then we have a total of 256 threads/block. Each block will be split into 8 warps to be scheduled on the SM. Since we know that the maximum number of warps/SM for devices with compute capability 1.3 is 32, then 4 thread blocks will be scheduled on each SM. Each SM can support up to 8 resident blocks per SM and 4 is still within our limit. Also with a maximum resident thread limit of 1024 threads and we exactly meet this requirement (4×256) so we also achieve 100% thread occupancy on the SM! So yes, a 16×16 thread block is a good choice for devices with compute capability 1.3.

b. For devices with compute capability 2.0 threads are also split into groups of 32 threads called warps. In this case, the maximum number of warps/SM is 48. Again, we have 256 threads per block which are split into 8 warps to be scheduled on the SM then 6 thread blocks will be scheduled per SM (48/8). 6 blocks is within the 8 block limit, so we haven’t exceeded the block limit. And with a maximum resident thread limit of 1536 threads, we exactly meet this requirement (6×256) so we also achieve a 100% thread occupancy on the SM! So yes, a 16×16 thread block is also a good choice for devices with compute capability 2.0.

c. For devices with compute capability 3.0 the threads are also split into groups of 32threads called warps. So again, each block will be split into 8 warps. The maximum number of warps that can be active in a SM is 64. This allows for 8 thread blocks to be scheduled per SM. This is within the limit of 16 blocks/SM and again matches exactly the maximum number of threads of 2048 threads (8×256) that can be scheduled for each SM so we also achieve 100% thread occupancy. So yes, a 16×16 thread block is also a good choice for devices with compute capability 3.0 (and consequently this is also true for devices of compute capability 3.5).

Q3. Assuming we have a block of 256 threads each, what is the maximum amount of shared memory we can use per block and still maintain 100% thread occupancy for devices of compute capability (assume the register count is not a limiting resource):

  1. 1.3?
  2. 2.0?
  3. 3.0?

a. In the previous exercise we already established that with a thread blocks of 256 threads, we will have 4 resident blocks per SM. Since devices of compute capability 1.3 have a maximum 16 KB (16,384 bytes) of shared memory then each block can use a maximum of 4,096 bytes (16,384/4) of shared memory while still maintaining 100% thread occupancy.

b. In the previous exercise we saw that we could schedule 6 blocks of 256 threads. Devices of compute capability 2.0 have a maximum of 48 KB (49,152 bytes) of shared memory per SM. This means that we can allocate a maximum of 8,192 bytes (49,152/6) of shared memory while still maintaining 100% thread occupancy.

c. In the previous exercise we saw that we could schedule 8 blocks of 256 threads to get 100% thread occupancy. Devices with compute capability 3.0 also have a maximum of 48 KB (49,152 KB) of shared memory per SM. In this case, we can only allocate 6,144 bytes(49,152/8) of shared memory while still maintaining 100% thread occupancy.

Q4. In the case (c) above, what would happen if we created thread blocks of 1024 threads? Would we still have 100% thread occupancy? How much shared memory could we allocate per thread block and maintain 100% thread occupancy? Explain your answer.

Q5. Answer question (3) and (4) again but this time compute the number of registers you have available per thread while still maintaining 100% thread occupancy. In this case, you can assume that shared memory is not a limiting resource.

Hint: To answer Q5 correctly, you must also take the register allocation granularity and unit size into consideration. For compute capability 1.3, the register allocation granularity is at the block level and the register allocation unit size is 512. For compute capability 2.x register allocation granularity is at the warp level and the register allocation unit size is 64. For compute capability 3.x, the register allocation granularity is at the warp level and the register allocation unit size is 256.

References

1. NVIDIA Corporation (2012, October). CUDA C Programming Guide. (PG-02829-001_v5.0). USA. Available from:http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Accessed: October 2012.

2. NVIDIA Corporation (2012, October). NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. (V1.0). USA. Available from:http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. Accessed: October 2012.

3. NVIDIA Corporation (2012, October). NVIDIA CUDA Getting Started Guide For Microsoft Windows. (DU-05349-001_v5.0). USA. Available from:http://developer.download.nvidia.com/compute/cuda/5_0/rel/docs/CUDA_Getting_Started_Guide_For_Microsoft_Windows.pdf. Accessed: October 2012.

4. NVIDIA Corporation (2012, October). CUDA C Best Practices Guide. (DG-05603-001_v5.0). USA. Available from:http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf. Accessed: October 2012.

5. Kirk, David B. and Hwu, Wen-mei W. (2010). Programming Massively Parallel Processors. 1st. ed. Burlington, MA 01803, USA: Morgan Kaufmann Publishers.

Posted in Artificial Intelligence, C, Computer Hardwares, Computer Languages, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Graphics Cards, Image Processing, Neural Network, PARALLEL, Research Menu | Leave a Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: