Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘GPU Accelareted’ Category

Databases for Multi-camera , Network Camera , E-Surveillace

Posted by Hemprasad Y. Badgujar on February 18, 2016

Multi-view, Multi-Class Dataset: pedestrians, cars and buses

This dataset consists of 23 minutes and 57 seconds of synchronized frames taken at 25fps from 6 different calibrated DV cameras.
One camera was placed about 2m high of the ground, two others where located on a first floor high, and the rest on a second floor to cover an area of 22m x 22m.
The sequence was recorded at the EPFL university campus where there is a road with a bus stop, parking slots for cars and a pedestrian crossing.


Ground truth images
Ground truth annotations


The dataset on this page has been used for our multiview object pose estimation algorithm described in the following paper:

G. Roig, X. Boix, H. Ben Shitrit and P. Fua Conditional Random Fields for Multi-Camera Object Detection, ICCV11.

Multi-camera pedestrians video

“EPFL” data set: Multi-camera Pedestrian Videos

people tracking
results, please cite one of the references below.

On this page you can download a few multi-camera sequences that we acquired for developing and testing our people detection and tracking framework. All of the sequences feature several synchronised video streams filming the same area under different angles. All cameras are located about 2 meters from the ground. All pedestrians on the sequences are members of our laboratory, so there is no privacy issue. For the Basketball sequence, we received consent from the team.

Laboratory sequences

These sequences were shot inside our laboratory by 4 cameras. Four (respectively six) people are sequentially entering the room and walking around for 2 1/2 minutes. The frame rate is 25 fps and the videos are encoded using MPEG-4 codec.

[Camera 0] [Camera 1] [Camera 2] [Camera 3]

Calibration file for the 4 people indoor sequence.

[Camera 0] [Camera 1] [Camera 2] [Camera 3]

Calibration file for the 6 people indoor sequence.

Campus sequences

These two sequences called campus were shot outside on our campus with 3 DV cameras. Up to four people are simultaneously walking in front of them. The white line on the screenshots shows the limits of the area that we defined to obtain our tracking results. The frame rate is 25 fps and the videos are encoded using Indeo 5 codec.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2]
[Seq.2, cam. 0] [Seq.2, cam. 1] [Seq.2, cam. 2]

Calibration file for the two above outdoor scenes.

Terrace sequences

The sequences below, called terrace, were shot outside our building on a terrace. Up to 7 people evolve in front of 4 DV cameras, for around 3 1/2 minutes. The frame rate is 25 fps and the videos are encoded using Indeo 5 codec.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]
[Seq.2, cam. 0] [Seq.2, cam. 1] [Seq.2, cam. 2] [Seq.1, cam. 3]

Calibration file for the terrace scene.

Passageway sequence

This sequence dubbed passageway was filmed in an underground passageway to a train station. It was acquired with 4 DV cameras at 25 fps, and is encoded with Indeo 5. It is a rather difficult sequence due to the poor lighting.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]

Calibration file for the passageway scene.

Basketball sequence

This sequence was filmed at a training session of a local basketball team. It was acquired with 4 DV cameras at 25 fps, and is encoded with Indeo 5.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]

Calibration file for the basketball scene.

Camera calibration

POM only needs a simple calibration consisting of two homographies per camera view, which project the ground plane in top view to the ground plane in camera views and to the head plane in camera views (a plane parallel to the ground plane but located 1.75 m higher). Therefore, the calibration files given above consist of 2 homographies per camera. In degenerate cases where the camera is located inside the head plane, this one will project to a horizontal line in the camera image. When this happens, we do not provide a homography for the head plane, but instead we give the height of the line in which the head plane will project. This is expressed in percentage of the image height, starting from the top.

The homographies given in the calibration files project points in the camera views to their corresponding location on the top view of the ground plane, that is

H * X_image = X_topview .

We have also computed the camera calibration using the Tsai calibration toolkit for some of our sequences. We also make them available for download. They consist of an XML file per camera view, containing the standard Tsai calibration parameters. Note that the image size used for calibration might differ from the size of the video sequences. In this case, the image coordinates obtained with the calibration should be normalized to the size of the video.

Ground truth

We have created a ground truth data for some of the video sequences presented above, by locating and identifying the people in some frames at a regular interval.

To use these ground truth files, you must rely on the same calibration with the exact same parameters that we used when generating the data. We call top view the rectangular area of the ground plane in which we perform tracking.

This area is of dimensions tv_width x tv_height and has top left coordinate (tv_origin_x, tv_origin_y). Besides, we call grid our discretization of the top view area into grid_width x grid_height cells. An example is illustrated by the figure below, in which the grid has dimensions 5 x 4.

The people’s position in the ground truth are expressed in discrete grid coordinates. In order to be projected into the images with homographies or the Tsai calibration, these grid coordinates need to be translated into top view coordinates. We provide below a simple C function that performs this translation. This function takes the following parameters:

  • pos : the person position coming from the ground truth file
  • grid_width, grid_height : the grid dimension
  • tv_origin_x, tv_origin_y : the top left corner of the top view
  • tv_width, tv_height : the top view dimension
  • tv_x, tv_y : the top view coordinates, i.e. the output of the function
  void grid_to_tv(int pos, int grid_width, int grid_height,                  float tv_origin_x, float tv_origin_y, float tv_width,                  float tv_height, float &tv_x, float &tv_y) {     tv_x = ( (pos % grid_width) + 0.5 ) * (tv_width / grid_width) + tv_origin_x;    tv_y = ( (pos / grid_width) + 0.5 ) * (tv_height / grid_height) + tv_origin_y;  }

The table below summarizes the aforementionned parameters for the ground truth files we provide. Note that the ground truth for the terrace sequence has been generated with the Tsai calibration provided in the table. You will need to use this one to get a proper bounding box alignment.

Ground Truth Grid dimensions Top view origin Top view dimensions Calibration
6-people laboratory 56 x 56 (0 , 0) 358 x 360 file
terrace, seq. 1 30 x 44 (-500 , -1,500) 7,500 x 11,000 file (Tsai)
passageway, seq. 1 40 x 99 (0 , 38.48) 155 x 381 file

The format of the ground truth file is the following:

 1 <number of frames>  <number of people>  <grid width>  <grid height>  <step size>  <first frame>  <last frame> <pos> <pos> <pos> ... <pos> <pos> <pos> ... . . .

where <number of frames> is the total number of frames, <number of people> is the number of people for which we have produced a ground truth, <grid width> and <grid height>are the ground plane grid dimensions, <step size> is the frame interval between two ground truth labels (i.e. if set to 25, then there is a label once every 25 frames), and <first frame> and <last frame> are the first and last frames for which a label has been entered.

After the header, every line represents the positions of people at a given frame. <pos> is the position of a person in the grid. It is normally a integer >= 0, but can be -1 if undefined (i.e. no label has been produced for this frame) or -2 if the person is currently out of the grid.


Multiple Object Tracking using K-Shortest Paths Optimization

Jérôme Berclaz, François Fleuret, Engin Türetken, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence
pdf | show bibtex

Multi-Camera People Tracking with a Probabilistic Occupancy Map

François Fleuret, Jérôme Berclaz, Richard Lengagne, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence
pdf | show bibtex

MuHAVi: Multicamera Human Action Video Data

including selected action sequences with

MAS: Manually Annotated Silhouette Data

for the evaluation of human action recognition methods

Figure 1. The top view of the configuration of 8 cameras used to capture the actions in the blue action zone (which is marked with white tapes on the scene floor).

camera symbol

camera name

V1 Camera_1
V2 Camera_2
V3 Camera_3
V4 Camera_4
V5 Camera_5
V6 Camera_6
V7 Camera_7
V8 Camera_8

Table 1. Camera view names appearing in the MuHAVi data folders and the corresponding symbols used in Fig. 1.


On the table below, you can click on the links to download the data (JPG images) for the corresponding action

Important: We noted that some earlier versions of that earlier versions of MS Internet Explorer could not download files over 2GB size, so we recomment to use alternative browsers such as Firefox or Chrome.

Each tar file contains 7 folders corresponding to 7 actors (Person1 to Person7) each of which contains 8 folders corresponding to 8 cameras (Camera_1 to Camera_8). Image frames corresponding to every combination of action/actor/camera are named with image frame numbers starting from 00000001.jpg for simplicity. The video frame rate is 25 frames per second and the resolution of image frames (except for Camera_8) is 720 x 576 Pixels (columns x rows). The image resolution is 704 x 576 for Camera_8.

action class

action name

C1 WalkTurnBack 2.6GB
C2 RunStop 2.5GB
C3 Punch 3.0GB
C4 Kick 3.4GB
C5 ShotGunCollapse 4.3GB
C6 PullHeavyObject 4.5GB
C7 PickupThrowObject 3.0GB
C8 WalkFall 3.9GB
C9 LookInCar 4.6GB
C10 CrawlOnKnees 3.4GB
C11 WaveArms 2.2GB
C12 DrawGraffiti 2.7GB
C13 JumpOverFence 4.4GB
C14 DrunkWalk 4.0GB
C15 ClimbLadder 2.1GB
C16 SmashObject 3.3GB
C17 JumpOverGap 2.6GB

MIT Trajectory Data Set – Multiple Camera Views


MIT trajectory data set is for the research of activity analysis in multiple single camera view using the trajectories of objects as features. Object tracking is based on background subtraction using a Adaptive Gaussian Mixture model. There are totally four camera views. Trajectories in different camera views have been synchronized. The data can be downloaded from the following link,

MIT trajectory data set

Background image


Please cite as:

X. Wang, K. Tieu and E. Grimson, Correspondence‐Free Activity Analysis and Scene Modeling in Multiple Camera Views, IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI), Vol. 32, pp. 56-71, 2010..


MIT traffic data set is for research on activity analysis and crowded scenes. It includes a traffic video sequence of 90 minutes long. It is recorded by a stationary camera. The size of the scene is 720 by 480. It is divided into 20 clips and can be downloaded from the following links.

Ground Truth

In order to evaluate the performance of human detection on this data set, ground truth of pedestrians of some sampled frames are manually labeled. It can be downloaded below. A readme file provides the instructions of how to use it.
Ground truth of pedestrians


  1. Unsupervised Activity Perception in Crowded and Complicated scenes Using Hierarchical Bayesian Models
    X. Wang, X. Ma and E. Grimson
    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 31, pp. 539-555, 2009
  2. Automatic Adaptation of a Generic Pedestrian Detector to a Specific Traffic Scene
    M. Wang and X. Wang
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2011


This dataset is presented in our CVPR 2015 paper,
Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification, In Computer Vision and Pattern Recognition (CVPR), 2015. PDF

The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car. The surveillance-nature data contains 50,000 car images captured in the front view. Please refer to our paper for the details.

The dataset is well prepared for the following computer vision tasks:

  • Fine-grained classification
  • Attribute prediction
  • Car model verification

The train/test subsets of these tasks introduced in our paper are included in the dataset. Researchers are also welcome to utilize it for any other tasks such as image ranking, multi-task learning, and 3D reconstruction.


  1. You need to complete the release agreement form to download the dataset. Please see below.
  2. The CompCars database is available for non-commercial research purposes only.
  3. All images of the CompCars database are obtained from the Internet which are not property of MMLAB, The Chinese University of Hong Kong. The MMLAB is not responsible for the content nor the meaning of these images.
  4. You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the images and any portion of derived data.
  5. You agree not to further copy, publish or distribute any portion of the CompCars database. Except, for internal use at a single site within the same organization it is allowed to make copies of the database.
  6. The MMLAB reserves the right to terminate your access to the database at any time.
  7. All submitted papers or any publicly available text using the CompCars database must cite the following paper:
    Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification, In Computer Vision and Pattern Recognition (CVPR), 2015.

Download instructions

Download the CompCars dataset Release Agreement, read it carefully, and complete it appropriately. Note that the agreement should be signed by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to Mr. Linjie Yang (yl012(at) and cc to Chen Change Loy (ccloy(at) We will verify your request and contact you on how to download the database.

Stanford Cars Dataset


       The Cars dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.


       Training images can be downloaded here.
Testing images can be downloaded here.
A devkit, including class labels for training images and bounding boxes for all images, can be downloaded here.
If you’re interested in the BMW-10 dataset, you can get that here.

Update: For ease of development, a tar of all images is available here and all bounding boxes and labels for both training and test are available here. If you were using the evaluation server before (which is still running), you can use test annotations here to evaluate yourself without using the server.


       An evaluation server has been set up here. Instructions for the submission format are included in the devkit. This dataset was featured as part of FGComp 2013, and competition results are directly comparable to results obtained from evaluating on images here.


       If you use this dataset, please cite the following paper:

3D Object Representations for Fine-Grained Categorization
Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei
4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.
[pdf]   [BibTex]   [slides]

Note that the dataset, as released, has 196 categories, one less than in the paper, as it has been cleaned up slightly since publication. Numbers should be more or less comparable, though.

The HDA dataset is a multi-camera high-resolution image sequence dataset for research on high-definition surveillance. 18 cameras (including VGA, HD and Full HD resolution) were recorded simultaneously during 30 minutes in a typical indoor office scenario at a busy hour (lunch time) involving more than 80 persons. In the current release (v1.1), 13 cameras have been fully labeled.


The venue spans three floors of the Institute for Systems and Robotics (ISR-Lisbon) facilities. The following pictures show the placement of the cameras. The 18 recorded cameras are identified with a small red circle. The 13 cameras with a coloured view field have been fully labeled in the current release (v1.1).


Each frame is labeled with the bounding boxes tightly adjusted to the visible body of the persons, the unique identification of each person, and flag bits indicating occlusion and crowd:

  • The bounding box is drawn so that it completely and tightly encloses the person.
  • If the person is occluded by something (except image boundaries), the bounding box is drawn by estimating the whole body extent.
  • People partially outside the image boundaries have their BB’s cropped to image limits. Partially occluded people and people partially outside the image boundaries are marked as ‘occluded’.
  • A unique ID is associated to each person, e.g., ‘person01’. In case of identity doubt, the special ID ‘personUnk’ is used.
  • Groups of people that are impossible to label individually are labelled collectively as ‘crowd’. People in front of a ’crowd’ area are labeled normally.

The following figures show examples of labeled frames: (a) an unoccluded person; (b) two occluded people; (c) a crowd with three people in front.


Data formats:

For each camera we provide the .jpg frames sequentially numbered and a .txt file containing the annotations according to the “video bounding box” (vbb) format defined in the Caltech Pedestrian Detection Database. Also on this site there are tools to visualise the annotations overlapped on the image frames.


Some statistics:

Labeled Sequences: 13

Number of Frames: 75207

Number of Bounding Boxes: 64028

Number of Persons: 85


Repository of Results:

We maintain a public repository of re-identification results in this dataset. Send us your CMC curve to be uploaded  (alex at isr ist utl pt).
Click here to see the full list and detailed experiments.

MANUAL_c_l_e_a_n cam60

Posted in Computer Network & Security, Computer Research, Computer Vision, Image Processing, Multimedia | Leave a Comment »

Bilateral Filtering

Posted by Hemprasad Y. Badgujar on September 14, 2015

Popular Filters

When smoothing or blurring images (the most popular goal of smoothing is to reduce noise), we can use diverse linear filters, because linear filters are easy to achieve, and are kind of fast, the most used ones are Homogeneous filter, Gaussian filter, Median filter, et al.

When performing a linear filter, we do nothing but output pixel’s value g(i,j)  which is determined as a weighted sum of input pixel values f(i+k, j+l):

g(i, j)=SUM[f(i+k, j+l) h(k, l)];

in which, h(k, l)) is called the kernel, which is nothing more than the coefficients of the filter.

Homogeneous filter is the most simple filter, each output pixel is the mean of its kernel neighbors ( all of them contribute with equal weights), and its kernel K looks like:


 Gaussian filter is nothing but using different-weight-kernel, in both x and y direction, pixels located in the middle would have bigger weight, and the weights decrease with distance from the neighborhood center, so pixels located on sides have smaller weight, its kernel K is something like (when kernel is 5*5):


Median filter is something that replace each pixel’s value with the median of its neighboring pixels. This method is great when dealing with “salt and pepper noise“.

Bilateral Filter

By using all the three above filters to smooth image, we not only dissolve noise, but also smooth edges, which make edges less sharper, even disappear. To solve this problem, we can use a filter called bilateral filter, which is an advanced version of Gaussian filter, it introduces another weight that represents how two pixels can be close (or similar) to one another in value, and by considering both weights in image,  Bilateral filter can keep edges sharp while blurring image.

Let me show you the process by using this image which have sharp edge.



Say we are smoothing this image (we can see noise in the image), and now we are dealing with the pixel at middle of the blue rect.

22   23

Left-above picture is a Gaussian kernel, and right-above picture is Bilateral filter kernel, which considered both weight.

We can also see the difference between Gaussian filter and Bilateral filter by these pictures:

Say we have an original image with noise like this



By using Gaussian filter, the image is smoother than before, but we can see the edge is no longer sharp, a slope appeared between white and black pixels.



However, by using Bilateral filter, the image is smoother, the edge is sharp, as well.


OpenCV code

It is super easy to make these kind of filters in OpenCV:

1 //Homogeneous blur:
2 blur(image, dstHomo, Size(kernel_length, kernel_length), Point(-1,-1));
3 //Gaussian blur:
4 GaussianBlur(image, dstGaus, Size(kernel_length, kernel_length), 0, 0);
5 //Median blur:
6 medianBlur(image, dstMed, kernel_length);
7 //Bilateral blur:
8 bilateralFilter(image, dstBila, kernel_length, kernel_length*2, kernel_length/2);

and for each function, you can find more details in OpenCV Documentation

Test Images

Glad to use my favorite Van Gogh image :



From left to right: Homogeneous blur, Gaussian blur, Median blur, Bilateral blur.

(click iamge to view full size version :p )

kernel length = 3:

homo3 Gaussian3 Median3 Bilateral3

kernel length = 9:

homo9 Gaussian9 Median9 Bilateral9
kernel length = 15:

homo15 Gaussian15 Median15 Bilateral15

kernel length = 23:

homo23 Gaussian23 Median23 Bilateral23
kernel length = 31:

homo31 Gaussian31 Median31 Bilateral31
kernel length = 49:

homo49 Gaussian49 Median49 Bilateral49
kernel length = 99:

homo99 Gaussian99 Median99 Bilateral99

Trackback URL.

Posted in C, Image / Video Filters, Image Processing, OpenCV, OpenCV, OpenCV Tutorial | Leave a Comment »

How to run CUDA 6.5 in Emulation Mode

Posted by Hemprasad Y. Badgujar on December 20, 2014

How to run CUDA in Emulation Mode

Some beginners feel a little bit dejected when they find that their systems donotcontainGPUs to learn andworkwithCUDA. In this blog post, I shall include the step by step process of installingandexecutingCUDA programs in emulation mode on a system with no GPU installed in it.It is mentioned here thatyouwill not be able to gain any performance advantage expected out of a GPU (obviously). Instead, the performance will be worse than a CPU implementation. However, emulation mode provides an excellent tool to compile and debugyourCUDA codes for more advanced purposes.Please note that I performed the following steps for a Dell Xeon with Windows 7 (32-bit) system.1. Acquire and install Microsoft Visual Studio 2008 on your system.

2. Access the CUDA Toolkit Archives  page and select CUDA Toolkit 6.0 / 6.5 version. (It is the last version that came with emulation mode. Emulation mode was discontinued in later versions.)

3. Download and install the following on your machine:-

  • Developer Drivers for Win8/win7 X64  – (Select the one as required for your machine.)
  • CUDA Toolkit
  • CUDA SDK Code Samples
  • CUBLAS and CUFFT (If required)

4. The next step is to check whether the sample codes run properly on the system or not. This will ensure that there is nothing missing from the required installations. Browse the nVIDIA GPU Computing SDK using the windows start bar or by using the following path in your My Computer address bar:-
As per your working Platform
“C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Release”
“C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win64\Release”

(Also note that the ProgramData folder is by default set to “Hidden” attribute. It will be good if you unhide theis folder as it will be frequently utilized later on as you progress with your CUDA learning spells.)

5. Run the “deviceQuery” program and it should output something similar as shown in Fig. 1. Upon visual inspection of the output data, it can be seen that “there is no GPU device found” however the test has PASSED. This means that all the required installations for CUDA in emulation mode has been completed and now we can proceed with writing, compiling and executing CUDA programs in emulation mode.

Figure 1. Successful Rxecution of deviceQuery.exe ** Demo Example only

6. Open Visual Studio and create a new Win32 console project. Let’s name it “HelloCUDAEmuWorld”. Remember to select the “EMPTY PROJECT” option in Application Settings. Now Right Click on “Source Files” in the project tree and add new C++ code item. Remember to include the extension “.cu” instead of “.cpp”. Let’s name this item as “”. (If you forget the file extension, it can always be renamed via the project tree on the left).

7. Include the CUDA include, lib and  bin paths to MS Visual Studio. They were located at “C:\CUDA” in my system.

The next steps need to be performed for every new CUDA project when created.

8. Right Click on the project and select Custom Build Rules. Check the Custom Build Rules v6.0.0 option if available. Otherwise, click on Find Existing… and navigate to “C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common” and select Cuda.rules. This will add the build rules for CUDA v6.0to VS 2012.

9. Right click on the project and select Properties. Navigate to Configuration Properties –> Linker –> Input. Type in cudart.lib in the Additional Dependencies text bar and click Okay. Now we are ready to compile and run our first ever CUDA program in emulation mode. But first we need to activate the emulation  mode for .cu files.

10. Once again  Right click on the project and select Properties. Navigate to Configuration Properties –> CUDA Build Rule v6.0.0 –> General. Set Emulation Mode from No to Yes in the right hand column of the opened window. Click Okay.

11. Type in the following in the code editor and build and compile the project. And there it is. Your first ever CUDA program, in Emulation Mode. Something to brag about among friends.

int main(void)
return 0;

I hope this effort would not go in vain and offer some help to anyone whois tied upregarding this issue. Do contact if there is any queryregarding the above procedure.Source (

Posted in Computer Vision, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, Image / Video Filters, My Research Related, OpenCV, PARALLEL, Project Related | Leave a Comment »

Running CUDA Code Natively on x86 Processors

Posted by Hemprasad Y. Badgujar on December 20, 2014

1 Try : CUDA Development without GPU

If you want to run the code on your machine but you don’t have a GPU? Or maybe you want to try things out before firing up your AWS instance? Here I show you a way to run the CUDA code without a GPU.

Note: this only works on Linux, maybe there are other alternatives for Mac or Windows.

Ocelot lets you run CUDA programs on NVIDIA GPUs, AMD GPUs and x86-CPUs without recompilation. Here we’ll take advantage of the latter to run our code using our CPU.


You’ll need to install the following packages:

  • C++ Compiler (GCC)
  • Lex Lexer Generator (Flex)
  • YACC Parser Generator (Bison)
  • SCons

And these libraries:

  • boost_system
  • boost_filesystem
  • boost_serialization
  • GLEW (optional for GL interop)
  • GL (for NVIDIA GPU Devices)

With Arch Linux, this should go something like this:

pacman -S gcc flex bison scons boost glew

On Ubuntu it should be similar (sudo apt-get install flex bison g++ scons libboost-all-dev). If you don’t know the name of a package, search for it with ‘apt-cache search package_name’.

You should probably install LLVM too, it’s not mandatory, but I think it runs faster with LLVM.

pacman -S llvm clang

And of course you’ll need to install CUDA and the OpenCL headers. You can do it manually or using your distro’s package manager (for ubuntu I belive the package is called nvidia-cuda-toolkit):

pacman -S cuda libcl opencl-nvidia

One last dependency is Hydrazine. Fetch the source code:

svn checkout hydrazine

Or if you’re like me and prefer Git:

git svn clone -s hydrazine

And install it like this (you might need to install automake if you don’t have it already):

cd hydrazine
automake --add-missing
sudo make install


Now we can finally install Ocelot. This is where it gets a bit messy. Fetch the Ocelot source code:

svn checkout gpuocelot

Or with Git (warning, this will take a while, the whole repo is about 1.9 GB):

git svn clone -s gpuocelot

Now go to the ocelot directory:

cd gpuocelot/ocelot

And install Ocelot with:

sudo ./ --install


Sadly, the last command probably failed. This is how I fixed the problems.

Hydrazine headers not found

You could fix this adding an include flag. I just added a logical link to the hydrazine code we downloaded previously:

ln -s /path/to/hydrazine/hydrazine

Make sure you link to the second hydrazine directory (inside this directory you’ll find directories like implementation and interface). You should do this in the ocelot directory where you’re running the script (gpuocelot/ocelot).

LLVM header file not found

For any error that looks like this:

llvm/Target/TargetData.h: No such file or directory

Just edit the source code and replace it with this header:


The LLVM project moved the file.

LLVM IR folder “missing”

Similarly, files referenced by Ocelot from the “IR” package were moved (LLVM 3.2-5 on Arch Linux). If you get an error about LLVM/IR/LLVMContext.h missing, edit the following files:


and replace the includes at the top of each file for LLVM/IR/LLVMContext.h and LLVM/IR/Module.h with LLVM/LLVMContext.h and LLVM/Module.h, respectively.

PTXLexer errors

The next problem I ran into was:

.release_build/ocelot/ptxgrammar.hpp:351:14:error:'PTXLexer' is not a member of 'parser'

Go ahead, open the ‘.release_build/ocelot/ptxgrammar.hpp’ file and just comment line 355:

/* int yyparse (parser::PTXLexer& lexer, parser::PTXParser::State& state); */

That should fix the error.

boost libraries not found

On up-to-date Arch Linux boxes, it will complain about not finding boost libraries ‘boost_system-mt’, ‘boost_filesystem-mt’, ‘boost_thread-mt’.

I had to edit two files:

  • scripts/
  • SConscript

And just remove the trailing -mt from the library names:

  • boost_system
  • boost_filesystem
  • boost_thread

Finish the installation

After those fixes everything should work.

Whew! That wasn’t fun. Hopefully with the help of this guide it won’t be too painful.

To finish the installation, run:

sudo ldconfig

And you can check the library was installed correctly running:

OcelotConfig -l

It should return -locelot. If it didn’t, check your LD_LIBRARY_PATH. On my machine, Ocelot was installed under /usr/local/lib so I just added this to my LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Here’s the link to the installation instructions.

Running the code with Ocelot

We’re finally ready enjoy the fruits of our hard work. We need to do two things:

Ocelot configuration file

Add a file called configure.ocelot to your project (in the same directory as our Makefile and files), and copy this:

    ocelot: "ocelot",
    trace: {
        database: "traces/database.trace",
        memoryChecker: {
            enabled: false,
            checkInitialization: false
        raceDetector: {
            enabled: false,
            ignoreIrrelevantWrites: false
        debugger: {
            enabled: false,
            kernelFilter: "",
            alwaysAttach: true
    cuda: {
        implementation: "CudaRuntime",
        tracePath: "trace/CudaAPI.trace"
    executive: {
        devices: [llvm],
        preferredISA: nvidia,
        optimizationLevel: full,
        defaultDeviceID: 0,
        asynchronousKernelLaunch: True,
        port: 2011,
        host: "",
        workerThreadLimit: 8,
        warpSize: 16
    optimizations: {
        subkernelSize: 10000,

You can check this guide for more information about these settings.

Compile with the Ocelot library

And lastly, a small change to our Makefile. Append this to the GCC_OPTS:

GCC_OPTS=-O3 -Wall -Wextra -m64 `OcelotConfig -l`

And change the student target so it uses g++ and not nvcc:

student: compare main.o student_func.o Makefile
    g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GCC_OPTS)

I just replaced ‘nvcc’ with ‘g++’ and ‘NVCC_OPTS’ with ‘GCC_OPTS’.

make clean

And that’s it!

I forked the github repo and added these changes in case you want to take a look.

I found this guide helpful, it might have some additional details for installing things under ubuntu and/or manually.

Note for debian users

I successfully installed ocelot under debian squeeze, following the above steps, except that I needed to download llvm from upstream, as indicated in the above guide for ubuntu.

Other than that, after fixing some includes as indicated (Replacing ‘TargetData.h’ by ‘IR/DataLayout.h’, or adding ‘/IR/’ to some includes), it just compiled.

To build the student project, I needed to replace -m64 by -m32 to fit my architecture, and to make the other indicated changes.

Here are my makefile diffs:

$ git diff Makefile
diff --git a/HW1/student/Makefile b/HW1/student/Makefile
index b6df3a4..55480af 100755
--- a/HW1/student/Makefile
+++ b/HW1/student/Makefile
@@ -22,7 +22,8 @@ OPENCV_INCLUDEPATH=/usr/include

 OPENCV_LIBS=-lopencv_core -lopencv_imgproc -lopencv_highgui


 # On Macs the default install locations are below    #
@@ -36,12 +37,12 @@ CUDA_INCLUDEPATH=/usr/local/cuda-5.0/include

-NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m64
+NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m32

-GCC_OPTS=-O3 -Wall -Wextra -m64
+GCC_OPTS=-O3 -Wall -Wextra -m32 `OcelotConfig -l` -I /usr/include/i386-linux-gn

 student: compare main.o student_func.o Makefile
-       $(NVCC) -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) 
+       g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GC

 main.o: main.cpp timer.h utils.h HW1.cpp
        g++ -c main.cpp $(GCC_OPTS) -I $(CUDA_INCLUDEPATH) -I $(OPENCV_LIBPATH)

I’m using cuda toolkit 4.2.

I don’t know why, but it was necessary to add /usr/lib/gcc/i486-linux-gnu/4.4 to the PATH for nvcc to work:

export PATH=$PATH:/usr/lib/gcc/i486-linux-gnu/4.4

Eclipse CUDA plugin

This is probably for another entry, but I used this guide to integrate CUDA into Eclipse Indigo.

The plugin is University of Bayreuth’s Eclipse Toolchain for CUDA compiler

2 Try :Running CUDA Code Natively on x86 Processors

We  focused on Fermi and the architectural changes that significantly broadened the types of applications that map well to GPGPU computing yet preserve the application performance of software written for previous generations of CUDA-enabled GPUs. This article addresses the mindset that CUDA is a language for only GPU-based applications.

Recent developments allow CUDA programs to transparently compile and run at full speed on x86 architectures. This advance makes CUDA a viable programming model for all application development, just like OpenMP. The PGI CUDA C/C++ compiler for x86 (from the Portland Group Inc.) is the reason for this recent change in mindset. It is the first native CUDA compiler that can transparently create a binary that will run on an x86 processor. No GPU is required. As a result, programmers now have the ability to use a single source tree of CUDA code to reach those customers who own CUDA-enabled GPUs as or who use x86-based systems.

Figure 1 illustrates the options and target platforms that are currently available to build and run CUDA applications. The various products are discussed next.

Figure 1: The various options for compiling and running a CUDA program.

Aside from the new CUDA-x86 compiler, the other products require developer or customer intervention to run CUDA on multiple backends. For example:

  • nvcc: The freely downloadable nvcc compiler from NVIDIA creates both host and device code. With the use of the __device__ and __host__ specifiers, a developer can use C++ Thrust functions to run on both host and CUDA-enabled devices. This x86 pathway is represented by the dotted line in Figure 1, as the programmer must explicitly specify use of the host processor. In addition, developers must explicitly check whether a GPU is present and use this information to select the memory space in which the data will reside (that is, GPU or host). The Thrust API also allows CUDA codes to be transparently compiled to run on different backends. The Thrust documentation shows how to use OpenMP to run a Monte Carlo simulation on x86. Note that Thrust is not optimized to create efficient OpenMP code.
  • gpuocelot provides a dynamic compilation framework to run CUDA binaries on various backends such as x86, AMD GPUs, and an x86-based PTX emulator. The emulator alone is a valuable tool for finding hot spots and bottlenecks in CUDA codes. The gpuocelot website claims that it “allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation.” I recommend this project even though it is challenging to use. As it matures, Ocelot will provide a pathway for customers to run CUDA binaries on various backends.
  • MCUDA is an academic project that translates CUDA to C. It is not currently maintained, but the papers are interesting reading. A follow-up project (FCUDA) provides a CUDA to FPGA translation capability.
  • SWAN provides a CUDA-to-OpenCL translation capability. The authors note that Swan is “not a drop in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls rewritten.” Still, it is an interesting project to bridge the gap between CUDA and OpenCL.

The CUDA-x86 compiler is the first to provide a seamless pathway to create a multi-platform application.

Why It Matters

Using CUDA for all application development may seem like a radical concept to many readers, but in fact, it is the natural extension of the emerging CPU/GPU paradigm of high-speed computing. One of the key benefits of CUDA is that it uses C/C++ and can be adopted easily and it runs on 300+ million GPUs and now all x86 chips. If this still feels like an edgy practice, this video presentation might be helpful.

CUDA works well now at its principal task — massively parallel computation — as demonstrated by the variety and number of projects that achieve 100x or greater performance in the NVIDIA showcase. See Figure 2.

Figure 2: All top 100 CUDA apps attain speedups in excess of 100x.

PGI CUDA-x86: CUDA Programming for Multi-core CPUs


The NVIDIA CUDA architecture was developed to enable offloading of compute-intensive kernels to GPUs. Through API function calls and language extensions, CUDA gives developers control over mapping of general-purpose compute kernels to GPUs, and over placement and movement of data between host memory and GPU memory. CUDA is supported on x86 and x64 (64-bit x86) systems running Linux, Windows or MacOS and that include an NVIDIA CUDA-enabled GPU. First introduced in 2007, CUDA is the most popular GPGPU parallel programming model with an estimated user-base of over 100,000 developers worldwide.

Let’s review the hardware around which the CUDA programming model was designed. Figure 1 below shows an abstraction of a multi-core x64+GPU platform focused on computing, with the graphics functionality stripped out. The key to the performance potential of the NVIDIA GPU is the large number of thread processors, up to 512 of them in a Fermi-class GPU. They’re organized into up to 16 multi-processors, each of which has 32 thread processors. Each thread processor has registers along with integer and floating point functional units; the thread processors within a multiprocessor run in SIMD mode. Fermi peak single-precision performance is about 1.4 TFLOPS and peak double-precision is about 550 GFLOPS.

Fermi Block Diagram

Figure 1: NVIDIA Fermi-class GPU Accelerator

The GPU has a large (up to 6GB) high bandwidth long latency device main memory. Each multi-processor has a small 64KB local shared memory that functions as both a hardware data cache and a software-managed data cache, and has a large register file.

The GPU has two levels of parallelism, SIMD within a multiprocessor, and parallel across multiprocessors. In addition, there is another very important level of concurrency: the thread processors support extremely fast multithread context switching to tolerate the long latency to device main memory. If a given thread stalls waiting for a device memory access, it is swapped out and another ready thread is swapped in and starts executing within a few cycles.

What kind of algorithms run well on this architecture?

  • Massive parallelism—is needed to effectively use hundreds of thread processors and provide enough slack parallelism for the fast multi-threading to effectively tolerate device memory latency and maximize device memory bandwidth utilization.
  • Regular parallelism—is needed for GPU hardware and firmware that is optimized for the regular parallelism found in graphics kernels; these correspond roughly to rectangular iteration spaces (think tightly nested loops).
  • Limited synchronization—thread processors within a multi-processor can synchronize quickly enough to enable coordinated vector operations like reductions, but there is virtually no ability to synchronize across multi-processors.
  • Locality—is needed to enable use of the hardware or user-managed data caches to minimize accesses to device memory.

This sounds a lot like a nest of parallel loops. So, NVIDIA defined the CUDA programming model to enable efficient mapping of general-purpose compute-intensive loop nests onto the GPU hardware. Specifically, a 1K x 1K matrix multiply loop that looks as follows on the host:

for (i = 0; i < 1024; ++i)
   for (k = 0; k < 1024; ++k)
      for (j = 0; j < 1024; ++j)
         c[i][j] =+= a[i][k]*b[k][j]; 

can be rewritten in its most basic form in CUDA C as:

cudaMalloc( &ap, memsizeA );
cudaMemcpy( ap, a, memsizeA, cudaMemcpyHostToDevice );
c_mmul_kernel <<<(64,64),(16,16)>>>(ap, bp, cp, 1024);
cudaMemcpy( c, cp, memsizeC, cudaMemcpyDeviceToHost );
__global__ void c_mmul_kernel(float* a, float* b, float* c, n)
   int i = blockIdx.y*16+threadIdx.y;
   int j = blockIdx.x*16+threadIdx.x;
   for( int k = 0; k < n; ++k )_
      c[n*i+j] += a[n*i+k] * b[n*k+j];

The triply-nested matrix multiply loop becomes a single dot-product loop, split out to a self-contained kernel function. The two outer loops are abstracted away in the launch of the kernel on the GPU. Conceptually, the over one million 1024-length dot-products it takes to perform the matrix multiply are all launched simultaneously on the GPU. The CUDA programmer structures fine-grain parallel tasks, in this case dot-product operations, as CUDA threads, organizes the threads into rectangular thread blocks with 32 to 1024 threads each, and organizes the thread-blocks into a rectangular grid. Each thread-block is assigned to a CUDA GPU multi-processor, and the threads within a thread-block are executed by the thread-processors within that multiprocessor.

The programmer also manages the memory hierarchy on the GPU, moving data from the host to device memory, from variables in device memory to variables in shared memory, or to variables that the user intends to be assigned to registers.

PGI CUDA C/C++ for Multi-core x64

The PGI CUDA C/C++ compiler for multi-core x64 platforms will allow developers to compile and optimize CUDA applications to run on x64-based workstations, servers and clusters with or without an NVIDIA GPU accelerator. Is it possible to compile CUDA C efficiently for multi-core processors? CUDA C is simply a parallel programming model and language. While it was designed with the structure required for efficient GPU programming, it also can be compiled for efficient execution on multi-core x64.

Looking at a multicore x64 CPU, we see features very like what we have on the NVIDIA GPU. We have MIMD parallelism across the cores, typically 4 cores but we know there are up to 12 on some chips today and up to 48 on a single motherboard. We have SIMD parallelism in the AVX or SSE instructions. So it’s the same set of features, excepting that CPUs are optimized with deep cache memory hierarchies for memory latency, whereas the GPU is optimized for memory bandwidth. Mapping the CUDA parallelism onto the CPU parallelism seems straightforward from basic principles.

Consider the process the CUDA programmer uses to convert existing serial or parallel programs to CUDA C, as outlined above. Many aspects of this process can simply be reversed by the compiler:

  • Reconstitute parallel/vector loop nests from the CUDA C chevron syntax
  • Where possible, remove or replace programmer-inserted __syncthreads() calls by appropriate mechanisms on the CPU

In effect, the PGI CUDA C/C++ compiler will process CUDA C as a native parallel programming language for mapping to multi-core x64 CPUs. CUDA thread blocks will be mapped to processor cores to effect multi-core execution, and CUDA thread-level parallelism will be mapped to the SSE or AVX SIMD units as shown in Figure 2 below. All existing PGI x64 optimizations for Intel and AMD CPUs will be applied to CUDA C/C++ host code—SIMD/AVX vectorization, inter-procedural analysis and optimizations, auto-parallelization for multi-core, OpenMP extensions support, etc.

Multi-core Mapping

Figure 2: Mapping CUDA to GPUs versus Multi-core CPUs

Initially, PGI CUDA C/C++ will target the CUDA 3.1 runtime API. There are no current plans to implement the CUDA driver API. The definition of warpSize may be changed (probably to 1 in optimizing versions of the compiler); correctly implementing warp-synchronous programming would either require implicit synchronization after each memory access, or would require the compiler to prove that such synchronization is not required. It’s much more natural to require programmers to use the value of warpSize to determine how many threads are running in SIMD mode.

What kind of performance can you expect from CUDA C programs running on multi-core CPUs? There are many determining factors. Typical CUDA C programs perform many explicit operations and optimizations that are not necessary when programming multi-core CPUs using OpenMP or threads-based programming:

  • Explicit movement of data from host main memory to CUDA device memory
  • Data copies from arrays in CUDA device memory to temporary arrays in multi-processor shared memory
  • Synchronization of SIMT thread processors to ensure shared memory coherency
  • Manual unrolling of loops

In many cases, the PGI CUDA C compiler will remove explicit synchronization of the thread processors if it can determine it’s safe to split loops in which synchronization calls occur. Manual unrolling of loops will not typically hurt performance on x64, and may help in some cases. However, explicit movement of data from host memory to “device” copies will still occur, and explicit movement of data from device copies to temporary arrays in shared memory will still occur; these operations are pure overhead on a multi-core processor.

It will be easy to write CUDA programs that run really well on the GPU and don’t run so well on a CPU. We can’t guarantee high performance, if you’ve gone and tightly hand-tuned your kernel code. As with OpenCL, we’re making the language portable, and many programs will port and run well; but there is no guarantee of general performance portability.

PGI Unified Binary for Multi-core x64 and NVIDIA GPUs

In later releases, in addition to multi-core execution, the PGI CUDA C/C++ compiler will support execution of device kernels on NVIDIA CUDA-enabled GPUs. PGI Unified Binary technology will enable developers to build one binary that will use NVIDIA GPUs when present or default to using multi-core x64 if no GPU is present.

PGI Unified Binary

Figure 3: PGI Unified Binary for NVIDIA GPUs and Multi-core CPUs


It’s important to clarify that the PGI CUDA C/C++ compiler for multi-core does not split work between the CPU and GPU; it executes device kernels in multi-core mode on the CPU. Even with the PGI Unified Binary feature, the device kernels will execute either on the GPU or on the multi-core, since the data will have been allocated in one memory or the other. PGI CUDA C/C++ also is not intended to as a replacement for OpenMP or other parallel programming models for CPUs. It is a feature of the PGI compilers that will enable CUDA programs to run on either CPUs or GPUs, and will give developers the option of a uniform manycore parallel programming model for applications where it’s needed and appropriate. It will ensure CUDA C programs are portable to virtually any multi-core x64 processor-based HPC system.

The PGI compiler will implement the NVIDIA CUDA C language and closely track the evolution of CUDA C moving forward. The implementation will proceed in phases:

  • Prototype demonstration at SC10 in New Orleans (November 2010).
  • First production release in Q2 2011 with most CUDA C functionality. This will not be a performance release; it will use multi-core parallelism across threads in a single thread block, in the same way as PGI CUDA Fortran emulation mode, but will not exploit parallelism across thread blocks.
  • Performance release in Q3 2011 leveraging multi-core and SSE/AVX to implement low-overhead native parallel/SIMD execution; this will use a single core to execute all the threads in a single thread block, in SIMD mode where possible, and use multi-core parallelism across the thread blocks.
  • Unification release in Q4 2011 that supports PGI Unified Binary technology to create binaries that use NVIDIA GPU accelerators when present, or run on multi-core CPUs if no GPU is present.

The necessary elements of the NVIDIA CUDA toolkit needed to compile and execute CUDA C/C++ programs (header files, for example) will be bundled with the PGI compiler. Finally, the same optimizations and features implemented for CUDA C/C++ for multi-core will also be supported in CUDA Fortran, offering interoperability and a uniform programming model across both languages.

How It Works

In CUDA-x86, thread blocks are mapped to x86 processor cores. Thread-level parallelism is mapped to SSE (Streaming SIMD Extensions) or AVX SIMD units as shown below. (AVX is an extension of SSE to 256-bit operation). PGI indicates that:

  • The size of a warp (that is, the basic unit of code to be run) will be different than the typical 32 threads per warp for a GPU. For x86 computing, a warp might be the size of the SIMD units on the x86 core (either four or eight threads) or one thread per warp when SIMD execution is not utilized.
  • In many cases, the PGI CUDA C compiler removes explicit synchronization of the thread processors when the compiler can determine it is safe to split loops.
  • CUDA considers the GPU as a separate device from the host processors. CUDA x86 maintains this memory model, which means that data movement between the host and device memory spaces still consumes application runtime. As shown in the device bandwidth SDK example below, a modern Xeon processor can transfer data to a CUDA-x86 “device” at about 4GB/sec. All CUDA x86 pointers reside in the x86 memory space, so programmers can use conditional compilation to directly access memory without requiring data transfers when running on multicore processors.

Trying Out the Compiler

The PGI installation process is fairly straightforward:

  1. Register and download the latest version from PGI
  2. Extract the tarfile at the location of your choice and follow the instructions in INSTALL.txt.
    • Under Linux, this basically requires running the file ./install as superuser and answering a few straight-forward questions.
    • Note that you should answer “yes” to the installation of CUDA even if you have a GPU version of CUDA already installed on your system. The PGI x86 version will not conflict with the GPU version. Otherwise, the PGI compiler will not understand files with the .cu file extension.
  3. Create the license.dat file.

At this point, you have a 15-day license for the PGI compilers.

Setup the environment to build with the PGI tools as discussed in the installation guide. Following are the commands for bash under Linux:

PGI=/opt/pgi; export PGI
MANPATH=$MANPATH:$PGI/linux86-64/11.5/man; export MANPATH
PATH=$PGI/linux86-64/11.5/bin:$PATH; export PATH

Copy the PGI NVIDIA SDK samples to a convenient location and build them:

cp –r /opt/pgi/linux86-64/2011/cuda/cudaX86SDK  .
cd cudaX86SDK ;

This is the output of deviceQuery on an Intel Xeon e5560 processor:

CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
  CUDA Driver Version:                           99.99
  CUDA Runtime Version:                          99.99
  CUDA Capability Major revision number:         9998
  CUDA Capability Minor revision number:         9998
  Total amount of global memory:                 128000000 bytes
  Number of multiprocessors:                     1
  Number of cores:                               0
  Total amount of constant memory:               1021585952 bytes
  Total amount of shared memory per block:       1021586048 bytes
  Total number of registers available per block: 1021585904
  Warp size:                                     1
  Maximum number of threads per block:           1021585920
  Maximum sizes of each dimension of a block:    32767 x 2 x 0
  Maximum sizes of each dimension of a grid:     1021586032 x 32767 x 1021586048
  Maximum memory pitch:                          4206313 bytes
  Texture alignment:                             1021585952 bytes
  Clock rate:                                    0.00 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Unknown
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 99.99, CUDA Runtime Version = 99.99, NumDevs = 1, Device = DEVICE EMULATION MODE
Press <Enter> to Quit...

The output of bandwidthTest shows that device transfers work as expected:

Running on...
 Quick Mode
 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4152.5
 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4257.0
 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         8459.2
[bandwidthTest] - Test results:
Press <Enter> to Quit...

As with NVIDIA’s nvcc compiler, it is easy to use the PGI pgCC compiler to build an executable from a CUDA source file. As an example, copy the code from Part 3 of this series. To compile and run it under Linux, type:


Posted in Computer Network & Security, Computer Softwares, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, PARALLEL | Tagged: | Leave a Comment »

Parallel Code: Maximizing your Performance Potential

Posted by Hemprasad Y. Badgujar on December 19, 2014

No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.
Read the rest of this entry »

Posted in Computer Hardwares, Computer Languages, Computing Technology, GPU (CUDA), GPU Accelareted, My Research Related, PARALLEL, Research Menu | Tagged: | Leave a Comment »


Posted by Hemprasad Y. Badgujar on December 19, 2014

MPI is a well-known programming model for Distributed Memory Computing. If you have access to GPU resources, MPI can be used to distribute tasks to computers, each of which can use their CPU and also GPU to process the distributed task.

My toy problem in hand is to use  a mix of MPI and CUDA to handle traditional sparse-matrix vector multiplication. The program can be structured as:

Each node uses both CPU and GPU resources
Each node uses both CPU and GPU resources
  1. Read a sparse matrix from from disk, and split it into sub-matrices.
  2. Use MPI to distribute the sub-matrices to processes.
  3. Each process would call a CUDA kernel to handle the multiplication. The result of multiplication would be copied back to each computer memory.
  4. Use MPI to gather results from each of the processes, and re-form the final matrix.

One of the options is to put both MPI and CUDA code in a single file, This program can be compiled using nvcc, which internally uses gcc/g++ to compile your C/C++ code, and linked to your MPI library:

nvcc -I/usr/mpi/gcc/openmpi-1.4.6/include -L/usr/mpi/gcc/openmpi-1.4.6/lib64 -lmpi -o program

The downside is it might end up being a plate of spaghetti, if you have some seriously long program.

Another cleaner option is to have MPI and CUDA code separate in two files: main.c and respectively. These two files can be compiled using mpicc, and nvcc respectively into object files (.o) and combined into a single executable file using mpicc. This second option is an opposite compilation of the above, using mpicc, meaning that you have to link to your CUDA library.

module load openmpi cuda #(optional) load modules on your node
mpicc -c main.c -o main.o
nvcc -arch=sm_20 -c -o multiply.o
mpicc main.o multiply.o -lcudart -L/apps/CUDA/cuda-5.0/lib64/ -o program

And finally, you can request two processes and two GPUs to test your program on the cluster using PBS script like:

#PBS -l nodes=2:ppn=2:gpus=2
mpiexec -np 2 ./program

The main.c, containing the call to CUDA file, would look like:

#include "mpi.h"
int main(int argc, char *argv[])
/* It's important to put this call at the begining of the program, after variable declarations. */
MPI_Init(argc, argv);
/* Get the number of MPI processes and the rank of this process. */
        MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
        MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
// ==== Call function 'call_me_maybe' from CUDA file ==========
/* ... */

And in, define call_me_maybe() with the ‘extern‘ keyword to make it accessible from main.c (without additional #include …)

/* */
#include <cuda.h>
#include <cuda_runtime.h>
 __global__ void __multiply__ ()
 extern "C" void call_me_maybe()
     /* ... Load CPU data into GPU buffers  */
     __multiply__ <<< ...block configuration... >>> (x, y);
     /* ... Transfer data from GPU to CPU */


Mixing MPI and CUDA

Mixing MPI (C) and CUDA (C++) code requires some care during linking because of differences between the C and C++ calling conventions and runtimes. A helpful overview of the issues can be found at How to Mix C and C++.

One option is to compile and link all source files with a C++ compiler, which will enforce additional restrictions on C code. Alternatively, if you wish to compile your MPI/C code with a C compiler and call CUDA kernels from within an MPI task, you can wrap the appropriate CUDA-compiled functions with the extern keyword, as in the following example.

These two source files can be compiled and linked with both a C and C++ compiler into a single executable on Oscar using:

$ module load mvapich2 cuda
$ mpicc -c main.c -o main.o
$ nvcc -c -o multiply.o
$ mpicc main.o multiply.o -lcudart

The CUDA/C++ compiler nvcc is used only to compile the CUDA source file, and the MPI C compiler mpicc is user to compile the C code and to perform the linking.

01. /* */
03. #include 
04. #include 
06. __global__ void __multiply__ (const float *a, float *b)
07. {
08. const int i = threadIdx.x + blockIdx.x * blockDim.x;
09.     b[i] *= a[i];
10. }
12. extern "C" void launch_multiply(const float *a, const *b)
13. {
14.     /* ... load CPU data into GPU buffers a_gpu and b_gpu */
16.     __multiply__ <<< ...block configuration... >>> (a_gpu, b_gpu);
18.     safecall(cudaThreadSynchronize());
19.     safecall(cudaGetLastError());
21.     /* ... transfer data from GPU to CPU */

Note the use of extern "C" around the function launch_multiply, which instructs the C++ compiler (nvcc in this case) to make that function callable from the C runtime. The following C code shows how the function could be called from an MPI task.

01. /* main.c */
03. #include 
05. void launch_multiply(const float *a, float *b);
07. int main (int argc, char **argv)
08. {
09.        int rank, nprocs;
10.     MPI_Init (&argc, &argv);
11.     MPI_Comm_rank (MPI_COMM_WORLD, &rank);
12.     MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
14.     /* ... prepare arrays a and b */
16.     launch_multiply (a, b);
18.     MPI_Finalize();
19.        return 1;
20. }

Posted in CLUSTER, Computer Hardware, Computer Softwares, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, GRID, Open CL, OpenMP, PARALLEL | Tagged: , , | 1 Comment »

Posted by Hemprasad Y. Badgujar on December 11, 2014

Cloud scaling, Part 1: Build a compute node or small cluster application and scale with HPC

Leveraging warehouse-scale computing as needed

Discover methods and tools to build a compute node and small cluster application that can scale with on-demand high-performance computing (HPC) by leveraging the cloud. This series takes an in-depth look at how to address unique challenges while tapping and leveraging the efficiency of warehouse-scale on-demand HPC. The approach allows the architect to build locally for expected workload and to spill over into on-demand cloud HPC for peak loads. Part 1 focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Exotic HPC architectures with custom-scaled processor cores and shared memory interconnection networks are being rapidly replaced by on-demand clusters that leverage off-the-shelf general purpose vector coprocessors, converged Ethernet at 40 Gbit/s per link or more, and multicore headless servers. These new HPC on-demand cloud resources resemble what has been called warehouse-scale computing, where each node is homogeneous and headless and the focus is on total cost of ownership and power use efficiency overall. However, HPC has unique requirements that go beyond social networks, web search, and other typical warehouse-scale computing solutions. This article focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Moving to high-performance computing

The TOP500 and Green500 supercomputers (see Resources) since 1994 are more often not custom designs, but rather designed and integrated with off-the-shelf headless servers, converged Ethernet or InfiniBand clustering, and general-purpose graphics processing unit (GP-GPU) coprocessors that aren’t for graphics but rather for single program, multiple data (SPMD) workloads. The trend in high-performance computing (HPC) away from exotic custom processor and memory interconnection design to off-the-shelf—warehouse-scale computing—is based on the need to control total cost of ownership, increase power efficiency, and balance operational expenditure (OpEx) and capital expenditure (CapEx) for both start-up and established HPC operations. This means that you can build your own small cluster with similar methods and use HPC warehouse-scale resources on-demand when you need them.

The famous 3D torus interconnection that Cray and others used may never fully go away (today, the TOP500 is one-third massively parallel processors [MPPs] and two-thirds cluster architecture for top performers), but focus on efficiency and new OpEx metrics like Green500 Floating Point Operations (FLOPs)/Watt are driving HPC and keeping architecture focused on clusters. Furthermore, many applications of interest today are data driven (for example, digital video analytics), so many systems not only need traditional sequential high performance storage for HPC checkpoints (saved state of a long-running job) but more random access to structured (database) and unstructured (files) large data sets. Big data access is a common need of traditional warehouse-scale computing for cloud services as well as current and emergent HPC workloads. So, warehouse-scale computing is not HPC, but HPC applications can leverage data center-inspired technology for cloud HPC on demand, if designed to do so from the start.

Power to computing

Power to computing can be measured in terms of a typical performance metric per Watt—for example, FLOPS/Watt or input/output per second/Watt for computing and I/O, respectively. Furthermore, any computing facility can be seen as a plant for converting Watts into computational results, and a gross measure of good plant design is power use efficiency (PUE), which is simply the ratio of total facility power over that delivered to computing equipment. A good value today is 1.2 or less. One reason for higher PUEs is inefficient cooling methods, administrative overhead, and lack of purpose-built facilities compared to cloud data centers (see Resources for a link to more information).

Changes in scalable computing architecture focus over time include:

  • Early focus on a fast single processor (uniprocessor) to push the stored-program arithmetic logic unit central processor to the highest clock rates and instruction throughput possible:
    • John von Neumann, Alan Turing, Robert Noyce (founder of Intel), Ted Hoff (Intel universal processor proponent), along with Gordon Moore see initial scaling as a challenge to scaling digital logic and clock a processor as fast as possible.
    • Up to at least 1984 (and maybe longer), the general rule was “the processor makes the computer.”
    • Cray Computer designs vector processors (X-MP, Y-MP) and distributed memory multiprocessors interconnected by a six-way interconnect 3D torus for custom MPP machines. But this is unique to the supercomputing world.
    • IBM’s focus early on was scalable mainframes and fast uniprocessors until the announcement of the IBM® Blue Gene® architecture in 1999 using a multicore IBM® POWER® architecture system-on-a-chip design and a 3D torus interconnection. The current TOP500 includes many Blue Gene systems, which have often occupied the LINPACK-measured TOP500 number one spot.
  • More recently since 1994, HPC is evolving to a few custom MPP and mostly off-the-shelf clusters, using both custom interconnections (for example, Blue Gene and Cray) and off-the-shelf converged Ethernet (10G, 40G) and InfiniBand:
    • The TOP500 has become dominated by clusters, which comprise the majority of top-performing HPC solutions (two-thirds) today.
    • As shown in the TOP500 chart by architecture since 1994, clusters and MPP dominate today (compared to single instruction, multiple data [SIMD] vector; fast uniprocessors; symmetric multiprocessing [SMP] shared memory; and other, more obscure architectures).
    • John Gage at Sun Microsystems (now Oracle) stated that “the network is the computer,” referring to distributed systems and the Internet, but low-latency networks in clusters likewise become core to scaling.
    • Coprocessors interfaced to cluster nodes via memory-mapped I/O, including GP-GPU and even hybrid field-programmable gate array (FPGA) processors, are used to accelerate specific computing workloads on each cluster node.
  • Warehouse-scale computing and the cloud emerge with focus on MapReduce and what HPC would call embarrassingly parallel applications:
    • The TOP500 is measured with LINPACK and FLOPs and so is not focused on cost of operations (for example, FLOPs/Watt) or data access. Memory access is critical, but storage access is not so critical, except for job checkpoints (so a job can be restarted, if needed).
    • Many data-driven applications have emerged in the new millennium, including social networks, Internet search, global geographical information systems, and analytics associated with more than a billion Internet users. This is not HPC in the traditional sense but warehouse-computing operating at a massive scale.
    • Luiz André Barroso states that “the data center is the computer,” a second shift away from processor-focused design. The data center is highly focused on OpEx as well as CapEx, and so is a better fit for HPC where FLOPs/Watt and data access matter. These Google data centers have a PUE less than 1.2—a measure of total facility power consumed divided by power used for computation. (Most computing enterprises have had a PUE of 2.0 or higher, so, 1.2 is very low indeed. See Resources for more information.)
    • Amazon launched Amazon Elastic Compute Cloud (Amazon EC2), which is best suited to web services but has some scalable and at least high-throughput computing features (see Resources).
  • On-demand cloud HPC services expand, with an emphasis on clusters, storage, coprocessors and elastic scaling:
    • Many private and public HPC clusters occupy TOP500, running Linux® and using common open source tools, such that users can build and scale applications on small clusters but migrate to the cloud for on-demand large job handling. Companies like Penguin Computing, which features Penguin On-Demand, leverage off-the-shelf clusters (InfiniBand and converged 10G/40G Ethernet), Intel or AMD multicore headless nodes, GP-GPU coprocessors, and scalable redundant array of independent disks (RAID) storage.
    • IBM Platform computing provides IBM xSeries® and zSeries® computing on demand with workload management tools and features.
    • Numerous universities and start-up companies leverage HPC on demand with cloud services or off-the-shelf clusters to complement their own private services. Two that I know well are the University of Alaska Arctic Region Supercomputing Center (ARSC) Pacman (Penguin Computing) and the University of Colorado JANUS cluster supercomputer. A common Red Hat Enterprise Linux (RHEL) open source workload tool set and open architecture allow for migration of applications from private to public cloud HPC systems.

Figure 1 shows the TOP500 move to clusters and MPP since the mid-1990s.

Figure 1. TOP500 evolution to clusters and MPP since 1994

Image showing the evolution to clustersThe cloud HPC on-demand approach requires well-defined off-the-shelf clustering, compute nodes, and tolerance for WAN latency to transfer workload. As such, these systems are not likely to overtake top spots in the TOP500, but they are likely to occupy the Green500 and provide efficient scaling for many workloads and now comprise the majority of the Top500.

High-definition digital video computer vision: a scalable HPC case study

Most of us deal with compressed digital video, often in Motion Picture Experts Group (MPEG) 4 format, and don’t think of the scale of even a high-definition (HD) web cam in terms of data rates and processing to apply simple image processing analysis. Digital cinema workflow and post-production experts know the challenges well. They deal with 4K data (roughly 4-megapixel) individual frames or much higher resolution. These frames might be compressed, but they are not compressed over time in groups of pictures like MPEG does and are often lossless compression rather than lossy.

To start to understand an HPC problem that involves FLOPs, uncompressed data, and tools that can be used for scale-up, let’s look at a simple edge-finder transform. The includes Open Computer Vision (OpenCV) algorithms to transform a real-time web cam stream into a Sobel or Canny edge view in real time. See Figure 2.

Figure 2. HD video Canny edge transform

Image showing a Canny edge transformLeveraging cloud HPC for video analytics allows for deployment of more intelligent smart phone applications. Perhaps phone processors will someday be able to handle real-time HD digital video facial recognition, but in the mean time, cloud HPC can help. Likewise, data that originates in data centers, like geographic information systems (GIS) data, needs intensive processing for analytics to segment scenes, create point clouds of 3D data from stereo vision, and recognize targets of interest (such as well-known landmarks).

Augmented reality and video analytics

Video analytics involves collection of structured (database) information from unstructured video (files) and video streams—for example, facial recognition. Much of the early focus has been on security and automation of surveillance, but applications are growing fast and are being used now for more social applications, e.g. facial recognition, perhaps not to identify a person but to capture and record their facial expression and mood (while shopping). This technology can be coupled with augmented reality, whereby the analytics are used to update a scene with helpful information (such as navigation data). Video data can be compressed and uplinked to warehouse-scale data centers for processing so that the analytics can be collected and information provided in return not available on a user’s smart phone. The image processing is compute intensive and involves big data storage, and likely a scaling challenge (see Resources for a link to more information).

Sometimes, when digital video is collected in the field, the data must be brought to the computational resources; but if possible, digital video should only be moved when necessary to avoid encoding to compress and decoding to decompress for viewing. Specialized coprocessors known as codecs (coder/decoder) are designed to decode without software and coprocessors to render graphics (GPUs) exist, but to date, no CV coprocessors are widely available. Khronos has announced an initiative to define hardware acceleration for OpenCV in late 2012, but work has only just begun (see Resources). So, to date, CV remains more of an HPC application that has had attention primarily from digital cinema, but this is changing rapidly based on interest in CV on mobiles and in the Cloud.

Although all of us imagine CV to be implemented on mobile robotics, in our heads-up displays for intelligent transportation, and on visors (like Google Goggles that are now available) for personal use, it’s not clear that all of the processing must be done on the embedded devices or that it should be even if it could. The reason is data: Without access to correlated data center data, CV information has less value. For example, how much value is there in knowing where your are without more mapping and GIS data to help you with where you want to go next? Real-time CV and video analytics are making progress, but they face many challenges, including huge storage requirements, high network bit rates for transport, and significant processing demands for interpretation. Whether the processing is done by cloud HPC clusters or embedded systems, it’s clear that concurrency and parallel processing will play a huge role. Try running a simple Hough linear transform on the 12-megapixel cactus photo I took, and you’ll see why HPC might be needed just to segment a scene at 60 frames/s.

The challenge of making algorithms parallel

HPC with both clusters and MPP requires coding methods to employ many thread of execution on each multicore node and to use message-passing interfaces (MPIs) and basic methods to map data and code to process resources and collect results. For digital video, the mapping can be simple if done at a frame level. Within a frame is more difficult but still not bad other than the steps of segmenting and restitching frames together.

The power of MapReduce

The MapReduce concept is generally associated with Google and the open source Hadoop project (from Apache Software Foundation), but any parallel computation must employ this concept to obtain speed-up, whether done at a node or cluster level with Java™ technology or at a thread level for a nonuniform memory access (NUMA) shared memory. For applications like digital video analytics, the mapping is data intensive, so it makes sense to move the function to the data (in the mapping stage), but either way, the data to be processed must be mapped and processed and the results combined. A clever mapping avoids data dependencies and the need for synchronization as much as possible. In the case of image processing, for CV, the mapping could be within a frame, at the frame level, or by groups of pictures (see Resources).

Key tools for designing cluster scaling applications for cloud HPC on demand include the following:

  • Threading is the way in which a single application (or Linux process) is one address space on one cluster node and can be designed to use all processor cores on that node. Most often, this is done with Portable Operating System Interface for UNIX® (POSIX) Pthreads or with a library like OpenMP, which abstracts the low-level details of POSIX threading. I find POSIX threading to be fairly simple and typically write Pthread code as can be seen in the hpc_cloud_grid.tar.gz example. This example maps threads to the over-number space for prime number searching.
  • MPI is a library that can be linked into a cluster parallel application to assist with mapping of processing to each node, synchronization, and reduction of results. Although you can use MPI to implement MapReduce, unlike Hadoop, it typically moves data (in messages) to program functions running on each node (rather than moving code to the data). In the final video analytics article in this series, I will provide a thread and MPI cluster-scalable version of the capture-transform code. Here, I provide the simple code for a single thread and node to serve as a reference. Run it and Linux dstat at the same time to monitor CPU, I/O, and storage use. It is a resource-intensive program that computes Sobel and Canny transforms on a 2560×1920-pixel image. It should run on any Linux system with OpenCV and a web cam.
  • Vector SIMD and SPMD processing can be accomplished on Intel and AMD nodes with a switch to enable during compilation or, with more work, by creation of transform kernels in CUDA or OpenCL for off-load to a GPU or GP-GPU coprocessor.
  • OpenCV is highly useful for video analytics, as it includes not only convenient image capture, handling, and display functions but also most of the best image processing transforms used in CV.

The future of on-demand cloud HPC

This articles makes an argument for cloud HPC. The goal here is to acquaint you with the idea and some of the challenging, yet compelling applications (like CV) as well as to introduce you to methods for programming applications that can scale on clusters and MPP machines. In future articles, I will take the CV example further and adapt it for not only threading but also for MPI so that we can examine how well it scales on cloud HPC (in my case, at ARSC on Pacman or JANUS). My research involves comparison of tightly coupled CV coprocessors (that I am building using an Altera Stratix IV FPGA I call a computer vision processing unit [CVPU]). I am comparing this to what I can achieve with CV on ARSC for the purpose of understanding whether environmental sensing and GIS data are best processed like graphics, with a coprocessor, or on a cluster or perhaps with a combination of the two. The goals for this research are lofty. In the case of CVPU, the CV/graphics Turing-like test I imagine is one in which the scene that the CVPU parses can then be sent to a GPU for rendering. Ideally, the parsed/rendered image would be indistinguishable from the true digital video stream. When rendered scenes and the ability to analyze them reaches a common level of fidelity, augmented reality, perceptual computing, and video analytics will have amazing power to transform our lives.

Cloud scaling, Part 2: Tour high-performance cloud system design advances

Learn how to leverage co-processing, nonvolatile memory, interconnection, and storage

Breakthrough device technology requires the system designer to re-think operating and application software design in order to realize the potential benefits of closing the access gap or pushing processing into the I/O path with coprocessors. Explore and consider how the latest memory, compute, and interconnection devices and subsystems can affect your scalable, data-centric, high-performance cloud computing system design. Breakthroughs in device technology can be leveraged for transition between compute-centric and the more balanced data-centric compute architectures.

The author examines storage-class memory and demonstrates how to fill the long-standing performance gap between RAM and spinning disk storage; details the use of I/O bus coprocessors (for processing closer to data); explains how to employ InfiniBand to build low-cost, high performance interconnection networks; and discusses scalable storage for unstructured data.

Computing systems engineering has historically been dominated by scaling processors and dynamic RAM (DRAM) interfaces to working memory, leaving a huge gap between data-driven and computational algorithms (see Resources). Interest in data-centric computing is growing rapidly, along with novel system design software and hardware devices to support data transformation with large data sets.

The data focus in software is no surprise given applications of interest today, such as video analytics, sensor networks, social networking, computer vision and augmented reality, intelligent transportation, machine-to-machine systems, and big data initiatives like IBM’s Smarter Planet and Smarter Cities.

The current wave of excitement is about collecting, processing, transforming, and mining the big data sets:

  • The data focus is leading toward new device-level breakthroughs in nonvolatile memory (storage-class memory, SCM) which brings big data closer to processing.
  • At the same time, input/output coprocessors are bringing processing closer to the data.
  • Finally, low-latency, high-bandwidth off-the-shelf interconnections like InfiniBand are allowing researchers to quickly build 3D torus and fat-tree clusters that used to be limited to the most exotic and expensive custom high-performance computing (HPC) designs.

Yet, the systems software and even system design often remain influenced by out-of-date bottlenecks and thinking. For example, consider threading and multiprogramming. The whole idea came about because of slow disk drive access; what else can a program do when waiting on data but run another one. Sure, we have redundant array of independent disks (RAID) scaling and NAND flash solid-state disks (SSDs), but as noted by IBM Almaden Research, the time scale differences of the access time gap are massive in human terms.

The access time gap between a CPU, RAM, and storage can be measured in terms of typical performance for each device, but perhaps the gap is more readily understood when put into human terms (as IBM Almaden has done for illustrative purposes).

If a typical CPU operation is similar to what a human can do in seconds, then RAM access at 100 times more latency is much like taking a few minutes to access information. However, by the same comparison, disk access at 100,000 times more latency compared to RAM is on the order of months (100 days). (See Figure 1.)

Figure 1. The data access gap

Image showing the data access gapMany experienced computer engineers have not really thought hard about the 100 to 200 random I/O operations per second (IOPS) — it is the mechanical boundary for a disk drive. (Sure, sequential access is as high as hundreds of megabytes per second, but random access remains what it was more than 50 years ago, with the same 15K RPM seek and rotate access latency.)

Finally, as Almaden notes, tape is therefore glacially slow. So, why do we bother? For the capacity, of course. But how can we get processing to the data or data to the processing more efficiently?

Look again at Figure 1. Improvements to NAND flash memory for use in mobile devices and more recently SSD has helped to close the gap; however, it is widely believed that NAND flash device technology will be pushed to its limits fairly quickly, as noted by numerous system researchers (see Resources). The transistor floating gate technology used is already at scaling limits and pushing it farther is leading to lower reliability, so although it has been a stop-gap for data-centric computing, it is likely not the solution.

Instead, several new nonvolatile RAM (NVRAM) device technologies are likely solutions, including:

  • Phase change RAM (PCRAM): This memory uses a heating element to turn a class of materials known as chalcogenides into either a crystallized or amorphous glass state, thereby storing two states that can be programmed and read, with state retained even when no power is applied. PCRAM appears to show the most promise in the near term for M-type synchronous nonvolatile memory (NVM).
  • Resistive RAM (RRAM): Most often described as a circuit that is unlike a capacitor, inductor, or resistor, RRAM provides a unique relationship between current and voltage unlike other well-known devices that store charge or magnetic energy or provide linear resistance to current flow. Materials with properties called memristors have been tested for many decades but engineers usually avoid them because of their nonlinear properties and the lack of application for them. IEEE fellow Leon Chua describes them in “Memristor: The Missing Circuit Element.” A memristor’s behavior can be summarized as follows: Current flow in one direction causes electrical resistance to increase and in the opposite direction resistance decreases, but the memristor retains the last resistance it had when flow is re-started. As such, it can store a nonvolatile state, be programmed, and the state read. For details and even some controversy on what is and is not a memristor, seeResources.
  • Spin transfer torque RAM (STT-RAM): A current passed through a magnetic layer can produce a spin-polarized current that, when directed into a magnetic layer, can change its orientation via angular momentum. This behavior can be used to excite oscillations and flip the orientation of nanometer-scale magnetic devices. The main drawback is the high current needed to flip the orientation.

Consult the many excellent entries in Resources for more in-depth information on each device technology.

From a systems perspective, as these devices evolve, where they can be used and how well each might fill the access gap depends on the device’s:

  • Cost
  • Scalability (device integration size must be smaller than a transistor to beat flash; less than 20 nanometers)
  • Latency to program and read
  • Device reliability
  • Perhaps most importantly, durability (how often it can be programmed and erased before it becomes unreliable).

Based on these device performance considerations, IBM has divided SCM into two main classes:

  • S-type: Asynchronous access via an I/O controller. Threading or multiprogramming is used to hide the I/O latency to the device.
  • M-type: Synchronous access via a memory controller. Think about this as wait-states for RAM access in which a CPU core stalls.

Further, NAND SSD would be considered fast storage, accessed via a block-oriented storage controller (much higher I/O rates but similar bandwidth to a spinning disk drive).

It may seem like the elimination of asynchronous I/O for data processing (except, of course, for archive access or cluster scaling) might be a cure-all for data-centric processing. In some sense it is, but systems designers and software developers will have to change habits. The need for I/O latency hiding will largely go away on each node in a system, but it won’t go away completely. Clusters built from InfiniBand deal with node-to-node data-transfer latency with Message Passing Interface or MapReduce schemes and enjoy similar performance to this envisioned SCM node except when booting or when node data exceeds node working RAM size.

So, for scaling purposes, cluster interconnection and I/O latency hiding among nodes in the cluster is still required.

Moving processing closer to data with coprocessors

Faster access to big data is ideal and looks promising, but some applications will always benefit from the alternative approach of moving processing closer to data interfaces. Many examples exist, such as graphics (graphics processing units, GPUs), network processors, protocol-offload engines like the TCP/IP Offload Engine, RAID on chip, encryption coprocessors, and more recently, the idea of computer vision coprocessors. My research involves computer vision and graphics coprocessors, both at scale in clusters and embedded. I am working on what I call a computer vision processing unit, comparing several coprocessors that became more widely pursued with the 2012 announcement of OpenVX by Khronos (see Resources).

In the embedded world, such a method might be described as an intelligent sensor or smart camera, methods in which preliminary processing of raw data is provided by the sensor interface and an embedded logic device or microprocessor, perhaps even a multicore system on a chip (SoC).

In the scalable world, this most often involves use of a coprocessor bus or channel adapter (like PCI Express, PCIe, and Ethernet or InfiniBand); it provides data processing between the data source (network side) and the node I/O controller (host side).

Whether processing should be done or is more efficient when done in the I/O path or on a CPU core has always been a topic of hot debate, but based on an existence proof (GPUs and network processors), clearly they can be useful, waxing and waning in popularity based on coprocessor technology compared to processor. So, let’s take a quick look at some of the methods:

Vector processing for single program, multiple data
Provided today by GPUs, general-purpose GPUs (GP-GPUs), and application processing units (APUs), the idea is that data can be transformed on its way to an output device like a display or sent to a GP-GPU/APU and transformed on a round trip from the host. “General purpose” implies more sophisticated features like double-precision arithmetic compared to single precision only for graphics-specific processing.
Many core
Traditional many-core coprocessor cards (see Resources) are available from various vendors. The idea is to lower cost and power consumption by using simpler, yet numerous cores on the I/O bus, with round-trip offloading of processing to the cards for a more capable but power-hungry and costly full-scale multicore host. Typically, the many-core coprocessor might have an order of magnitude more cores than the host and often includes gigabit or 10G Ethernet and other types of network interfaces.
I/O bus field-programmable gate arrays (FPGAs)
FPGA cards, most often used to prototype a new coprocessor in the early stages of development, can perhaps used as a solution for low-volume coprocessors as well.
Embedded SoCs
A multicore solution can be used in an I/O device to create an intelligent device like a stereo ranging or time-of-flight camera.
Interface FPGA/configurable programmable logic devices
A digital logic state machine can provide buffering and continuous transformation of I/O data, such as digital video encoding.

Let’s look at an example based on offload and I/O path. Data transformation has obvious value for applications like the decoding of MPEG4 digital video, consisting of a GPU coprocessor in the path between the player and a display as shown in Figure 2 for the Linux® MPlayer video decoder and presentation acceleration unit (VDPAU) software interface to NVIDIA MPEG decoding on the GPU.

Figure 2. Simple video decode offload example

Image showing an example of a simple video decode offloadLikewise, any data processing or transformation that can be done in-bound or out-bound from a CPU host may have value, especially if the coprocessor can provide processing at a lower cost with great efficiency or with lower power consumption based on purpose-built processors compared to general-purpose CPUs.

To start to understand a GP-GPU compared to a multicore coprocessor approach, try downloading the two examples of a point spread function to sharpen the edges on an image (threaded transform example) compared with the GPU transform example. Both provide the same 320×240-pixel transformation, but in one case, the Compute Unified Device Architecture (CUDA) C code provided requires a GPU or GP-GPU coprocessor and, in the other case, either a multicore host or a many-core (for example, MICA) coprocessor.

So which is better?

Neither approach is clearly better, mostly because the NVRAM solutions have not yet been made widely available (except as expensive battery-backed DRAM or as S-type SCM from IBM Texas Memory Systems Division) and moving processing into the I/O data path has traditionally involved less friendly programming. Both are changing, though: Coprocessors are adopting higher-level languages like the Open Compute Language (OpenCL) in which code written for multicore hosts runs equally well on Intel MICA or Altera Startix IV/V architectures.

Likewise, all of the major computer systems companies are working feverishly to release SCM products, with PCRAM the most likely to be available first. My advice is to assume that both will be with us for some time and operating systems and applications must be able to deal with both. The memristor, or RRAM, includes a vision that resembles Isaac Asimov’s fictional positronic brain in which memory and processing are fully integrated as they are in a human neural system but with metallic materials. The concept of fully integrated NVM and processing is generally referred to as processing in memory (PIM) or neuromorphic processing (see Resources). Scalable NVM integrated processing holds extreme promise for biologically inspired intelligent systems similar to the human visual cortex, for example. Pushing toward the goal of integrated NVM, with PIM from both sides, is probably a good approach, so I plan to keep up with and keep working on systems that employ both methods—coprocessors and NVM. Nature has clearly favored direct, low-level, full integration of PIM at scale for intelligent systems.

Scaling nodes with Infiniband interconnection

System designers always have to consider the trade-off between scaling up each node in a system and scaling out a solution that uses networking or more richly interconnected clustering to scale processing, I/O, and data storage. At some point, scaling the memory, processing, and storage a single node can integrate hits a practical limit in terms of cost, power efficiency, and size. It is also often more convenient from a reliability, availability, and servicing perspective to spread capability over multiple nodes so that if one needs repair or upgrade, others can continue to provide service with load sharing.

Figure 3 shows a typical InfiniBand 3D torus interconnection.

Figure 3. Example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)

Image showing an example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)In Figure 3, the 4x4x4 shown is for the San Diego Supercomputing Center (SDSC) Gordon supercomputer, as documented by Mellanox, which uses a 36-port InfiniBand switch to connect nodes to each other and to storage I/O.

InfiniBand, Converged Enhanced Ethernet iSCSI (CEE), or Fibre Channel is the most often used scalable storage interface for access to big data. This storage area network (SAN) scaling for RAID arrays is used to host distributed, scalable file systems like Ceph, Lustre, Apache Hadoop, or the IBM General Parallel File System (GPFS). Use of CEE and InfiniBand for storage access using the Open Fabric Alliance SCSI Remote Direct Memory Access (RDMA) Protocol and iSCSI Extensions for RDMA is a natural fit for SAN storage integrated with an InfiniBand cluster. Storage is viewed more as a distributed archive of unstructured data that is searched or mined and loaded into node NVRAM for cluster processing. Higher-level data-centric cluster processing methods like Hadoop MapReduce can also be used to bring code (software) to the data at each node. These topics are big-data-related topics that I describe more in the last part of this four-part series.

The future of data-centric scaling

This articles makes an argument for systems design and architecture that move processors closer to data-generating and -consuming devices, as well as simplification of memory hierarchy to include fewer levels, leveraging lower-latency, scalable NVM devices. This defines a data-centric node design that can be further scaled with low-latency off-the-shelf interconnection networks like InfiniBand. The main challenge with data-centric computing is not instructions-per-second or floating-point-operations-per-second only, but rather IOPS and the overall power efficiency of data processing.

In Part 1 of this series, I uncovered methods and tools to build a compute node and small cluster application that can scale with on-demand HPC by leveraging the cloud. In this article I detailed such high-performance system design advances as co-processing, nonvolatile memory, interconnection, and storage.

In Part 3 in this series I provide more in-depth coverage of a specific data-centric computing application — video analytics. Video analytics includes applications such as facial recognition for security and computer forensics, use of cameras for intelligent transportation monitoring, retail and marketing that involves integration of video (for example, visualizing yourself in a suit you’re considering from a web-based catalog), as well as a wide range of computer vision and augmented reality applications that are being invented daily. Although many of these applications involve embedded computer vision, most also require digital video analysis, transformation, and generation in cloud-based scalable servers. Algorithms like Sobel transformation can be run on typical servers, but algorithms like the generalized Hough transform, facial recognition, image registration, and stereo (point cloud) mapping, for example, require the NVM and coprocessor approaches this article discussed for scaling.

In the last part of the series, I deal with big data issues.

Cloud scaling, Part 3: Explore video analytics in the cloud

Using methods, tools, and system design for video and image analysis, monitoring, and security

Explore and consider methods, tools, and system design for video and image analysis with cloud scaling. As described in earlier articles in this series, video analytics requires a more balanced data-centric compute architecture compared to traditional compute-centric, scalable, high-performance computing. The author examines the use of OpenCV and similar tools for digital video analysis and methods to scale this analysis using cluster and distributed system design.

The use of coprocessors designed for video analytics and the new OpenVX hardware acceleration discussed in previous articles can be applied to the computer vision (CV) examples presented in this article. This new data-centric technology for CV and video analytics requires the system designer to re-think application software and system design to meet demanding requirements, such as real-time monitoring and security for large, public facilities and infrastructure as well as a more entertaining, interactive, and safer world.

Public safety and security

The integration of video analytics in public places is perhaps the best way to ensure public safety, providing digital forensic capabilities to law enforcement and the potential to increase detection of threats and prevention of public safety incidents. At the same time, this need has to be balanced with rights to privacy, which can become a contentious issue if these systems are abused or not well understood. For example, the extension of facial detection, as shown in Figure 1, to facial recognition has obvious identification capability and can be used to track an individual as he or she moves from one public place to another. To many people, facial analytics might be seen an invasion of privacy, and use of CV and video analytics should adhere to surveillance and privacy rights laws and policies, to be sure—any product or service developer might want to start by considering best practices outlined by the Federal Trade Commission (FTC; see Resources).

Digital video using standards such as that from Motion Picture Experts Group (MPEG) for encoding video to compress, transport, uncompress, and display it has led to a revolution in computing ranging from social networking media and amateur digital cinema to improved training and education. Tools for decoding and consuming digital video are widely used by all every day, but tools to encode and analyze uncompressed video frames are needed for video analytics, such as Open Computer Vision (OpenCV). One of the readily available and quite capable tools for encoding and decoding of digital video is FFmpeg; for still images, GNU Image Processing (GIMP) is quite useful (see Resources for links). With these three basic tools, an open source developer is fully equipped to start exploring computer vision (CV) and video analytics. Before exploring these tools and development methods, however, let’s first define these terms better and consider applications.

The first article in this series, Cloud scaling, Part 1: Build your own and scale with HPC on demand, provided a simple example using OpenCV that implements a Canny edge transformation on continuous real-time video from a Linux® web cam. This is an example of a CV application that you could use as a first step in segmenting an image. In general, CV applications involve acquisition, digital image formats for pixels (picture elements that represent points of illumination), images and sequences of them (movies), processing and transformation, segmentation, recognition, and ultimately scene descriptions. The best way to understand what CV encompasses is to look at examples. Figure 1 shows face and facial feature detection analysis using OpenCV. Note that in this simple example, using the Haar Cascade method (a machine learning algorithm) for detection analysis, the algorithm best detects faces and eyes that are not occluded (for example, my youngest son’s face is turned to the side) or shadowed and when the subject is not squinting. This is perhaps one of the most important observations that can be made regarding CV: It’s not a trivial problem. Researchers in this field often note that although much progress has been made since its advent more than 50 years ago, most applications still can’t match the scene segmentation and recognition performance of a 2-year-old child, especially when the ability to generalize and perform recognition in a wide range of conditions (lighting, size variation, orientation and context) is considered.

Figure 1. Using OpenCV for facial recognition

Image showing facial recognition analysisTo help you understand the analytical methods used in CV, I have created a small test set of images from the Anchorage, Alaska area that isavailable for download. The images have been processed using GIMP and OpenCV. I developed C/C++ code to use the OpenCV application programming interface with a Linux web cam, precaptured images, or MPEG movies. The use of CV to understand video content (sequences of images), either in real time or from precaptured databases of image sequences, is typically referred to as video analytics.

Defining video analytics

Video analytics is broadly defined as analysis of digital video content from cameras (typically visible light, but it could be from other parts of the spectrum, such as infrared) or stored sequences of images. Video analytics involves several disciplines but at least includes:

  • Image acquisition and encoding. As a sequence of images or groups of compressed images. This stage of video analytics can be complex, including photometer (camera) technology, analog decoding, digital formats for arrays of light samples (pixels) in frames and sequences, and methods of compressing and decompressing this data.
  • CV. The inverse of graphical rendering, where acquired scenes are converted into descriptions compared to rendering a scene from a description. Most often, CV assumes that this process of using a computer to “see” should operate wherever humans do, which often distinguishes it from machine vision. The goal of seeing like a human does most often means that CV solutions employ machine learning.
  • Machine vision. Again, the inverse of rendering but most often in a well-controlled environment for the purpose of process control—for example, inspecting printed circuit boards or fabricated parts to make sure they are geometrically correct within tolerances.
  • Image processing. A broad application of digital signal processing methods to samples from photometers and radiometers (detectors that measure electromagnetic radiation) to understand the properties of an observation target.
  • Machine learning. Algorithms developed based on the refinement of the algorithm through training data, whereby the algorithm improves performance and generalizes when tested with new data.
  • Real-time and interactive systems. Systems that require response by a deadline relative to a request for service or at least a quality of service that meets SLAs with customers or users of the services.
  • Storage, networking, database, and computing. All required to process digital data used in video analytics, but a subtle, yet important distinction is that this is an inherently data-centric compute problem, as was discussed in Part 2 of this series.

Video analytics, therefore, is broader in scope than CV and is a system design problem that might include mobile elements like a smart phone (for example, Google Goggles) and cloud-based services for the CV aspects of the overall system. For example, IBM has developed a video analytics system known as the video correlation and analysis suite (VCAS), for which the IBM Travel and Transportation Solution BriefSmarter Safety and Security Solution for Rail [PDF] is available; it is a good example of a system design concept. Detailed focus on each system design discipline involved in a video analytics solution is beyond the scope of this article, but many pointers to more information for system designers are available in Resources. The rest of this article focuses on CV processing examples and applications.

Basic structure of video analytics applications

You can break the architecture of cloud-based video analytics systems down into two major segments: embedded intelligent sensors (such as smart phones, tablets with a camera, or customized smart cameras) and cloud-based processing for analytics that can’t be directly computed on the embedded device. Why break the architecture into two segments compared to fully solving in the smart embedded device? Embedding CV in transportation, smart phones, and products is not always practical. Even when embedding a smart camera is smart, so often, the compressed video or scene description may be back-hauled to a cloud-based video analytics system, just to offload the resource-limited embedded device. Perhaps more important, though, than resource limitations is that video transported to the cloud for analysis allows for correlation with larger data sets and annotation with up-to-date global information for augmented reality (AR) returned to the devices.

The smart camera devices for applications like gesture and facial expression recognition must be embedded. However, more intelligent inference to identify people and objects and fully parse scenes is likely to require scalable data-centric systems that can be more efficiently scaled in a data center. Furthermore, data processing acceleration at scale ranging from the Khronos OpenVX CV acceleration standards to the latest MPEG standards and feature-recognition databases are key to moving forward with improved video analytics, and two-segment cloud plus smart camera solutions allow for rapid upgrades.

With sufficient data-centric computing capability leveraging the cloud and smart cameras, the dream of inverse rendering can perhaps be realized where, in the ultimate “Turing-like” test that can be demonstrated for CV, scene parsing and re-rendered display and direct video would be indistinguishable for a remote viewer. This is essentially done now in digital cinema with photorealistic rendering, but this rendering is nowhere close to real time or interactive.

Video analytics apps: Individual scenarios

Killer applications for video analytics are being thought of every day for CV and video analytics, some perhaps years from realization because of computing requirements or implementation cost. Nevertheless, here is a list of interesting applications:

  • AR views of scenes for improved understanding. If you have ever looked at, for example, a landing plane and thought, I wish I could see the cockpit view with instrumentation, this is perhaps possible. I worked in Space Shuttle mission control long ago, where a large development team meticulously re-created a view of the avionics for ground controllers that shadowed what astronauts could see—all graphical, but imaging fusion of both video and graphics to annotate and re-create scenes with meta-data. A much simplified example is presented here in concept to show how an aircraft observed via a tablet computer camera could be annotated with attitude and altitude estimation data (see the example in this article).
  • Skeletal transformations to track the movement and estimate the intent and trajectory of an animal that might jump onto a highway. See the example in this article.
  • Fully autonomous or mostly autonomous vehicles with human supervisory control only. Think of the steps between today’s cruise control and tomorrow’s full autonomous car. Cars that can parallel park themselves today are a great example of this stepwise development.
  • Beyond face detection to reliable recognition and, perhaps more importantly, for expression feedback. Is the driver of a semiautonomous vehicle aggravated, worried, surprised?
  • Virtual shopping (AR to try products). Shoppers can see themselves in that new suit.
  • Signage that interacts with viewers. This is based on expressions, likes and dislikes, and data that the individual has made public.
  • Two-way television and interactive digital cinema. Entertainment for which viewers can influence the experience, almost as if they were actors in the content.
  • Interactive telemedicine. This is available any time with experts from anywhere in the world.

I make no attempt in this article to provide an exhaustive list of applications, but I explore more by looking closely at both AR (annotated views of the world through a camera and display—think heads-up displays such as fighter pilots have) and skeletal transformations for interactive tracking. To learn more beyond these two case studies and for more in-depth application-specific uses of CV and video analytics in medicine, transportation safety, security and surveillance, mapping and remote sensing, and an ever-increasing list of system automation that includes video content analysis, consult the many entries in Resources. The tools available can help anyone with computer engineering skills get started. You can also download a larger set of test images as well as all OpenCV code I developed for this article.

Example: Augmented reality

Real-time video analytics can change the face of reality by augmenting the view a consumer has with a smart phone held up to products or our view of the world (for example, while driving a vehicle) and can allow for a much more interactive experience for users for everything from movies to television, shopping, and travel to how we work. In AR, the ideal solution provides seamless transition from scenes captured with digital video to scenes generated by rendering for a user in real time, mixing both digital video and graphics in an AR view for the user. Poorly designed AR systems distract a user from normal visual cues, but a well-designed AR system can increase overall situation awareness, fusing metrics with visual cues (think fighter pilot heads-up displays).

The use of CV and video analytics in intelligent transportation systems has significant value for safety improvement, and perhaps eventually CV may be the key technology for self-driving vehicles. This appears to be the case based on the U.S. Defense Advanced Research Projects Agency challenge and the Google car, although use of the full spectrum with forward-looking infrared and instrumentation in addition to CV has made autonomous vehicles possible. Another potentially significant application is air traffic safety, especially for airports to detect and prevent runway incursion scenarios. The imagined AR view of an aircraft on final approach at Ted Stevens airport in Anchorage shows a Hough linear transform that might be used to segment and estimate aircraft attitude and altitude visually, as shown in Figure 2. Runway incursion safety is of high interest to the U.S. Federal Aviation Administration (FAA), and statistics for these events can be found in Resources.

Figure 2. AR display example

Image showing an example of video augmentationFor intelligent transportation, drivers will most likely want to participate even as systems become more intelligent, so a balance of automation and human participation and intervention should be kept in mind (for autonomous or semiautonomous vehicles).

Skeletal transformation examples: Tracking movement for interactive systems

Skeletal transformations are useful for applications like gesture recognition or gate analysis of humans or animals—any application where the motion of a body’s skeleton (rigid members) must be tracked can benefit from a skeletal transformation. Most often, this transformation is applied to bodies or limbs in motion, which further enables the use of background elimination for foreground tracking. However, it can still be applied to a single snapshot, as shown in Figure 3, where a picture of a moose is first converted to a gray map, then a threshold binary image, and finally the medial distance is found for each contiguous region and thinned to a single pixel, leaving just the skeletal structure of each object. Notice that the ears on the moose are back—an indication of the animal’s intent (higher-resolution skeletal transformation might be able to detect this as well as the gait of the animal).

Figure 3. Skeletal transformation of a moose

Image showing an example of a skeletal transformationSkeletal transformations can certainly be useful in tracking animals that might cross highways or charge a hiker, but the transformation has also become of high interest for gesture recognition in entertainment, such as in the Microsoft® Kinect® software developer kit (SDK). Gesture recognition can be used for entertainment but also has many practical purposes, such as automatic sign language recognition—not yet available as a product but a concept in research. Certainly skeletal transformation CV can analyze the human gait for diagnostic or therapeutic purposes in medicine or to capture human movement for animation in digital cinema.

Skeletal transformations are widely used in gesture-recognition systems for entertainment. Creative and Intel have teamed up to create an SDK for Windows® called the Creative* Interactive Gesture Camera Developer Kit (see Resources for a link) that uses a time-of-flight light detection and ranging sensor, camera, and stereo microphone. This SDK is similar to the Kinect SDK but intended for early access for developers to build gesture-recognition applications for the device. The SDK is amazingly affordable and could become the basis from some breakthrough consumer devices now that it is in the hands of a broad development community. To get started, you can purchase the device from Intel, and then download the Intel® Perceptual Computing SDK. The demo images are included as an example along with numerous additional SDK examples to help developers understand what the device can do. You can use the finger tracking example shown in Figure 4 right away just by installing the SDK for Microsoft Visual Studio® and running the Gesture Viewer sample.

Figure 4. Skeletal transformation using the Intel Perceptual Computing SDK and Creative Interactive Gesture Camera Developer Kit

Image showing a skeletal and blob transformation of a hand


The future of video analytics

This article makes an argument for the use of video analytics primarily to improve public safety; for entertainment purposes, social networking, telemedicine, and medical augmented diagnostics; and to envision products and services as a consumer. Machine vision has quietly helped automate industry and process control for years, but CV and video analytics in the cloud now show promise for providing vision-based automation in the everyday world, where the environment is not well controlled. This will be a challenge both in terms of algorithms for image processing and machine learning as well as data-centric computer architectures discussed in this series. The challenges for high-performance video analytics (in terms of receiver operating characteristics and throughput) should not be underestimated, but with careful development, this rapidly growing technology promises a wide range of new products and even human vision system prosthetics for those with sign impairments or loss of vision. Based on the value of vision to humans, no doubt this is also fundamental to intelligent computing systems.


Description Name Size
OpenCV Video Analytics Examples 600KB
Simple images for use with OpenCV 6474KB



Get products and technologies


Description Name Size
GPU accelerated image transform 644KB
Grid threaded comparison 1.08MB
Simple image for transform benchmark Cactus-320× 206KB




Description Name Size
Continuous HD digital camera transform example 123KB
Grid threaded prime generator benchmark hpc_cloud_grid.tar.gz 3KB
High-resolution image for transform benchmark 12288KB



Posted in Apps Development, CLOUD, Computer Languages, Computer Software, Computer Vision, GPU (CUDA), GPU Accelareted, Image Processing, OpenCV, OpenCV, PARALLEL, Project Related, Video | Leave a Comment »

cuSVM for CUDA 6.0 and Matlab x64

Posted by Hemprasad Y. Badgujar on October 13, 2014

cuSVM for CUDA 6.0 and Matlab x64

This page shows how to build cuSVM, GPU accelerated SVM with dense format. Library has been written by AUSTIN CARPENTER. The procedure use CUDA 6.0, MATLAB x64 and Visual Studio 2012. The code and project files were modified in order to compile and link library, many steps were taken from


  1. Addmatlab variables:
    1. cuSVMTrainIter – contains number of iteration the solver does
    2. cuSVMTrainObj –  contains the final objective function value after the trainning
  2. In file lines 869-874 all calls of cudaMemcpyToSymbol was changed, because of changes made in CUDA 6.0 runtime library –
    before the change:
    mxCUDA_SAFE_CALL(cudaMemcpyToSymbol(„taumin”, &h_taumin, sizeof(float) ));
    after the change:
    mxCUDA_SAFE_CALL(cudaMemcpyToSymbol(taumin, &h_taumin, sizeof(float) ));
  3. In functions FindBI, FindBJ, FindStoppingJ – change the way reduction in shared memory was done (
  4. The kernel cache size is constrained to 400MB, if you want bigger cache you can modify line 24
    #define KERNEL_CACHE_SIZE (400*1024*1024)


Build Procedure

Download preconfigure cuSVM Visual Studio 2010 solution with LibSVM and matlab scritp for classification

All steps describe below are done, you have to check if all paths are set correctly and yours GPU computational capability is set properly.

My setup:

  • windows 7 x64
  • visual studio 2012
  • CUDA 6.0
  • Matlab R2014a
  • the code was tested on Quadro 5000 and Geforce GTX 580


Determine paths:

  1. Matlab include path, mine is „D:\Program Files\MATLAB\R2014a\extern\include” (Matlab was installed on drive d:\)
  2. Matlab library path: „D:\Program Files\MATLAB\R2014a\extern\lib\win64\microsoft”
  3. CUDA toolkit include path: „C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include”
  4. GPU compute capability, mine is 1.2 in case of GeForce GT 330M(compute_12,sm_12), and 3.0 in case GeForce GTX 690 (compute_30,sm_30)


Changes made in projects properties (the same steps are for both projects: cuSVMPredict, cuSVMTrain):

  1. Open solution in VS 2010
  2. Right click on project (cuSVMTrain or cuSVMPredict)  and choose „Build Customizations …”, make sure that „CUDA 5.0(.targets, .props)” is checked
  3. Right click oncuSVMTrain and choose project „Properties”
    1. Expand „Configuration Properties”
      1. General->Target Extension: .mexw64
      2. General->Configuration Type: Dynamic Library (.dll)
    2. Expand c/c++-
      1. General->Additional Include Directories: $(SolutionDir)inc\;D:\Program Files\MATLAB\R2014a\extern\include;$(CudaToolkitIncludeDir);%(AdditionalIncludeDirectories)
    3. ExpandCUDA C/C++
      1. Common->Additional Include Directories: $(SolutionDir)inc\;D:\Program Files\MATLAB\R2014a\extern\include;$(CudaToolkitIncludeDir);%(AdditionalIncludeDirectories)
      2. Common->Target Machine Platform: 64-bit (–machine 64)
      3. Device->Code Generation: compute_30,sm_30– this depends on your GPU compute capability
    4. Expand Linker
      1. General->Additional Library Directories: %(AdditionalLibraryDirectories); $(CudaToolkitLibDir); D:\Program Files\MATLAB\R2014a\extern\lib\win64\microsoft
      2. Input->Additional Dependencies: cuda.lib;cublas.lib;libmex.lib;libmat.lib;libmx.lib;cudart.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)
      3. Input->Module Definition File: TrainModule.def (for cuSVMTrain project, for cuSVMPredict set PredictModule.def)
    5. Expand Build Events
      1. Post-Build Event->Command Line:
        echo copy „$(CudaToolkitBinDir)\cudart*.dll” „$(OutDir)”
        copy „$(CudaToolkitBinDir)\cudart*.dll” „$(OutDir)”
        each command in separate line

Eventually you can check if it is „Release” or „Debug” build.


How to use cuSVM

The zip package contains two folders:

  • cuSVM – Visual Studio 2012 solution
  • cuSVMmatlab – contains:
  1. libsvm,
  2. compile cuSVMTrain.mexw64 and cuSVMPredict.mexw64 in Libfolder,
  3. sample datasets in datafolder
  4. matlab script cuSVMTest.m
  1. Build cuSVM in Release or Debug mode – important check your GPU compute capability
  2. Copy cuSVMTrain.mexw64 and cuSVMPredict.mexw64 to Lib folder
  3. Add Libfolder matlab search path.
  4. If you want classify some dataset open  mfile.




Posted in Computing Technology, GPU (CUDA), GPU Accelareted, Image Processing | Tagged: , | Leave a Comment »

Integral Histogram for fast HoG feature calculation

Posted by Hemprasad Y. Badgujar on October 12, 2014

Histograms of Oriented Gradients or HOG features in combination with a support vector machine have been successfully used for object Detection (most popularly pedestrian detection).
An Integral Histogram representation can be used for fast calculation of Histograms of Oriented Gradients over arbitrary rectangular regions of the image. The idea of an integral histogram is analogous to that of an integral image, used by viola and jones for fast calculation of haar features for face detection. Mathematically,

where b represents the bin number of the histogram. This way the calculation of hog over any arbitrary rectangle in the image requires just 4*bins number of array references. For more details on integral histogram representation, please refer,

Integral Histogram

The following demonstrates how such integral histogram can be calculated from an image and used for the calculation of hog features using the opencv computer vision library :

/*Function to calculate the integral histogram*/

IplImage** calculateIntegralHOG(IplImage* in)


/*Convert the input image to grayscale*/

IplImage* img_gray = cvCreateImage(cvGetSize(in), IPL_DEPTH_8U,1);
cvCvtColor(in, img_gray, CV_BGR2GRAY);

/* Calculate the derivates of the grayscale image in the x and y directions using a sobel operator and obtain 2 gradient images for the x and y directions*/

IplImage *xsobel, *ysobel;
xsobel = doSobel(img_gray, 1, 0, 3);
ysobel = doSobel(img_gray, 0, 1, 3);

/* Create an array of 9 images (9 because I assume bin size 20 degrees and unsigned gradient ( 180/20 = 9), one for each bin which will have zeroes for all pixels, except for the pixels in the original image for which the gradient values correspond to the particular bin. These will be referred to as bin images. These bin images will be then used to calculate the integral histogram, which will quicken the calculation of HOG descriptors */

IplImage** bins = (IplImage**) malloc(9 * sizeof(IplImage*));
for (int i = 0; i < 9 ; i++) {
bins[i] = cvCreateImage(cvGetSize(in), IPL_DEPTH_32F,1);

/* Create an array of 9 images ( note the dimensions of the image, the cvIntegral() function requires the size to be that), to store the integral images calculated from the above bin images. These 9 integral images together constitute the integral histogram */

IplImage** integrals = (IplImage**) malloc(9 * sizeof(IplImage*)); for (int i = 0; i < 9 ; i++) {
integrals[i] = cvCreateImage(cvSize(in->width + 1, in->height + 1),

/* Calculate the bin images. The magnitude and orientation of the gradient at each pixel is calculated using the xsobel and ysobel images.{Magnitude = sqrt(sq(xsobel) + sq(ysobel) ), gradient = itan (ysobel/xsobel) }. Then according to the orientation of the gradient, the value of the corresponding pixel in the corresponding image is set */

int x, y;
float temp_gradient, temp_magnitude;
for (y = 0; y <in->height; y++) {

/* ptr1 and ptr2 point to beginning of the current row in the xsobel and ysobel images respectively. ptrs[i] point to the beginning of the current rows in the bin images */

float* ptr1 = (float*) (xsobel->imageData + y * (xsobel->widthStep));
float* ptr2 = (float*) (ysobel->imageData + y * (ysobel->widthStep));
float** ptrs = (float**) malloc(9 * sizeof(float*));
for (int i = 0; i < 9 ;i++){
ptrs[i] = (float*) (bins[i]->imageData + y * (bins[i]->widthStep));

/*For every pixel in a row gradient orientation and magnitude are calculated and corresponding values set for the bin images. */

for (x = 0; x <in->width; x++) {

/* if the xsobel derivative is zero for a pixel, a small value is added to it, to avoid division by zero. atan returns values in radians, which on being converted to degrees, correspond to values between -90 and 90 degrees. 90 is added to each orientation, to shift the orientation values range from {-90-90} to {0-180}. This is just a matter of convention. {-90-90} values can also be used for the calculation. */

if (ptr1[x] == 0){
temp_gradient = ((atan(ptr2[x] / (ptr1[x] + 0.00001))) * (180/ PI)) + 90;
temp_gradient = ((atan(ptr2[x] / ptr1[x])) * (180 / PI)) + 90;
temp_magnitude = sqrt((ptr1[x] * ptr1[x]) + (ptr2[x] * ptr2[x]));

/*The bin image is selected according to the gradient values. The corresponding pixel value is made equal to the gradient magnitude at that pixel in the corresponding bin image */

if (temp_gradient <= 20) {
ptrs[0][x] = temp_magnitude;
else if (temp_gradient <= 40) {
ptrs[1][x] = temp_magnitude;
else if (temp_gradient <= 60) {
ptrs[2][x] = temp_magnitude;
else if (temp_gradient <= 80) {
ptrs[3][x] = temp_magnitude;
else if (temp_gradient <= 100) {
ptrs[4][x] = temp_magnitude;
else if (temp_gradient <= 120) {
ptrs[5][x] = temp_magnitude;
else if (temp_gradient <= 140) {
ptrs[6][x] = temp_magnitude;
else if (temp_gradient <= 160) {
ptrs[7][x] = temp_magnitude;
else {
ptrs[8][x] = temp_magnitude;


/*Integral images for each of the bin images are calculated*/

for (int i = 0; i <9 ; i++){
cvIntegral(bins[i], integrals[i]);

for (int i = 0; i <9 ; i++){

/*The function returns an array of 9 images which consitute the integral histogram*/

return (integrals);


The following demonstrates how the integral histogram calculated using the above function can be used to calculate the histogram of oriented gradients for any rectangular region in the image:

/* The following function takes as input the rectangular cell for which the histogram of oriented gradients has to be calculated, a matrix hog_cell of dimensions 1×9 to store the bin values for the histogram, the integral histogram, and the normalization scheme to be used. No normalization is done if normalization = -1 */

void calculateHOG_rect(CvRect cell, CvMat* hog_cell,
IplImage** integrals, int normalization) {

/* Calculate the bin values for each of the bin of the histogram one by one */

for (int i = 0; i < 9 ; i++){

float a =((double*)(integrals[i]->imageData + (cell.y)
* (integrals[i]->widthStep)))[cell.x];
float b = ((double*) (integrals[i]->imageData + (cell.y + cell.height)
* (integrals[i]->widthStep)))[cell.x + cell.width];
float c = ((double*) (integrals[i]->imageData + (cell.y)
* (integrals[i]->widthStep)))[cell.x + cell.width];
float d = ((double*) (integrals[i]->imageData + (cell.y + cell.height)
* (integrals[i]->widthStep)))[cell.x];

((float*) hog_cell->data.fl)[i] = (a + b) – (c + d);


/*Normalize the matrix*/
if (normalization != -1){
cvNormalize(hog_cell, hog_cell, 1, 0, normalization);


I had described how the HOG features for pedestrian detection can be obtained using the above framework and how SVM can be trained for such features for pedestrian detection

/*This function takes in a the path and names of
64x128 pixel images, the size of the cell to be
used for calculation of hog features(which should
be 8x8 pixels, some modifications will have to be 
done in the code for a different cell size, which
could be easily done once the reader understands
how the code works), a default block size of 2x2
cells has been considered and the window size
parameter should be 64x128 pixels (appropriate
modifications can be easily done for other say
64x80 pixel window size). All the training images
are expected to be stored at the same location and
the names of all the images are expected to be in
sequential order like a1.jpg, a2.jpg, a3.jpg ..
and so on or a(1).jpg, a(2).jpg, a(3).jpg ... The
explanation of all the parameters below will make
clear the usage of the function. The synopsis of
the function is as follows :

prefix : it should be the path of the images, along
with the prefix in the image name for
example if the present working directory is
/home/saurabh/hog/ and the images are in
/home/saurabh/hog/images/positive/ and are
named like pos1.jpg, pos2.jpg, pos3.jpg ....,
then the prefix parameter would be
"images/positive/pos" or if the images are
named like pos(1).jpg, pos(2).jpg,
pos(3).jpg ... instead, the prefix parameter
would be "images/positive/pos("

suffix : it is the part of the name of the image
files after the number for example for the
above examples it would be ".jpg" or ").jpg"

cell   : it should be CvSize(8,8), appropriate changes
need to be made for other cell sizes

window : it should be CvSize(64,128), appropriate
changes need to be made for other window sizes

number_samples : it should be equal to the number of
training images, for example if the
training images are pos1.jpg, pos2.jpg
..... pos1216.jpg, then it should be

start_index : it should be the start index of the images'
names for example for the above case it
should be 1 or if the images were named
like pos1000.jpg, pos1001.jpg, pos1002.jpg
.... pos2216.jpg, then it should be 1000

end_index : it should be the end index of the images'
name for example for the above cases it
should be 1216 or 2216

savexml   : if you want to store the extracted features,
then you can pass to it the name of an xml
file to which they should be saved

normalization : the normalization scheme to be used for
computing the hog features, any of the
opencv schemes could be passed or -1
could be passed if no normalization is
to be done */

CvMat* train_64x128(char *prefix, char *suffix, CvSize cell,
CvSize window, int number_samples, int start_index,
int end_index, char *savexml = NULL, int canny = 0,
int block = 1, int normalization = 4) 

char filename[50] = "", number[8];
int prefix_length;
prefix_length = strlen(prefix);
int bins = 9;

/* A default block size of 2x2 cells is considered */

int block_width = 2, block_height = 2;

/* Calculation of the length of a feature vector for
an image (64x128 pixels)*/

int feature_vector_length;
feature_vector_length = (((window.width -
cell.width * block_width)/ cell.width) + 1) *
(((window.height - cell.height * block_height)
/ cell.height) + 1) * 36;

/* Matrix to store the feature vectors for
all(number_samples) the training samples */

CvMat* training = cvCreateMat(number_samples,
feature_vector_length, CV_32FC1);

CvMat row;
CvMat* img_feature_vector;
IplImage** integrals;
int i = 0, j = 0;

printf("Beginning to extract HoG features from
positive images\n");

strcat(filename, prefix);

/* Loop to calculate hog features for each
image one by one */

for (i = start_index; i <= end_index; i++) 
cvtInt(number, i);
strcat(filename, number);
strcat(filename, suffix);
IplImage* img = cvLoadImage(filename);

/* Calculation of the integral histogram for
fast calculation of hog features*/

integrals = calculateIntegralHOG(img);
cvGetRow(training, &row, j);
= calculateHOG_window(integrals, cvRect(0, 0,
window.width, window.height), normalization);
cvCopy(img_feature_vector, &row);
printf("%s\n", filename);
filename[prefix_length] = '';
for (int k = 0; k < 9; k++) 
if (savexml != NULL) 
cvSave(savexml, training);

return training;

/* This function is almost the same as
train_64x128(...), except the fact that it can
take as input images of bigger sizes and
generate multiple samples out of a single

It takes 2 more parameters than
train_64x128(...), horizontal_scans and
vertical_scans to determine how many samples
are to be generated from the image. It
generates horizontal_scans x vertical_scans
number of samples. The meaning of rest of the
parameters is same.

For example for a window size of
64x128 pixels, if a 320x240 pixel image is
given input with horizontal_scans = 5 and
vertical scans = 2, then it will generate to
samples by considering windows in the image
with (x,y,width,height) as (0,0,64,128),
(64,0,64,128), (128,0,64,128), .....,
(0,112,64,128), (64,112,64,128) .....

The function takes non-overlapping windows
from the image except the last row and last
column, which could overlap with the second
last row or second last column. So the values
of horizontal_scans and vertical_scans passed
should be such that it is possible to perform
that many scans in a non-overlapping fashion
on the given image. For example horizontal_scans
= 5 and vertical_scans = 3 cannot be passed for
a 320x240 pixel image as that many vertical scans
are not possible for an image of height 240
pixels and window of height 128 pixels. */

CvMat* train_large(char *prefix, char *suffix,
CvSize cell, CvSize window, int number_images,
int horizontal_scans, int vertical_scans,
int start_index, int end_index,
char *savexml = NULL, int normalization = 4)
char filename[50] = "", number[8];
int prefix_length;
prefix_length = strlen(prefix);
int bins = 9;

/* A default block size of 2x2 cells is considered */

int block_width = 2, block_height = 2;

/* Calculation of the length of a feature vector for
an image (64x128 pixels)*/

int feature_vector_length;
feature_vector_length = (((window.width -
cell.width * block_width) / cell.width) + 1) *
(((window.height - cell.height * block_height)
/ cell.height) + 1) * 36;

/* Matrix to store the feature vectors for
all(number_samples) the training samples */

CvMat* training = cvCreateMat(number_images
* horizontal_scans * vertical_scans,
feature_vector_length, CV_32FC1);

CvMat row;
CvMat* img_feature_vector;
IplImage** integrals;
int i = 0, j = 0;
strcat(filename, prefix);

printf("Beginning to extract HoG features
from negative images\n");

/* Loop to calculate hog features for each
image one by one */

for (i = start_index; i <= end_index; i++) 
cvtInt(number, i);
strcat(filename, number);
strcat(filename, suffix);
IplImage* img = cvLoadImage(filename);
integrals = calculateIntegralHOG(img);
for (int l = 0; l < vertical_scans - 1; l++)
for (int k = 0; k < horizontal_scans - 1; k++)
cvGetRow(training, &row, j);
img_feature_vector = calculateHOG_window(
integrals, cvRect(window.width * k,
window.height * l, window.width,
window.height), normalization);

cvCopy(img_feature_vector, &row);

cvGetRow(training, &row, j);

img_feature_vector = calculateHOG_window(
integrals, cvRect(img->width - window.width,
window.height * l, window.width,
window.height), normalization);

cvCopy(img_feature_vector, &row);

for (int k = 0; k < horizontal_scans - 1; k++)
cvGetRow(training, &row, j);

img_feature_vector = calculateHOG_window(
integrals, cvRect(window.width * k,
img->height - window.height, window.width,
window.height), normalization);

cvCopy(img_feature_vector, &row);
cvGetRow(training, &row, j);

img_feature_vector = calculateHOG_window(integrals,
cvRect(img->width - window.width, img->height -
window.height, window.width, window.height),

cvCopy(img_feature_vector, &row);

printf("%s\n", filename);
filename[prefix_length] = '';
for (int k = 0; k < 9; k++)



printf("%d negative samples created \n",

if (savexml != NULL)
cvSave(savexml, training);
printf("Negative samples saved as %s\n",

return training;


/* This function trains a linear support vector
machine for object classification. The synopsis is
as follows :

pos_mat : pointer to CvMat containing hog feature
vectors for positive samples. This may be
NULL if the feature vectors are to be read
from an xml file

neg_mat : pointer to CvMat containing hog feature
vectors for negative samples. This may be
NULL if the feature vectors are to be read
from an xml file

savexml : The name of the xml file to which the learnt
svm model should be saved

pos_file: The name of the xml file from which feature
vectors for positive samples are to be read.
It may be NULL if feature vectors are passed
as pos_mat

neg_file: The name of the xml file from which feature
vectors for negative samples are to be read.
It may be NULL if feature vectors are passed
as neg_mat*/

void trainSVM(CvMat* pos_mat, CvMat* neg_mat, char *savexml,
char *pos_file = NULL, char *neg_file = NULL) 

/* Read the feature vectors for positive samples */
if (pos_file != NULL) 
printf("positive loading...\n");
pos_mat = (CvMat*) cvLoad(pos_file);
printf("positive loaded\n");

/* Read the feature vectors for negative samples */
if (neg_file != NULL)
neg_mat = (CvMat*) cvLoad(neg_file);
printf("negative loaded\n");

int n_positive, n_negative;
n_positive = pos_mat->rows;
n_negative = neg_mat->rows;
int feature_vector_length = pos_mat->cols;
int total_samples;
total_samples = n_positive + n_negative;

CvMat* trainData = cvCreateMat(total_samples,
feature_vector_length, CV_32FC1);

CvMat* trainClasses = cvCreateMat(total_samples,
1, CV_32FC1 );

CvMat trainData1, trainData2, trainClasses1,

printf("Number of positive Samples : %d\n",

/*Copy the positive feature vectors to training

cvGetRows(trainData, &trainData1, 0, n_positive);
cvCopy(pos_mat, &trainData1);

/*Copy the negative feature vectors to training

cvGetRows(trainData, &trainData2, n_positive,

cvCopy(neg_mat, &trainData2);

printf("Number of negative Samples : %d\n",

/*Form the training classes for positive and
negative samples. Positive samples belong to class
1 and negative samples belong to class 2 */

cvGetRows(trainClasses, &trainClasses1, 0, n_positive);
cvSet(&trainClasses1, cvScalar(1));

cvGetRows(trainClasses, &trainClasses2, n_positive,

cvSet(&trainClasses2, cvScalar(2));

/* Train a linear support vector machine to learn from
the training data. The parameters may played and
experimented with to see their effects*/

CvSVM svm(trainData, trainClasses, 0, 0,
CvSVMParams(CvSVM::C_SVC, CvSVM::LINEAR, 0, 0, 0, 2,
0, 0, 0, cvTermCriteria(CV_TERMCRIT_EPS,0, 0.01)));

printf("SVM Training Complete!!\n");

/*Save the learnt model*/

if (savexml != NULL) {;



Posted in CUDA, GPU Accelareted, Image Processing, OpenCV, OpenCV, OpenCV Tutorial, PARALLEL | Tagged: , , , , | 1 Comment »

OpenCV installation for Ubuntu 12.04

Posted by Hemprasad Y. Badgujar on May 11, 2014


OpenCV (open source computer vision) is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and Java interfaces and supports Ubuntu Linux. OpenCV was designed for computational efficiency and with a strong focus on real-time applications.

OpenCV is the most popular and advanced code library for Computer Vision related applications today, spanning from many very basic tasks (capture and pre-processing of image data) to high-level algorithms (feature extraction, motion tracking, machine learning). It is free software and provides a rich API in C, C++, Java and Python. Other wrappers are available. The library itself is platform-independent and often used for real-time image processing and computer vision.


Many people are having problem with installing OpenCV even from Ubuntu Software Centre. Here a simple .sh script file get all dependancy files from internet and compile the source finally install opencv on your system. So that users can easily write their CV files from C,C++, and Python

Step 1

Download the latest from or Copy the following script to gedit and save as

version="$(wget -q -O - | egrep -m1 -o '\"[0-9](\.[0-9])+' | cut -c2-)"
 echo "Installing OpenCV" $version
 mkdir OpenCV
cd OpenCV
echo "Removing any pre-installed ffmpeg and x264"
sudo apt-get -qq remove ffmpeg x264 libx264-dev
echo "Installing Dependenices"

sudo apt-get -qq install libopencv-dev build-essential checkinstall cmake pkg-config yasm libjpeg-dev libjasper-dev libavcodec-dev libavformat-dev libswscale-dev libdc1394-22-dev libxine-dev libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev libv4l-dev python-dev python-numpy libtbb-dev libqt4-dev libgtk2.0-dev libfaac-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libtheora-dev libvorbis-dev libxvidcore-dev x264 v4l-utils ffmpeg

sudo apt-get install libgstreamer0.10-0 libgstreamer0.10-dev gstreamer0.10-tools gstreamer0.10-plugins-base libgstreamer-plugins-base0.10-dev gstreamer0.10-plugins-good gstreamer0.10-plugins-ugly gstreamer0.10-plugins-bad gstreamer0.10-ffmpeg

sudo apt-get install vlc vlc-dbg vlc-data libvlccore5 libvlc5 libvlccore-dev libvlc-dev tbb-examples libtbb-doc libtbb2 libtbb-dev libxine1-bin libxine1-ffmpeg libxine-dev 

sudo apt-get install libmysqlcppconn-dev
echo "Downloading OpenCV" $version wget -O OpenCV-$$version/opencv-"$version".zip/download echo "Installing OpenCV" $version unzip OpenCV-$ cd opencv-$version mkdir build cd build cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_TBB=ON -D BUILD_NEW_PYTHON_SUPPORT=ON -D WITH_V4L=ON -D INSTALL_C_EXAMPLES=ON -D INSTALL_PYTHON_EXAMPLES=ON -D BUILD_EXAMPLES=ON -D WITH_QT=ON -D WITH_OPENGL=ON .. make -j 4 sudo checkinstall sudo sh -c 'echo "/usr/local/lib" > /etc/' sudo ldconfig echo "OpenCV" $version "ready to be used"

Step 2

Open terminal.

   1 $ chmod +x
   2 $ ./

This will complete opencv installation

Running OpenCV



Loading an image in Python

Toggle line numbers

   1 from import *
   3 img = LoadImage("/home/USER/Pictures/python.jpg")
   4 NamedWindow("opencv")
   5 ShowImage("opencv",img)
   6 WaitKey(0)


   1 $ python


in C

Loading an image file in C

Toggle line numbers

   1 #include
   2 #include<opencv2/highgui/highgui.hpp>
   4 int main()
   5 {
   6     IplImage* img = cvLoadImage("/home/USER/Pictures/python.jpg",CV_LOAD_IMAGE_COLOR);
   7     cvNamedWindow("opencvtest",CV_WINDOW_AUTOSIZE);
   8     cvShowImage("opencvtest",img);
   9     cvWaitKey(0);
  10     cvReleaseImage(&img);
  11     return 0;
  12 }


To compile C program, Let’s assume the file is opencvtest.c

   1 $ gcc -ggdb `pkg-config --cflags opencv` -o `basename opencvtest.c .c` opencvtest.c `pkg-config --libs opencv`
   2 $ ./opencvtest


In C++


Loading an image file in C++

Toggle line numbers

   1 #include<opencv2/highgui/highgui.hpp>
   2 using namespace cv;
   4 int main()
   5 {
   7     Mat img = imread("/home/USER/Pictures/python.jpg",CV_LOAD_IMAGE_COLOR);
   8     imshow("opencvtest",img);
   9     waitKey(0);
  11     return 0;
  12 }


to compile in C++

   1 $ g++ -ggdb `pkg-config --cflags opencv` -o `basename opencvtest.cpp .cpp` opencvtest.cpp `pkg-config --libs opencv`
   2 $ ./opencvtest


Note: Always include OpenCV header files in C and C++ as

   1 #include "opencv2/core/core_c.h"
   2 #include "opencv2/core/core.hpp"
   3 #include "opencv2/flann/miniflann.hpp"
   4 #include "opencv2/imgproc/imgproc_c.h"
   5 #include "opencv2/imgproc/imgproc.hpp"
   6 #include "opencv2/video/video.hpp"
   7 #include "opencv2/features2d/features2d.hpp"
   8 #include "opencv2/objdetect/objdetect.hpp"
   9 #include "opencv2/calib3d/calib3d.hpp"
  10 #include "opencv2/ml/ml.hpp"
  11 #include "opencv2/highgui/highgui_c.h"
  12 #include "opencv2/highgui/highgui.hpp"
  13 #include "opencv2/contrib/contrib.hpp"


A bash script to compile opencv programs.Making a Bash Script to Compile OpenCV:

It’s kind of boring typing all this stuff. So, A bash file to compile OpenCV programs. Name it and keep it in your home directory.

   1 #!/bin/bash
   2 echo "compiling $1"
   3 if [[ $1 == *.c ]]
   4 then
   5 gcc -ggdb `pkg-config --cflags opencv` -o `basename $1 .c` $1 `pkg-config --libs opencv`;
   6 elif [[ $1 == *.cpp ]]
   7 then
   8 g++ -ggdb `pkg-config --cflags opencv` -o `basename $1 .cpp` $1 `pkg-config --libs opencv`;
   9 else
  10 echo "Please compile only .c or .cpp files"
  11 fi
  12 echo "Output file => ${1%.*}"


Add an alias in .bashrc or .bash_aliases

   1 $ alias opencv="~/"
   2 $ opencv opencvtest.c
   3 $ ./opencvtest


Posted in Computer Vision, GPU Accelareted, Installation, Linux OS, Linux OS, OpenCV, OpenCV, Operating Systems, PARALLEL, UNIX OS | Leave a Comment »

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: