Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘Project Related’ Category

Project Related (CUDA /Open CL & DirectX / OpenGL)

Deep Learning Software/ Framework links

Posted by Hemprasad Y. Badgujar on July 15, 2016

  1. Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal)
  2. Torch – provides a Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu)
  3. Pylearn2 – Pylearn2 is a library designed to make machine learning research easy.
  4. Blocks – A Theano framework for training neural networks
  5. Tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.
  6. MXNet – MXNet is a deep learning framework designed for both efficiency and flexibility.
  7. Caffe -Caffe is a deep learning framework made with expression, speed, and modularity in mind.Caffe is a deep learning framework made with expression, speed, and modularity in mind.
  8. Lasagne – Lasagne is a lightweight library to build and train neural networks in Theano.
  9. Keras– A theano based deep learning library.
  10. Deep Learning Tutorials – examples of how to do Deep Learning with Theano (from LISA lab at University of Montreal)
  11. Chainer – A GPU based Neural Network Framework
  12. DeepLearnToolbox – A Matlab toolbox for Deep Learning (from Rasmus Berg Palm)
  13. Cuda-Convnet – A fast C++/CUDA implementation of convolutional (or more generally, feed-forward) neural networks. It can model arbitrary layer connectivity and network depth. Any directed acyclic graph of layers will do. Training is done using the back-propagation algorithm.
  14. Deep Belief Networks. Matlab code for learning Deep Belief Networks (from Ruslan Salakhutdinov).
  15. RNNLM– Tomas Mikolov’s Recurrent Neural Network based Language models Toolkit.
  16. RNNLIB-RNNLIB is a recurrent neural network library for sequence learning problems. Applicable to most types of spatiotemporal data, it has proven particularly effective for speech and handwriting recognition.
  17. matrbm. Simplified version of Ruslan Salakhutdinov’s code, by Andrej Karpathy (Matlab).
  18. deeplearning4j– Deeplearning4J is an Apache 2.0-licensed, open-source, distributed neural net library written in Java and Scala.
  19. Estimating Partition Functions of RBM’s. Matlab code for estimating partition functions of Restricted Boltzmann Machines using Annealed Importance Sampling (from Ruslan Salakhutdinov).
  20. Learning Deep Boltzmann Machines Matlab code for training and fine-tuning Deep Boltzmann Machines (from Ruslan Salakhutdinov).
  21. The LUSH programming language and development environment, which is used @ NYU for deep convolutional networks
  22. Eblearn.lsh is a LUSH-based machine learning library for doing Energy-Based Learning. It includes code for “Predictive Sparse Decomposition” and other sparse auto-encoder methods for unsupervised learning. Koray Kavukcuoglu provides Eblearn code for several deep learning papers on thispage.
  23. deepmat– Deepmat, Matlab based deep learning algorithms.
  24. MShadow – MShadow is a lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA. The goal of mshadow is to support efficient, device invariant and simple tensor library for machine learning project that aims for both simplicity and performance. Supports CPU/GPU/Multi-GPU and distributed system.
  25. CXXNET – CXXNET is fast, concise, distributed deep learning framework based on MShadow. It is a lightweight and easy extensible C++/CUDA neural network toolkit with friendly Python/Matlab interface for training and prediction.
  26. Nengo-Nengo is a graphical and scripting based software package for simulating large-scale neural systems.
  27. Eblearn is a C++ machine learning library with a BSD license for energy-based learning, convolutional networks, vision/recognition applications, etc. EBLearn is primarily maintained by Pierre Sermanet at NYU.
  28. cudamat is a GPU-based matrix library for Python. Example code for training Neural Networks and Restricted Boltzmann Machines is included.
  29. Gnumpy is a Python module that interfaces in a way almost identical to numpy, but does its computations on your computer’s GPU. It runs on top of cudamat.
  30. The CUV Library (github link) is a C++ framework with python bindings for easy use of Nvidia CUDA functions on matrices. It contains an RBM implementation, as well as annealed importance sampling code and code to calculate the partition function exactly (from AIS lab at University of Bonn).
  31. 3-way factored RBM and mcRBM is python code calling CUDAMat to train models of natural images (from Marc’Aurelio Ranzato).
  32. Matlab code for training conditional RBMs/DBNs and factored conditional RBMs (from Graham Taylor).
  33. mPoT is python code using CUDAMat and gnumpy to train models of natural images (from Marc’Aurelio Ranzato).
  34. neuralnetworks is a java based gpu library for deep learning algorithms.
  35. ConvNet is a matlab based convolutional neural network toolbox.
  36. Elektronn is a deep learning toolkit that makes powerful neural networks accessible to scientists outside the machine learning community.
  37. OpenNN is an open source class library written in C++ programming language which implements neural networks, a main area of deep learning research.
  38. NeuralDesigner  is an innovative deep learning tool for predictive analytics.
  39. Theano Generalized Hebbian Learning.

Posted in C, Computing Technology, CUDA, Deep Learning, GPU (CUDA), JAVA, OpenCL, PARALLEL, PHP, Project Related | Leave a Comment »

Databases for Multi-camera , Network Camera , E-Surveillace

Posted by Hemprasad Y. Badgujar on February 18, 2016

Multi-view, Multi-Class Dataset: pedestrians, cars and buses

This dataset consists of 23 minutes and 57 seconds of synchronized frames taken at 25fps from 6 different calibrated DV cameras.
One camera was placed about 2m high of the ground, two others where located on a first floor high, and the rest on a second floor to cover an area of 22m x 22m.
The sequence was recorded at the EPFL university campus where there is a road with a bus stop, parking slots for cars and a pedestrian crossing.


Ground truth images
Ground truth annotations


The dataset on this page has been used for our multiview object pose estimation algorithm described in the following paper:

G. Roig, X. Boix, H. Ben Shitrit and P. Fua Conditional Random Fields for Multi-Camera Object Detection, ICCV11.

Multi-camera pedestrians video

“EPFL” data set: Multi-camera Pedestrian Videos

people tracking
results, please cite one of the references below.

On this page you can download a few multi-camera sequences that we acquired for developing and testing our people detection and tracking framework. All of the sequences feature several synchronised video streams filming the same area under different angles. All cameras are located about 2 meters from the ground. All pedestrians on the sequences are members of our laboratory, so there is no privacy issue. For the Basketball sequence, we received consent from the team.

Laboratory sequences

These sequences were shot inside our laboratory by 4 cameras. Four (respectively six) people are sequentially entering the room and walking around for 2 1/2 minutes. The frame rate is 25 fps and the videos are encoded using MPEG-4 codec.

[Camera 0] [Camera 1] [Camera 2] [Camera 3]

Calibration file for the 4 people indoor sequence.

[Camera 0] [Camera 1] [Camera 2] [Camera 3]

Calibration file for the 6 people indoor sequence.

Campus sequences

These two sequences called campus were shot outside on our campus with 3 DV cameras. Up to four people are simultaneously walking in front of them. The white line on the screenshots shows the limits of the area that we defined to obtain our tracking results. The frame rate is 25 fps and the videos are encoded using Indeo 5 codec.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2]
[Seq.2, cam. 0] [Seq.2, cam. 1] [Seq.2, cam. 2]

Calibration file for the two above outdoor scenes.

Terrace sequences

The sequences below, called terrace, were shot outside our building on a terrace. Up to 7 people evolve in front of 4 DV cameras, for around 3 1/2 minutes. The frame rate is 25 fps and the videos are encoded using Indeo 5 codec.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]
[Seq.2, cam. 0] [Seq.2, cam. 1] [Seq.2, cam. 2] [Seq.1, cam. 3]

Calibration file for the terrace scene.

Passageway sequence

This sequence dubbed passageway was filmed in an underground passageway to a train station. It was acquired with 4 DV cameras at 25 fps, and is encoded with Indeo 5. It is a rather difficult sequence due to the poor lighting.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]

Calibration file for the passageway scene.

Basketball sequence

This sequence was filmed at a training session of a local basketball team. It was acquired with 4 DV cameras at 25 fps, and is encoded with Indeo 5.

[Seq.1, cam. 0] [Seq.1, cam. 1] [Seq.1, cam. 2] [Seq.1, cam. 3]

Calibration file for the basketball scene.

Camera calibration

POM only needs a simple calibration consisting of two homographies per camera view, which project the ground plane in top view to the ground plane in camera views and to the head plane in camera views (a plane parallel to the ground plane but located 1.75 m higher). Therefore, the calibration files given above consist of 2 homographies per camera. In degenerate cases where the camera is located inside the head plane, this one will project to a horizontal line in the camera image. When this happens, we do not provide a homography for the head plane, but instead we give the height of the line in which the head plane will project. This is expressed in percentage of the image height, starting from the top.

The homographies given in the calibration files project points in the camera views to their corresponding location on the top view of the ground plane, that is

H * X_image = X_topview .

We have also computed the camera calibration using the Tsai calibration toolkit for some of our sequences. We also make them available for download. They consist of an XML file per camera view, containing the standard Tsai calibration parameters. Note that the image size used for calibration might differ from the size of the video sequences. In this case, the image coordinates obtained with the calibration should be normalized to the size of the video.

Ground truth

We have created a ground truth data for some of the video sequences presented above, by locating and identifying the people in some frames at a regular interval.

To use these ground truth files, you must rely on the same calibration with the exact same parameters that we used when generating the data. We call top view the rectangular area of the ground plane in which we perform tracking.

This area is of dimensions tv_width x tv_height and has top left coordinate (tv_origin_x, tv_origin_y). Besides, we call grid our discretization of the top view area into grid_width x grid_height cells. An example is illustrated by the figure below, in which the grid has dimensions 5 x 4.

The people’s position in the ground truth are expressed in discrete grid coordinates. In order to be projected into the images with homographies or the Tsai calibration, these grid coordinates need to be translated into top view coordinates. We provide below a simple C function that performs this translation. This function takes the following parameters:

  • pos : the person position coming from the ground truth file
  • grid_width, grid_height : the grid dimension
  • tv_origin_x, tv_origin_y : the top left corner of the top view
  • tv_width, tv_height : the top view dimension
  • tv_x, tv_y : the top view coordinates, i.e. the output of the function
  void grid_to_tv(int pos, int grid_width, int grid_height,                  float tv_origin_x, float tv_origin_y, float tv_width,                  float tv_height, float &tv_x, float &tv_y) {     tv_x = ( (pos % grid_width) + 0.5 ) * (tv_width / grid_width) + tv_origin_x;    tv_y = ( (pos / grid_width) + 0.5 ) * (tv_height / grid_height) + tv_origin_y;  }

The table below summarizes the aforementionned parameters for the ground truth files we provide. Note that the ground truth for the terrace sequence has been generated with the Tsai calibration provided in the table. You will need to use this one to get a proper bounding box alignment.

Ground Truth Grid dimensions Top view origin Top view dimensions Calibration
6-people laboratory 56 x 56 (0 , 0) 358 x 360 file
terrace, seq. 1 30 x 44 (-500 , -1,500) 7,500 x 11,000 file (Tsai)
passageway, seq. 1 40 x 99 (0 , 38.48) 155 x 381 file

The format of the ground truth file is the following:

 1 <number of frames>  <number of people>  <grid width>  <grid height>  <step size>  <first frame>  <last frame> <pos> <pos> <pos> ... <pos> <pos> <pos> ... . . .

where <number of frames> is the total number of frames, <number of people> is the number of people for which we have produced a ground truth, <grid width> and <grid height>are the ground plane grid dimensions, <step size> is the frame interval between two ground truth labels (i.e. if set to 25, then there is a label once every 25 frames), and <first frame> and <last frame> are the first and last frames for which a label has been entered.

After the header, every line represents the positions of people at a given frame. <pos> is the position of a person in the grid. It is normally a integer >= 0, but can be -1 if undefined (i.e. no label has been produced for this frame) or -2 if the person is currently out of the grid.


Multiple Object Tracking using K-Shortest Paths Optimization

Jérôme Berclaz, François Fleuret, Engin Türetken, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence
pdf | show bibtex

Multi-Camera People Tracking with a Probabilistic Occupancy Map

François Fleuret, Jérôme Berclaz, Richard Lengagne, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence
pdf | show bibtex

MuHAVi: Multicamera Human Action Video Data

including selected action sequences with

MAS: Manually Annotated Silhouette Data

for the evaluation of human action recognition methods

Figure 1. The top view of the configuration of 8 cameras used to capture the actions in the blue action zone (which is marked with white tapes on the scene floor).

camera symbol

camera name

V1 Camera_1
V2 Camera_2
V3 Camera_3
V4 Camera_4
V5 Camera_5
V6 Camera_6
V7 Camera_7
V8 Camera_8

Table 1. Camera view names appearing in the MuHAVi data folders and the corresponding symbols used in Fig. 1.


On the table below, you can click on the links to download the data (JPG images) for the corresponding action

Important: We noted that some earlier versions of that earlier versions of MS Internet Explorer could not download files over 2GB size, so we recomment to use alternative browsers such as Firefox or Chrome.

Each tar file contains 7 folders corresponding to 7 actors (Person1 to Person7) each of which contains 8 folders corresponding to 8 cameras (Camera_1 to Camera_8). Image frames corresponding to every combination of action/actor/camera are named with image frame numbers starting from 00000001.jpg for simplicity. The video frame rate is 25 frames per second and the resolution of image frames (except for Camera_8) is 720 x 576 Pixels (columns x rows). The image resolution is 704 x 576 for Camera_8.

action class

action name

C1 WalkTurnBack 2.6GB
C2 RunStop 2.5GB
C3 Punch 3.0GB
C4 Kick 3.4GB
C5 ShotGunCollapse 4.3GB
C6 PullHeavyObject 4.5GB
C7 PickupThrowObject 3.0GB
C8 WalkFall 3.9GB
C9 LookInCar 4.6GB
C10 CrawlOnKnees 3.4GB
C11 WaveArms 2.2GB
C12 DrawGraffiti 2.7GB
C13 JumpOverFence 4.4GB
C14 DrunkWalk 4.0GB
C15 ClimbLadder 2.1GB
C16 SmashObject 3.3GB
C17 JumpOverGap 2.6GB

MIT Trajectory Data Set – Multiple Camera Views


MIT trajectory data set is for the research of activity analysis in multiple single camera view using the trajectories of objects as features. Object tracking is based on background subtraction using a Adaptive Gaussian Mixture model. There are totally four camera views. Trajectories in different camera views have been synchronized. The data can be downloaded from the following link,

MIT trajectory data set

Background image


Please cite as:

X. Wang, K. Tieu and E. Grimson, Correspondence‐Free Activity Analysis and Scene Modeling in Multiple Camera Views, IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI), Vol. 32, pp. 56-71, 2010..


MIT traffic data set is for research on activity analysis and crowded scenes. It includes a traffic video sequence of 90 minutes long. It is recorded by a stationary camera. The size of the scene is 720 by 480. It is divided into 20 clips and can be downloaded from the following links.

Ground Truth

In order to evaluate the performance of human detection on this data set, ground truth of pedestrians of some sampled frames are manually labeled. It can be downloaded below. A readme file provides the instructions of how to use it.
Ground truth of pedestrians


  1. Unsupervised Activity Perception in Crowded and Complicated scenes Using Hierarchical Bayesian Models
    X. Wang, X. Ma and E. Grimson
    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 31, pp. 539-555, 2009
  2. Automatic Adaptation of a Generic Pedestrian Detector to a Specific Traffic Scene
    M. Wang and X. Wang
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2011


This dataset is presented in our CVPR 2015 paper,
Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification, In Computer Vision and Pattern Recognition (CVPR), 2015. PDF

The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car. The surveillance-nature data contains 50,000 car images captured in the front view. Please refer to our paper for the details.

The dataset is well prepared for the following computer vision tasks:

  • Fine-grained classification
  • Attribute prediction
  • Car model verification

The train/test subsets of these tasks introduced in our paper are included in the dataset. Researchers are also welcome to utilize it for any other tasks such as image ranking, multi-task learning, and 3D reconstruction.


  1. You need to complete the release agreement form to download the dataset. Please see below.
  2. The CompCars database is available for non-commercial research purposes only.
  3. All images of the CompCars database are obtained from the Internet which are not property of MMLAB, The Chinese University of Hong Kong. The MMLAB is not responsible for the content nor the meaning of these images.
  4. You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the images and any portion of derived data.
  5. You agree not to further copy, publish or distribute any portion of the CompCars database. Except, for internal use at a single site within the same organization it is allowed to make copies of the database.
  6. The MMLAB reserves the right to terminate your access to the database at any time.
  7. All submitted papers or any publicly available text using the CompCars database must cite the following paper:
    Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification, In Computer Vision and Pattern Recognition (CVPR), 2015.

Download instructions

Download the CompCars dataset Release Agreement, read it carefully, and complete it appropriately. Note that the agreement should be signed by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to Mr. Linjie Yang (yl012(at) and cc to Chen Change Loy (ccloy(at) We will verify your request and contact you on how to download the database.

Stanford Cars Dataset


       The Cars dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.


       Training images can be downloaded here.
Testing images can be downloaded here.
A devkit, including class labels for training images and bounding boxes for all images, can be downloaded here.
If you’re interested in the BMW-10 dataset, you can get that here.

Update: For ease of development, a tar of all images is available here and all bounding boxes and labels for both training and test are available here. If you were using the evaluation server before (which is still running), you can use test annotations here to evaluate yourself without using the server.


       An evaluation server has been set up here. Instructions for the submission format are included in the devkit. This dataset was featured as part of FGComp 2013, and competition results are directly comparable to results obtained from evaluating on images here.


       If you use this dataset, please cite the following paper:

3D Object Representations for Fine-Grained Categorization
Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei
4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.
[pdf]   [BibTex]   [slides]

Note that the dataset, as released, has 196 categories, one less than in the paper, as it has been cleaned up slightly since publication. Numbers should be more or less comparable, though.

The HDA dataset is a multi-camera high-resolution image sequence dataset for research on high-definition surveillance. 18 cameras (including VGA, HD and Full HD resolution) were recorded simultaneously during 30 minutes in a typical indoor office scenario at a busy hour (lunch time) involving more than 80 persons. In the current release (v1.1), 13 cameras have been fully labeled.


The venue spans three floors of the Institute for Systems and Robotics (ISR-Lisbon) facilities. The following pictures show the placement of the cameras. The 18 recorded cameras are identified with a small red circle. The 13 cameras with a coloured view field have been fully labeled in the current release (v1.1).


Each frame is labeled with the bounding boxes tightly adjusted to the visible body of the persons, the unique identification of each person, and flag bits indicating occlusion and crowd:

  • The bounding box is drawn so that it completely and tightly encloses the person.
  • If the person is occluded by something (except image boundaries), the bounding box is drawn by estimating the whole body extent.
  • People partially outside the image boundaries have their BB’s cropped to image limits. Partially occluded people and people partially outside the image boundaries are marked as ‘occluded’.
  • A unique ID is associated to each person, e.g., ‘person01’. In case of identity doubt, the special ID ‘personUnk’ is used.
  • Groups of people that are impossible to label individually are labelled collectively as ‘crowd’. People in front of a ’crowd’ area are labeled normally.

The following figures show examples of labeled frames: (a) an unoccluded person; (b) two occluded people; (c) a crowd with three people in front.


Data formats:

For each camera we provide the .jpg frames sequentially numbered and a .txt file containing the annotations according to the “video bounding box” (vbb) format defined in the Caltech Pedestrian Detection Database. Also on this site there are tools to visualise the annotations overlapped on the image frames.


Some statistics:

Labeled Sequences: 13

Number of Frames: 75207

Number of Bounding Boxes: 64028

Number of Persons: 85


Repository of Results:

We maintain a public repository of re-identification results in this dataset. Send us your CMC curve to be uploaded  (alex at isr ist utl pt).
Click here to see the full list and detailed experiments.

MANUAL_c_l_e_a_n cam60

Posted in Computer Network & Security, Computer Research, Computer Vision, Image Processing, Multimedia | Leave a Comment »

Bilateral Filtering

Posted by Hemprasad Y. Badgujar on September 14, 2015

Popular Filters

When smoothing or blurring images (the most popular goal of smoothing is to reduce noise), we can use diverse linear filters, because linear filters are easy to achieve, and are kind of fast, the most used ones are Homogeneous filter, Gaussian filter, Median filter, et al.

When performing a linear filter, we do nothing but output pixel’s value g(i,j)  which is determined as a weighted sum of input pixel values f(i+k, j+l):

g(i, j)=SUM[f(i+k, j+l) h(k, l)];

in which, h(k, l)) is called the kernel, which is nothing more than the coefficients of the filter.

Homogeneous filter is the most simple filter, each output pixel is the mean of its kernel neighbors ( all of them contribute with equal weights), and its kernel K looks like:


 Gaussian filter is nothing but using different-weight-kernel, in both x and y direction, pixels located in the middle would have bigger weight, and the weights decrease with distance from the neighborhood center, so pixels located on sides have smaller weight, its kernel K is something like (when kernel is 5*5):


Median filter is something that replace each pixel’s value with the median of its neighboring pixels. This method is great when dealing with “salt and pepper noise“.

Bilateral Filter

By using all the three above filters to smooth image, we not only dissolve noise, but also smooth edges, which make edges less sharper, even disappear. To solve this problem, we can use a filter called bilateral filter, which is an advanced version of Gaussian filter, it introduces another weight that represents how two pixels can be close (or similar) to one another in value, and by considering both weights in image,  Bilateral filter can keep edges sharp while blurring image.

Let me show you the process by using this image which have sharp edge.



Say we are smoothing this image (we can see noise in the image), and now we are dealing with the pixel at middle of the blue rect.

22   23

Left-above picture is a Gaussian kernel, and right-above picture is Bilateral filter kernel, which considered both weight.

We can also see the difference between Gaussian filter and Bilateral filter by these pictures:

Say we have an original image with noise like this



By using Gaussian filter, the image is smoother than before, but we can see the edge is no longer sharp, a slope appeared between white and black pixels.



However, by using Bilateral filter, the image is smoother, the edge is sharp, as well.


OpenCV code

It is super easy to make these kind of filters in OpenCV:

1 //Homogeneous blur:
2 blur(image, dstHomo, Size(kernel_length, kernel_length), Point(-1,-1));
3 //Gaussian blur:
4 GaussianBlur(image, dstGaus, Size(kernel_length, kernel_length), 0, 0);
5 //Median blur:
6 medianBlur(image, dstMed, kernel_length);
7 //Bilateral blur:
8 bilateralFilter(image, dstBila, kernel_length, kernel_length*2, kernel_length/2);

and for each function, you can find more details in OpenCV Documentation

Test Images

Glad to use my favorite Van Gogh image :



From left to right: Homogeneous blur, Gaussian blur, Median blur, Bilateral blur.

(click iamge to view full size version :p )

kernel length = 3:

homo3 Gaussian3 Median3 Bilateral3

kernel length = 9:

homo9 Gaussian9 Median9 Bilateral9
kernel length = 15:

homo15 Gaussian15 Median15 Bilateral15

kernel length = 23:

homo23 Gaussian23 Median23 Bilateral23
kernel length = 31:

homo31 Gaussian31 Median31 Bilateral31
kernel length = 49:

homo49 Gaussian49 Median49 Bilateral49
kernel length = 99:

homo99 Gaussian99 Median99 Bilateral99

Trackback URL.

Posted in C, Image / Video Filters, Image Processing, OpenCV, OpenCV, OpenCV Tutorial | Leave a Comment »

Open road databases for lane tracking and vehicle detection

Posted by Hemprasad Y. Badgujar on May 16, 2015

“free” for a researcher willing to test his own algorithms of lane tracking or vehicle detection.

Although it is quite easy to find webpages with huge databases with images of vehicles, it is not so easy to find sites where there are videos of the road ahead captured with a camera installed in a vehicle.

We finally found some, which include in some cases the original videos and the videos with the overimposed detection of vehicles, pedestrians and things like that.

Here you are the list with the links and a short description of the owner:

Thanks to the researchers that share their databases! You support the whole research community with your effort!!

Posted in Computer Vision, OpenCV, Project Related | Tagged: , , , , , | Leave a Comment »

build Tesseract 3.03 with Visual Studio 2013

Posted by Hemprasad Y. Badgujar on March 17, 2015

Compiling Tesseract 3.02.02 with Visual C++ 2008 (Express) is covered by the documentation whereas compiling Tesseract 3.03 isn’t covered at all, though.

Unfortunately newer versions of Tesseract also require a new version of Leptonica, a C library for image processing and image analysis applications, which in turn requires new versions of zlib, libpng, libtiff, libjpeg and giflib. Tesseract provides pre-compiled versions of Leptonica, which prevents you from having to collect and set up projects for all of these libraries in Visual Studio, which can be a tedious task.

Yesterday I found a project on GitHub that includes a Visual Studio solution file for all dependencies required to compile Tesseract 3.03: charlesw/tesseract-vs2012. While following the build instructions there, I stumpled over several build errors, which I could easily resolve by removing a definition. The necessary change is in my fork of the repository mentioned above.

This is a write-up of all steps that are required to compile Tesseract 3.03 with Visual Studio 2013.


  1. Install Git.
  2. Install SVN. There are many versions of SVN. You can, for example, install the binary package from SlickSVN for free.
  3. Install Visual Studio 2013 for Windows Desktop (the Express version will be enough). You don’t need the optional features except for “Microsoft Foundation Classes for C++”.

Building the dependencies

  1. Create a directory where you want to compile Tesseract. In this document, I’ll assume it’s C:\Tesseract-Build\.
  2. Open a CMD prompt and change to that directory.
    cd \Tesseract-Build\
  3. Clone the dependencies repository from GitHub.
    git clone
  4. Open the “VS 2013 Developer Command Prompt”. (It can be found in the Start Menu.)
  5. Change to the newly cloned repository.
    cd \Tesseract-Build\tesseract-vs2012
  6. Build the dependencies
    msbuild build.proj
  7. You can close the “VS 2013 Developer Command Prompt”.

Building Tesseract

  1. Re-open the first command prompt and ensure it’s still in C:\Tesseract-Build\.
  2. Get the latest source from SVN.
    svn checkout
  3. Change to the newly checked-out repository.
    cd tesseract-ocr
  4. Apply the patch provided in tesseract-vs2013.
    svn patch ..\tesseract-vs2012\vs2013+64bit_support.patch
  5. Copy both directories in C:\Tesseract-Build\tesseract-vs2012\release\ toC:\Tesseract-Build\. Now you should have
    • C:\Tesseract-Build\include\
    • C:\Tesseract-Build\lib\
  6. Open C:\Tesseract-Build\tesseract-ocr\vs2013\tesseract.sln with Visual Studio 2013.
  7. Press F7 on your keyboard. Both libtesseract303 and tesseract should compile without errors.

The Visual Studio solution file contains configurations for dynamic and static compilation as well as debugging and release configurations for both 32-Bit and 64-Bit. Select whichever configuration you need and recompile with F7.

You can find the compiled binaries in C:\Tesseract-Build\tesseract-ocr\vs2013\bin\.

Posted in Computer Softwares, Installation, Mixed, OpenCV, Project Related | Tagged: , , , , | Leave a Comment »

Research Writing Up & Publishing

Posted by Hemprasad Y. Badgujar on February 5, 2015

Research Writing Up & Publishing

The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century
– Steven Pinker, Harvard University
Writing your thesis – Champion et al.
How to read a scientific paper.
Top Ten Tips for doing your PhD – only 10?
Advice on Research and writing
Writing tips – covering a wide range of issues, from abbreviations, to punctuation, to writing style.
Guide to Grammar and Writing by Charles Darling
How to have a bad career in Research/Academia by David A. Patterson.
How to Write a Master’s Thesis in Computer Science by William D. Shoaff
Writing and Presenting Your Thesis or Dissertation by S. Joseph Levine, Ph.D.
How To Write A Dissertation or Bedtime Reading For People Who Do Not Have Time To Sleep
List of links on being a graduate student
Notes On The PhD Degree by D. Comer.
On Being A Scientist: Responsible Conduct In Research by NATIONAL ACADEMY OF SCIENCES.
You and your research.
Library notes for Engineering Researchers.
PhD Thesis Structure and Content by Christopher Clack.
Discussion on Ph.D. thesis proposals in computing science by H. Lauer.
Tips for a PhD and here.
Guide for writing a funding proposal by J. Levine.
How to publish in top journals.
How to Write Publishable Papers and here is a list of journals (mostly non-IT).
Networking on the Network – A Guide to Professional Skills for PhD Students by Phil Agre.
PhD writing links
Your PHd Thesis: How to Plan, Draft, Revise and Edit Your Thesis – A book by Brewer et al.,

Posted in Documentations, Journals & Conferences, My Research Related, Research Menu | Tagged: , , , , , , , , , , , , , , | Leave a Comment »

Posting Source Code in WordPress

Posted by Hemprasad Y. Badgujar on January 9, 2015

While doesn’t allow you to use potentially dangerous code on your blog, there is a way to post source code for viewing. We have created a shortcode you can wrap around source code that preserves its formatting and even provides syntax highlighting for certain languages, like so:

#button {
    font-weight: bold;
    border: 2px solid #fff;

To accomplish the above, just wrap your code in these tags:

** [*code language="css"]
your code here
[/code*] **

delete “*”

The language (or lang) parameter controls how the code is syntax highlighted. The following languages are supported:

  • actionscript3
  • bash
  • clojure
  • coldfusion
  • cpp
  • csharp
  • css
  • delphi
  • erlang
  • fsharp
  • diff
  • groovy
  • html
  • javascript
  • java
  • javafx
  • matlab (keywords only)
  • objc
  • perl
  • php
  • text
  • powershell
  • python
  • r
  • ruby
  • scala
  • sql
  • vb
  • xml

If the language parameter is not set, it will default to “text” (no syntax highlighting).

Code in between the source code tags will automatically be encoded for display, you don’t need to worry about HTML entities or anything.

Configuration Parameters

The shortcodes also accept a variety of configuration parameters that you may use to customize the output. All are completely optional.

  • autolinks (true/false) — Makes all URLs in your posted code clickable. Defaults to true.
  • collapse (true/false) — If true, the code box will be collapsed when the page loads, requiring the visitor to click to expand it. Good for large code posts. Defaults to false.
  • firstline (number) — Use this to change what number the line numbering starts at. It defaults to 1.
  • gutter (true/false) — If false, the line numbering on the left side will be hidden. Defaults to true.
  • highlight (comma-seperated list of numbers) — You can list the line numbers you want to be highlighted. For example “4,7,19”.
  • htmlscript (true/false) — If true, any HTML/XML in your code will be highlighted. This is useful when you are mixing code into HTML, such as PHP inside of HTML. Defaults to false and will only work with certain code languages.
  • light (true/false) — If true, the gutter (line numbering) and toolbar (see below) will be hidden. This is helpful when posting only one or two lines of code. Defaults to false.
  • padlinenumbers (true/false/integer) — Allows you to control the line number padding. truewill result in automatic padding, false will result in no padding, and entering a number will force a specific amount of padding.
  • title (string) — Set a label for your code block. Can be useful when combined with thecollapse parameter.

Here’s some examples of the above parameters in action:

<div class="line number1 index0 alt2"><code class="htmlscript plain"><!</code><code class="htmlscript keyword">DOCTYPE</code> <code class="htmlscript plain">html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "<a href=""></a>"></code></div>
<div class="line number2 index1 alt1"><code class="htmlscript plain"><</code><code class="htmlscript keyword">html</code> <code class="htmlscript color1">xmlns</code><code class="htmlscript plain">=</code><code class="htmlscript string">"<a href=""></a>"</code> <code class="htmlscript color1">xml:lang</code><code class="htmlscript plain">=</code><code class="htmlscript string">"en"</code> <code class="htmlscript color1">lang</code><code class="htmlscript plain">=</code><code class="htmlscript string">"en"</code><code class="htmlscript plain">></code></div>
<div class="line number3 index2 alt2"><code class="htmlscript plain"><</code><code class="htmlscript keyword">head</code><code class="htmlscript plain">></code></div>
<div class="line number4 index3 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">meta</code> <code class="htmlscript color1">http-equiv</code><code class="htmlscript plain">=</code><code class="htmlscript string">"Content-Type"</code> <code class="htmlscript color1">content</code><code class="htmlscript plain">=</code><code class="htmlscript string">"text/html; charset=UTF-8"</code> <code class="htmlscript plain">/></code></div>
<div class="line number5 index4 alt2"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">title</code><code class="htmlscript plain">> Code Example</</code><code class="htmlscript keyword">title</code><code class="htmlscript plain">></code></div>
<div class="line number6 index5 alt1"><code class="htmlscript plain"></</code><code class="htmlscript keyword">head</code><code class="htmlscript plain">></code></div>
<div class="line number7 index6 alt2"><code class="htmlscript plain"><</code><code class="htmlscript keyword">body</code><code class="htmlscript plain">></code></div>
<div class="line number8 index7 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">h1</code><code class="htmlscript plain">> Code Example</</code><code class="htmlscript keyword">h1</code><code class="htmlscript plain">></code></div>
<div class="line number9 index8 alt2"></div>
<div class="line number10 index9 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">p</code><code class="php plain">></code><code class="php script"><?php</code> <code class="php functions">echo</code> <code class="php string">'Hello World!'</code><code class="php plain">; </code><code class="php script">?></code><code class="htmlscript plain"></</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">></code></div>
<div class="line number11 index10 alt2"></div>
<div class="line number12 index11 alt1 highlighted"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">>This line is highlighted.</</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">></code></div>
<div class="line number13 index12 alt2"></div>
<div class="line number14 index13 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">>This line is very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long.</</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">></code></div>
<div class="line number15 index14 alt2"></div>
<div class="line number16 index15 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">div</code> <code class="htmlscript color1">class</code><code class="htmlscript plain">=</code><code class="htmlscript string">"foobar"</code><code class="htmlscript plain">></code></div>
<div class="line number17 index16 alt2"><code class="htmlscript spaces">        </code><code class="htmlscript plain">This    is  an</code></div>
<div class="line number18 index17 alt1"><code class="htmlscript spaces">        </code><code class="htmlscript plain">example of  smart</code></div>
<div class="line number19 index18 alt2"><code class="htmlscript spaces">        </code><code class="htmlscript plain">tabs.</code></div>
<div class="line number20 index19 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"></</code><code class="htmlscript keyword">div</code><code class="htmlscript plain">></code></div>
<div class="line number21 index20 alt2"></div>
<div class="line number22 index21 alt1"><code class="htmlscript spaces">    </code><code class="htmlscript plain"><</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">><</code><code class="htmlscript keyword">a</code> <code class="htmlscript color1">href</code><code class="htmlscript plain">=</code><code class="htmlscript string">"<a href=""></a>"</code><code class="htmlscript plain">></</code><code class="htmlscript keyword">a</code><code class="htmlscript plain">></</code><code class="htmlscript keyword">p</code><code class="htmlscript plain">></code></div>
<div class="line number23 index22 alt2"><code class="htmlscript plain"></</code><code class="htmlscript keyword">body</code><code class="htmlscript plain">></code></div>
<div class="line number24 index23 alt1"><code class="htmlscript plain"></</code><code class="htmlscript keyword">html</code><code class="htmlscript plain">></code></div>



Posted in Project Related | Leave a Comment »

How to run CUDA 6.5 in Emulation Mode

Posted by Hemprasad Y. Badgujar on December 20, 2014

How to run CUDA in Emulation Mode

Some beginners feel a little bit dejected when they find that their systems donotcontainGPUs to learn andworkwithCUDA. In this blog post, I shall include the step by step process of installingandexecutingCUDA programs in emulation mode on a system with no GPU installed in it.It is mentioned here thatyouwill not be able to gain any performance advantage expected out of a GPU (obviously). Instead, the performance will be worse than a CPU implementation. However, emulation mode provides an excellent tool to compile and debugyourCUDA codes for more advanced purposes.Please note that I performed the following steps for a Dell Xeon with Windows 7 (32-bit) system.1. Acquire and install Microsoft Visual Studio 2008 on your system.

2. Access the CUDA Toolkit Archives  page and select CUDA Toolkit 6.0 / 6.5 version. (It is the last version that came with emulation mode. Emulation mode was discontinued in later versions.)

3. Download and install the following on your machine:-

  • Developer Drivers for Win8/win7 X64  – (Select the one as required for your machine.)
  • CUDA Toolkit
  • CUDA SDK Code Samples
  • CUBLAS and CUFFT (If required)

4. The next step is to check whether the sample codes run properly on the system or not. This will ensure that there is nothing missing from the required installations. Browse the nVIDIA GPU Computing SDK using the windows start bar or by using the following path in your My Computer address bar:-
As per your working Platform
“C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Release”
“C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win64\Release”

(Also note that the ProgramData folder is by default set to “Hidden” attribute. It will be good if you unhide theis folder as it will be frequently utilized later on as you progress with your CUDA learning spells.)

5. Run the “deviceQuery” program and it should output something similar as shown in Fig. 1. Upon visual inspection of the output data, it can be seen that “there is no GPU device found” however the test has PASSED. This means that all the required installations for CUDA in emulation mode has been completed and now we can proceed with writing, compiling and executing CUDA programs in emulation mode.

Figure 1. Successful Rxecution of deviceQuery.exe ** Demo Example only

6. Open Visual Studio and create a new Win32 console project. Let’s name it “HelloCUDAEmuWorld”. Remember to select the “EMPTY PROJECT” option in Application Settings. Now Right Click on “Source Files” in the project tree and add new C++ code item. Remember to include the extension “.cu” instead of “.cpp”. Let’s name this item as “”. (If you forget the file extension, it can always be renamed via the project tree on the left).

7. Include the CUDA include, lib and  bin paths to MS Visual Studio. They were located at “C:\CUDA” in my system.

The next steps need to be performed for every new CUDA project when created.

8. Right Click on the project and select Custom Build Rules. Check the Custom Build Rules v6.0.0 option if available. Otherwise, click on Find Existing… and navigate to “C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common” and select Cuda.rules. This will add the build rules for CUDA v6.0to VS 2012.

9. Right click on the project and select Properties. Navigate to Configuration Properties –> Linker –> Input. Type in cudart.lib in the Additional Dependencies text bar and click Okay. Now we are ready to compile and run our first ever CUDA program in emulation mode. But first we need to activate the emulation  mode for .cu files.

10. Once again  Right click on the project and select Properties. Navigate to Configuration Properties –> CUDA Build Rule v6.0.0 –> General. Set Emulation Mode from No to Yes in the right hand column of the opened window. Click Okay.

11. Type in the following in the code editor and build and compile the project. And there it is. Your first ever CUDA program, in Emulation Mode. Something to brag about among friends.

int main(void)
return 0;

I hope this effort would not go in vain and offer some help to anyone whois tied upregarding this issue. Do contact if there is any queryregarding the above procedure.Source (

Posted in Computer Vision, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, Image / Video Filters, My Research Related, OpenCV, PARALLEL, Project Related | Leave a Comment »

Running CUDA Code Natively on x86 Processors

Posted by Hemprasad Y. Badgujar on December 20, 2014

1 Try : CUDA Development without GPU

If you want to run the code on your machine but you don’t have a GPU? Or maybe you want to try things out before firing up your AWS instance? Here I show you a way to run the CUDA code without a GPU.

Note: this only works on Linux, maybe there are other alternatives for Mac or Windows.

Ocelot lets you run CUDA programs on NVIDIA GPUs, AMD GPUs and x86-CPUs without recompilation. Here we’ll take advantage of the latter to run our code using our CPU.


You’ll need to install the following packages:

  • C++ Compiler (GCC)
  • Lex Lexer Generator (Flex)
  • YACC Parser Generator (Bison)
  • SCons

And these libraries:

  • boost_system
  • boost_filesystem
  • boost_serialization
  • GLEW (optional for GL interop)
  • GL (for NVIDIA GPU Devices)

With Arch Linux, this should go something like this:

pacman -S gcc flex bison scons boost glew

On Ubuntu it should be similar (sudo apt-get install flex bison g++ scons libboost-all-dev). If you don’t know the name of a package, search for it with ‘apt-cache search package_name’.

You should probably install LLVM too, it’s not mandatory, but I think it runs faster with LLVM.

pacman -S llvm clang

And of course you’ll need to install CUDA and the OpenCL headers. You can do it manually or using your distro’s package manager (for ubuntu I belive the package is called nvidia-cuda-toolkit):

pacman -S cuda libcl opencl-nvidia

One last dependency is Hydrazine. Fetch the source code:

svn checkout hydrazine

Or if you’re like me and prefer Git:

git svn clone -s hydrazine

And install it like this (you might need to install automake if you don’t have it already):

cd hydrazine
automake --add-missing
sudo make install


Now we can finally install Ocelot. This is where it gets a bit messy. Fetch the Ocelot source code:

svn checkout gpuocelot

Or with Git (warning, this will take a while, the whole repo is about 1.9 GB):

git svn clone -s gpuocelot

Now go to the ocelot directory:

cd gpuocelot/ocelot

And install Ocelot with:

sudo ./ --install


Sadly, the last command probably failed. This is how I fixed the problems.

Hydrazine headers not found

You could fix this adding an include flag. I just added a logical link to the hydrazine code we downloaded previously:

ln -s /path/to/hydrazine/hydrazine

Make sure you link to the second hydrazine directory (inside this directory you’ll find directories like implementation and interface). You should do this in the ocelot directory where you’re running the script (gpuocelot/ocelot).

LLVM header file not found

For any error that looks like this:

llvm/Target/TargetData.h: No such file or directory

Just edit the source code and replace it with this header:


The LLVM project moved the file.

LLVM IR folder “missing”

Similarly, files referenced by Ocelot from the “IR” package were moved (LLVM 3.2-5 on Arch Linux). If you get an error about LLVM/IR/LLVMContext.h missing, edit the following files:


and replace the includes at the top of each file for LLVM/IR/LLVMContext.h and LLVM/IR/Module.h with LLVM/LLVMContext.h and LLVM/Module.h, respectively.

PTXLexer errors

The next problem I ran into was:

.release_build/ocelot/ptxgrammar.hpp:351:14:error:'PTXLexer' is not a member of 'parser'

Go ahead, open the ‘.release_build/ocelot/ptxgrammar.hpp’ file and just comment line 355:

/* int yyparse (parser::PTXLexer& lexer, parser::PTXParser::State& state); */

That should fix the error.

boost libraries not found

On up-to-date Arch Linux boxes, it will complain about not finding boost libraries ‘boost_system-mt’, ‘boost_filesystem-mt’, ‘boost_thread-mt’.

I had to edit two files:

  • scripts/
  • SConscript

And just remove the trailing -mt from the library names:

  • boost_system
  • boost_filesystem
  • boost_thread

Finish the installation

After those fixes everything should work.

Whew! That wasn’t fun. Hopefully with the help of this guide it won’t be too painful.

To finish the installation, run:

sudo ldconfig

And you can check the library was installed correctly running:

OcelotConfig -l

It should return -locelot. If it didn’t, check your LD_LIBRARY_PATH. On my machine, Ocelot was installed under /usr/local/lib so I just added this to my LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Here’s the link to the installation instructions.

Running the code with Ocelot

We’re finally ready enjoy the fruits of our hard work. We need to do two things:

Ocelot configuration file

Add a file called configure.ocelot to your project (in the same directory as our Makefile and files), and copy this:

    ocelot: "ocelot",
    trace: {
        database: "traces/database.trace",
        memoryChecker: {
            enabled: false,
            checkInitialization: false
        raceDetector: {
            enabled: false,
            ignoreIrrelevantWrites: false
        debugger: {
            enabled: false,
            kernelFilter: "",
            alwaysAttach: true
    cuda: {
        implementation: "CudaRuntime",
        tracePath: "trace/CudaAPI.trace"
    executive: {
        devices: [llvm],
        preferredISA: nvidia,
        optimizationLevel: full,
        defaultDeviceID: 0,
        asynchronousKernelLaunch: True,
        port: 2011,
        host: "",
        workerThreadLimit: 8,
        warpSize: 16
    optimizations: {
        subkernelSize: 10000,

You can check this guide for more information about these settings.

Compile with the Ocelot library

And lastly, a small change to our Makefile. Append this to the GCC_OPTS:

GCC_OPTS=-O3 -Wall -Wextra -m64 `OcelotConfig -l`

And change the student target so it uses g++ and not nvcc:

student: compare main.o student_func.o Makefile
    g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GCC_OPTS)

I just replaced ‘nvcc’ with ‘g++’ and ‘NVCC_OPTS’ with ‘GCC_OPTS’.

make clean

And that’s it!

I forked the github repo and added these changes in case you want to take a look.

I found this guide helpful, it might have some additional details for installing things under ubuntu and/or manually.

Note for debian users

I successfully installed ocelot under debian squeeze, following the above steps, except that I needed to download llvm from upstream, as indicated in the above guide for ubuntu.

Other than that, after fixing some includes as indicated (Replacing ‘TargetData.h’ by ‘IR/DataLayout.h’, or adding ‘/IR/’ to some includes), it just compiled.

To build the student project, I needed to replace -m64 by -m32 to fit my architecture, and to make the other indicated changes.

Here are my makefile diffs:

$ git diff Makefile
diff --git a/HW1/student/Makefile b/HW1/student/Makefile
index b6df3a4..55480af 100755
--- a/HW1/student/Makefile
+++ b/HW1/student/Makefile
@@ -22,7 +22,8 @@ OPENCV_INCLUDEPATH=/usr/include

 OPENCV_LIBS=-lopencv_core -lopencv_imgproc -lopencv_highgui


 # On Macs the default install locations are below    #
@@ -36,12 +37,12 @@ CUDA_INCLUDEPATH=/usr/local/cuda-5.0/include

-NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m64
+NVCC_OPTS=-O3 -arch=sm_20 -Xcompiler -Wall -Xcompiler -Wextra -m32

-GCC_OPTS=-O3 -Wall -Wextra -m64
+GCC_OPTS=-O3 -Wall -Wextra -m32 `OcelotConfig -l` -I /usr/include/i386-linux-gn

 student: compare main.o student_func.o Makefile
-       $(NVCC) -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) 
+       g++ -o hw main.o student_func.o -L $(OPENCV_LIBPATH) $(OPENCV_LIBS) $(GC

 main.o: main.cpp timer.h utils.h HW1.cpp
        g++ -c main.cpp $(GCC_OPTS) -I $(CUDA_INCLUDEPATH) -I $(OPENCV_LIBPATH)

I’m using cuda toolkit 4.2.

I don’t know why, but it was necessary to add /usr/lib/gcc/i486-linux-gnu/4.4 to the PATH for nvcc to work:

export PATH=$PATH:/usr/lib/gcc/i486-linux-gnu/4.4

Eclipse CUDA plugin

This is probably for another entry, but I used this guide to integrate CUDA into Eclipse Indigo.

The plugin is University of Bayreuth’s Eclipse Toolchain for CUDA compiler

2 Try :Running CUDA Code Natively on x86 Processors

We  focused on Fermi and the architectural changes that significantly broadened the types of applications that map well to GPGPU computing yet preserve the application performance of software written for previous generations of CUDA-enabled GPUs. This article addresses the mindset that CUDA is a language for only GPU-based applications.

Recent developments allow CUDA programs to transparently compile and run at full speed on x86 architectures. This advance makes CUDA a viable programming model for all application development, just like OpenMP. The PGI CUDA C/C++ compiler for x86 (from the Portland Group Inc.) is the reason for this recent change in mindset. It is the first native CUDA compiler that can transparently create a binary that will run on an x86 processor. No GPU is required. As a result, programmers now have the ability to use a single source tree of CUDA code to reach those customers who own CUDA-enabled GPUs as or who use x86-based systems.

Figure 1 illustrates the options and target platforms that are currently available to build and run CUDA applications. The various products are discussed next.

Figure 1: The various options for compiling and running a CUDA program.

Aside from the new CUDA-x86 compiler, the other products require developer or customer intervention to run CUDA on multiple backends. For example:

  • nvcc: The freely downloadable nvcc compiler from NVIDIA creates both host and device code. With the use of the __device__ and __host__ specifiers, a developer can use C++ Thrust functions to run on both host and CUDA-enabled devices. This x86 pathway is represented by the dotted line in Figure 1, as the programmer must explicitly specify use of the host processor. In addition, developers must explicitly check whether a GPU is present and use this information to select the memory space in which the data will reside (that is, GPU or host). The Thrust API also allows CUDA codes to be transparently compiled to run on different backends. The Thrust documentation shows how to use OpenMP to run a Monte Carlo simulation on x86. Note that Thrust is not optimized to create efficient OpenMP code.
  • gpuocelot provides a dynamic compilation framework to run CUDA binaries on various backends such as x86, AMD GPUs, and an x86-based PTX emulator. The emulator alone is a valuable tool for finding hot spots and bottlenecks in CUDA codes. The gpuocelot website claims that it “allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation.” I recommend this project even though it is challenging to use. As it matures, Ocelot will provide a pathway for customers to run CUDA binaries on various backends.
  • MCUDA is an academic project that translates CUDA to C. It is not currently maintained, but the papers are interesting reading. A follow-up project (FCUDA) provides a CUDA to FPGA translation capability.
  • SWAN provides a CUDA-to-OpenCL translation capability. The authors note that Swan is “not a drop in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls rewritten.” Still, it is an interesting project to bridge the gap between CUDA and OpenCL.

The CUDA-x86 compiler is the first to provide a seamless pathway to create a multi-platform application.

Why It Matters

Using CUDA for all application development may seem like a radical concept to many readers, but in fact, it is the natural extension of the emerging CPU/GPU paradigm of high-speed computing. One of the key benefits of CUDA is that it uses C/C++ and can be adopted easily and it runs on 300+ million GPUs and now all x86 chips. If this still feels like an edgy practice, this video presentation might be helpful.

CUDA works well now at its principal task — massively parallel computation — as demonstrated by the variety and number of projects that achieve 100x or greater performance in the NVIDIA showcase. See Figure 2.

Figure 2: All top 100 CUDA apps attain speedups in excess of 100x.

PGI CUDA-x86: CUDA Programming for Multi-core CPUs


The NVIDIA CUDA architecture was developed to enable offloading of compute-intensive kernels to GPUs. Through API function calls and language extensions, CUDA gives developers control over mapping of general-purpose compute kernels to GPUs, and over placement and movement of data between host memory and GPU memory. CUDA is supported on x86 and x64 (64-bit x86) systems running Linux, Windows or MacOS and that include an NVIDIA CUDA-enabled GPU. First introduced in 2007, CUDA is the most popular GPGPU parallel programming model with an estimated user-base of over 100,000 developers worldwide.

Let’s review the hardware around which the CUDA programming model was designed. Figure 1 below shows an abstraction of a multi-core x64+GPU platform focused on computing, with the graphics functionality stripped out. The key to the performance potential of the NVIDIA GPU is the large number of thread processors, up to 512 of them in a Fermi-class GPU. They’re organized into up to 16 multi-processors, each of which has 32 thread processors. Each thread processor has registers along with integer and floating point functional units; the thread processors within a multiprocessor run in SIMD mode. Fermi peak single-precision performance is about 1.4 TFLOPS and peak double-precision is about 550 GFLOPS.

Fermi Block Diagram

Figure 1: NVIDIA Fermi-class GPU Accelerator

The GPU has a large (up to 6GB) high bandwidth long latency device main memory. Each multi-processor has a small 64KB local shared memory that functions as both a hardware data cache and a software-managed data cache, and has a large register file.

The GPU has two levels of parallelism, SIMD within a multiprocessor, and parallel across multiprocessors. In addition, there is another very important level of concurrency: the thread processors support extremely fast multithread context switching to tolerate the long latency to device main memory. If a given thread stalls waiting for a device memory access, it is swapped out and another ready thread is swapped in and starts executing within a few cycles.

What kind of algorithms run well on this architecture?

  • Massive parallelism—is needed to effectively use hundreds of thread processors and provide enough slack parallelism for the fast multi-threading to effectively tolerate device memory latency and maximize device memory bandwidth utilization.
  • Regular parallelism—is needed for GPU hardware and firmware that is optimized for the regular parallelism found in graphics kernels; these correspond roughly to rectangular iteration spaces (think tightly nested loops).
  • Limited synchronization—thread processors within a multi-processor can synchronize quickly enough to enable coordinated vector operations like reductions, but there is virtually no ability to synchronize across multi-processors.
  • Locality—is needed to enable use of the hardware or user-managed data caches to minimize accesses to device memory.

This sounds a lot like a nest of parallel loops. So, NVIDIA defined the CUDA programming model to enable efficient mapping of general-purpose compute-intensive loop nests onto the GPU hardware. Specifically, a 1K x 1K matrix multiply loop that looks as follows on the host:

for (i = 0; i < 1024; ++i)
   for (k = 0; k < 1024; ++k)
      for (j = 0; j < 1024; ++j)
         c[i][j] =+= a[i][k]*b[k][j]; 

can be rewritten in its most basic form in CUDA C as:

cudaMalloc( &ap, memsizeA );
cudaMemcpy( ap, a, memsizeA, cudaMemcpyHostToDevice );
c_mmul_kernel <<<(64,64),(16,16)>>>(ap, bp, cp, 1024);
cudaMemcpy( c, cp, memsizeC, cudaMemcpyDeviceToHost );
__global__ void c_mmul_kernel(float* a, float* b, float* c, n)
   int i = blockIdx.y*16+threadIdx.y;
   int j = blockIdx.x*16+threadIdx.x;
   for( int k = 0; k < n; ++k )_
      c[n*i+j] += a[n*i+k] * b[n*k+j];

The triply-nested matrix multiply loop becomes a single dot-product loop, split out to a self-contained kernel function. The two outer loops are abstracted away in the launch of the kernel on the GPU. Conceptually, the over one million 1024-length dot-products it takes to perform the matrix multiply are all launched simultaneously on the GPU. The CUDA programmer structures fine-grain parallel tasks, in this case dot-product operations, as CUDA threads, organizes the threads into rectangular thread blocks with 32 to 1024 threads each, and organizes the thread-blocks into a rectangular grid. Each thread-block is assigned to a CUDA GPU multi-processor, and the threads within a thread-block are executed by the thread-processors within that multiprocessor.

The programmer also manages the memory hierarchy on the GPU, moving data from the host to device memory, from variables in device memory to variables in shared memory, or to variables that the user intends to be assigned to registers.

PGI CUDA C/C++ for Multi-core x64

The PGI CUDA C/C++ compiler for multi-core x64 platforms will allow developers to compile and optimize CUDA applications to run on x64-based workstations, servers and clusters with or without an NVIDIA GPU accelerator. Is it possible to compile CUDA C efficiently for multi-core processors? CUDA C is simply a parallel programming model and language. While it was designed with the structure required for efficient GPU programming, it also can be compiled for efficient execution on multi-core x64.

Looking at a multicore x64 CPU, we see features very like what we have on the NVIDIA GPU. We have MIMD parallelism across the cores, typically 4 cores but we know there are up to 12 on some chips today and up to 48 on a single motherboard. We have SIMD parallelism in the AVX or SSE instructions. So it’s the same set of features, excepting that CPUs are optimized with deep cache memory hierarchies for memory latency, whereas the GPU is optimized for memory bandwidth. Mapping the CUDA parallelism onto the CPU parallelism seems straightforward from basic principles.

Consider the process the CUDA programmer uses to convert existing serial or parallel programs to CUDA C, as outlined above. Many aspects of this process can simply be reversed by the compiler:

  • Reconstitute parallel/vector loop nests from the CUDA C chevron syntax
  • Where possible, remove or replace programmer-inserted __syncthreads() calls by appropriate mechanisms on the CPU

In effect, the PGI CUDA C/C++ compiler will process CUDA C as a native parallel programming language for mapping to multi-core x64 CPUs. CUDA thread blocks will be mapped to processor cores to effect multi-core execution, and CUDA thread-level parallelism will be mapped to the SSE or AVX SIMD units as shown in Figure 2 below. All existing PGI x64 optimizations for Intel and AMD CPUs will be applied to CUDA C/C++ host code—SIMD/AVX vectorization, inter-procedural analysis and optimizations, auto-parallelization for multi-core, OpenMP extensions support, etc.

Multi-core Mapping

Figure 2: Mapping CUDA to GPUs versus Multi-core CPUs

Initially, PGI CUDA C/C++ will target the CUDA 3.1 runtime API. There are no current plans to implement the CUDA driver API. The definition of warpSize may be changed (probably to 1 in optimizing versions of the compiler); correctly implementing warp-synchronous programming would either require implicit synchronization after each memory access, or would require the compiler to prove that such synchronization is not required. It’s much more natural to require programmers to use the value of warpSize to determine how many threads are running in SIMD mode.

What kind of performance can you expect from CUDA C programs running on multi-core CPUs? There are many determining factors. Typical CUDA C programs perform many explicit operations and optimizations that are not necessary when programming multi-core CPUs using OpenMP or threads-based programming:

  • Explicit movement of data from host main memory to CUDA device memory
  • Data copies from arrays in CUDA device memory to temporary arrays in multi-processor shared memory
  • Synchronization of SIMT thread processors to ensure shared memory coherency
  • Manual unrolling of loops

In many cases, the PGI CUDA C compiler will remove explicit synchronization of the thread processors if it can determine it’s safe to split loops in which synchronization calls occur. Manual unrolling of loops will not typically hurt performance on x64, and may help in some cases. However, explicit movement of data from host memory to “device” copies will still occur, and explicit movement of data from device copies to temporary arrays in shared memory will still occur; these operations are pure overhead on a multi-core processor.

It will be easy to write CUDA programs that run really well on the GPU and don’t run so well on a CPU. We can’t guarantee high performance, if you’ve gone and tightly hand-tuned your kernel code. As with OpenCL, we’re making the language portable, and many programs will port and run well; but there is no guarantee of general performance portability.

PGI Unified Binary for Multi-core x64 and NVIDIA GPUs

In later releases, in addition to multi-core execution, the PGI CUDA C/C++ compiler will support execution of device kernels on NVIDIA CUDA-enabled GPUs. PGI Unified Binary technology will enable developers to build one binary that will use NVIDIA GPUs when present or default to using multi-core x64 if no GPU is present.

PGI Unified Binary

Figure 3: PGI Unified Binary for NVIDIA GPUs and Multi-core CPUs


It’s important to clarify that the PGI CUDA C/C++ compiler for multi-core does not split work between the CPU and GPU; it executes device kernels in multi-core mode on the CPU. Even with the PGI Unified Binary feature, the device kernels will execute either on the GPU or on the multi-core, since the data will have been allocated in one memory or the other. PGI CUDA C/C++ also is not intended to as a replacement for OpenMP or other parallel programming models for CPUs. It is a feature of the PGI compilers that will enable CUDA programs to run on either CPUs or GPUs, and will give developers the option of a uniform manycore parallel programming model for applications where it’s needed and appropriate. It will ensure CUDA C programs are portable to virtually any multi-core x64 processor-based HPC system.

The PGI compiler will implement the NVIDIA CUDA C language and closely track the evolution of CUDA C moving forward. The implementation will proceed in phases:

  • Prototype demonstration at SC10 in New Orleans (November 2010).
  • First production release in Q2 2011 with most CUDA C functionality. This will not be a performance release; it will use multi-core parallelism across threads in a single thread block, in the same way as PGI CUDA Fortran emulation mode, but will not exploit parallelism across thread blocks.
  • Performance release in Q3 2011 leveraging multi-core and SSE/AVX to implement low-overhead native parallel/SIMD execution; this will use a single core to execute all the threads in a single thread block, in SIMD mode where possible, and use multi-core parallelism across the thread blocks.
  • Unification release in Q4 2011 that supports PGI Unified Binary technology to create binaries that use NVIDIA GPU accelerators when present, or run on multi-core CPUs if no GPU is present.

The necessary elements of the NVIDIA CUDA toolkit needed to compile and execute CUDA C/C++ programs (header files, for example) will be bundled with the PGI compiler. Finally, the same optimizations and features implemented for CUDA C/C++ for multi-core will also be supported in CUDA Fortran, offering interoperability and a uniform programming model across both languages.

How It Works

In CUDA-x86, thread blocks are mapped to x86 processor cores. Thread-level parallelism is mapped to SSE (Streaming SIMD Extensions) or AVX SIMD units as shown below. (AVX is an extension of SSE to 256-bit operation). PGI indicates that:

  • The size of a warp (that is, the basic unit of code to be run) will be different than the typical 32 threads per warp for a GPU. For x86 computing, a warp might be the size of the SIMD units on the x86 core (either four or eight threads) or one thread per warp when SIMD execution is not utilized.
  • In many cases, the PGI CUDA C compiler removes explicit synchronization of the thread processors when the compiler can determine it is safe to split loops.
  • CUDA considers the GPU as a separate device from the host processors. CUDA x86 maintains this memory model, which means that data movement between the host and device memory spaces still consumes application runtime. As shown in the device bandwidth SDK example below, a modern Xeon processor can transfer data to a CUDA-x86 “device” at about 4GB/sec. All CUDA x86 pointers reside in the x86 memory space, so programmers can use conditional compilation to directly access memory without requiring data transfers when running on multicore processors.

Trying Out the Compiler

The PGI installation process is fairly straightforward:

  1. Register and download the latest version from PGI
  2. Extract the tarfile at the location of your choice and follow the instructions in INSTALL.txt.
    • Under Linux, this basically requires running the file ./install as superuser and answering a few straight-forward questions.
    • Note that you should answer “yes” to the installation of CUDA even if you have a GPU version of CUDA already installed on your system. The PGI x86 version will not conflict with the GPU version. Otherwise, the PGI compiler will not understand files with the .cu file extension.
  3. Create the license.dat file.

At this point, you have a 15-day license for the PGI compilers.

Setup the environment to build with the PGI tools as discussed in the installation guide. Following are the commands for bash under Linux:

PGI=/opt/pgi; export PGI
MANPATH=$MANPATH:$PGI/linux86-64/11.5/man; export MANPATH
PATH=$PGI/linux86-64/11.5/bin:$PATH; export PATH

Copy the PGI NVIDIA SDK samples to a convenient location and build them:

cp –r /opt/pgi/linux86-64/2011/cuda/cudaX86SDK  .
cd cudaX86SDK ;

This is the output of deviceQuery on an Intel Xeon e5560 processor:

CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
  CUDA Driver Version:                           99.99
  CUDA Runtime Version:                          99.99
  CUDA Capability Major revision number:         9998
  CUDA Capability Minor revision number:         9998
  Total amount of global memory:                 128000000 bytes
  Number of multiprocessors:                     1
  Number of cores:                               0
  Total amount of constant memory:               1021585952 bytes
  Total amount of shared memory per block:       1021586048 bytes
  Total number of registers available per block: 1021585904
  Warp size:                                     1
  Maximum number of threads per block:           1021585920
  Maximum sizes of each dimension of a block:    32767 x 2 x 0
  Maximum sizes of each dimension of a grid:     1021586032 x 32767 x 1021586048
  Maximum memory pitch:                          4206313 bytes
  Texture alignment:                             1021585952 bytes
  Clock rate:                                    0.00 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Unknown
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 99.99, CUDA Runtime Version = 99.99, NumDevs = 1, Device = DEVICE EMULATION MODE
Press <Enter> to Quit...

The output of bandwidthTest shows that device transfers work as expected:

Running on...
 Quick Mode
 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4152.5
 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4257.0
 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         8459.2
[bandwidthTest] - Test results:
Press <Enter> to Quit...

As with NVIDIA’s nvcc compiler, it is easy to use the PGI pgCC compiler to build an executable from a CUDA source file. As an example, copy the code from Part 3 of this series. To compile and run it under Linux, type:


Posted in Computer Network & Security, Computer Softwares, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, PARALLEL | Tagged: | Leave a Comment »

Parallel Code: Maximizing your Performance Potential

Posted by Hemprasad Y. Badgujar on December 19, 2014

No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.
Read the rest of this entry »

Posted in Computer Hardwares, Computer Languages, Computing Technology, GPU (CUDA), GPU Accelareted, My Research Related, PARALLEL, Research Menu | Tagged: | Leave a Comment »

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: