Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘Computer Hardwares’ Category

Building a Beowulf cluster with Ubuntu

Posted by Hemprasad Y. Badgujar on December 25, 2014

Building a Beowulf cluster with Ubuntu

The beowulf cluster article on Wikipedia describes the Beowulf cluster as follows:

“A Beowulf cluster is a group of what are normally identical, commercially available computers, which are running a Free and Open Source Software (FOSS), Unix-like operating system, such as BSD, GNU/Linux, or Solaris. They are networked into a small TCP/IP LAN, and have libraries and programs installed which allow processing to be shared among them.” – Wikipedia, Beowulf cluster, 28 February 2011.

This means a Beowulf cluster can be easily built with “off the shelf” computers running GNU/Linux in a simple home network. So building a Beowulf like cluster is within reach if you already have a small TCP/IP LAN at home with desktop computers running Ubuntu Linux (or any other Linux distribution).

There are many ways to install and configure a cluster. There is OSCAR(1), which allows any user, regardless of experience, to easily install a Beowulf type cluster on supported Linux distributions. It installs and configures all required software according to user input.

There is also the NPACI Rocks toolkit(2), which incorporates the latest Red Hat distribution and cluster-specific software. Rocks addresses the difficulties of deploying manageable clusters. Rocks makes clusters easy to deploy, manage, upgrade and scale.

Both of the afore mentioned toolkits for deploying clusters were made to be easy to use and require minimal expertise from the user. But the purpose of this tutorial is to explain how to manually build a Beowulf like cluster. Basically, the toolkits mentioned above do most of the installing and configuring for you, rendering the learning experience mute. So it would not make much sense to use any of these toolkits if you want to learn the basics of how a cluster works. This tutorial therefore explains how to manually build a cluster, by manually installing and configuring the required tools. In this tutorial I assume that you have some basic knowledge of the Linux-based operating system and know your way around the command line. I tried however to make this as easy as possible to follow. Keep in mind that this is new territory for me as well and there’s a good chance that this tutorial shows methods that may not be the best.

I myself started off with the clustering tutorial from SCFBio which gives a great explanation on how to build a simple Beowulf cluster.(3) It describes the prerequisites for building a Beowulf cluster and why these are needed.


  • What’s a Beowulf Cluster, exactly?
  • Building a virtual Beowulf Cluster
  • Building the actual cluster
  • Configuring the Nodes
    • Add the nodes to the hosts file
    • Defining a user for running MPI jobs
    • Install and setup the Network File System
    • Setup passwordless SSH for communication between nodes
    • Setting up the process manager
      • Setting up Hydra
      • Setting up MPD
  • Running jobs on the cluster
    • Running MPICH2 example applications on the cluster
    • Running bioinformatics tools on the cluster
  • Credits
  • References

What’s a Beowulf Cluster, exactly?

The typical setup of a beowulf cluster

The definition I cited before is not very complete. The book “Engineering a Beowulf-style Compute Cluster”(4) by Robert G. Brown gives a more detailed answer to this question (if you’re serious about this, this book is a must read). According to this book, there is an accepted definition of a beowulf cluster. This book describes the true beowulf as a cluster of computers interconnected with a network with the following characteristics:

  1. The nodes are dedicated to the beowulf cluster.
  2. The network on which the nodes reside are dedicated to the beowulf cluster.
  3. The nodes are Mass Market Commercial-Off-The-Shelf (M2COTS) computers.
  4. The network is also a COTS entity.
  5. The nodes all run open source software.
  6. The resulting cluster is used for High Performance Computing (HPC).

Building a virtual Beowulf Cluster

It is not a bad idea to start by building a virtual cluster using virtualization software like VirtualBox. I simply used my laptop running Ubuntu as the master node, and two virtual computing nodes running Ubuntu Server Edition were created in VirtualBox. The virtual cluster allows you to build and test the cluster without the need for the extra hardware. However, this method is only meant for testing and not suited if you want increased performance.

When it comes to configuring the nodes for the cluster, building a virtual cluster is practically the same as building a cluster with actual machines. The difference is that you don’t have to worry about the hardware as much. You do have to properly configure the virtual network interfaces of the virtual nodes. They need to be configured in a way that the master node (e.g. the computer on which the virtual nodes are running) has network access to the virtual nodes, and vice versa.

Building the actual cluster

It is good practice to first build and test a virtual cluster as described above. If you have some spare computers and network parts lying around, you can use those to build the actual cluster. The nodes (the computers that are part of the cluster) and the network hardware are the usual kind available to the general public (beowulf requirement 3 and 4). In this tutorial we’ll use the Ubuntu operating system to power the machines and open source software to allow for distributed parallel computing (beowulf requirement 5). We’ll test the cluster with cluster specific versions of bioinformaticstools that perform some sort of heavy calculations (beowulf requirement 6).

The cluster consists of the following hardware parts:

  • Network
  • Server / Head / Master Node (common names for the same machine)
  • Compute Nodes
  • Gateway

All nodes (including the master node) run the following software:

I will not focus on setting up the network (parts) in this tutorial. I assume that all nodes are part of the same private network and that they are properly connected.

Configuring the Nodes

Some configurations need to be made to the nodes. I’ll walk you through them one by one.

Add the nodes to the hosts file

It is easier if the nodes can be accessed with their host name rather than their IP address. It will also make things a lot easier later on. To do this, add the nodes to the hosts file of all nodes.(8) (9) All nodes should have a static local IP address set. I won’t go into details here as this is outside the scope of this tutorial. For this tutorial I assume that all nodes are already properly configured to have a static local IP address.

Edit the hosts file (sudo vim /etc/hosts) like below and remember that you need to do this for all nodes,	localhost	master	node1	node2	node3

Make sure it doesn’t look like this:	localhost	master	node1	node2	node3

neither like this:	localhost	master	master	node1	node2	node3

Otherwise other nodes will try to connect to localhost when trying to reach the master node.

Once saved, you can use the host names to connect to the other nodes,

$ ping -c 3 master
PING master ( 56(84) bytes of data.
64 bytes from master ( icmp_req=1 ttl=64 time=0.606 ms
64 bytes from master ( icmp_req=2 ttl=64 time=0.552 ms
64 bytes from master ( icmp_req=3 ttl=64 time=0.549 ms

--- master ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.549/0.569/0.606/0.026 ms

Try this with different nodes on different nodes. You should get a response similar to the above.

In this tutorial, master is used as the master node. Once the cluster has been set up, the master node will be used to start jobs on the cluster. The master node will be used to spawn jobs on the cluster. The compute nodes are node1 to node3 and will thus execute the jobs.

Defining a user for running MPI jobs

Several tutorials explain that all nodes need a separate user for running MPI jobs.(8) (9) (6) I haven’t found a clear explanation to why this is necessary, but there could be several reasons:

  1. There’s no need to remember different user names and passwords if all nodes use the same username and password.
  2. MPICH2 can use SSH for communication between nodes. Passwordless login with the use of authorized keys only works if the username matches the one set for passwordless login. You don’t have to worry about this if all nodes use the same username.
  3. The NFS directory can be made accessible for the MPI users only. The MPI users all need to have the same user ID for this to work.
  4. The separate user might require special permissions.

The command below creates a new user with username “mpiuser” and user ID 999. Giving a user ID below 1000 prevents the user from showing up in the login screen for desktop versions of Ubuntu. It is important that all MPI users have the same username and user ID. The user IDs for the MPI users need to be the same because we give access to the MPI user on the NFS directory later. Permissions on NFS directories are checked with user IDs. Create the user like this,

$ sudo adduser mpiuser --uid 999

You may use a different user ID (as long as it is the same for all MPI users). Enter a password for the user when prompted. It’s recommended to give the same password on all nodes so you have to remember just one password. The above command should also create a new directory/home/mpiuser. This is the home directory for user mpiuser and we will use it to execute jobs on the cluster.

Install and setup the Network File System

Files and programs used for MPI jobs (jobs that are run in parallel on the cluster) need to be available to all nodes, so we give all nodes access to a part of the file system on the master node. Network File System (NFS) enables you to mount part of a remote file system so you can access it as if it is a local directory. To install NFS, run the following command on the master node:

master:~$ sudo apt-get install nfs-kernel-server

And in order to make it possible to mount a Network File System on the compute nodes, the nfs-common package needs to be installed on all compute nodes:

$ sudo apt-get install nfs-common

We will use NFS to share the MPI user’s home directory (i.e. /home/mpiuser) with the compute nodes. It is important that this directory is owned by the MPI user so that all MPI users can access this directory. But since we created this home directory with the adduser command earlier, it is already owned by the MPI user,

master:~$ ls -l /home/ | grep mpiuser
drwxr-xr-x   7 mpiuser mpiuser  4096 May 11 15:47 mpiuser

If you use a different directory that is not currently owned by the MPI user, you must change it’s ownership as follows,

master:~$ sudo chown mpiuser:mpiuser /path/to/shared/dir

Now we share the /home/mpiuser directory of the master node with all other nodes. For this the file /etc/exports on the master node needs to be edited. Add the following line to this file,

/home/mpiuser *(rw,sync,no_subtree_check)

You can read the man page to learn more about the exports file (man exports). After the first install you may need to restart the NFS daemon:

master:~$ sudo service nfs-kernel-server restart

This also exports the directores listed in /etc/exports. In the future when the /etc/exports file is modified, you need to run the following command to export the directories listed in /etc/exports:

master:~$ sudo exportfs -a

The /home/mpiuser directory should now be shared through NFS. In order to test this, you can run the following command from a compute node:

$ showmount -e master

In this case this should print the path /home/mpiuser. All data files and programs that will be used for running an MPI job must be placed in this directory on the master node. The other nodes will then be able to access these files through NFS.

The firewall is by default enabled on Ubuntu. The firewall will block access when a client tries to access an NFS shared directory. So you need to add a rule with UFW (a tool for managing the firewall) to allow access from a specific subnet. If the IP addresses in your network have the format192.168.1.*, then is the subnet. Run the following command to allow incoming access from a specific subnet,

master:~$ sudo ufw allow from

You need to run this on the master node and replace “” by the subnet for your network.

You should then be able to mount master:/home/mpiuser on the compute nodes. Run the following commands to test this,

node1:~$ sudo mount master:/home/mpiuser /home/mpiuser
node2:~$ sudo mount master:/home/mpiuser /home/mpiuser
node3:~$ sudo mount master:/home/mpiuser /home/mpiuser

If this fails or hangs, restart the compute node and try again. If the above command runs without a problem, you should test whether/home/mpiuser on any compute node actually has the content from /home/mpiuser of the master node. You can test this by creating a file inmaster:/home/mpiuser and check if that same file appears in node*:/home/mpiuser (where node* is any compute node).

If mounting the NFS shared directory works, we can make it so that the master:/home/mpiuser directory is automatically mounted when the compute nodes are booted. For this the file /etc/fstab needs to be edited. Add the following line to the fstab file of all compute nodes,

master:/home/mpiuser /home/mpiuser nfs

Again, read the man page of fstab if you want to know the details (man fstab). Reboot the compute nodes and list the contents of the/home/mpiuser directory on each compute node to check if you have access to the data on the master node,

$ ls /home/mpiuser

This should lists the files from the /home/mpiuser directory of the master node. If it doesn’t immediately, wait a few seconds and try again. It might take some time for the system to initialize the connection with the master node.

Setup passwordless SSH for communication between nodes

For the cluster to work, the master node needs to be able to communicate with the compute nodes, and vice versa.(8) Secure Shell (SSH) is usually used for secure remote access between computers. By setting up passwordless SSH between the nodes, the master node is able to run commands on the compute nodes. This is needed to run the MPI daemons on the available compute nodes.

First install the SSH server on all nodes:

$ sudo apt-get install ssh

Now we need to generate an SSH key for all MPI users on all nodes. The SSH key is by default created in the user’s home directory. Remember that in our case the MPI user’s home directory (i.e. /home/mpiuser) is actually the same directory for all nodes: /home/mpiuser on the master node. So if we generate an SSH key for the MPI user on one of the nodes, all nodes will automatically have an SSH key. Let’s generate an SSH key for the MPI user on the master node (but any node should be fine),

$ su mpiuser
$ ssh-keygen

When asked for a passphrase, leave it empty (hence passwordless SSH).

When done, all nodes should have an SSH key (the same key actually). The master node needs to be able to automatically login to the compute nodes. To enable this, the public SSH key of the master node needs to be added to the list of known hosts (this is usually a file~/.ssh/authorized_keys) of all compute nodes. But this is easy, since all SSH key data is stored in one location: /home/mpiuser/.ssh/ on the master node. So instead of having to copy master’s public SSH key to all compute nodes separately, we just have to copy it to master’s ownauthorized_keys file. There is a command to push the public SSH key of the currently logged in user to another computer. Run the following commands on the master node as user “mpiuser”,

mpiuser@master:~$ ssh-copy-id localhost

Master’s own public SSH key should now be copied to /home/mpiuser/.ssh/authorized_keys. But since /home/mpiuser/ (and everything under it) is shared with all nodes via NFS, all nodes should now have master’s public SSH key in the list of known hosts. This means that we should now be able to login on the compute nodes from the master node without having to enter a password,

mpiuser@master:~$ ssh node1
mpiuser@node1:~$ echo $HOSTNAME

You should now be logged in on node1 via SSH. Make sure you’re able to login to the other nodes as well.

Setting up the process manager

In this section I’ll walk you through the installation of MPICH and configuring the process manager. The process manager is needed to spawn and manage parallel jobs on the cluster. The MPICH wiki explains this nicely:

“Process managers are basically external (typically distributed) agents that spawn and manage parallel jobs. These process managers communicate with MPICH processes using a predefined interface called as PMI (process management interface). Since the interface is (informally) standardized within MPICH and its derivatives, you can use any process manager from MPICH or its derivatives with any MPI application built with MPICH or any of its derivatives, as long as they follow the same wire protocol.” – Frequently Asked Questions – Mpich.

The process manager is included with the MPICH package, so start by installing MPICH on all nodes with,

$ sudo apt-get install mpich2

MPD has been the traditional default process manager for MPICH till the 1.2.x release series. Starting the 1.3.x series, Hydra is the default process manager.(10) So depending on the version of MPICH you are using, you should either use MPD or Hydra for process management. You can check the MPICH version by running mpich2version in the terminal. Then follow either the steps for MPD or Hydra in the following sub sections.

Setting up Hydra

This section explains how to configure the Hydra process manager and is for users of MPICH 1.3.x series and up. In order to setup Hydra, we need to create one file on the master node. This file contains all the host names of the compute nodes.(11) You can create this file anywhere you want, but for simplicity we create it in the the MPI user’s home directory,

mpiuser@master:~$ cd ~
mpiuser@master:~$ touch hosts

In order to be able to send out jobs to the other nodes in the network, add the host names of all compute nodes to the hosts file,


You may choose to include master in this file, which would mean that the master node would also act as a compute node. The hosts file only needs to be present on the node that will be used to start jobs on the cluster, usually the master node. But because the home directory is shared among all nodes, all nodes will have the hosts file. For more details about setting up Hydra see this page: Using the Hydra Process Manager.

Setting up MPD

This section explains how to configure the MPD process manager and is for users of MPICH 1.2.x series and down. Before we can start any parallel jobs with MPD, we need to create two files in the home directory of the MPI user. Make sure you’re logged in as the MPI user and create the following two files in the home directory,

mpiuser@master:~$ cd ~
mpiuser@master:~$ touch mpd.hosts
mpiuser@master:~$ touch .mpd.conf

In order to be able to send out jobs to the other nodes in the network, add the host names of all compute nodes to the mpd.hosts file,


You may choose to include master in this file, which would mean that the master node would also act as a compute node. The mpd.hosts file only needs to be present on the node that will be used to start jobs on the cluster, usually the master node. But because the home directory is shared among all nodes, all nodes will have the mpd.hosts file.

The configuration file .mpd.conf (mind the dot at the beginning of the file name) must be accessible to the MPI user only (in fact, MPD refuses to work if you don’t do this),

mpiuser@master:~$ chmod 600 .mpd.conf

Then add a line with a secret passphrase to the configuration file,


The secretword can be set to any random passphrase. You may want to use a random password generator the generate a passphrase.

All nodes need to have the .mpd.conf file in the home directory of mpiuser with the same passphrase. But this is automatically the case since/home/mpiuser is shared through NFS.

The nodes should now be configured correctly. Run the following command on the master node to start the mpd deamon on all nodes,

mpiuser@master:~$ mpdboot -n 3

Replace “3” by the number of compute nodes in your cluster. If this was successful, all nodes should now be running the mpd daemon. Run the following command to check if all nodes entered the ring (and are thus running the mpd daemon),

mpiuser@master:~$ mpdtrace -l

This command should display a list of all nodes that entered the ring. Nodes listed here are running the mpd daemon and are ready to accept MPI jobs. This means that your cluster is now set up and ready to rock!

Running jobs on the cluster

Running MPICH2 example applications on the cluster

The MPICH2 package comes with a few example applications that you can run on your cluster. To obtain these examples, download the MPICH2 source package from the MPICH website and extract the archive to a directory. The directory to where you extracted the MPICH2 package should contain an “examples” directory. This directory contains the source codes of the example applications. You need to compile these yourself.

$ sudo apt-get build-dep mpich2
$ wget
$ tar -xvzf mpich2-1.4.1.tar.gz
$ cd mpich2-1.4.1/
$ ./configure
$ make
$ cd examples/

The example application cpi is compiled by default, so you can find the executable in the “examples” directory. Optionally you can build the other examples as well,

$ make hellow
$ make pmandel

Once compiled, place the executables of the examples somewhere inside the /home/mpiuser directory on the master node. It’s common practice to place executables in a “bin” directory, so create the directory /home/mpiuser/bin and place the executables in this directory. The executables should now be available on all nodes.

We’re going to run an MPI job using the example application cpi. Make sure you’re logged in as the MPI user on the master node,

$ su mpiuser

And run the job like this,

When using MPD:

mpiuser@master:~$ mpiexec -n 3 /home/mpiuser/bin/cpi

When using Hydra:

mpiuser@master:~$ mpiexec -f hosts -n 3 /home/mpiuser/bin/cpi

Replace “3” by the number of nodes on which you want to run the job. When using Hydra, the -f switch should point to the file containing the host names. When using MPD, it’s important that you use the absolute path to the executable in the above command, because only then MPD knows where to look for the executable on the compute nodes. The absolute path used should thus be correct for all nodes. But since/home/mpiuser is the NFS shared directory, all nodes have access to this path and the files within it.

The example application cpi is useful for testing because it shows on which nodes each sub process is running and how long it took to run the job. This application is however not useful to test performance because this is a very small application which takes only a few milliseconds to run. As a matter of fact, I don’t think it actually computes pi. If you look at the source, you’ll find that the value of pi is hard coded into the program.

Running bioinformatics tools on the cluster

By running actual bioinformatics tools you can give your cluster a more realistic test run. There are several parallel implementations of bioinformatics tools that are based on MPI. There are two that I currently know of:

It would be nice to test mpiBLAST, but because of a compilation issue, I was not able to do so. After some asking around at the mpiBLAST-Users mailing list, I got an answer:

“That problem is caused by a change in GCC version 4.4.X. We don’t have a fix to give out for the issue as yet, but switching to 4.3.X or lower should solve the issue for the time being.”(7)

Basically, I’m using a newer version of the GCC compiler which fails to build mpiBLAST. In order to compile it, I’d have to use an older version. But to instruct mpicc to use GCC 4.3 instead, requires that MPICH2 be compiled with GCC 4.3. Instead of going through that trouble, I’ve decided to give ClustalW-MPI a try instead.

The MPI implementation of ClustalW is fairly out-dated, but it’s good enough to perform a test run on your cluster. Download the source from the website, extract the package, and compile the source. Copy the resulting executable to the /home/mpiuser/bin directory on the master node. Use for example Entrez to search for some DNA/protein sequences and put these in a single FASTA file (the NCBI website can do that for you). Create several FASTA files with multiple sequences to test with. Copy the multi-sequence FASTA files to a data directory inside mirror (e.g./home/mpiuser/data). Then run a job like this,

When using MPD:

mpiuser@master:~$ mpiexec -n 3 /home/mpiuser/bin/clustalw-mpi /home/mpiuser/data/seq_tyrosine.fasta

When using Hydra:

mpiuser@master:~$ mpiexec -f hosts -n 3 /home/mpiuser/bin/clustalw-mpi /home/mpiuser/data/seq_tyrosine.fasta

and let the cluster do the work. Again, notice that we must use absolute paths. You can check if the nodes are actually doing anything by logging into the nodes (ssh node*) and running the top command. This should display a list of running processes with the processes using the most CPU on the top. In this case, you should see the process clustalw-mpi somewhere along the top.


Thanks to Reza Azimi for mentioning the nfs-common package.


  1. OpenClusterGroup. OSCAR.
  2. Philip M. Papadopoulos, Mason J. Katz, and Greg Bruno. NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. October 2001, Cluster 2001: IEEE International Conference on Cluster Computing.
  3. Supercomputing Facility for Bioinformatics & Computational Biology, IIT Delhi. Clustering Tutorial.
  4. Robert G. Brown. Engineering a Beowulf-style Compute Cluster. 2004. Duke University Physics Department.
  5. Pavan Balaji, et all. MPICH2 User’s Guide, Version 1.3.2. 2011. Mathematics and Computer Science Division Argonne National Laboratory.
  6. Kerry D. Wong. A Simple Beowulf Cluster.
  7. mpiBLAST-Users: unimplemented: inlining failed in call to ‘int fprintf(FILE*, const char*, …)’
  8. Ubuntu Wiki. Setting Up an MPICH2 Cluster in Ubuntu.
  9. Building a Beowulf Cluster in just 13 steps.
  10. Frequently Asked Questions – Mpich.
  11. Using the Hydra Process Manager – Mpich.

Posted in CLUSTER, Computer Hardware, Computer Hardwares, Computer Languages, Computer Vision, Computing Technology, CUDA, Free Tools, GPU (CUDA), Linux OS, Mixed, My Research Related, Open CL, OpenCV, OpenMP, OPENMPI, PARALLEL | 2 Comments »

Parallel Code: Maximizing your Performance Potential

Posted by Hemprasad Y. Badgujar on December 19, 2014

No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.
Read the rest of this entry »

Posted in Computer Hardwares, Computer Languages, Computing Technology, GPU (CUDA), GPU Accelareted, My Research Related, PARALLEL, Research Menu | Tagged: | Leave a Comment »

Computer Vision Algorithm Implementations

Posted by Hemprasad Y. Badgujar on May 6, 2014

Participate in Reproducible Research

General Image Processing

(C/C++ code, BSD lic) Image manipulation, matrix manipulation, transforms
(C/C++ code, BSD lic) Basic image processing, matrix manipulation and feature extraction algorithms: rotation, flip, photometric normalisations (Histogram Equalization, Multiscale Retinex, Self-Quotient Image or Gross-Brajovic), edge detection, 2D DCT, 2D FFT, 2D Gabor, PCA to do Eigen-Faces, LDA to do Fisher-Faces. Various metrics (Euclidean, Mahanalobis, ChiSquare, NormalizeCorrelation, TangentDistance, …)
(C/C++ code, MIT lic) A Free Experimental System for Image Processing (loading, transforms, filters, histogram, morphology, …)
(C/C++ code, GPL and LGPL lic) CImg Library is an open source C++ toolkit for image processing
Generic Image Library (GIL)boost integration
(C/C++ code, MIT lic) Adobe open source C++ Generic Image Library (GIL)
SimpleCV a kinder, gentler machine vision library
(python code, MIT lic) SimpleCV is a Python interface to several powerful open source computer vision libraries in a single convenient package
PCL, The Point Cloud Library
(C/C++ code, BSD lic) The Point Cloud Library (or PCL) is a large scale, open project for point cloud processing. The PCL framework contains numerous state-of-the art algorithms including filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation.
Population, imaging library in C++ for processing, analysing, modelling and visualising
(C/C++ code, CeCill lic) Population is an open-source imaging library in C++ for processing, analysing, modelling and visualising including more than 200 algorithms designed by V. Tariel.
(C/C++ code, LGPL 3) A computer vision framework based on Qt and OpenCV that provides an easy to use interface to display, analyze and run computer vision algorithms. The library is provided with multiple application examples including stereo, SURF, Sobel and and Hough transform.
Machine Vision Toolbox
(MATLAB/C, LGPL lic) image processing, segmentation, blob/line/point features, multiview geometry, camera models, colorimetry.
(Java code, Apache lic) BoofCV is an open source Java library for real-time computer vision and robotics applications. BoofCV is organized into several packages: image processing, features, geometric vision, calibration, visualize, and IO.
(C++ code, MIT lic) Simd is free open source library in C++. It includes high performance image processing algorithms. The algorithms are optimized with using of SIMD CPU extensions such as SSE2, SSSE3, SSE4.2 and AVX2.
Free but not open source – ArrayFire (formely LibJacket) is a matrix library for CUDA
(CUDA/C++, free lic) ArrayFire offers hundreds of general matrix and image processing functions, all running on the GPU. The syntax is very Matlab-like, with the goal of offering easy porting of Matlab code to C++/ArrayFire.

Image Acquisition, Decoding & encoding

(C/C++ code, LGPL or GPL lic) Record, convert and stream audio and video (lot of codec)
(C/C++ code, BSD lic) PNG, JPEG,… images, avi video files, USB webcam,…
(C/C++ code, BSD lic) Video file decoding/encoding (ffmpeg integration), image capture from a frame grabber or from USB, Sony pan/tilt/zoom camera control using VISCA interface
lib VLC
(C/C++ code, GPL lic) Used by VLC player: record, convert and stream audio and video
(C/C++ code, LGPL lic) RTSP streams
(C/C++ code, GPL lic) Loading & saving DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, TIFF, and more
(C/C++ code, LGPL lic) Loading & saving various image format
(C/C++ code, GPL & FPL lic) PNG, BMP, JPEG, TIFF loading
(C/C++ code, LGPL lic) VideoMan is trying to make the image capturing process from cameras, video files or image sequences easier.


(C/C++ code, BSD lic) Pyramid image segmentation
(C/C++ code, Microsoft Research Lic) Branch-and-Mincut Algorithm for Image Segmentation
Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo

Machine Learning

(C/C++ code, BSD lic) Gradient machines ( multi-layered perceptrons, radial basis functions, mixtures of experts, convolutional networks and even time-delay neural networks), Support vector machines, Ensemble models (bagging, adaboost), Non-parametric models (K-nearest-neighbors, Parzen regression and Parzen density estimator), distributions (Kmeans, Gaussian mixture models, hidden Markov models, input-output hidden Markov models, and Bayes classifier), speech recognition tools

Object Detection

(C/C++ code, BSD lic) Viola-jones face detection (Haar features)
(C/C++ code, BSD lic) MLP & cascade of Haar-like classifiers face detection
Hough Forests
(C/C++ code, Microsoft Research Lic) Class-Specific Hough Forests for Object Detection
Efficient Subwindow Object Detection
(C/C++ code, Apache Lic) Christoph Lampert “Efficient Subwindow” algorithms for Object Detection
INRIA Object Detection and Localization Toolkit
(C/C++ code, Custom Lic) Histograms of Oriented Gradients library for Object Detection

Object Category Labelling

Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo
Multi-label optimization
(C/C++/MATLAB code) The gco-v3.0 library is for optimizing multi-label energies. It supports energies with any combination of unary, pairwise, and label cost terms.

Optical flow

(C/C++ code, BSD lic) Horn & Schunck algorithm, Lucas & Kanade algorithm, Lucas-Kanade optical flow in pyramids, block matching.
(C/C++/OpenGL/Cg code, LGPL) Gain-Adaptive KLT Tracking and TV-L1 optical flow on the GPU.
(C/C++/Matlab code, Custom Lic.) The RLOF library provides GPU / CPU implementation of Optical Flow and Feature Tracking method.

Features Extraction & Matching

SIFT by R. Hess
(C/C++ code, GPL lic) SIFT feature extraction & RANSAC matching
(C/C++ code) SURF feature extraction algorihtm (kind of fast SIFT)
(C/C++ code, Ecole Polytechnique and ENS Cachan for commercial Lic) Affine SIFT (ASIFT)
VLFeat (formely Sift++)
(C/C++ code) SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift
A GPU Implementation of Scale Invariant Feature Transform (SIFT)
(C/C++ code, GPL lic) An enhance version of RANSAC that considers the correlation between data points

Nearest Neighbors matching

(C/C++ code, BSD lic) Approximate Nearest Neighbors (Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration)
(C/C++ code, LGPL lic) Approximate Nearest Neighbor Searching


(C/C++ code, BSD lic) Kalman, Condensation, CAMSHIFT, Mean shift, Snakes
KLT: An Implementation of the Kanade-Lucas-Tomasi Feature Tracker
(C/C++ code, public domain) Kanade-Lucas-Tomasi Feature Tracker
(C/C++/OpenGL/Cg code, ) A GPU-based Implementation of the Kanade-Lucas-Tomasi Feature Tracker
(C/C++/OpenGL/Cg code, LGPL) Gain-Adaptive KLT Tracking and TV-L1 optical flow on the GPU
On-line boosting trackers
(C/C++, LGPL) On-line boosting tracker, semi-supervised tracker, beyond semi-supervised tracker
Single Camera background subtraction tracking
(C/C++, LGPL) Background subtraction based tracking algorithm using OpenCV.
Multi-camera tracking
(C/C++, LGPL) Multi-camera particle filter tracking algorithm using OpenCv and intel IPP.

Simultaneous localization and mapping

Real-Time SLAM – SceneLib
(C/C++ code, LGPL lic) Real-time vision-based SLAM with a single camera
(C/C++ code, Isis Innovation Limited lic) Parallel Tracking and Mapping for Small AR Workspaces
(C/C++ code, BSD lic) GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices

Camera Calibration & constraint

(C/C++ code, BSD lic) Chessboard calibration, calibration with rig or pattern
Geometric camera constraint – Minimal Problems in Computer Vision
Minimal problems in computer vision arise when computing geometrical models from image data. They often lead to solving systems of algebraic equations.
Camera Calibration Toolbox for Matlab
(Matlab toolbox) Camera Calibration Toolbox for Matlab by Jean-Yves Bouguet (C implementation in OpenCV)

Multi-View Reconstruction

Bundle Adjustment – SBA
(C/C++ code, GPL lic) A Generic Sparse Bundle Adjustment Package Based on the Levenberg-Marquardt Algorithm
Bundle Adjustment – SSBA
(C/C++ code, LGPL lic) Simple Sparse Bundle Adjustment (SSBA)


Efficiently solving multi-label MRFs (Readme)
(C/C++ code) Segmentation, object category labelling, stereo
LIBELAS: Library for Efficient LArge-scale Stereo Matching
(C/C++ code) Disparity maps, stereo

Structure from motion

(C/C++ code, GPL lic) A structure-from-motion system for unordered image collections
Patch-based Multi-view Stereo Software (Windows version)
(C/C++ code, GPL lic) A multi-view stereo software that takes a set of images and camera parameters, then reconstructs 3D structure of an object or a scene visible in the images
libmv – work in progress
(C/C++ code, MIT lic) A structure from motion library
Multicore Bundle Adjustment
(C/C++/GPU code, GPL3 lic) Design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems.
(C/C++/GPU code, MPL2 lic) OpenMVG (Multiple View Geometry) “open Multiple View Geometry” is a library for computer-vision scientists and especially targeted to the Multiple View Geometry community. It is designed to provide an easy access to the classical problem solvers in Multiple View Geometry and solve them accurately..

Visual odometry

LIBVISO2: Library for VISual Odometry 2
(C/C++ code, Matlab, GPL lic) Libviso 2 is a very fast cross-platfrom (Linux, Windows) C++ library with MATLAB wrappers for computing the 6 DOF motion of a moving mono/stereo camera.

Posted in Apps Development, C, Computer Hardware, Computer Network & Security, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Graphics Cards, Image Processing, OpenCV, PARALLEL, Simulation, Virtualization | Tagged: , , , , , , , , , , , , , , , , , , , | 3 Comments »

How to Mirror a Hard Drive

Posted by Hemprasad Y. Badgujar on February 10, 2014

How to Mirror a Hard Drive

From music libraries and family photos to important documents and operating files, losing a hard drive can be a logistical and emotional nightmare. In a few simple steps, however, you can mirror a hard drive — creating a back up copy identical to your main drive. Mirroring a hard drive will protect your important files in the event of a computer virus or hardware failure.


  1. Mirror a Hard Drive Step 1.jpg

    Decide if you want to create a mirrored drive on your existing hard drive or on a separate, external drive.

  2. Mirror a Hard Drive Step 2.jpg

    Download or purchase the necessary software. There are many inexpensive software solutions that will streamline the process down to just a few clicks. Choose a software program that provides a reliable backup and disaster recovery of systems, applications, settings and personal files.

  3. Mirror a Hard Drive Step 3.jpg

    If you want to create a mirrored drive on your existing hard drive, you will first need to partition your hard drive. Right click the ‘computer’ icon on desktop and Choose “Manage”.

  4. Mirror a Hard Drive Step 4.jpg

    From Management Console, Select “Disk Management” in left pane. It will display all disks. Click appropriate disk to create partition and follow the prompts.

  5. Mirror a Hard Drive Step 5.jpg

    After creating partition format your partition. This will enable the partition for copying data. (Usually when you partition with Disk management” option, It gives you an option to format created partition automatically or to leave partition without formatting.

  6. Mirror a Hard Drive Step 6.jpg

    Load and run your mirror software. The software provides a user-friendly interface that will guide you through the installation and set-up process.

  7. Mirror a Hard Drive Step 7.jpg

    Choose the drive you want to mirror from the list of available drives. In most cases, you will want to select the C drive.

  8. Mirror a Hard Drive Step 8.jpg

    Select a location to save your mirror copy. This location will either be your newly-partitioned hard drive or an external drive.

  9. Mirror a Hard Drive Step 9.jpg

    Click the “Start” button to begin the mirroring process. Depending on the size of your files and the write speed of your hard drive, the mirroring process may take anywhere from a few minutes to several hours.

  10. 10

    Test the new drive to ensure your files are successfully copied. If you mirrored your entire drive, you can test the new drive by removing the old drive. Your computer should successfully boot up from the new drive.

Posted in Computer Hardwares, Computer Network & Security | Leave a Comment »

Five free, dead-easy IP traffic monitoring tools

Posted by Hemprasad Y. Badgujar on January 30, 2014

When you really need to know what’s going on with your network, give one of these free monitors a try.

Monitoring your network can be a real pain. First and foremost, what tool should you use? Everyone you ask will give you a different answer. Each answer will reflect a different set of requirements and, in some cases, fill completely different needs. Here are the five network monitors I prefer, based on two criteria: They’re free (as in cost) and easy to use. You might not agree with the choices, but at the price point, you’d be hard pressed to find better solutions.

1: Wireshark

Wireshark (Figure A) has always been my go-to monitor. When most other monitors fail to find what I want, Wireshark doesn’t let me down. Wireshark is a cross-platform analyzer that does deep inspection of hundreds of protocols. It does live capture and capture save (for offline browsing), which can be viewed in GUI or tty mode. Wireshark also does VoIP analysis and can read/write many capture formats (tcpdump, Pcap NG, Catapult DCT2000, Cisco Secure IDS iplog, Microsoft Network Monitor, and many more).

Figure A

2: Angry IP Scanner

Angry IP Scanner (Figure B) is one of the easiest to use of all the IP scanners. It has a user-friendly GUI that can scan IP addresses (and their ports) in any range. Angry IP Scanner is cross platform and doesn’t require installation, so you can use it as a portable scanner. It can get NetBIOS information, favorite IP address range, Web server detection, customizable openers, and much more. This little scanner makes use of mutlithreads, so it’s going to be fairly fast. Source code is available on the download page.

Figure B

3: Zenmap

Zenmap (Figure C) is a graphical front end to the cross-platform Nmap tool. Nmap can scan huge networks, is portable, free, and well documented. It’s one of the most powerful IP traffic monitors, but that power comes with a price: complexity. Zenmap takes Nmap and makes it more accessible to users who prefer to avoid the command line. That does not mean Zenmap is the easiest of the lot. You still need to use some commands. But Zenmap offers a powerful wizard-like tool to help you through the process.

Figure C

4: Colasoft Capsa Free

If you’re an admin used to more Windows-like tools, Capsa Free (Figure D) might be the perfect tool for you. There are actually two versions of Capsa: paid and free. The free version should be enough in most cases. It provides an easy-to-use dashboard you can use to create various types of captures. Capsa Free also offers plenty of alarm configurations so you can be alerted when something occurs. And it can capture more than 300 network protocols, so you won’t be missing out on anything with this free tool.

Figure D

5: EtherApe

EtherApe is a Linux-only tool and is molded after the classic etherman monitor. It’s unique in that it offers an easy-to-use mapping of IP traffic on your network. It does this in real time and gives you a clear picture of the overall look of your network traffic. You can create filters (using pcap syntax) to make reading the map easier. As you can see in Figure E, a busy network can get rather challenging to read. EtherApe will display both the node and link color with the most-used protocol so it’s easier to take a quick glance, even on a busy network.

Figure E

More tools?

A lot of networking monitoring tools are out there, and some of them do more auditing than the tools listed here. But when you really need to know what’s going on with your network, one of the above tools will do a great job.

Have you used any of these tools? What other free scanners have you tried?

Posted in CLOUD, Computer Hardwares, Computer Languages, Computer Network & Security, Computer Softwares, Computing Technology, User Authentication | Leave a Comment »

Directory of Open Source Broadcasting Projects

Posted by Hemprasad Y. Badgujar on January 3, 2014

Directory of Open Source Broadcasting Projects

Audio production

Graphics / CG


Open Source Hardware

Radio Automation



Video Play-out

Video production

Video Transcoding

Posted in Animation, Apps Development, Computer Games, Computer Languages, Computer Network & Security, Computer Softwares, Game Development, Network Devices, Research Menu | Leave a Comment »

CUDA Thread Execution Model

Posted by Hemprasad Y. Badgujar on July 22, 2013

CUDA Thread Execution Model


Grid of Thread Blocks

Grid of Thread Blocks

In a previous article, I gave an introduction to programming with CUDA. Now I’d like to go into a little bit more depth about the CUDA thread execution model and the architecture of a CUDA enabled GPU. I assume that the reader has basic knowledge about CUDA and already knows how to setup a project that uses the CUDA runtime API. If you don’t know how to setup a project with CUDA, you can refer to my previous article:Introduction to CUDA.



GPU Architecture

To understand the thread execution model for modern GPU’s, we must first make an analysis of the GPU compute architecture. In this article I will focus on the Fermi compute architecture found in modern GPU’s (GTX 580).

Overview of the Fermi Architecture

A Fermi GPU consists of 512 CUDA cores. These 512 CUDA cores are split across 16 Streaming Multiprocessors (SM) each SM consisting of 32 CUDA cores. The GPU has 6 64-bit memory partitions supporting up to 6 GB of GDDR5 DRAM memory.

Fermi Arcitecture

Fermi Arcitecture

Each streaming multiprocessor (SM) has 32 cuda cores. Each CUDA core consists of an integer arithmetic logic unit (ALU) and a floating point unit (FPU).

Fermi Streaming Multiprocessor (SM)

Fermi Streaming Multiprocessor (SM)

The SM has 16 load/store units allowing source and destination addresses to be calculated for sixteen threads per clock.

Each SM also has four Special Function Units (SFU) that execute transcendental instructions such as sin, cosine, reciprocal, and square root.

CUDA Threads

Now that we’ve seen the specific architecture of a Fermi GPU, let’s analyze the more general CUDA thread execution model.

Each kernel function is executed in a grid of threads. This grid is divided into blocks also known as thread blocks and each block is further divided into threads.

Cuda Execution Model

Cuda Execution Model

In the image above we see that this example grid is divided into nine thread blocks (3×3), each thread block consists of 9 threads (3×3) for a total of 81 threads for the kernel grid.

This image only shows 2-dimensional grid, but if the graphics device supports compute capability 2.0, then the grid of thread blocks can actually be partitioned into 1, 2 or 3 dimensions, otherwise if the device supports compute capability 1.x, then thread blocks can be partitioned into 1, or 2 dimensions (in this case, then the 3rd dimension should always be set to 1).

The thread block is partitioned into individual threads and for all compute capabilities, threads can be partitioned into 1, 2, or 3 dimensions. The maximum number of threads that can be assigned to a thread block is 512 for devices with compute capability 1.x and 1024 threads for devices that support compute capability 2.0.

Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.0
Maximum dimensionality of a grid of thread blocks 2 3
Maximum x-, y-, or z-dimension of a grid of thread blocks 65535
Maximum dimensionality of a thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 1024

The number of blocks within a gird can be determined within a kernel by using the built-in variable gridDim and the number of threads within a block can be determined by using the built-in variable blockDim.

A thread block is uniquely identified in a kernel function by using the built-in variableblockIdx and a thread within a block is uniquely identified in a kernel function by using the built-in variable threadIdx.

The built-in variables gridDimblockDimblockIdx, and threadIdx are each 3-component structs with members x, y, z.

With a 1-D kernel, the unique thread ID within a block is the same as the x component of the threadIdx variable.

and the unique block ID within a grid is the same as the x component of the blockIdx variable:

To determine the unique thread ID in a 2-D block, you would use the following formula:

and to determine the unique block ID within a 2-D grid, you would use the following formula:

I’ll leave it as an exercise for the reader to determine the formula to compute the unique thread ID and block ID in a 3D grid.

Matrix Addition Example

Let’s take a look at an example kernel that one might execute.

Let’s assume we want to implement a kernel function that adds two matrices and stores the result in a 3rd.

The general formula for matrix addition is:

That is, the sum of matrix A and matrix B is the sum of the components of matrix A and matrix B.

Let’s first write the host version of this method that we would execute on the CPU.

void MatrixAddHost( float* C, float* A, float* B, unsigned int matrixRank )
    for( unsigned int j = 0; j < matrixRank; ++j )
        for ( unsigned int i = 0; i < matrixRank; ++i )
            unsigned int index = ( j * matrixRank ) + i;
            C[index] = A[index] + B[index];

This is a pretty standard method that loops through the rows and columns of a matrix and adds the components and stores the results in a 3rd. Now let’s see how we might execute this kernel on the GPU using CUDA.

First, we need to think of the problem domain. I this case, the domain is trivial: it is the components of a matrix. Since we are operating on 2-D arrays, it seems reasonable to split our domain into two dimensions; one for the rows, and another for the columns of the matrices.

We will assume that we are working on square matrices. This simplifies the problem but mathematically matrix addition only requires that the two matrices have the same number of rows and columns but does not have the requirement that the matrices must be square.

Since we know that a kernel is limited to 512 threads/block with compute capability 1.x and 1024 threads/block with compute capability 2.0, then we know we can split our job into square thread blocks each consisting of 16×16 threads (256 threads per block) with compute capability 1.x and 32×32 threads (1024 threads per block) with compute capability 2.0.

For simplicity, I will assume compute capability 1.x for the remainder of this tutorial.

If we limit the size of our matrix to no larger than 16×16, then we only need a single block to compute the matrix sum and our kernel execution configuration might look something like this:

dim3 gridDim( 1, 1, 1 );
dim3 blockDim( matrixRank, matrixRank, 1 );
MatrixAddDevice<<<gridDim, blockDim>>>( C, A, B, matrixRank );

In this simple case, the kernel grid consists of only a single block with matrixRank xmatrixRank threads.

However, if we want to sum matrices larger than 512 components, then we must split our problem domain into smaller groups that can be processed in multiple blocks.

Let’s assume that we want to limit our blocks to execute in 16×16 (256) threads. We can determine the number of blocks that will be required to operate on the entire array by dividing the size of the matrix dimension by the maximum number of threads per block and round-up to the nearest whole number:

And we can determine the number of threads per block by dividing the size of the matrix dimension by the number of blocks and round-up to the nearest whole number:

So for example, for a 4×4 matrix, we would get

and the number of threads is computed as:

resulting in a 1×1 grid of 4×4 thread blocks for a total of 16 threads.

Another example a 512×512 matirx, we would get:

and the number of threads is computed as:

resulting in a 32×32 grid of 16×16 thread blocks for a total of 262,144 threads.

The host code to setup the kernel granularity might look like this:

size_t blocks = ceilf( matrixRank / 16.0f );
dim3 gridDim( blocks, blocks, 1 );
size_t threads = ceilf( matrixRank / (float)blocks );
dim3 blockDim( threads, threads, 1 );
MatrixAddDevice<<< gridDim, blockDim >>>( C, A, B, matrixRank );
You may have noticed that if the size of the matrix does not fit nicely into equally divisible blocks, then we may get more threads than are needed to process the array. It is not possible to configure a gird of thread blocks with 1 block containing less threads than the others. The only way to solve this is to execute multiple kernels – one that handles all the equally divisible blocks, and a 2nd kernel invocation that handles the partial block. The other solution to this problem is simply to ignore any of the threads that are executed outside of our problem domain which is generally the easier (and more efficient) than invoking multiple kernels (this should be profiled to be proven).

The Matrix Addition Kernel Function

On the device, one kernel function is created for every thread in the problem domain (the matrix elements). We can use the built-in variables gridDimblockDimblockIdx, and threadIdx, to identify the current matrix element that the current kernel is operating on.

If we assume we have a 9×9 matrix and we split the problem domain into 3×3 blocks each consisting of 3×3 threads as shown in the CUDA Grid below, then we could compute the ith column and the jth row of the matrix with the following formula:

So for thread (0,0) of block (1,1) of our 9×9 matrix, we would get:

for the column and:

for the row.

The index into the 1-D buffer that store the matrix is then computed as:

and substituting gives:

Which is the correct element in the matrix. This solution assumes we are accessing the matrix in row-major order.

CUDA Grid Example

CUDA Grid Example

Let’s see how we might implement this in the kernel.
__global__ void MatrixAddDevice( float* C, float* A, float* B, unsigned int matrixRank )
    unsigned int column = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int row    = ( blockDim.y * blockIdx.y ) + threadIdx.y;
    unsigned int index = ( matrixRank * row ) + column;
    if ( index < matrixRank * matrixRank ) // prevent reading/writing array out-of-bounds.
        C[index] = A[index] + B[index];

On line 3, and 4 we compute the column and row of the matrix we are operating on using the formulas shown earlier.

On line 6, the 1-d index in the matrix array is computed based on the size of a single dimension of the square matris.

We must be careful that we don’t try to read or write out of the bounds of the matrix. This might happen if the size of the matrix does not fit nicely into the size of the CUDA grid (in the case of matrices whose size is not evenly divisible by 16) To protect the read and write operation, on line 7 we must check that the computed index does not exceed the size of our array.

Thread Synchronization

CUDA provides a synchronization barrier for all threads in a block through the__syncthreads() method. A practical example of thread synchronization will be shown in a later article about optimization a CUDA kernel, but for now it’s only important that you know this functionality exists.

Thread synchronization is only possible across all threads in a block but not across all threads running in the grid. By not allowing threads across blocks to be synchronized, CUDA enables multiple blocks to be executed on other streaming multiprocessors (SM) in any order. The queue of blocks can be distributed to any SM without having to wait for blocks from another SM to be complete. This allows the CUDA enabled applications to scale across platforms that have more SM at it’s disposal, executing more blocks concurrently than another platforms with less SM’s.

Thread synchronization follows strict synchronization rules. All threads in a block must hit the synchronization point or none of them must hit synchronization point.

Give the following code block:
if ( threadID % 2 == 0 )

will cause the threads in a block to wait indefinitely for each other because the two occurrences of __syncthreads are considered separate synchronization points and all threads of the same block must hit the same synchronization point, or all of them must not hit it.

Thread Assignment

When a kernel is invoked, the CUDA runtime will distribute the blocks across the SM’s on the device. A maximum of 8 blocks (irrelevant of platform) will be assigned to each SM as long as there are enough resources (registers, shared memory, and threads) to execute all the blocks. In the case where there are not enough resources on the SM, then the CUDA runtime will automatically assign less blocks per SM until the resource usage is below the maximum per SM.

The total number of blocks that can be executed concurrently is dependent on the device. In the case of the Fermi architecture discussed earlier, a total of 16 SM’s can concurrently handle 8 blocks for a total of 128 blocks executing concurrently on the device.

Because the Fermi architecture support compute compatibility 2.0, we can create thread blocks consisting of at most 1024 threads, then the Fermi device can technically support 131,072 threads residing in the SM’s for execution. This does not mean that every clock tick the devices is executing 131,072 instruction simultaneously. In order to understand how the blocks are actually executed on the device, we must look one step further to see how the threads of a block are actually scheduled on the SM’s.

Thread Scheduling

When a block is assigned to a SM, it is further divided into groups of 32 threads called a warp. Warp scheduling is different depending on the platform, but if we take a look at the Fermi architecture, we see that a single SM consists of 32 CUDA cores (or streaming processor) – two groups of 16 per SM.

Each SM in the Fermi architecture (see Fermi architecture image above) features two warp schedulers allowing two warps to be issued and executed concurrently. Fermi’s dual-warp scheduler selects two warps and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four special function units (SFU’s).

Most instructions can be dual-issued; two integer instructions, two floating point instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently.

Fermi - Dual Warp Scheduler

Fermi – Dual Warp Scheduler

You might be wondering why it would be useful to schedule 8 blocks of a maximum of 1024 threads if the SM only has 32 SP’s? The answer is that each instruction of a kernel may require more than a few clock cycles to execute (for example, an instruction to read from global memory will require multiple clock cycles). Any instruction that requires multiple clock cycles to execute incurs latency. The latency of long-running instructions can be hidden by executing instructions from other warps while waiting for the result of the previous warp. This technique of filling the latency of expensive operations with work from other threads is often called latency hiding.

Thread Divergence

It is reasonable to imagine that your CUDA program contains flow-control statements like if-then-elseswitchwhile loops, or for loops. Whenever you introduce these flow-control statements in your code, you also introduce the possibility of thread divergence. It is important to be aware of the consequence of thread divergence and also to understand how you can minimize the negative impact of divergence.

Thread divergence occurs when some threads in a warp follow a different execution path than others. Let’s take the following code block as an example:
__global__ void TestDivergence( float* dst, float* src )
    unsigned int index = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    float value = 0.0f;
    if ( threadIdx.x % 2 == 0 )
        // Threads executing PathA are active while threads
        // executing PathB are inactive.
        value = PathA( src );
        // Threads executing PathB are active while threads
        // executing PathA are inactive.
        value = PathB( src );
    // Threads converge here again and execute in parallel.
    dst[index] = value;

Then our flow control and thread divergence would look something like this:

Thread Divergence

Thread Divergence

As you can see from this example, the even numbered threads in each block will execute PathA while the odd numbered threads in the block will execute PathB. This is pretty much the worst-case scenario for simple divergence example.

Both PathA and PathB cannot be executed concurrently on all threads because their execution paths are different. Only the threads that execute the exact same execution path can run concurrently so the total running time of the warp is the sum of the execution time of both PathA and PathB.

In this example, the threads in the warp that execute PathA are activated if the condition is true and all the other threads are deactivated. Then, in another pass, all the threads that execute PathB are activated if the condition is false are activated and the other threads are deactivated. This means that to resolve this condition requires 2-passes to be executed for a single warp.

The overhead of having the warp execute both PathA and PathB can be eliminated if the programmer takes careful consideration when writing the kernel. If possible, all threads of a block (since warps can’t span thread blocks) should execute the same execution path. This way you guarantee that all threads in a warp will execute the same execution path and there will be no thread divergence within a block.


If a device supports compute capability 1.3 then it can have blocks with a maximum of 512 threads/block and 8 blocks/SM can be scheduled concurrently. Each SM can schedule groups of 32-thread units called warps. The maximum number of resident warps per SM in a device that supports compute capability 1.3 is 32 and the maximum number of resident threads per SM is 1024.

Q. What would be the ideal block granularity to compute the product of two 2-D matrices of size 1024 x 1024?

  1. 4×4?
  2. 8×8?
  3. 16×16?
  4. or 32×32?

A. To answer this question, let’s analyze each choice and give pros and cons for each one.

4×4: If we decide to split our domain into 4×4 thread blocks, then we have 16 threads per block. In order to fully occupy the SM that can support 1024 threads per SM, we would need 1024/16 = 64 blocks but the SM can only schedule 8 blocks/SM so each SM would be scheduled with 8 blocks each having 16 threads which is 128 threads/SM. When divided into warps, we only have 4 warps scheduled per SM out of a total of 32 which gives only 12.5% occupancy.

8×8: We have the same problem here as we did with the 4×4 thread block granularity except not as severe. With 8×8 thread blocks, we get 64 threads per block. For a SM that can support 1024 threads per SM, we would need 1024/64 = 16 blocks but since we are limited to 8 blocks maximum per SM, we can only execute 8×64 = 512 threads/SM. When split into warps of 32-threads each, we get 512/32 = 16 warps scheduled per SM from a possible total 32 warps. This give only 50% occupancy.

16×16: A 16×16 thread block gives 256 threads/block. With a maximum thread limit per SM of 1024, we get 1024/256 = 4 blocks/SM. This is within the 8 block limit so 4 blocks, each of 256 threads can be scheduled on one SM. With 4 blocks each with 256 threads, we get a total of 1024 threads. The threads are further split into warps of 32 threads each for a total of 32 warps. Since the device can support 32 warps/SM we have achieved 100% occupancy.

32×32: This option is not even an option since a 32×32 thread block produces a single block with 1024 threads. As stated earlier, we are limited to 512 threads per block with compute capability 1.3 so our kernel wouldn’t even run.

So the best choice for this problem domain would be to invoke a kernel with block size16×16.


In this article, I discussed the architecture of a CUDA enabled GPU, in particular the Fermi architecture. I also showed how a kernel function is scheduled on the GPU and how the warp scheduler executes instructions from different warps in order to minimize the amount of noticeable latency between kernel instructions.


Posted on November 15, 2011 by 

Posted in Computer Languages, Computing Technology, CUDA, Game Development, GPU (CUDA), Graphics Cards, PARALLEL | Leave a Comment »

UPDATED! AMD Announces FX-9590 and FX-9370: Return of the GHz Race

Posted by Hemprasad Y. Badgujar on July 18, 2013

AMD Announces FX-9590 and FX-9370: Return of the GHz Race

Today at E3 AMD announced their latest CPUs, the FX-9590 and FX-9370. Similar to what we’re seeing with Richland vs. Trinity, AMD is incrementing the series number to 9000 while sticking with the existing Piledriver Vishera architecture. These chips are the result of tuning and binning on GlobalFoundries’ 32nm SOI process, though the latest jump from the existing FX-8350 is nonetheless quite impressive.

The FX-8350 had a base clock of 4.0GHz with a maximum Turbo Core clock of 4.2GHz; the FX-9590 in contrast has a maximum Turbo clock of 5GHz and the FX-9370 tops out at 4.7GHz. We’ve asked AMD for details on the base clocks for the new parts, but so far have not yet received a response; we’re also missing details on TDP, cache size, etc. but those will likely be the same as the FX-8350/8320 (at least for everything but TDP).

Posted in Computer Hardwares, Computing Technology, Graphics Cards, PARALLEL | 1 Comment »

CPU vs GPU performance

Posted by Hemprasad Y. Badgujar on July 18, 2013

in the performance of GPUs and CPUs. This has to be quite a compromise since actual performance depends heavily on the suitability of the chip to a particular problem/algorithm among many other specifics. The simplest method is to plot theoretical peak performance over time; I chose to show it for single and double precision for NVIDIA GPUs and Intel CPUs.

In the past, I have used the graph in the CUDA C Programming Guide, but that is frequently out of date, I have no control of the formatting, I have to settle for a screenshot instead of vector output, and, until I did my own research, I wasn’t sure if it was biased.

Below is Michael Galloy attempt (click to enlarge).

CPU vs. GPU performance

by Michael Galloy

Posted in CLOUD, Computer Hardwares, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, Open CL, PARALLEL | Leave a Comment »

Remote Desktop Protocol (RDP)

Posted by Hemprasad Y. Badgujar on March 4, 2013

Remote Desktop Protocol (RDP)

In May 1997, Microsoft began developing a protocol for exchanges between terminal servers and their Windows OS clients. This protocol is called RDP (remote desktop protocol) and is based on International Telecommunication Union (ITU) standards. In particular, RDP is based on the standards of the T.120 protocol family, especially on T.125 Multipoint Communication Service-Protocol Specification (MCS) and T.128 Application Sharing. RDP is also strongly aligned with communication mechanisms that were already in use for data exchange under Microsoft NetMeeting.

Any device can be a client as long as it has an output medium, a mouse, and a keyboard. It also needs to be able to communicate over the network using RDP. Further intelligence is not needed on the client side. The currently officially available Microsoft RDP clients support only Windows CE, the 32-bit and 64-bit Windows operating systems, an ActiveX control element for Microsoft Internet Explorer, and Apple Mac OS X. There is also an open sourced based RDP client available that works excellently under Linux.

Figure 1: RDP protocol integration under Windows Server 2003.


RDP Architecture

The RDP protocol allows communication via up to 64,000 channels. The screen is transmitted as a raster graphic (bitmap) from the server to the client or terminal. The client transmits the keyboard and mouse interactions to the server. The communication is extremely asymmetric. Most data is transmitted from the server to the client.

RDP was originally designed to support different network topologies. In its current state, it can be executed only via TCP/IP networks and is internally divided into several layers. The reason for this, at the lowest level, is that the T.120 protocol family, on which RDP is based, was optimized in accordance with some rather complex specifications of the ISO model. These were mostly grade-of-service mechanisms. Because these cannot be mapped to the TCP/IP protocol, an X.224- compatible adaptation layer handled mapping the specified service primitive of the ISO layer to the service primitive of the TCP/IP protocol.

Basically, only four service primitives are needed, three of which are for connection administration: connection request, connection confirmation, and disconnection request. Connection and disconnection come from the client. When the server ends the connection, the client is not specially notified. In this case, the client implementation needs to address the defined behavior through corresponding exception handling. The fourth service primitive handles data transmission.

The layer above this one provides multicast services. Multicast services allow both point-to-point and point-to-multipoint connections. This is the only way to implement functionality with several endpoints, for example, remote control.

A special security layer comprises all encryption and signature services. It keeps unauthorized users from monitoring the RDP connection and prevents the transmitted data stream from being modified. The RC4 algorithm by RSA Inc. is used for encryption. A signature that consists of a combination of the MD5 and SHA-1 algorithms prevents data manipulation. Additionally, the security layer manages the transmission of the user authentication and the relevant licenses.

The truly relevant layer handles the transmission of mouse and keyboard input and display output. This mechanism is relatively complex. It negotiates the operations used during connection, and it manages the caching information of some data, which reduces the network load significantly.

Over the course of its development, the RDP protocol was further adapted to Windows NT, Windows 2000, Windows XP, Windows Server 2003, and their applications. Many expansions of the protocol relate to this specific type of environment. The original program transmits data over the RDP protocol stack to the TCP/IP protocol stack. Through the layer model described earlier, the data is directed to a channel, encrypted, divided into predefined sections, adapted to the network protocol, addressed, and sent on its way. On the receiving end, this process occurs in reverse, making the data available to the target program. Licensing modalities are monitored here and encryption methods selected. Additional services manage the protocol- specific adjustments in user mode.

Windows Server 2003 Terminal Services is independent of the RDP protocol. The protocol is basically exchangeable; Terminal Services provides a flexible platform for multi-user mode. This platform allows other manufacturers to develop alternative protocols that use Terminal Services functions.

Due to its higher capacity, a kernel driver provides the remaining runtime environment to generate the RDP data stream. The kernel is subdivided into a section for the transport protocol (TCP/IP) and a section for the session- specific communication protocol (RDP). The latter is executed by the Termdd driver, which transfers mouse and keyboard input from the remote clients to the kernel.

RDP Capabilities

A terminal server does not know which type of client will contact it next. Therefore, all parameters that characterize the client and describe its functions need to be transmitted when connecting. A set of predefined capabilities is used for this purpose.

Knowledge of client capabilities allows a terminal server to respond flexibly to a client’s requirements. The capabilities are divided into several groups:


  • General abilities: Client platform and operating system used, protocol version, and data compression supported.
  • Video: Bitmaps, Desktop size, preferred color depth, supported color depths, and bitmap compression.
  • Character commands Support: local character operations, for example, text output or administration for overlaying and redrawing panels.
  • Bitmap cache: temporary or persistent caching of frequently used bitmaps on the client.
  • Color table: Supports a palette for drawing pixels in bitmaps.
  • Panel activation: Controls the active application window when viewed singly versus in the context of a complete desktop.
  • Remote control: Supports remote administration, allowing a client to be controlled from a remote location.
  • Cursor: Defines mouse-cursor color properties.

Some of these properties are vital to the performance of the RDP protocol and warrant a closer look.

Graphics Display Under RDP

If you view a user interface or a dynamic Windows application on a remote client, it is reminiscent of a digital movie. The action might be a bit abstract, and the cutover as application windows change or start is very abrupt. Displaying a graphical user interface on a remote client mainly requires the transmission of huge amounts of data. If modified screen sections were always transferred as a complete set from server to client, the data would easily total from several hundred kilobytes to more than a megabyte. For this reason, starting or ending a full-screen application is particularly expensive in terms of the required network bandwidth.

Therefore, optimizing the RDP protocol and its handling of graphical elements has been given the utmost attention and effort. The most efficient compression algorithms and administration mechanisms have been used, supporting almost all graphics operations of both a Windows desktop and Windows-based applications. The intended result is to keep computational effort and required network bandwidth at the lowest possible level.

The RDP graphics driver receives the graphics commands from the applications via the graphics device interface (GDI or GDI+). The application instructs the GDI(+) where and what to draw. The GDI(+) forwards the instructions to the RDP driver. Nevertheless, the GDI(+) does not recognize that the graphics elements are not output on a local screen. The RDP driver transmits that data to a remote client. This is exactly when optimization must occur. Most graphics operations focus on producing raster images (bitmaps), drawing graphics primitives, and displaying text characters.

A bitmap is a rectangle consisting of pixels of different colors, thus creating a specific pattern. For instance, an application symbol (icon) is such a bitmap. The display of a static bitmap can be accelerated using compression when transmitting. Compression is possible at low color depth through a simple temporary encryption scheme that transmits only one color value if several pixels of the same color appear in a row. The rates of compression can also be influenced by the applications used and by the design of the graphical user interface. For example, using many colors and horizontal color transitions is very critical. This is why the decoration of panels in a remote client session is much simpler than on the console. The total number of colors used, as defined under bitmap properties, is also very important in the selection of the compression algorithm and the resulting compression rate.

Color palettes can further optimize bitmap use. A table containing color values is created and transmitted to the client. If individual pixels in a bitmap need coloration, the position coordinates in the table are transmitted, not the color value. The amount of data is always smaller when the color depth is high but the number of simultaneously used colors is relatively low. Thus, an individual color value requires up to three bytes, whereas the color position in a table with a maximum of 256 entries needs only one byte.

Even more problematic are animated images, that is, animated bitmaps. They result in significantly higher transfer rates on the network, even if they are very small (for example, animated or blinking mouse cursor). In this instance, we need another mechanism to limit the data volume: caching.

How can the output of graphics be realized, for example, lines or rectangles that make up many elements of the user interface? The easiest solution is to draw a line by transmitting each and every pixel. A much more efficient method is a command that defines the start and endpoint, thickness, and color of a line. The rest can be calculated by the output device. Naturally, this feature also uses much more complex commands, such as redrawing a window after it has been overlapped. Here, caching also plays a key role in temporarily storing data of all panel elements and contents.

Individual letters in character strings are managed by a special kind of bitmap, the glyph. In principle, the output of characters using glyphs is much easier than the transmittal of letter bitmaps. A command requests only the output of a glyph at a set screen position. However, the number of different fonts and font sizes of the current font sets is quite problematic, as is the initial transport of the glyphs to the client.


We already mentioned another mechanism to reduce the amount of data transmitted. It is called caching, or temporary storage. In this process, the client reserves memory that houses frequently used image fragments that can be displayed again without any network transmission taking place. Some character operations do not work at all without caching, due to technical reasons.

Caching functions are not realized solely through the client’s main memory. If local hard drives exist, they can permit so-called persistent caching, in which the cached data is still available even after a client reboot. In principle, special network components can store RDP data and thus provide a global caching space on a LAN segment.

RDP supports the following caches:

  • Bitmap cache: Cache for different kinds of bitmaps. Size and number of these caches is determined upon connection.
  • Font cache: Cache for glyphs (character bitmaps). The cache size must be sufficient for storing all characters of a defined character set.
  • Desktop cache: Cache for a desktop screenshot. A character command stores or outputs this special bitmap.
  • Cursor cache: Cache for mouse cursors that need special handling. Special graphics algorithms ensure that the correct mouse cursor is displayed on the desktop without generating a lot of network traffic and overloading the local processor resources. Drag-and-drop operations with the mouse therefore do not require the transmission of new graphical elements, but only the transmission of the new coordinates.
  • Text cache: Cache for frequently used character strings and the corresponding formatting information. Glyphs from the font cache are used to generate a complete character string.

Virtual Channels and Mirroring

An RDP client or application can use a virtual channel to transmit specific information. Virtual channels thus help add functions that are not yet specified in the RDP protocol. They represent a platform that future developments can be based on without having to modify the communication methods between a terminal server and its clients. With Windows Server 2003 and RDP 5.2, virtual channels are used for joint client and server clipboards and for redirecting print jobs to local client printers.

Basically, the RDP protocol allows distribution of data streams from one source to many destinations without having to send the data separately. An application can therefore be mirrored on another client. Even the point of input can be transferred from one user to another.

RDP Protocol Features

The features of the RDP protocol play a key role in the wide acceptance of terminal servers. RDP 5.0, deployed under Windows 2000, already set the direction for the future. The most important RDP features follow:

  • 256-color (8-bit) support
  • Remote monitoring
  • 56-bit and 128-bit encryption options
  • Improved compression and caching to reduce network traffic
  • Option to restore the connection to an existing user session
  • Connection to printers and the clipboard on the client

RDP 5.1 was established as an integral part of Windows XP when it was launched. Overall, this version’s improved and new properties strengthened Terminal Services significantly. However, without a suitable server, Windows XP and RDP 5.1 could not exploit the full potential the new technology offered.

So which new properties did RDP 5.1 offer? The following list provides an overview:

  • Supports up to 16 million colors (24 bit)
  • Provides increased screen resolution of 640 x 480 up to 1600 x 1200 pixels
  • Maps predefined, local key combinations to the user session
  • Connects to the client’s local file system
  • Provides improved support of client and network printer integration
  • Transmits audio data to the client. The required bandwidth can be controlled. MIDI data integration is not supported.
  • Redirects serial ports from the client to the user session on the terminal server
  • Permits use of the console session from a remote client. All administrative tasks can be performed without physically accessing the peripheral server devices.
  • Requests time-zone information from the client and adjusts the user session accordingly.
  • Supports smart cards if a corresponding reading device was installed on the client
  • Optimizes bandwidth settings to support different network scenarios (intranet, modem, etc.)

Windows Server 2003 RDP 5.2 has all properties of its predecessor with a few minor additions, such as automatic reconnection of ended sessions.

Required Network Bandwidth

It is very important to know how much bandwidth the RDP protocol uses on the network. Unfortunately, there is no universal answer to this question because it is influenced by so many external conditions, such as:

  • Color depth and screen resolution set and supported on the client
  • Encryption and compression
  • Desktop backgrounds allowed or prohibited, window contents displayed while dragging, animated menus and windows, designs, or bitmap caching
  • Client and server platform capacity
  • Type and combination of applications used
  • User behavior

RDP is usable at a bandwidth of 20 kilobits per second (Kbps), and most efficient at 50 Kbps. If there is sufficient bandwidth, the RDP protocol can use up to 500 Kbps. For instance, it uses more bandwidth if a program produces large graphics quickly in series, and if it is limited only by the output channel’s capacity. This situation occurs when computer games, videos, animations, or certain benchmark programs are launched via an RDP client. Therefore, these programs are normally not suitable for running on terminal servers.

Only comprehensive tests can give us more detailed information on the bandwidth required in a target environment. Just be prepared for some surprises! The RDP protocol does not always behave as expected. For instance, it is not necessarily true that increasing the supported color depth from 8 bits to 16 bits inevitably results in poorer performance. An 8-bit color space was supported in the past, and for compatibility reasons, the algorithms used then are still available. Modern algorithms are generally faster and optimized to a larger color space. So using more colors could actually improve performance.

Client and server platforms play a key role, too. Network bandwidth can be reduced through compression. However, if the client is too slow to process the received data stream and simultaneously encrypt it, the sessions will not run smoothly. Therefore, optimizing the transmission rate by prohibiting desktop backgrounds or animated menus is of little use.

Figure 2 shows a great example of a measurement with limited validity. Six RDP sessions, each configured differently, were executed in a sequence on one client. The bandwidth required by a 16-bit program for graphical load tests was measured for each session. During the measurement, both computer platforms had exclusive use of a 100-megabit network. Each session had its own user account in order to prevent mutual influence. The 16-bit test program was run within each user session in the 16 bit emulation environment.

Figure 2: Result of a bandwidth test with several RDP sessions, all configured differently, on the system monitor.

The results are very similar, and peak loads are all between 25 kilobytes per second (Kbps) and 35 Kbps. This is even more surprising when we review the individual RDP session parameters, as shown in the following table:

We see that the different RDP session configurations resulted mainly in different run times of the test program. The run times differed by up to 50 percent, depending on the parameters selected. However, the values for the network peak loads did not differ very much. It is rather noteworthy, though, that the highest peak load occurred during session 9 configured with only 256 colors (8 bit). In this case, the display speed on the client had the highest value, which correlated with the corresponding demands of the network.

In contrast, Figure 3 shows a normal RDP session whose parameters match the ones used in session ID 40 from Table 1. Again, both computer platforms had exclusive use of a 100-megabit network without bandwidth limitations. The user of the measured session first started the Paint program and opened a large graphic. The user then started WordPad, opened an existing document, and started adding to it. The application windows were continuously dragged on the desktop. Short-term network load peaks of more than 10 Kbps were observed only when a new application was opened and the window needed to be transmitted for the first time. The processor load was also moderate during these actions, which could be attributed to the fact that the user was the only one on the terminal server.

Table 1: Configuration of the Individual RDP Sessions for the Bandwidth Test

Session ID Resolution Color Depth Bitmap Caching Network Optimization

7 1024 x 768 16 bit Yes LAN (10 Mbps)

8 640 x 480 16 bit Yes LAN (10 Mbps)

9 1024 x 768 8 bit Yes LAN (10 Mbps)

10 640 x 480 8 bit Yes LAN (10 Mbps)

11 1024 x 768 16 bit No Modem (28.8 Kbps)

12 640 x 480 16 bit No Modem (28.8 Kbps)

The average bandwidth measured was 2.6 Kbps, which corresponds to approximately 20 Kbps. If the session were run over a 56 Kbps modem, the bandwidth would definitely suffice. The only difference observed was a short delay in the initial construction of the application windows. Optimizing the transmission rate in the client settings would further reduce the bandwidth required. This scenario would result in relatively smooth operation even on very slow network connections.

Figure 3: An RDP session with normal user actions is measured. The thick line represents the corresponding network bandwidth in Kbps; the thin line displays the processor load.


User satisfaction in a terminal-server environment is even more influenced by network delays (latency) than by nominally available bandwidth. For instance, a wide area network (WAN) is usually unsuitable for a terminal server, despite its extremely high bandwidth. Signal transmission from user input to client response can take up to one second, because the signal has a long way to travel: from mouse or keyboard to the server, eliciting a graphical response from the corresponding user session, which is then transmitted back to the client.

In addition, not every keystroke made on the RDP client is immediately sent to the server. An input buffer first collects user input to send it as a package. This can also lead to obvious delays in the response behavior of application programs. However, the problem can be resolved by modifying the parameters for buffering RDP data streams.

If you have ever tried to run a highly interactive application program smoothly, you know it can be rather problematic. Users often also insist on a certain grade of service in terms of maximum response limits and mask change times. Therefore, latency in production terminal server environments should be less than 100 milliseconds (ms). At 150 to 200 ms, delays are very obvious, and at 300 to 500 ms, users normally no longer tolerate system behavior.

A simple method to measure latency between two network nodes (for example, RDP client and terminal server) is the ping command. This command invokes a service program for measuring connections to a remote computer. Through echo request and echo reply packages, ping determines whether a certain IP node (that is, a certain computer system) within a network is operational.

A typical ping command contains a number of parameters that further specify the command’s behavior. Checking latency between RDP clients and terminal server would look like this:

ping -t -n 20 -l 200


-t : Continuously sends ping signals to the computer specified.

-N number : Sends the number of echo packages defined. Default = 4.

-l length : Sends echo packages with the amount of data specified by length. Default = 32 bytes, maximum value = 65,527 bytes. The typical length of an RDP package is between 50 bytes and 1000 bytes.

RDP and the Network Monitor

The network monitor is a system tool included in Windows Server 2003. It is used for complex operations on the network, especially for network analysis and troubleshooting. The network monitor or an agent connected to it collects all the data on a network adapter.

Both Windows Server 2003 and Microsoft Systems Management Server include the network monitor. At first glance, both tools appear to be identical. However, the operating system variant has some functional limitations. Only the full-fledged Systems Management Server network monitor can receive data packages from other network monitor agents, which allows a more global view of the full system. This is also valid for the network connections between two remote computers. The network monitor delivered with the terminal server can log only the connections of the local network adapter(s).

Note: The network monitor is not included in the predefined Windows Servers 2003 installation routine. It can be installed by adjusting the installation options, or subsequently in the dialog box under Control Panel\Add or Remove Programs\Add\Remove Windows Components in the Start menu.

Figure 4: The network monitor logs the data traffic between a terminal server and an RDP client.

Network Monitor Functions

The network monitor allows you to define capture filters so that only certain network data is stored for analysis. The filter options can be based on source and destination addresses of the network adapter, source and destination addresses of the protocol, and on pattern matching in the network data stream. As soon as a capture is complete, further filters can restrict the display or limit it to certain protocols.

To simplify the analysis, the collected binary network data streams are interpreted based on their protocol, edited, and displayed. Protocol headings are clearly separated from usable data. They are easy to read and, with a little practice, to understand.

If problems originate in the network during a terminal server session, the network monitor captures all the corresponding information. If you save the collected data in a file, you can analyze it later with the help of the network monitor. Sending this file to a remote specialist allows remote troubleshooting, too.

Caution: The network monitor is a powerful tool for collecting and displaying data streams in corporate networks. It is also an excellent means of targeting, capturing, and displaying unencrypted data. This also applies to unencrypted passwords. Therefore, an administrator should understand how a network monitor works in his or her corporate environment. The Microsoft Systems Management Server network monitor can trace all other network monitors operating on the network.

Network monitor agents greatly simplify terminal server management. All you need is an available network monitor and a basic understanding of the communication mechanisms to the different clients. With network monitor agents, you can also monitor data transmission between terminal servers and domain controllers, and between file and print servers. As you see, this is a valuable tool, especially for solving broadcast problems in complex, structured networks using routers and switches.

Analysis of the RDP Data Stream

The following system structure allows easy measurement of an RDP data stream:

A terminal server called WTSUS1 is waiting for a connection request from an RDP client.

The user of a client platform called STETBACH and an installed RDP client connects to the terminal server.

The network monitor is launched on the administrator console of the terminal server. Data collection begins just before connecting to the client. Now, the RDP session can be easily monitored in the network monitor. After the user of the RDP client has logged on and performed several standard actions on the desktop of the terminal server session, data collection can be stopped. You can now analyze the data by displaying it in a special window by pressing the F12 key. Alternatively, you can save the data in a file and reload and analyze it at a later time.

You will see a summary of the individual packages at the top of the analysis window of the network monitor. The protocol information recognized appears in the middle, and the raw data is displayed in hexadecimal form at the bottom. The summary includes an item number, the time the collection process started, the hardware’s source and target address, the protocol, a description, and the network’s source and target address. Because the network monitor does not have a module for recognition of RDP, all packages are labeled with the TCP protocol.

The first three frames of the RDP sequence contain the TCP connection (this corresponds to item numbers 1 to 3 in Figure 5). If you analyze the first two rows in detail, you can recognize the TCP connection by the set synchronization bits (S in the TCP protocol information flags in the middle part of the window panel). Frames two and three contain the acknowledgement A for the previous frame. With frames four and five (item numbers 4 and 5), you establish an ISO-over-TCP/IP connection according to Internet standard RFC 1006. This is necessary because the T.120 protocol family, on which RDP is based, requires an X.224-compatible transport protocol.

Figure 5: The network monitor analyzes an RDP data stream.

Frame six (item 6 in Figure 5) initializes the T.120 connection and transmits the first parameters for the RDP protocol. You can clearly see the client’s name here (from byte D2, the string reading STETBACH). It is a bit more difficult to find the client’s resolution of 800 x 600 pixels. You can find it from position C2 in hexadecimal form as 0320 x 0258 pixels with exchanged bytes. In the following frames, encryption is negotiated, the client user logon name is sent, and the logon screen is transmitted.

If you are willing to spend some time on this, you can decipher the basic mechanisms of the RDP protocol as described earlier. This can be very helpful for diagnosing errors and solving problems in network communication.

Posted in Computer Network & Security, Network Devices, Research Menu, User Authentication | 11 Comments »

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.


News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: