Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘CLUSTER’ Category

Building a Beowulf cluster with Ubuntu

Posted by Hemprasad Y. Badgujar on December 25, 2014

Building a Beowulf cluster with Ubuntu

The beowulf cluster article on Wikipedia describes the Beowulf cluster as follows:

“A Beowulf cluster is a group of what are normally identical, commercially available computers, which are running a Free and Open Source Software (FOSS), Unix-like operating system, such as BSD, GNU/Linux, or Solaris. They are networked into a small TCP/IP LAN, and have libraries and programs installed which allow processing to be shared among them.” – Wikipedia, Beowulf cluster, 28 February 2011.

This means a Beowulf cluster can be easily built with “off the shelf” computers running GNU/Linux in a simple home network. So building a Beowulf like cluster is within reach if you already have a small TCP/IP LAN at home with desktop computers running Ubuntu Linux (or any other Linux distribution).

There are many ways to install and configure a cluster. There is OSCAR(1), which allows any user, regardless of experience, to easily install a Beowulf type cluster on supported Linux distributions. It installs and configures all required software according to user input.

There is also the NPACI Rocks toolkit(2), which incorporates the latest Red Hat distribution and cluster-specific software. Rocks addresses the difficulties of deploying manageable clusters. Rocks makes clusters easy to deploy, manage, upgrade and scale.

Both of the afore mentioned toolkits for deploying clusters were made to be easy to use and require minimal expertise from the user. But the purpose of this tutorial is to explain how to manually build a Beowulf like cluster. Basically, the toolkits mentioned above do most of the installing and configuring for you, rendering the learning experience mute. So it would not make much sense to use any of these toolkits if you want to learn the basics of how a cluster works. This tutorial therefore explains how to manually build a cluster, by manually installing and configuring the required tools. In this tutorial I assume that you have some basic knowledge of the Linux-based operating system and know your way around the command line. I tried however to make this as easy as possible to follow. Keep in mind that this is new territory for me as well and there’s a good chance that this tutorial shows methods that may not be the best.

I myself started off with the clustering tutorial from SCFBio which gives a great explanation on how to build a simple Beowulf cluster.(3) It describes the prerequisites for building a Beowulf cluster and why these are needed.


  • What’s a Beowulf Cluster, exactly?
  • Building a virtual Beowulf Cluster
  • Building the actual cluster
  • Configuring the Nodes
    • Add the nodes to the hosts file
    • Defining a user for running MPI jobs
    • Install and setup the Network File System
    • Setup passwordless SSH for communication between nodes
    • Setting up the process manager
      • Setting up Hydra
      • Setting up MPD
  • Running jobs on the cluster
    • Running MPICH2 example applications on the cluster
    • Running bioinformatics tools on the cluster
  • Credits
  • References

What’s a Beowulf Cluster, exactly?

The typical setup of a beowulf cluster

The definition I cited before is not very complete. The book “Engineering a Beowulf-style Compute Cluster”(4) by Robert G. Brown gives a more detailed answer to this question (if you’re serious about this, this book is a must read). According to this book, there is an accepted definition of a beowulf cluster. This book describes the true beowulf as a cluster of computers interconnected with a network with the following characteristics:

  1. The nodes are dedicated to the beowulf cluster.
  2. The network on which the nodes reside are dedicated to the beowulf cluster.
  3. The nodes are Mass Market Commercial-Off-The-Shelf (M2COTS) computers.
  4. The network is also a COTS entity.
  5. The nodes all run open source software.
  6. The resulting cluster is used for High Performance Computing (HPC).

Building a virtual Beowulf Cluster

It is not a bad idea to start by building a virtual cluster using virtualization software like VirtualBox. I simply used my laptop running Ubuntu as the master node, and two virtual computing nodes running Ubuntu Server Edition were created in VirtualBox. The virtual cluster allows you to build and test the cluster without the need for the extra hardware. However, this method is only meant for testing and not suited if you want increased performance.

When it comes to configuring the nodes for the cluster, building a virtual cluster is practically the same as building a cluster with actual machines. The difference is that you don’t have to worry about the hardware as much. You do have to properly configure the virtual network interfaces of the virtual nodes. They need to be configured in a way that the master node (e.g. the computer on which the virtual nodes are running) has network access to the virtual nodes, and vice versa.

Building the actual cluster

It is good practice to first build and test a virtual cluster as described above. If you have some spare computers and network parts lying around, you can use those to build the actual cluster. The nodes (the computers that are part of the cluster) and the network hardware are the usual kind available to the general public (beowulf requirement 3 and 4). In this tutorial we’ll use the Ubuntu operating system to power the machines and open source software to allow for distributed parallel computing (beowulf requirement 5). We’ll test the cluster with cluster specific versions of bioinformaticstools that perform some sort of heavy calculations (beowulf requirement 6).

The cluster consists of the following hardware parts:

  • Network
  • Server / Head / Master Node (common names for the same machine)
  • Compute Nodes
  • Gateway

All nodes (including the master node) run the following software:

I will not focus on setting up the network (parts) in this tutorial. I assume that all nodes are part of the same private network and that they are properly connected.

Configuring the Nodes

Some configurations need to be made to the nodes. I’ll walk you through them one by one.

Add the nodes to the hosts file

It is easier if the nodes can be accessed with their host name rather than their IP address. It will also make things a lot easier later on. To do this, add the nodes to the hosts file of all nodes.(8) (9) All nodes should have a static local IP address set. I won’t go into details here as this is outside the scope of this tutorial. For this tutorial I assume that all nodes are already properly configured to have a static local IP address.

Edit the hosts file (sudo vim /etc/hosts) like below and remember that you need to do this for all nodes,	localhost	master	node1	node2	node3

Make sure it doesn’t look like this:	localhost	master	node1	node2	node3

neither like this:	localhost	master	master	node1	node2	node3

Otherwise other nodes will try to connect to localhost when trying to reach the master node.

Once saved, you can use the host names to connect to the other nodes,

$ ping -c 3 master
PING master ( 56(84) bytes of data.
64 bytes from master ( icmp_req=1 ttl=64 time=0.606 ms
64 bytes from master ( icmp_req=2 ttl=64 time=0.552 ms
64 bytes from master ( icmp_req=3 ttl=64 time=0.549 ms

--- master ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.549/0.569/0.606/0.026 ms

Try this with different nodes on different nodes. You should get a response similar to the above.

In this tutorial, master is used as the master node. Once the cluster has been set up, the master node will be used to start jobs on the cluster. The master node will be used to spawn jobs on the cluster. The compute nodes are node1 to node3 and will thus execute the jobs.

Defining a user for running MPI jobs

Several tutorials explain that all nodes need a separate user for running MPI jobs.(8) (9) (6) I haven’t found a clear explanation to why this is necessary, but there could be several reasons:

  1. There’s no need to remember different user names and passwords if all nodes use the same username and password.
  2. MPICH2 can use SSH for communication between nodes. Passwordless login with the use of authorized keys only works if the username matches the one set for passwordless login. You don’t have to worry about this if all nodes use the same username.
  3. The NFS directory can be made accessible for the MPI users only. The MPI users all need to have the same user ID for this to work.
  4. The separate user might require special permissions.

The command below creates a new user with username “mpiuser” and user ID 999. Giving a user ID below 1000 prevents the user from showing up in the login screen for desktop versions of Ubuntu. It is important that all MPI users have the same username and user ID. The user IDs for the MPI users need to be the same because we give access to the MPI user on the NFS directory later. Permissions on NFS directories are checked with user IDs. Create the user like this,

$ sudo adduser mpiuser --uid 999

You may use a different user ID (as long as it is the same for all MPI users). Enter a password for the user when prompted. It’s recommended to give the same password on all nodes so you have to remember just one password. The above command should also create a new directory/home/mpiuser. This is the home directory for user mpiuser and we will use it to execute jobs on the cluster.

Install and setup the Network File System

Files and programs used for MPI jobs (jobs that are run in parallel on the cluster) need to be available to all nodes, so we give all nodes access to a part of the file system on the master node. Network File System (NFS) enables you to mount part of a remote file system so you can access it as if it is a local directory. To install NFS, run the following command on the master node:

master:~$ sudo apt-get install nfs-kernel-server

And in order to make it possible to mount a Network File System on the compute nodes, the nfs-common package needs to be installed on all compute nodes:

$ sudo apt-get install nfs-common

We will use NFS to share the MPI user’s home directory (i.e. /home/mpiuser) with the compute nodes. It is important that this directory is owned by the MPI user so that all MPI users can access this directory. But since we created this home directory with the adduser command earlier, it is already owned by the MPI user,

master:~$ ls -l /home/ | grep mpiuser
drwxr-xr-x   7 mpiuser mpiuser  4096 May 11 15:47 mpiuser

If you use a different directory that is not currently owned by the MPI user, you must change it’s ownership as follows,

master:~$ sudo chown mpiuser:mpiuser /path/to/shared/dir

Now we share the /home/mpiuser directory of the master node with all other nodes. For this the file /etc/exports on the master node needs to be edited. Add the following line to this file,

/home/mpiuser *(rw,sync,no_subtree_check)

You can read the man page to learn more about the exports file (man exports). After the first install you may need to restart the NFS daemon:

master:~$ sudo service nfs-kernel-server restart

This also exports the directores listed in /etc/exports. In the future when the /etc/exports file is modified, you need to run the following command to export the directories listed in /etc/exports:

master:~$ sudo exportfs -a

The /home/mpiuser directory should now be shared through NFS. In order to test this, you can run the following command from a compute node:

$ showmount -e master

In this case this should print the path /home/mpiuser. All data files and programs that will be used for running an MPI job must be placed in this directory on the master node. The other nodes will then be able to access these files through NFS.

The firewall is by default enabled on Ubuntu. The firewall will block access when a client tries to access an NFS shared directory. So you need to add a rule with UFW (a tool for managing the firewall) to allow access from a specific subnet. If the IP addresses in your network have the format192.168.1.*, then is the subnet. Run the following command to allow incoming access from a specific subnet,

master:~$ sudo ufw allow from

You need to run this on the master node and replace “” by the subnet for your network.

You should then be able to mount master:/home/mpiuser on the compute nodes. Run the following commands to test this,

node1:~$ sudo mount master:/home/mpiuser /home/mpiuser
node2:~$ sudo mount master:/home/mpiuser /home/mpiuser
node3:~$ sudo mount master:/home/mpiuser /home/mpiuser

If this fails or hangs, restart the compute node and try again. If the above command runs without a problem, you should test whether/home/mpiuser on any compute node actually has the content from /home/mpiuser of the master node. You can test this by creating a file inmaster:/home/mpiuser and check if that same file appears in node*:/home/mpiuser (where node* is any compute node).

If mounting the NFS shared directory works, we can make it so that the master:/home/mpiuser directory is automatically mounted when the compute nodes are booted. For this the file /etc/fstab needs to be edited. Add the following line to the fstab file of all compute nodes,

master:/home/mpiuser /home/mpiuser nfs

Again, read the man page of fstab if you want to know the details (man fstab). Reboot the compute nodes and list the contents of the/home/mpiuser directory on each compute node to check if you have access to the data on the master node,

$ ls /home/mpiuser

This should lists the files from the /home/mpiuser directory of the master node. If it doesn’t immediately, wait a few seconds and try again. It might take some time for the system to initialize the connection with the master node.

Setup passwordless SSH for communication between nodes

For the cluster to work, the master node needs to be able to communicate with the compute nodes, and vice versa.(8) Secure Shell (SSH) is usually used for secure remote access between computers. By setting up passwordless SSH between the nodes, the master node is able to run commands on the compute nodes. This is needed to run the MPI daemons on the available compute nodes.

First install the SSH server on all nodes:

$ sudo apt-get install ssh

Now we need to generate an SSH key for all MPI users on all nodes. The SSH key is by default created in the user’s home directory. Remember that in our case the MPI user’s home directory (i.e. /home/mpiuser) is actually the same directory for all nodes: /home/mpiuser on the master node. So if we generate an SSH key for the MPI user on one of the nodes, all nodes will automatically have an SSH key. Let’s generate an SSH key for the MPI user on the master node (but any node should be fine),

$ su mpiuser
$ ssh-keygen

When asked for a passphrase, leave it empty (hence passwordless SSH).

When done, all nodes should have an SSH key (the same key actually). The master node needs to be able to automatically login to the compute nodes. To enable this, the public SSH key of the master node needs to be added to the list of known hosts (this is usually a file~/.ssh/authorized_keys) of all compute nodes. But this is easy, since all SSH key data is stored in one location: /home/mpiuser/.ssh/ on the master node. So instead of having to copy master’s public SSH key to all compute nodes separately, we just have to copy it to master’s ownauthorized_keys file. There is a command to push the public SSH key of the currently logged in user to another computer. Run the following commands on the master node as user “mpiuser”,

mpiuser@master:~$ ssh-copy-id localhost

Master’s own public SSH key should now be copied to /home/mpiuser/.ssh/authorized_keys. But since /home/mpiuser/ (and everything under it) is shared with all nodes via NFS, all nodes should now have master’s public SSH key in the list of known hosts. This means that we should now be able to login on the compute nodes from the master node without having to enter a password,

mpiuser@master:~$ ssh node1
mpiuser@node1:~$ echo $HOSTNAME

You should now be logged in on node1 via SSH. Make sure you’re able to login to the other nodes as well.

Setting up the process manager

In this section I’ll walk you through the installation of MPICH and configuring the process manager. The process manager is needed to spawn and manage parallel jobs on the cluster. The MPICH wiki explains this nicely:

“Process managers are basically external (typically distributed) agents that spawn and manage parallel jobs. These process managers communicate with MPICH processes using a predefined interface called as PMI (process management interface). Since the interface is (informally) standardized within MPICH and its derivatives, you can use any process manager from MPICH or its derivatives with any MPI application built with MPICH or any of its derivatives, as long as they follow the same wire protocol.” – Frequently Asked Questions – Mpich.

The process manager is included with the MPICH package, so start by installing MPICH on all nodes with,

$ sudo apt-get install mpich2

MPD has been the traditional default process manager for MPICH till the 1.2.x release series. Starting the 1.3.x series, Hydra is the default process manager.(10) So depending on the version of MPICH you are using, you should either use MPD or Hydra for process management. You can check the MPICH version by running mpich2version in the terminal. Then follow either the steps for MPD or Hydra in the following sub sections.

Setting up Hydra

This section explains how to configure the Hydra process manager and is for users of MPICH 1.3.x series and up. In order to setup Hydra, we need to create one file on the master node. This file contains all the host names of the compute nodes.(11) You can create this file anywhere you want, but for simplicity we create it in the the MPI user’s home directory,

mpiuser@master:~$ cd ~
mpiuser@master:~$ touch hosts

In order to be able to send out jobs to the other nodes in the network, add the host names of all compute nodes to the hosts file,


You may choose to include master in this file, which would mean that the master node would also act as a compute node. The hosts file only needs to be present on the node that will be used to start jobs on the cluster, usually the master node. But because the home directory is shared among all nodes, all nodes will have the hosts file. For more details about setting up Hydra see this page: Using the Hydra Process Manager.

Setting up MPD

This section explains how to configure the MPD process manager and is for users of MPICH 1.2.x series and down. Before we can start any parallel jobs with MPD, we need to create two files in the home directory of the MPI user. Make sure you’re logged in as the MPI user and create the following two files in the home directory,

mpiuser@master:~$ cd ~
mpiuser@master:~$ touch mpd.hosts
mpiuser@master:~$ touch .mpd.conf

In order to be able to send out jobs to the other nodes in the network, add the host names of all compute nodes to the mpd.hosts file,


You may choose to include master in this file, which would mean that the master node would also act as a compute node. The mpd.hosts file only needs to be present on the node that will be used to start jobs on the cluster, usually the master node. But because the home directory is shared among all nodes, all nodes will have the mpd.hosts file.

The configuration file .mpd.conf (mind the dot at the beginning of the file name) must be accessible to the MPI user only (in fact, MPD refuses to work if you don’t do this),

mpiuser@master:~$ chmod 600 .mpd.conf

Then add a line with a secret passphrase to the configuration file,


The secretword can be set to any random passphrase. You may want to use a random password generator the generate a passphrase.

All nodes need to have the .mpd.conf file in the home directory of mpiuser with the same passphrase. But this is automatically the case since/home/mpiuser is shared through NFS.

The nodes should now be configured correctly. Run the following command on the master node to start the mpd deamon on all nodes,

mpiuser@master:~$ mpdboot -n 3

Replace “3” by the number of compute nodes in your cluster. If this was successful, all nodes should now be running the mpd daemon. Run the following command to check if all nodes entered the ring (and are thus running the mpd daemon),

mpiuser@master:~$ mpdtrace -l

This command should display a list of all nodes that entered the ring. Nodes listed here are running the mpd daemon and are ready to accept MPI jobs. This means that your cluster is now set up and ready to rock!

Running jobs on the cluster

Running MPICH2 example applications on the cluster

The MPICH2 package comes with a few example applications that you can run on your cluster. To obtain these examples, download the MPICH2 source package from the MPICH website and extract the archive to a directory. The directory to where you extracted the MPICH2 package should contain an “examples” directory. This directory contains the source codes of the example applications. You need to compile these yourself.

$ sudo apt-get build-dep mpich2
$ wget
$ tar -xvzf mpich2-1.4.1.tar.gz
$ cd mpich2-1.4.1/
$ ./configure
$ make
$ cd examples/

The example application cpi is compiled by default, so you can find the executable in the “examples” directory. Optionally you can build the other examples as well,

$ make hellow
$ make pmandel

Once compiled, place the executables of the examples somewhere inside the /home/mpiuser directory on the master node. It’s common practice to place executables in a “bin” directory, so create the directory /home/mpiuser/bin and place the executables in this directory. The executables should now be available on all nodes.

We’re going to run an MPI job using the example application cpi. Make sure you’re logged in as the MPI user on the master node,

$ su mpiuser

And run the job like this,

When using MPD:

mpiuser@master:~$ mpiexec -n 3 /home/mpiuser/bin/cpi

When using Hydra:

mpiuser@master:~$ mpiexec -f hosts -n 3 /home/mpiuser/bin/cpi

Replace “3” by the number of nodes on which you want to run the job. When using Hydra, the -f switch should point to the file containing the host names. When using MPD, it’s important that you use the absolute path to the executable in the above command, because only then MPD knows where to look for the executable on the compute nodes. The absolute path used should thus be correct for all nodes. But since/home/mpiuser is the NFS shared directory, all nodes have access to this path and the files within it.

The example application cpi is useful for testing because it shows on which nodes each sub process is running and how long it took to run the job. This application is however not useful to test performance because this is a very small application which takes only a few milliseconds to run. As a matter of fact, I don’t think it actually computes pi. If you look at the source, you’ll find that the value of pi is hard coded into the program.

Running bioinformatics tools on the cluster

By running actual bioinformatics tools you can give your cluster a more realistic test run. There are several parallel implementations of bioinformatics tools that are based on MPI. There are two that I currently know of:

It would be nice to test mpiBLAST, but because of a compilation issue, I was not able to do so. After some asking around at the mpiBLAST-Users mailing list, I got an answer:

“That problem is caused by a change in GCC version 4.4.X. We don’t have a fix to give out for the issue as yet, but switching to 4.3.X or lower should solve the issue for the time being.”(7)

Basically, I’m using a newer version of the GCC compiler which fails to build mpiBLAST. In order to compile it, I’d have to use an older version. But to instruct mpicc to use GCC 4.3 instead, requires that MPICH2 be compiled with GCC 4.3. Instead of going through that trouble, I’ve decided to give ClustalW-MPI a try instead.

The MPI implementation of ClustalW is fairly out-dated, but it’s good enough to perform a test run on your cluster. Download the source from the website, extract the package, and compile the source. Copy the resulting executable to the /home/mpiuser/bin directory on the master node. Use for example Entrez to search for some DNA/protein sequences and put these in a single FASTA file (the NCBI website can do that for you). Create several FASTA files with multiple sequences to test with. Copy the multi-sequence FASTA files to a data directory inside mirror (e.g./home/mpiuser/data). Then run a job like this,

When using MPD:

mpiuser@master:~$ mpiexec -n 3 /home/mpiuser/bin/clustalw-mpi /home/mpiuser/data/seq_tyrosine.fasta

When using Hydra:

mpiuser@master:~$ mpiexec -f hosts -n 3 /home/mpiuser/bin/clustalw-mpi /home/mpiuser/data/seq_tyrosine.fasta

and let the cluster do the work. Again, notice that we must use absolute paths. You can check if the nodes are actually doing anything by logging into the nodes (ssh node*) and running the top command. This should display a list of running processes with the processes using the most CPU on the top. In this case, you should see the process clustalw-mpi somewhere along the top.


Thanks to Reza Azimi for mentioning the nfs-common package.


  1. OpenClusterGroup. OSCAR.
  2. Philip M. Papadopoulos, Mason J. Katz, and Greg Bruno. NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. October 2001, Cluster 2001: IEEE International Conference on Cluster Computing.
  3. Supercomputing Facility for Bioinformatics & Computational Biology, IIT Delhi. Clustering Tutorial.
  4. Robert G. Brown. Engineering a Beowulf-style Compute Cluster. 2004. Duke University Physics Department.
  5. Pavan Balaji, et all. MPICH2 User’s Guide, Version 1.3.2. 2011. Mathematics and Computer Science Division Argonne National Laboratory.
  6. Kerry D. Wong. A Simple Beowulf Cluster.
  7. mpiBLAST-Users: unimplemented: inlining failed in call to ‘int fprintf(FILE*, const char*, …)’
  8. Ubuntu Wiki. Setting Up an MPICH2 Cluster in Ubuntu.
  9. Building a Beowulf Cluster in just 13 steps.
  10. Frequently Asked Questions – Mpich.
  11. Using the Hydra Process Manager – Mpich.

Posted in CLUSTER, Computer Hardware, Computer Hardwares, Computer Languages, Computer Vision, Computing Technology, CUDA, Free Tools, GPU (CUDA), Linux OS, Mixed, My Research Related, Open CL, OpenCV, OpenMP, OPENMPI, PARALLEL | 2 Comments »

How to Build a GPU-Accelerated Cluster

Posted by Hemprasad Y. Badgujar on December 22, 2014

Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (“nodes”) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current world’s fastest supercomputer, Oak Ridge National Labs TITAN, utilizes more than 18,000 NVIDIA Kepler GPUs.

In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost.

I gave a talk on this topic at GTC 2013 (session S3516 – Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out!

There are multiple motivating reason for building a GPU-based research cluster.

  • Get a feel for production systems and performance estimates;
  • Port your applications to GPUs and distributed computing (using CUDA-aware MPI);
  • Tune GPU and CPU load balancing for your application;
  • Use the cluster as development platform;
  • Early experience means increased readiness;
  • The investment is relatively small for a research prototype cluster

Figure 1 shows the steps to build a small GPU cluster. Let’s look at the process in more detail.

Steps in building GPU based Clusters
Figure 1: Seven steps to build and test a small research GPU cluster.

1. Choose Your Hardware

There are two steps to choosing the correct hardware.

  1. Node Hardware Details. This isthe specification of the machine (node) for your cluster. Each node has the  following components.
    • CPU processor from any vendor;
    • A motherboard with the following PCI-express connections:
      • 2x PCIe x16 Gen2/3 connections for Tesla GPUs;
      • 1x PCIe x8 wide for HCI Infiniband card;
    • 2 available network ports;
    • A minimum of 16-24 GB DDR3 RAM. (It is good to have more RAM in the system).
    • A power-supply unit (SMPS) with ample power rating. The total power supply needed includes power taken by the CPU, GPUs and other components in the system.
    • Secondary storage (HDD / SSD) based on your needs.

    GPU boards are wide enough to cover two physically adjacent PCI-e slots, so make sure that the PCIe x16 and x8 slots are physically separated on the motherboard so that you can fit a minimum of 2 PCI-e x16 GPUs and 1 PCIe x8 network card.

  2. Choose the right form factor forGPUs. Once you decide your machine specs you should also decide which modelGPUs you would like to consider for your system. The form factor ofGPUs is an important consideration. Kepler-based NVIDIA TeslaGPUs are available in two main form factors.
    • Tesla workstation products (C Series) are actively cooled GPU boards (this means they have a fan cooler over the GPU chip) that you can just plug in to your desktop computer in a PCI-e x16 slot. These use either two 6-pin or one 8-pin power supply connector.
    • Server products (M Series) are passively cooled GPUs (no fans) installed in standard servers sold by various OEMs.

    There are three different options for adding GPUs to your cluster:

    • you can buy C-series GPUs and install them in existing workstations or servers with enough space;
    • you can buy workstations from a vendor with C-series GPUs installed; or
    • you can buy servers with M-series GPUs installed.

2. Allocate Space, Power and Cooling

The goal for this step is to assess your physical infrastructure, including space, power and cooling needs, network considerations and storage requirements to ensure optimal system choices with room to grow your cluster in the future. You should make sure that you have enough space, power and cooling for your cluster. Clusters are mainly rack mounted, with multiple machines installed in a vertical rack. Vendors offer many server solutions that minimize the use of rack space.

3. Assembly and Physical Deployment

After deciding the machine configuration and real estate the next step is to physically deploy your cluster. Figure 2 shows the cluster deployment connections. The head node is the external interface to the cluster; it receives all external network connections, processes incoming requests, and assigns work to compute nodes (nodes with GPUs that perform the computation).

In a research prototype cluster you can also make use one of the compute nodes as a head node, but routing all traffic from the head node and also making it a compute node is not a good idea for production clusters because of performance and security issues. Production and large clusters mostly have a dedicated node to handle all incoming traffic while the head node just manages the work distribution for the compute nodes.

Head Node & Compute Nodes connections
Figure 2: Head node and compute node connections.

4. Head Node Installation

I recommend installing the head node with the open source Rocks Linux distribution. Rocks is a customizable, easy and quick way to install nodes. The Rocks installation package includes essential components for clusters, such as MPI. ROCKS head node installation is well-documented in the Rocks user guide, but here is a summary of the steps.

  • Follow the steps in Chapter 3 of the Rocks user guide and do a CD-based installation.
  • Install the NVIDIA drivers and CUDA Toolkit on the head node. (CUDA 5 provides a unified package that contain NVIDIA driver, toolkit and CUDA Samples.) 
  • Install network interconnect drivers (e.g. Infiniband) on the head node. These drivers are available from your interconnect manufacturer.
  • Nagios® Core™ is an open source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go wrong and when they get better. To install, follow the instructions given in the Nagios installation guide.
  • The NRPE Nagios add-on allows you to execute Nagios plugins on remote Linux machines. This allows you to monitor local resources like CPU load and memory usage, which are not usually exposed to external machines, on remote machines using Nagios. Install NRPE following the install guide.

5. Compute Node Installation

After you have completed the head node installation, you will install the compute node software with the help of Rocks and the following steps.

  • On the head node: in a terminal shell run the command:
    > insert-ethers

    Choose “Compute Nodes” as the new node to add.

  • Power on the compute node with the Rocks CD as the first boot device or do a network installation.
  • The compute node will connect to the head node and start the installation.
  • Install the NRPE package as described in the NRPE guide.

6. Management and Monitoring

Once you finish the head node and all compute node installations, your cluster is ready to use! Before you actually start using it to run applications of interest, you should also set up management and monitoring tools on the cluster. These tools are necessary for proper management and monitoring of all resources available in cluster. In this section, I will describe various tools and software packages for GPU management and monitoring.


The NVIDIA System Management Interface (NVIDIA-SMI) is a tool distributed as part of the NVIDIA GPU driver. NVIDIA-SMI provides a variety of GPU system information including

  • thermal monitoring metrics: GPU temperature, chassis inlet/outlet temperatures;
  • system Information: firmware revision, configuration information;
  • system state: fan states, GPU faults, power system fault; ECC errors, etc.

NVIDIA-SMI allows you to configure the compute mode for any device in the system (Reference: CUDA C Programming Guide)

  • Default compute mode: multiple host threads can use the device at the same time.
  • Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may be current to as many threads as desired within the process that created the context.
  • Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time.
  • Prohibited compute mode: No CUDA context can be created on the device.

NVIDIA-SMI also allows you to turn ECC (Error Correcting Code memory) mode on and off. The default is ON, but applications that do not need ECC can get higher memory bandwidth by disabling it.


The Tesla Deployment Kit is a collection of tools provided to better manage NVIDIA Tesla™ GPUs. These tools support Linux (32-bit and 64-bit), Windows 7 (64-bit), and Windows Server 2008 R2 (64-bit). The current distribution contains NVIDIA-healthmon and the NVML API.


The NVML API is a C-based API which provides programmatic state monitoring and management of NVIDIA GPU devices. The NVML dynamic run-time library ships with the NVIDIA display driver, and the NVML SDK provides headers, stub libraries and sample applications. NVML can be used from Python or Perl (bindings are available) as well as C/C++ or Fortran.

Ganglia is an open-source scalable distributed monitoring system used for clusters and grids with very low per-node overhead and high concurrency. Ganglia gmond is an NVML-based Python module for monitoring NVIDIA GPUs in the Ganglia interface.


This utility provides quick health checking of GPUs in cluster nodes. The tool detects issues and suggests remedies to software and system configuration problems, but it is not a comprehensive hardware diagnostic tool. Features include:

  • basic CUDA and NVML sanity check;
  • diagnosis of GPU failures;
  • check for conflicting drivers;
  • poorly seated GPU detection;
  • check for disconnected power cables;
  • ECC error detection and reporting;
  • bandwidth test;
  • infoROM validation.

7. Run Benchmarks and Applications

Once your cluster is up and running you will want to validate it by running some benchmarks and sample applications. There are various benchmarks and code samples for GPUs and the network as well as applications to run on the entire cluster. For GPUs, you need to run two basic tests.

  1. devicequery: This sample code is available with the CUDA Samples included in the CUDA Toolkit installation package. devicequery simply enumerates the properties of the CUDA devices present in a node. This is not a benchmark but successfully running this or any other CUDA sample serves to verify that you have the CUDA driver and toolkit properly installed on the system.
  2. bandwidthtest: This is another of the CUDA Samples included with the Toolkit. This sample measures the cudaMemcopy bandwidth of the GPU across PCI-e as well as internally. You should measure device-to-device copy bandwidth, host-to-device copy bandwidth for pageable and page-locked memory, and device-to-host copy bandwidth for pageable and page-locked memory.

To benchmark network performance, you should run the bandwidth and latency tests for your installed MPI distribution. MPI standard installations have standard benchmarks such as /tests/osu_benchmarks-3.1.1. You should consider using an open source CUDA-aware MPI implementation like MVAPICH2, as described in earlier Parallel Forall posts An Introduction to CUDA-Aware MPI and Benchmarking CUDA-Aware MPI.

To benchmark the entire cluster, you should run the LINPACK numerical linear algebra application. The top 500 supercomputers list uses the HPL benchmark to decide the fastest supercomputers on Earth. The CUDA-enabled version of HPL (High-Performance LINPACK) optimized for GPUs is available from NVIDIA on request, and there is a Fermi-optimized version available to all NVIDIA registered developers.

# In this post I have provided an overview of the basic steps to build a GPU-accelerated research prototype cluster. For more details on GPU-based clusters and some of best practices for production clusters, please refer to Dale Southard’s GTC 2013 talk S3249 – Introduction to Deploying, Managing, and Using GPU Clusters by Dale Southard.

Posted in CLOUD, CLUSTER, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GRID, Linux OS, Mixed, Multimedia, PARALLEL | Tagged: , , | Leave a Comment »


Posted by Hemprasad Y. Badgujar on December 19, 2014

MPI is a well-known programming model for Distributed Memory Computing. If you have access to GPU resources, MPI can be used to distribute tasks to computers, each of which can use their CPU and also GPU to process the distributed task.

My toy problem in hand is to use  a mix of MPI and CUDA to handle traditional sparse-matrix vector multiplication. The program can be structured as:

Each node uses both CPU and GPU resources
Each node uses both CPU and GPU resources
  1. Read a sparse matrix from from disk, and split it into sub-matrices.
  2. Use MPI to distribute the sub-matrices to processes.
  3. Each process would call a CUDA kernel to handle the multiplication. The result of multiplication would be copied back to each computer memory.
  4. Use MPI to gather results from each of the processes, and re-form the final matrix.

One of the options is to put both MPI and CUDA code in a single file, This program can be compiled using nvcc, which internally uses gcc/g++ to compile your C/C++ code, and linked to your MPI library:

nvcc -I/usr/mpi/gcc/openmpi-1.4.6/include -L/usr/mpi/gcc/openmpi-1.4.6/lib64 -lmpi -o program

The downside is it might end up being a plate of spaghetti, if you have some seriously long program.

Another cleaner option is to have MPI and CUDA code separate in two files: main.c and respectively. These two files can be compiled using mpicc, and nvcc respectively into object files (.o) and combined into a single executable file using mpicc. This second option is an opposite compilation of the above, using mpicc, meaning that you have to link to your CUDA library.

module load openmpi cuda #(optional) load modules on your node
mpicc -c main.c -o main.o
nvcc -arch=sm_20 -c -o multiply.o
mpicc main.o multiply.o -lcudart -L/apps/CUDA/cuda-5.0/lib64/ -o program

And finally, you can request two processes and two GPUs to test your program on the cluster using PBS script like:

#PBS -l nodes=2:ppn=2:gpus=2
mpiexec -np 2 ./program

The main.c, containing the call to CUDA file, would look like:

#include "mpi.h"
int main(int argc, char *argv[])
/* It's important to put this call at the begining of the program, after variable declarations. */
MPI_Init(argc, argv);
/* Get the number of MPI processes and the rank of this process. */
        MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
        MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
// ==== Call function 'call_me_maybe' from CUDA file ==========
/* ... */

And in, define call_me_maybe() with the ‘extern‘ keyword to make it accessible from main.c (without additional #include …)

/* */
#include <cuda.h>
#include <cuda_runtime.h>
 __global__ void __multiply__ ()
 extern "C" void call_me_maybe()
     /* ... Load CPU data into GPU buffers  */
     __multiply__ <<< ...block configuration... >>> (x, y);
     /* ... Transfer data from GPU to CPU */


Mixing MPI and CUDA

Mixing MPI (C) and CUDA (C++) code requires some care during linking because of differences between the C and C++ calling conventions and runtimes. A helpful overview of the issues can be found at How to Mix C and C++.

One option is to compile and link all source files with a C++ compiler, which will enforce additional restrictions on C code. Alternatively, if you wish to compile your MPI/C code with a C compiler and call CUDA kernels from within an MPI task, you can wrap the appropriate CUDA-compiled functions with the extern keyword, as in the following example.

These two source files can be compiled and linked with both a C and C++ compiler into a single executable on Oscar using:

$ module load mvapich2 cuda
$ mpicc -c main.c -o main.o
$ nvcc -c -o multiply.o
$ mpicc main.o multiply.o -lcudart

The CUDA/C++ compiler nvcc is used only to compile the CUDA source file, and the MPI C compiler mpicc is user to compile the C code and to perform the linking.

01. /* */
03. #include 
04. #include 
06. __global__ void __multiply__ (const float *a, float *b)
07. {
08. const int i = threadIdx.x + blockIdx.x * blockDim.x;
09.     b[i] *= a[i];
10. }
12. extern "C" void launch_multiply(const float *a, const *b)
13. {
14.     /* ... load CPU data into GPU buffers a_gpu and b_gpu */
16.     __multiply__ <<< ...block configuration... >>> (a_gpu, b_gpu);
18.     safecall(cudaThreadSynchronize());
19.     safecall(cudaGetLastError());
21.     /* ... transfer data from GPU to CPU */

Note the use of extern "C" around the function launch_multiply, which instructs the C++ compiler (nvcc in this case) to make that function callable from the C runtime. The following C code shows how the function could be called from an MPI task.

01. /* main.c */
03. #include 
05. void launch_multiply(const float *a, float *b);
07. int main (int argc, char **argv)
08. {
09.        int rank, nprocs;
10.     MPI_Init (&argc, &argv);
11.     MPI_Comm_rank (MPI_COMM_WORLD, &rank);
12.     MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
14.     /* ... prepare arrays a and b */
16.     launch_multiply (a, b);
18.     MPI_Finalize();
19.        return 1;
20. }

Posted in CLUSTER, Computer Hardware, Computer Softwares, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, GRID, Open CL, OpenMP, PARALLEL | Tagged: , , | 1 Comment »

Remote Desktop Connection in Windows 7

Posted by Hemprasad Y. Badgujar on March 4, 2013

Remote Desktop Connection in Windows 7

Remote Desktop Connection, a utility included in all versions of Windows 7, allows you to use a laptop or home computer to remotely control the Windows-based desktop computer in your on-campus office or lab. When using Remote Desktop Connection from a laptop on a wireless network (including Purdue’s AirLink network and free public WiFi networks in coffee shops, hotels, etc.) or a home computer on a broadband Internet connection, it’s as if you’re sitting at the desk in your office using your computer’s keyboard and mouse — even if you’re two buildings, two miles, or two continents away.

By remotely accessing an ECN-supported desktop computer and refraining from storing your Purdue files locally on your laptop or home computer, your data remains safely stored in your home directory on ECN’s network servers — which receive daily backups.

  • If you’re using Windows XP Professional rather than Windows 7, please see Remote Desktop Connection in Windows XP instead.
  • If you have a Macintosh desktop at home or a Mac laptop but have a Windows-based desktop computer in your office, Microsoft also provides a free Mac version of Remote Desktop Connection; please see Remote Desktop Connection in Mac OS X. (The instructions on the page you’re reading now focus on the Windows 7 version.)

You’ll want to follow these instructions on your laptop and/or home computer, not on the on-campus desktop computer!

When connecting from off-campus, please don’t miss step #6! Connecting first to Purdue’s Virtual Private Network is required.

Who can use Remote Desktop Connection?

A remote-controlled computer can be used by only one person at a time. As such, it is recommended for use only by those who do not share the same office computer with other people. A graduate student may use Remote Desktop Connection with the permission of his or her supervisor.


Creating a Remote Desktop shortcut

1. Getting started on your Windows 7-based laptop or home computer.

On your laptop or home computer, click on the Start menu, navigate to All Programs, then to Accessories, and then launch “Remote Desktop Connection.”

Windows 7 Start menu

Remote Desktop Connection dialog

2. Computer address.

2A. In the “Computer” field, enter the IP number of the desktop computer in your office. It will look similar to the following:

where both xxx and yyy are a specific number between 1 and 255. No two computers have the same full number; please obtain this number from ECN.

You may either skip to step #6 (to connect to the remote computer immediately) or proceed with step #2B (to set program options and create a shortcut for future use).

2B. Then click on the “Options” button. The window will expand to show several tabs, each with various program settings.

Experience tab

3. The “Experience” tab.

This step is optional. These settings might help improve your remote connection’s performance.

3A. Click on the “Experience” tab.

3B. Click the menu beneath “Choose your connection speed to optimize performance” and select one of the following:

  • For most public WiFi services or home DSL connections, try “Low-speed broadband (256 Kbps – 2 Mbps)”.
  • For home cable modem connections, try “High-speed broadband (2 Mbps – 10 Mbps)”.

General tab

4. The “General” tab.

4A. In the “User name” field, type your Purdue Career Account username.

Leave the “Allow me to save credentials” box unchecked.

4B.  Click on the “Save As” button to proceed to the next step. The “Save As” dialog will appear.

5. Saving your shortcut file.

In this step, you’ll create a shortcut file which you will later begin using routinely to launch a remote control session to your office PC.  You may save this shortcut wherever you prefer; we suggest saving a copy to your desktop.

5A. In the “Save As” dialog, click on the “Desktop” icon in the left-hand column. This will set the “Save in” location to the desktop.

5B. In the “File name” field, type a name that you’ll recognize.  We suggest something like the following:

Remote Desktop to my office PC

If you’ll be creating shortcuts to multiple remote computers (say, one for each person who uses a shared home computer, each pointing to his or her unique office PC), you could enter a more specific name, e.g.:

Remote Desktop to John's office PC
Remote Desktop to arms3403pc1

5C. Click the “Save” button.

The new shortcut file will be created on the desktop.

5D. (This step is optional.) If you’d like the shortcut to appear in more places, this would be a good time to make copies of it.  You could drag the icon from the desktop to the Start button, for example, to place a copy of the shortcut in your Start menu.

Connecting to the desktop computer in your office

These instructions assume that your computer is connected to the Internet, either wirelessly or via a broadband connection (e.g. cable modem or DSL).

6. Connect to Purdue’s Virtual Private Network. When using a computer off-campus, this step is required. Establish a connection to Purdue’s Virtual Private Network ( For a description of this service, please see ITaP’s VPN “Getting Started” page.

7. Starting the remote connection.

7A. If you saved the icon to the desktop in step #5, locate it there and double-click the icon now.

Alternately, repeat steps #1 and #2A, and then click the “Connect” button.

Your laptop or home computer will connect via the Internet to your desktop computer in your office.

Remote computer identity8. Remote computer verification.

You might see a dialog (like the one shown at right) noting that the remote computer’s identity cannot be verified.

8A. You may optionally enable (place a check mark in) the “Don’t ask me again for connections to this computer” box. When the password prompt appears, enter your Purdue Career Account password.

8B. Then click the “Yes” button.

9. Password prompt.

A password prompt will appear. Because you are connecting to an ECN-supported PC which is a member of an Active Directory domain, you might need to do a couple extra steps.

If the remote computer is running Windows 7, the login prompt will look like the one on the left in the illustration, below:

Enter your credentials dialog

9A. If the dialog appears as above, click the “Use another account” button.

9B. Enter your username as follows, substituting your own Purdue Career Account username:


9C. Enter your Purdue Career Account password.

9D. Then click the “OK” button.

Your office computer’s desktop will appear. If you had left programs running and/or files open on your office computer, they’ll appear now, just as they were.  If you had logged out of Windows before you left your office, your ECN-supported office computer will go through the typical startup process, finishing with the Message of the Day window — just as when you’re in the office.

Now, while your remote connection is open, when you type or use your mouse, it’ll be like using the keyboard and mouse at your office computer.

Minimizing and/or disconnecting

10. Using the top-central tool bar.

While connected to the remote computer, a toolbar appears at the top of your screen like the one shown here:

Remote Desktop toolbar

10A. If you need to access a file or program on your local computer (the laptop or home computer you’re using), click the minimize button on the top-central tool bar.  Remote Desktop Connection will stay running (as will all programs you have open on your office PC);  restore it by clicking its button on the task bar (at the bottom of your screen, usually).

10B. When you’re ready to disconnect from your office PC, you may end the session one of these ways:

  • Click on the “X” button at the right edge of the top-central toolbar.  This will end the remote session but leave files and programs open and running on your office PC.
  • Or, as shown in the illustration below, click on the (remote computer’s) Start menu and select “Log off.”  This will close all open files and programs on your office PC and also end the remote session.

Log off

Posted in CLOUD, CLUSTER, Computer Network & Security, Computing Technology, PARALLEL | 1 Comment »

Octree Textures on the GPU

Posted by Hemprasad Y. Badgujar on February 24, 2013

Octree Textures on the GPU

Sylvain Lefebvre

Samuel Hornus

Fabrice Neyret

Texture mapping is a very effective and efficient technique for enriching the appearance of polygonal models with details. Textures store not only color information, but also normals for bump mapping and various shading attributes to create appealing surface effects. However, texture mapping usually requires parameterizing a mesh by associating a 2D texture coordinate with every mesh vertex. Distortions and seams are often introduced by this difficult process, especially on complex meshes.

The 2D parameterization can be avoided by defining the texture inside a volume enclosing the object. Debry et al. (2002) and Benson and Davis (2002) have shown how 3D hierarchical data structures, named octree textures, can be used to efficiently store color information along a mesh surface without texture coordinates. This approach has two advantages. First, color is stored only where the surface intersects the volume, thus reducing memory requirements. Figures 37-1 and 37-2 illustrate this idea. Second, the surface is regularly sampled, and the resulting texture does not suffer from any distortions. In addition to mesh painting, any application that requires storing information on a complex surface can benefit from this approach.

37_octree_01.jpgFigure 37-1 An Octree Texture Surrounding a 3D Model

37_octree_02.jpgFigure 37-2 Unparameterized Mesh Textures with an Octree Texture

This chapter details how to implement octree textures on today’s GPUs. The octree is directly stored in texture memory and accessed from a fragment program. We discuss the trade-offs between performance, storage efficiency, and rendering quality. After explaining our implementation in Section 37.1, we demonstrate it on two different interactive applications:

  • A surface-painting application (Section 37.2). In particular, we discuss the different possibilities for filtering the resulting texture (Section 37.2.3). We also show how a texture defined in an octree can be converted into a standard texture, possibly at runtime (Section 37.2.4).
  • A nonphysical simulation of liquid flowing along a surface (Section 37.3). The simulation runs entirely on the GPU.

37.1 A GPU-Accelerated Hierarchical Structure: The N3-Tree

37.1.1 Definition

An octree is a regular hierarchical data structure. The first node of the tree, the root, is a cube. Each node has either eight children or no children. The eight children form a 2x2x2 regular subdivision of the parent node. A node with children is called an internal node. A node without children is called a leaf. Figure 37-3 shows an octree surrounding a 3D model where the nodes that have the bunny’s surface inside them have been refined and empty nodes have been left as leaves.

37_octree_03.jpgFigure 37-3 An Octree Surrounding a 3D Model

In an octree, the resolution in each dimension increases by two at each subdivision level. Thus, to reach a resolution of 256x256x256, eight levels are required (28= 256). Depending on the application, one might prefer to divide each edge by an arbitrary number N rather than 2. We therefore define a more generic structure called an N3 tree. In an N3-tree, each node has N 3 children. The octree is an N3-tree with N = 2. A larger value of N reduces the tree depth required to reach a given resolution, but it tends to waste memory because the surface is less closely matched by the tree.

37.1.2 Implementation

To implement a hierarchical tree on a GPU, we need to define how to store the structure in texture memory and how to access the structure from a fragment program.

A simple approach to implement an octree on a CPU is to use pointers to link the tree nodes together. Each internal node contains an array of pointers to its children. A child can be another internal node or a leaf. A leaf contains only a data field.

Our implementation on the GPU follows a similar approach. Pointers simply become indices within a texture. They are encoded as RGB values. The content of the leaves is directly stored as an RGB value within the parent node’s array of pointers. We use the alpha channel to distinguish between a pointer to a child and the content of a leaf. Our approach relies on dependent texture lookups (or texture indirections). This requires the hardware to support an arbitrary number of dependent texture lookups, which is the case for GeForce FX and GeForce 6 Series GPUs.

The following sections detail our GPU implementation of the N3-tree. For clarity, the figures illustrate the 2D equivalent of an octree (a quadtree).


We store the tree in an 8-bit RGBA 3D texture called the indirection pool. Each “pixel” of the indirection pool is called a cell.

The indirection pool is subdivided into indirection grids. An indirection grid is a cube of NxNxN cells (a 2x2x2 grid for an octree). Each node of the tree is represented by an indirection grid. It corresponds to the array of pointers in the CPU implementation described earlier.

A cell of an indirection grid can be empty or can contain one of the following:

  • Data, if the corresponding child is a leaf
  • The index of an indirection grid, if the corresponding child is another internal node

Figure 37-4 illustrates our tree storage representation.

37_octree_04.jpgFigure 37-4 Storage in Texture Memory (2D Case)

We note S = Su Sv Sw as the number of indirection grids stored in the indirection pool and R= (N x Su ) x (N x Sv ) x (N x Sw ) as the resolution in cells of the indirection pool.

Data values and indices of children are both stored as RGB triples. The alpha channel is used as a flag to determine the cell content (alpha = 1 indicates data; alpha = 0.5 indicates index; alpha = 0 indicates empty cell). The root of the tree is always stored at (0, 0, 0) within the indirection pool.

Accessing the Structure: Tree Lookup

Once the tree is stored in texture memory, we need to access it from a fragment program. As with standard 3D textures, the tree defines a texture within the unit cube. We want to retrieve the value stored in the tree at a point M U220A.GIF [0, 1]3. The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.

Let I D be the index of the indirection grid of the node visited at depth D. The tree lookup is initialized with I 0= (0, 0, 0), which corresponds to the tree root. When we are at depth D, we know the index I D of the current node’s indirection grid. We now explain how we retrieve I D+1 from I D .

The lookup point M is inside the node visited at depth D. To decide what to do next, we need to read from the indirection grid ID the value stored at the location corresponding to M. To do so, we need to compute the coordinates of M within the node.

At depth D, a complete tree produces a regular grid of resolution N D N D N D within the unit cube. We call this grid the depth-D grid. Each node of the tree at depth D corresponds to a cell of this grid. In particular, M is within the cell corresponding to the node visited at depth D. The coordinates of M within this cell are given by frac(M x N D ). We use these coordinates to read the value from the indirection grid I D . The lookup coordinates within the indirection pool are thus computed as:


We then retrieve the RGBA value stored at P in the indirection pool. Depending on the alpha value, either we will return the RGB color if the child is a leaf, or we will interpret the RGB values as the index of the child’s indirection grid (I D+1) and continue to the next tree depth. Figure 37-5 summarizes this entire process for the 2D case (quadtree).

37_octree_05.jpgFigure 37-5 Example of a Tree Lookup

The lookup ends when a leaf is reached. In practice, our fragment program also stops after a fixed number of texture lookups: on most hardware, it is only possible to implement loop statements with a fixed number of iterations (however, early exit is possible on GeForce 6 Series GPUs). The application is in charge of limiting the tree depth with respect to the maximum number of texture lookups done within the fragment program. The complete tree lookup code is shown in Listing 37-1.

Example 37-1. The Tree Lookup Cg Code

float4 tree_lookup(uniform sampler3D IndirPool, // Indirection Pool

  uniform float3 invS, // 1 / S

  uniform float N,

  float3 M) // Lookup coordinates


  float4 I = float4(0.0, 0.0, 0.0, 0.0);

  float3 MND = M;

  for (float i=0; i// fixed # of iterations

    float3 P;

    // compute lookup coords. within current node

    P = (MND + floor(0.5 + * 255.0)) * invS;

    // access indirection pool

    if (I.w < 0.9)                   // already in a leaf?

      I =(float4)tex3D(IndirPool,P);// no, continue to next depth

 #ifdef DYN_BRANCHING // early exit if hardware supports dynamic branching

    if (I.w > 0.9)    // a leaf has been reached



    if (I.w < 0.1) // empty cell


    // compute pos within next depth grid

    MND = MND * N;


  return (I);


Further Optimizations

In our tree lookup algorithm, as we explained earlier, the computation of P requires a frac instruction. In our implementation, however, as shown Listing 37-1, we actually avoid computing the frac by relying on the cyclic behavior of the texture units (repeat mode). We leave the detailed explanations as an appendix, located on the book’s CD.

We compute P as


where D D is an integer within the range [0, S[.

We store D D instead of directly storing the I D values. Please refer to the appendix on the CD for the code to compute D D .

Encoding Indices

The indirection pool is an 8-bit 3D RGBA texture. This means that we can encode only 256 different values per channel. This gives us an addressing space of 24 bits (3 indices of 8 bits), which makes it possible to encode octrees large enough for most applications.

Within a fragment program, a texture lookup into an 8-bit texture returns a value mapped between [0,1]. However, we need to encode integers. Using a floating-point texture to do so would require more memory and would reduce performance. Instead, we map values between [0,1] with a fixed precision of 1/255 and simply multiply the floating-point value by 255 to obtain an integer. Note that on hardware without fixed-precision registers, we need to compute floor(0.5 + 255 * v) to avoid rounding errors.

37.2 Application 1: Painting on Meshes

In this section we use the GPU-accelerated octree structure presented in the previous section to create a surface-painting application. Thanks to the octree, the mesh does not need to be parameterized. This is especially useful with complex meshes such as trees, hairy monsters, or characters.

The user will be able to paint on the mesh using a 3D brush, similar to the brush used in 2D painting applications. In this example, the painting resolution is homogeneous along the surface, although multiresolution painting would be an easy extension if desired.

37.2.1 Creating the Octree

We start by computing the bounding box of the object to be painted. The object is then rescaled such that its largest dimension is mapped between [0,1]. The same scaling is applied to the three dimensions because we want the painting resolution to be the same in every dimension. After this process, the mesh fits entirely within the unit box.

The user specifies the desired resolution of the painting. This determines the depth of the leaves of the octree that contain colors. For instance, if the user selects a resolution of 5123, the leaves containing colors will be at depth 9.

The tree is created by subdividing the nodes intersecting the surface until all the leaves either are empty or are at the selected depth (color leaves). To check whether a tree node intersects the geometry, we rely on the box defining the boundary of the node. This process is depicted in Figure 37-6. We use the algorithm shown in Listing 37-2.

37_octree_06a.jpgFigure 37-6 Building an Octree Around a Mesh Surface

This algorithm uses our GPU octree texture API. The links between nodes (indices in the indirection grids) are set up by the createChild() call. The values stored in tree leaves are set up by calling setChildAsEmpty() and setChildColor(), which also set the appropriate alpha value.

Example 37-2. Recursive Algorithm for Octree Creation

void createNode(depth, polygons, box)

  for all children (i, j, k) within (N, N, N)

     if (depth + 1 == painting depth)       // painting depth reached?

        setChildColor(i, j, k, white)       // child is at depth+1


        childbox = computeSubBox(i, j, k, box)

        if (childbox intersect polygons)

           child = createChild(i, j, k)

           // recurse

           createNode(depth + 1, polygons, childbox)


          setChildAsEmpty(i, j, k)

37.2.2 Painting

In our application, the painting tool is drawn as a small sphere moving along the surface of the mesh. This sphere is defined by a painting center P center and a painting radius P radius. The behavior of the brush is similar to that of brushes in 2D painting tools.

When the user paints, the leaf nodes intersecting the painting tool are updated. The new color is computed as a weighted sum of the previous color and the painting color. The weight is such that the painting opacity decreases as the distance from P center increases.

To minimize the amount of data to be sent to the GPU as painting occurs, only the modified leaves are updated in texture memory. This corresponds to a partial update of the indirection pool texture (under OpenGL, we use glTexSubImage3D). The modifications are tracked on a copy of the tree stored in CPU memory.

37.2.3 Rendering

To render the textured mesh, we need to access the octree from the fragment program, using the tree lookup defined in Section 37.1.2.

The untransformed coordinates of the vertices are stored as 3D texture coordinates. These 3D texture coordinates are interpolated during the rasterization of the triangles. Therefore, within the fragment program, we know the 3D point of the mesh surface being projected in the fragment. By using these coordinates as texture coordinates for the tree lookup, we retrieve the color stored in the octree texture.

However, this produces the equivalent of a standard texture lookup in “nearest” mode. Linear interpolation and mipmapping are often mandatory for high-quality results. In the following section, we discuss how to implement these techniques for octree textures.

Linear Interpolation

Linear interpolation of the texture can be obtained by extending the standard 2D linear interpolation. Because the octree texture is a volume texture, eight samples are required for linear interpolation, as shown in Figure 37-7.

37_octree_07.jpgFigure 37-7 Linear Interpolation Using Eight Samples

However, we store information only where the surface intersects the volume. Some of the samples involved in the 3D linear interpolation are not on the surface and have no associated color information. Consider a sample at coordinates (ijk) within the maximum depth grid (recall that the depth D grid is the regular grid produced by a complete octree at depth D). The seven other samples involved in the 3D linear interpolation are at coordinates (i+1, jk), (ij+1, k), (ijk+1), (ij+1, k+1), (i+1, jk+1), (i+1, j+1, k), and (i+1, j+1, k+1). However, some of these samples may not be included in the tree, because they are too far from the surface. This leads to rendering artifacts, as shown in Figure 37-8.

37_octree_08.jpgFigure 37-8 Fixing Artifacts Caused by Straightforward Linear Interpolation

We remove these artifacts by modifying the tree creation process. We make sure that all of the samples necessary for linear interpolation are included in the tree. This can be done easily by enlarging the box used to check whether a tree node intersects the geometry. The box is built in such a way that it includes the previous samples in each dimension. Indeed, the sample at (ijk) must be added if one of the previous samples (for example, the one at (i-1, j-1, k-1)) is in the tree. This is illustrated in Figure 37-9.

37_octree_09.jpgFigure 37-9 Modifying the Tree Creation to Remove Linear Interpolation Artifacts

In our demo, we use the same depth for all color leaves. Of course, the octree structure makes it possible to store color information at different depths. However, doing so complicates linear interpolation. For more details, refer to Benson and Davis 2002.


When a textured mesh becomes small on the screen, multiple samples of the texture fall into the same pixel. Without a proper filtering algorithm, this leads to aliasing. Most GPUs implement the mipmapping algorithm on standard 2D textures. We extend this algorithm to our GPU octree textures.

We define the mipmap levels as follows. The finest level (level 0) corresponds to the leaves of the initial octree. A coarser level is built from the previous one by merging the leaves in their parent node. The node color is set to the average color of its leaves, and the leaves are suppressed, as shown in Figure 37-10. The octree depth is therefore reduced by one at each mipmapping level. The coarsest level has only one root node, containing the average color of all the leaves of the initial tree.

37_octree_10.jpgFigure 37-10 An Example of a Tree with Mipmapping

Storing one tree per mipmapping level would be expensive. Instead, we create a second 3D texture, called the LOD pool. The LOD pool has one cell per indirection grid of the indirection pool (see Figure 37-10, bottom row). Its resolution is thus S u S v S w (see “Storage” in Section 37.1.2). Each node of the initial tree becomes a leaf at a given mipmapping level. The LOD pool stores the color taken by the nodes when they are used as leaves in a mipmapping level.

To texture the mesh at a specific mipmapping level, we stop the tree lookup at the corresponding depth and look up the node’s average color in the LOD pool. The appropriate mipmapping level can be computed within the fragment program using partial-derivative instructions.

37.2.4 Converting the Octree Texture to a Standard 2D Texture

Our ultimate goal is to use octree textures as a replacement for 2D textures, thus completely removing the need for a 2D parameterization. However, the octree texture requires explicit programming of the texture filtering. This leads to long fragment programs. On recent GPUs, performance is still high enough for texture-authoring applications, where a single object is displayed. But for applications displaying complex scenes, such as games or simulators, rendering performance may be too low. Moreover, GPUs are extremely efficient at displaying filtered standard 2D texture maps.

Being able to convert an octree texture into a standard 2D texture is therefore important. We would like to perform this conversion dynamically: this makes it possible to select the best representation at runtime. For example, an object near the viewpoint would use the linearly interpolated octree texture and switch to the corresponding filtered standard 2D texture when it moves farther away. The advantage is that filtering of the 2D texture is natively handled by the GPU. Thus, the extra cost of the octree texture is incurred only when details are visible.

In the following discussion, we assume that the mesh is already parameterized. We describe how we create a 2D texture map from an octree texture.

To produce the 2D texture map, we render the triangles using their 2D (uv) coordinates instead of their 3D (xyz) coordinates. The triangles are textured with the octree texture, using the 3D coordinates of the mesh vertices as texture coordinates for the tree lookup. The result is shown in Figure 37-11.

37fig11.jpgFigure 37-11 Converting the Octree into a Standard 2D Texture

However, this approach produces artifacts. When the 2D texture is applied to the mesh with filtering, the background color bleeds inside the texture. This happens because samples outside of the 2D triangles are used by the linear interpolation for texture filtering. It is not sufficient to add only a few border pixels: more and more pixels outside of the triangles are used by coarser mipmapping levels. These artifacts are shown in Figure 37-12.

37_octree_12.jpgFigure 37-12 Artifacts Resulting from Straightforward Conversion

To suppress these artifacts, we compute a new texture in which the colors are extrapolated outside of the 2D triangles. To do so, we use a simplified GPU variant of the extrapolation method known as push-pull. This method has been used for the same purpose in Sander et al. 2001.

We first render the 2D texture map as described previously. The background is set with an alpha value of 0. The triangles are rendered using an alpha value of 1. We then ask the GPU to automatically generate the mipmapping levels of the texture. Then we collapse all the mipmapping levels into one texture, interpreting the alpha value as a transparency coefficient. This is done with the Cg code shown in Listing 37-3.

Finally, new mipmapping levels are generated by the GPU from this new texture. Figures 37-13 and 37-14 show the result of this process.

37_octree_13.jpgFigure 37-13 Color Extrapolation

37_octree_14.jpgFigure 37-14 Artifacts Removed Due to Color Extrapolation

Example 37-3. Color Extrapolation Cg Code

PixelOut main(V2FI IN,

  uniform sampler2D Tex) // texture with mipmapping levels


  PixelOut OUT;

  float4 res = float4(0.0, 0.0, 0.0, 0.0);

  float alpha = 0.0;

  // start with coarsest level

  float sz = TEX_SIZE;

   // for all mipmapping levels

   for (float i=0.0; i<=TEX_SIZE_LOG2; i+=1.0)


      // texture lookup at this level

      float2 MIP = float2(sz/TEX_SIZE, 0.0);

      float4 c = (float4)tex2D(Tex, IN.TCoord0, MIP.xy, MIP.yx);

      // blend with previous

      res = c + res * (1.0 - c.w);

      // go to finer level

      sz /= 2.0;


   // done - return normalized color (alpha == 1)

   OUT.COL = float4(,1);

   return OUT;


37.3 Application 2: Surface Simulation

We have seen with the previous application that octree structures are useful for storing color information along a mesh surface. But octree structures on GPUs are also useful for simulation purposes. In this section, we present how we use an octree structure on the GPU to simulate liquid flowing along a mesh.

We do not go through the details of the simulation itself, because that is beyond the scope of this chapter. We concentrate instead on how we use the octree to make available all the information required by the simulation.

The simulation is done by a cellular automaton residing on the surface of the object. To perform the simulation, we need to attach a 2D density map to the mesh surface. The next simulation step is computed by updating the value of each pixel with respect to the density of its neighbors. This is done by rendering into the next density map using the previous density map and neighboring information as input.

Because physical simulation is very sensitive to distortions, using a standard 2D parameterization to associate the mesh surface to the density map would not produce good results in general. Moreover, computation power could be wasted if some parts of the 2D density map were not used. Therefore, we use an octree to avoid the parameterization.

The first step is to create an octree around the mesh surface (see Section 37.2.1). We do not directly store density within the octree: the density needs to be updated using a render-to-texture operation during the simulation and should therefore be stored in a 2D texture map. Instead of density, each leaf of the octree contains the index of a pixel within the 2D density map. Recall that the leaves of the octree store three 8-bit values (in RGB channels). To be able to use a density map larger than 256×256, we combine the values of the blue and green channels to form a 16-bit index.

During simulation, we also need to access the density of the neighbors. A set of 2D RGB textures, called neighbor textures, is used to encode neighboring information. Let I be an index stored within a leaf L of the octree. Let Dmap be the density map and N a neighbor texture. The Cg call tex2D(Dmap,I) returns the density associated with leaf L. The call tex2D(N,I)gives the index within the density map corresponding to a neighbor (in 3D space) of the leaf L. Therefore, tex2D(Dmap, tex2D(N,I)) gives us the density of the neighbor of L.

To encode the full 3D neighborhood information, 26 textures would be required (a leaf of the tree can have up to 26 neighbors in 3D). However, fewer neighbors are required in practice. Because the octree is built around a 2D surface, the average number of neighbors is likely to be closer to 9.

Once these textures have been created, the simulation can run on the density map. Rendering is done by texturing the mesh with the density map. The octree is used to retrieve the density stored in a given location of the mesh surface. Results of the simulation are shown in Figure 37-15. The user can interactively add liquid on the surface. Videos are available on the book’s CD.

37_octree_15.jpgFigure 37-15 Liquid Flowing Along Mesh Surfaces

37.4 Conclusion

We have presented a complete GPU implementation of octree textures. These structures offer an efficient and convenient way of storing undistorted data along a mesh surface. This can be color data, as in the mesh-painting application, or data for dynamic texture simulation, as in the flowing liquid simulation. Rendering can be done efficiently on modern hardware, and we have provided solutions for filtering to avoid texture aliasing. Nevertheless, because 2D texture maps are preferable in some situations, we have shown how an octree texture can be dynamically converted into a 2D texture without artifacts.

Octrees are very generic data structures, widely used in computer science. They are a convenient way of storing information on unparameterized meshes, and more generally in space. Many other applications, such as volume rendering, can benefit from their hardware implementation.

We hope that you will discover many further uses for and improvements to the techniques presented in this chapter! Please see for updates of the source code and additional information.

37.5 References

Benson, D., and J. Davis. 2002. “Octree Textures.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 785–790.

Debry, D., J. Gibbs, D. Petty, and N. Robins. 2002. “Painting and Rendering Textures on Unparameterized Models.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 763–768.

Sander, P., J. Snyder, S. Gortler, and H. Hoppe. 2001. “Texture Mapping Progressive Meshes.” In Proceedings of SIGGRAPH 2001, pp. 409–416.


Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.

The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.

NVIDIA makes no warranty or representation that the techniques described herein are free from any Intellectual Property claims. The reader assumes all risk of any such claims based on his or her use of these techniques.

The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact:

        U.S. Corporate and Government Sales
(800) 382-3419

For sales outside of the U.S., please contact:

        International Sales

Visit Addison-Wesley on the Web:

Library of Congress Cataloging-in-Publication Data

GPU gems 2 : programming techniques for high-performance graphics and general-purpose
computation / edited by Matt Pharr ; Randima Fernando, series editor.
p. cm.
Includes bibliographical references and index.
ISBN 0-321-33559-7 (hardcover : alk. paper)
1. Computer graphics. 2. Real-time programming. I. Pharr, Matt. II. Fernando, Randima.

T385.G688 2005

GeForce™ and NVIDIA Quadro® are trademarks or registered trademarks of NVIDIA Corporation.

Nalu, Timbury, and Clear Sailing images © 2004 NVIDIA Corporation.

mental images and mental ray are trademarks or registered trademarks of mental images, GmbH.

Copyright © 2005 by NVIDIA Corporation.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. Printed in the United States of America. Published simultaneously in Canada.

For information on obtaining permission for use of material from this work, please submit a written request to:

       Pearson Education, Inc.
Rights and Contracts Department
One Lake Street
Upper Saddle River, NJ 07458

Text printed in the United States on recycled paper at Quebecor World Taunton in Taunton, Massachusetts.

Second printing, April 2005

Posted in CLOUD, CLUSTER, Computer Softwares, Computing Technology, CUDA, Graphics Cards, OpenCL, OpenGL, PARALLEL | 1 Comment »

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.


News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: