Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘Apps Development’ Category

Installing CUDA 6 on Ubuntu 12.04

Posted by Hemprasad Y. Badgujar on April 18, 2014


Setup New ubuntu 12.04

  • Clean Install
  • DO NOT INSTALL NVIDIA DRIVERs
  • You can manually update via terminal by running:
    sudo apt-get update
    sudo apt-get upgrade
    

    Additionally you can run:

    sudo apt-get dist-upgrade
  • enable root login
     sudo passwd root
     sudo sh -c 'echo "greeter-show-manual-login=true" >> /etc/lightdm/lightdm.conf'

           Root won’t show up as a user, but “Login” will, which is how you manually log in with users not shown in the greeter.

           Rebooted and then you should be able to login as root.

  • Download the NVIDIA CUDA Toolkit.

 

Pre-installation Actions

Some actions must be taken beforetheCUDA Toolkit and Driver can be installed on Linux:

  • Verify the system has a CUDA-capable GPU.
  • Verify the system is running a supported version of Linux.
  • Verify the system has gcc installed.
  • Download the NVIDIA CUDA Toolkit.
Note: You can override the install-time prerequisite checks by running the installer with the -override flag. Remember that the prerequisites will still be required to use the NVIDIA CUDA Toolkit.

Verify You Have a CUDA-Capable GPU

To verify that your GPU is CUDA-capable, go to your distribution’s equivalent of System Properties, or, from the command line, enter:

lspci | grep -i nvidia

If you do not see any settings, update the PCI hardware database that Linux maintains by entering update-pciids (generally found in /sbin) at the command line and rerun the previous lspci command.

If your graphics card is from NVIDIA and it is listed in http://developer.nvidia.com/cuda-gpus, your GPU is CUDA-capable.

The Release Notes for the CUDA Toolkit also contain a list of supported products.

 Verify You Have a Supported Version of Linux

The CUDA Development Tools are only supported on some specific distributions of Linux. These are listed in the CUDA Toolkit release notes.

To determine which distribution and release number you’re running, type the following at the command line:

uname -m && cat /etc/*release

You should see output similar to the following, modified for your particular system:

i386 Red Hat Enterprise Linux WS release 4 (Nahant Update 6)

The i386 line indicates you are running on a 32-bit system. On 64-bit systems running in 64-bit mode, this line will generally read: x86_64. The second line gives the version number of the operating system.

Verify the System Has gcc Installed

The gcc compiler and toolchain generally are installed as part of the Linux installation, and in most cases the version of gcc installed with a supported version of Linux will work correctly.

To verify the version of gcc installed on your system, type the following on the command line:

gcc --version

If an error message displays, you need to install the development tools from your Linux distribution or obtain a version of gcc and its accompanying toolchain from the Web.

For ARMv7 cross development, a suitable cross compiler is required. For example, performing the following on Ubuntu 12.04:

sudo apt-get install g++-4.6-arm-linux-gnueabihf

will install the gcc 4.6 cross compiler on your system which will be used by nvcc. Please refer to th NVCC manual on how to use nvcc to cross compile to the ARMv7 architecture

Choose an Installation Method

The CUDA Toolkit can be installed using either of two different installation mechanisms: distribution-specific packages, or a distribution-independent package. The distribution-independent package has the advantage of working across a wider set of Linux distributions, but does not update the distribution’s native package management system. The distribution-specific packages interface with the distribution’s native package management system. It is recommended to use the distribution-specific packages, where possible.

Note: Distribution-specific packages and repositories are not provided for Redhat 5 and Ubuntu 10.04. For those two Linux distributions, the stand-alone installer must be used.
Note: Standalone installers are not provided for the ARMv7 release. For both native ARMv7 as well as cross developent, the toolkit must be installed using the distribution-specific installer.

Download the NVIDIA CUDA Toolkit

The NVIDIA CUDA Toolkit is available at http://developer.nvidia.com/cuda-downloads.

Choose the platform you are using and download the NVIDIA CUDA Toolkit

The CUDA Toolkit contains the CUDA driver and tools needed to create, build and run a CUDA application as well as libraries, header files, CUDA samples source code, and other resources.

Download Verification

The download can be verified by comparing the MD5 checksum posted at http://developer.nvidia.com/cuda-downloads/checksums with that of the downloaded file. If either of the checksums differ, the downloaded file is corrupt and needs to be downloaded again.

To calculate the MD5 checksum of the downloaded file, run the following:

$ md5sum

Runfile Installation

 

This section describes the installation and configuration of CUDA when using the standalone installer.

 

Pre-installation Setup

Before the stand-alone installation can be run, perform the pre-installation actions.

 

Prerequisites

If you have already installed a standalone CUDA driver and desire to keep using it, you need to make sure it meets the minimum version requirement for the toolkit. This requirement can be found in the CUDA Toolkit release notes. With many distributions, the driver version number can be found in the graphical interface menus under Applications > System Tools > NVIDIA X Server Settings.. On the command line, the driver version number can be found by running /usr/bin/nvidia-settings.

The package manager installations (RPM/DEB packages) and the stand-alone installer installations (.run file) are incompatible. See below about how to uninstall any previous RPM/DEB installation.

 

Copy cuda_6.0.37_linux_*.run file to Root home folder for easy access.

Contents

The standalone installer can install any combination of the NVIDIA Driver (that includes the CUDA Driver), the CUDA Toolkit, or the CUDA Samples. If needed, each individual installer can be extracted by using the -extract=/absolute/path/to/extract/location/. The extraction path must be an absolute path.

The CUDA Toolkit installation includes a read-only copy of the CUDA Samples. The read-only copy is used to create a writable copy of the CUDA Samples at some other location at any point in time. To create this writable copy, use the cuda-install-samples-6.0.sh script provided with the toolkit. It is equivalent to installing the CUDA Samples with the standalone installer.

Extra Libraries

If you wish to build all the samples, including those with graphical rather than command-line interfaces, additional system libraries or headers may be required. While every Linux distribution is slightly different with respect to package names and package installation procedures, the libraries and headers most likely to be necessary are OpenGL (e.g., Mesa), GLU, GLUT, and X11 (including Xi, Xmu, and GLX).

On Ubuntu, those can be installed as follows:

sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libgl1-mesa-dri libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

sudo apt-get install libwxgtk2.8-0 libwxbase2.8-0 wx-common libglu1-mesa libgl1-mesa-glx zlib1g bzip2 gpsd gpsd-clients xcalib libportaudio2

Interaction with Nouveau

Proprietary Video Driver

The built-in nouveau video driver in Ubuntu is incompatible with the CUDA Toolkit, and you have to replace it with the proprietary NVIDIA driver.

$ sudo aptget remove purge \  xserverxorgvideonouveau 
The Nouveau drivers may be installed intoyourroot filesystem (initramfs) and may cause the Display Driver installation to fail. To fix the situation,theinitramfs image must be rebuilt with:

sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
sudo dracut /boot/initramfs-$(uname -r).img $(uname -r)

if Grub2 is used as the bootloader, the rdblacklist=nouveau nouveau.modeset=0 line must be added at the end of the GRUB_CMDLINE_LINUX entry in /etc/default/grub. Then, the Grub configuration must be remade by running:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Once this is done, the machine must be rebooted and the installation attempted again.

 

Graphical Interface Shutdown

Exit the GUI if you are in a GUI environment by pressing Ctrl-Alt-Backspace. Some distributions require you to press this sequence twice in a row; others have disabled it altogether in favor of a command such as sudo service ligthdm stop. Still others require changing the system runlevel using a command such as /sbin/init 3 Consult your distribution’s documentation to find out how to properly exit the GUI. This step is only required in the event that you want to install the NVIDIA Display Driver included in the standalone installer.

 

NVIDIA Driver RPM/Deb package uninstallation

If you want to install the NVIDIA Display Driver included in the standalone installer, any previous driver installed through RPM or DEB packages MUST be uninstalled first. Such installation may be part of the default installation of your Linux distribution. Or it could have been installed as part of the package installation described in the previous section. To uninstall a DEB package, use sudo apt-get –purge remove package_name or equivalent. To uninstall a RPM package, use sudo yum remove package_name or equivalent.

 

Installation

To install any combination of the driver, toolkit, and the samples, simply execute the .run script. The installation of the driver requires the script to be run with root privileges. Depending on the target location, the toolkit and samples installations may also require root privileges.

Shutdown the all the graphics

Ubuntu uses LightDM, so you need to stop this service:

$ sudo service lightdm stop

press Alt+F1 for Terminal

5.2 Run the installer

Go to (using cd) the directory where you have the CUDA installer (a file with *.run extension) and type the following:

$ sudo chmod +x *.run
$ sudo ./*.run

By default, the toolkit and the samples will install under /usr/local/cuda-6.0 and $(HOME)/NVIDIA_CUDA-6.0_Samples, respectively. In addition, a symbolic link is created from /usr/local/cuda to /usr/local/cuda-6.0. The symbolic link is created in order for existing projects to automatically make use of the newly installed CUDA Toolkit.

If the target system includes both an integrated GPU (iGPU) and a discrete GPU (dGPU), the –no-opengl-libs option must be used. Otherwise, the openGL library used by the graphics driver of the iGPU will be overwritten and the GUI will not work. In addition, the xorg.conf update at the end of the installation must be declined.

Note: Installing Mesa may overwrite the /usr/lib/libGL.so that was previously installed by the NVIDIA driver, so a reinstallation of the NVIDIA driver might be required after installing these libraries.

 

Environment Setup

The PATH variable needs to include /usr/local/cuda-6.0/bin

The LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-6.0/lib on a 32-bit system, and /usr/local/cuda-6.0/lib64 on a 64-bit system

  • To change the environment variables for 32-bit operating systems:

    $ export PATH=/usr/local/cuda-6.0/bin:$PATH
    $ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH
  • To change the environment variables for 64-bit operating systems:

    $ export PATH=/usr/local/cuda-6.0/bin:$PATH
    $ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64:$LD_LIBRARY_PATH

Verifications

Check that the device files/dev/nvidia* exist and have the correct (0666) file permissions. These files are used by the CUDA Driver to communicate with the kernel-mode portion of the NVIDIA Driver. Applications that use the NVIDIA driver, such as a CUDA application or the X server (if any), will normally automatically create these files if they are missing using the setuidnvidia-modprobe tool that is bundled with the NVIDIA Driver. Some systems disallow setuid binaries, however, so if these files do not exist, you can create them manually either by running the command nvidia-smi as root at boot time or by using a startup script such as the one below:

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

 

Graphical Interface Restart

Restart the GUI environment using the command startx, init 5, sudo service lightdm start, or the equivalent command on your system.

 

 

Post-installation Actions

Some actions must be taken after installingtheCUDA Toolkit and Driver before they can be completely used:

  • Setup evironment variables.
  • Install a writable copy of the CUDA Samples.
  • Verify the installation.

 Environment Setup

The PATH variable needs to include /usr/local/cuda-6.0/bin

The LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-6.0/lib on a 32-bit system, and /usr/local/cuda-6.0/lib64 on a 64-bit system

  • To change the environment variables for 32-bit operating systems:

    $ export PATH=/usr/local/cuda-6.0/bin:$PATH
    $ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH
  • To change the environment variables for 64-bit operating systems:

    $ export PATH=/usr/local/cuda-6.0/bin:$PATH
    $ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64:$LD_LIBRARY_PATH

(Optional) Install Writable Samples

In order to modify, compile, and run the samples, the samples must be installedwithwrite permissions. A convenience installation script is provided:

$ cuda-install-samples-6.0.sh <dir>

This script is installed with the cuda-samples-60 package. The cuda-samples-60 package installs only a read-only copy in /usr/local/cuda-6.0/samples.

 Verify the Installation

Before continuing, it is important to verify that the CUDA toolkit can find and communicate correctly with the CUDA-capable hardware. To do this, you need to compile and run some of the included sample programs.

Note: Ensure the PATH and LD_LIBRARY_PATH variables are set correctly.

Verify the Driver Version

If you installed the driver, verify that the correct version of it is installed.

This can be done through your System Properties (or equivalent) or by executing the command

cat /proc/driver/nvidia/version

Note that this command will not work on an iGPU/dGPU system.

Compiling the Examples

The version of the CUDA Toolkit can be checked by running nvcc -V in a terminal window. The nvcc command runs the compiler driver that compiles CUDA programs. It calls the gcc compiler for C code and the NVIDIA PTX compiler for the CUDA code.

The NVIDIA CUDA Toolkit includes sample programs in source form. You should compile them by changing to ~/NVIDIA_CUDA-6.0_Samples and typing make. The resulting binaries will be placed under ~/NVIDIA_CUDA-6.0_Samples/bin.

Running the Binaries

After compilation, find and run deviceQuery under ~/NVIDIA_CUDA-6.0_Samples. If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar to that shown in Figure 1.

Figure 1. Valid Results from deviceQuery CUDA Sample

Valid Results from deviceQuery CUDA Sample.

 

The exact appearance and the output lines might be different on your system. The important outcomes are that a device was found (the first highlighted line), that the device matches the one on your system (the second highlighted line), and that the test passed (the final highlighted line).

If a CUDA-capable device and the CUDA Driver are installed but deviceQuery reports that no CUDA-capable devices are present, this likely means that the /dev/nvidia* files are missing or have the wrong permissions.

On systems where SELinux is enabled, you might need to temporarily disable this security feature to run deviceQuery. To do this, type:

# setenforce 0

from the command line as the superuser.

Running the bandwidthTest program ensures that the system and the CUDA-capable device are able to communicate correctly. Its output is shown in Figure 2.

Figure 2. Valid Results from bandwidthTest CUDA Sample

Valid Results from bandwidthTest CUDA Sample.

 

Note that the measurements for your CUDA-capable device description will vary from system to system. The important point is that you obtain measurements, and that the second-to-last line (in Figure 2) confirms that all necessary tests passed.

Should the tests not pass, make sure you have a CUDA-capable NVIDIA GPU on your system and make sure it is properly installed.

If you run into difficulties with the link step (such as libraries not being found), consult the Linux Release Notes found in the doc folder in the CUDA Samples directory.

6. Additional Considerations

Now that you have CUDA-capable hardware and the NVIDIA CUDA Toolkit installed, you can examine and enjoy the numerous included programs. To begin using CUDA to accelerate the performance of your own applications, consult the CUDA C Programming Guide, located in /usr/local/cuda-6.0/doc.

A number of helpful development tools are included in the CUDA Toolkit to assist you as you develop your CUDA programs, such as NVIDIA® Nsight™ Eclipse Edition, NVIDIA Visual Profiler, cuda-gdb, and cuda-memcheck.

For technical support on programming questions, consult and participate in the developer forums at http://developer.nvidia.com/cuda/.

 

Posted in Apps Development, CUDA, GPU (CUDA), GPU Accelareted, Open CL, OpenCV, OpenMP, PARALLEL | Leave a Comment »

Google Cloud SQL is now Generally Available

Posted by Hemprasad Y. Badgujar on February 19, 2014


Google Cloud SQL is now Generally Available
Cloud SQL
Google Cloud SQL is a fully managed MySQL service hosted on Google Cloud Platform, providing a database backbone for applications running on Google App Engine or Google Compute Engine. Cloud SQL is now Generally Available and features encryption of customer data, a 99.95% uptime SLA, and support for databases up to 500GB in size. Read more or get started
NEWS & ANNOUNCEMENTS
Big Data processing with the Google Cloud Storage Connector for Hadoop
Compute Engine VMs provide a fast and reliable way to run Apache Hadoop. The new Cloud Storage connector for Hadoop lets you focus on your data processing logic instead of managing a cluster and file system. Read more or get started
Google Cloud Platform provides support for HIPAA Covered Entities
To serve developers who want to build healthcare-related applications that comply with Health Insurance Portability and Accountability Act, we’re announcing support for Business Associates Agreements. Read more or contact us
Join us on March 25th for Google Cloud Platform Live
Urs Hölzle, Senior Vice President for Technical Infrastructure, and the engineering leadership of Google Cloud Platform will be hosting a day-long developer summit. We’ll announce a number of new products, features and updates to Google Cloud Platform and showcase the investments we’re making in building the world’s best public cloud. Join us in San Francisco, New York, Seattle, or on YouTube to catch the action. Read more
TIPS & TRICKS
Running Docker on Compute Engine
Get Programmatic Access to your Billing Data With the New Billing API
Performance advantages of the new Google Cloud Storage connector for Hadoop
Large Akka cluster on Google Compute Engine
Analyzing detailed web analytics data with Google Analytics and Google BigQuery
BigQuery, Python, Pandas and R for data science
PHP App Engine apps and file system concepts
A better way to explore and learn on Github
Cheers,
The Google Cloud Platform Team

Posted in Animation, Computer Network & Security, Google TEch, Installation, PARALLEL | Tagged: | Leave a Comment »

Free Screen Recording Softwares For Windows

Posted by Hemprasad Y. Badgujar on February 5, 2014


Free programs at times come with drawbacks or limitations, and watermarks are common in many free screen recording softwares. However, the following list of programs has been tested to not have any watermarks and can export to a file format recognizable to most video editing softwares.

1. Ezvid

Ezvid is a screen recorder program that comes with an in-built video editor where you can split your recordings and add text in between two clips, creating a slideshow effect. There is no way to export the video you recorded. However, you can upload the video to YouTube through the program itself.

Ezvid Program

For gamers, there’s an option to enable ‘Gaming mode’ where it records the windowed mode of your game. The program comes with a few music clips. However if you decide to have no music, it will be replaced with a ‘Silent machine’ which sounds like a small fan from a computer. You can add in other pictures and video clips, as well as add in your voice after you’re done editing your clip.

2. BlueBerry FlashBack Express Recorder

BB (short for BlueBerry) FlashBack Express Recorder lets you use your webcam to record yourself while recording the activities that are happening on your desktop. After you have stopped recording, it creates an FBR file which can be edited with its packaged video editor.

Flashback Express

If you did not enable your webcam, you can skip the video editor program and export it to the AVI file format right away. Otherwise, you can use the software to position and resize your webcam box before exporting it. Although it requires you to register (for a free account) after 30 days of usage, it still provides you with all of its functions before you register.

3. Screenr

Screenr is an interesting way to share a screencast (recording of your screen) online without installing a program on your computer; it requires Java to work. You select an area on your screen which you want to record (max: 5 minutes). All recordings are saved into your account.

Screenr Webpage

After recording you’ll be given a link which you can share. You can also export your video to MP4 or upload it on YouTube.

Screenr also provides a bookmarklet so you can record without going to the website. Register with your Facebook, Twitter, Google, LinkedIn, Yahoo or Windows Live account in order to use Screenr for free.

4. Rylstim Screen Recorder

Rylstim just records your screen after you hit the ‘Start Record’ button. This will be useful for people who do not want to configure anything and just want a basic recorder. This program does not record sound from input devices like a microphone.

Rylstim Program

The only options available determine if you want to show your left or right mouse button clicks in the video. If you enable the mouse click options, a red ripple will appear at your cursor when you left click and a green ripple appears for right clicks. This mouse click ripple effect will only be visible when you view your recording.

5. CamStudio

CamStudio comes equipped with many options to tweak the way it records. There are options to enable or disable your mouse cursor, record sounds from programs or a microphone (or have no sound at all) and the option to enable custom screen annotations.

CamStudio Program

You can also choose to record a particular area on your screen or a program window so that the rest of your desktop isn’t showing on the recording. It can record at different frame rate speeds; for example 1 FPS (frames per second)to create a time lapse video effect, or 30 FPS for a smooth video.

More!

Webinaria

Webinaria is another easy-to-use screen recording software with basic options for your recording needs. It can record your entire screen, a program window or a custom selection. You can choose from 3 frame rate options; 5, 10 and 15 FPS (frames per second).

Webinaria Program

Videos are produced in AVI file format. If you were using Google Chrome before running Webinaria, Google Chrome will be detected as the program window for recording.

DVDVideoSoft Free Screen Video Recorder

This program has a simple user interface carrying 9 icons. The first 4 is for screen capturing, the next 4 handles screen recording and the last icon opens the options of the program. It also exports its video to an AVI file format. What’s good about this program is its automatic file naming options.

DVD Video Soft

It gives you options to include the specific date and time in the filename of your recorded video. Users who need to keep track of many screen recordings back to back will find this automatic file naming option useful.

Krut Computer Recorder

Krut does not require installation but uses Java to run. After downloading, you’ll have a folder where you have to run the ‘KRUT.jar’ file to get started. The capture area and recording frames per second can be set by the user.

Krut Computer Recorder

There is an option called ‘Follow Mouse’ where it captures the area around your mouse, wherever it moves to. When using the ‘Follow Mouse’ feature, you can enable preview mode to see the captured area as you record. This program outputs to three types of files: the WAV file only has the audio recorded, while of the two MOV files; one has no audio and the other has both audio and video of the recording.

Posted in Apps Development, Computer Network & Security, Documentations, Operating Systems, Placement, Project Related, Video | Tagged: | Leave a Comment »

Directory of Open Source Broadcasting Projects

Posted by Hemprasad Y. Badgujar on January 3, 2014


Directory of Open Source Broadcasting Projects

Audio production

Graphics / CG

Newsroom

Open Source Hardware

Radio Automation

Recording

Streaming

Video Play-out

Video production

Video Transcoding

Posted in Animation, Apps Development, Computer Games, Computer Languages, Computer Network & Security, Computer Softwares, Game Development, Network Devices, Research Menu | Leave a Comment »

How to Install Apache Tomcat

Posted by Hemprasad Y. Badgujar on December 2, 2013


How to Install Apache Tomcat

(on Windows, Mac, Ubuntu)

 

Get Started with Java Servlet Programming

This tutorial can be completed in a 3-hour session.

This installation and configuration guide is applicable to Tomcat 7/8, and possibly the earlier versions.

Take note that Tomcat 8 requires JDK 1.7. It will NOT work with JDK 1.6.

1.  Introduction

1.1  Web Application (Webapp)

web application (or webapp), unlike standalone application, runs over the Internet. Examples of webapps are google, amazon, ebay, facebook and twitter.

A webapp is typically a 3-tier (or multi-tierclient-server database application run over the Internet as illustrated in the diagram below. It comprises five components:

  1. HTTP Server: E.g., Apache HTTP Server, Apache Tomcat Server, Microsoft Internet Information Server (IIS), nginx, Google Web Server (GWS), and others.
  2. HTTP Client (or Web Browser): E.g., Internet Explorer (MSIE), FireFox, Chrome, Safari, and others.
  3. Database: E.g., Open-source MySQL, Apache Derby, mSQL, SQLite, PostgreSQL, OpenOffice’s Base; Commercial Oracle, IBM DB2, SAP SyBase, MS SQL Server, MS Access; and others.
  4. Client-Side Programs: could be written in HTML Form, JavaScript, VBScript, Flash, and others.
  5. Server-Side Programs: could be written in Java Servlet/JSP, ASP, PHP, Perl, Python, CGI, and others.

HTTP_ClientServerSystem.pngThe typical use-case is:

  1. A user, via a web browser (HTTP client), issues a URL request to an HTTP server to start a webapp.
  2. A client-side program (such as an HTML form) is loaded into client’s browser.
  3. The user fills up the query criteria in the form.
  4. The client-side program sends the query parameters to a server-side program.
  5. The server-side program receives the query parameters, queries the database and returns the query result to the client.
  6. The client-side program displays the query result on the browser.
  7. The process repeats.

1.2  Hypertext Transfer Protocol (HTTP)

  • HTTP is an application layer protocol runs over TCP/IP. The IP provides support for routing and addressing (via an unique IP address for machines on the Internet); while TCP supports multiplexing via 64K ports from port number 0 to 65535. The default port number assigned to HTTP is TCP port 80.
  • HTTP is an asynchronous request-response application-layer protocol. A client sends a request message to the server. The server then returns a response message to the client.
  • HTTP is a pull protocol, a client pulls a page from the server (instead of server pushes pages to the clients).
  • The syntax of the message is defined in the HTTP specification.

HTTP_RequestResponseMessages.png

1.3  Apache Tomcat HTTP Server

Apache Tomcat is a Java-capable HTTP server, which could execute special Java programs known as Java Servlet and Java Server Pages (JSP). It is the official Reference Implementation (RI) for Java Servlets and JavaServer Pages (JSP) technologies. Tomcat is an open-source project, under the “Apache Software Foundation” (which also provides the most use, open-source, industrial-strength Apache HTTP Server). The mother site for Tomcat is http://tomcat.apache.org. Alternatively, you can find tomcat via the Apache mother site @ http://www.apache.org.

Tomcat was originally written by James Duncan Davison (then working in Sun), in 1998, based on an earlier Sun’s server called Java Web Server (JWS). It began at version 3.0 after JSWDK 2.1 it replaced. Sun subsequently made Tomcat open-source and gave it to Apache.

The various Tomcat releases are:

  1. Tomcat 3.x (1999): RI for Servlet 2.2 and JSP 1.1.
  2. Tomcat 4.x (2001): RI for Servlet 2.3 and JSP 1.2.
  3. Tomcat 5.x (2002): RI for Servlet 2.4 and JSP 2.0.
  4. Tomcat 6.x (2006): RI for Servlet 2.5 and JSP 2.1.
  5. Tomcat 7.x (2010): RI for Servlet 3.0, JSP 2.2 and EL 2.2.
  6. Tomcat 8.x (2010): RI for Servlet 3.1, JSP 2.3, EL 3.0 and Java WebSocket 1.0.

Tomcat is an HTTP application runs over TCP/IP. In other words, the Tomcat server runs on a specific TCP port in a specific IP address. The default TCP port number for HTTP protocol is 80, which is used for the production HTTP server. For test HTTP server, you can choose any unused port number between 1024 and 65535; while port numbers 1-1023 are reserved.

2.  How to Install Tomcat 7 and Get Started with Java Servlet Programming

2.1  STEP 1: Download and Install Tomcat

For Windows

  1. Goto http://tomcat.apache.org ⇒ Downloads ⇒ Tomcat 8.0 ⇒ “8.0.{xx}” (where {xx} is the latest upgrade number) ⇒ Binary Distributions ⇒ Core ⇒ “zip” package (e.g., “apache-tomcat-8.0.{xx}.zip“, about 8 MB).
  2. UNZIP into a directory of your choice. DO NOT unzip onto the Desktop (because its path is hard to locate). I suggest using “d:\myproject“. Tomcat will be unzipped into directory “d:\myproject\apache-tomcat-8.0.{xx}“. For ease of use, we shall shorten and rename this directory to “d:\myproject\tomcat“. Take note of Your Tomcat Installed Directory. Hereafter, I shall refer to the Tomcat installed directory as <TOMCAT_HOME> (or<CATALINA_HOME> – “Catalina” is the codename for Tomcat 5 and above).

(Advanced) A better approach is to keep the original directory name, such as apache-tomcat-8.0.{xx}, but create a symlink called tomcat via command “mklink /D tomcat apache-tomcat-8.0.{xx}“. Symlink is available in Windows Vista/7/8 only.

For Mac

  1. Goto http://tomcat.apache.org ⇒ Download ⇒ Tomcat 8.0 ⇒ “8.0.{xx}” (where {xx} denotes the latest upgrade number) ⇒ Binary distribution ⇒ Core ⇒ “tar.gz” package (e.g., “apache-tomcat-8.0.{xx}.tar.gz“, about 8 MB).
  2. To install Tomcat:
    1. Goto~/Downloads“, double-click the downloaded tarball (e.g., “apache-tomcat-8.0.{xx}.tar.gz“) to expand it into a folder (e.g., “apache-tomcat-8.0.{xx}“).
    2. Move the extracted folder (e.g., “apache-tomcat-8.0.{xx}“) to “/Applications“.
    3. Rename the folder to “tomcat”, for ease of use. Take note of Your Tomcat Installed Directory. Hereafter, I shall refer to the Tomcat installed directory as <TOMCAT_HOME> (or <CATALINA_HOME> – “Catalina” is the codename for Tomcat 5 and above).

For Ubuntu

Read “How to Install Tomcat 7 on Ubuntu“. You need to switch between these two articles.

For academic learning, I recommend “zip” (or “tar.gz“) version, as you could simply delete the entire directory when Tomcat is no longer needed (without running any un-installer). You are free to move or rename the Tomcat’s installed directory. You can install (unzip) multiple copies of Tomcat in the same machine. For production, it is easier to use the installer to properly configure the Tomcat.

Tomcat’s Directories

Take a quick look at the Tomcat installed directory. It contains the following sub-directories:

  • bin: contains the binaries; and startup script (startup.bat for Windows and startup.sh for Unix and Mac), shutdown script (shutdown.bat for Windows and shutdown.sh for Unix and Mac), and other binaries and scripts.
  • conf: contains the system-wide configuration files, such as server.xmlweb.xmlcontext.xml, and tomcat-users.xml.
  • lib: contains the Tomcat’s system-wide JAR files, accessible by all webapps. You could also place external JAR file (such as MySQL JDBC Driver) here.
  • logs: contains Tomcat’s log files. You may need to check for error messages here.
  • webapps: contains the webapps to be deployed. You can also place the WAR (Webapp Archive) file for deployment here.
  • work: Tomcat’s working directory used by JSP, for JSP-to-Servlet conversion.
  • temp: Temporary files.

2.2  STEP 2: Create an Environment Variable JAVA_HOME

For Windows

You need to create an environment variable called “JAVA_HOME” and set it to your JDK installed directory.

  1. First, take note of your JDK installed directory. The default is “c:\Program Files\Java\jdk1.7.0_{xx}“, where {xx} is the latest upgrade number. It is important to verify your JDK installed directory, via the “Computer”, before you proceed further.
  2. Start a CMD shell, and issue the command “set JAVA_HOME” to check if variable JAVA_HOME has been set:
    > set JAVA_HOME
    Environment variable JAVA_HOME not defined

    If JAVA_HOME is set, check if it is set to your JDK installed directory correctly. Otherwise, goto next step.

  3. To set the environment variable JAVA_HOME in Windows 2000/XP/Vista/7/8: Push “Start” button ⇒ Control Panel ⇒ System ⇒ (Vista/7/8) Advanced system settings ⇒ Switch to “Advanced” tab ⇒ Environment Variables ⇒ System Variables ⇒ “New” (or “Edit” for modification) ⇒ In “Variable Name”, enter “JAVA_HOME” ⇒ In “Variable Value”, enter your JDK installed directory (e.g., “c:\Program Files\Java\jdk1.7.0_{xx}“).
  4. To verify, RE-START a CMD shell (need to refresh the environment) and issue:
    > set JAVA_HOME
    JAVA_HOME=c:\Program Files\Java\jdk1.7.0_{xx}   <== Verify that this is YOUR JDK installed directory

For Mac

Skip this step. No need to do anything.

2.3  STEP 3: Configure Tomcat Server

The Tomcat configuration files are located in the “conf” sub-directory of your Tomcat installed directory, e.g. “d:\myproject\tomcat\conf” (for Windows) or “/Applications/tomcat” (for Mac). There are 4 configuration XML files:

  1. server.xml
  2. web.xml
  3. context.xml
  4. tomcat-users.xml

Make a BACKUP of the configuration files before you proceed.

Step 3(a) “conf\server.xml” – Set the TCP Port Number

Use a programming text editor (e.g., NotePad++, TextPad for Windows; or gEdit, jEdit, TextEdit for Mac) to open the configuration file “server.xml“, under the “conf” sub-directory of Tomcat installed directory.

The default TCP port number configured in Tomcat is 8080, you may choose any number between 1024 and 65535, which is not used by an existing application. We shall choose 9999 in this article. (For production server, you should use port 80, which is pre-assigned to HTTP server as the default port number.)

Locate the following lines, and change port="8080" to port="9999".

<!-- A "Connector" represents an endpoint by which requests are received
      and responses are returned. Documentation at :
      Java HTTP Connector: /docs/config/http.html (blocking & non-blocking)
      Java AJP  Connector: /docs/config/ajp.html
      APR (HTTP/AJP) Connector: /docs/apr.html
      Define a non-SSL HTTP/1.1 Connector on port 8080
-->
<Connector port="9999" protocol="HTTP/1.1"
      connectionTimeout="20000"
      redirectPort="8443" />
Step 3(b) “conf\web.xml” – Enabling Directory Listing

Again, use a programming text editor to open the configuration file “web.xml“, under the “conf” sub-directory of Tomcat installed directory.

We shall enable directory listing by changing “listings” from “false” to “true” for the “default” servlet. This is handy for test system, but not for production system for security reasons.

Locate the following lines and change from “false” to “true“.

<!-- The default servlet for all web applications, that serves static     -->
<!-- resources.  It processes all requests that are not mapped to other   -->
<!-- servlets with servlet mappings.                                      -->
<servlet>
  <servlet-name>default</servlet-name>
  <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
  <init-param>
    <param-name>debug</param-name>
    <param-value>0</param-value>
  </init-param>
  <init-param>
    <param-name>listings</param-name>
    <param-value>true</param-value>
  </init-param>
  <load-on-startup>1</load-on-startup>
</servlet>
Step 3(c) “conf\context.xml” – Enabling Automatic Reload

We shall add the attribute reloadable="true" to the <Context> element to enable automatic reload after code changes. Again, this is handy for test system but not for production, due to the overhead of detecting changes.

Locate the <Context> start element, and change it to <Context reloadable="true">.

<Context reloadable="true">
   ......
</Context>
Step 3(d) (Optional) “conf\tomcat-users.xml”

Enable the Tomcat’s manager by adding the highlighted lines, inside the <tomcat-users> elements:

<tomcat-users>
  <role rolename="manager-gui"/>
  <user username="manager" password="xxxx" roles="manager-gui"/>
</tomcat-users>

This enables the manager GUI app for managing Tomcat server.

2.4  STEP 4: Start Tomcat Server

The Tomcat’s executable programs and scripts are kept in the “bin” sub-directory of the Tomcat installed directory, e.g., “d:\myproject\tomcat\bin” (for Windows) or “/Applications/tomcat/bin” (for Mac).

Step 4(a) Start Server

For Windows

Launch a CMD shell. Set the current directory to “<TOMCAT_HOME>\bin“, and run “startup.bat” as follows:

// Change the current directory to Tomcat's "bin"
// Assume that Tomcat is installed in "d:\myproject\tomcat"
> d:                           // Change the current drive
d:\> cd \myproject\tomcat\bin  // Change Directory to YOUR Tomcat's "bin" directory

// Start Tomcat Server
D:\myproject\tomcat\bin> startup

For Mac

I assume that Tomcat is installed in “/Applications/tomcat“. To start the Tomcat server, open a new “Terminal” and issue:

// Change current directory to Tomcat's binary directory
$ cd /Applications/tomcat/bin

// Start tomcat server
$ ./catalina.sh run

new Tomcat console window appears. Study the messages on the console. Look out for the Tomcat’s port number (double check that Tomcat is running on port 9999). Future error messages will be send to this console. System.out.println()issued by your Java servlets will also be sent to this console.

......
......
xxx xx, xxxx x:xx:xx xx org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-9999"]
xxx xx, xxxx x:xx:xx xx org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-8009"]
xxx xx, xxxx x:xx:xx xx org.apache.catalina.startup.Catalina start
INFO: Server startup in 2477 ms

(Skip Unless …) Cannot Start Tomcat: Read “How to Debug“.

Step 4(b) Start a Client to Access the Server

Start a browser (as HTTP client). Issue URL “http://localhost:9999” to access the Tomcat server’s welcome page. The hostname “localhost” (with IP address of 127.0.0.1) is meant for local loop-back testing inside the same machine. For users on the other machines over the net, they have to use the server’s IP address or DNS domain name or hostname in the format of “http://serverHostnameOrIPAddress:9999“.

TomcatHomePage.pngTry issuing URL http://localhost:9999/examples to view the servlet and JSP examples. Try running some of the servlet examples.

(Optional) Try issuing URL http://localhost:9999/manager/html to run the Tomcat Web Manager. Enter the username and password configured earlier in tomcat-users.xml.

Step 4(c) Shutdown Server

For Windows

You can shutdown the tomcat server by either:

  1. Press ctrl-c on the Tomcat console; or
  2. Run “<TOMCAT_HOME>\bin\shutdown.bat” script:
    // Change the current directory to Tomcat's "bin"
    > d:                           // Change the current drive
    d:\> cd \myproject\tomcat\bin  // Change Directory to YOUR Tomcat's "bin" directory
    
    // Shutdown the server
    d:\myproject\tomcat\bin> shutdown

For Mac

To shutdown the Tomcat server:

  1. Press control-c (NOT command-c); or
  2. Run the “<TOMCAT_HOME>/bin/shutdown.sh” script. Open a new “Terminal” and issue:
    // Change current directory to Tomcat's bin directory
    $ cd /Applications/tomcat/bin
    
    // Shutdown the server
    $ ./shutdown.sh

WARNING: You MUST properly shutdown the Tomcat. DO NOT kill the cat by pushing the window’s “CLOSE” button.

2.5  STEP 5: Develop and Deploy a WebApp

Step 5(a) Create the Directory Structure for your WebApp

TomcatWebappHello.pngFirst of all, choose a name for your webapp. Let’s call it “hello“. Goto Tomcat’s “webapps” sub-directory. Create the following directory structure for you webapp “hello” (as illustrated):

  1. Under Tomcat’s “webapps“, create your webapp root directory “hello” (i.e., “<TOMCAT_HOME>\webapps\hello“).
  2. Under “hello“, create a sub-directory “WEB-INF” (case sensitive, a “dash” not an underscore) (i.e., “<TOMCAT_HOME>\webapps\hello\WEB-INF“).
  3. Under “WEB-INF“, create a sub-sub-directory “classes” (case sensitive, plural) (i.e., “<TOMCAT_HOME>\webapps\hello\WEB-INF\classes“).

You need to keep your web resources (e.g., HTMLs, CSSs, images, scripts, servlets, JSPs) in the proper directories:

  • hello“: The is called the context root (or document base directory) of your webapp. You should keep all your HTML files and resources visible to the web users (e.g., HTMLs, CSSs, images, scripts, JSPs) under thiscontext root.
  • hello\WEB-INF“: This directory, although under the context root, is not visible to the web users. This is where you keep your application’s web descriptor file “web.xml“.
  • hello\WEB-INF\classes“: This is where you keep all the Java classes such as servlet class-files.

You should RE-START your Tomcat server to pick up the hello webapp. Check the Tomcat’s console to confirm that “hello” application has been properly depolyed:

......
INFO: Deploying web application directory D:\myproject\tomcat\webapps\hello
......

You can issue the following URL to access the web application “hello“:

http://localhost:9999/hello

You should see the directory listing of the directory “<TOMCAT_HOME>\webapps\hello“, which shall be empty (provided you have enabled directory listing in web.xml earlier).

Step 5(b) Write a Welcome Page

Create the following HTML page and save as “HelloHome.html” in your application’s root directory “hello“.

1
2
3
4
5
6
<html>
  <head><title>My Home Page</title></head>
  <body>
    <h1>My Name is so and so. This is my HOME.</h1>
  </body>
</html>

You can browse this page by issuing this URL:

http://localhost:9999/hello/HelloHome.html

TomcatWebappHelloHome.pngAlternatively, you can issue an URL to your web application root “hello“:

http://localhost:9999/hello

The server will return the directory listing of your base directory. You can then click on “HelloHome.html“.

Rename “HelloHome.html” to “index.html“, and issue a directory request again:

http://localhost:9999/hello

Now, the server will redirect the directory request to “index.html“, if the root directory contains an “index.html“, instead of serving the directory listing.

You can check out the home page of your peers by issuing:

http://YourPeerHostnameOrIPAddress:9999/hello
http://YourPeerHostnameOrIPAddress:9999/hello/HelloHome.html
http://YourPeerHostnameOrIPAddress:9999/hello/index.html

with a valid “YourPeerHostnameOrIPAddress“, provided that your peer has started his tomcat server and his firewall does not block your access. You can use command such as “ipconfig“, “winipcfg“, “ping” to find the IP address.
(Skip Unless…) The likely errors are “Unable to Connect”, “Internet Explorer cannot display the web page“, and “404 File Not Found”. Read “How to Debug” section.

2.6  STEP 6: Write a “Hello-world” Java Servlet

servlet is Java program that runs inside a Java-capable HTTP Server, such as Apache Tomcat. A web user invokes a servlet by issuing an appropriate URL from a web browser (HTTP client).

Before you proceed, I shall assume that you are familiar with Java Programming and have installed the followings:

  1. JDK (Read “How to install JDK and Get Started“).
  2. A programming text editor, such as TextPad or Notepad++ (Read “Programming Text Editor“); or a Java IDE such as Eclipse or NetBeans (Read “How to Install Eclipse” or “How to Install NetBeans“).
Step 6(a) Install Servlet API Library

Before we can write our first servlet, we need to install the Servlet API. Servlet API is not part of JDK or Java SE (but belongs to Java EE). Tomcat provides a copy of Servlets API.

For Windows

COPY the Tomcat’s Servlet API jar-file located at “<TOMCAT_HOME>\lib\servlet-api.jar“, (e.g., “d:\myproject\tomcat\lib\servlet-api.jar“) into JDK’s extension directory at “<JAVA_HOME>\jre\lib\ext“, (e.g., “c:\Program Files\Java\jdk1.7.0\jre\lib\ext“).

For Mac

COPY the Servlet API jar-file (“servlet-api.jar“) from “/Applications/tomcat/lib” to the JDK’s extension directory at “/Library/Java/Extension“.

(For Advanced Users Only) Alternatively, you could include the Servlet API jar-file in the CLASSPATH: or the JDK’s extension directory: or in the javac|java‘s command-line option -cp <classpaths>.

Step 6(b) Write a “Hello-world” Java Servlet

A Java servlet is a Java program that runs inside a HTTP server. A web user invokes a servlet by issuing a URL from a browser (or HTTP client).

In this example, we are going to write a Java servlet called HelloServlet, which says “Hello, world!”. We will then write a configuration such that web users can invoke this servlet by issuing URL http://hostname:port/hello/sayhello from their browser, as illustrated:

TomcatWebappHelloServlet.pngWrite the following source codes called “HelloServlet.java” and save it under your application “classes” directory (i.e., “<TOMCAT_HOME>\webapps\hello\WEB-INF\classes\HelloServlet.java“). This servlet says “Hello”, echos some request information, and prints a random number upon each request.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// To save as "<TOMCAT_HOME>\webapps\hello\WEB-INF\classes\HelloServlet.java"
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class HelloServlet extends HttpServlet {
   @Override
   public void doGet(HttpServletRequest request, HttpServletResponse response)
         throws IOException, ServletException {

      // Set the response MIME type of the response message
      response.setContentType("text/html");
      // Allocate a output writer to write the response message into the network socket
      PrintWriter out = response.getWriter();

      // Write the response message, in an HTML page
      try {
         out.println("<html>");
         out.println("<head><title>Hello, World</title></head>");
         out.println("<body>");
         out.println("<h1>Hello, world!</h1>");  // says Hello
         // Echo client's request information
         out.println("<p>Request URI: " + request.getRequestURI() + "</p>");
         out.println("<p>Protocol: " + request.getProtocol() + "</p>");
         out.println("<p>PathInfo: " + request.getPathInfo() + "</p>");
         out.println("<p>Remote Address: " + request.getRemoteAddr() + "</p>");
         // Generate a random number upon each request
         out.println("<p>A Random Number: <strong>" + Math.random() + "</strong></p>");
         out.println("</body></html>");
      } finally {
         out.close();  // Always close the output writer
      }
   }
}

Compile the source “HelloServlet.java” into “HelloServlet.class“:

> cd [Path-to-the-source-file]
> javac HelloServlet.java

(Skip Unless…) Read “Common Errors in Compiling Java Servlet“.

Step 6(c) Configure Servlet’s Request URL in “webapps\hello\WEB-INF\web.xml”

A web user invokes a servlet, which is kept in the web server, by issuing a request URL from the browser. We need to configure this request URL for our HelloServlet.

Create the following configuration file called “web.xml“, and save it under “webapps\hello\WEB-INF” (i.e., “<TOMCAT_HOME>\webapps\hello\WEB-INF\web.xml“).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app version="3.0"
  xmlns="http://java.sun.com/xml/ns/javaee"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd">

   <!-- To save as "hello\WEB-INF\web.xml" -->

   <servlet>
      <servlet-name>HelloWorld</servlet-name>
      <servlet-class>HelloServlet</servlet-class>
   </servlet>

   <!-- Note: All <servlet> elements MUST be grouped together and
         placed IN FRONT of the <servlet-mapping> elements -->

   <servlet-mapping>
      <servlet-name>HelloWorld</servlet-name>
      <url-pattern>/sayhello</url-pattern>
   </servlet-mapping>
</web-app>

In the above configuration, a servlet having a class file “HelloServlet.class” is mapped to request URL “/sayhello” (via an arbitrary servlet-name “HelloWorld“), under this web application “hello“. In other words, the complete request URL for this servlet is “http://hostname:port/hello/sayhello“.

This configuration file, saved under your webapp “hello“, is applicable only to this particular webapp “hello“.

RESTART your Tomcat server to refresh the “web.xml” file.

IMPORTANT: For EACH servlet, you need to write a pair of <servlet> and <servlet-mapping> elements with a common but arbitrary <servlet-name>. Take note that all the <servlet> elements MUST be grouped together and placed IN FRONT of the <servlet-mapping> elements.

Step 6(d) Invoke the Servlet

To run this servlet, start a browser, and issue the request URL configured earlier:

http://localhost:9999/hello/sayhello

You shall see the output of the servlet displayed in your web browser.

Refresh the browser, you shall see a new random number upon each refresh. In other word, the doGet() method of the servlet runs once per request.

Try “View Source” to look at the output received by the web users. Take note that the web users receive only the output of the servlet (generated via the out.println() statements). They have no access to the servlet programs (which may contain confidential information).

<html>
<head><title>Hello, World</title></head>
<body>
<h1>Hello, world!</h1>
<p>Request URI: /hello/sayhello</p>
<p>Protocol: HTTP/1.1</p>
<p>PathInfo: null</p>
<p>Remote Address: 127.0.0.1</p>
<p>A Random Number: <strong>0.3523682325749493</strong></p>
</body>
</html>

(Skip Unless…) The likely errors are “404 File Not Found” and “500 Internal Server Error”. Read “How to debug” Section.

2.7  STEP 7: Write a Database Servlet

This section assumes that you are familiar with “Java database programming” and “MySQL database server”. Otherwise, read “Java Database Program” and “How to Install MySQL 5 and Get Started“.

Step 7(a) Setup a Database on MySQL

Start your MySQL server. Take note of the server’s port number. I shall assume that the MySQL server is running on port 8888 (whereas the Tomcat is running on port 9999).

// For Windows
> d:
> cd \myproject\mysql\bin
> mysqld --console

// For Mac
$ cd /usr/local/mysql/bin
$ sudo ./mysqld_safe --console

Start a MySQL client. I shall assume that there is a user called “myuser” with password “xxxx“.

// For Windows
> d:
> cd \myproject\mysql\bin
> mysql -u myuser -p

// For Mac
$ cd /usr/local/mysql/bin
$ ./mysql -u myuser -p

Run the following SQL statements to create a database called “ebookshop“, with a table called “books” with 5 columns: idtitleauthorpriceqty.

create database if not exists ebookshop;

use ebookshop;

drop table if exists books;
create table books (
   id     int,
   title  varchar(50),
   author varchar(50),
   price  float,
   qty    int,
   primary key (id));

insert into books values (1001, 'Java for dummies', 'Tan Ah Teck', 11.11, 11);
insert into books values (1002, 'More Java for dummies', 'Tan Ah Teck', 22.22, 22);
insert into books values (1003, 'More Java for more dummies', 'Mohammad Ali', 33.33, 33);
insert into books values (1004, 'A Cup of Java', 'Kumar', 55.55, 55);
insert into books values (1005, 'A Teaspoon of Java', 'Kevin Jones', 66.66, 66);select * from books;
Step 7(b) Install MySQL JDBC Driver

You need to download MySQL JDBC driver if you have not done so. Read “Installing the MySQL JDBC Driver“.

(For Advanced Users Only) You could also place the MySQL driver jar-file “mysql-connector-java-5.1.{xx}-bin.jar” in Tomcat’s “lib” directory.

Step 7(c) Write a Client-side HTML Form

Let’s write an HTML script to create a query form with 3 checkboxes and a submit button, as illustrated below.  Save the HTML file as “querybook.html” in your application root directory “<TOMCAT_HOME>\webapps\hello”.

HtmlFormBookQuery.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<html>
<head>
  <title>Yet Another Bookshop</title>
</head>
<body>
  <h2>Yet Another Bookshop</h2>
  <form method="get" action="http://localhost:9999/hello/query">
    <b>Choose an author:</b>
    <input type="checkbox" name="author" value="Tan Ah Teck">Ah Teck
    <input type="checkbox" name="author" value="Mohammad Ali">Ali
    <input type="checkbox" name="author" value="Kumar">Kumar
    <input type="submit" value="Search">
  </form>
</body>
</html>

You can browse the HTML page by issuing the following URL:

http://localhost:9999/hello/querybook.html

Check a box (e.g., “Tan Ah Teck”) and click the “Search” button.  An HTTP GET request will be issued to the URL specified in the <form>‘s “action” attribute.  Observe the URL of the HTTP GET request:

http://localhost:9999/hello/query?author=Tan+Ah+Teck

The request consists of two part: a URL corresponding to the “action” attribute of the <form> tag, and the “name=value” pair extracted from the <input> tag, separated by a '?'. Take note that blanks are replaced by '+' (or %20), because blanks are not allowed in the URL.

If you check two boxes (e.g., “Tan Ah Teck” and “Mohammad Ali”), you will get this URL, which has two “name=value” pairs separated by an '&'.

http://localhost:9999/hello/query?author=Tan+Ah+Teck&author=Mohammad+Ali

You are expected to get an error “404 File Not Found”, as you have yet to write the server-side program.

Step 7(d) Write the Server-side Database Query Servlet

The next step is to write a Java servlet, which responses to the client’s request by querying the database and returns the query results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// To save as "<TOMCAT_HOME>\webapps\hello\WEB-INF\classes\QueryServlet.java".
import java.io.*;
import java.sql.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class QueryServlet extends HttpServlet {  // JDK 6 and above only

   // The doGet() runs once per HTTP GET request to this servlet.
   @Override
   public void doGet(HttpServletRequest request, HttpServletResponse response)
               throws ServletException, IOException {
      // Set the MIME type for the response message
      response.setContentType("text/html");
      // Get a output writer to write the response message into the network socket
      PrintWriter out = response.getWriter();

      Connection conn = null;
      Statement stmt = null;
      try {
         // Step 1: Allocate a database Connection object
         conn = DriverManager.getConnection(
            "jdbc:mysql://localhost:8888/ebookshop", "myuser", "xxxx"); // <== Check!
            // database-URL(hostname, port, default database), username, password

         // Step 2: Allocate a Statement object within the Connection
         stmt = conn.createStatement();

         // Step 3: Execute a SQL SELECT query
         String sqlStr = "select * from books where author = "
              + "'" + request.getParameter("author") + "'"
              + " and qty > 0 order by price desc";

         // Print an HTML page as the output of the query
         out.println("<html><head><title>Query Response</title></head><body>");
         out.println("<h3>Thank you for your query.</h3>");
         out.println("<p>You query is: " + sqlStr + "</p>"); // Echo for debugging
         ResultSet rset = stmt.executeQuery(sqlStr);  // Send the query to the server

         // Step 4: Process the query result set
         int count = 0;
         while(rset.next()) {
            // Print a paragraph <p>...</p> for each record
            out.println("<p>" + rset.getString("author")
                 + ", " + rset.getString("title")
                 + ", $" + rset.getDouble("price") + "</p>");
            count++;
         }
         out.println("<p>==== " + count + " records found =====</p>");
         out.println("</body></html>");
     } catch (SQLException ex) {
        ex.printStackTrace();
     } finally {
        out.close();  // Close the output writer
        try {
           // Step 5: Close the resources
           if (stmt != null) stmt.close();
           if (conn != null) conn.close();
        } catch (SQLException ex) {
           ex.printStackTrace();
        }
     }
   }
}

Compile the source “QueryServlet.java” into “QueryServlet.class“.

Step 7(e) Configure the Request URL for the Servlet

Open the configuration file “web.xml” of your application “hello” that you have created earlier for the HelloServlet, i.e., “<TOMCAT_HOME>\webapps\hello\WEB-INF\web.xml“. Add the lines that are shown in red at the LOCATIONS INDICATED.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app version="3.0"
  xmlns="http://java.sun.com/xml/ns/javaee"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd">

   <!-- To save as "hello\WEB-INF\web.xml" -->

   <servlet>
      <servlet-name>HelloWorld</servlet-name>
      <servlet-class>HelloServlet</servlet-class>
   </servlet>

   <servlet>
      <servlet-name>UserQuery</servlet-name>
      <servlet-class>QueryServlet</servlet-class>
   </servlet>

   <!-- Note: All <servlet> elements MUST be grouped together and
         placed IN FRONT of the <servlet-mapping> elements -->

   <servlet-mapping>
      <servlet-name>HelloWorld</servlet-name>
      <url-pattern>/sayhello</url-pattern>
   </servlet-mapping>

   <servlet-mapping>
      <servlet-name>UserQuery</servlet-name>
      <url-pattern>/query</url-pattern>
   </servlet-mapping>
</web-app>

The above lines configure the following URL to invoke QueryServlet:

http://localhost:9999/hello/query
Step 7(f) Invoke the Servlet from the Client-Side Form

Issue the following URL to browse the HMTL form “querybook.html” that you have created earlier:

http://localhost:9999/hello/querybook.html

Select an author (e.g., “Tan Ah Teck”) and click the submit button, which activates the following URL coded in the <form>‘s “action” attribute, together with the name=value pair:

http://localhost:9999/hello/query?author=Tan+Ah+Teck

This URL “/query” triggers QueryServlet. The QueryServlet retrieves the name=value pair of “author=Tan+Ah+Teck“. Inside the QueryServlet, the method request.getParameter("author") returns “Tan Ah Teck“, which is inserted into the SQL SELECT command to query the database. The processed query result is then written to the client as an HTML document.
(Skip Unless…) The likely errors are “404 File Not Found” and “500 Internal Server Error”. Read “How to debug” Section.

2.8  (Advanced) Deploying Servlet using @WebServlet (Servlet 3.0 on Tomcat 7)

Servlet 3.0, which is supported by Tomcat 7, introduces the @WebServlet annotation, which greatly simplifies the deployment of servlets. You no longer need to write the deployment descriptor in “web.xml“. Instead, you can use the@WebServlet annotation to specify the url mapping.

For example, let us write a new servlet called AnotherHelloServlet.java, by modifying the HelloServlet.java written earlier, with url mapping of “sayhi“.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// To save as "<TOMCAT_HOME>\webapps\hello\WEB-INF\classes\AnotherHelloServlet.java"
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import javax.servlet.annotation.*;

@WebServlet("/sayhi")
public class AnotherHelloServlet extends HttpServlet {
   @Override
   public void doGet(HttpServletRequest request, HttpServletResponse response)
         throws IOException, ServletException {

      // Set the response MIME type
      response.setContentType("text/html;charset=UTF-8");
      // Allocate a output writer to write the response message into the network socket
      PrintWriter out = response.getWriter();

      // Write the response message, in an HTML page
      try {
         out.println("<html>");
         out.println("<head><title>Hello, World</title></head>");
         out.println("<body>");
         out.println("<h1>Hello world, again!</h1>");  // says Hello
         // Echo client's request information
         out.println("<p>Request URI: " + request.getRequestURI() + "</p>");
         out.println("<p>Protocol: " + request.getProtocol() + "</p>");
         out.println("<p>PathInfo: " + request.getPathInfo() + "</p>");
         out.println("<p>Remote Address: " + request.getRemoteAddr() + "</p>");
         // Generate a random number upon each request
         out.println("<p>A Random Number: <strong>" + Math.random() + "</strong></p>");
         out.println("</body></html>");
      } finally {
         out.close();  // Always close the output writer
      }
   }
}

In Line 7, the annotation @WebServlet("/sayhi") is used to declare the URL mapping for this servlet, i.e., http://localhost:9999/hello/sayhi. There is no need to provide any more configuration in “web.xml“!

3.  How to Debug?

“Everything that can possibly go wrong will go wrong.” The most important thing to do is to find the ERROR MESSAGE!!!

Always…
  1. Refresh your browser using Cntl-F5 (instead of refresh button or simply F5) to get a fresh copy, instead of from the cache.
  2. You may re-start your Tomcat server. You may also re-start your browser to clear the cache.
  3. Check your spelling! Always assume that all programs are case-sensitive. Don’t type, copy and paste if possible!
  4. and MOST IMPORTANTLY – Find the ERROR MESSAGE!!!
    1. Check the Error Messages on Tomcat’s Console. Most of the error messages have a few screens of lines. You need to scroll up slowly from the last line to look for the FIRST LINE of the error messages.
    2. Check the Tomcat’s log files, located at “<TOMCAT_HOME>\logs“. The “catalina.yyyy-mm-dd.log” shows the Tomcat’s startup messages. Also check the “localhost.yyyy-mm-dd.log“.
  5. If things were running fine until the lightning strikes, ask yourself “What have I changed?”
Cannot Start Tomcat – Tomcat’s Console Flashes and Disappears
  1. Try running the script “configtest.bat” (for Windows) or “./configtest.sh” (for Mac/Linux) to check your configuration files.
  2. Check the Tomcat’s log files for error messages. The log files are located at “<TOMCAT_HOME>\logs“. The “catalina.{yyyy-mm-dd}.log” shows the Tomcat’s startup messages. Also check the “localhost.{yyyy-mm-dd}.log“.
  3. If the error messages indicate that another Tomcat instance is running (java.net.BindException: Address already in use: JVM_Bind), kill the Tomcat process (see below); or try running the “shutdown” script at Tomcat’s bin (For Windows, simply double-click the “shutdown.bat” or issue “shutdown” from CMD. For Mac, issue “./shutdown.sh” from Terminal.)
  4. If the error messages indicate that another application is running on the Tomcat’s port numbers, then you need to change the Tomcat’s port number in server.xml. You can issue command “netstat -an” to check the status of all the ports.
  5. Start the tomcat in the debugging mode by running “catalina debug” (or ./catalina.sh debug) and type “run” in the “jdb” prompt. Look for the error messages.
Locating/Killing Tomcat’s Process
  • In windows, start “Task Manager”, Tomcat run as a “process” named “java.exe“. You may need to kill the process.
  • In Mac, start “Activity Monitor”. Select “All Processes” and look for “java.exe“.
  • In Linux/Mac, you may issue “ps aux | grep tomcat” to locate the Tomcat process. Note down the process ID (pid). You can kill the Tomcat process via “kill -9 pid“.
(Firefox) Unable to Connect
(IE) Internet Explorer cannot display the webpage
(Chrome) Oops! Google Chrome could not connect to …
(Safari) Safari can’t connect to the server

Cause: You are simply not connecting to your Tomcat.

Solution:

  1. Check if your Tomcat server has been started?
  2. Check the hostname and port number, separated by a colon ':', of your URL (http://localhost:9999/...).
Error 404 File Not Found

Cause: You have connected to your Tomcat. But Tomcat server cannot find the HTML file or Servlet that your requested.

Solution:

  1. Check your spelling! The path is case-sensitive!
  2. For HTML file with URL http://localhost:9999/xxxx/filename.html:
    1. Open Tomcat’s “webapps” directory, check if sub-directory “xxxx” exists. It is case-sensitive.
    2. Open the “xxxx” directory, check if “filename.html” exists.
  3. For Servlet with URL http://localhost:9999/xxxx/servletURL:
    1. Check the Tomcat’s console for error message. Your application cannot be deployed if you make a mistake in editing “web.xml“, which triggered many error messages.
    2. Check the Tomcat console to make sure that your application has been deployed.
    3. Open Tomcat’s “webapps” directory, check if sub-directory “xxxx” exists.
    4. Open the “xxxx” directory, check if sub-sub-directory “WEB-INF” (uppercase with a dash) exists.
    5. Open the “WEB-INF“, check if sub-sub-sub directory “classes” (lowercase, plural) exists.
    6. Open the configuration file “WEB-INF\web.xml“:
      1. Check that servletURL is defined in a <servlet-mapping> tag. Take note of the name in <servlet-name> tag.
      2. Based on the name noted, look for the matching <servlet-class> tag. Take note of the ServletClassname.
      3. Open “WEB-INF\classes“, check if “ServletClassname.class” that you noted exists (Note: It is “.class“, and NOT “.java“. You need to compile the “.java” to get the “.class“.)
Error 500 Internal Server Error

Error 500 should have triggered many error message in the Tomcat’s console. Go to the Tomcat’s console, find the error message. The error message spans tens of lines. You need to scroll up slowly to look for the first line of the error message. The error message should tell you the cuase of this error, e.g. SQL syntax error, wrong user/password, etc.

For database servlet, you may check the error messages at “Common Errors in JDBC Programming“.

  • For “No suitable driver found” (Windows) or NullPointerException (Mac and Linux): Read Step 7(b) again, again, and again.
More Errors

Try searching “Common Error Messages“.

REFERENCES & RESOURCES

  1. Apache Tomcat mother site @ http://tomcat.apache.org.
  2. Apache Tomcat Documentation @ “<TOMCAT_HOME>\webapps\docs“.
  3. How to install MySQL and Get Started“.
  4. Introduction to Java Database (JDBC) Programming“.
  5. Jason Brittain, Ian F. Darwin, “Tomcat The Definitive Guide“, 2nd eds, OReilly, 2007.

Posted in Apps Development, Computer Languages, Computer Softwares, Computing Technology, Installation | Leave a Comment »

Pictures from a developer’s life

Posted by Hemprasad Y. Badgujar on November 30, 2013


Pictures from a developer’s life

A friend sent me this link that was composed from other links from other blogs. The origin is here and another here

Sorry for the brute copy-past, but it’s much easier to read as a whole – you can check the links for more…

When I show the boss that I have finally fixed this bug

When the project manager enters the office

When I’m deploying code to production

When I try to fix a bug at 3 in the morning

When my regex returned exactly what I expected

When a friend of mine asks me to fix his website built with Joomla

When I’m told that the module on which I have worked all the week will never be used

When the code that I have not tested on dev works perfectly in production

When the sales people announce they have sold our product to the customer

When I apply a new CSS for the first time

When sysadmin finally gives us the root access

When I launch my script for the first time after several hours of development

When I go off for the weekend while everyone else is still trying to fix bugs

When the app goes into beta and the first bug reports arrive

When the boss is looking for someone to urgently fix a difficult bug

When a thing that worked on Friday no longer works on Monday

When I return to development of my code that wasn’t commented

When a bug goes unnoticed during a presentation

When a newbie suggests to add a new feature to project

When the boss announces a bonus if the project is completed before the deadline

When I realize that I have been blocked for two hours because of a forgotten semicolon

When asked to lend a hand on a Friday afternoon

When the project manager suddenly looks on my screen

When the client tries to click on the mockups

When customer wants to change specification 2 days before pushing to production

When my script finally worked

When I am asked to continue work of a newbie colleague

When I’m told that my code is broken in production

When I find a solution without searching Google

When the intern tells me that “the tests are for those who can not program”

When I manage to replace 200 lines of the algorithm by only 10 lines

Posted in .Net Platform, Animation, Apps Development, Computer Languages, Entertainment | Leave a Comment »

CUDA Thread Execution Model

Posted by Hemprasad Y. Badgujar on July 22, 2013


CUDA Thread Execution Model

 

Grid of Thread Blocks

Grid of Thread Blocks

In a previous article, I gave an introduction to programming with CUDA. Now I’d like to go into a little bit more depth about the CUDA thread execution model and the architecture of a CUDA enabled GPU. I assume that the reader has basic knowledge about CUDA and already knows how to setup a project that uses the CUDA runtime API. If you don’t know how to setup a project with CUDA, you can refer to my previous article:Introduction to CUDA.

 

 

GPU Architecture

To understand the thread execution model for modern GPU’s, we must first make an analysis of the GPU compute architecture. In this article I will focus on the Fermi compute architecture found in modern GPU’s (GTX 580).

Overview of the Fermi Architecture

A Fermi GPU consists of 512 CUDA cores. These 512 CUDA cores are split across 16 Streaming Multiprocessors (SM) each SM consisting of 32 CUDA cores. The GPU has 6 64-bit memory partitions supporting up to 6 GB of GDDR5 DRAM memory.

Fermi Arcitecture

Fermi Arcitecture

Each streaming multiprocessor (SM) has 32 cuda cores. Each CUDA core consists of an integer arithmetic logic unit (ALU) and a floating point unit (FPU).

Fermi Streaming Multiprocessor (SM)

Fermi Streaming Multiprocessor (SM)

The SM has 16 load/store units allowing source and destination addresses to be calculated for sixteen threads per clock.

Each SM also has four Special Function Units (SFU) that execute transcendental instructions such as sin, cosine, reciprocal, and square root.

CUDA Threads

Now that we’ve seen the specific architecture of a Fermi GPU, let’s analyze the more general CUDA thread execution model.

Each kernel function is executed in a grid of threads. This grid is divided into blocks also known as thread blocks and each block is further divided into threads.

Cuda Execution Model

Cuda Execution Model

In the image above we see that this example grid is divided into nine thread blocks (3×3), each thread block consists of 9 threads (3×3) for a total of 81 threads for the kernel grid.

This image only shows 2-dimensional grid, but if the graphics device supports compute capability 2.0, then the grid of thread blocks can actually be partitioned into 1, 2 or 3 dimensions, otherwise if the device supports compute capability 1.x, then thread blocks can be partitioned into 1, or 2 dimensions (in this case, then the 3rd dimension should always be set to 1).

The thread block is partitioned into individual threads and for all compute capabilities, threads can be partitioned into 1, 2, or 3 dimensions. The maximum number of threads that can be assigned to a thread block is 512 for devices with compute capability 1.x and 1024 threads for devices that support compute capability 2.0.

Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.0
Maximum dimensionality of a grid of thread blocks 2 3
Maximum x-, y-, or z-dimension of a grid of thread blocks 65535
Maximum dimensionality of a thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 1024

The number of blocks within a gird can be determined within a kernel by using the built-in variable gridDim and the number of threads within a block can be determined by using the built-in variable blockDim.

A thread block is uniquely identified in a kernel function by using the built-in variableblockIdx and a thread within a block is uniquely identified in a kernel function by using the built-in variable threadIdx.

The built-in variables gridDimblockDimblockIdx, and threadIdx are each 3-component structs with members x, y, z.

With a 1-D kernel, the unique thread ID within a block is the same as the x component of the threadIdx variable.

and the unique block ID within a grid is the same as the x component of the blockIdx variable:

To determine the unique thread ID in a 2-D block, you would use the following formula:

and to determine the unique block ID within a 2-D grid, you would use the following formula:

I’ll leave it as an exercise for the reader to determine the formula to compute the unique thread ID and block ID in a 3D grid.

Matrix Addition Example

Let’s take a look at an example kernel that one might execute.

Let’s assume we want to implement a kernel function that adds two matrices and stores the result in a 3rd.

The general formula for matrix addition is:

That is, the sum of matrix A and matrix B is the sum of the components of matrix A and matrix B.

Let’s first write the host version of this method that we would execute on the CPU.

MatrixAdd.cpp
1
2
3
4
5
6
7
8
9
10
11
void MatrixAddHost( float* C, float* A, float* B, unsigned int matrixRank )
{
    for( unsigned int j = 0; j < matrixRank; ++j )
    {
        for ( unsigned int i = 0; i < matrixRank; ++i )
        {
            unsigned int index = ( j * matrixRank ) + i;
            C[index] = A[index] + B[index];
        }
    }
}

This is a pretty standard method that loops through the rows and columns of a matrix and adds the components and stores the results in a 3rd. Now let’s see how we might execute this kernel on the GPU using CUDA.

First, we need to think of the problem domain. I this case, the domain is trivial: it is the components of a matrix. Since we are operating on 2-D arrays, it seems reasonable to split our domain into two dimensions; one for the rows, and another for the columns of the matrices.

We will assume that we are working on square matrices. This simplifies the problem but mathematically matrix addition only requires that the two matrices have the same number of rows and columns but does not have the requirement that the matrices must be square.

Since we know that a kernel is limited to 512 threads/block with compute capability 1.x and 1024 threads/block with compute capability 2.0, then we know we can split our job into square thread blocks each consisting of 16×16 threads (256 threads per block) with compute capability 1.x and 32×32 threads (1024 threads per block) with compute capability 2.0.

For simplicity, I will assume compute capability 1.x for the remainder of this tutorial.

If we limit the size of our matrix to no larger than 16×16, then we only need a single block to compute the matrix sum and our kernel execution configuration might look something like this:

main.cpp
1
2
3
dim3 gridDim( 1, 1, 1 );
dim3 blockDim( matrixRank, matrixRank, 1 );
MatrixAddDevice<<<gridDim, blockDim>>>( C, A, B, matrixRank );

In this simple case, the kernel grid consists of only a single block with matrixRank xmatrixRank threads.

However, if we want to sum matrices larger than 512 components, then we must split our problem domain into smaller groups that can be processed in multiple blocks.

Let’s assume that we want to limit our blocks to execute in 16×16 (256) threads. We can determine the number of blocks that will be required to operate on the entire array by dividing the size of the matrix dimension by the maximum number of threads per block and round-up to the nearest whole number:

And we can determine the number of threads per block by dividing the size of the matrix dimension by the number of blocks and round-up to the nearest whole number:

So for example, for a 4×4 matrix, we would get

and the number of threads is computed as:

resulting in a 1×1 grid of 4×4 thread blocks for a total of 16 threads.

Another example a 512×512 matirx, we would get:

and the number of threads is computed as:

resulting in a 32×32 grid of 16×16 thread blocks for a total of 262,144 threads.

The host code to setup the kernel granularity might look like this:

main.cpp
1
2
3
4
5
6
size_t blocks = ceilf( matrixRank / 16.0f );
dim3 gridDim( blocks, blocks, 1 );
size_t threads = ceilf( matrixRank / (float)blocks );
dim3 blockDim( threads, threads, 1 );
 
MatrixAddDevice<<< gridDim, blockDim >>>( C, A, B, matrixRank );
You may have noticed that if the size of the matrix does not fit nicely into equally divisible blocks, then we may get more threads than are needed to process the array. It is not possible to configure a gird of thread blocks with 1 block containing less threads than the others. The only way to solve this is to execute multiple kernels – one that handles all the equally divisible blocks, and a 2nd kernel invocation that handles the partial block. The other solution to this problem is simply to ignore any of the threads that are executed outside of our problem domain which is generally the easier (and more efficient) than invoking multiple kernels (this should be profiled to be proven).

The Matrix Addition Kernel Function

On the device, one kernel function is created for every thread in the problem domain (the matrix elements). We can use the built-in variables gridDimblockDimblockIdx, and threadIdx, to identify the current matrix element that the current kernel is operating on.

If we assume we have a 9×9 matrix and we split the problem domain into 3×3 blocks each consisting of 3×3 threads as shown in the CUDA Grid below, then we could compute the ith column and the jth row of the matrix with the following formula:

So for thread (0,0) of block (1,1) of our 9×9 matrix, we would get:

for the column and:

for the row.

The index into the 1-D buffer that store the matrix is then computed as:

and substituting gives:

Which is the correct element in the matrix. This solution assumes we are accessing the matrix in row-major order.

CUDA Grid Example

CUDA Grid Example

Let’s see how we might implement this in the kernel.

MatrixAdd.cu
1
2
3
4
5
6
7
8
9
10
11
__global__ void MatrixAddDevice( float* C, float* A, float* B, unsigned int matrixRank )
{
    unsigned int column = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    unsigned int row    = ( blockDim.y * blockIdx.y ) + threadIdx.y;
 
    unsigned int index = ( matrixRank * row ) + column;
    if ( index < matrixRank * matrixRank ) // prevent reading/writing array out-of-bounds.
    {
        C[index] = A[index] + B[index];
    }
}

On line 3, and 4 we compute the column and row of the matrix we are operating on using the formulas shown earlier.

On line 6, the 1-d index in the matrix array is computed based on the size of a single dimension of the square matris.

We must be careful that we don’t try to read or write out of the bounds of the matrix. This might happen if the size of the matrix does not fit nicely into the size of the CUDA grid (in the case of matrices whose size is not evenly divisible by 16) To protect the read and write operation, on line 7 we must check that the computed index does not exceed the size of our array.

Thread Synchronization

CUDA provides a synchronization barrier for all threads in a block through the__syncthreads() method. A practical example of thread synchronization will be shown in a later article about optimization a CUDA kernel, but for now it’s only important that you know this functionality exists.

Thread synchronization is only possible across all threads in a block but not across all threads running in the grid. By not allowing threads across blocks to be synchronized, CUDA enables multiple blocks to be executed on other streaming multiprocessors (SM) in any order. The queue of blocks can be distributed to any SM without having to wait for blocks from another SM to be complete. This allows the CUDA enabled applications to scale across platforms that have more SM at it’s disposal, executing more blocks concurrently than another platforms with less SM’s.

Thread synchronization follows strict synchronization rules. All threads in a block must hit the synchronization point or none of them must hit synchronization point.

Give the following code block:

sample.cu
1
2
3
4
5
6
7
8
if ( threadID % 2 == 0 )
{
    __syncthreads();
}
else
{
    __syncthreads();
}

will cause the threads in a block to wait indefinitely for each other because the two occurrences of __syncthreads are considered separate synchronization points and all threads of the same block must hit the same synchronization point, or all of them must not hit it.

Thread Assignment

When a kernel is invoked, the CUDA runtime will distribute the blocks across the SM’s on the device. A maximum of 8 blocks (irrelevant of platform) will be assigned to each SM as long as there are enough resources (registers, shared memory, and threads) to execute all the blocks. In the case where there are not enough resources on the SM, then the CUDA runtime will automatically assign less blocks per SM until the resource usage is below the maximum per SM.

The total number of blocks that can be executed concurrently is dependent on the device. In the case of the Fermi architecture discussed earlier, a total of 16 SM’s can concurrently handle 8 blocks for a total of 128 blocks executing concurrently on the device.

Because the Fermi architecture support compute compatibility 2.0, we can create thread blocks consisting of at most 1024 threads, then the Fermi device can technically support 131,072 threads residing in the SM’s for execution. This does not mean that every clock tick the devices is executing 131,072 instruction simultaneously. In order to understand how the blocks are actually executed on the device, we must look one step further to see how the threads of a block are actually scheduled on the SM’s.

Thread Scheduling

When a block is assigned to a SM, it is further divided into groups of 32 threads called a warp. Warp scheduling is different depending on the platform, but if we take a look at the Fermi architecture, we see that a single SM consists of 32 CUDA cores (or streaming processor) – two groups of 16 per SM.

Each SM in the Fermi architecture (see Fermi architecture image above) features two warp schedulers allowing two warps to be issued and executed concurrently. Fermi’s dual-warp scheduler selects two warps and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four special function units (SFU’s).

Most instructions can be dual-issued; two integer instructions, two floating point instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently.

Fermi - Dual Warp Scheduler

Fermi – Dual Warp Scheduler

You might be wondering why it would be useful to schedule 8 blocks of a maximum of 1024 threads if the SM only has 32 SP’s? The answer is that each instruction of a kernel may require more than a few clock cycles to execute (for example, an instruction to read from global memory will require multiple clock cycles). Any instruction that requires multiple clock cycles to execute incurs latency. The latency of long-running instructions can be hidden by executing instructions from other warps while waiting for the result of the previous warp. This technique of filling the latency of expensive operations with work from other threads is often called latency hiding.

Thread Divergence

It is reasonable to imagine that your CUDA program contains flow-control statements like if-then-elseswitchwhile loops, or for loops. Whenever you introduce these flow-control statements in your code, you also introduce the possibility of thread divergence. It is important to be aware of the consequence of thread divergence and also to understand how you can minimize the negative impact of divergence.

Thread divergence occurs when some threads in a warp follow a different execution path than others. Let’s take the following code block as an example:

test.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void TestDivergence( float* dst, float* src )
{
    unsigned int index = ( blockDim.x * blockIdx.x ) + threadIdx.x;
    float value = 0.0f;
 
    if ( threadIdx.x % 2 == 0 )
    {
        // Threads executing PathA are active while threads
        // executing PathB are inactive.
        value = PathA( src );
    }
    else
    {
        // Threads executing PathB are active while threads
        // executing PathA are inactive.
        value = PathB( src );
    }
    // Threads converge here again and execute in parallel.
    dst[index] = value;
}

Then our flow control and thread divergence would look something like this:

Thread Divergence

Thread Divergence

As you can see from this example, the even numbered threads in each block will execute PathA while the odd numbered threads in the block will execute PathB. This is pretty much the worst-case scenario for simple divergence example.

Both PathA and PathB cannot be executed concurrently on all threads because their execution paths are different. Only the threads that execute the exact same execution path can run concurrently so the total running time of the warp is the sum of the execution time of both PathA and PathB.

In this example, the threads in the warp that execute PathA are activated if the condition is true and all the other threads are deactivated. Then, in another pass, all the threads that execute PathB are activated if the condition is false are activated and the other threads are deactivated. This means that to resolve this condition requires 2-passes to be executed for a single warp.

The overhead of having the warp execute both PathA and PathB can be eliminated if the programmer takes careful consideration when writing the kernel. If possible, all threads of a block (since warps can’t span thread blocks) should execute the same execution path. This way you guarantee that all threads in a warp will execute the same execution path and there will be no thread divergence within a block.

Exercise

If a device supports compute capability 1.3 then it can have blocks with a maximum of 512 threads/block and 8 blocks/SM can be scheduled concurrently. Each SM can schedule groups of 32-thread units called warps. The maximum number of resident warps per SM in a device that supports compute capability 1.3 is 32 and the maximum number of resident threads per SM is 1024.

Q. What would be the ideal block granularity to compute the product of two 2-D matrices of size 1024 x 1024?

  1. 4×4?
  2. 8×8?
  3. 16×16?
  4. or 32×32?

A. To answer this question, let’s analyze each choice and give pros and cons for each one.

4×4: If we decide to split our domain into 4×4 thread blocks, then we have 16 threads per block. In order to fully occupy the SM that can support 1024 threads per SM, we would need 1024/16 = 64 blocks but the SM can only schedule 8 blocks/SM so each SM would be scheduled with 8 blocks each having 16 threads which is 128 threads/SM. When divided into warps, we only have 4 warps scheduled per SM out of a total of 32 which gives only 12.5% occupancy.

8×8: We have the same problem here as we did with the 4×4 thread block granularity except not as severe. With 8×8 thread blocks, we get 64 threads per block. For a SM that can support 1024 threads per SM, we would need 1024/64 = 16 blocks but since we are limited to 8 blocks maximum per SM, we can only execute 8×64 = 512 threads/SM. When split into warps of 32-threads each, we get 512/32 = 16 warps scheduled per SM from a possible total 32 warps. This give only 50% occupancy.

16×16: A 16×16 thread block gives 256 threads/block. With a maximum thread limit per SM of 1024, we get 1024/256 = 4 blocks/SM. This is within the 8 block limit so 4 blocks, each of 256 threads can be scheduled on one SM. With 4 blocks each with 256 threads, we get a total of 1024 threads. The threads are further split into warps of 32 threads each for a total of 32 warps. Since the device can support 32 warps/SM we have achieved 100% occupancy.

32×32: This option is not even an option since a 32×32 thread block produces a single block with 1024 threads. As stated earlier, we are limited to 512 threads per block with compute capability 1.3 so our kernel wouldn’t even run.

So the best choice for this problem domain would be to invoke a kernel with block size16×16.

Conclusion

In this article, I discussed the architecture of a CUDA enabled GPU, in particular the Fermi architecture. I also showed how a kernel function is scheduled on the GPU and how the warp scheduler executes instructions from different warps in order to minimize the amount of noticeable latency between kernel instructions.

 

Posted on November 15, 2011 by 

Posted in Computer Languages, Computing Technology, CUDA, Game Development, GPU (CUDA), Graphics Cards, PARALLEL | Leave a Comment »

CUDA Open Source Projects

Posted by Hemprasad Y. Badgujar on March 4, 2013


CUDA Open Source Projects

In searching for projects to use for learning and developing with plus requests from the NVidia forums I have put together a list here of free and open source research and projects that use CUDA.  Please if you have one to add or updates of anything here let me know.

GNURadio Software defined radio. A hardware/software combination that does baseband signal processing in software. Experiments were carried out to integrate CUDA into this mix.
MediaCoder A transcoding application for videos with a strong focus on mobile players. Some operations (de-interlacing, scaling, encoding) are have been CUDA accelerated.
Bullet Bullet: physics simulation started to include CUDA but it is not fully capable yet.  Perhaps some CUDA genius will add to it?
Thrust (included in Release 4.0) Excellent Library!! A Parallel Template Library for CUDA. Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity.
Pycuda A module which allows access to the complete range of CUDA functionality in Python, including seamless numpy integration, OpenGL interoperability and lots more. Released under the MIT/X consortium license.
FOLKI-GPU An optical-flow estimation, implemented on CUDA
Flam4 CUDA A CUDA accelerated renderer for fractal frames. Sample videos hereand here. Use other tools like Apophysis 2.0 to generate the parameter files (.flame files). A new and ongoing approach to port fractal frame rendering to CUDA is described here.
CUJ2K A CUDA accelerated JPEG 2000 encoder. Command line tool and C/C++ library. This is student work with excellent documentation. Notable speedup achieved only for large files.
Ocelot A Binary Translation Framework for PTX
Msieve A library for factoring large integers, as in RSA-size numbers. The polynomial selection phase of the general number field sieve has a great deal of CUDA code, and the speedup over a CPU is enormous (10-50x)
PFAC An open library for exact string matching performed on GPUs.
cuSVM A CUDA implementation of Support Vector Classification and Regression.
multisvm In this project, it is described how a naïve implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput.
gpuminer Parallel Data Mining on Graphics Processors
Cmatch Cmatch, performs exact string matching for a set of query sequences and achieves a speedup of as much as 35x on a recent GPU over the equivalent CPU-bound version.
R+GPU A popular Open Source solution for Statistical Analysis

Posted in Apps Development, Artificial Intelligence, Computer Languages, CUDA, GPU (CUDA), GPU Accelareted, Image Processing, Neural Network, Open CL, OpenMP, PARALLEL, Project Related, Simulation, Virtualization | 1 Comment »

Getting Started with CUDA

Posted by Hemprasad Y. Badgujar on March 4, 2013


What are the capabilities of Nvidia’s CUDA running on the GPU and how does it compare to CPU performance? I bought a GeForce 9800GT and set about finding out, starting off by installing the CUDA drivers, toolkit and SDK from the Cuda Zone.

The first thing I noticed was that on my Vista64 machine the sample projects had been installed to:

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\projects

which is read only. Rather than fight with Vista’s UAC I copied everything into the C:\CUDA directory. To build the solution in VS2008 on my Vista 64 machine all I needed to do was switch the platform to x64, ignore the warning:

 

Command line warning D9035 : option 'Wp64' has been deprecated and will be removed in a future release

 

and everything was fine. The SDK’s sample template conveniently included both a gold (CPU) implementation of a function and a GPU implementation. An initial run of the template project showed that only the GPU section was timed. Since the reason to use CUDA is performance and I wanted a comparison, the first modification I made was to put a timer around the CPU implementation:

 

cutilCheckError( cutStartTimer( timer));
computeGold( reference, h_idata, num_threads);  // reference solution
cutilCheckError( cutStopTimer( timer));

 

and raced them – but the results weren’t too inspiring:

 

GPU Processing time: 84.362747 (ms)
CPU Processing time: 0.001257 (ms)

 

The CPU solution wasn’t even threaded. I remembered the question of a student at the Stanford CUDA lecture on YouTube:

 

Q: Since there’s overhead in moving the data to the GPU how do you decide when it’s worthwhile?

A: Generally speaking it makes the most sense for large problems with high data intensity where you have to do multiple calculations per data element. 

Hmm, the template code only processed 128 bytes with 32 threads so I had paid the setup costs and then not sent enough data to the GPU – no wonder the CPU was faster. So I needed to increase the data set, but there’s a problem with that since the provided kernel code assumes the entire data set will fit in shared memory and binds the size of the data to the thread count. There needed to be some changes. But you can’t just increase the number of threads or you’ll get:

 

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <template.cu>, line 88 : invalid configuration argument.

 

First step was to find out what resources were available on the GPU, then I’d need to work out how to get at those resources. Running the SDK Device Query told me how much global and shared memory was available as well as how many threads I could use:

 

Device 0: "GeForce 9800 GT"
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 1073741824 bytes
  Number of multiprocessors:                     14
  Number of cores:                               112
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.50 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

 

Some interesting numbers there, since the GeForce can perform both a FMUL (2 flops) and a FADD (1 flop) per clock, per processor, we can calculate the maximum theoretical Gflops attainable is 1.5 GHz * 112 * (2 + 1) = 504 Gflops. By way of comparison, the E8400 in my test machine has a peak of 24 Gflops according to Intel’s data sheet:

 

Intel_E8400

 

But back to the problem of pushing more data through.  A few problems:

1) The data size needs to be uncoupled from the thread count which means a change to the GRID count from this:

 

// setup execution parameters
dim3  grid( 1, 1, 1);
dim3  threads( num_threads, 1, 1);

 

to something more like this:

 

cThreadsPerBlock = 64;
cBlocksPerGridx = 1024;
cBlocksPerGridy = 1024;

cData = cThreadsPerBlock * cBlocksPerGridx * cBlocksPerGridy;

dim3  grid ( cBlocksPerGridx, cBlocksPerGridy, 1); 
dim3  block( cThreadsPerBlock, 1, 1);

 

where the counts of Blocks Per Grid in the x and y directions would need to be data derived. To simplify the example I’ve done it backwards and set the data size based on thread and block breakdown. These grid and block variables are then be passed to GPU using the triple angle bracket <<< >>> notation:

 

testKernel<<< grid, block, shared_mem_size >>>( d_idata, d_odata);

 

which is the same as:

 

testKernel<<< grid, 64, shared_mem_size >>> ( d_idata, d_odata);

 

because the passed argument is converted to a CUDA dim3 type which “is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.” from the programming guide.

Specifying a shared_mem_size on the kernel call as above allows you to specify the size at runtime. You can then pick up a reference to the memory in the kernel code with:

 

extern  __shared__  float sdata[];

 

Alternatively if you know the size at compilation time you can also declare the shared memory inside the kernel like this:

 

__shared__ float sdata[256];

 

Which would mean the kernel call would be just be:

 

testKernel<<< grid, 64 >>> ( d_idata, d_odata);

 

2) The kernel code must loop through the grid. Calculate the thread id, block id and then global id to figure where in the global data we are up to. Pass the size of the data(int len) since num_threads is no longer coupled with the data length.  The __umul24 in the code provides increased performance but comes with a warning: “Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but __mul24 and __umul24 provide signed and unsigned 24-bit integer multiplication with a throughput of 8 operations per clock cycle. On future architectures however, __[u]mul24 will be slower than 32-bit integer multiplication”.

 

__global__ void
testKernel( float* g_idata, float* g_odata, int len) 
{
  // shared memory
  // the size is determined by the host application
  extern  __shared__  float sdata[];

  // thread id
  const unsigned int tid = threadIdx.x;
  // block id
  const unsigned int bid = __umul24(gridDim.x, blockIdx.y) + blockIdx.x ;
  // global memory id
  const unsigned int gid = tid + __umul24(blockDim.x, bid);

  const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z);

 

3) The kernel needs to read from global memory and then synchronise across threads, this causes the threads across warps to sync and thus presents a consistent shared memory picture. So now thread 0 can read from SDATA(1) and will see the data which thread 1 loaded. A call to __syncthreads() is only needed when the count of threads per block exceed the warpSize because as mentioned in the performance optimisation whitepaper: “Instructions are SIMD synchronous within a warp”. Of course every call has a cost and the programming guide states that “throughput for __syncthreads is 8 operations per clock cycle in the case where no thread has to wait for any other threads.”

None of this is important in the sample template code because there is no communication between threads, thus no need for shared memory or thread syncing – a situation in which registers would normally be used but in this case shared memory has presumably been used by Nvidia for example purposes.

 

const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z); 
SDATA(tid) = g_idata[tid];
if (cThreadsPerBlock > warpSize) __syncthreads();

 

At this point I had revised the template to time the CPU for comparison, remove the size restrictions to allow a decent amount of data to be pushed through and was ready to attempt to answer the question – given the overhead of pushing the data to the GPU, when it is worth doing so? Running the code gave some unexpected answers. Keeping the thread count constant I varied the cBlocksPerGridy to yield various data sizes:

 

 

The GPU and CPU seemed to take the same amount of time with different data loads but the GPU was hampered by a constant overhead of 80ms, the exact same difference I noted when only 128 bytes were trialled in the very first instance before any modification.  Where was the time going? Some sort of setup cost?  Also how much was being taken in the kernel and how much in the data transfer? I needed more fine grained data to see what was going on.

I had modified the supplied SDK template code in a minimal way in order to measure CPU vs GPU performance and found that for the simple test code (1 float multiplication) that the E8400 CPU with a claimed 24 Gflops was handily out performing a GPU with a theoretical max 504 Gflops. Where was all the time going? Was the kernel the culprit, the memory copy or something else? I started out by trying to reuse the

 

cutilCheckError( cutStartTimer( timer));

 

timing method already in the template. Looking into the CUDA source in SDK\common\src\stopwatch_win.cpp showed that on Windows it was using the QueryPerformanceFrequency method which uses the highest possible resolution hardware timer … on the CPU. Using it to measure GPU performance is problematic because timing the GPU using a CPU timer requires the GPU and the CPU to be synchronised with:

 

cudaThreadSynchronize();

 

and ruins the timing information. To measure times on the GPU I needed to use GPU based timing on stream 0 using events:

cudaEventRecord(start, 0);

So I created an array of start and stop events, broke the GPU processes into 5 steps and timed everything. The 5 GPU processes were:

1) Alloc: Host to Device – The allocation of memory on the device for the input array which needed to be copied over from the host.

2) Copy: Host to Device – Copying the input array from the host onto the device. Data size divided by time taken here would give bandwidth.

3) Alloc: Device to Host – The allocation of memory on the device for the output array where the result would be stored before being copied back to the host.

4) Compute – Running the actual kernel, reading from the input array, processing and writing results to the output array.

5) Copy: Device to Host – Copying the output array back to the host.

I also retained my CPU timing to measure the amount of time it took for the GPU to do everything and get the answer back onto the host – that way I’d have a 1:1 comparison against the CPU version. That gives one more thing to measure, how does the sum of the GPU times compare to the overall CPU time?

6) Sync with CPU – CPU time minus sum of GPU times indicates how long it takes to sync the two.

Set up 5 GPU timers to get a breakdown of where the GPU was spending time and keep the 2 CPU timers for the original comparison:

 

// GPU timers - used to time GPU streams
int cGpuTimer = 5;

cudaEvent_t* rgGpuTimer_start = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);
cudaEvent_t* rgGpuTimer_stop = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);

for (int i=0;i<cGpuTimer;i++)
{
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_start[i] ) );
    cutilSafeCall( cudaEventCreate( &rgGpuTimer_stop[i] ) );
}

 

and wrap all the GPU calls with timing calls:

 

cutilCheckError( cutStartTimer( rgTimer[0]));

  // Alloc: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[0], 0 ) );
  float* d_idata;
  cutilSafeCall( cudaMalloc( (void**) &d_idata, global_mem_size));
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[0], 0 ) );

  // Copy: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[1], 0 ) );
  cutilSafeCall( cudaMemcpy( d_idata, h_idata, global_mem_size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[1], 0 ) );

  // Alloc: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[2], 0 ) );
  float* d_odata;
  cutilSafeCall( cudaMalloc( (void**) &d_odata, global_mem_size)); // The pad won't be read back
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[2], 0 ) );

  // Compute
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[3], 0 ) );
  dim3  gridDim ( cBlocksPerGridx, cBlocksPerGridy, 1);
  dim3  blockDim( cThreadsPerBlock, 1, 1);

  testKernel<<< gridDim, blockDim, shared_mem_size >>>( d_idata, d_odata, cData);

  cutilCheckMsg("Kernel execution failed");
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[3], 0 ) );

  // Copy: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[4], 0 ) );
  cutilSafeCall( cudaMemcpy( h_odata, d_odata, global_mem_size, cudaMemcpyDeviceToHost) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[4], 0 ) );

cudaThreadSynchronize(); // Block until memory copy is done to ensure accurate timing

cutilCheckError( cutStopTimer( rgTimer[0]));

 

With this code in place I was ready to find out where the extra 80ms that the GPU took compared to the CPU was coming from and how much time each of the GPU tasks took. First a baseline comparison to verify that the code was still the same and gave the same numbers.

So here’s the graph from before on the left, and here’s the new graph, which should be identical, on the right:

 

 

 

Wow! What’s happened here? All the CPU times are the same, as expected, but the GPU has suddenly closed the gap and now takes only a few ms extra – the 80ms gap has vanished. A diff of the two versions shows that the only change to the code is the addition of GPU timing – and that turns out to be why the GPU suddenly sped up. Directly after setting the device, sending a wakeup call to the GPU like this:

 

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
    cutilDeviceInit(argc, argv);
else
    cudaSetDevice( cutGetMaxGflopsDeviceId() );

{
    cudaEvent_t wakeGPU;
    cutilSafeCall( cudaEventCreate( &wakeGPU) );
}

 

means that 80ms vanishes from the timed loop later in the code. Note that the variable is scoped so it isn’t used. Is the GeForce like a person – goes faster when it knows it is being watched?!  Or is this some wakeup from a power saving mode, I’m not sure.  This is the only extra code needed to cut 80ms from the timing which shows how tricky it is to time accurately on the ms scale, the slightest change can have a significant effect. It is always advisable to run tests on large volumes of data with a lot of loops to drown out one-off costs like this where possible.  While on the topic of getting accurate performance readings note that all timing should be done on release code, particularly timing breakdowns as the SDK/common/cutil_readme.txt file states:

 

“These macros are compiled out in release builds and so they will not affect performance. Note that in debug mode they call cudaThreadSynchronize() to ensure that kernel execution has completed, which can affect performance.” 

Well now that the extra 80ms has been eliminated what does our new GPU timing code show us about how the GPU spends its time? Here’s a chart showing the breakdown for a 16MB sample:

 

 

The majority of the time, and this holds for the other data sizes, is taken copying data back and forth. So experimentally it seems that the overhead in moving the data back and forth is quite significant. Of the 24.8ms required in total to process 16MB, 21.9ms were spent copying data. The actual processing takes almost no time.  Running a variety of input sizes and timing each one can tell us what kind of bandwidth we are typically getting as shown in the table below where times are in ms:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 9.0 1771.9 11.8 1359.3
32MB 16.3 1966.0 22.2 1442.8
64MB 30.6 2093.9 49.8 1285.4
128MB 58.2 2198.2 83.9 1526.4
256MB 114.9 2228.7 171.4 1493.4

We wanted to find how where the GPU was spending its time and now discovered that most of the time is in moving data back and forth.  Can we now answer the question of where the GPU outperforms the CPU? Is 2GB/s the expected throughput? Well Nvidia provides a tool in the SDK to answer that – the “Bandwidth Test”. Running it through the provided GUI tool yields the following results:

 

Running on......
      device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               2152.6

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1919.2

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               48507.8

 

So we can see for 32MB, performance is roughly in line with the template results so that’s case closed … or is it? Two things give cause for concern:

1) PCIe 2.0 is theoretically capable of 500 MB/s per lane and with a x16 slot there are 16 lanes. So throughput should be up around 8GB/s, not the 2GB/s observed.

2) What exactly does “Host to Device Bandwidth for Pageable memory” in the bandwidth test results mean? Pageable memory?

So I found out that the bulk of the time was in data copying, first confirmed that the speeds observed were similar to those given in the Nvidia test suite and then raised new questions about whether we were getting everything out of the hardware given 2GB/s observed and 8GB/s theoretical. So now I need to confirm that my hardware really is PCIe 2.0 x16 and figure out what pageable memory is.

I’d added GPU based timing to my template code and found out that most of the time was spent copying data back and forth between the host and the device. The “Bandwidth Test” in the SDK gave roughly similar results although it mentioned something about pageable memory. But the big problem was the theoretical performance of PCIe 2.0 x16 far exceeded what I was seeing. So the first step was to confirm that both my graphics card and my motherboard supported and were using PCIe 2.0 x16. To do this I used CPU-Z and GPU-Z, with the following results:

 

CPU_GPU

 

So after confirming the hardware should have been capable of better speeds I took another look at the BandwidthTest. Running with the –help switch reveals several options:

 

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --help
Usage:  bandwidthTest [OPTION]...
Test the bandwidth for device to host, host to device, and device to device transfers

Example:  measure the bandwidth of device to host pinned memory copies in the range 1024 Bytes
          to 102400 Bytes in 1024 Byte increments
./bandwidthTest --memory=pinned --mode=range --start=1024 --end=102400 --increment=1024 --dtoh

Options:
--help  Display this help menu
--csv   Print results as a CSV
--device=[deviceno]     Specify the device device to be used
  all - compute cumulative bandwidth on all the devices
  0,1,2,...,n - Specify any particular device to be used
--memory=[MEMMODE]      Specify which memory mode to use
  pageable - pageable memory
  pinned   - non-pageable system memory
--mode=[MODE]   Specify the mode to use
  quick - performs a quick measurement
  range - measures a user-specified range of values
  shmoo - performs an intense shmoo of a large range of values
--htod  Measure host to device transfers
--dtoh  Measure device to host transfers
--dtod  Measure device to device transfers
--wc    Allocate pinned memory as write-combined
--cputiming     Force CPU-based timing always
Range mode options
--start=[SIZE]  Starting transfer size in bytes
--end=[SIZE]    Ending transfer size in bytes
--increment=[SIZE]      Increment size in bytes

 

Particularly of interest is the “pinned” memory mode. Let’s try that:

 

C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5256.9
Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4891.6
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 48498.6

 

and we see that this mode vastly improves the maximum throughput. Not sure why Nvidia didn’t make it the default option. Speeds are now up to 5GB/s. A short investigation of the code reveals that the timing isn’t quite analogous to the testing we are doing in the template code:

bandwidthTest.cu

 

56: // defines, project
57: #define MEMCOPY_ITERATIONS  10

 

as the bandwidthTest copies the same memory 10 times in a row as compared to the single copy we are doing. So we expect our performance to lag slightly behind this 5GB/s. Conveniently, all the code needed to use pinned memory is provided in the bandwidthTest, so putting it into a few wrapper functions called freeHost, mallocHost and memCpy yields:

 

////////////////////////////////////////////////////////////////////////////////
//  Memory functions to switch between pinned and pageable memory as required
////////////////////////////////////////////////////////////////////////////////

cudaError
freeHost(void* h_mem, memoryMode memMode)
{
    if( PINNED == memMode ) {
        return cudaFreeHost(h_mem);
    }
    else {
        free(h_mem);
    }
    return cudaSuccess;
}

cudaError
mallocHost(void** h_mem ,uint memSize, memoryMode memMode, bool wc)
{
    if( PINNED == memMode ) {
#if CUDART_VERSION >= 2020
        return cudaHostAlloc( h_mem, memSize, (wc) ? cudaHostAllocWriteCombined : 0 );
#else
        if (wc) {printf("Write-Combined unavailable on CUDART_VERSION less than 2020, running is: %d", CUDART_VERSION);
        return cudaMallocHost( h_mem, memSize );
#endif
    }
    else { // PAGEABLE memory mode
        *h_mem = malloc( memSize );
    }

    return cudaSuccess;
}

cudaError
memCpy(void* sink, void* source, uint memSize, cudaMemcpyKind direction, memoryMode memMode)
{
    if( PINNED == memMode ) {
        return cudaMemcpyAsync( sink, source, memSize, direction, 0);
    }
    else {
        return cudaMemcpy( sink, source, memSize, direction);
    }
}

 

These functions take the same parameters as the existing functions with the addition of memory mode and for mallocHost whether or not the memory is Write Combined. Changing the allocation, copying and freeing over to these new functions allow use of pinned memory. Running the same test set shows that now the time is much more evenly spread between tasks:

 

 

 

and running the new numbers on the throughput we get:

Copy: Host to Device MB/s Copy: Device to Host MB/s
16MB 3.2 5026.7 3.3 4878.0
32MB 6.1 5242.5 6.5 4891.5
64MB 12.2 5251.1 13.1 4871.7
128MB 24.4 5247.6 26.2 4894.1
256MB 48.9 5239.0 52.3 4894.7

So now the throughput approaches the theoretical limit and matches the best the bandwidthTest provides. The total times are down significantly and the GPU is faster on all tested sizes. The 256MB trial runs in 30% less time down from 340ms to 236ms.

 

 

The next challenge is to find where else time is lost. The pie charts show that most of the time is still spent in allocation and copying with very little in compute time so there’s no need to look at the kernel. We’ve already probably cut most of the time we can from the copying so that leaves allocation. A good idea would probably be to allocate the memory once and then use it over and over for multiple kernel executions, an intensive process like the kind Nvidia suggests are best suited for CUDA. But what if the code needs to be as shown, one kernel being run on one large set of data and then returning to another application? This is the kind of flow seen in Matlab MEX files where CUDA is used – Matlab passes the data through the C/C++ MEX file, which runs up a CUDA program gets the result and then returns to Matlab. Could parallel memory copies and allocations speed things up in this situation?

So we’ve switched the code over to use pinned memory in preference to pageable and attained the desired speedup in memory operations from 2GB/s to about 5GB/s. Theoretically PCIe 2.0 x16 should be able to hit 8GB/s and I don’t know why we aren’t able to achieve speeds closer to this number. If anyone knows please leave a comment or e-mail me. From here the next thing to investigate to get more throughput in the single kernel scenario is parallel allocations and copies.

Posted in Artificial Intelligence, Computer Languages, Computing Technology, CUDA, Game Development, GPU (CUDA), GPU Accelareted, Image Processing, Neural Network, OpenCL, PARALLEL, Simulation, Virtualization | Leave a Comment »

GPU Parallel Programming in VS2012 with NVIDIA CUDA

Posted by Hemprasad Y. Badgujar on March 4, 2013


1. Introduction

Here I will share to you my first experience in creating a CUDA-based C++ program on Windows using Visual Studio 2012. CUDA is an acronym of Compute Unified Device Architecture, which is NVIDIA’s general purpose computing API for their graphics card hardware. This simple program is taken from the example code of NVIDIA’s samples, which is basically doing fill and copy operation with a big size matrix. Before continuing, you should have installed the required CUDA drivers, toolkits and SDK from here:
http://developer.nvidia.com/cuda/cuda-downloads

Or, if you’d rather choose to install the latest CUDA toolkit, head over here:
http://developer.nvidia.com/cuda/cuda-pre-production

You should also have a working C++ compiler. I am using Visual Studio 2012 on Windows 8 64-bit. Please be advised that CUDA-based applications won’t run unless the appropriate NVIDIA GPU hardware supporting CUDA is present in your system.

2. Setting up Visual Studio 2012

Basically everything should be set up automatically by the installer. However, with the current release of CUDA version 5.0, you might not be able to compile/build your project successfully. This is because nvcc.exe does not currently support the new cl.exe compiler version. If you try to compile any samples from the SDK there will be errors about target and props file not found or missing. For this, you should manually deploy those files according to the instructions from “C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v5.0\extras\visual_studio_integration”

Those files still need some modifications for a successful compilation. You can download the modified files here: BuildCustomizations.rar. Extract the contents to the folder “C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\BuildCustomizations\”.
If you prefer to modify the files manually, follow these instructions carefully:

  1. Copy all the build customization files somewhere
  2. Open “CUDA 5.0.props”. Search for the following lines:
    Code:
    1
    2
    <CudaClVersion Condition="'$(PlatformToolset)' == 'v90'">2008</CudaClVersion>
    <CudaClVersion Condition="'$(PlatformToolset)' == 'v100'">2010</CudaClVersion>

    and add this new line:

    Code:
    1
    <CudaClVersion Condition="'$(PlatformToolset)' == 'v110'">2010</CudaClVersion>
  3. Open “CUDA 5.0.targets”. Search for the text “CudaCleanDependsOn” and replace the tag content with these lines:
    Code:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    <CudaCleanDependsOn>
      $(CudaCompileDependsOn);
      _SelectedFiles;
      CudaFilterSelectedFiles;
      AddCudaCompileMetadata;
      AddCudaLinkMetadata;
      AddCudaCompileDeps;
      AddCudaCompilePropsDeps;
      ValidateCudaBuild;
      ValidateCudaCodeGeneration;
      ComputeCudaCompileOutput;
      PrepareForCudaBuild
    </CudaCleanDependsOn>
  4. In the same file, search for “GenerateRelocatableDeviceCode”. Replace the line with the following:
    Code:
    1
    GenerateRelocatableDeviceCode="%(CudaCompile.GenerateRelocatableDeviceCode)"
  5. Go down a bit and look for “CodeGeneration”. Replace the line with this:
    Code:
    1
    CodeGeneration="%(CudaCompile.CodeGenerationValues)"
  6. Again search for “CommandLineTemplate”. It should be somewhere near the end of the file. Replace the line with this:
    Code:
    1
    CommandLineTemplate=""$(CudaToolkitNvccPath)" %(CudaCompile.BuildCommandLineTemplate) %(CudaCompile.ApiCommandLineTemplate) %(CudaCompile.CleanCommandLineTemplate)" />
  7. Copy all modified files here: “C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\BuildCustomizations\”

Also, modify line 90 of the file “host_config.h” located in the folder:
“C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\”
by changing the value ’1600′ to ’1700′.

Note: Remove the ‘x86′ inside paths if you use 64-bit CUDA toolkit

Syntax Highlighting

To have a fancy C++ syntax highlighting feature enabled, follow these steps:

  1. Select the menu “Tools->Options…”. Open “Text Editor” in the tree view on the left, and click on “File Extension”.
  2. Type “cu” in the “Extension” box, set the editor to “Microsoft Visual C++” and click “Add”. Click “OK” on the dialog box.
  3. Restart Visual Studio and your CUDA code should now have syntax highlighting.

3. Creating the App

Make sure you have installed all required SDKs. If everything is ok, then start by creating a simple console project and type this code:

Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <iostream>
using namespace std;
__global__ void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));
  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  cout << "Max error: " << maxError;
}

Before compiling, make a reference to the CUDA library by specifying its location and name in the project’s properties page:

  1. Navigate to the “Configuration Properties\Linker\General” option
  2. In the “Additional Library Directories” field, add “$(CUDA_PATH)lib\$(PlatformName)”
  3. Go to the “Configuration Properties\Linker\Input” option
  4. Lastly in the “Additional Dependencies” field, add “cudart.lib”

The code should compile successfully.

Read more: http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2012-with-nvidia-cuda/#ixzz2MVFioDQt

Posted in Apps Development, C, Computer Games, Computer Languages, Computing Technology, Cryptography, CUDA, Game Development, GPU (CUDA), GPU Accelareted, PARALLEL | 2 Comments »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: