Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘CUDA’ Category

Deep Learning Software/ Framework links

Posted by Hemprasad Y. Badgujar on July 15, 2016


  1. Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal)
  2. Torch – provides a Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu)
  3. Pylearn2 – Pylearn2 is a library designed to make machine learning research easy.
  4. Blocks – A Theano framework for training neural networks
  5. Tensorflow – TensorFlow™ is an open source software library for numerical computation using data flow graphs.
  6. MXNet – MXNet is a deep learning framework designed for both efficiency and flexibility.
  7. Caffe -Caffe is a deep learning framework made with expression, speed, and modularity in mind.Caffe is a deep learning framework made with expression, speed, and modularity in mind.
  8. Lasagne – Lasagne is a lightweight library to build and train neural networks in Theano.
  9. Keras– A theano based deep learning library.
  10. Deep Learning Tutorials – examples of how to do Deep Learning with Theano (from LISA lab at University of Montreal)
  11. Chainer – A GPU based Neural Network Framework
  12. DeepLearnToolbox – A Matlab toolbox for Deep Learning (from Rasmus Berg Palm)
  13. Cuda-Convnet – A fast C++/CUDA implementation of convolutional (or more generally, feed-forward) neural networks. It can model arbitrary layer connectivity and network depth. Any directed acyclic graph of layers will do. Training is done using the back-propagation algorithm.
  14. Deep Belief Networks. Matlab code for learning Deep Belief Networks (from Ruslan Salakhutdinov).
  15. RNNLM– Tomas Mikolov’s Recurrent Neural Network based Language models Toolkit.
  16. RNNLIB-RNNLIB is a recurrent neural network library for sequence learning problems. Applicable to most types of spatiotemporal data, it has proven particularly effective for speech and handwriting recognition.
  17. matrbm. Simplified version of Ruslan Salakhutdinov’s code, by Andrej Karpathy (Matlab).
  18. deeplearning4j– Deeplearning4J is an Apache 2.0-licensed, open-source, distributed neural net library written in Java and Scala.
  19. Estimating Partition Functions of RBM’s. Matlab code for estimating partition functions of Restricted Boltzmann Machines using Annealed Importance Sampling (from Ruslan Salakhutdinov).
  20. Learning Deep Boltzmann Machines Matlab code for training and fine-tuning Deep Boltzmann Machines (from Ruslan Salakhutdinov).
  21. The LUSH programming language and development environment, which is used @ NYU for deep convolutional networks
  22. Eblearn.lsh is a LUSH-based machine learning library for doing Energy-Based Learning. It includes code for “Predictive Sparse Decomposition” and other sparse auto-encoder methods for unsupervised learning. Koray Kavukcuoglu provides Eblearn code for several deep learning papers on thispage.
  23. deepmat– Deepmat, Matlab based deep learning algorithms.
  24. MShadow – MShadow is a lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA. The goal of mshadow is to support efficient, device invariant and simple tensor library for machine learning project that aims for both simplicity and performance. Supports CPU/GPU/Multi-GPU and distributed system.
  25. CXXNET – CXXNET is fast, concise, distributed deep learning framework based on MShadow. It is a lightweight and easy extensible C++/CUDA neural network toolkit with friendly Python/Matlab interface for training and prediction.
  26. Nengo-Nengo is a graphical and scripting based software package for simulating large-scale neural systems.
  27. Eblearn is a C++ machine learning library with a BSD license for energy-based learning, convolutional networks, vision/recognition applications, etc. EBLearn is primarily maintained by Pierre Sermanet at NYU.
  28. cudamat is a GPU-based matrix library for Python. Example code for training Neural Networks and Restricted Boltzmann Machines is included.
  29. Gnumpy is a Python module that interfaces in a way almost identical to numpy, but does its computations on your computer’s GPU. It runs on top of cudamat.
  30. The CUV Library (github link) is a C++ framework with python bindings for easy use of Nvidia CUDA functions on matrices. It contains an RBM implementation, as well as annealed importance sampling code and code to calculate the partition function exactly (from AIS lab at University of Bonn).
  31. 3-way factored RBM and mcRBM is python code calling CUDAMat to train models of natural images (from Marc’Aurelio Ranzato).
  32. Matlab code for training conditional RBMs/DBNs and factored conditional RBMs (from Graham Taylor).
  33. mPoT is python code using CUDAMat and gnumpy to train models of natural images (from Marc’Aurelio Ranzato).
  34. neuralnetworks is a java based gpu library for deep learning algorithms.
  35. ConvNet is a matlab based convolutional neural network toolbox.
  36. Elektronn is a deep learning toolkit that makes powerful neural networks accessible to scientists outside the machine learning community.
  37. OpenNN is an open source class library written in C++ programming language which implements neural networks, a main area of deep learning research.
  38. NeuralDesigner  is an innovative deep learning tool for predictive analytics.
  39. Theano Generalized Hebbian Learning.

Posted in C, Computing Technology, CUDA, Deep Learning, GPU (CUDA), JAVA, OpenCL, PARALLEL, PHP, Project Related | Leave a Comment »

CUDA Unified Memory

Posted by Hemprasad Y. Badgujar on October 6, 2015


CUDA Unified Memory

CUDA is the language of Nvidia GPU’s.  To extract maximum performance from GPU’s, you’ll want to develop applications in CUDA.

CUDA Toolkit is the primary IDE (integrated development environment) for developing CUDA-enabled applications.  The main roles of the Toolkit IDE are to simplify the software development process, maximize software developer productivity, and provide features that enhance GPU performance.  The Toolkit has been steadily evolving in tandem with GPU hardware and currently sits at Version 6.5.

One of the most important features of CUDA 6.5 is Unified Memory (UM).  (UM was actually first introduced in CUDA v.6.0).  CPU host memory and GPU device memory are physically separate entities, connected by a relatively slow PCIe bus.  Prior to v.6.0, data elements shared in both CPU and GPU memory required two copies – one copy in CPU memory and one copy in GPU memory.  Developers had to allocate memory on the CPU, allocate memory on the GPU, and then copy data from CPU to GPU and from GPU to CPU.  This dual data management scheme added complexity to programs, opportunities for the introduction of software bugs, and an excessive focus of time and energy on data management tasks.

UM corrects this.  UM creates a memory pool that is shared between CPU and GPU, with a single memory address space and single pointers accessible to both host and device code.  The CUDA driver and runtime libraries automatically handle data transfers between host and device memory, thus relieving developers from the burden of explicitly managing those data transfers.  UM improves performance by automatically providing data locality on the CPU or GPU, wherever it might be required by the application algorithm.  UM also guarantees global coherency of data on host and device, thus reducing the introduction of software bugs.

Let’s explore some sample code that illustrates these concepts.  We won’t concern ourselves with the function of this algorithm; instead, we’ll just focus on the syntax. (Credit to Nvidia for this C/CUDA template example).

Without Unified Memory

Without Unified Memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <string.h>
#include <stdio.h>
struct DataElement
{
  char *name;
  int value;
};
__global__
void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';
  elem->value++;
}
void launch(DataElement *elem) {
  DataElement *d_elem;
  char *d_name;
  int namelen = strlen(elem->name) + 1;
  // Allocate memory on GPU
  cudaMalloc(&d_elem, sizeofDataElement());
  cudaMalloc(&d_name, namelen);
  // Copy data from CPU to GPU
  cudaMemcpy(d_elem, elem, sizeof(DataElement),
     cudaMemcpyHostToDevice);
  cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
  cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*),
     cudaMemcpyHostToDevice);
  // Launch kernel
  Kernel<<< 1, 1 >>>(d_elem);
  // Copy data from GPU to CPU
  cudaMemcpy(&(elem->value), &(d_elem->value), sizeof(int),
     cudaMemcpyDeviceToHost);
  cudaMemcpy(elem->name, d_name, namelen, cudaMemcpyDeviceToHost);
  cudaFree(d_name);
  cudaFree(d_elem);
}
int main(void)
{
  DataElement *e;
  // Allocate memory on CPU
  e = (DataElement*)malloc(sizeof(DataElement));
  e->value = 10;
  // Allocate memory on CPU
  e->name = (char*)malloc(sizeof(char) * (strlen("hello") + 1));
  strcpy(e->name, "hello");
  launch(e);
  printf("On host: name=%s, value=%d\n", e->name, e->value);
  free(e->name);
  free(e);
  cudaDeviceReset();
}

Note these key points:

  • L51,55: Allocate memory on CPU
  • L24,25: Allocate memory on GPU
  • L28-32: Copy data from CPU to GPU
  • L35: Run kernel
  • L38-40: Copy data from GPU to CPU

With Unified Memory 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <string.h>
#include <stdio.h>
struct DataElement
{
  char *name;
  int value;
};
__global__
void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';
  elem->value++;
}
void launch(DataElement *elem) {
  // Launch kernel
  Kernel<<< 1, 1 >>>(elem);
  cudaDeviceSynchronize();
}
int main(void)
{
  DataElement *e;
  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&e, sizeof(DataElement));
  e->value = 10;
  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&(e->name), sizeof(char) *
     (strlen("hello") + 1) );
  strcpy(e->name, "hello");
  launch(e);
  printf("On host: name=%s, value=%d\n", e->name, e->value);
  cudaFree(e->name);
  cudaFree(e);
  cudaDeviceReset();
}
 

Note these key points:

  • L28, 32, 33: Allocate unified memory on CPU and GPU
  • L19: Run kernel

With UM, memory is allocated on the CPU and GPU in a single address space and managed with a single pointer.  Note how the “malloc’s” and “cudaMalloc’s” are condensed into single calls to cudaMallocManaged().  Furthermore, explicit cudaMemcpy() data transfers between CPU and GPU are eliminated, as the CUDA runtime handles these transfers automatically in the background. Collectively these actions simplify code development, code maintenance, and data management.

As software project managers, we like UM for the productivity enhancements it provides for our software development teams.  It improves software quality, reduces coding time, effort and cost, and enhances overall performance. As software engineers, we like UM because of reduced coding effort and the fact that we can focus time and effort on writing CUDA kernel code, where all the parallel performance comes from, instead of spending time on memory management tasks.  Unified Memory is major step forward in GPU programming.

Posted in CUDA, CUDA TUTORIALS, GPU (CUDA), PARALLEL | Leave a Comment »

Building VTK with Visual Studio 2013

Posted by Hemprasad Y. Badgujar on April 30, 2015


Building VTK5 with Visual Studio

Download

  1. Download VTK 5.10.1 the (VTK-5.10.1.zip) to unzip the file. (C: \ VTK-5.10.1)Http://Www.Vtk.Org/VTK/resources/software.Html#previous
    Https://Github.Com/Kitware/VTK/tree/v5.10.1

CMake

  1. You want to specify the destination of the input destination and solution files of source code.
    • Where is the source code: C:\VTK-5.10.1
    • Where is build the binaries: C:\VTK-5.10.1\build
  2. Press the [Configure] to select the Visual Studio that is the target.
  3. It makes various settings.
    • BUILD_SHAREED_LIBS ☑ (check)
    • BUILD_TESTING ☐ (uncheck)
    • CMAKE_CONFIGURATION_TYPES Debug;Release
    • CMAKE_INSTALL_PREFIX C:\Program Files\VTK (or C:\Program Files (x86)\VTK)
  4. Press the [Add Entry] to add the following settings.
    Name: CMAKE_DEBUG_POSTFIX
    Type: STRING
    Value: -gd
    Description:

    * Debug string to be added to the file name of the build generated files of the (last).

  5. And output the solution file by pressing the [Generate].

Build

  1. Start Visual Studio with administrative privileges VTK solution file (C: \ VTK-5.10.1 \ build \ VTK.sln) to open.
    (If you do not start with administrator privileges Visual Studio INSTALL to fail.)
  2. It wants to modify the source code.
    • vtkOStreamWrapper.cxx
      60 line

      //VTKOSTREAM_OPERATOR(ostream&);
      vtkOStreamWrapper& vtkOStreamWrapper::operator << (ostream& a) {
        this->ostr << (void *)&a;
        return *this;
      }
      
    • vtkEnSightGoldBinaryReader.cxx
      3925 line

      if (this->IFile->read(result, 80).fail())
      

      3944 line

      if (this->IFile->read(dummy, 8).fail())
      

      4001 line

      if (this->IFile->read(dummy, 4).fail())
      

      4008 line

      if (this->IFile->read((char*)result, sizeof(int)).fail())
      

      4025 line

      if (this->IFile->read(dummy, 4).fail())
      

      4048 line

      if (this->IFile->read(dummy, 4).fail())
      

      4055 line

      if (this->IFile->read((char*)result, sizeof(int)*numInts).fail())
      

      4072 line

      if (this->IFile->read(dummy, 4).fail())
      

      4095 line

      if (this->IFile->read(dummy, 4).fail())
      

      4102 line

      if (this->IFile->read((char*)result, sizeof(float)*numFloats).fail())
      

      4119 line

      if (this->IFile->read(dummy, 4).fail())
      
    • vtkConvexHull2D.cxx
      31 lines

      #include <algorithm>
      
    • vtkAdjacencyMatrixToEdgeTable.cxx
      31 lines

      #include <algorithm>
      
    • vtkNormalizeMatrixVectors.cxx
      30 Line

      #include <algorithm>
      
    • vtkPairwiseExtractHistogram2D.cxx
      39 line

      #include <algorithm>
      
    • vtkControlPointsItem.cxx
      35 lines

      #include <algorithm>
      
    • vtkPiecewisePointHandleItem.cxx
      31 lines

      #include <algorithm>
      
    • vtkParallelCoordinatesRepresentation.cxx
      83 line

      #include <algorithm>
      
  1. It wants to build the VTK. (ALL_BUILD)
    1. The configuration of the solution (Debug, Release) set the.
    2. Choose the ALL_BUILD project from Solution Explorer.
    3. [Build]> to build VTK Press [Build Solution].
  2. It wants to install the VTK. (INSTALL)
    1. Choose the INSTALL project from Solution Explorer.
    2. [Build]> [projects only]> to install the VTK Press [INSTALL only the Build menu.CMAKE_INSTALL_PREFIX necessary files are copied to the specified output destination.

Environment Variable

  1. Environment variable VTK_ROOT create a VTK of path: Set the (C \ Program Files \ VTK).
  2. Environment variable Path I add a% VTK_ROOT% \ bin; to.

Building VTK6 with Visual Studio

Download

  1. Download VTK 6.1.0 the (VTK-6.1.0.zip) to unzip the file. (C: \ VTK-6.1.0)Http://Www.Vtk.Org/VTK/resources/software.Html#latestcand
    Https://Github.Com/Kitware/VTK/tree/v6.1.0

CMake

  1. You want to specify the destination of the input destination and solution files of source code.
    • Where is the source code: C:\VTK-6.1.0
    • Where is build the binaries: C:\VTK-6.1.0\build
  2. Press the [Configure] to select the Visual Studio that is the target.
  3. It makes various settings.
    • BUILD_SHAREED_LIBS ☑ (check)
    • BUILD_TESTING ☐ (uncheck)
    • CMAKE_CONFIGURATION_TYPES Debug;Release
    • CMAKE_INSTALL_PREFIX C:\Program Files\VTK (or C:\Program Files (x86)\VTK)
  4. Press the [Add Entry] to add the following settings.
    Name: CMAKE_DEBUG_POSTFIX
    Type: STRING
    Value: -gd
    Description:

    * Debug string to be added to the file name of the build generated files of the (last).

  5. And output the solution file by pressing the [Generate].

Build

  1. Start Visual Studio with administrative privileges VTK solution file (C: \ VTK-6.1.0 \ build \ VTK.sln) to open.
    (If you do not start with administrator privileges Visual Studio INSTALL to fail.)
  2. It wants to build the VTK. (ALL_BUILD)
    1. The configuration of the solution (Debug, Release) set the.
    2. Choose the ALL_BUILD project from Solution Explorer.
    3. [Build]> to build VTK Press [Build Solution].
  3. It wants to install the VTK. (INSTALL)
    1. Choose the INSTALL project from Solution Explorer.
    2. [Build]> [projects only]> to install the VTK Press [INSTALL only the Build menu.CMAKE_INSTALL_PREFIX necessary files are copied to the specified output destination.

Environment Variable

  1. Environment variable VTK_DIR create a VTK of path: Set the (C \ Program Files \ VTK).
  2. Environment variable Path I add a% VTK_DIR% \ bin; to.

Building VTK6 + Qt5 with Visual Studio

Download

  1. Download VTK 6.1.0 the (VTK-6.1.0.zip) to unzip the file. (C: \ VTK-6.1.0)Http://Www.Vtk.Org/VTK/resources/software.Html#latestcand
    Https://Github.Com/Kitware/VTK/tree/v6.1.0
  2. Qt 5.4.0 with OpenGLをダウンロード、インストールする。(C:\Qt)
    http://www.qt.io/download-open-source/#

    • Qt 5.4.0 for Windows 32-bit (VS 2013, OpenGL, 694 MB)
      (qt-opensource-windows-x86-msvc2013_opengl-5.4.0.exe)
    • Qt 5.4.0 for Windows 64-bit (VS 2013, OpenGL, 709 MB)
      (qt-opensource-windows-x86-msvc2013_64_opengl-5.4.0.exe)

CMake

  1. You want to specify the destination of the input destination and solution files of source code.
    • Where is the source code: C:\VTK-6.1.0
    • Where is build the binaries: C:\VTK-6.1.0\build
  2. Press the [Configure] to select the Visual Studio that is the target.
  3. It makes various settings.
    (Grouped and helpful to put a check to Advanced.) * Win32 is Msvc2013_opengl , x64 is msvc2013_64_openglspecified in. Ungrouped Entries

    • Qt5Core_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Core
    • Qt5Designer_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Designer
    • Qt5Gui_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Gui
    • Qt5Network_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Network
    • Qt5OpenGL_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5OpenGL
    • Qt5Sql_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Sql
    • Qt5WebKit_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5WebKit
    • Qt5WebKitWidgets_DIRC:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5WebKitWidgets
    • Qt5Widgets_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Widgets
    • Qt5Xml_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Xml

    BUILD

    • BUILD_SHAREED_LIBS ☑ (check)
    • BUILD_TESTING ☐ (uncheck)

    CMAKE

    • CMAKE_CONFIGURATION_TYPES Debug;Release
    • CMAKE_INSTALL_PREFIX C:\Program Files\VTK (or C:\Program Files (x86)\VTK)

    Module

    • Module_vtkGUISupportQt ☑ (check)
    • Module_vtkGUISupportQtOpenGL ☑ (check)
    • Module_vtkGUISupportQtSQL ☑ (check)
    • Module_vtkGUISupportQtWebkit ☑ (check)
    • Module_vtkRenderingQt ☑ (check)
    • Module_vtkViewsQt ☑ (check)

    OPENGL

    • OPENGL_gl_LIBRARY opengl
    • OPENGL_glu_LIBRARY glu32

    QT

    • QT_MKSPECS_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/mkspecs/win32-msvc2013
    • QT_QMAKE_EXECUTABLE C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/bin/qmake.exe
    • QT_QTCORE_LIBRARY_DEBUG C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/Qt5Cored.lib
    • QT_QTCORE_LIBRARY_DEBUG C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/Qt5Core.lib

    VTK

    • VTK_Group_Qt ☑ (check)
    • VTK_INSTALL_QT_PLUGIN_DIR ${CMAKE_INSTALL_PREFIX}/${VTK_INSTALL_QT_DIR}
    • VTK_QT_VERSION 5
  4. Press the [Add Entry] to add the following settings.
    Name: CMAKE_PREFIX_PATH
    Type: PATH
    Value: C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x64
    (or C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x86)
    Description:

    * Windows Kits path if Visual Studio 2013 8.1 \ Lib \ Winv6.3, if Visual Studio 2012 8.0 I specify the \ Lib \ Win8.

    Name: CMAKE_DEBUG_POSTFIX
    Type: STRING
    Value: -gd
    Description:

    * Debug string to be added to the file name of the build generated files of the (last).

  5. And output the solution file by pressing the [Generate].

Build

  1. Start Visual Studio with administrative privileges VTK solution file (C: \ VTK-6.1.0 \ build \ VTK.sln) to open.
    (If you do not start with administrator privileges Visual Studio INSTALL to fail.)
  2. It wants to build the VTK. (ALL_BUILD)
    1. The configuration of the solution (Debug, Release) set the.
    2. Choose the ALL_BUILD project from Solution Explorer.
    3. [Build]> to build VTK Press [Build Solution].
  3. It wants to install the VTK. (INSTALL)
    1. Choose the INSTALL project from Solution Explorer.
    2. [Build]> [projects only]> to install the VTK Press [INSTALL only the Build menu.CMAKE_INSTALL_PREFIX necessary files are copied to the specified output destination.

Environment Variable

  1. Environment variable VTK_DIR create a VTK of path: Set the (C \ Program Files \ VTK).
  2. Environment variable QTDIR by creating a Qt of the path (C: \ Qt \ Qt5.4.0 \ 5.4 \ msvc2013_64_opengl \ (or C: \ Qt \ Qt5.4.0 \ 5.4 \ msvc2013_opengl \)) to set.
  3. Environment variable Path in;% VTK_DIR% \ bin;% I add a QTDIR% \ bin.

Posted in CLOUD, Computer Languages, Computer Softwares, Computer Vision, Computing Technology, CUDA, GPU (CUDA), OpenCV | Tagged: , , , | 4 Comments »

Project Template in Visual Studio

Posted by Hemprasad Y. Badgujar on March 5, 2015


 

 Sample Image - maximum width is 600 pixels

Introduction

This article describes the step by step process of creating project template in Visual Studio 2012 and VSIX installer that deploys the project template. Each step contains an image snapshot that helps the reader to keep focused.

Background

A number of predefined project and project item templates are installed when you install Visual Studio. You can use one of the many project templates to create the basic project container and a preliminary set of items for your application, class, control, or library. You can also use one of the many project item templates to create, for example, a Windows Forms application or a Web Forms page to customize as you develop your application.

You can create custom project templates and project item templates and have these templates appear in the New Project and Add New Item dialog boxes. The article describes the complete process of creating and deploying the project template.

Using the Code

Here, I have taken a very simple example which contains nearly no code but this can be extended as per your needs.

Create Project Template

First of all, create the piece (project or item) that resembles the thing you want to get created started from the template we are going to create.

Then, export the template (we are going to use the exported template as a shortcut to build our Visual Studio template package):

Visual Studio Project Templates

We are creating a project template here.

Fill all the required details:

A zip file should get created:

Creating Visual Studio Package Project

To use VSIX projects, you need to install the Visual Studio 2012 VSSDK.

Download the Visual Studio 2012 SDK.

You should see new project template “Visual Studio Package” after installing SDK.

Select C# as our project template belongs to C#.

Provide details:

Currently, we don’t need unit test project but they are good to have.

In the solution, double-click the manifest, so designer opens.

Fill all the tabs. The most important is Assert. Here you give path of our project template(DummyConsoleApplication.zip).

As a verification step, build the solution, you should see a .vsix being generated after its dependency project:

Installing the Extension

Project template is located under “Visual C#” node.

Uninstalling the Project Template

References

Posted in .Net Platform, C, Computer Languages, Computer Software, Computer Softwares, Computer Vision, CUDA, GPU (CUDA), Installation, OpenMP, PARALLEL | Tagged: , , | Leave a Comment »

Professional ways of tracking GPU memory leakage

Posted by Hemprasad Y. Badgujar on January 25, 2015


Depending on what I am doing and what I need to track/trace and profile I utilise all 4 packages above. They also have the added benefit of being a: free; b: well maintained; c: free; d: regularly updated; e: free.

In case you hadn’t guessed I like the free part:)

In regards of object management, I would recommend an old C++ coding principle: as soon as you create an object, add the line that deletes it, every new should always (eventually) have a delete. That way you know that you are destroying the objects you create, however it will not save you from orphaned memory block memory leaks, where you change where pointers are pointing, for example:

myclass* firstInstance = new myclass();
myclass* secondInstance = new myclass();
firstInstance = secondInstance;
delete firstInstance;
delete secondInstance;

You will now have created a small memory leak where the data for the real firstInstance is now not being pointed at by any pointer. Very hard to detect when this happens in a large code-base, and more common that it should be.

generally these are the pairings you need to be aware of to ensure you properly dispose of all your objects:

new -> delete
new[] -> delete[]
malloc() -> free() // or you can use realloc(0) instead of free()
calloc() -> free() // or you can use realloc(0) instead of free()
realloc(nonzero) -> free() // or you can use realloc(0) instead of free()

If you are coming from a language with garbage collection to C++ it can take a while to get used to, but it quickly becomes habit:)

Posted in C, Computer Languages, Computer Vision, Computing Technology, CUDA | Tagged: , , , , , | Leave a Comment »

Step By Step Installing Visual Studio Professional 2012

Posted by Hemprasad Y. Badgujar on January 5, 2015


1. Mount .iso file. Click on “Setup.exe” file. Agree on terms and conditions and click on “Next” button.

2. Select the required features from the list and click “Install” button. It will take around 7.90 GB of space if all features are installed.

3. Setup will create “System Restore Point” before starting the installation process.

4. Once it is done, it will start installation process.

5. Between setup will ask you to restart the system. Click on “Restart” button to restart your system.


6. Setup will resume, once system is restarted.

7. Now installation will take some time. Around 20-30 minutes.

8. Once setup is completed, you can launch Visual studio.

Posted in Computer Languages, Computer Network & Security, Computer Softwares, CUDA, GPU (CUDA), Installation, PARALLEL, Windows OS | Tagged: | 1 Comment »

Building a Beowulf cluster with Ubuntu

Posted by Hemprasad Y. Badgujar on December 25, 2014


Building a Beowulf cluster with Ubuntu

The beowulf cluster article on Wikipedia describes the Beowulf cluster as follows:

“A Beowulf cluster is a group of what are normally identical, commercially available computers, which are running a Free and Open Source Software (FOSS), Unix-like operating system, such as BSD, GNU/Linux, or Solaris. They are networked into a small TCP/IP LAN, and have libraries and programs installed which allow processing to be shared among them.” – Wikipedia, Beowulf cluster, 28 February 2011.

This means a Beowulf cluster can be easily built with “off the shelf” computers running GNU/Linux in a simple home network. So building a Beowulf like cluster is within reach if you already have a small TCP/IP LAN at home with desktop computers running Ubuntu Linux (or any other Linux distribution).

There are many ways to install and configure a cluster. There is OSCAR(1), which allows any user, regardless of experience, to easily install a Beowulf type cluster on supported Linux distributions. It installs and configures all required software according to user input.

There is also the NPACI Rocks toolkit(2), which incorporates the latest Red Hat distribution and cluster-specific software. Rocks addresses the difficulties of deploying manageable clusters. Rocks makes clusters easy to deploy, manage, upgrade and scale.

Both of the afore mentioned toolkits for deploying clusters were made to be easy to use and require minimal expertise from the user. But the purpose of this tutorial is to explain how to manually build a Beowulf like cluster. Basically, the toolkits mentioned above do most of the installing and configuring for you, rendering the learning experience mute. So it would not make much sense to use any of these toolkits if you want to learn the basics of how a cluster works. This tutorial therefore explains how to manually build a cluster, by manually installing and configuring the required tools. In this tutorial I assume that you have some basic knowledge of the Linux-based operating system and know your way around the command line. I tried however to make this as easy as possible to follow. Keep in mind that this is new territory for me as well and there’s a good chance that this tutorial shows methods that may not be the best.

I myself started off with the clustering tutorial from SCFBio which gives a great explanation on how to build a simple Beowulf cluster.(3) It describes the prerequisites for building a Beowulf cluster and why these are needed.

Contents

  • What’s a Beowulf Cluster, exactly?
  • Building a virtual Beowulf Cluster
  • Building the actual cluster
  • Configuring the Nodes
    • Add the nodes to the hosts file
    • Defining a user for running MPI jobs
    • Install and setup the Network File System
    • Setup passwordless SSH for communication between nodes
    • Setting up the process manager
      • Setting up Hydra
      • Setting up MPD
  • Running jobs on the cluster
    • Running MPICH2 example applications on the cluster
    • Running bioinformatics tools on the cluster
  • Credits
  • References

What’s a Beowulf Cluster, exactly?

The typical setup of a beowulf cluster

The definition I cited before is not very complete. The book “Engineering a Beowulf-style Compute Cluster”(4) by Robert G. Brown gives a more detailed answer to this question (if you’re serious about this, this book is a must read). According to this book, there is an accepted definition of a beowulf cluster. This book describes the true beowulf as a cluster of computers interconnected with a network with the following characteristics:

  1. The nodes are dedicated to the beowulf cluster.
  2. The network on which the nodes reside are dedicated to the beowulf cluster.
  3. The nodes are Mass Market Commercial-Off-The-Shelf (M2COTS) computers.
  4. The network is also a COTS entity.
  5. The nodes all run open source software.
  6. The resulting cluster is used for High Performance Computing (HPC).

Building a virtual Beowulf Cluster

It is not a bad idea to start by building a virtual cluster using virtualization software like VirtualBox. I simply used my laptop running Ubuntu as the master node, and two virtual computing nodes running Ubuntu Server Edition were created in VirtualBox. The virtual cluster allows you to build and test the cluster without the need for the extra hardware. However, this method is only meant for testing and not suited if you want increased performance.

When it comes to configuring the nodes for the cluster, building a virtual cluster is practically the same as building a cluster with actual machines. The difference is that you don’t have to worry about the hardware as much. You do have to properly configure the virtual network interfaces of the virtual nodes. They need to be configured in a way that the master node (e.g. the computer on which the virtual nodes are running) has network access to the virtual nodes, and vice versa.

Building the actual cluster

It is good practice to first build and test a virtual cluster as described above. If you have some spare computers and network parts lying around, you can use those to build the actual cluster. The nodes (the computers that are part of the cluster) and the network hardware are the usual kind available to the general public (beowulf requirement 3 and 4). In this tutorial we’ll use the Ubuntu operating system to power the machines and open source software to allow for distributed parallel computing (beowulf requirement 5). We’ll test the cluster with cluster specific versions of bioinformaticstools that perform some sort of heavy calculations (beowulf requirement 6).

The cluster consists of the following hardware parts:

  • Network
  • Server / Head / Master Node (common names for the same machine)
  • Compute Nodes
  • Gateway

All nodes (including the master node) run the following software:

I will not focus on setting up the network (parts) in this tutorial. I assume that all nodes are part of the same private network and that they are properly connected.

Configuring the Nodes

Some configurations need to be made to the nodes. I’ll walk you through them one by one.

Add the nodes to the hosts file

It is easier if the nodes can be accessed with their host name rather than their IP address. It will also make things a lot easier later on. To do this, add the nodes to the hosts file of all nodes.(8) (9) All nodes should have a static local IP address set. I won’t go into details here as this is outside the scope of this tutorial. For this tutorial I assume that all nodes are already properly configured to have a static local IP address.

Edit the hosts file (sudo vim /etc/hosts) like below and remember that you need to do this for all nodes,

127.0.0.1	localhost
192.168.1.6	master
192.168.1.7	node1
192.168.1.8	node2
192.168.1.9	node3

Make sure it doesn’t look like this:

127.0.0.1	localhost
127.0.1.1	master
192.168.1.7	node1
192.168.1.8	node2
192.168.1.9	node3

neither like this:

127.0.0.1	localhost
127.0.1.1	master
192.168.1.6	master
192.168.1.7	node1
192.168.1.8	node2
192.168.1.9	node3

Otherwise other nodes will try to connect to localhost when trying to reach the master node.

Once saved, you can use the host names to connect to the other nodes,

$ ping -c 3 master
PING master (192.168.1.6) 56(84) bytes of data.
64 bytes from master (192.168.1.6): icmp_req=1 ttl=64 time=0.606 ms
64 bytes from master (192.168.1.6): icmp_req=2 ttl=64 time=0.552 ms
64 bytes from master (192.168.1.6): icmp_req=3 ttl=64 time=0.549 ms

--- master ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.549/0.569/0.606/0.026 ms

Try this with different nodes on different nodes. You should get a response similar to the above.

In this tutorial, master is used as the master node. Once the cluster has been set up, the master node will be used to start jobs on the cluster. The master node will be used to spawn jobs on the cluster. The compute nodes are node1 to node3 and will thus execute the jobs.

Defining a user for running MPI jobs

Several tutorials explain that all nodes need a separate user for running MPI jobs.(8) (9) (6) I haven’t found a clear explanation to why this is necessary, but there could be several reasons:

  1. There’s no need to remember different user names and passwords if all nodes use the same username and password.
  2. MPICH2 can use SSH for communication between nodes. Passwordless login with the use of authorized keys only works if the username matches the one set for passwordless login. You don’t have to worry about this if all nodes use the same username.
  3. The NFS directory can be made accessible for the MPI users only. The MPI users all need to have the same user ID for this to work.
  4. The separate user might require special permissions.

The command below creates a new user with username “mpiuser” and user ID 999. Giving a user ID below 1000 prevents the user from showing up in the login screen for desktop versions of Ubuntu. It is important that all MPI users have the same username and user ID. The user IDs for the MPI users need to be the same because we give access to the MPI user on the NFS directory later. Permissions on NFS directories are checked with user IDs. Create the user like this,

$ sudo adduser mpiuser --uid 999

You may use a different user ID (as long as it is the same for all MPI users). Enter a password for the user when prompted. It’s recommended to give the same password on all nodes so you have to remember just one password. The above command should also create a new directory/home/mpiuser. This is the home directory for user mpiuser and we will use it to execute jobs on the cluster.

Install and setup the Network File System

Files and programs used for MPI jobs (jobs that are run in parallel on the cluster) need to be available to all nodes, so we give all nodes access to a part of the file system on the master node. Network File System (NFS) enables you to mount part of a remote file system so you can access it as if it is a local directory. To install NFS, run the following command on the master node:

master:~$ sudo apt-get install nfs-kernel-server

And in order to make it possible to mount a Network File System on the compute nodes, the nfs-common package needs to be installed on all compute nodes:

$ sudo apt-get install nfs-common

We will use NFS to share the MPI user’s home directory (i.e. /home/mpiuser) with the compute nodes. It is important that this directory is owned by the MPI user so that all MPI users can access this directory. But since we created this home directory with the adduser command earlier, it is already owned by the MPI user,

master:~$ ls -l /home/ | grep mpiuser
drwxr-xr-x   7 mpiuser mpiuser  4096 May 11 15:47 mpiuser

If you use a different directory that is not currently owned by the MPI user, you must change it’s ownership as follows,

master:~$ sudo chown mpiuser:mpiuser /path/to/shared/dir

Now we share the /home/mpiuser directory of the master node with all other nodes. For this the file /etc/exports on the master node needs to be edited. Add the following line to this file,

/home/mpiuser *(rw,sync,no_subtree_check)

You can read the man page to learn more about the exports file (man exports). After the first install you may need to restart the NFS daemon:

master:~$ sudo service nfs-kernel-server restart

This also exports the directores listed in /etc/exports. In the future when the /etc/exports file is modified, you need to run the following command to export the directories listed in /etc/exports:

master:~$ sudo exportfs -a

The /home/mpiuser directory should now be shared through NFS. In order to test this, you can run the following command from a compute node:

$ showmount -e master

In this case this should print the path /home/mpiuser. All data files and programs that will be used for running an MPI job must be placed in this directory on the master node. The other nodes will then be able to access these files through NFS.

The firewall is by default enabled on Ubuntu. The firewall will block access when a client tries to access an NFS shared directory. So you need to add a rule with UFW (a tool for managing the firewall) to allow access from a specific subnet. If the IP addresses in your network have the format192.168.1.*, then 192.168.1.0 is the subnet. Run the following command to allow incoming access from a specific subnet,

master:~$ sudo ufw allow from 192.168.1.0/24

You need to run this on the master node and replace “192.168.1.0” by the subnet for your network.

You should then be able to mount master:/home/mpiuser on the compute nodes. Run the following commands to test this,

node1:~$ sudo mount master:/home/mpiuser /home/mpiuser
node2:~$ sudo mount master:/home/mpiuser /home/mpiuser
node3:~$ sudo mount master:/home/mpiuser /home/mpiuser

If this fails or hangs, restart the compute node and try again. If the above command runs without a problem, you should test whether/home/mpiuser on any compute node actually has the content from /home/mpiuser of the master node. You can test this by creating a file inmaster:/home/mpiuser and check if that same file appears in node*:/home/mpiuser (where node* is any compute node).

If mounting the NFS shared directory works, we can make it so that the master:/home/mpiuser directory is automatically mounted when the compute nodes are booted. For this the file /etc/fstab needs to be edited. Add the following line to the fstab file of all compute nodes,

master:/home/mpiuser /home/mpiuser nfs

Again, read the man page of fstab if you want to know the details (man fstab). Reboot the compute nodes and list the contents of the/home/mpiuser directory on each compute node to check if you have access to the data on the master node,

$ ls /home/mpiuser

This should lists the files from the /home/mpiuser directory of the master node. If it doesn’t immediately, wait a few seconds and try again. It might take some time for the system to initialize the connection with the master node.

Setup passwordless SSH for communication between nodes

For the cluster to work, the master node needs to be able to communicate with the compute nodes, and vice versa.(8) Secure Shell (SSH) is usually used for secure remote access between computers. By setting up passwordless SSH between the nodes, the master node is able to run commands on the compute nodes. This is needed to run the MPI daemons on the available compute nodes.

First install the SSH server on all nodes:

$ sudo apt-get install ssh

Now we need to generate an SSH key for all MPI users on all nodes. The SSH key is by default created in the user’s home directory. Remember that in our case the MPI user’s home directory (i.e. /home/mpiuser) is actually the same directory for all nodes: /home/mpiuser on the master node. So if we generate an SSH key for the MPI user on one of the nodes, all nodes will automatically have an SSH key. Let’s generate an SSH key for the MPI user on the master node (but any node should be fine),

$ su mpiuser
$ ssh-keygen

When asked for a passphrase, leave it empty (hence passwordless SSH).

When done, all nodes should have an SSH key (the same key actually). The master node needs to be able to automatically login to the compute nodes. To enable this, the public SSH key of the master node needs to be added to the list of known hosts (this is usually a file~/.ssh/authorized_keys) of all compute nodes. But this is easy, since all SSH key data is stored in one location: /home/mpiuser/.ssh/ on the master node. So instead of having to copy master’s public SSH key to all compute nodes separately, we just have to copy it to master’s ownauthorized_keys file. There is a command to push the public SSH key of the currently logged in user to another computer. Run the following commands on the master node as user “mpiuser”,

mpiuser@master:~$ ssh-copy-id localhost

Master’s own public SSH key should now be copied to /home/mpiuser/.ssh/authorized_keys. But since /home/mpiuser/ (and everything under it) is shared with all nodes via NFS, all nodes should now have master’s public SSH key in the list of known hosts. This means that we should now be able to login on the compute nodes from the master node without having to enter a password,

mpiuser@master:~$ ssh node1
mpiuser@node1:~$ echo $HOSTNAME
node1

You should now be logged in on node1 via SSH. Make sure you’re able to login to the other nodes as well.

Setting up the process manager

In this section I’ll walk you through the installation of MPICH and configuring the process manager. The process manager is needed to spawn and manage parallel jobs on the cluster. The MPICH wiki explains this nicely:

“Process managers are basically external (typically distributed) agents that spawn and manage parallel jobs. These process managers communicate with MPICH processes using a predefined interface called as PMI (process management interface). Since the interface is (informally) standardized within MPICH and its derivatives, you can use any process manager from MPICH or its derivatives with any MPI application built with MPICH or any of its derivatives, as long as they follow the same wire protocol.” – Frequently Asked Questions – Mpich.

The process manager is included with the MPICH package, so start by installing MPICH on all nodes with,

$ sudo apt-get install mpich2

MPD has been the traditional default process manager for MPICH till the 1.2.x release series. Starting the 1.3.x series, Hydra is the default process manager.(10) So depending on the version of MPICH you are using, you should either use MPD or Hydra for process management. You can check the MPICH version by running mpich2version in the terminal. Then follow either the steps for MPD or Hydra in the following sub sections.

Setting up Hydra

This section explains how to configure the Hydra process manager and is for users of MPICH 1.3.x series and up. In order to setup Hydra, we need to create one file on the master node. This file contains all the host names of the compute nodes.(11) You can create this file anywhere you want, but for simplicity we create it in the the MPI user’s home directory,

mpiuser@master:~$ cd ~
mpiuser@master:~$ touch hosts

In order to be able to send out jobs to the other nodes in the network, add the host names of all compute nodes to the hosts file,

node1
node2
node3

You may choose to include master in this file, which would mean that the master node would also act as a compute node. The hosts file only needs to be present on the node that will be used to start jobs on the cluster, usually the master node. But because the home directory is shared among all nodes, all nodes will have the hosts file. For more details about setting up Hydra see this page: Using the Hydra Process Manager.

Setting up MPD

This section explains how to configure the MPD process manager and is for users of MPICH 1.2.x series and down. Before we can start any parallel jobs with MPD, we need to create two files in the home directory of the MPI user. Make sure you’re logged in as the MPI user and create the following two files in the home directory,

mpiuser@master:~$ cd ~
mpiuser@master:~$ touch mpd.hosts
mpiuser@master:~$ touch .mpd.conf

In order to be able to send out jobs to the other nodes in the network, add the host names of all compute nodes to the mpd.hosts file,

node1
node2
node3

You may choose to include master in this file, which would mean that the master node would also act as a compute node. The mpd.hosts file only needs to be present on the node that will be used to start jobs on the cluster, usually the master node. But because the home directory is shared among all nodes, all nodes will have the mpd.hosts file.

The configuration file .mpd.conf (mind the dot at the beginning of the file name) must be accessible to the MPI user only (in fact, MPD refuses to work if you don’t do this),

mpiuser@master:~$ chmod 600 .mpd.conf

Then add a line with a secret passphrase to the configuration file,

secretword=random_text_here

The secretword can be set to any random passphrase. You may want to use a random password generator the generate a passphrase.

All nodes need to have the .mpd.conf file in the home directory of mpiuser with the same passphrase. But this is automatically the case since/home/mpiuser is shared through NFS.

The nodes should now be configured correctly. Run the following command on the master node to start the mpd deamon on all nodes,

mpiuser@master:~$ mpdboot -n 3

Replace “3” by the number of compute nodes in your cluster. If this was successful, all nodes should now be running the mpd daemon. Run the following command to check if all nodes entered the ring (and are thus running the mpd daemon),

mpiuser@master:~$ mpdtrace -l

This command should display a list of all nodes that entered the ring. Nodes listed here are running the mpd daemon and are ready to accept MPI jobs. This means that your cluster is now set up and ready to rock!

Running jobs on the cluster

Running MPICH2 example applications on the cluster

The MPICH2 package comes with a few example applications that you can run on your cluster. To obtain these examples, download the MPICH2 source package from the MPICH website and extract the archive to a directory. The directory to where you extracted the MPICH2 package should contain an “examples” directory. This directory contains the source codes of the example applications. You need to compile these yourself.

$ sudo apt-get build-dep mpich2
$ wget http://www.mpich.org/static/downloads/1.4.1/mpich2-1.4.1.tar.gz
$ tar -xvzf mpich2-1.4.1.tar.gz
$ cd mpich2-1.4.1/
$ ./configure
$ make
$ cd examples/

The example application cpi is compiled by default, so you can find the executable in the “examples” directory. Optionally you can build the other examples as well,

$ make hellow
$ make pmandel
...

Once compiled, place the executables of the examples somewhere inside the /home/mpiuser directory on the master node. It’s common practice to place executables in a “bin” directory, so create the directory /home/mpiuser/bin and place the executables in this directory. The executables should now be available on all nodes.

We’re going to run an MPI job using the example application cpi. Make sure you’re logged in as the MPI user on the master node,

$ su mpiuser

And run the job like this,

When using MPD:

mpiuser@master:~$ mpiexec -n 3 /home/mpiuser/bin/cpi

When using Hydra:

mpiuser@master:~$ mpiexec -f hosts -n 3 /home/mpiuser/bin/cpi

Replace “3” by the number of nodes on which you want to run the job. When using Hydra, the -f switch should point to the file containing the host names. When using MPD, it’s important that you use the absolute path to the executable in the above command, because only then MPD knows where to look for the executable on the compute nodes. The absolute path used should thus be correct for all nodes. But since/home/mpiuser is the NFS shared directory, all nodes have access to this path and the files within it.

The example application cpi is useful for testing because it shows on which nodes each sub process is running and how long it took to run the job. This application is however not useful to test performance because this is a very small application which takes only a few milliseconds to run. As a matter of fact, I don’t think it actually computes pi. If you look at the source, you’ll find that the value of pi is hard coded into the program.

Running bioinformatics tools on the cluster

By running actual bioinformatics tools you can give your cluster a more realistic test run. There are several parallel implementations of bioinformatics tools that are based on MPI. There are two that I currently know of:

It would be nice to test mpiBLAST, but because of a compilation issue, I was not able to do so. After some asking around at the mpiBLAST-Users mailing list, I got an answer:

“That problem is caused by a change in GCC version 4.4.X. We don’t have a fix to give out for the issue as yet, but switching to 4.3.X or lower should solve the issue for the time being.”(7)

Basically, I’m using a newer version of the GCC compiler which fails to build mpiBLAST. In order to compile it, I’d have to use an older version. But to instruct mpicc to use GCC 4.3 instead, requires that MPICH2 be compiled with GCC 4.3. Instead of going through that trouble, I’ve decided to give ClustalW-MPI a try instead.

The MPI implementation of ClustalW is fairly out-dated, but it’s good enough to perform a test run on your cluster. Download the source from the website, extract the package, and compile the source. Copy the resulting executable to the /home/mpiuser/bin directory on the master node. Use for example Entrez to search for some DNA/protein sequences and put these in a single FASTA file (the NCBI website can do that for you). Create several FASTA files with multiple sequences to test with. Copy the multi-sequence FASTA files to a data directory inside mirror (e.g./home/mpiuser/data). Then run a job like this,

When using MPD:

mpiuser@master:~$ mpiexec -n 3 /home/mpiuser/bin/clustalw-mpi /home/mpiuser/data/seq_tyrosine.fasta

When using Hydra:

mpiuser@master:~$ mpiexec -f hosts -n 3 /home/mpiuser/bin/clustalw-mpi /home/mpiuser/data/seq_tyrosine.fasta

and let the cluster do the work. Again, notice that we must use absolute paths. You can check if the nodes are actually doing anything by logging into the nodes (ssh node*) and running the top command. This should display a list of running processes with the processes using the most CPU on the top. In this case, you should see the process clustalw-mpi somewhere along the top.

Credits

Thanks to Reza Azimi for mentioning the nfs-common package.

References

  1. OpenClusterGroup. OSCAR.
  2. Philip M. Papadopoulos, Mason J. Katz, and Greg Bruno. NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. October 2001, Cluster 2001: IEEE International Conference on Cluster Computing.
  3. Supercomputing Facility for Bioinformatics & Computational Biology, IIT Delhi. Clustering Tutorial.
  4. Robert G. Brown. Engineering a Beowulf-style Compute Cluster. 2004. Duke University Physics Department.
  5. Pavan Balaji, et all. MPICH2 User’s Guide, Version 1.3.2. 2011. Mathematics and Computer Science Division Argonne National Laboratory.
  6. Kerry D. Wong. A Simple Beowulf Cluster.
  7. mpiBLAST-Users: unimplemented: inlining failed in call to ‘int fprintf(FILE*, const char*, …)’
  8. Ubuntu Wiki. Setting Up an MPICH2 Cluster in Ubuntu.
  9. Linux.com. Building a Beowulf Cluster in just 13 steps.
  10. wiki.mpich.org. Frequently Asked Questions – Mpich.
  11. wiki.mpich.org. Using the Hydra Process Manager – Mpich.

Posted in CLUSTER, Computer Hardware, Computer Hardwares, Computer Languages, Computer Vision, Computing Technology, CUDA, Free Tools, GPU (CUDA), Linux OS, Mixed, My Research Related, Open CL, OpenCV, OpenMP, OPENMPI, PARALLEL | 2 Comments »

How to Build a GPU-Accelerated Cluster

Posted by Hemprasad Y. Badgujar on December 22, 2014


Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (“nodes”) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current world’s fastest supercomputer, Oak Ridge National Labs TITAN, utilizes more than 18,000 NVIDIA Kepler GPUs.

In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost.

I gave a talk on this topic at GTC 2013 (session S3516 – Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out!

There are multiple motivating reason for building a GPU-based research cluster.

  • Get a feel for production systems and performance estimates;
  • Port your applications to GPUs and distributed computing (using CUDA-aware MPI);
  • Tune GPU and CPU load balancing for your application;
  • Use the cluster as development platform;
  • Early experience means increased readiness;
  • The investment is relatively small for a research prototype cluster

Figure 1 shows the steps to build a small GPU cluster. Let’s look at the process in more detail.

Steps in building GPU based Clusters
Figure 1: Seven steps to build and test a small research GPU cluster.

1. Choose Your Hardware

There are two steps to choosing the correct hardware.

  1. Node Hardware Details. This isthe specification of the machine (node) for your cluster. Each node has the  following components.
    • CPU processor from any vendor;
    • A motherboard with the following PCI-express connections:
      • 2x PCIe x16 Gen2/3 connections for Tesla GPUs;
      • 1x PCIe x8 wide for HCI Infiniband card;
    • 2 available network ports;
    • A minimum of 16-24 GB DDR3 RAM. (It is good to have more RAM in the system).
    • A power-supply unit (SMPS) with ample power rating. The total power supply needed includes power taken by the CPU, GPUs and other components in the system.
    • Secondary storage (HDD / SSD) based on your needs.

    GPU boards are wide enough to cover two physically adjacent PCI-e slots, so make sure that the PCIe x16 and x8 slots are physically separated on the motherboard so that you can fit a minimum of 2 PCI-e x16 GPUs and 1 PCIe x8 network card.

  2. Choose the right form factor forGPUs. Once you decide your machine specs you should also decide which modelGPUs you would like to consider for your system. The form factor ofGPUs is an important consideration. Kepler-based NVIDIA TeslaGPUs are available in two main form factors.
    • Tesla workstation products (C Series) are actively cooled GPU boards (this means they have a fan cooler over the GPU chip) that you can just plug in to your desktop computer in a PCI-e x16 slot. These use either two 6-pin or one 8-pin power supply connector.
    • Server products (M Series) are passively cooled GPUs (no fans) installed in standard servers sold by various OEMs.

    There are three different options for adding GPUs to your cluster:

    • you can buy C-series GPUs and install them in existing workstations or servers with enough space;
    • you can buy workstations from a vendor with C-series GPUs installed; or
    • you can buy servers with M-series GPUs installed.

2. Allocate Space, Power and Cooling

The goal for this step is to assess your physical infrastructure, including space, power and cooling needs, network considerations and storage requirements to ensure optimal system choices with room to grow your cluster in the future. You should make sure that you have enough space, power and cooling for your cluster. Clusters are mainly rack mounted, with multiple machines installed in a vertical rack. Vendors offer many server solutions that minimize the use of rack space.

3. Assembly and Physical Deployment

After deciding the machine configuration and real estate the next step is to physically deploy your cluster. Figure 2 shows the cluster deployment connections. The head node is the external interface to the cluster; it receives all external network connections, processes incoming requests, and assigns work to compute nodes (nodes with GPUs that perform the computation).

In a research prototype cluster you can also make use one of the compute nodes as a head node, but routing all traffic from the head node and also making it a compute node is not a good idea for production clusters because of performance and security issues. Production and large clusters mostly have a dedicated node to handle all incoming traffic while the head node just manages the work distribution for the compute nodes.

Head Node & Compute Nodes connections
Figure 2: Head node and compute node connections.

4. Head Node Installation

I recommend installing the head node with the open source Rocks Linux distribution. Rocks is a customizable, easy and quick way to install nodes. The Rocks installation package includes essential components for clusters, such as MPI. ROCKS head node installation is well-documented in the Rocks user guide, but here is a summary of the steps.

  • Follow the steps in Chapter 3 of the Rocks user guide and do a CD-based installation.
  • Install the NVIDIA drivers and CUDA Toolkit on the head node. (CUDA 5 provides a unified package that contain NVIDIA driver, toolkit and CUDA Samples.) 
  • Install network interconnect drivers (e.g. Infiniband) on the head node. These drivers are available from your interconnect manufacturer.
  • Nagios® Core™ is an open source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go wrong and when they get better. To install, follow the instructions given in the Nagios installation guide.
  • The NRPE Nagios add-on allows you to execute Nagios plugins on remote Linux machines. This allows you to monitor local resources like CPU load and memory usage, which are not usually exposed to external machines, on remote machines using Nagios. Install NRPE following the install guide.

5. Compute Node Installation

After you have completed the head node installation, you will install the compute node software with the help of Rocks and the following steps.

  • On the head node: in a terminal shell run the command:
    > insert-ethers

    Choose “Compute Nodes” as the new node to add.

  • Power on the compute node with the Rocks CD as the first boot device or do a network installation.
  • The compute node will connect to the head node and start the installation.
  • Install the NRPE package as described in the NRPE guide.

6. Management and Monitoring

Once you finish the head node and all compute node installations, your cluster is ready to use! Before you actually start using it to run applications of interest, you should also set up management and monitoring tools on the cluster. These tools are necessary for proper management and monitoring of all resources available in cluster. In this section, I will describe various tools and software packages for GPU management and monitoring.

GPU SYSTEM MANAGEMENT

The NVIDIA System Management Interface (NVIDIA-SMI) is a tool distributed as part of the NVIDIA GPU driver. NVIDIA-SMI provides a variety of GPU system information including

  • thermal monitoring metrics: GPU temperature, chassis inlet/outlet temperatures;
  • system Information: firmware revision, configuration information;
  • system state: fan states, GPU faults, power system fault; ECC errors, etc.

NVIDIA-SMI allows you to configure the compute mode for any device in the system (Reference: CUDA C Programming Guide)

  • Default compute mode: multiple host threads can use the device at the same time.
  • Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may be current to as many threads as desired within the process that created the context.
  • Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time.
  • Prohibited compute mode: No CUDA context can be created on the device.

NVIDIA-SMI also allows you to turn ECC (Error Correcting Code memory) mode on and off. The default is ON, but applications that do not need ECC can get higher memory bandwidth by disabling it.

GPU MONITORING WITH THE TESLA DEPLOYMENT KIT

The Tesla Deployment Kit is a collection of tools provided to better manage NVIDIA Tesla™ GPUs. These tools support Linux (32-bit and 64-bit), Windows 7 (64-bit), and Windows Server 2008 R2 (64-bit). The current distribution contains NVIDIA-healthmon and the NVML API.

NVML API

The NVML API is a C-based API which provides programmatic state monitoring and management of NVIDIA GPU devices. The NVML dynamic run-time library ships with the NVIDIA display driver, and the NVML SDK provides headers, stub libraries and sample applications. NVML can be used from Python or Perl (bindings are available) as well as C/C++ or Fortran.

Ganglia is an open-source scalable distributed monitoring system used for clusters and grids with very low per-node overhead and high concurrency. Ganglia gmond is an NVML-based Python module for monitoring NVIDIA GPUs in the Ganglia interface.

NVIDIA-HEALTHMON 

This utility provides quick health checking of GPUs in cluster nodes. The tool detects issues and suggests remedies to software and system configuration problems, but it is not a comprehensive hardware diagnostic tool. Features include:

  • basic CUDA and NVML sanity check;
  • diagnosis of GPU failures;
  • check for conflicting drivers;
  • poorly seated GPU detection;
  • check for disconnected power cables;
  • ECC error detection and reporting;
  • bandwidth test;
  • infoROM validation.

7. Run Benchmarks and Applications

Once your cluster is up and running you will want to validate it by running some benchmarks and sample applications. There are various benchmarks and code samples for GPUs and the network as well as applications to run on the entire cluster. For GPUs, you need to run two basic tests.

  1. devicequery: This sample code is available with the CUDA Samples included in the CUDA Toolkit installation package. devicequery simply enumerates the properties of the CUDA devices present in a node. This is not a benchmark but successfully running this or any other CUDA sample serves to verify that you have the CUDA driver and toolkit properly installed on the system.
  2. bandwidthtest: This is another of the CUDA Samples included with the Toolkit. This sample measures the cudaMemcopy bandwidth of the GPU across PCI-e as well as internally. You should measure device-to-device copy bandwidth, host-to-device copy bandwidth for pageable and page-locked memory, and device-to-host copy bandwidth for pageable and page-locked memory.

To benchmark network performance, you should run the bandwidth and latency tests for your installed MPI distribution. MPI standard installations have standard benchmarks such as /tests/osu_benchmarks-3.1.1. You should consider using an open source CUDA-aware MPI implementation like MVAPICH2, as described in earlier Parallel Forall posts An Introduction to CUDA-Aware MPI and Benchmarking CUDA-Aware MPI.

To benchmark the entire cluster, you should run the LINPACK numerical linear algebra application. The top 500 supercomputers list uses the HPL benchmark to decide the fastest supercomputers on Earth. The CUDA-enabled version of HPL (High-Performance LINPACK) optimized for GPUs is available from NVIDIA on request, and there is a Fermi-optimized version available to all NVIDIA registered developers.

# In this post I have provided an overview of the basic steps to build a GPU-accelerated research prototype cluster. For more details on GPU-based clusters and some of best practices for production clusters, please refer to Dale Southard’s GTC 2013 talk S3249 – Introduction to Deploying, Managing, and Using GPU Clusters by Dale Southard.

Posted in CLOUD, CLUSTER, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GRID, Linux OS, Mixed, Multimedia, PARALLEL | Tagged: , , | Leave a Comment »

Install CUDA 6.5 on Ubuntu 14.04

Posted by Hemprasad Y. Badgujar on December 22, 2014


Install build-essential:

1
$ apt-get update && apt-get install build-essential

Get CUDA installer:

1
$ wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

Extract CUDA installer:

1
2
3
$ chmod +x cuda_6.5.14_linux_64.run
$ mkdir nvidia_installers
$ ./cuda_6.5.14_linux_64.run -extract=`pwd`/nvidia_installers

Run Nvidia driver installer:

1
2
$ cd nvidia_installers
$ ./NVIDIA-Linux-x86_64-340.29.run

At this point it will popup an 8-bit UI that will ask you to accept a license agreement, and then start installing.

screenshot

At this point, I got an error:

1
2
3
4
5
6
7
Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or
         improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver
         such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
         device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

         Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log'
         for more information.

After reading this forum post I installed:

1
$ sudo apt-get install linux-image-extra-virtual

When it prompted me what do to about the grub changes, I chose “choose package maintainers version”.

Reboot:

1
$ reboot

Disable nouveau

At this point you need to disable nouveau, since it conflicts with the nvidia kernel module.

Open a new file

1
$ vi /etc/modprobe.d/blacklist-nouveau.conf

and add these lines to it

1
2
3
4
5
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

and then save the file.

Disable the Kernel Nouveau:

1
$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf

Reboot:

1
2
$ update-initramfs -u
$ reboot

One more try — this time it works

Get Kernel source:

1
2
$ apt-get install linux-source
$ apt-get install linux-headers-3.13.0-37-generic

Rerun Nvidia driver installer:

1
2
$ cd nvidia_installers
$ ./NVIDIA-Linux-x86_64-340.29.run

Load nvidia kernel module:

1
$ modprobe nvidia

Run CUDA + samples installer:

1
2
$ ./cuda-linux64-rel-6.5.14-18749181.run
$ ./cuda-samples-linux-6.5.14-18745345.run

Verify CUDA is correctly installed

1
2
3
$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery   

You should see the following output:

1
2
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GRID K520
Result = PASS

You should reboot the system afterwards and verify the driver installation with the nvidia-settings utility.

Environment Variables

As part of the CUDA environment, you should add the following in the .bashrc file of your home folder.

export CUDA_HOME=/usr/local/cuda6.5
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64

PATH=${CUDA_HOME}/bin:${PATH}
export PATH

CUDA SDK Samples

Now you can copy the SDK samples into your home directory, and proceed with the build process.

$ cudainstallsamples6.5.sh  ~
$ cd ~/NVIDIA_CUDA6.5_Samples
$ make

If everything goes well, you should be able to verify your CUDA installation by running thedeviceQuery sample in bin/x86_64/linux/release.

Source (http://tleyden.github.io/)

Posted in Computer Network & Security, Computer Vision, Computing Technology, CUDA | Tagged: , | Leave a Comment »

How to run CUDA 6.5 in Emulation Mode

Posted by Hemprasad Y. Badgujar on December 20, 2014


How to run CUDA in Emulation Mode

Some beginners feel a little bit dejected when they find that their systems donotcontainGPUs to learn andworkwithCUDA. In this blog post, I shall include the step by step process of installingandexecutingCUDA programs in emulation mode on a system with no GPU installed in it.It is mentioned here thatyouwill not be able to gain any performance advantage expected out of a GPU (obviously). Instead, the performance will be worse than a CPU implementation. However, emulation mode provides an excellent tool to compile and debugyourCUDA codes for more advanced purposes.Please note that I performed the following steps for a Dell Xeon with Windows 7 (32-bit) system.1. Acquire and install Microsoft Visual Studio 2008 on your system.

2. Access the CUDA Toolkit Archives  page and select CUDA Toolkit 6.0 / 6.5 version. (It is the last version that came with emulation mode. Emulation mode was discontinued in later versions.)

3. Download and install the following on your machine:-

  • Developer Drivers for Win8/win7 X64  – (Select the one as required for your machine.)
  • CUDA Toolkit
  • CUDA SDK Code Samples
  • CUBLAS and CUFFT (If required)

4. The next step is to check whether the sample codes run properly on the system or not. This will ensure that there is nothing missing from the required installations. Browse the nVIDIA GPU Computing SDK using the windows start bar or by using the following path in your My Computer address bar:-
As per your working Platform
“C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Release”
“C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win64\Release”

(Also note that the ProgramData folder is by default set to “Hidden” attribute. It will be good if you unhide theis folder as it will be frequently utilized later on as you progress with your CUDA learning spells.)

5. Run the “deviceQuery” program and it should output something similar as shown in Fig. 1. Upon visual inspection of the output data, it can be seen that “there is no GPU device found” however the test has PASSED. This means that all the required installations for CUDA in emulation mode has been completed and now we can proceed with writing, compiling and executing CUDA programs in emulation mode.

Figure 1. Successful Rxecution of deviceQuery.exe ** Demo Example only

6. Open Visual Studio and create a new Win32 console project. Let’s name it “HelloCUDAEmuWorld”. Remember to select the “EMPTY PROJECT” option in Application Settings. Now Right Click on “Source Files” in the project tree and add new C++ code item. Remember to include the extension “.cu” instead of “.cpp”. Let’s name this item as “HelloCUDAEmuWorld.cu”. (If you forget the file extension, it can always be renamed via the project tree on the left).

7. Include the CUDA include, lib and  bin paths to MS Visual Studio. They were located at “C:\CUDA” in my system.

The next steps need to be performed for every new CUDA project when created.

8. Right Click on the project and select Custom Build Rules. Check the Custom Build Rules v6.0.0 option if available. Otherwise, click on Find Existing… and navigate to “C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common” and select Cuda.rules. This will add the build rules for CUDA v6.0to VS 2012.

9. Right click on the project and select Properties. Navigate to Configuration Properties –> Linker –> Input. Type in cudart.lib in the Additional Dependencies text bar and click Okay. Now we are ready to compile and run our first ever CUDA program in emulation mode. But first we need to activate the emulation  mode for .cu files.

10. Once again  Right click on the project and select Properties. Navigate to Configuration Properties –> CUDA Build Rule v6.0.0 –> General. Set Emulation Mode from No to Yes in the right hand column of the opened window. Click Okay.

11. Type in the following in the code editor and build and compile the project. And there it is. Your first ever CUDA program, in Emulation Mode. Something to brag about among friends.

int main(void)
{
return 0;
}

I hope this effort would not go in vain and offer some help to anyone whois tied upregarding this issue. Do contact if there is any queryregarding the above procedure.Source (http://hpcden.blogspot.in)

Posted in Computer Vision, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, Image / Video Filters, My Research Related, OpenCV, PARALLEL, Project Related | Leave a Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: