Something More for Research

Explorer of Research #HEMBAD

Archive for the ‘CLOUD’ Category

Building VTK with Visual Studio 2013

Posted by Hemprasad Y. Badgujar on April 30, 2015


Building VTK5 with Visual Studio

Download

  1. Download VTK 5.10.1 the (VTK-5.10.1.zip) to unzip the file. (C: \ VTK-5.10.1)Http://Www.Vtk.Org/VTK/resources/software.Html#previous
    Https://Github.Com/Kitware/VTK/tree/v5.10.1

CMake

  1. You want to specify the destination of the input destination and solution files of source code.
    • Where is the source code: C:\VTK-5.10.1
    • Where is build the binaries: C:\VTK-5.10.1\build
  2. Press the [Configure] to select the Visual Studio that is the target.
  3. It makes various settings.
    • BUILD_SHAREED_LIBS ☑ (check)
    • BUILD_TESTING ☐ (uncheck)
    • CMAKE_CONFIGURATION_TYPES Debug;Release
    • CMAKE_INSTALL_PREFIX C:\Program Files\VTK (or C:\Program Files (x86)\VTK)
  4. Press the [Add Entry] to add the following settings.
    Name: CMAKE_DEBUG_POSTFIX
    Type: STRING
    Value: -gd
    Description:

    * Debug string to be added to the file name of the build generated files of the (last).

  5. And output the solution file by pressing the [Generate].

Build

  1. Start Visual Studio with administrative privileges VTK solution file (C: \ VTK-5.10.1 \ build \ VTK.sln) to open.
    (If you do not start with administrator privileges Visual Studio INSTALL to fail.)
  2. It wants to modify the source code.
    • vtkOStreamWrapper.cxx
      60 line

      //VTKOSTREAM_OPERATOR(ostream&);
      vtkOStreamWrapper& vtkOStreamWrapper::operator << (ostream& a) {
        this->ostr << (void *)&a;
        return *this;
      }
      
    • vtkEnSightGoldBinaryReader.cxx
      3925 line

      if (this->IFile->read(result, 80).fail())
      

      3944 line

      if (this->IFile->read(dummy, 8).fail())
      

      4001 line

      if (this->IFile->read(dummy, 4).fail())
      

      4008 line

      if (this->IFile->read((char*)result, sizeof(int)).fail())
      

      4025 line

      if (this->IFile->read(dummy, 4).fail())
      

      4048 line

      if (this->IFile->read(dummy, 4).fail())
      

      4055 line

      if (this->IFile->read((char*)result, sizeof(int)*numInts).fail())
      

      4072 line

      if (this->IFile->read(dummy, 4).fail())
      

      4095 line

      if (this->IFile->read(dummy, 4).fail())
      

      4102 line

      if (this->IFile->read((char*)result, sizeof(float)*numFloats).fail())
      

      4119 line

      if (this->IFile->read(dummy, 4).fail())
      
    • vtkConvexHull2D.cxx
      31 lines

      #include <algorithm>
      
    • vtkAdjacencyMatrixToEdgeTable.cxx
      31 lines

      #include <algorithm>
      
    • vtkNormalizeMatrixVectors.cxx
      30 Line

      #include <algorithm>
      
    • vtkPairwiseExtractHistogram2D.cxx
      39 line

      #include <algorithm>
      
    • vtkControlPointsItem.cxx
      35 lines

      #include <algorithm>
      
    • vtkPiecewisePointHandleItem.cxx
      31 lines

      #include <algorithm>
      
    • vtkParallelCoordinatesRepresentation.cxx
      83 line

      #include <algorithm>
      
  1. It wants to build the VTK. (ALL_BUILD)
    1. The configuration of the solution (Debug, Release) set the.
    2. Choose the ALL_BUILD project from Solution Explorer.
    3. [Build]> to build VTK Press [Build Solution].
  2. It wants to install the VTK. (INSTALL)
    1. Choose the INSTALL project from Solution Explorer.
    2. [Build]> [projects only]> to install the VTK Press [INSTALL only the Build menu.CMAKE_INSTALL_PREFIX necessary files are copied to the specified output destination.

Environment Variable

  1. Environment variable VTK_ROOT create a VTK of path: Set the (C \ Program Files \ VTK).
  2. Environment variable Path I add a% VTK_ROOT% \ bin; to.

Building VTK6 with Visual Studio

Download

  1. Download VTK 6.1.0 the (VTK-6.1.0.zip) to unzip the file. (C: \ VTK-6.1.0)Http://Www.Vtk.Org/VTK/resources/software.Html#latestcand
    Https://Github.Com/Kitware/VTK/tree/v6.1.0

CMake

  1. You want to specify the destination of the input destination and solution files of source code.
    • Where is the source code: C:\VTK-6.1.0
    • Where is build the binaries: C:\VTK-6.1.0\build
  2. Press the [Configure] to select the Visual Studio that is the target.
  3. It makes various settings.
    • BUILD_SHAREED_LIBS ☑ (check)
    • BUILD_TESTING ☐ (uncheck)
    • CMAKE_CONFIGURATION_TYPES Debug;Release
    • CMAKE_INSTALL_PREFIX C:\Program Files\VTK (or C:\Program Files (x86)\VTK)
  4. Press the [Add Entry] to add the following settings.
    Name: CMAKE_DEBUG_POSTFIX
    Type: STRING
    Value: -gd
    Description:

    * Debug string to be added to the file name of the build generated files of the (last).

  5. And output the solution file by pressing the [Generate].

Build

  1. Start Visual Studio with administrative privileges VTK solution file (C: \ VTK-6.1.0 \ build \ VTK.sln) to open.
    (If you do not start with administrator privileges Visual Studio INSTALL to fail.)
  2. It wants to build the VTK. (ALL_BUILD)
    1. The configuration of the solution (Debug, Release) set the.
    2. Choose the ALL_BUILD project from Solution Explorer.
    3. [Build]> to build VTK Press [Build Solution].
  3. It wants to install the VTK. (INSTALL)
    1. Choose the INSTALL project from Solution Explorer.
    2. [Build]> [projects only]> to install the VTK Press [INSTALL only the Build menu.CMAKE_INSTALL_PREFIX necessary files are copied to the specified output destination.

Environment Variable

  1. Environment variable VTK_DIR create a VTK of path: Set the (C \ Program Files \ VTK).
  2. Environment variable Path I add a% VTK_DIR% \ bin; to.

Building VTK6 + Qt5 with Visual Studio

Download

  1. Download VTK 6.1.0 the (VTK-6.1.0.zip) to unzip the file. (C: \ VTK-6.1.0)Http://Www.Vtk.Org/VTK/resources/software.Html#latestcand
    Https://Github.Com/Kitware/VTK/tree/v6.1.0
  2. Qt 5.4.0 with OpenGLをダウンロード、インストールする。(C:\Qt)
    http://www.qt.io/download-open-source/#

    • Qt 5.4.0 for Windows 32-bit (VS 2013, OpenGL, 694 MB)
      (qt-opensource-windows-x86-msvc2013_opengl-5.4.0.exe)
    • Qt 5.4.0 for Windows 64-bit (VS 2013, OpenGL, 709 MB)
      (qt-opensource-windows-x86-msvc2013_64_opengl-5.4.0.exe)

CMake

  1. You want to specify the destination of the input destination and solution files of source code.
    • Where is the source code: C:\VTK-6.1.0
    • Where is build the binaries: C:\VTK-6.1.0\build
  2. Press the [Configure] to select the Visual Studio that is the target.
  3. It makes various settings.
    (Grouped and helpful to put a check to Advanced.) * Win32 is Msvc2013_opengl , x64 is msvc2013_64_openglspecified in. Ungrouped Entries

    • Qt5Core_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Core
    • Qt5Designer_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Designer
    • Qt5Gui_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Gui
    • Qt5Network_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Network
    • Qt5OpenGL_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5OpenGL
    • Qt5Sql_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Sql
    • Qt5WebKit_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5WebKit
    • Qt5WebKitWidgets_DIRC:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5WebKitWidgets
    • Qt5Widgets_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Widgets
    • Qt5Xml_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/cmake/Qt5Xml

    BUILD

    • BUILD_SHAREED_LIBS ☑ (check)
    • BUILD_TESTING ☐ (uncheck)

    CMAKE

    • CMAKE_CONFIGURATION_TYPES Debug;Release
    • CMAKE_INSTALL_PREFIX C:\Program Files\VTK (or C:\Program Files (x86)\VTK)

    Module

    • Module_vtkGUISupportQt ☑ (check)
    • Module_vtkGUISupportQtOpenGL ☑ (check)
    • Module_vtkGUISupportQtSQL ☑ (check)
    • Module_vtkGUISupportQtWebkit ☑ (check)
    • Module_vtkRenderingQt ☑ (check)
    • Module_vtkViewsQt ☑ (check)

    OPENGL

    • OPENGL_gl_LIBRARY opengl
    • OPENGL_glu_LIBRARY glu32

    QT

    • QT_MKSPECS_DIR C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/mkspecs/win32-msvc2013
    • QT_QMAKE_EXECUTABLE C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/bin/qmake.exe
    • QT_QTCORE_LIBRARY_DEBUG C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/Qt5Cored.lib
    • QT_QTCORE_LIBRARY_DEBUG C:/Qt/Qt5.4.0/5.4/msvc2013_64_opengl/lib/Qt5Core.lib

    VTK

    • VTK_Group_Qt ☑ (check)
    • VTK_INSTALL_QT_PLUGIN_DIR ${CMAKE_INSTALL_PREFIX}/${VTK_INSTALL_QT_DIR}
    • VTK_QT_VERSION 5
  4. Press the [Add Entry] to add the following settings.
    Name: CMAKE_PREFIX_PATH
    Type: PATH
    Value: C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x64
    (or C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x86)
    Description:

    * Windows Kits path if Visual Studio 2013 8.1 \ Lib \ Winv6.3, if Visual Studio 2012 8.0 I specify the \ Lib \ Win8.

    Name: CMAKE_DEBUG_POSTFIX
    Type: STRING
    Value: -gd
    Description:

    * Debug string to be added to the file name of the build generated files of the (last).

  5. And output the solution file by pressing the [Generate].

Build

  1. Start Visual Studio with administrative privileges VTK solution file (C: \ VTK-6.1.0 \ build \ VTK.sln) to open.
    (If you do not start with administrator privileges Visual Studio INSTALL to fail.)
  2. It wants to build the VTK. (ALL_BUILD)
    1. The configuration of the solution (Debug, Release) set the.
    2. Choose the ALL_BUILD project from Solution Explorer.
    3. [Build]> to build VTK Press [Build Solution].
  3. It wants to install the VTK. (INSTALL)
    1. Choose the INSTALL project from Solution Explorer.
    2. [Build]> [projects only]> to install the VTK Press [INSTALL only the Build menu.CMAKE_INSTALL_PREFIX necessary files are copied to the specified output destination.

Environment Variable

  1. Environment variable VTK_DIR create a VTK of path: Set the (C \ Program Files \ VTK).
  2. Environment variable QTDIR by creating a Qt of the path (C: \ Qt \ Qt5.4.0 \ 5.4 \ msvc2013_64_opengl \ (or C: \ Qt \ Qt5.4.0 \ 5.4 \ msvc2013_opengl \)) to set.
  3. Environment variable Path in;% VTK_DIR% \ bin;% I add a QTDIR% \ bin.
Advertisements

Posted in CLOUD, Computer Languages, Computer Softwares, Computer Vision, Computing Technology, CUDA, GPU (CUDA), OpenCV | Tagged: , , , | 4 Comments »

How to Build a GPU-Accelerated Cluster

Posted by Hemprasad Y. Badgujar on December 22, 2014


Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (“nodes”) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current world’s fastest supercomputer, Oak Ridge National Labs TITAN, utilizes more than 18,000 NVIDIA Kepler GPUs.

In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost.

I gave a talk on this topic at GTC 2013 (session S3516 – Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out!

There are multiple motivating reason for building a GPU-based research cluster.

  • Get a feel for production systems and performance estimates;
  • Port your applications to GPUs and distributed computing (using CUDA-aware MPI);
  • Tune GPU and CPU load balancing for your application;
  • Use the cluster as development platform;
  • Early experience means increased readiness;
  • The investment is relatively small for a research prototype cluster

Figure 1 shows the steps to build a small GPU cluster. Let’s look at the process in more detail.

Steps in building GPU based Clusters
Figure 1: Seven steps to build and test a small research GPU cluster.

1. Choose Your Hardware

There are two steps to choosing the correct hardware.

  1. Node Hardware Details. This isthe specification of the machine (node) for your cluster. Each node has the  following components.
    • CPU processor from any vendor;
    • A motherboard with the following PCI-express connections:
      • 2x PCIe x16 Gen2/3 connections for Tesla GPUs;
      • 1x PCIe x8 wide for HCI Infiniband card;
    • 2 available network ports;
    • A minimum of 16-24 GB DDR3 RAM. (It is good to have more RAM in the system).
    • A power-supply unit (SMPS) with ample power rating. The total power supply needed includes power taken by the CPU, GPUs and other components in the system.
    • Secondary storage (HDD / SSD) based on your needs.

    GPU boards are wide enough to cover two physically adjacent PCI-e slots, so make sure that the PCIe x16 and x8 slots are physically separated on the motherboard so that you can fit a minimum of 2 PCI-e x16 GPUs and 1 PCIe x8 network card.

  2. Choose the right form factor forGPUs. Once you decide your machine specs you should also decide which modelGPUs you would like to consider for your system. The form factor ofGPUs is an important consideration. Kepler-based NVIDIA TeslaGPUs are available in two main form factors.
    • Tesla workstation products (C Series) are actively cooled GPU boards (this means they have a fan cooler over the GPU chip) that you can just plug in to your desktop computer in a PCI-e x16 slot. These use either two 6-pin or one 8-pin power supply connector.
    • Server products (M Series) are passively cooled GPUs (no fans) installed in standard servers sold by various OEMs.

    There are three different options for adding GPUs to your cluster:

    • you can buy C-series GPUs and install them in existing workstations or servers with enough space;
    • you can buy workstations from a vendor with C-series GPUs installed; or
    • you can buy servers with M-series GPUs installed.

2. Allocate Space, Power and Cooling

The goal for this step is to assess your physical infrastructure, including space, power and cooling needs, network considerations and storage requirements to ensure optimal system choices with room to grow your cluster in the future. You should make sure that you have enough space, power and cooling for your cluster. Clusters are mainly rack mounted, with multiple machines installed in a vertical rack. Vendors offer many server solutions that minimize the use of rack space.

3. Assembly and Physical Deployment

After deciding the machine configuration and real estate the next step is to physically deploy your cluster. Figure 2 shows the cluster deployment connections. The head node is the external interface to the cluster; it receives all external network connections, processes incoming requests, and assigns work to compute nodes (nodes with GPUs that perform the computation).

In a research prototype cluster you can also make use one of the compute nodes as a head node, but routing all traffic from the head node and also making it a compute node is not a good idea for production clusters because of performance and security issues. Production and large clusters mostly have a dedicated node to handle all incoming traffic while the head node just manages the work distribution for the compute nodes.

Head Node & Compute Nodes connections
Figure 2: Head node and compute node connections.

4. Head Node Installation

I recommend installing the head node with the open source Rocks Linux distribution. Rocks is a customizable, easy and quick way to install nodes. The Rocks installation package includes essential components for clusters, such as MPI. ROCKS head node installation is well-documented in the Rocks user guide, but here is a summary of the steps.

  • Follow the steps in Chapter 3 of the Rocks user guide and do a CD-based installation.
  • Install the NVIDIA drivers and CUDA Toolkit on the head node. (CUDA 5 provides a unified package that contain NVIDIA driver, toolkit and CUDA Samples.) 
  • Install network interconnect drivers (e.g. Infiniband) on the head node. These drivers are available from your interconnect manufacturer.
  • Nagios® Core™ is an open source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go wrong and when they get better. To install, follow the instructions given in the Nagios installation guide.
  • The NRPE Nagios add-on allows you to execute Nagios plugins on remote Linux machines. This allows you to monitor local resources like CPU load and memory usage, which are not usually exposed to external machines, on remote machines using Nagios. Install NRPE following the install guide.

5. Compute Node Installation

After you have completed the head node installation, you will install the compute node software with the help of Rocks and the following steps.

  • On the head node: in a terminal shell run the command:
    > insert-ethers

    Choose “Compute Nodes” as the new node to add.

  • Power on the compute node with the Rocks CD as the first boot device or do a network installation.
  • The compute node will connect to the head node and start the installation.
  • Install the NRPE package as described in the NRPE guide.

6. Management and Monitoring

Once you finish the head node and all compute node installations, your cluster is ready to use! Before you actually start using it to run applications of interest, you should also set up management and monitoring tools on the cluster. These tools are necessary for proper management and monitoring of all resources available in cluster. In this section, I will describe various tools and software packages for GPU management and monitoring.

GPU SYSTEM MANAGEMENT

The NVIDIA System Management Interface (NVIDIA-SMI) is a tool distributed as part of the NVIDIA GPU driver. NVIDIA-SMI provides a variety of GPU system information including

  • thermal monitoring metrics: GPU temperature, chassis inlet/outlet temperatures;
  • system Information: firmware revision, configuration information;
  • system state: fan states, GPU faults, power system fault; ECC errors, etc.

NVIDIA-SMI allows you to configure the compute mode for any device in the system (Reference: CUDA C Programming Guide)

  • Default compute mode: multiple host threads can use the device at the same time.
  • Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may be current to as many threads as desired within the process that created the context.
  • Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time.
  • Prohibited compute mode: No CUDA context can be created on the device.

NVIDIA-SMI also allows you to turn ECC (Error Correcting Code memory) mode on and off. The default is ON, but applications that do not need ECC can get higher memory bandwidth by disabling it.

GPU MONITORING WITH THE TESLA DEPLOYMENT KIT

The Tesla Deployment Kit is a collection of tools provided to better manage NVIDIA Tesla™ GPUs. These tools support Linux (32-bit and 64-bit), Windows 7 (64-bit), and Windows Server 2008 R2 (64-bit). The current distribution contains NVIDIA-healthmon and the NVML API.

NVML API

The NVML API is a C-based API which provides programmatic state monitoring and management of NVIDIA GPU devices. The NVML dynamic run-time library ships with the NVIDIA display driver, and the NVML SDK provides headers, stub libraries and sample applications. NVML can be used from Python or Perl (bindings are available) as well as C/C++ or Fortran.

Ganglia is an open-source scalable distributed monitoring system used for clusters and grids with very low per-node overhead and high concurrency. Ganglia gmond is an NVML-based Python module for monitoring NVIDIA GPUs in the Ganglia interface.

NVIDIA-HEALTHMON 

This utility provides quick health checking of GPUs in cluster nodes. The tool detects issues and suggests remedies to software and system configuration problems, but it is not a comprehensive hardware diagnostic tool. Features include:

  • basic CUDA and NVML sanity check;
  • diagnosis of GPU failures;
  • check for conflicting drivers;
  • poorly seated GPU detection;
  • check for disconnected power cables;
  • ECC error detection and reporting;
  • bandwidth test;
  • infoROM validation.

7. Run Benchmarks and Applications

Once your cluster is up and running you will want to validate it by running some benchmarks and sample applications. There are various benchmarks and code samples for GPUs and the network as well as applications to run on the entire cluster. For GPUs, you need to run two basic tests.

  1. devicequery: This sample code is available with the CUDA Samples included in the CUDA Toolkit installation package. devicequery simply enumerates the properties of the CUDA devices present in a node. This is not a benchmark but successfully running this or any other CUDA sample serves to verify that you have the CUDA driver and toolkit properly installed on the system.
  2. bandwidthtest: This is another of the CUDA Samples included with the Toolkit. This sample measures the cudaMemcopy bandwidth of the GPU across PCI-e as well as internally. You should measure device-to-device copy bandwidth, host-to-device copy bandwidth for pageable and page-locked memory, and device-to-host copy bandwidth for pageable and page-locked memory.

To benchmark network performance, you should run the bandwidth and latency tests for your installed MPI distribution. MPI standard installations have standard benchmarks such as /tests/osu_benchmarks-3.1.1. You should consider using an open source CUDA-aware MPI implementation like MVAPICH2, as described in earlier Parallel Forall posts An Introduction to CUDA-Aware MPI and Benchmarking CUDA-Aware MPI.

To benchmark the entire cluster, you should run the LINPACK numerical linear algebra application. The top 500 supercomputers list uses the HPL benchmark to decide the fastest supercomputers on Earth. The CUDA-enabled version of HPL (High-Performance LINPACK) optimized for GPUs is available from NVIDIA on request, and there is a Fermi-optimized version available to all NVIDIA registered developers.

# In this post I have provided an overview of the basic steps to build a GPU-accelerated research prototype cluster. For more details on GPU-based clusters and some of best practices for production clusters, please refer to Dale Southard’s GTC 2013 talk S3249 – Introduction to Deploying, Managing, and Using GPU Clusters by Dale Southard.

Posted in CLOUD, CLUSTER, Computer Vision, Computing Technology, CUDA, GPU (CUDA), GRID, Linux OS, Mixed, Multimedia, PARALLEL | Tagged: , , | Leave a Comment »

Posted by Hemprasad Y. Badgujar on December 11, 2014


Cloud scaling, Part 1: Build a compute node or small cluster application and scale with HPC

Leveraging warehouse-scale computing as needed

Discover methods and tools to build a compute node and small cluster application that can scale with on-demand high-performance computing (HPC) by leveraging the cloud. This series takes an in-depth look at how to address unique challenges while tapping and leveraging the efficiency of warehouse-scale on-demand HPC. The approach allows the architect to build locally for expected workload and to spill over into on-demand cloud HPC for peak loads. Part 1 focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Exotic HPC architectures with custom-scaled processor cores and shared memory interconnection networks are being rapidly replaced by on-demand clusters that leverage off-the-shelf general purpose vector coprocessors, converged Ethernet at 40 Gbit/s per link or more, and multicore headless servers. These new HPC on-demand cloud resources resemble what has been called warehouse-scale computing, where each node is homogeneous and headless and the focus is on total cost of ownership and power use efficiency overall. However, HPC has unique requirements that go beyond social networks, web search, and other typical warehouse-scale computing solutions. This article focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Moving to high-performance computing

The TOP500 and Green500 supercomputers (see Resources) since 1994 are more often not custom designs, but rather designed and integrated with off-the-shelf headless servers, converged Ethernet or InfiniBand clustering, and general-purpose graphics processing unit (GP-GPU) coprocessors that aren’t for graphics but rather for single program, multiple data (SPMD) workloads. The trend in high-performance computing (HPC) away from exotic custom processor and memory interconnection design to off-the-shelf—warehouse-scale computing—is based on the need to control total cost of ownership, increase power efficiency, and balance operational expenditure (OpEx) and capital expenditure (CapEx) for both start-up and established HPC operations. This means that you can build your own small cluster with similar methods and use HPC warehouse-scale resources on-demand when you need them.

The famous 3D torus interconnection that Cray and others used may never fully go away (today, the TOP500 is one-third massively parallel processors [MPPs] and two-thirds cluster architecture for top performers), but focus on efficiency and new OpEx metrics like Green500 Floating Point Operations (FLOPs)/Watt are driving HPC and keeping architecture focused on clusters. Furthermore, many applications of interest today are data driven (for example, digital video analytics), so many systems not only need traditional sequential high performance storage for HPC checkpoints (saved state of a long-running job) but more random access to structured (database) and unstructured (files) large data sets. Big data access is a common need of traditional warehouse-scale computing for cloud services as well as current and emergent HPC workloads. So, warehouse-scale computing is not HPC, but HPC applications can leverage data center-inspired technology for cloud HPC on demand, if designed to do so from the start.

Power to computing

Power to computing can be measured in terms of a typical performance metric per Watt—for example, FLOPS/Watt or input/output per second/Watt for computing and I/O, respectively. Furthermore, any computing facility can be seen as a plant for converting Watts into computational results, and a gross measure of good plant design is power use efficiency (PUE), which is simply the ratio of total facility power over that delivered to computing equipment. A good value today is 1.2 or less. One reason for higher PUEs is inefficient cooling methods, administrative overhead, and lack of purpose-built facilities compared to cloud data centers (see Resources for a link to more information).

Changes in scalable computing architecture focus over time include:

  • Early focus on a fast single processor (uniprocessor) to push the stored-program arithmetic logic unit central processor to the highest clock rates and instruction throughput possible:
    • John von Neumann, Alan Turing, Robert Noyce (founder of Intel), Ted Hoff (Intel universal processor proponent), along with Gordon Moore see initial scaling as a challenge to scaling digital logic and clock a processor as fast as possible.
    • Up to at least 1984 (and maybe longer), the general rule was “the processor makes the computer.”
    • Cray Computer designs vector processors (X-MP, Y-MP) and distributed memory multiprocessors interconnected by a six-way interconnect 3D torus for custom MPP machines. But this is unique to the supercomputing world.
    • IBM’s focus early on was scalable mainframes and fast uniprocessors until the announcement of the IBM® Blue Gene® architecture in 1999 using a multicore IBM® POWER® architecture system-on-a-chip design and a 3D torus interconnection. The current TOP500 includes many Blue Gene systems, which have often occupied the LINPACK-measured TOP500 number one spot.
  • More recently since 1994, HPC is evolving to a few custom MPP and mostly off-the-shelf clusters, using both custom interconnections (for example, Blue Gene and Cray) and off-the-shelf converged Ethernet (10G, 40G) and InfiniBand:
    • The TOP500 has become dominated by clusters, which comprise the majority of top-performing HPC solutions (two-thirds) today.
    • As shown in the TOP500 chart by architecture since 1994, clusters and MPP dominate today (compared to single instruction, multiple data [SIMD] vector; fast uniprocessors; symmetric multiprocessing [SMP] shared memory; and other, more obscure architectures).
    • John Gage at Sun Microsystems (now Oracle) stated that “the network is the computer,” referring to distributed systems and the Internet, but low-latency networks in clusters likewise become core to scaling.
    • Coprocessors interfaced to cluster nodes via memory-mapped I/O, including GP-GPU and even hybrid field-programmable gate array (FPGA) processors, are used to accelerate specific computing workloads on each cluster node.
  • Warehouse-scale computing and the cloud emerge with focus on MapReduce and what HPC would call embarrassingly parallel applications:
    • The TOP500 is measured with LINPACK and FLOPs and so is not focused on cost of operations (for example, FLOPs/Watt) or data access. Memory access is critical, but storage access is not so critical, except for job checkpoints (so a job can be restarted, if needed).
    • Many data-driven applications have emerged in the new millennium, including social networks, Internet search, global geographical information systems, and analytics associated with more than a billion Internet users. This is not HPC in the traditional sense but warehouse-computing operating at a massive scale.
    • Luiz André Barroso states that “the data center is the computer,” a second shift away from processor-focused design. The data center is highly focused on OpEx as well as CapEx, and so is a better fit for HPC where FLOPs/Watt and data access matter. These Google data centers have a PUE less than 1.2—a measure of total facility power consumed divided by power used for computation. (Most computing enterprises have had a PUE of 2.0 or higher, so, 1.2 is very low indeed. See Resources for more information.)
    • Amazon launched Amazon Elastic Compute Cloud (Amazon EC2), which is best suited to web services but has some scalable and at least high-throughput computing features (see Resources).
  • On-demand cloud HPC services expand, with an emphasis on clusters, storage, coprocessors and elastic scaling:
    • Many private and public HPC clusters occupy TOP500, running Linux® and using common open source tools, such that users can build and scale applications on small clusters but migrate to the cloud for on-demand large job handling. Companies like Penguin Computing, which features Penguin On-Demand, leverage off-the-shelf clusters (InfiniBand and converged 10G/40G Ethernet), Intel or AMD multicore headless nodes, GP-GPU coprocessors, and scalable redundant array of independent disks (RAID) storage.
    • IBM Platform computing provides IBM xSeries® and zSeries® computing on demand with workload management tools and features.
    • Numerous universities and start-up companies leverage HPC on demand with cloud services or off-the-shelf clusters to complement their own private services. Two that I know well are the University of Alaska Arctic Region Supercomputing Center (ARSC) Pacman (Penguin Computing) and the University of Colorado JANUS cluster supercomputer. A common Red Hat Enterprise Linux (RHEL) open source workload tool set and open architecture allow for migration of applications from private to public cloud HPC systems.

Figure 1 shows the TOP500 move to clusters and MPP since the mid-1990s.

Figure 1. TOP500 evolution to clusters and MPP since 1994

Image showing the evolution to clustersThe cloud HPC on-demand approach requires well-defined off-the-shelf clustering, compute nodes, and tolerance for WAN latency to transfer workload. As such, these systems are not likely to overtake top spots in the TOP500, but they are likely to occupy the Green500 and provide efficient scaling for many workloads and now comprise the majority of the Top500.

High-definition digital video computer vision: a scalable HPC case study

Most of us deal with compressed digital video, often in Motion Picture Experts Group (MPEG) 4 format, and don’t think of the scale of even a high-definition (HD) web cam in terms of data rates and processing to apply simple image processing analysis. Digital cinema workflow and post-production experts know the challenges well. They deal with 4K data (roughly 4-megapixel) individual frames or much higher resolution. These frames might be compressed, but they are not compressed over time in groups of pictures like MPEG does and are often lossless compression rather than lossy.

To start to understand an HPC problem that involves FLOPs, uncompressed data, and tools that can be used for scale-up, let’s look at a simple edge-finder transform. The transform-example.zip includes Open Computer Vision (OpenCV) algorithms to transform a real-time web cam stream into a Sobel or Canny edge view in real time. See Figure 2.

Figure 2. HD video Canny edge transform

Image showing a Canny edge transformLeveraging cloud HPC for video analytics allows for deployment of more intelligent smart phone applications. Perhaps phone processors will someday be able to handle real-time HD digital video facial recognition, but in the mean time, cloud HPC can help. Likewise, data that originates in data centers, like geographic information systems (GIS) data, needs intensive processing for analytics to segment scenes, create point clouds of 3D data from stereo vision, and recognize targets of interest (such as well-known landmarks).

Augmented reality and video analytics

Video analytics involves collection of structured (database) information from unstructured video (files) and video streams—for example, facial recognition. Much of the early focus has been on security and automation of surveillance, but applications are growing fast and are being used now for more social applications, e.g. facial recognition, perhaps not to identify a person but to capture and record their facial expression and mood (while shopping). This technology can be coupled with augmented reality, whereby the analytics are used to update a scene with helpful information (such as navigation data). Video data can be compressed and uplinked to warehouse-scale data centers for processing so that the analytics can be collected and information provided in return not available on a user’s smart phone. The image processing is compute intensive and involves big data storage, and likely a scaling challenge (see Resources for a link to more information).

Sometimes, when digital video is collected in the field, the data must be brought to the computational resources; but if possible, digital video should only be moved when necessary to avoid encoding to compress and decoding to decompress for viewing. Specialized coprocessors known as codecs (coder/decoder) are designed to decode without software and coprocessors to render graphics (GPUs) exist, but to date, no CV coprocessors are widely available. Khronos has announced an initiative to define hardware acceleration for OpenCV in late 2012, but work has only just begun (see Resources). So, to date, CV remains more of an HPC application that has had attention primarily from digital cinema, but this is changing rapidly based on interest in CV on mobiles and in the Cloud.

Although all of us imagine CV to be implemented on mobile robotics, in our heads-up displays for intelligent transportation, and on visors (like Google Goggles that are now available) for personal use, it’s not clear that all of the processing must be done on the embedded devices or that it should be even if it could. The reason is data: Without access to correlated data center data, CV information has less value. For example, how much value is there in knowing where your are without more mapping and GIS data to help you with where you want to go next? Real-time CV and video analytics are making progress, but they face many challenges, including huge storage requirements, high network bit rates for transport, and significant processing demands for interpretation. Whether the processing is done by cloud HPC clusters or embedded systems, it’s clear that concurrency and parallel processing will play a huge role. Try running a simple Hough linear transform on the 12-megapixel cactus photo I took, and you’ll see why HPC might be needed just to segment a scene at 60 frames/s.

The challenge of making algorithms parallel

HPC with both clusters and MPP requires coding methods to employ many thread of execution on each multicore node and to use message-passing interfaces (MPIs) and basic methods to map data and code to process resources and collect results. For digital video, the mapping can be simple if done at a frame level. Within a frame is more difficult but still not bad other than the steps of segmenting and restitching frames together.

The power of MapReduce

The MapReduce concept is generally associated with Google and the open source Hadoop project (from Apache Software Foundation), but any parallel computation must employ this concept to obtain speed-up, whether done at a node or cluster level with Java™ technology or at a thread level for a nonuniform memory access (NUMA) shared memory. For applications like digital video analytics, the mapping is data intensive, so it makes sense to move the function to the data (in the mapping stage), but either way, the data to be processed must be mapped and processed and the results combined. A clever mapping avoids data dependencies and the need for synchronization as much as possible. In the case of image processing, for CV, the mapping could be within a frame, at the frame level, or by groups of pictures (see Resources).

Key tools for designing cluster scaling applications for cloud HPC on demand include the following:

  • Threading is the way in which a single application (or Linux process) is one address space on one cluster node and can be designed to use all processor cores on that node. Most often, this is done with Portable Operating System Interface for UNIX® (POSIX) Pthreads or with a library like OpenMP, which abstracts the low-level details of POSIX threading. I find POSIX threading to be fairly simple and typically write Pthread code as can be seen in the hpc_cloud_grid.tar.gz example. This example maps threads to the over-number space for prime number searching.
  • MPI is a library that can be linked into a cluster parallel application to assist with mapping of processing to each node, synchronization, and reduction of results. Although you can use MPI to implement MapReduce, unlike Hadoop, it typically moves data (in messages) to program functions running on each node (rather than moving code to the data). In the final video analytics article in this series, I will provide a thread and MPI cluster-scalable version of the capture-transform code. Here, I provide the simple code for a single thread and node to serve as a reference. Run it and Linux dstat at the same time to monitor CPU, I/O, and storage use. It is a resource-intensive program that computes Sobel and Canny transforms on a 2560×1920-pixel image. It should run on any Linux system with OpenCV and a web cam.
  • Vector SIMD and SPMD processing can be accomplished on Intel and AMD nodes with a switch to enable during compilation or, with more work, by creation of transform kernels in CUDA or OpenCL for off-load to a GPU or GP-GPU coprocessor.
  • OpenCV is highly useful for video analytics, as it includes not only convenient image capture, handling, and display functions but also most of the best image processing transforms used in CV.

The future of on-demand cloud HPC

This articles makes an argument for cloud HPC. The goal here is to acquaint you with the idea and some of the challenging, yet compelling applications (like CV) as well as to introduce you to methods for programming applications that can scale on clusters and MPP machines. In future articles, I will take the CV example further and adapt it for not only threading but also for MPI so that we can examine how well it scales on cloud HPC (in my case, at ARSC on Pacman or JANUS). My research involves comparison of tightly coupled CV coprocessors (that I am building using an Altera Stratix IV FPGA I call a computer vision processing unit [CVPU]). I am comparing this to what I can achieve with CV on ARSC for the purpose of understanding whether environmental sensing and GIS data are best processed like graphics, with a coprocessor, or on a cluster or perhaps with a combination of the two. The goals for this research are lofty. In the case of CVPU, the CV/graphics Turing-like test I imagine is one in which the scene that the CVPU parses can then be sent to a GPU for rendering. Ideally, the parsed/rendered image would be indistinguishable from the true digital video stream. When rendered scenes and the ability to analyze them reaches a common level of fidelity, augmented reality, perceptual computing, and video analytics will have amazing power to transform our lives.

Cloud scaling, Part 2: Tour high-performance cloud system design advances

Learn how to leverage co-processing, nonvolatile memory, interconnection, and storage

Breakthrough device technology requires the system designer to re-think operating and application software design in order to realize the potential benefits of closing the access gap or pushing processing into the I/O path with coprocessors. Explore and consider how the latest memory, compute, and interconnection devices and subsystems can affect your scalable, data-centric, high-performance cloud computing system design. Breakthroughs in device technology can be leveraged for transition between compute-centric and the more balanced data-centric compute architectures.

The author examines storage-class memory and demonstrates how to fill the long-standing performance gap between RAM and spinning disk storage; details the use of I/O bus coprocessors (for processing closer to data); explains how to employ InfiniBand to build low-cost, high performance interconnection networks; and discusses scalable storage for unstructured data.

Computing systems engineering has historically been dominated by scaling processors and dynamic RAM (DRAM) interfaces to working memory, leaving a huge gap between data-driven and computational algorithms (see Resources). Interest in data-centric computing is growing rapidly, along with novel system design software and hardware devices to support data transformation with large data sets.

The data focus in software is no surprise given applications of interest today, such as video analytics, sensor networks, social networking, computer vision and augmented reality, intelligent transportation, machine-to-machine systems, and big data initiatives like IBM’s Smarter Planet and Smarter Cities.

The current wave of excitement is about collecting, processing, transforming, and mining the big data sets:

  • The data focus is leading toward new device-level breakthroughs in nonvolatile memory (storage-class memory, SCM) which brings big data closer to processing.
  • At the same time, input/output coprocessors are bringing processing closer to the data.
  • Finally, low-latency, high-bandwidth off-the-shelf interconnections like InfiniBand are allowing researchers to quickly build 3D torus and fat-tree clusters that used to be limited to the most exotic and expensive custom high-performance computing (HPC) designs.

Yet, the systems software and even system design often remain influenced by out-of-date bottlenecks and thinking. For example, consider threading and multiprogramming. The whole idea came about because of slow disk drive access; what else can a program do when waiting on data but run another one. Sure, we have redundant array of independent disks (RAID) scaling and NAND flash solid-state disks (SSDs), but as noted by IBM Almaden Research, the time scale differences of the access time gap are massive in human terms.

The access time gap between a CPU, RAM, and storage can be measured in terms of typical performance for each device, but perhaps the gap is more readily understood when put into human terms (as IBM Almaden has done for illustrative purposes).

If a typical CPU operation is similar to what a human can do in seconds, then RAM access at 100 times more latency is much like taking a few minutes to access information. However, by the same comparison, disk access at 100,000 times more latency compared to RAM is on the order of months (100 days). (See Figure 1.)

Figure 1. The data access gap

Image showing the data access gapMany experienced computer engineers have not really thought hard about the 100 to 200 random I/O operations per second (IOPS) — it is the mechanical boundary for a disk drive. (Sure, sequential access is as high as hundreds of megabytes per second, but random access remains what it was more than 50 years ago, with the same 15K RPM seek and rotate access latency.)

Finally, as Almaden notes, tape is therefore glacially slow. So, why do we bother? For the capacity, of course. But how can we get processing to the data or data to the processing more efficiently?

Look again at Figure 1. Improvements to NAND flash memory for use in mobile devices and more recently SSD has helped to close the gap; however, it is widely believed that NAND flash device technology will be pushed to its limits fairly quickly, as noted by numerous system researchers (see Resources). The transistor floating gate technology used is already at scaling limits and pushing it farther is leading to lower reliability, so although it has been a stop-gap for data-centric computing, it is likely not the solution.

Instead, several new nonvolatile RAM (NVRAM) device technologies are likely solutions, including:

  • Phase change RAM (PCRAM): This memory uses a heating element to turn a class of materials known as chalcogenides into either a crystallized or amorphous glass state, thereby storing two states that can be programmed and read, with state retained even when no power is applied. PCRAM appears to show the most promise in the near term for M-type synchronous nonvolatile memory (NVM).
  • Resistive RAM (RRAM): Most often described as a circuit that is unlike a capacitor, inductor, or resistor, RRAM provides a unique relationship between current and voltage unlike other well-known devices that store charge or magnetic energy or provide linear resistance to current flow. Materials with properties called memristors have been tested for many decades but engineers usually avoid them because of their nonlinear properties and the lack of application for them. IEEE fellow Leon Chua describes them in “Memristor: The Missing Circuit Element.” A memristor’s behavior can be summarized as follows: Current flow in one direction causes electrical resistance to increase and in the opposite direction resistance decreases, but the memristor retains the last resistance it had when flow is re-started. As such, it can store a nonvolatile state, be programmed, and the state read. For details and even some controversy on what is and is not a memristor, seeResources.
  • Spin transfer torque RAM (STT-RAM): A current passed through a magnetic layer can produce a spin-polarized current that, when directed into a magnetic layer, can change its orientation via angular momentum. This behavior can be used to excite oscillations and flip the orientation of nanometer-scale magnetic devices. The main drawback is the high current needed to flip the orientation.

Consult the many excellent entries in Resources for more in-depth information on each device technology.

From a systems perspective, as these devices evolve, where they can be used and how well each might fill the access gap depends on the device’s:

  • Cost
  • Scalability (device integration size must be smaller than a transistor to beat flash; less than 20 nanometers)
  • Latency to program and read
  • Device reliability
  • Perhaps most importantly, durability (how often it can be programmed and erased before it becomes unreliable).

Based on these device performance considerations, IBM has divided SCM into two main classes:

  • S-type: Asynchronous access via an I/O controller. Threading or multiprogramming is used to hide the I/O latency to the device.
  • M-type: Synchronous access via a memory controller. Think about this as wait-states for RAM access in which a CPU core stalls.

Further, NAND SSD would be considered fast storage, accessed via a block-oriented storage controller (much higher I/O rates but similar bandwidth to a spinning disk drive).

It may seem like the elimination of asynchronous I/O for data processing (except, of course, for archive access or cluster scaling) might be a cure-all for data-centric processing. In some sense it is, but systems designers and software developers will have to change habits. The need for I/O latency hiding will largely go away on each node in a system, but it won’t go away completely. Clusters built from InfiniBand deal with node-to-node data-transfer latency with Message Passing Interface or MapReduce schemes and enjoy similar performance to this envisioned SCM node except when booting or when node data exceeds node working RAM size.

So, for scaling purposes, cluster interconnection and I/O latency hiding among nodes in the cluster is still required.

Moving processing closer to data with coprocessors

Faster access to big data is ideal and looks promising, but some applications will always benefit from the alternative approach of moving processing closer to data interfaces. Many examples exist, such as graphics (graphics processing units, GPUs), network processors, protocol-offload engines like the TCP/IP Offload Engine, RAID on chip, encryption coprocessors, and more recently, the idea of computer vision coprocessors. My research involves computer vision and graphics coprocessors, both at scale in clusters and embedded. I am working on what I call a computer vision processing unit, comparing several coprocessors that became more widely pursued with the 2012 announcement of OpenVX by Khronos (see Resources).

In the embedded world, such a method might be described as an intelligent sensor or smart camera, methods in which preliminary processing of raw data is provided by the sensor interface and an embedded logic device or microprocessor, perhaps even a multicore system on a chip (SoC).

In the scalable world, this most often involves use of a coprocessor bus or channel adapter (like PCI Express, PCIe, and Ethernet or InfiniBand); it provides data processing between the data source (network side) and the node I/O controller (host side).

Whether processing should be done or is more efficient when done in the I/O path or on a CPU core has always been a topic of hot debate, but based on an existence proof (GPUs and network processors), clearly they can be useful, waxing and waning in popularity based on coprocessor technology compared to processor. So, let’s take a quick look at some of the methods:

Vector processing for single program, multiple data
Provided today by GPUs, general-purpose GPUs (GP-GPUs), and application processing units (APUs), the idea is that data can be transformed on its way to an output device like a display or sent to a GP-GPU/APU and transformed on a round trip from the host. “General purpose” implies more sophisticated features like double-precision arithmetic compared to single precision only for graphics-specific processing.
Many core
Traditional many-core coprocessor cards (see Resources) are available from various vendors. The idea is to lower cost and power consumption by using simpler, yet numerous cores on the I/O bus, with round-trip offloading of processing to the cards for a more capable but power-hungry and costly full-scale multicore host. Typically, the many-core coprocessor might have an order of magnitude more cores than the host and often includes gigabit or 10G Ethernet and other types of network interfaces.
I/O bus field-programmable gate arrays (FPGAs)
FPGA cards, most often used to prototype a new coprocessor in the early stages of development, can perhaps used as a solution for low-volume coprocessors as well.
Embedded SoCs
A multicore solution can be used in an I/O device to create an intelligent device like a stereo ranging or time-of-flight camera.
Interface FPGA/configurable programmable logic devices
A digital logic state machine can provide buffering and continuous transformation of I/O data, such as digital video encoding.

Let’s look at an example based on offload and I/O path. Data transformation has obvious value for applications like the decoding of MPEG4 digital video, consisting of a GPU coprocessor in the path between the player and a display as shown in Figure 2 for the Linux® MPlayer video decoder and presentation acceleration unit (VDPAU) software interface to NVIDIA MPEG decoding on the GPU.

Figure 2. Simple video decode offload example

Image showing an example of a simple video decode offloadLikewise, any data processing or transformation that can be done in-bound or out-bound from a CPU host may have value, especially if the coprocessor can provide processing at a lower cost with great efficiency or with lower power consumption based on purpose-built processors compared to general-purpose CPUs.

To start to understand a GP-GPU compared to a multicore coprocessor approach, try downloading the two examples of a point spread function to sharpen the edges on an image (threaded transform example) compared with the GPU transform example. Both provide the same 320×240-pixel transformation, but in one case, the Compute Unified Device Architecture (CUDA) C code provided requires a GPU or GP-GPU coprocessor and, in the other case, either a multicore host or a many-core (for example, MICA) coprocessor.

So which is better?

Neither approach is clearly better, mostly because the NVRAM solutions have not yet been made widely available (except as expensive battery-backed DRAM or as S-type SCM from IBM Texas Memory Systems Division) and moving processing into the I/O data path has traditionally involved less friendly programming. Both are changing, though: Coprocessors are adopting higher-level languages like the Open Compute Language (OpenCL) in which code written for multicore hosts runs equally well on Intel MICA or Altera Startix IV/V architectures.

Likewise, all of the major computer systems companies are working feverishly to release SCM products, with PCRAM the most likely to be available first. My advice is to assume that both will be with us for some time and operating systems and applications must be able to deal with both. The memristor, or RRAM, includes a vision that resembles Isaac Asimov’s fictional positronic brain in which memory and processing are fully integrated as they are in a human neural system but with metallic materials. The concept of fully integrated NVM and processing is generally referred to as processing in memory (PIM) or neuromorphic processing (see Resources). Scalable NVM integrated processing holds extreme promise for biologically inspired intelligent systems similar to the human visual cortex, for example. Pushing toward the goal of integrated NVM, with PIM from both sides, is probably a good approach, so I plan to keep up with and keep working on systems that employ both methods—coprocessors and NVM. Nature has clearly favored direct, low-level, full integration of PIM at scale for intelligent systems.

Scaling nodes with Infiniband interconnection

System designers always have to consider the trade-off between scaling up each node in a system and scaling out a solution that uses networking or more richly interconnected clustering to scale processing, I/O, and data storage. At some point, scaling the memory, processing, and storage a single node can integrate hits a practical limit in terms of cost, power efficiency, and size. It is also often more convenient from a reliability, availability, and servicing perspective to spread capability over multiple nodes so that if one needs repair or upgrade, others can continue to provide service with load sharing.

Figure 3 shows a typical InfiniBand 3D torus interconnection.

Figure 3. Example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)

Image showing an example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)In Figure 3, the 4x4x4 shown is for the San Diego Supercomputing Center (SDSC) Gordon supercomputer, as documented by Mellanox, which uses a 36-port InfiniBand switch to connect nodes to each other and to storage I/O.

InfiniBand, Converged Enhanced Ethernet iSCSI (CEE), or Fibre Channel is the most often used scalable storage interface for access to big data. This storage area network (SAN) scaling for RAID arrays is used to host distributed, scalable file systems like Ceph, Lustre, Apache Hadoop, or the IBM General Parallel File System (GPFS). Use of CEE and InfiniBand for storage access using the Open Fabric Alliance SCSI Remote Direct Memory Access (RDMA) Protocol and iSCSI Extensions for RDMA is a natural fit for SAN storage integrated with an InfiniBand cluster. Storage is viewed more as a distributed archive of unstructured data that is searched or mined and loaded into node NVRAM for cluster processing. Higher-level data-centric cluster processing methods like Hadoop MapReduce can also be used to bring code (software) to the data at each node. These topics are big-data-related topics that I describe more in the last part of this four-part series.

The future of data-centric scaling

This articles makes an argument for systems design and architecture that move processors closer to data-generating and -consuming devices, as well as simplification of memory hierarchy to include fewer levels, leveraging lower-latency, scalable NVM devices. This defines a data-centric node design that can be further scaled with low-latency off-the-shelf interconnection networks like InfiniBand. The main challenge with data-centric computing is not instructions-per-second or floating-point-operations-per-second only, but rather IOPS and the overall power efficiency of data processing.

In Part 1 of this series, I uncovered methods and tools to build a compute node and small cluster application that can scale with on-demand HPC by leveraging the cloud. In this article I detailed such high-performance system design advances as co-processing, nonvolatile memory, interconnection, and storage.

In Part 3 in this series I provide more in-depth coverage of a specific data-centric computing application — video analytics. Video analytics includes applications such as facial recognition for security and computer forensics, use of cameras for intelligent transportation monitoring, retail and marketing that involves integration of video (for example, visualizing yourself in a suit you’re considering from a web-based catalog), as well as a wide range of computer vision and augmented reality applications that are being invented daily. Although many of these applications involve embedded computer vision, most also require digital video analysis, transformation, and generation in cloud-based scalable servers. Algorithms like Sobel transformation can be run on typical servers, but algorithms like the generalized Hough transform, facial recognition, image registration, and stereo (point cloud) mapping, for example, require the NVM and coprocessor approaches this article discussed for scaling.

In the last part of the series, I deal with big data issues.

Cloud scaling, Part 3: Explore video analytics in the cloud

Using methods, tools, and system design for video and image analysis, monitoring, and security

Explore and consider methods, tools, and system design for video and image analysis with cloud scaling. As described in earlier articles in this series, video analytics requires a more balanced data-centric compute architecture compared to traditional compute-centric, scalable, high-performance computing. The author examines the use of OpenCV and similar tools for digital video analysis and methods to scale this analysis using cluster and distributed system design.

The use of coprocessors designed for video analytics and the new OpenVX hardware acceleration discussed in previous articles can be applied to the computer vision (CV) examples presented in this article. This new data-centric technology for CV and video analytics requires the system designer to re-think application software and system design to meet demanding requirements, such as real-time monitoring and security for large, public facilities and infrastructure as well as a more entertaining, interactive, and safer world.

Public safety and security

The integration of video analytics in public places is perhaps the best way to ensure public safety, providing digital forensic capabilities to law enforcement and the potential to increase detection of threats and prevention of public safety incidents. At the same time, this need has to be balanced with rights to privacy, which can become a contentious issue if these systems are abused or not well understood. For example, the extension of facial detection, as shown in Figure 1, to facial recognition has obvious identification capability and can be used to track an individual as he or she moves from one public place to another. To many people, facial analytics might be seen an invasion of privacy, and use of CV and video analytics should adhere to surveillance and privacy rights laws and policies, to be sure—any product or service developer might want to start by considering best practices outlined by the Federal Trade Commission (FTC; see Resources).

Digital video using standards such as that from Motion Picture Experts Group (MPEG) for encoding video to compress, transport, uncompress, and display it has led to a revolution in computing ranging from social networking media and amateur digital cinema to improved training and education. Tools for decoding and consuming digital video are widely used by all every day, but tools to encode and analyze uncompressed video frames are needed for video analytics, such as Open Computer Vision (OpenCV). One of the readily available and quite capable tools for encoding and decoding of digital video is FFmpeg; for still images, GNU Image Processing (GIMP) is quite useful (see Resources for links). With these three basic tools, an open source developer is fully equipped to start exploring computer vision (CV) and video analytics. Before exploring these tools and development methods, however, let’s first define these terms better and consider applications.

The first article in this series, Cloud scaling, Part 1: Build your own and scale with HPC on demand, provided a simple example using OpenCV that implements a Canny edge transformation on continuous real-time video from a Linux® web cam. This is an example of a CV application that you could use as a first step in segmenting an image. In general, CV applications involve acquisition, digital image formats for pixels (picture elements that represent points of illumination), images and sequences of them (movies), processing and transformation, segmentation, recognition, and ultimately scene descriptions. The best way to understand what CV encompasses is to look at examples. Figure 1 shows face and facial feature detection analysis using OpenCV. Note that in this simple example, using the Haar Cascade method (a machine learning algorithm) for detection analysis, the algorithm best detects faces and eyes that are not occluded (for example, my youngest son’s face is turned to the side) or shadowed and when the subject is not squinting. This is perhaps one of the most important observations that can be made regarding CV: It’s not a trivial problem. Researchers in this field often note that although much progress has been made since its advent more than 50 years ago, most applications still can’t match the scene segmentation and recognition performance of a 2-year-old child, especially when the ability to generalize and perform recognition in a wide range of conditions (lighting, size variation, orientation and context) is considered.

Figure 1. Using OpenCV for facial recognition

Image showing facial recognition analysisTo help you understand the analytical methods used in CV, I have created a small test set of images from the Anchorage, Alaska area that isavailable for download. The images have been processed using GIMP and OpenCV. I developed C/C++ code to use the OpenCV application programming interface with a Linux web cam, precaptured images, or MPEG movies. The use of CV to understand video content (sequences of images), either in real time or from precaptured databases of image sequences, is typically referred to as video analytics.

Defining video analytics

Video analytics is broadly defined as analysis of digital video content from cameras (typically visible light, but it could be from other parts of the spectrum, such as infrared) or stored sequences of images. Video analytics involves several disciplines but at least includes:

  • Image acquisition and encoding. As a sequence of images or groups of compressed images. This stage of video analytics can be complex, including photometer (camera) technology, analog decoding, digital formats for arrays of light samples (pixels) in frames and sequences, and methods of compressing and decompressing this data.
  • CV. The inverse of graphical rendering, where acquired scenes are converted into descriptions compared to rendering a scene from a description. Most often, CV assumes that this process of using a computer to “see” should operate wherever humans do, which often distinguishes it from machine vision. The goal of seeing like a human does most often means that CV solutions employ machine learning.
  • Machine vision. Again, the inverse of rendering but most often in a well-controlled environment for the purpose of process control—for example, inspecting printed circuit boards or fabricated parts to make sure they are geometrically correct within tolerances.
  • Image processing. A broad application of digital signal processing methods to samples from photometers and radiometers (detectors that measure electromagnetic radiation) to understand the properties of an observation target.
  • Machine learning. Algorithms developed based on the refinement of the algorithm through training data, whereby the algorithm improves performance and generalizes when tested with new data.
  • Real-time and interactive systems. Systems that require response by a deadline relative to a request for service or at least a quality of service that meets SLAs with customers or users of the services.
  • Storage, networking, database, and computing. All required to process digital data used in video analytics, but a subtle, yet important distinction is that this is an inherently data-centric compute problem, as was discussed in Part 2 of this series.

Video analytics, therefore, is broader in scope than CV and is a system design problem that might include mobile elements like a smart phone (for example, Google Goggles) and cloud-based services for the CV aspects of the overall system. For example, IBM has developed a video analytics system known as the video correlation and analysis suite (VCAS), for which the IBM Travel and Transportation Solution BriefSmarter Safety and Security Solution for Rail [PDF] is available; it is a good example of a system design concept. Detailed focus on each system design discipline involved in a video analytics solution is beyond the scope of this article, but many pointers to more information for system designers are available in Resources. The rest of this article focuses on CV processing examples and applications.

Basic structure of video analytics applications

You can break the architecture of cloud-based video analytics systems down into two major segments: embedded intelligent sensors (such as smart phones, tablets with a camera, or customized smart cameras) and cloud-based processing for analytics that can’t be directly computed on the embedded device. Why break the architecture into two segments compared to fully solving in the smart embedded device? Embedding CV in transportation, smart phones, and products is not always practical. Even when embedding a smart camera is smart, so often, the compressed video or scene description may be back-hauled to a cloud-based video analytics system, just to offload the resource-limited embedded device. Perhaps more important, though, than resource limitations is that video transported to the cloud for analysis allows for correlation with larger data sets and annotation with up-to-date global information for augmented reality (AR) returned to the devices.

The smart camera devices for applications like gesture and facial expression recognition must be embedded. However, more intelligent inference to identify people and objects and fully parse scenes is likely to require scalable data-centric systems that can be more efficiently scaled in a data center. Furthermore, data processing acceleration at scale ranging from the Khronos OpenVX CV acceleration standards to the latest MPEG standards and feature-recognition databases are key to moving forward with improved video analytics, and two-segment cloud plus smart camera solutions allow for rapid upgrades.

With sufficient data-centric computing capability leveraging the cloud and smart cameras, the dream of inverse rendering can perhaps be realized where, in the ultimate “Turing-like” test that can be demonstrated for CV, scene parsing and re-rendered display and direct video would be indistinguishable for a remote viewer. This is essentially done now in digital cinema with photorealistic rendering, but this rendering is nowhere close to real time or interactive.

Video analytics apps: Individual scenarios

Killer applications for video analytics are being thought of every day for CV and video analytics, some perhaps years from realization because of computing requirements or implementation cost. Nevertheless, here is a list of interesting applications:

  • AR views of scenes for improved understanding. If you have ever looked at, for example, a landing plane and thought, I wish I could see the cockpit view with instrumentation, this is perhaps possible. I worked in Space Shuttle mission control long ago, where a large development team meticulously re-created a view of the avionics for ground controllers that shadowed what astronauts could see—all graphical, but imaging fusion of both video and graphics to annotate and re-create scenes with meta-data. A much simplified example is presented here in concept to show how an aircraft observed via a tablet computer camera could be annotated with attitude and altitude estimation data (see the example in this article).
  • Skeletal transformations to track the movement and estimate the intent and trajectory of an animal that might jump onto a highway. See the example in this article.
  • Fully autonomous or mostly autonomous vehicles with human supervisory control only. Think of the steps between today’s cruise control and tomorrow’s full autonomous car. Cars that can parallel park themselves today are a great example of this stepwise development.
  • Beyond face detection to reliable recognition and, perhaps more importantly, for expression feedback. Is the driver of a semiautonomous vehicle aggravated, worried, surprised?
  • Virtual shopping (AR to try products). Shoppers can see themselves in that new suit.
  • Signage that interacts with viewers. This is based on expressions, likes and dislikes, and data that the individual has made public.
  • Two-way television and interactive digital cinema. Entertainment for which viewers can influence the experience, almost as if they were actors in the content.
  • Interactive telemedicine. This is available any time with experts from anywhere in the world.

I make no attempt in this article to provide an exhaustive list of applications, but I explore more by looking closely at both AR (annotated views of the world through a camera and display—think heads-up displays such as fighter pilots have) and skeletal transformations for interactive tracking. To learn more beyond these two case studies and for more in-depth application-specific uses of CV and video analytics in medicine, transportation safety, security and surveillance, mapping and remote sensing, and an ever-increasing list of system automation that includes video content analysis, consult the many entries in Resources. The tools available can help anyone with computer engineering skills get started. You can also download a larger set of test images as well as all OpenCV code I developed for this article.

Example: Augmented reality

Real-time video analytics can change the face of reality by augmenting the view a consumer has with a smart phone held up to products or our view of the world (for example, while driving a vehicle) and can allow for a much more interactive experience for users for everything from movies to television, shopping, and travel to how we work. In AR, the ideal solution provides seamless transition from scenes captured with digital video to scenes generated by rendering for a user in real time, mixing both digital video and graphics in an AR view for the user. Poorly designed AR systems distract a user from normal visual cues, but a well-designed AR system can increase overall situation awareness, fusing metrics with visual cues (think fighter pilot heads-up displays).

The use of CV and video analytics in intelligent transportation systems has significant value for safety improvement, and perhaps eventually CV may be the key technology for self-driving vehicles. This appears to be the case based on the U.S. Defense Advanced Research Projects Agency challenge and the Google car, although use of the full spectrum with forward-looking infrared and instrumentation in addition to CV has made autonomous vehicles possible. Another potentially significant application is air traffic safety, especially for airports to detect and prevent runway incursion scenarios. The imagined AR view of an aircraft on final approach at Ted Stevens airport in Anchorage shows a Hough linear transform that might be used to segment and estimate aircraft attitude and altitude visually, as shown in Figure 2. Runway incursion safety is of high interest to the U.S. Federal Aviation Administration (FAA), and statistics for these events can be found in Resources.

Figure 2. AR display example

Image showing an example of video augmentationFor intelligent transportation, drivers will most likely want to participate even as systems become more intelligent, so a balance of automation and human participation and intervention should be kept in mind (for autonomous or semiautonomous vehicles).

Skeletal transformation examples: Tracking movement for interactive systems

Skeletal transformations are useful for applications like gesture recognition or gate analysis of humans or animals—any application where the motion of a body’s skeleton (rigid members) must be tracked can benefit from a skeletal transformation. Most often, this transformation is applied to bodies or limbs in motion, which further enables the use of background elimination for foreground tracking. However, it can still be applied to a single snapshot, as shown in Figure 3, where a picture of a moose is first converted to a gray map, then a threshold binary image, and finally the medial distance is found for each contiguous region and thinned to a single pixel, leaving just the skeletal structure of each object. Notice that the ears on the moose are back—an indication of the animal’s intent (higher-resolution skeletal transformation might be able to detect this as well as the gait of the animal).

Figure 3. Skeletal transformation of a moose

Image showing an example of a skeletal transformationSkeletal transformations can certainly be useful in tracking animals that might cross highways or charge a hiker, but the transformation has also become of high interest for gesture recognition in entertainment, such as in the Microsoft® Kinect® software developer kit (SDK). Gesture recognition can be used for entertainment but also has many practical purposes, such as automatic sign language recognition—not yet available as a product but a concept in research. Certainly skeletal transformation CV can analyze the human gait for diagnostic or therapeutic purposes in medicine or to capture human movement for animation in digital cinema.

Skeletal transformations are widely used in gesture-recognition systems for entertainment. Creative and Intel have teamed up to create an SDK for Windows® called the Creative* Interactive Gesture Camera Developer Kit (see Resources for a link) that uses a time-of-flight light detection and ranging sensor, camera, and stereo microphone. This SDK is similar to the Kinect SDK but intended for early access for developers to build gesture-recognition applications for the device. The SDK is amazingly affordable and could become the basis from some breakthrough consumer devices now that it is in the hands of a broad development community. To get started, you can purchase the device from Intel, and then download the Intel® Perceptual Computing SDK. The demo images are included as an example along with numerous additional SDK examples to help developers understand what the device can do. You can use the finger tracking example shown in Figure 4 right away just by installing the SDK for Microsoft Visual Studio® and running the Gesture Viewer sample.

Figure 4. Skeletal transformation using the Intel Perceptual Computing SDK and Creative Interactive Gesture Camera Developer Kit

Image showing a skeletal and blob transformation of a hand

 op

The future of video analytics

This article makes an argument for the use of video analytics primarily to improve public safety; for entertainment purposes, social networking, telemedicine, and medical augmented diagnostics; and to envision products and services as a consumer. Machine vision has quietly helped automate industry and process control for years, but CV and video analytics in the cloud now show promise for providing vision-based automation in the everyday world, where the environment is not well controlled. This will be a challenge both in terms of algorithms for image processing and machine learning as well as data-centric computer architectures discussed in this series. The challenges for high-performance video analytics (in terms of receiver operating characteristics and throughput) should not be underestimated, but with careful development, this rapidly growing technology promises a wide range of new products and even human vision system prosthetics for those with sign impairments or loss of vision. Based on the value of vision to humans, no doubt this is also fundamental to intelligent computing systems.

Downloads

Description Name Size
OpenCV Video Analytics Examples va-opencv-examples.zip 600KB
Simple images for use with OpenCV example-images.zip 6474KB

Resources

Learn

Get products and technologies

Downloads

Description Name Size
GPU accelerated image transform sharpenCUDA.zip 644KB
Grid threaded comparison hpc_dm_cloud_grid.zip 1.08MB
Simple image for transform benchmark Cactus-320×240-pixel.ppm.zip 206KB

Resources

Learn

Downloads

Description Name Size
Continuous HD digital camera transform example transform-example.zip 123KB
Grid threaded prime generator benchmark hpc_cloud_grid.tar.gz 3KB
High-resolution image for transform benchmark Cactus-12mpixel.zip 12288KB

Resources

Learn

Posted in Apps Development, CLOUD, Computer Languages, Computer Software, Computer Vision, GPU (CUDA), GPU Accelareted, Image Processing, OpenCV, OpenCV, PARALLEL, Project Related, Video | Leave a Comment »

Architectures of Mobile Cloud Computing

Posted by Hemprasad Y. Badgujar on August 30, 2014


“Mobile Cloud Computing at its simplest, refers to an infrastructure where both the data storage and the data processing happen outside of the mobile device. Mobile cloud applications move the computing power and data storage away from mobile phones and into the cloud, bringing applications and mobile computing to not just smartphone users but a much broader range of mobile subscribers”.

1
From the concept of MCC, the general architecture of MCC can be shown in Fig.  In Fig. , mobile devices are connected to the mobile networks via base stations (e.g., base transceiver station (BTS), access point, or satellite) that establish and control the connections (air links) and functional interfaces between the networks and mobile devices. Mobile users’ requests and information (e.g., ID and location) are transmitted to the central processors that are connected to servers providing mobile network services. Here, mobile network operators can provide services to mobile users as AAA (for authentication, authorization, and accounting) based on the home agent (HA) and subscribers’ data stored in databases. After that, the subscribers’ requests are delivered to a cloud through the Internet. In the cloud, cloud controllers
process the requests to provide mobile users with the corresponding cloud services. These services are Accepted in Wireless Communications and Mobile Computing -developed with the concepts of utility computing, virtualization, and service-oriented architecture (e.g.,web, application, and database servers).

http://onlinelibrary.wiley.com/doi/10.1002/wcm.1203/abstract

Posted in CLOUD, Computer Network & Security, Computer Software, Computing Technology | Tagged: , | Leave a Comment »

Five free, dead-easy IP traffic monitoring tools

Posted by Hemprasad Y. Badgujar on January 30, 2014


When you really need to know what’s going on with your network, give one of these free monitors a try.

Monitoring your network can be a real pain. First and foremost, what tool should you use? Everyone you ask will give you a different answer. Each answer will reflect a different set of requirements and, in some cases, fill completely different needs. Here are the five network monitors I prefer, based on two criteria: They’re free (as in cost) and easy to use. You might not agree with the choices, but at the price point, you’d be hard pressed to find better solutions.

1: Wireshark

Wireshark (Figure A) has always been my go-to monitor. When most other monitors fail to find what I want, Wireshark doesn’t let me down. Wireshark is a cross-platform analyzer that does deep inspection of hundreds of protocols. It does live capture and capture save (for offline browsing), which can be viewed in GUI or tty mode. Wireshark also does VoIP analysis and can read/write many capture formats (tcpdump, Pcap NG, Catapult DCT2000, Cisco Secure IDS iplog, Microsoft Network Monitor, and many more).

Figure A

2: Angry IP Scanner

Angry IP Scanner (Figure B) is one of the easiest to use of all the IP scanners. It has a user-friendly GUI that can scan IP addresses (and their ports) in any range. Angry IP Scanner is cross platform and doesn’t require installation, so you can use it as a portable scanner. It can get NetBIOS information, favorite IP address range, Web server detection, customizable openers, and much more. This little scanner makes use of mutlithreads, so it’s going to be fairly fast. Source code is available on the download page.

Figure B

3: Zenmap

Zenmap (Figure C) is a graphical front end to the cross-platform Nmap tool. Nmap can scan huge networks, is portable, free, and well documented. It’s one of the most powerful IP traffic monitors, but that power comes with a price: complexity. Zenmap takes Nmap and makes it more accessible to users who prefer to avoid the command line. That does not mean Zenmap is the easiest of the lot. You still need to use some commands. But Zenmap offers a powerful wizard-like tool to help you through the process.

Figure C

4: Colasoft Capsa Free

If you’re an admin used to more Windows-like tools, Capsa Free (Figure D) might be the perfect tool for you. There are actually two versions of Capsa: paid and free. The free version should be enough in most cases. It provides an easy-to-use dashboard you can use to create various types of captures. Capsa Free also offers plenty of alarm configurations so you can be alerted when something occurs. And it can capture more than 300 network protocols, so you won’t be missing out on anything with this free tool.

Figure D

5: EtherApe

EtherApe is a Linux-only tool and is molded after the classic etherman monitor. It’s unique in that it offers an easy-to-use mapping of IP traffic on your network. It does this in real time and gives you a clear picture of the overall look of your network traffic. You can create filters (using pcap syntax) to make reading the map easier. As you can see in Figure E, a busy network can get rather challenging to read. EtherApe will display both the node and link color with the most-used protocol so it’s easier to take a quick glance, even on a busy network.

Figure E

More tools?

A lot of networking monitoring tools are out there, and some of them do more auditing than the tools listed here. But when you really need to know what’s going on with your network, one of the above tools will do a great job.

Have you used any of these tools? What other free scanners have you tried?

Posted in CLOUD, Computer Hardwares, Computer Languages, Computer Network & Security, Computer Softwares, Computing Technology, User Authentication | Leave a Comment »

25 Apache Performance Tuning Tips

Posted by Hemprasad Y. Badgujar on November 29, 2013


We all know and love Apache.  Its great, it allows us to run websites on the Internet with minimal configuration and administration.

However, this same ease of flexibility and lack of tuning, is typically what leads Apache to becoming a memory hog.  Utilizing these easy to understand tips, you can gain a significant performance boost from Apache.

Apache Specifics

1. Remove unused modules – save memory by not loading modules that you do not need, including but not limited to mod_php, mod_ruby, mod_perl, etc.

2. Use mod_disk_cache NOT mod_mem_cache – mod_mem_cache will not share its cache amongst different apache processes, which results in high memory usage with little performance gain since on an active server, mod_mem_cache will rarely serve the same page twice in the same apache process.

3. Configure mod_disk_cache with a flat hierarchy – ensure that you are using CacheDirLength=2 and CacheDirLevels=1 to ensure htcacheclean will not take forever when cleaning up your cache directory.

4. Setup appropriate Expires, Etag, and Cache-Control Headers – to utilize your cache, you must tell it when a file expires, otherwise your client will not experience the caching benefits.

5. Put Cache on separate disk – place your cache on a separate physical disk for fastest access without slowing down other processes.

6. Use Piped Logging instead of direct logging – directly logging to a file has issues when you want to rotate the log file.  It must restart apache to use the next log file.  This will cause significant slowness for your users during the restart.  Particularly if you are using Passenger or some other app loader.

7. Log to a different disk than disk serving pages – put your logs on physically different disks than the files you are serving.

8. Utilize mod_gzip/mod_deflate – gzip your content before sending it off and then the client will ungzip upon receipt, this will minimize the size of file transfers, it generally will help all user experience.

9. Turn HostnameLookups Off – stop doing expensive DNS lookups.  You will rarely ever need them and when you do, you can look them up after the fact.

10. Avoid using hostname in configs – if you have HostnameLookups off, this will prevent you from having to wait for the DNS resolve of the hostnames in your configs, use IP addresses instead.

11. Use Persistent Connections – Set KeepAlive On and then set KeepAliveTimeout and KeepAliveRequests.  KeepAliveTimeout is how long apache will wait for the next request, and KeepAliveRequests is the max number of requests for a client prior to resetting the connection.  This will prevent the client from having to reconnect between each request.

12. Do Not set KeepAliveTimeout too high – if you have more requests than apache children, this setting can starve your pool of available clients.

13. Disable .htaccess – i.e. AllowOverride None  This will prevent apache from having to check for a .htaccess file on each request.

14. Allow symlinks – i.e. Options +FollowSymLinks -SymLinksIfOwnerMatch.  Otherwise, apache will make a separate call on each filename to ensure it is not a symlink.

15. Set ExtendedStatus Off – Although very useful, the ExtendedStatus will produce several system calls for each request to gather statistics.  Better to utilize for a set time period in order to benchmark, then turn back off.

16. Avoid Wildcards in DirectoryIndex – use a specific DirectoryIndex, i.e. index.html or index.php, not index

17. Increase Swappiness – particularly on single site hosts this will increase performance.  On linux systems increase /proc/sys/vm/swappiness to at least 60 if not greater.  This will try to load as many files as possible into the memory cache for faster access.

18. Increase Write Buffer Size – increase your write buffer size for tcp/ip buffers.  On linux systems increase /proc/sys/net/core/wmem_max and /proc/sys/net/core/wmem_default. If your pages fit within this buffer, apache will complete a process in one call to the tcp/ip buffer.

19. Increase Max Open Files – if you are handling high loads increase the number of allowed open files.  On linux, increase /proc/sys/fs/file-max and run ulimit -H -n 4096.

Application Specifics

20. Setup Frontend proxy for images and stylesheets – allow your main web servers to process the application while images and stylesheets are served from frontend webservers

21. Use mod_passenger for rails – mod_passenger is able to share memory and resources amongst several processes, allowing for faster spawning of new application instances.  It will also monitor these processes and remove them when they are unnecessary.

22. Turn off safe_mode for php – it will utilize about 50-70% of your script time checking against these safe directives.  Instead configure open_base_dir properly and utilize plugins such as mod_itk.

23. Don’t use threaded mpm with mod_php – look at using mod_itk, mod_php tends to segfault with threaded mpm.

24. Flush buffers early for pre-render – it takes a relatively long time to create a web page on the backend, flush your buffer prior to page completion to send a partial page to the client, so it can start rendering.  A good place to do this is right after the HEAD section – so that the browser can start fetching other objects.

25. Use a Cache for frequently accessed data memcached is a great for frequently used data and sessions.  It will speed up your apache render time as databases are slow.

Posted in CLOUD, Computing Technology, Installation | Leave a Comment »

CPU vs GPU performance

Posted by Hemprasad Y. Badgujar on July 18, 2013


in the performance of GPUs and CPUs. This has to be quite a compromise since actual performance depends heavily on the suitability of the chip to a particular problem/algorithm among many other specifics. The simplest method is to plot theoretical peak performance over time; I chose to show it for single and double precision for NVIDIA GPUs and Intel CPUs.

In the past, I have used the graph in the CUDA C Programming Guide, but that is frequently out of date, I have no control of the formatting, I have to settle for a screenshot instead of vector output, and, until I did my own research, I wasn’t sure if it was biased.

Below is Michael Galloy attempt (click to enlarge).

CPU vs. GPU performance

by Michael Galloy

Posted in CLOUD, Computer Hardwares, Computing Technology, CUDA, GPU (CUDA), GPU Accelareted, Open CL, PARALLEL | Leave a Comment »

Remote Desktop Connection in Windows 7

Posted by Hemprasad Y. Badgujar on March 4, 2013


Remote Desktop Connection in Windows 7

Remote Desktop Connection, a utility included in all versions of Windows 7, allows you to use a laptop or home computer to remotely control the Windows-based desktop computer in your on-campus office or lab. When using Remote Desktop Connection from a laptop on a wireless network (including Purdue’s AirLink network and free public WiFi networks in coffee shops, hotels, etc.) or a home computer on a broadband Internet connection, it’s as if you’re sitting at the desk in your office using your computer’s keyboard and mouse — even if you’re two buildings, two miles, or two continents away.

By remotely accessing an ECN-supported desktop computer and refraining from storing your Purdue files locally on your laptop or home computer, your data remains safely stored in your home directory on ECN’s network servers — which receive daily backups.

  • If you’re using Windows XP Professional rather than Windows 7, please see Remote Desktop Connection in Windows XP instead.
  • If you have a Macintosh desktop at home or a Mac laptop but have a Windows-based desktop computer in your office, Microsoft also provides a free Mac version of Remote Desktop Connection; please see Remote Desktop Connection in Mac OS X. (The instructions on the page you’re reading now focus on the Windows 7 version.)

You’ll want to follow these instructions on your laptop and/or home computer, not on the on-campus desktop computer!

When connecting from off-campus, please don’t miss step #6! Connecting first to Purdue’s Virtual Private Network is required.


Who can use Remote Desktop Connection?

A remote-controlled computer can be used by only one person at a time. As such, it is recommended for use only by those who do not share the same office computer with other people. A graduate student may use Remote Desktop Connection with the permission of his or her supervisor.

 


Creating a Remote Desktop shortcut

1. Getting started on your Windows 7-based laptop or home computer.

On your laptop or home computer, click on the Start menu, navigate to All Programs, then to Accessories, and then launch “Remote Desktop Connection.”

Windows 7 Start menu


Remote Desktop Connection dialog

2. Computer address.

2A. In the “Computer” field, enter the IP number of the desktop computer in your office. It will look similar to the following:

128.46.xxx.yyy

where both xxx and yyy are a specific number between 1 and 255. No two computers have the same full number; please obtain this number from ECN.

You may either skip to step #6 (to connect to the remote computer immediately) or proceed with step #2B (to set program options and create a shortcut for future use).

2B. Then click on the “Options” button. The window will expand to show several tabs, each with various program settings.


Experience tab

3. The “Experience” tab.

This step is optional. These settings might help improve your remote connection’s performance.

3A. Click on the “Experience” tab.

3B. Click the menu beneath “Choose your connection speed to optimize performance” and select one of the following:

  • For most public WiFi services or home DSL connections, try “Low-speed broadband (256 Kbps – 2 Mbps)”.
  • For home cable modem connections, try “High-speed broadband (2 Mbps – 10 Mbps)”.

General tab

4. The “General” tab.

4A. In the “User name” field, type your Purdue Career Account username.

Leave the “Allow me to save credentials” box unchecked.

4B.  Click on the “Save As” button to proceed to the next step. The “Save As” dialog will appear.


5. Saving your shortcut file.

In this step, you’ll create a shortcut file which you will later begin using routinely to launch a remote control session to your office PC.  You may save this shortcut wherever you prefer; we suggest saving a copy to your desktop.

5A. In the “Save As” dialog, click on the “Desktop” icon in the left-hand column. This will set the “Save in” location to the desktop.

5B. In the “File name” field, type a name that you’ll recognize.  We suggest something like the following:

Remote Desktop to my office PC

If you’ll be creating shortcuts to multiple remote computers (say, one for each person who uses a shared home computer, each pointing to his or her unique office PC), you could enter a more specific name, e.g.:

Remote Desktop to John's office PC
Remote Desktop to arms3403pc1

5C. Click the “Save” button.

The new shortcut file will be created on the desktop.

5D. (This step is optional.) If you’d like the shortcut to appear in more places, this would be a good time to make copies of it.  You could drag the icon from the desktop to the Start button, for example, to place a copy of the shortcut in your Start menu.


Connecting to the desktop computer in your office

These instructions assume that your computer is connected to the Internet, either wirelessly or via a broadband connection (e.g. cable modem or DSL).

6. Connect to Purdue’s Virtual Private Network. When using a computer off-campus, this step is required. Establish a connection to Purdue’s Virtual Private Network (https://webvpn.purdue.edu). For a description of this service, please see ITaP’s VPN “Getting Started” page.

7. Starting the remote connection.

7A. If you saved the icon to the desktop in step #5, locate it there and double-click the icon now.

Alternately, repeat steps #1 and #2A, and then click the “Connect” button.

Your laptop or home computer will connect via the Internet to your desktop computer in your office.


Remote computer identity8. Remote computer verification.

You might see a dialog (like the one shown at right) noting that the remote computer’s identity cannot be verified.

8A. You may optionally enable (place a check mark in) the “Don’t ask me again for connections to this computer” box. When the password prompt appears, enter your Purdue Career Account password.

8B. Then click the “Yes” button.


9. Password prompt.

A password prompt will appear. Because you are connecting to an ECN-supported PC which is a member of an Active Directory domain, you might need to do a couple extra steps.

If the remote computer is running Windows 7, the login prompt will look like the one on the left in the illustration, below:

Enter your credentials dialog

9A. If the dialog appears as above, click the “Use another account” button.

9B. Enter your username as follows, substituting your own Purdue Career Account username:

ecn\username

9C. Enter your Purdue Career Account password.

9D. Then click the “OK” button.

Your office computer’s desktop will appear. If you had left programs running and/or files open on your office computer, they’ll appear now, just as they were.  If you had logged out of Windows before you left your office, your ECN-supported office computer will go through the typical startup process, finishing with the Message of the Day window — just as when you’re in the office.

Now, while your remote connection is open, when you type or use your mouse, it’ll be like using the keyboard and mouse at your office computer.


Minimizing and/or disconnecting

10. Using the top-central tool bar.

While connected to the remote computer, a toolbar appears at the top of your screen like the one shown here:

Remote Desktop toolbar

10A. If you need to access a file or program on your local computer (the laptop or home computer you’re using), click the minimize button on the top-central tool bar.  Remote Desktop Connection will stay running (as will all programs you have open on your office PC);  restore it by clicking its button on the task bar (at the bottom of your screen, usually).

10B. When you’re ready to disconnect from your office PC, you may end the session one of these ways:

  • Click on the “X” button at the right edge of the top-central toolbar.  This will end the remote session but leave files and programs open and running on your office PC.
  • Or, as shown in the illustration below, click on the (remote computer’s) Start menu and select “Log off.”  This will close all open files and programs on your office PC and also end the remote session.

Log off

Posted in CLOUD, CLUSTER, Computer Network & Security, Computing Technology, PARALLEL | 1 Comment »

Make changes in Cloudsim and run it

Posted by Hemprasad Y. Badgujar on February 25, 2013


Make changes in Cloudsim and run it

Step1
open Netbeans (any version greater then 5.0) ,Go to  file–>>new project
Step 2
 select “Java” folder then select first option java Application ,Press next
Step3
Now give name to the project as you wish ,then un-check the “create main class” press next.
 Step 4
Now your project is been created as shown.
Step5
 

Now simply copy the  “org” folder in “cloudsim-2.1.1\examples” and paste it to net beans source folder as shown.go to source and right click select paste.

 
Step6
Now simply copy the  “org” folder in “cloudsim-2.1.1\source” and paste it to net beans source folder as shown.go to source and right click select paste.
Step7
Now you can simply run the examples

Posted in CLOUD, Computing Technology | Leave a Comment »

Cloudsim installation

Posted by Hemprasad Y. Badgujar on February 25, 2013


Cloudsim installation

Step1
open Netbeans (any version greater then 5.0) ,Go to  file–>>new project

Step 2

select “Java” folder then select first option java Application ,Press next

Step3
Now give name to the project as you wish ,then un-check the “create main class” press next.
 

Step 4
Now your project is been created as shown.

Step5
 

Go to library ,right click on it ,a menu will come ,click on “Add jars/Folders”

Step6
Now browse the cloudsim folder which you have extracted from zip file .and go to  “cloudsim-2.1.1\jars” and select “cloudsim-2.1.1.jar” .

Step 7
Now simply copy the  “org” folder in “cloudsim-2.1.1\examples” and paste it to net beans source folder as shown.go to source and right click select paste.
Step 8
To run the example go to source ->> org.cloudbus.cloudsim.examples->>select any example ,right click on it and select “run” option the output will be displayed in the output window at the bottom.
         TO RUN EXAMPLE CODE FROM SOURCE CODE
 
Step 1,2,3,4 remain the same  in step 5 copy and paste   “org” folder from “cloudsim-2.1.1\source” and paste it into the source folder of  your netbeans project as you copied the example folder.

Posted in CLOUD, Computing Technology | Leave a Comment »

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: