Something More for Research

Explorer of Research #HEMBAD

Posted by Hemprasad Y. Badgujar on December 11, 2014

Cloud scaling, Part 1: Build a compute node or small cluster application and scale with HPC

Leveraging warehouse-scale computing as needed

Discover methods and tools to build a compute node and small cluster application that can scale with on-demand high-performance computing (HPC) by leveraging the cloud. This series takes an in-depth look at how to address unique challenges while tapping and leveraging the efficiency of warehouse-scale on-demand HPC. The approach allows the architect to build locally for expected workload and to spill over into on-demand cloud HPC for peak loads. Part 1 focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Exotic HPC architectures with custom-scaled processor cores and shared memory interconnection networks are being rapidly replaced by on-demand clusters that leverage off-the-shelf general purpose vector coprocessors, converged Ethernet at 40 Gbit/s per link or more, and multicore headless servers. These new HPC on-demand cloud resources resemble what has been called warehouse-scale computing, where each node is homogeneous and headless and the focus is on total cost of ownership and power use efficiency overall. However, HPC has unique requirements that go beyond social networks, web search, and other typical warehouse-scale computing solutions. This article focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Moving to high-performance computing

The TOP500 and Green500 supercomputers (see Resources) since 1994 are more often not custom designs, but rather designed and integrated with off-the-shelf headless servers, converged Ethernet or InfiniBand clustering, and general-purpose graphics processing unit (GP-GPU) coprocessors that aren’t for graphics but rather for single program, multiple data (SPMD) workloads. The trend in high-performance computing (HPC) away from exotic custom processor and memory interconnection design to off-the-shelf—warehouse-scale computing—is based on the need to control total cost of ownership, increase power efficiency, and balance operational expenditure (OpEx) and capital expenditure (CapEx) for both start-up and established HPC operations. This means that you can build your own small cluster with similar methods and use HPC warehouse-scale resources on-demand when you need them.

The famous 3D torus interconnection that Cray and others used may never fully go away (today, the TOP500 is one-third massively parallel processors [MPPs] and two-thirds cluster architecture for top performers), but focus on efficiency and new OpEx metrics like Green500 Floating Point Operations (FLOPs)/Watt are driving HPC and keeping architecture focused on clusters. Furthermore, many applications of interest today are data driven (for example, digital video analytics), so many systems not only need traditional sequential high performance storage for HPC checkpoints (saved state of a long-running job) but more random access to structured (database) and unstructured (files) large data sets. Big data access is a common need of traditional warehouse-scale computing for cloud services as well as current and emergent HPC workloads. So, warehouse-scale computing is not HPC, but HPC applications can leverage data center-inspired technology for cloud HPC on demand, if designed to do so from the start.

Power to computing

Power to computing can be measured in terms of a typical performance metric per Watt—for example, FLOPS/Watt or input/output per second/Watt for computing and I/O, respectively. Furthermore, any computing facility can be seen as a plant for converting Watts into computational results, and a gross measure of good plant design is power use efficiency (PUE), which is simply the ratio of total facility power over that delivered to computing equipment. A good value today is 1.2 or less. One reason for higher PUEs is inefficient cooling methods, administrative overhead, and lack of purpose-built facilities compared to cloud data centers (see Resources for a link to more information).

Changes in scalable computing architecture focus over time include:

  • Early focus on a fast single processor (uniprocessor) to push the stored-program arithmetic logic unit central processor to the highest clock rates and instruction throughput possible:
    • John von Neumann, Alan Turing, Robert Noyce (founder of Intel), Ted Hoff (Intel universal processor proponent), along with Gordon Moore see initial scaling as a challenge to scaling digital logic and clock a processor as fast as possible.
    • Up to at least 1984 (and maybe longer), the general rule was “the processor makes the computer.”
    • Cray Computer designs vector processors (X-MP, Y-MP) and distributed memory multiprocessors interconnected by a six-way interconnect 3D torus for custom MPP machines. But this is unique to the supercomputing world.
    • IBM’s focus early on was scalable mainframes and fast uniprocessors until the announcement of the IBM® Blue Gene® architecture in 1999 using a multicore IBM® POWER® architecture system-on-a-chip design and a 3D torus interconnection. The current TOP500 includes many Blue Gene systems, which have often occupied the LINPACK-measured TOP500 number one spot.
  • More recently since 1994, HPC is evolving to a few custom MPP and mostly off-the-shelf clusters, using both custom interconnections (for example, Blue Gene and Cray) and off-the-shelf converged Ethernet (10G, 40G) and InfiniBand:
    • The TOP500 has become dominated by clusters, which comprise the majority of top-performing HPC solutions (two-thirds) today.
    • As shown in the TOP500 chart by architecture since 1994, clusters and MPP dominate today (compared to single instruction, multiple data [SIMD] vector; fast uniprocessors; symmetric multiprocessing [SMP] shared memory; and other, more obscure architectures).
    • John Gage at Sun Microsystems (now Oracle) stated that “the network is the computer,” referring to distributed systems and the Internet, but low-latency networks in clusters likewise become core to scaling.
    • Coprocessors interfaced to cluster nodes via memory-mapped I/O, including GP-GPU and even hybrid field-programmable gate array (FPGA) processors, are used to accelerate specific computing workloads on each cluster node.
  • Warehouse-scale computing and the cloud emerge with focus on MapReduce and what HPC would call embarrassingly parallel applications:
    • The TOP500 is measured with LINPACK and FLOPs and so is not focused on cost of operations (for example, FLOPs/Watt) or data access. Memory access is critical, but storage access is not so critical, except for job checkpoints (so a job can be restarted, if needed).
    • Many data-driven applications have emerged in the new millennium, including social networks, Internet search, global geographical information systems, and analytics associated with more than a billion Internet users. This is not HPC in the traditional sense but warehouse-computing operating at a massive scale.
    • Luiz André Barroso states that “the data center is the computer,” a second shift away from processor-focused design. The data center is highly focused on OpEx as well as CapEx, and so is a better fit for HPC where FLOPs/Watt and data access matter. These Google data centers have a PUE less than 1.2—a measure of total facility power consumed divided by power used for computation. (Most computing enterprises have had a PUE of 2.0 or higher, so, 1.2 is very low indeed. See Resources for more information.)
    • Amazon launched Amazon Elastic Compute Cloud (Amazon EC2), which is best suited to web services but has some scalable and at least high-throughput computing features (see Resources).
  • On-demand cloud HPC services expand, with an emphasis on clusters, storage, coprocessors and elastic scaling:
    • Many private and public HPC clusters occupy TOP500, running Linux® and using common open source tools, such that users can build and scale applications on small clusters but migrate to the cloud for on-demand large job handling. Companies like Penguin Computing, which features Penguin On-Demand, leverage off-the-shelf clusters (InfiniBand and converged 10G/40G Ethernet), Intel or AMD multicore headless nodes, GP-GPU coprocessors, and scalable redundant array of independent disks (RAID) storage.
    • IBM Platform computing provides IBM xSeries® and zSeries® computing on demand with workload management tools and features.
    • Numerous universities and start-up companies leverage HPC on demand with cloud services or off-the-shelf clusters to complement their own private services. Two that I know well are the University of Alaska Arctic Region Supercomputing Center (ARSC) Pacman (Penguin Computing) and the University of Colorado JANUS cluster supercomputer. A common Red Hat Enterprise Linux (RHEL) open source workload tool set and open architecture allow for migration of applications from private to public cloud HPC systems.

Figure 1 shows the TOP500 move to clusters and MPP since the mid-1990s.

Figure 1. TOP500 evolution to clusters and MPP since 1994

Image showing the evolution to clustersThe cloud HPC on-demand approach requires well-defined off-the-shelf clustering, compute nodes, and tolerance for WAN latency to transfer workload. As such, these systems are not likely to overtake top spots in the TOP500, but they are likely to occupy the Green500 and provide efficient scaling for many workloads and now comprise the majority of the Top500.

High-definition digital video computer vision: a scalable HPC case study

Most of us deal with compressed digital video, often in Motion Picture Experts Group (MPEG) 4 format, and don’t think of the scale of even a high-definition (HD) web cam in terms of data rates and processing to apply simple image processing analysis. Digital cinema workflow and post-production experts know the challenges well. They deal with 4K data (roughly 4-megapixel) individual frames or much higher resolution. These frames might be compressed, but they are not compressed over time in groups of pictures like MPEG does and are often lossless compression rather than lossy.

To start to understand an HPC problem that involves FLOPs, uncompressed data, and tools that can be used for scale-up, let’s look at a simple edge-finder transform. The includes Open Computer Vision (OpenCV) algorithms to transform a real-time web cam stream into a Sobel or Canny edge view in real time. See Figure 2.

Figure 2. HD video Canny edge transform

Image showing a Canny edge transformLeveraging cloud HPC for video analytics allows for deployment of more intelligent smart phone applications. Perhaps phone processors will someday be able to handle real-time HD digital video facial recognition, but in the mean time, cloud HPC can help. Likewise, data that originates in data centers, like geographic information systems (GIS) data, needs intensive processing for analytics to segment scenes, create point clouds of 3D data from stereo vision, and recognize targets of interest (such as well-known landmarks).

Augmented reality and video analytics

Video analytics involves collection of structured (database) information from unstructured video (files) and video streams—for example, facial recognition. Much of the early focus has been on security and automation of surveillance, but applications are growing fast and are being used now for more social applications, e.g. facial recognition, perhaps not to identify a person but to capture and record their facial expression and mood (while shopping). This technology can be coupled with augmented reality, whereby the analytics are used to update a scene with helpful information (such as navigation data). Video data can be compressed and uplinked to warehouse-scale data centers for processing so that the analytics can be collected and information provided in return not available on a user’s smart phone. The image processing is compute intensive and involves big data storage, and likely a scaling challenge (see Resources for a link to more information).

Sometimes, when digital video is collected in the field, the data must be brought to the computational resources; but if possible, digital video should only be moved when necessary to avoid encoding to compress and decoding to decompress for viewing. Specialized coprocessors known as codecs (coder/decoder) are designed to decode without software and coprocessors to render graphics (GPUs) exist, but to date, no CV coprocessors are widely available. Khronos has announced an initiative to define hardware acceleration for OpenCV in late 2012, but work has only just begun (see Resources). So, to date, CV remains more of an HPC application that has had attention primarily from digital cinema, but this is changing rapidly based on interest in CV on mobiles and in the Cloud.

Although all of us imagine CV to be implemented on mobile robotics, in our heads-up displays for intelligent transportation, and on visors (like Google Goggles that are now available) for personal use, it’s not clear that all of the processing must be done on the embedded devices or that it should be even if it could. The reason is data: Without access to correlated data center data, CV information has less value. For example, how much value is there in knowing where your are without more mapping and GIS data to help you with where you want to go next? Real-time CV and video analytics are making progress, but they face many challenges, including huge storage requirements, high network bit rates for transport, and significant processing demands for interpretation. Whether the processing is done by cloud HPC clusters or embedded systems, it’s clear that concurrency and parallel processing will play a huge role. Try running a simple Hough linear transform on the 12-megapixel cactus photo I took, and you’ll see why HPC might be needed just to segment a scene at 60 frames/s.

The challenge of making algorithms parallel

HPC with both clusters and MPP requires coding methods to employ many thread of execution on each multicore node and to use message-passing interfaces (MPIs) and basic methods to map data and code to process resources and collect results. For digital video, the mapping can be simple if done at a frame level. Within a frame is more difficult but still not bad other than the steps of segmenting and restitching frames together.

The power of MapReduce

The MapReduce concept is generally associated with Google and the open source Hadoop project (from Apache Software Foundation), but any parallel computation must employ this concept to obtain speed-up, whether done at a node or cluster level with Java™ technology or at a thread level for a nonuniform memory access (NUMA) shared memory. For applications like digital video analytics, the mapping is data intensive, so it makes sense to move the function to the data (in the mapping stage), but either way, the data to be processed must be mapped and processed and the results combined. A clever mapping avoids data dependencies and the need for synchronization as much as possible. In the case of image processing, for CV, the mapping could be within a frame, at the frame level, or by groups of pictures (see Resources).

Key tools for designing cluster scaling applications for cloud HPC on demand include the following:

  • Threading is the way in which a single application (or Linux process) is one address space on one cluster node and can be designed to use all processor cores on that node. Most often, this is done with Portable Operating System Interface for UNIX® (POSIX) Pthreads or with a library like OpenMP, which abstracts the low-level details of POSIX threading. I find POSIX threading to be fairly simple and typically write Pthread code as can be seen in the hpc_cloud_grid.tar.gz example. This example maps threads to the over-number space for prime number searching.
  • MPI is a library that can be linked into a cluster parallel application to assist with mapping of processing to each node, synchronization, and reduction of results. Although you can use MPI to implement MapReduce, unlike Hadoop, it typically moves data (in messages) to program functions running on each node (rather than moving code to the data). In the final video analytics article in this series, I will provide a thread and MPI cluster-scalable version of the capture-transform code. Here, I provide the simple code for a single thread and node to serve as a reference. Run it and Linux dstat at the same time to monitor CPU, I/O, and storage use. It is a resource-intensive program that computes Sobel and Canny transforms on a 2560×1920-pixel image. It should run on any Linux system with OpenCV and a web cam.
  • Vector SIMD and SPMD processing can be accomplished on Intel and AMD nodes with a switch to enable during compilation or, with more work, by creation of transform kernels in CUDA or OpenCL for off-load to a GPU or GP-GPU coprocessor.
  • OpenCV is highly useful for video analytics, as it includes not only convenient image capture, handling, and display functions but also most of the best image processing transforms used in CV.

The future of on-demand cloud HPC

This articles makes an argument for cloud HPC. The goal here is to acquaint you with the idea and some of the challenging, yet compelling applications (like CV) as well as to introduce you to methods for programming applications that can scale on clusters and MPP machines. In future articles, I will take the CV example further and adapt it for not only threading but also for MPI so that we can examine how well it scales on cloud HPC (in my case, at ARSC on Pacman or JANUS). My research involves comparison of tightly coupled CV coprocessors (that I am building using an Altera Stratix IV FPGA I call a computer vision processing unit [CVPU]). I am comparing this to what I can achieve with CV on ARSC for the purpose of understanding whether environmental sensing and GIS data are best processed like graphics, with a coprocessor, or on a cluster or perhaps with a combination of the two. The goals for this research are lofty. In the case of CVPU, the CV/graphics Turing-like test I imagine is one in which the scene that the CVPU parses can then be sent to a GPU for rendering. Ideally, the parsed/rendered image would be indistinguishable from the true digital video stream. When rendered scenes and the ability to analyze them reaches a common level of fidelity, augmented reality, perceptual computing, and video analytics will have amazing power to transform our lives.

Cloud scaling, Part 2: Tour high-performance cloud system design advances

Learn how to leverage co-processing, nonvolatile memory, interconnection, and storage

Breakthrough device technology requires the system designer to re-think operating and application software design in order to realize the potential benefits of closing the access gap or pushing processing into the I/O path with coprocessors. Explore and consider how the latest memory, compute, and interconnection devices and subsystems can affect your scalable, data-centric, high-performance cloud computing system design. Breakthroughs in device technology can be leveraged for transition between compute-centric and the more balanced data-centric compute architectures.

The author examines storage-class memory and demonstrates how to fill the long-standing performance gap between RAM and spinning disk storage; details the use of I/O bus coprocessors (for processing closer to data); explains how to employ InfiniBand to build low-cost, high performance interconnection networks; and discusses scalable storage for unstructured data.

Computing systems engineering has historically been dominated by scaling processors and dynamic RAM (DRAM) interfaces to working memory, leaving a huge gap between data-driven and computational algorithms (see Resources). Interest in data-centric computing is growing rapidly, along with novel system design software and hardware devices to support data transformation with large data sets.

The data focus in software is no surprise given applications of interest today, such as video analytics, sensor networks, social networking, computer vision and augmented reality, intelligent transportation, machine-to-machine systems, and big data initiatives like IBM’s Smarter Planet and Smarter Cities.

The current wave of excitement is about collecting, processing, transforming, and mining the big data sets:

  • The data focus is leading toward new device-level breakthroughs in nonvolatile memory (storage-class memory, SCM) which brings big data closer to processing.
  • At the same time, input/output coprocessors are bringing processing closer to the data.
  • Finally, low-latency, high-bandwidth off-the-shelf interconnections like InfiniBand are allowing researchers to quickly build 3D torus and fat-tree clusters that used to be limited to the most exotic and expensive custom high-performance computing (HPC) designs.

Yet, the systems software and even system design often remain influenced by out-of-date bottlenecks and thinking. For example, consider threading and multiprogramming. The whole idea came about because of slow disk drive access; what else can a program do when waiting on data but run another one. Sure, we have redundant array of independent disks (RAID) scaling and NAND flash solid-state disks (SSDs), but as noted by IBM Almaden Research, the time scale differences of the access time gap are massive in human terms.

The access time gap between a CPU, RAM, and storage can be measured in terms of typical performance for each device, but perhaps the gap is more readily understood when put into human terms (as IBM Almaden has done for illustrative purposes).

If a typical CPU operation is similar to what a human can do in seconds, then RAM access at 100 times more latency is much like taking a few minutes to access information. However, by the same comparison, disk access at 100,000 times more latency compared to RAM is on the order of months (100 days). (See Figure 1.)

Figure 1. The data access gap

Image showing the data access gapMany experienced computer engineers have not really thought hard about the 100 to 200 random I/O operations per second (IOPS) — it is the mechanical boundary for a disk drive. (Sure, sequential access is as high as hundreds of megabytes per second, but random access remains what it was more than 50 years ago, with the same 15K RPM seek and rotate access latency.)

Finally, as Almaden notes, tape is therefore glacially slow. So, why do we bother? For the capacity, of course. But how can we get processing to the data or data to the processing more efficiently?

Look again at Figure 1. Improvements to NAND flash memory for use in mobile devices and more recently SSD has helped to close the gap; however, it is widely believed that NAND flash device technology will be pushed to its limits fairly quickly, as noted by numerous system researchers (see Resources). The transistor floating gate technology used is already at scaling limits and pushing it farther is leading to lower reliability, so although it has been a stop-gap for data-centric computing, it is likely not the solution.

Instead, several new nonvolatile RAM (NVRAM) device technologies are likely solutions, including:

  • Phase change RAM (PCRAM): This memory uses a heating element to turn a class of materials known as chalcogenides into either a crystallized or amorphous glass state, thereby storing two states that can be programmed and read, with state retained even when no power is applied. PCRAM appears to show the most promise in the near term for M-type synchronous nonvolatile memory (NVM).
  • Resistive RAM (RRAM): Most often described as a circuit that is unlike a capacitor, inductor, or resistor, RRAM provides a unique relationship between current and voltage unlike other well-known devices that store charge or magnetic energy or provide linear resistance to current flow. Materials with properties called memristors have been tested for many decades but engineers usually avoid them because of their nonlinear properties and the lack of application for them. IEEE fellow Leon Chua describes them in “Memristor: The Missing Circuit Element.” A memristor’s behavior can be summarized as follows: Current flow in one direction causes electrical resistance to increase and in the opposite direction resistance decreases, but the memristor retains the last resistance it had when flow is re-started. As such, it can store a nonvolatile state, be programmed, and the state read. For details and even some controversy on what is and is not a memristor, seeResources.
  • Spin transfer torque RAM (STT-RAM): A current passed through a magnetic layer can produce a spin-polarized current that, when directed into a magnetic layer, can change its orientation via angular momentum. This behavior can be used to excite oscillations and flip the orientation of nanometer-scale magnetic devices. The main drawback is the high current needed to flip the orientation.

Consult the many excellent entries in Resources for more in-depth information on each device technology.

From a systems perspective, as these devices evolve, where they can be used and how well each might fill the access gap depends on the device’s:

  • Cost
  • Scalability (device integration size must be smaller than a transistor to beat flash; less than 20 nanometers)
  • Latency to program and read
  • Device reliability
  • Perhaps most importantly, durability (how often it can be programmed and erased before it becomes unreliable).

Based on these device performance considerations, IBM has divided SCM into two main classes:

  • S-type: Asynchronous access via an I/O controller. Threading or multiprogramming is used to hide the I/O latency to the device.
  • M-type: Synchronous access via a memory controller. Think about this as wait-states for RAM access in which a CPU core stalls.

Further, NAND SSD would be considered fast storage, accessed via a block-oriented storage controller (much higher I/O rates but similar bandwidth to a spinning disk drive).

It may seem like the elimination of asynchronous I/O for data processing (except, of course, for archive access or cluster scaling) might be a cure-all for data-centric processing. In some sense it is, but systems designers and software developers will have to change habits. The need for I/O latency hiding will largely go away on each node in a system, but it won’t go away completely. Clusters built from InfiniBand deal with node-to-node data-transfer latency with Message Passing Interface or MapReduce schemes and enjoy similar performance to this envisioned SCM node except when booting or when node data exceeds node working RAM size.

So, for scaling purposes, cluster interconnection and I/O latency hiding among nodes in the cluster is still required.

Moving processing closer to data with coprocessors

Faster access to big data is ideal and looks promising, but some applications will always benefit from the alternative approach of moving processing closer to data interfaces. Many examples exist, such as graphics (graphics processing units, GPUs), network processors, protocol-offload engines like the TCP/IP Offload Engine, RAID on chip, encryption coprocessors, and more recently, the idea of computer vision coprocessors. My research involves computer vision and graphics coprocessors, both at scale in clusters and embedded. I am working on what I call a computer vision processing unit, comparing several coprocessors that became more widely pursued with the 2012 announcement of OpenVX by Khronos (see Resources).

In the embedded world, such a method might be described as an intelligent sensor or smart camera, methods in which preliminary processing of raw data is provided by the sensor interface and an embedded logic device or microprocessor, perhaps even a multicore system on a chip (SoC).

In the scalable world, this most often involves use of a coprocessor bus or channel adapter (like PCI Express, PCIe, and Ethernet or InfiniBand); it provides data processing between the data source (network side) and the node I/O controller (host side).

Whether processing should be done or is more efficient when done in the I/O path or on a CPU core has always been a topic of hot debate, but based on an existence proof (GPUs and network processors), clearly they can be useful, waxing and waning in popularity based on coprocessor technology compared to processor. So, let’s take a quick look at some of the methods:

Vector processing for single program, multiple data
Provided today by GPUs, general-purpose GPUs (GP-GPUs), and application processing units (APUs), the idea is that data can be transformed on its way to an output device like a display or sent to a GP-GPU/APU and transformed on a round trip from the host. “General purpose” implies more sophisticated features like double-precision arithmetic compared to single precision only for graphics-specific processing.
Many core
Traditional many-core coprocessor cards (see Resources) are available from various vendors. The idea is to lower cost and power consumption by using simpler, yet numerous cores on the I/O bus, with round-trip offloading of processing to the cards for a more capable but power-hungry and costly full-scale multicore host. Typically, the many-core coprocessor might have an order of magnitude more cores than the host and often includes gigabit or 10G Ethernet and other types of network interfaces.
I/O bus field-programmable gate arrays (FPGAs)
FPGA cards, most often used to prototype a new coprocessor in the early stages of development, can perhaps used as a solution for low-volume coprocessors as well.
Embedded SoCs
A multicore solution can be used in an I/O device to create an intelligent device like a stereo ranging or time-of-flight camera.
Interface FPGA/configurable programmable logic devices
A digital logic state machine can provide buffering and continuous transformation of I/O data, such as digital video encoding.

Let’s look at an example based on offload and I/O path. Data transformation has obvious value for applications like the decoding of MPEG4 digital video, consisting of a GPU coprocessor in the path between the player and a display as shown in Figure 2 for the Linux® MPlayer video decoder and presentation acceleration unit (VDPAU) software interface to NVIDIA MPEG decoding on the GPU.

Figure 2. Simple video decode offload example

Image showing an example of a simple video decode offloadLikewise, any data processing or transformation that can be done in-bound or out-bound from a CPU host may have value, especially if the coprocessor can provide processing at a lower cost with great efficiency or with lower power consumption based on purpose-built processors compared to general-purpose CPUs.

To start to understand a GP-GPU compared to a multicore coprocessor approach, try downloading the two examples of a point spread function to sharpen the edges on an image (threaded transform example) compared with the GPU transform example. Both provide the same 320×240-pixel transformation, but in one case, the Compute Unified Device Architecture (CUDA) C code provided requires a GPU or GP-GPU coprocessor and, in the other case, either a multicore host or a many-core (for example, MICA) coprocessor.

So which is better?

Neither approach is clearly better, mostly because the NVRAM solutions have not yet been made widely available (except as expensive battery-backed DRAM or as S-type SCM from IBM Texas Memory Systems Division) and moving processing into the I/O data path has traditionally involved less friendly programming. Both are changing, though: Coprocessors are adopting higher-level languages like the Open Compute Language (OpenCL) in which code written for multicore hosts runs equally well on Intel MICA or Altera Startix IV/V architectures.

Likewise, all of the major computer systems companies are working feverishly to release SCM products, with PCRAM the most likely to be available first. My advice is to assume that both will be with us for some time and operating systems and applications must be able to deal with both. The memristor, or RRAM, includes a vision that resembles Isaac Asimov’s fictional positronic brain in which memory and processing are fully integrated as they are in a human neural system but with metallic materials. The concept of fully integrated NVM and processing is generally referred to as processing in memory (PIM) or neuromorphic processing (see Resources). Scalable NVM integrated processing holds extreme promise for biologically inspired intelligent systems similar to the human visual cortex, for example. Pushing toward the goal of integrated NVM, with PIM from both sides, is probably a good approach, so I plan to keep up with and keep working on systems that employ both methods—coprocessors and NVM. Nature has clearly favored direct, low-level, full integration of PIM at scale for intelligent systems.

Scaling nodes with Infiniband interconnection

System designers always have to consider the trade-off between scaling up each node in a system and scaling out a solution that uses networking or more richly interconnected clustering to scale processing, I/O, and data storage. At some point, scaling the memory, processing, and storage a single node can integrate hits a practical limit in terms of cost, power efficiency, and size. It is also often more convenient from a reliability, availability, and servicing perspective to spread capability over multiple nodes so that if one needs repair or upgrade, others can continue to provide service with load sharing.

Figure 3 shows a typical InfiniBand 3D torus interconnection.

Figure 3. Example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)

Image showing an example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)In Figure 3, the 4x4x4 shown is for the San Diego Supercomputing Center (SDSC) Gordon supercomputer, as documented by Mellanox, which uses a 36-port InfiniBand switch to connect nodes to each other and to storage I/O.

InfiniBand, Converged Enhanced Ethernet iSCSI (CEE), or Fibre Channel is the most often used scalable storage interface for access to big data. This storage area network (SAN) scaling for RAID arrays is used to host distributed, scalable file systems like Ceph, Lustre, Apache Hadoop, or the IBM General Parallel File System (GPFS). Use of CEE and InfiniBand for storage access using the Open Fabric Alliance SCSI Remote Direct Memory Access (RDMA) Protocol and iSCSI Extensions for RDMA is a natural fit for SAN storage integrated with an InfiniBand cluster. Storage is viewed more as a distributed archive of unstructured data that is searched or mined and loaded into node NVRAM for cluster processing. Higher-level data-centric cluster processing methods like Hadoop MapReduce can also be used to bring code (software) to the data at each node. These topics are big-data-related topics that I describe more in the last part of this four-part series.

The future of data-centric scaling

This articles makes an argument for systems design and architecture that move processors closer to data-generating and -consuming devices, as well as simplification of memory hierarchy to include fewer levels, leveraging lower-latency, scalable NVM devices. This defines a data-centric node design that can be further scaled with low-latency off-the-shelf interconnection networks like InfiniBand. The main challenge with data-centric computing is not instructions-per-second or floating-point-operations-per-second only, but rather IOPS and the overall power efficiency of data processing.

In Part 1 of this series, I uncovered methods and tools to build a compute node and small cluster application that can scale with on-demand HPC by leveraging the cloud. In this article I detailed such high-performance system design advances as co-processing, nonvolatile memory, interconnection, and storage.

In Part 3 in this series I provide more in-depth coverage of a specific data-centric computing application — video analytics. Video analytics includes applications such as facial recognition for security and computer forensics, use of cameras for intelligent transportation monitoring, retail and marketing that involves integration of video (for example, visualizing yourself in a suit you’re considering from a web-based catalog), as well as a wide range of computer vision and augmented reality applications that are being invented daily. Although many of these applications involve embedded computer vision, most also require digital video analysis, transformation, and generation in cloud-based scalable servers. Algorithms like Sobel transformation can be run on typical servers, but algorithms like the generalized Hough transform, facial recognition, image registration, and stereo (point cloud) mapping, for example, require the NVM and coprocessor approaches this article discussed for scaling.

In the last part of the series, I deal with big data issues.

Cloud scaling, Part 3: Explore video analytics in the cloud

Using methods, tools, and system design for video and image analysis, monitoring, and security

Explore and consider methods, tools, and system design for video and image analysis with cloud scaling. As described in earlier articles in this series, video analytics requires a more balanced data-centric compute architecture compared to traditional compute-centric, scalable, high-performance computing. The author examines the use of OpenCV and similar tools for digital video analysis and methods to scale this analysis using cluster and distributed system design.

The use of coprocessors designed for video analytics and the new OpenVX hardware acceleration discussed in previous articles can be applied to the computer vision (CV) examples presented in this article. This new data-centric technology for CV and video analytics requires the system designer to re-think application software and system design to meet demanding requirements, such as real-time monitoring and security for large, public facilities and infrastructure as well as a more entertaining, interactive, and safer world.

Public safety and security

The integration of video analytics in public places is perhaps the best way to ensure public safety, providing digital forensic capabilities to law enforcement and the potential to increase detection of threats and prevention of public safety incidents. At the same time, this need has to be balanced with rights to privacy, which can become a contentious issue if these systems are abused or not well understood. For example, the extension of facial detection, as shown in Figure 1, to facial recognition has obvious identification capability and can be used to track an individual as he or she moves from one public place to another. To many people, facial analytics might be seen an invasion of privacy, and use of CV and video analytics should adhere to surveillance and privacy rights laws and policies, to be sure—any product or service developer might want to start by considering best practices outlined by the Federal Trade Commission (FTC; see Resources).

Digital video using standards such as that from Motion Picture Experts Group (MPEG) for encoding video to compress, transport, uncompress, and display it has led to a revolution in computing ranging from social networking media and amateur digital cinema to improved training and education. Tools for decoding and consuming digital video are widely used by all every day, but tools to encode and analyze uncompressed video frames are needed for video analytics, such as Open Computer Vision (OpenCV). One of the readily available and quite capable tools for encoding and decoding of digital video is FFmpeg; for still images, GNU Image Processing (GIMP) is quite useful (see Resources for links). With these three basic tools, an open source developer is fully equipped to start exploring computer vision (CV) and video analytics. Before exploring these tools and development methods, however, let’s first define these terms better and consider applications.

The first article in this series, Cloud scaling, Part 1: Build your own and scale with HPC on demand, provided a simple example using OpenCV that implements a Canny edge transformation on continuous real-time video from a Linux® web cam. This is an example of a CV application that you could use as a first step in segmenting an image. In general, CV applications involve acquisition, digital image formats for pixels (picture elements that represent points of illumination), images and sequences of them (movies), processing and transformation, segmentation, recognition, and ultimately scene descriptions. The best way to understand what CV encompasses is to look at examples. Figure 1 shows face and facial feature detection analysis using OpenCV. Note that in this simple example, using the Haar Cascade method (a machine learning algorithm) for detection analysis, the algorithm best detects faces and eyes that are not occluded (for example, my youngest son’s face is turned to the side) or shadowed and when the subject is not squinting. This is perhaps one of the most important observations that can be made regarding CV: It’s not a trivial problem. Researchers in this field often note that although much progress has been made since its advent more than 50 years ago, most applications still can’t match the scene segmentation and recognition performance of a 2-year-old child, especially when the ability to generalize and perform recognition in a wide range of conditions (lighting, size variation, orientation and context) is considered.

Figure 1. Using OpenCV for facial recognition

Image showing facial recognition analysisTo help you understand the analytical methods used in CV, I have created a small test set of images from the Anchorage, Alaska area that isavailable for download. The images have been processed using GIMP and OpenCV. I developed C/C++ code to use the OpenCV application programming interface with a Linux web cam, precaptured images, or MPEG movies. The use of CV to understand video content (sequences of images), either in real time or from precaptured databases of image sequences, is typically referred to as video analytics.

Defining video analytics

Video analytics is broadly defined as analysis of digital video content from cameras (typically visible light, but it could be from other parts of the spectrum, such as infrared) or stored sequences of images. Video analytics involves several disciplines but at least includes:

  • Image acquisition and encoding. As a sequence of images or groups of compressed images. This stage of video analytics can be complex, including photometer (camera) technology, analog decoding, digital formats for arrays of light samples (pixels) in frames and sequences, and methods of compressing and decompressing this data.
  • CV. The inverse of graphical rendering, where acquired scenes are converted into descriptions compared to rendering a scene from a description. Most often, CV assumes that this process of using a computer to “see” should operate wherever humans do, which often distinguishes it from machine vision. The goal of seeing like a human does most often means that CV solutions employ machine learning.
  • Machine vision. Again, the inverse of rendering but most often in a well-controlled environment for the purpose of process control—for example, inspecting printed circuit boards or fabricated parts to make sure they are geometrically correct within tolerances.
  • Image processing. A broad application of digital signal processing methods to samples from photometers and radiometers (detectors that measure electromagnetic radiation) to understand the properties of an observation target.
  • Machine learning. Algorithms developed based on the refinement of the algorithm through training data, whereby the algorithm improves performance and generalizes when tested with new data.
  • Real-time and interactive systems. Systems that require response by a deadline relative to a request for service or at least a quality of service that meets SLAs with customers or users of the services.
  • Storage, networking, database, and computing. All required to process digital data used in video analytics, but a subtle, yet important distinction is that this is an inherently data-centric compute problem, as was discussed in Part 2 of this series.

Video analytics, therefore, is broader in scope than CV and is a system design problem that might include mobile elements like a smart phone (for example, Google Goggles) and cloud-based services for the CV aspects of the overall system. For example, IBM has developed a video analytics system known as the video correlation and analysis suite (VCAS), for which the IBM Travel and Transportation Solution BriefSmarter Safety and Security Solution for Rail [PDF] is available; it is a good example of a system design concept. Detailed focus on each system design discipline involved in a video analytics solution is beyond the scope of this article, but many pointers to more information for system designers are available in Resources. The rest of this article focuses on CV processing examples and applications.

Basic structure of video analytics applications

You can break the architecture of cloud-based video analytics systems down into two major segments: embedded intelligent sensors (such as smart phones, tablets with a camera, or customized smart cameras) and cloud-based processing for analytics that can’t be directly computed on the embedded device. Why break the architecture into two segments compared to fully solving in the smart embedded device? Embedding CV in transportation, smart phones, and products is not always practical. Even when embedding a smart camera is smart, so often, the compressed video or scene description may be back-hauled to a cloud-based video analytics system, just to offload the resource-limited embedded device. Perhaps more important, though, than resource limitations is that video transported to the cloud for analysis allows for correlation with larger data sets and annotation with up-to-date global information for augmented reality (AR) returned to the devices.

The smart camera devices for applications like gesture and facial expression recognition must be embedded. However, more intelligent inference to identify people and objects and fully parse scenes is likely to require scalable data-centric systems that can be more efficiently scaled in a data center. Furthermore, data processing acceleration at scale ranging from the Khronos OpenVX CV acceleration standards to the latest MPEG standards and feature-recognition databases are key to moving forward with improved video analytics, and two-segment cloud plus smart camera solutions allow for rapid upgrades.

With sufficient data-centric computing capability leveraging the cloud and smart cameras, the dream of inverse rendering can perhaps be realized where, in the ultimate “Turing-like” test that can be demonstrated for CV, scene parsing and re-rendered display and direct video would be indistinguishable for a remote viewer. This is essentially done now in digital cinema with photorealistic rendering, but this rendering is nowhere close to real time or interactive.

Video analytics apps: Individual scenarios

Killer applications for video analytics are being thought of every day for CV and video analytics, some perhaps years from realization because of computing requirements or implementation cost. Nevertheless, here is a list of interesting applications:

  • AR views of scenes for improved understanding. If you have ever looked at, for example, a landing plane and thought, I wish I could see the cockpit view with instrumentation, this is perhaps possible. I worked in Space Shuttle mission control long ago, where a large development team meticulously re-created a view of the avionics for ground controllers that shadowed what astronauts could see—all graphical, but imaging fusion of both video and graphics to annotate and re-create scenes with meta-data. A much simplified example is presented here in concept to show how an aircraft observed via a tablet computer camera could be annotated with attitude and altitude estimation data (see the example in this article).
  • Skeletal transformations to track the movement and estimate the intent and trajectory of an animal that might jump onto a highway. See the example in this article.
  • Fully autonomous or mostly autonomous vehicles with human supervisory control only. Think of the steps between today’s cruise control and tomorrow’s full autonomous car. Cars that can parallel park themselves today are a great example of this stepwise development.
  • Beyond face detection to reliable recognition and, perhaps more importantly, for expression feedback. Is the driver of a semiautonomous vehicle aggravated, worried, surprised?
  • Virtual shopping (AR to try products). Shoppers can see themselves in that new suit.
  • Signage that interacts with viewers. This is based on expressions, likes and dislikes, and data that the individual has made public.
  • Two-way television and interactive digital cinema. Entertainment for which viewers can influence the experience, almost as if they were actors in the content.
  • Interactive telemedicine. This is available any time with experts from anywhere in the world.

I make no attempt in this article to provide an exhaustive list of applications, but I explore more by looking closely at both AR (annotated views of the world through a camera and display—think heads-up displays such as fighter pilots have) and skeletal transformations for interactive tracking. To learn more beyond these two case studies and for more in-depth application-specific uses of CV and video analytics in medicine, transportation safety, security and surveillance, mapping and remote sensing, and an ever-increasing list of system automation that includes video content analysis, consult the many entries in Resources. The tools available can help anyone with computer engineering skills get started. You can also download a larger set of test images as well as all OpenCV code I developed for this article.

Example: Augmented reality

Real-time video analytics can change the face of reality by augmenting the view a consumer has with a smart phone held up to products or our view of the world (for example, while driving a vehicle) and can allow for a much more interactive experience for users for everything from movies to television, shopping, and travel to how we work. In AR, the ideal solution provides seamless transition from scenes captured with digital video to scenes generated by rendering for a user in real time, mixing both digital video and graphics in an AR view for the user. Poorly designed AR systems distract a user from normal visual cues, but a well-designed AR system can increase overall situation awareness, fusing metrics with visual cues (think fighter pilot heads-up displays).

The use of CV and video analytics in intelligent transportation systems has significant value for safety improvement, and perhaps eventually CV may be the key technology for self-driving vehicles. This appears to be the case based on the U.S. Defense Advanced Research Projects Agency challenge and the Google car, although use of the full spectrum with forward-looking infrared and instrumentation in addition to CV has made autonomous vehicles possible. Another potentially significant application is air traffic safety, especially for airports to detect and prevent runway incursion scenarios. The imagined AR view of an aircraft on final approach at Ted Stevens airport in Anchorage shows a Hough linear transform that might be used to segment and estimate aircraft attitude and altitude visually, as shown in Figure 2. Runway incursion safety is of high interest to the U.S. Federal Aviation Administration (FAA), and statistics for these events can be found in Resources.

Figure 2. AR display example

Image showing an example of video augmentationFor intelligent transportation, drivers will most likely want to participate even as systems become more intelligent, so a balance of automation and human participation and intervention should be kept in mind (for autonomous or semiautonomous vehicles).

Skeletal transformation examples: Tracking movement for interactive systems

Skeletal transformations are useful for applications like gesture recognition or gate analysis of humans or animals—any application where the motion of a body’s skeleton (rigid members) must be tracked can benefit from a skeletal transformation. Most often, this transformation is applied to bodies or limbs in motion, which further enables the use of background elimination for foreground tracking. However, it can still be applied to a single snapshot, as shown in Figure 3, where a picture of a moose is first converted to a gray map, then a threshold binary image, and finally the medial distance is found for each contiguous region and thinned to a single pixel, leaving just the skeletal structure of each object. Notice that the ears on the moose are back—an indication of the animal’s intent (higher-resolution skeletal transformation might be able to detect this as well as the gait of the animal).

Figure 3. Skeletal transformation of a moose

Image showing an example of a skeletal transformationSkeletal transformations can certainly be useful in tracking animals that might cross highways or charge a hiker, but the transformation has also become of high interest for gesture recognition in entertainment, such as in the Microsoft® Kinect® software developer kit (SDK). Gesture recognition can be used for entertainment but also has many practical purposes, such as automatic sign language recognition—not yet available as a product but a concept in research. Certainly skeletal transformation CV can analyze the human gait for diagnostic or therapeutic purposes in medicine or to capture human movement for animation in digital cinema.

Skeletal transformations are widely used in gesture-recognition systems for entertainment. Creative and Intel have teamed up to create an SDK for Windows® called the Creative* Interactive Gesture Camera Developer Kit (see Resources for a link) that uses a time-of-flight light detection and ranging sensor, camera, and stereo microphone. This SDK is similar to the Kinect SDK but intended for early access for developers to build gesture-recognition applications for the device. The SDK is amazingly affordable and could become the basis from some breakthrough consumer devices now that it is in the hands of a broad development community. To get started, you can purchase the device from Intel, and then download the Intel® Perceptual Computing SDK. The demo images are included as an example along with numerous additional SDK examples to help developers understand what the device can do. You can use the finger tracking example shown in Figure 4 right away just by installing the SDK for Microsoft Visual Studio® and running the Gesture Viewer sample.

Figure 4. Skeletal transformation using the Intel Perceptual Computing SDK and Creative Interactive Gesture Camera Developer Kit

Image showing a skeletal and blob transformation of a hand


The future of video analytics

This article makes an argument for the use of video analytics primarily to improve public safety; for entertainment purposes, social networking, telemedicine, and medical augmented diagnostics; and to envision products and services as a consumer. Machine vision has quietly helped automate industry and process control for years, but CV and video analytics in the cloud now show promise for providing vision-based automation in the everyday world, where the environment is not well controlled. This will be a challenge both in terms of algorithms for image processing and machine learning as well as data-centric computer architectures discussed in this series. The challenges for high-performance video analytics (in terms of receiver operating characteristics and throughput) should not be underestimated, but with careful development, this rapidly growing technology promises a wide range of new products and even human vision system prosthetics for those with sign impairments or loss of vision. Based on the value of vision to humans, no doubt this is also fundamental to intelligent computing systems.


Description Name Size
OpenCV Video Analytics Examples 600KB
Simple images for use with OpenCV 6474KB



Get products and technologies


Description Name Size
GPU accelerated image transform 644KB
Grid threaded comparison 1.08MB
Simple image for transform benchmark Cactus-320× 206KB




Description Name Size
Continuous HD digital camera transform example 123KB
Grid threaded prime generator benchmark hpc_cloud_grid.tar.gz 3KB
High-resolution image for transform benchmark 12288KB




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione


Just another site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico




A great site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.


News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....


Explorer of Research #HEMBAD


Explorer of Research #HEMBAD


A great site


This is My Space so Dont Mess With IT !!

%d bloggers like this: