Wednesday, March 6, 2013

Computer Grid Architecture and Performance

Many organizations are now looking for new ways to perform compute intensive tasks at a lower cost. Fast provisioning, minimal administration, and flexible instance sizing and capacity, along with innovative third party support for grid coordination and data management through cloud based platforms that support high performance grid computing.

One can allocate compute capacity on demand without upfront planning of data center, network, and server infrastructure. You have access to a broad range of instance types to meet your demands for CPU, memory, local disk, and network connectivity. Infrastructure can be run in any of a large number of global regions, without long lead times of contract negotiation and a local presence, enabling faster delivery especially in emerging markets. One can define a virtual network topology that closely resembles a traditional network that you might operate in your own data center. One can build grids and infrastructure as required for isolation between business lines, or sharing of infrastructure for cost optimization. Operation task of running compute grid of multiple nodes can be made fully automated. One can combine elastic compute capacity with other services to minimize complexity.

High performance computing (HPC) allows end users to solve complex science, engineering, and business problems using applications that require a large amount of computational resources, as well as high throughput and predictable latency networking. Most systems providing HPC platforms are shared among many users, and comprise a significant capital investment to build, tune, and maintain.

Many commercial and open source compute grids use HTTP for communication and can accept relatively unreliable networks with variable throughput and latency. However, for ticking risk applications, and in some proprietary compute grids, network latency and bandwidth can be important factors in overall performance. Compute grids typically have hundreds to thousands of individual processes (engines) running on tens or hundreds of machines. For reliable results, engines tend to be deployed in a fixed ratio to compute cores and memory (for example, two virtual cores and 2 GB of memory per engine). The formation of a compute cluster is controlled by a grid “director” or “controller,” and clients of the compute grid submit tasks to engines via a job manager or “broker.” In many grid architectures, sending data between the client and engines is done directly, while in other architectures, data is sent via the grid broker. In some architectures, known as two-tier grids, the director and broker responsibilities are managed by a single component, while in larger three-tier grids, a director may have many brokers, each responsible for a subset of engines.

The time frame for the completion of grid calculations tends to be minutes or hours rather than milliseconds. Calculations are partitioned among engines and computed in parallel, and thus lend themselves to a shared-nothing architecture. Communications with the client of the computation tend to accept relatively high latency and can be retried during failure.


No comments: