Techno-Scribble: AWS HPC Workloads

AWS offers a variety of services and service types; service applicability in a given context is a point for architects / designers to search, research, test & adapt. This blog features -
AWS well-architected framework lenses , focusing on HPC examples & AWS supporting HPC workloads...

For high performance computing, AWS offers certain specific instance types in EC2 --> to associate highly performant compute load:

this includes a family of supported instance types such as x-large M5*, R4*, R5*, X*, *-metal, etc;
range of transmission varies between 1500 MTUs upto 9001 MTUs maximum transfer units or jumbo frames;
note that network transmission speeds upto 100 Gbps are achievable for high performance computing;
an appropriate instance type (typically EC2 instance type], grouping [AZs / VPC] should be used to maximize network throughput for high performance computing

With instance type variations:

elastic network adapter can support network speeds upto 100 Gbps
intel 82599 virtual function (VF) instance type supports network speeds upto 10Gbps

-----------------------------------------------------------------------------------------------------------------------

Cluster placement groups - cluster placement groups is a logical grouping of EC2 instances in a single availability zone; Cluster placement groups can span VPCs in the same region;

network traffic in cluster placement groups can use upto 10 Gbps for single-flow traffic
cluster placement groups with AWS DX (Direct connect) to on-premises are limited to 5 Gbps

High-performance computing lens - cloud platforms can host HPC workloads. Natural ebb & flow, bursting characteristic of HPC workloads make them well suited for pay-as-you-go infrastructure. Certain terms associated with HPC workloads:

vCPUs - synchronous to threads / hyper-threads
align procurement model to support the workload - use pay-as-you-go model to support flexible loads e.g. peak / stress / high workload run for a duration, following which the resource utilization drops; elasticity is the key here, that helps meet varying workload requirements
AWS Parallel cluster - use to experiment the workload, optimize architecture for performance & cost; use AMIs & EBS snapshots; S3 & cloud formation templates along with AWS parallel cluster configuration templates;
test real-world workloads - application requirements vary based on algorithm complexity, mathematical formulae applied (finite element methods, extrapolations, moving averages, calculus, predictive analysis, etc.) size & complexity of models used, user-interface simulation requirements, requirements for visual graphics, 3-D complex image rendering, etc. Extrapolate real-world anticipated load for performance tests

use spot instances - least expensive for non-critical workloads; good for research-oriented workload simulation

Select storage solution that best aligns to the requirement; for e.g. create RAID 0 array to achieve higher levels of performance, where I/O performance is critical to fault tolerance;

for storage solutions with fixed size, such as EBSx / FSx - ensure to monitor the amount of storage used v/s overall storage size; automate scaling of storage resources based on threshold limit;

Scenarios where applicable: genomics, computational chemistry, risk modeling, computer-aided design & mechanics, weather prediction & seismic imaging, machine learning & deep learning, autonomous driving, etc.

Loosely coupled scenarios - entails processing of a large number of smaller jobs with shared memory parallelization (SMP); e.g. monte-carlo simulations, image processing, genomics analysis & electronic design automation (EDA);

compute - driven by application's memory-to-compute ratio; GPUs / FPGA accelerators on EC2 instances
network - performance of the workload is not sensitive to the bandwidth / network latency between the instances; hence cluster placement is not necessary
storage - driven by data size, I/O (read / write) & data transfers
deployment - distributed across AZs with no impact to performance; can be simulated with AWS Batch & AWS parallel cluster or combination of AWS services

Tightly coupled scenarios - parallel dependent processes for compute calculations; can carry tens of thousands to millions of inter-linked processes / iterations. Failure of one node usually leads to failure of the entire calculation; application-level checkpoints hence recommended to restore to last known simulation state;

compute - homogeneous cluster with similar compute nodes; per core instance size can be chosen by available memory optimized instance nodes; largest size per core preferred proportionate to the workload
network - cluster placement group applies here; AWS Elastic Fabric Adapter (EFA) can be used to support tightly coupled workloads with high inter-node communication at scale;
storage - driven by data size, I/O (read / write) & data transfers
deployment - can be deployed via AWS Batch & AWS parallel cluster configuration / AWS cloud formation or EC2 fleet

HPC lens covers related scenarios - cluster environments, batch & queue processing, hybrid deployments & server-less workflows. Following section covers HPC in the context of AWS well-architected framework pillars:

-----------------------------------------------------------------------------------------------------------------------

Operational excellence - automate cluster operations using AWS Parallel cluster & AWS cloud formation tools to automate creation of workload components, made configurable; AWS Batch for scheduled jobs execution; use AWS Cloud watch to monitor cluster load metrics & configure notifications; CI tools such as AWS Code Pipeline & AWS Code Deploy for deployment management; AWS Code commit for version control;

Security - use IAM, define principals, users, groups & roles; build policies which reference the principles; run HPC workloads autonomously with minimum exposure to resources; protect keys used to access services / workload components - ensure appropriate key rotation policies are in place;

Address data protection & data loss prevention using storage services such as S3 / EBS / EFS; understand availability & durability requirements applied to HPC scenarios

Reliability - HPC applications often require a large number of compute instances simultaneously. However, scaling horizontally may require an increase to the AWS service limits before a large workload is deployed to either one large cluster, or to many smaller clusters all at once. Service limits may often be required to increase from the default values to handle the requirements for large deployments.

Check-points - are needed to ensure recover to the latest consistent state on an event of failure. Failure scenarios can include abrupt instance failure, cluster start-up failure, application component [compute / data] crashes, network interruptions, etc. Check-pointing is a common feature built into many HPC applications; avoids data loss & aids retain system to the last consistent state.

Failure tolerance can improve deploying to multiple AZs / regions;
trade-off between reliability & cost pillars is needed with considerations around clustered placements for compute instances, based on latency (tight / loose coupling)

------------------------------------------------------------------------------------------------------------------

Instance sizing - compute performance requires an appropriate instance sizing to support HPC workloads; HPC architecture can rely on one or more different architectural elements - queued / batch / cluster compute / containers / server-less / event-driven;

choose an appropriate instance family for compute- , memory- or GPU intensive workloads
each instance family has an associated instance size - large / extra-large, allows to vertically scale capacity; choose an appropriate 'larger' instance type to support the appropriate HPC workload (tight / loosely coupled)
Choose new-gen instance types for HPC workloads - Nitro, with enhanced high-speed I/O acceleration; Nitro delivers performance similar to bare-metal;
When choosing underlying hardware, look for:

advanced processing features
hyper-threading technology
advanced vector extensions, processor affinity
processor state controls (C-state & P-state) - choice between reduced latency with single / dual core versus consistent performance at lower frequencies

FSx for Lustre natively integrates with S3; presents entire content of S3 buckets as a file system - allows optimize storage costs

Approaches to scale resources on AWS

Demand based approach
Buffer based approach
Time based approach

Networking considerations - tightly coupled applications benefit from an Elastic Fabric Adapter (EFA), a network device that can attach to your Amazon EC2 instance. EFA provides lower and more consistent latency & higher throughput than TCP used in cloud-based HPC systems.

EFA supports OS-bypass access model via Libfabric API - HPC applications communicate directly with the network interface hardware.
EFA enhances inter-instance communication; optimized to work on the existing AWS network infrastructure; critical to scale tightly coupled applications

-----------------------------------------------------------------------------------------------------------------------

Dynamo DB scaling
   partition calculations:
        (1) number of partition = (2000 RCU / 3000) + (2000 WCU / 1000)
            where RCU = read capacity units & WCU = write capacity units
        (2) number of partition by size = x / 10GB
            where x = data capacity for storage
        total number of partition = ROUND(MAX (1,2))
   we have a partition key & sort key
        partition key is like a primary key and sort key is like a aggregation or similar key similar to map / reduce concept

DAX - Dynamo DB Accelarator, its a in memory cache that sits in front of the dynamo DB table

ENI v/s ENA v/s EFA

ENI - elastic network interface -- a virtual network card

with ENI, we get 1 private & public IPv4 address, 1 or more IPv6 address, 1 MAC address, source/destination check flag and description
used to create low-budget high availability solutions, networks connecting sub nets
use network & security appliances in your VPC

ENA - enhanced networking -- uses single root I/O virtualization (SR-IOV) to provide high-performance networking capabilities on supported instance types

used to higher I/O performance, higher networking bandwidth, higher packet per second (PPS) performance
there is no additional charge for enhanced networking
choose elastic network adapter OVER ENI when you got to get good speeds in the order of 50 to 100 Gbps
it's not required to keep adding ENIs rather configure ENA or VF for high speed networking

EFA - elastic fabric adapter -- network device that can be attached to Amazon EC2 instance to accelarate High Performance Capacity (HPC) and machine learning applications

EFA gets attached to EC2 instance and provides high performance computing
use EFA for HPC & machine learning kind of use cases

-----------------------------------------------------------------------------------------------------------------------

Techno-Scribble

Amazon Web Services (AWS)

Generic Topics

AWS HPC Workloads

No comments:

Post a Comment