AWS also provides a way to measure / review customer workloads. AWS WA Tool reviews the current workload state & compares with latest AWS best practices. Framework provides consistent approach to evaluate solutions, identify areas of improvement & state of maturity in comparison with AWS implementation architectures.
- Stop guessing capacity requirements - use as much capacity as needed, implement auto-scaling policies
- Test systems at production scale - simulate your test environment on-demand, with resources at scale & at low-cost on-cloud, decommissioned after use
- Automate tests, continuously integrate - automate everything; reduce manual effort for regression to speed-up time to build & deliver products / solutions on cloud
- Allow for evolutionary architectures - build adaptable architectures; open to build innovative solutions with the available flexibility on cloud
- Drive architectures using data - this approach helps apply fact-based decisions & architecture choices
- Practice, practice & practice - evolve architecture & operations
- what are the business priorities
- what is the worst possible scenario
- what are your legitimate honest constraints; immovable constraints
- what data is solutions processing
- what type of skills does your team have
- what is the timeline for the product
- perform operations as code - define entire workload (application components, platform, infrastructure, data, etc.) - portable as code, using scripts which are triggered in response to events.
- apply small, incremental & reversible changes - to account for failures, identify & resolve problems in the application / environment, with minimal impact to service consumers
- anticipate failures - identify potential sources of failure, test for failures. Design systems to detect & protect from failure; consider fail-safe design, handle system failures across system components
- learn from operational failure - drive improvements via lessons learnt - shared across teams in the organization
- organization - define business objectives & structure to drive business goals; understand organization priorities - evaluate internal & external customer needs; evaluate security, governance & compliance requirements; drive from leadership; empower team members to action independently when necessary; escalation mechanisms to notify of problems; clear & apt communication across the board; encourage innovation & experimentation;
- prepare - preparing for operational readiness --> telemetry, operations, deployment risks & ops-readiness; understand workloads & expected behavior; build systems to provide insight [metrics, logs, events, traces, etc.] supporting easier / faster troubleshooting; integrate distributed tracing [AWS X-Ray] , correlate event flows across services; implement application telemetry - track internal state [install Cloudwatch agents], use advanced metrics [e.g. API call volume, HTTP status codes, scaling events] for tracking customer behavior; derive insights from gathered & analyzed data; configure workload to track response time for external dependencies e.g. external services, services over VPN / connected on-premise, etc. ; configure transaction traceability; create parallel environment(s) on cloud for testing & experimentation; adopt fail-fast model, simulate failure(s) during test; plan using CI/CD tools for deployments - reduces job of manual deployment; deploy small, frequent, reversible changes; experiment early to find better solutions;
- operate -achieve business outcomes as measured by the metrics defined; identify the relevant & important KPI; examine the system behavior via metrics - relate observations to target KPI; e.g. KPIs - response time, # of erroneous transactions, volume of processed transactions, etc.; define KPI, collect data & measure output against the pre-defined KPI are the three main steps to understand the workload health; based on observation & analysis - establish operations metrics & performance baselines; detect anomalies & respond appropriately as necessary; configure required alerts on occurrences of abnormal events - outside of the configured threshold; develop response mechanisms for events [incident / problem management] - define event priorities & define SLA & escalation mechanism; create dynamic dashboards for operators to easily identify system health & apply corrective measures as necessary; automated notifications based on severity of an alert is equally important to define;
- key services offered by AWS related to Ops:
- AWS Glue / AWS Athena (with S3) - capture, observe & analyze metrics;
- AWS cloud watch + cloud watch insights; AWS Health API (health checks); AWS Quicksight; AWS cloud watch logs insights, etc. ;
- AWS cloud watch anomaly detection & alarms;
- AWS cloud watch dashboards;
- 3rd party tools integration - sumologic, splunk, new relic, data dog, loggly, etc.
- evolve - continuous process improvement over time; knowledge management; document system failure occurrence(s) w RCA & fix-path for future use / reference; such incident results can be used to automate ticket management (AI), as a data feed; apply learning as a process feedback for development; identify opportunities to improve operations - identify drivers for opportunities to help evaluate & prioritize opportunities; maintain central knowledge repository, shared & visible to all stakeholders; test for assumptions & evaluate processes;
- key services - AWS VPC flow logs; AWS Glue; AWS Athena & AWS Quicksights (analyze flow logs)
Security is a shared responsibility between the client / consumer & cloud service provider. Identify & create responsibility matrix (RACI) - call out who is responsible for which component, based on the chosen cloud-service model. Identify the components which falls under AWS' responsibility - understand the detection & response mechanisms offered by AWS for security monitoring, alert & notifications
- implement strong security foundation - identity management; principle of least privileges & enforcing separation of duties are the key factors to consider;
- AAA - authenticate, authorize & audit; ability to track events cross-layers -- across the workload components at real-time to investigate & resolve security incidents at a swift pace; integrate log & metrics for traceability & analysis
- automate security best practices - implement reusable security controls defined & managed as code-versioned templates; automate & reuse
- protect data-in-transit & data-at-rest - apply encryption, encode, tokenise, etc.
- prepare for security events - security & incident response management system; use response simulations & tools to detect failures early & apply mitigation measures
- separate workloads using accounts - enable common guardrails across the organization to support growing workloads; create account(s) each, for different environments - providing strong logical boundaries between workloads that process data of different sensitivity levels, as defined by external compliance requirements (PCI-DSS / HIPAA)
- identity & access management - secure user accounts - admin, root & application users; centrally manage access & govern to support growing workloads [use AWS IAM & AWS Organizations to segregate accounts]
- human identities - manage & control access to developers, admins, operators, consumers; device interactions via web / mobile clients, browsers / command-line tools
- machine identities - workload applications, operational tools & integrating components requiring an identity / service-account access with key management for internal / external account access
- Manage federated identities via SAML 2.0 & SSO; AWS supports identity federation - use AWS Cognito for federated identity management;
- Use tools such as AWS KMS / secrets manager (vault) - to manage & rotate machine identities. IAM roles is another option to manage machine identities, where AWS creates, distributes & rotates temporary credentials via Instance Metadata Service (IMDS)
- access control policies (ACLs) - create access control policies to limit access using AWS managed / customer managed policies. Types of policies that can be configured include - resource-based access policies, attribute-based access control policies, organization service control policies & session policies. Always follow principle of least privileges - when granting access...
- detection controls - use tools & configure alerts to detect & get notified of unexpected behavior in the workload (e.g. AWS CloudTrail, AWS Config, AWS GuardDuty, AWS Security Hub, etc.); use services such as AWS VPC Flow logs to detect IP traffic flow in & out of network interfaces; as always, AWS Cloudwatch logs are pretty useful for log analysis & derive insights for anomaly detection & audit trail;
- infrastructure protection - is a key part of IS program. Important to ensure systems & services are protected from unintended / unauthorized access to networks, infra & application components; AWS Outposts can be used to deploy AWS services, infrastructure & operating models to data center(s) / co-location space / on-premise facilities;
- network protection - control traffic via network filtering rules at entry; configure whitelisted IP address(es) to limit access; create network layers (public / private) - to control access; adapt 'Zero Trust' model; define inbound & outbound network rules, avoid unintended access; use 'Transit Gateways' to peer larger networks across region(s) / on-premise - ensures traffic remains on AWS private network + reduces threat factors from DDOS / SQL injection / cross-site scripting / forgery; AWS WAF & AWS Shield can inspect & filter HTTP traffic to match rules forwarded to AWS API Gateway / Amazon Cloudfront or ALB; AWS WAF Security Automation framework - provides self-detecting network based threat intelligence & anamoly detection services
- compute protection - defense in-depth, vulnerability management & reduce attack surface; use patch manager to automate process of patching for managed instances (OS & apps); harden OS & system components, libraries + external service interfaces; use static code analysis tools (e.g. Amazon CodeGuru) - reduce security vulnerabilities at build stage; integrate security tests stages into build & deploy pipeline - test package(s) by injecting malformed data during tests (e.g fuzzing technique); validate software integrity using self-signed certificates; automate compute protection - Amazon Inspector, etc..;
- data protection - define data protection controls; identify & protect PII in the workload; use resource tags to identify & segregate AWS services / resources - manage data access & control; establish trust boundaries; have clean lanes for data flow in & out, computations and clearly demarcated in those lanes;
- protect data-at-rest using data encoding, encryption, tokenization; use AWS Cloud HSM to generate & own encryption / access keys on the cloud; use AWS managed config rules - to automatically check if encryption is enabled on services such as S3, RDS & EBS volumes; S3 Glacier offers 'vault lock' - helps enforce encrypting archived data; audit use of encryption keys regularly - use AWS KMS logs & Cloud Trail event logs to ensure usage of all keys are valid; Use AWS Config Rules to validate & automate security controls on a continuous basis;
- protect data-in-transit between workload services & external services / components + end-users; implement secure key & certificate management (TLS) - use AWS certificate manager (ACM); transmit data via SSL - encrypt data during transmission; build data marshaling, encryption & decryption into application platform / framework services to ease data processing; use AWS Guard Duty to automate detection of suspicious activity; at network level, use AWS VPC Flow logs used with Amazon EventBridge - detect abnormal connections;
- incident response -define an incident response management system - based on incident occurrences; define response on the cloud - document recovery plan, time & objectives; involves process of educating incident response team; implement automated controls to detect & classify incident priorities + create according notification mechanisms; develop incident response management plans like playbooks; iterate through the incident findings & identify process areas to improve on a continuous basis;
- applying new tags, applying programmatic changes to change permissions/control access;
- think whether the permissions can be adjusted to be containable, to reduce frustration of stakeholders
- check on the type of attack rather than the originator:
- random / targeted / persistent hacker
- incident response planning, prepare, innovation cycle with experiments & game-play are all the more important for an appropriate response
- Key Services - AWS IAM, AWS Security Hub, AWS CloudTrial, AWS Artifact, AWS Guard Duty, AWS Shield, AWS Web Application Firewall (WAF), AWS KMS, AWS Certificate Manager (ACM), AWS SSO, AWS Organizations, AWS STS, AWS MFA, AWS Event Bridge, AWS VPC Flow logs, AWS S3 Access Analyzer, AWS Certificate Manager, etc.
- stop guessing capacity - scale horizontally for aggregate workload availability; apply cloud-bursting when required, for workloads on-premise; get rid of idle resources;
- automated failure detection & recovery - based on occurrence of an event breaching threshold; KPIs configured for business value
- test recovery procedures - simulate failures in test environment(s) & validate recovery strategies
- manage change in automation - manage infrastructure changes via automation; automation is the key to reliability;
- Distributed system design, prevent failures, recovery planning - based on point & time objectives (RTO, RPO); improve mean-time between failures (MTBF); improve mean-time to recovery (MTTR); design considerations - distribute data load across data centers, availability zones & regions to minimize the risk of data-loss on the event of failure;
- Change management - use AWS services to monitor the workload behavior; configure metrics to notify failures / cross-over threshold;
- Failure management - backup frequency driven by RTO & RPO; achieve fault isolation within defined boundary limits; test for reliability; plan for disaster recovery (DR); mitigate single-points of failure in the architecture
- Modern application principles for reliability - simple, secure, resilient, automated, self-contained, interoperable, independent
- apply an appropriate design pattern / tactic to suite the cloud-migration approach - lift-shift / re-platform or re-architect / green field implementation
Focus for reliability
- Limits - service limits, understand the default & requested resource limits; monitor & manage service quotas; accommodate fixed service quotas & constraints; automate quota management;
- Networking - understand topology bandwidth & latency also look at ways to connect VPCs together or via direct connect into your data center, understand the field constraints
- apply the suited network topology - use AWS transit gateways to connect large VPCs instead of peer-to-peer mesh networks;
- ensure non-overlapping private IP address ranges in all private address spaces where connected;
- Availability - ensure your application is ready for business use
- implement graceful degradation - reduce / eliminate hard dependencies; use patterns such as circuit breakers for graceful degradation;
- throttle requests - to meet unexpected demands; control & limit the retry calls - implement algorithms such as exponential backoff - to limit the number of retries, avoid network congestion;
- fail fast & limit queues; set client timeouts; have services stateless where possible;
4. Performance Efficiency - ability to use compute, storage & data resources efficiently, to meet system requirements as well as balance system efficiency at peak load.
Design principles
- consider surge in workloads; plan for surge in demand; consider business expansion objectives at design; adapt user centric approach rather than technology-centric approach;
- consider evolving technology alternatives on cloud - advanced processing capabilities with patterns such as map-reduce, machine-learning, media-trans coding & artificial intelligence
- adapt "serverless" architecture where feasible - reduce operational overhead with server management
- evaluate trade-offs for performance; adapt data-driven approach with frequent experiments covering different conditions / scenarios
- reduce latency where possible, deliver thru regions & edge; utilize AWS edge to reduce network latency;
- automate everything - apply infra-as-code; conduct automated load-tests, stress tests & failure simulation
Best practices:
- Selection - choice of appropriate compute, storage, network & database services to support the application workload; auto-scaling for compute, EBS & S3 for storage; RDS & Dynamo DB for database, route53; VPC & Direct Connect for network; services such as AWS Dynamo DB helps achieve millisecond latencies;
- Review - continuously keep abreast of technology updates; review what's new section of AWS website; performance review process + refinements are necessary to tune the workload components to deliver best performance;
- Monitoring - Monitor system performance to identify degradation and remediate internal or external factors, such as the operating system or application load;
- understand five phases of monitoring -- generation, aggregation, real-time processing, storage & analytics
- look for trade-off between these phases
- active v/s passive monitoring (e.g. cloud-watch events v/s cloud trail logs)
- Tradeoffs - When architecting solutions, determining tradeoffs enables you to select an optimal approach. Often you can improve performance by trading consistency, durability, and space for time and latency.
How do you select your compute solution?
The optimal compute solution for a workload varies based on application design, usage patterns, and configuration settings. Selecting the wrong compute solution for an architecture can lead to lower performance efficiency.
For container deployments, AWS supports AWS Fargate, a serverless compute for containers; with AWS Fargate, AWS ECS & AWS EKS workload deployments can be managed; service-mesh such as AWS App Mesh can be used to standardize services communication.
How do you select your storage solution?
The optimal storage solution for a system varies based on the kind of access method (block, file, or object), patterns of access (random or sequential), required throughput, frequency of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and durability constraints. Well architected systems use multiple storage solutions and enable different features to improve performance and use resources efficiently.
How do you select your database solution?
The optimal database solution for a system varies based on requirements for availability, consistency, partition tolerance, latency, durability, scalability, and query capability. Selecting the wrong database solution and features for a system can lead to lower performance efficiency.
How do you configure your networking solution?
The optimal network solution for a workload varies based on latency, throughput requirements, jitter, and bandwidth. Physical constraints, such as user or on-premises resources, determine location options. These constraints can be offset with edge locations or resource placement.
5. Cost optimization - ability to avoid / eliminate under-utilized OR not-used system resources.
Design principles:
- define cost of services on cloud; adopt the consumption model based on the workload (refer my pricing blog)
- measure the throughput by business outcomes & efficiency to which cloud platform has delivered adhering to SLA; analyze business impact for cost & pricing based on outcome
- refine areas where expenditures on resources are not necessary; optimal resource utilization helps reduce cost for compute, storage, data & network resources
Best practices:
- Cost-effective resources - use resources such as AWS cost explorer, AWS cloudwatch & AWS Trusted advisor, to budget the costs & monitor the actual spend by resource utilization
- Cost Explorer, Amazon Athena & Amazon QuickSight provide cost & usage report (CUR); cost + usage awareness across the organization
- AWS budgets provides proactive notifications for cost and usage
- Matching supply & demand - use auto-scaling to match the resource demand; ensure queue or buffer are sized aptly to match the demand at runtime;
- Expenditure & usage awareness - establish policies & mechanisms to ensure costs incurred are mapped / attributed to the business objectives achieved; employ checks & balances; implement change control process to remove unused resources and operate with optimum required capacity;
- use AWS cost explorer - explore usage costs with granularity; create cost & usage report (CUR);
- AWS Glue & AWS Athena can be used to prepare data & perform analysis, using SQL to query the data; AWS QuickSight can be used to prepare complex visualizations;
- Optimize over time - apply cost optimization by usage; optimize data transfer costs; use trusted advisor; keep regular watch on news blog & what's new section of AWS website
References:
- AWS Architecture center
- AWS Shared responsibility model
- AWS Cloud Compliance
- AWS Well architected partner program
- AWS Well architected tool
- AWS Well architected homepage
- Amazon Builder's library; AWS Architecture examples
- Opertional excellence pillar whitepaper
- Performance efficiency pillar whitepaper
- Reliability pillar whitepaper
- Security pillar whitepaper
- Cost optimization pillar whitepaper
No comments:
Post a Comment