Amazon Web Services (AWS)

Generic Topics

Storage options on AWS cloud

Storage solutions available on AWS are as follows:

  • Simple Storage Service (S3) - durable object storage service on cloud
  • Elastic File System (EFS) - network file storage on cloud
  • Elastic Block Store (EBS) - block storage volumes on EC2 instances
  • EC2 Instance Storage - temporary block storage volumes on EC2 instances
  • Storage Gateway - on-premise storage appliance that integrates with cloud storage
  • Snowball / Snow mobile - transport data physically from on-premises to cloud

S3 usage patterns:

  1. store & distribute static web content & media; fast-growing websites, data-intensive, user-generated content such as video & photo sharing sites
  2. low-cost scalable solution to store static HTML files, images, videos, client-side scripts
  3. data store for large-scale analytics - financial transaction analysis, click-stream analytics, media transcoding; concurrent access to multiple computing nodes;
  4. durable storage & archival solution, back-up & archive critical / cold data on storage drives with ensured reliability & replication support;
Amazon Glacier - Low cost storage service (~ $0.007 per GB / month); used as an archive store - with high durability (11 9s); retrieval of data from Glacier involves a job to be executed - typically takes 3-5 hrs using API; used by AWS storage gateway virtual tape library;
    • using multi-part upload, up to 40TB of data can be uploaded in separate parts - independently or in parallel; integrated with S3 via lifecycle management;
  • data is stored synchronously across multiple facilities to ensure reliability & durability;
  • systems & integrity checks are built-in with automatic self-healing;
  • storage is elastic & scalable; one archive can store up to 40TB of data;
  • uses server-side encryption for data-at-rest; key-protection block ciphers used are 256-bit AES standard; customer-managed keys are supported;
  • compliance rules can be set on Glacier vaults for long-term record retention & regulatory compliance;
    •  vault lock policies can be used to control data from changes / deletion e.g. policies such as 'undeletable records' / 'time-based data retention' protects archived data from future edits / accidental deletion
  • Exposes REST API interfaces [with java & .net support]; an alternate access is via 'lifecycle policy' management available with S3 - to transition over data from S3 into Glacier archive;
  • associated pricing / cost components - storage (per GB per month), data transfer out (per GB per month), requests (per 1000 upload / retrieval requests per month)
Glacier has a concept of vault, called Glacier Vault
  1. whatever is written into the vault should be secure, not touched or easily accessible
  2. so "vault lock" can be generated and applied to "Glacier Vault" object
  3. "vault lock" can be completed OR aborted in "24 hrs", failing which, it gets dissolved automatically
  4. vault lock policies OR rules are different from IAM access, IAM access only allows access to Glacier vaults; WORM based vault lock policies can enforce compliance controls & avoid documents from future edits / accidental deletion;
  5. further, archive size in a Glacier vault is around 40TB, it's immutable once loaded

------------------------------------------------------------------------------------------------------------------------

Amazon EFS - elastic & highly available network file system (NFS v4 & v4.1 supported), available as-a-service; NFS v4.1 brings in benefits with performance & parallelism, within a region; storage data (& metadata) is stored across multiple AZs;

  • EFS is a network mount OR an encapsulation over the distributed file system created by Sun Microsystems;
    • NFS distributed files were shared across zones over mount drives earlier;
    • this was a way to share your files across network drives;
    • Note that file transfers over EFS are costlier / expensive when compared to S3 / EBS;
  • access to file systems & objects are controlled via read/write/execute permissions for users & group IDs - similar to standard unix-style;
  • EFS supports multi-AZ distributed file sharing, available across multiple zones & regions;
  • used for multi-threaded applications which require substantial levels of aggregate throughput & I/O per second (IOPS);
  • distributed file system design - involves a small latency overhead per I/O operation, gets amortized over large amounts of data; hence EFS is ideal for growing datasets with larger files that need performance & multi-client access;
  • 2 performance modes available for EFS - general purpose & max I/O performance modes; workloads exceeding 7000 file IOPS per second - suites for max I/O performance mode; this is when 100s-1000s of EC2 instances access the file system;
  • EFS is optimized to burst - provide higher levels of throughput for shorter intervals of time; credit system determines when an EFS file system can burst - each file system earns burst credits at a baseline rate, accumulated & used during burst period;
  • EFS stores file copies across multiple AZs in a region - thereby durable & available;
  • storage is scalable - capacity allocated / released dynamically, as needed; 
  • file access can be controlled via security groups, API access controls along with users, groups & permissions;
    • mount targets & tags are referred to as sub-resources;
    • IAM policies can be used to assign & restrict access to cloud file systems;
  • A word of caution when mounted from on-premises systems:
    • need very fast & stable connectivity like Direct Connect to on-premises, hence adds to cost factor;
    • NFS is not considered to be secure, hence need a secure tunnel OR VPN connection;
    • An alternate to mount from on-premises is AWS DATASYNC;
    • related cost comparison: EFS is ~3X costlier than EBS; EFS is ~20x costlier than S3; certain NFS features aren't supported on EFS;

------------------------------------------------------------------------------------------------------------------------

Amazon EBS - elastic block store volumes for block storage; EBS volumes are network attached storage volumes - persisted independent of the lifecycle of an EC2 instance; multiple EBS volumes can be attached to a single EC2 instance similar to attaching external hard-drives; EBS is useful to create snapshots, backup your EC2 instance volume and restore the same across different availability zone(s) OR region(s);

  • volume size ranges from 1 GiB to 16 TiB - depending on the volume type;
  • using snapshot & backup, volumes can be restored on a different instance across AZ / regions - volumes are hence useful for multi-regional expansion;
  • usage patterns for EBS are - SSD (solid-state drive) & HDD (hard-disk drive); SSD suites for transactional workloads while HDD suites for throughput-intensive workloads e.g. data warehouse, big data processing, etc.
  • for high-throughput workloads, HDD per volume throughput can range between 250-500 MiB/s; max IOPS across HDD & SSD is capped at 65,000 IOPS per instance with max through put per instance up to 1250 MiB/s
    • SSD-gp2 bursts up to 3000 IOPS for extended periods with 3 IOPS/GiB baseline performance; max 10,000 IOPS; gp2 can range 1 GiB-16 TiB; throughput limit range <=170 GiB; for > 170 GiB, limit increases at 768 KiB/s to max 160 MiB/s; 
      • suited for transactional workloads;
      • volume that balances price & performance for variety of transactional workloads;
    • SSD-io1 delivers high-performance for I/O intensive workloads, with small I/O size; range - 4 GiB-16 TiB; 20,000 IOPS per volume can be configured; ratio of IOPS provisioned to volume size requested is max up to 50;
      • highest performance SSD volume designed for mission-critical applications;
    • HDD-st1 (throughput optimized) - good for frequently accessed throughput-intensive workloads range 500GiB-16TiB can burst up to 250 MiB/s per TB, with baseline through put of 40 MiB/s per TB and max through put of 500 MiB/s;
      • low cost SSD volume designed for frequently accessed throughput-intensive workloads;
      • suited for big data & data warehouses;
    • Cold HDD-sc1 - lowest cost per GiB of EBS volume types; good for infrequently accessed workloads with large I/O sizes; range 500GiB-16TiB can burst upto 80 MiB/s per TiB, with baseline through put of 12 MiB/s per TiB and max through put of 250 MiB/s per TiB;
      • lowest cost SSD volume designed for less frequently accessed workloads; file servers;
    • EBS Magnetic [standard] - previous generation HDD; suited for infrequently accessed work loads; range 1GiB - 1TiB; max IOPS per volume 40-200;
  • other devices on EC2 performing network I/O operations can impact performance, given EBS volumes are network-attached devices;
  • latest volumes are available as EBS optimized instances to deliver dedicated through put between EC2 instances & attached EBS volumes;
  • include incremental snapshots, point-in-time backups, annual failure rate b/w 0.1-0.2%;
    • incremental snapshots with versioning - when deleted - are soft-delete;
    • it's possible to restore incremental snapshots to previous specific version;
    • however, once snapshot is deleted, it's NOT possible to restore to specific version;
  • ensures data is encrypted at rest, on the volume & in-transit between EC2 instance to volume; encryption keys can be AWS managed (via KMS) or customer-managed;
  • exposes REST API for access, AWS CLI is an alternative way to access EBS volumes; 
  • pricing standpoint - data transferred via EBS volumes across regions is charged;
  • Use AWS KMS to encrypt EBS volumes; EBS lifecycle manager can take scheduled snapshots of any EC2 volume, irrespective of the state --> even terminated instances;
  • Checkout the RAID options available with EBS - it's a very good option to snapshot your EC2 instance and backup on S3 OR similar;

AWS Elastic container service (ECS) security -concept of "task roles" --> roles allows access on task basis --> S3 / DynamoDB / etc.; container services contain our applications, possibly databases; this integrate with "managed services" on AWS OR outside, hence the need to secure your container services in compliance with "IAM roles";

Amazon EC2 instance storage - temporary block storage; EPHEMERAL storage by nature is not persistent storage; not available on certain EC2 instance types (t-series, c-series, etc.); good for temporary data storage - buffers, caches, in-memory processing across load-balanced fleet of EC2 instances; Instance volumes are volatile, temporary, cache / in-memory stores, which live until an EC2 instance is active; Given the instance-storage is template based, number of available options to create an EC2 instance is also limited;

Instance-storage volumes cannot be stopped, they can be deleted / terminated only;

In the event of an instance failure for an instance-storage backed volume, all data is LOST; however, additional EBS volumes to an EC2 instance backed by instance-storage can be added after creating the instance; By this, EC2 backed with instance-storage can be attached to persistent volume;

While selecting an AMI, we choose based on:

  • region / AZ, operating system, architecture (32 / 64 bit), launch permissions;
  • storage for root device (root device volume) - can be EPHEMERAL, when known as instance storage OR EBS backed storage volumes;

 All AMIs are categorized as either backed by Amazon EBS OR backed by instance store;

  • For EBS volumes - root device for an instance launched from an AMI is the Amazon EBS volume created from the EBS snapshot;
  • For instance storage volumes - root device for an instance launched from an AMI is an instance store volume created "from a template stored in Amazon S3";
------------------------------------------------------------------------------------------------------------------------ 

 AWS Storage gateway - connects on-premise software appliance with cloud-based storage; provides seamless storage integration between on-premises IT environment & AWS storage infrastructure; 

  • provides low-latency performance by maintaining frequently accessed data on-premises while securely storing encrypted data in S3 / Glacier;
  • are virtual machines (VMs) OR hypervisors that provide local storage resources, backed by S3 & Glacier; storage gateway can be downloaded as a virtual image (VM) - installed on hosted in data center or EC2 instance; 
  • can be created as gateway-cached volumes, gateway stored volumes, gateway-virtual tape library (VTL), mounted via iSCSI interface;
  • Typically used as local storage volumes to house on-premise data backed by S3 and migrate to cloud gradually. It's an easier option to synchronize data from on-premise to cloud with options for fail-over & recovery;
  • storage volumes up to 32 TiB can be created & mounted as iSCSI devices from on-premises app servers; gateways configured for cached volumes can support up to 20 volumes with total storage volumes of 150 TiB; 
  • each gateway configured for gateway-stored volumes can support up to 12 volumes & total storage volume of 12 TiB; 
    • data written to gateway-stored volumes are asynchronously backed up to S3 in the form of EBS snapshots
  • gateway allows perform offline archival on virtual tape drives (VTL); each VTL can hold up to 1500 virtual tapes with maximum aggregate capacity of 150 TiB; 
Modes of operation:
  1. File Gateway -- exposed interface via NFS / SMB share drives, to store objects into S3 via mount point;
  2. Volume gateway stored mode <-> Gateway stored volumes --> iSCSI interface ==> async replication of on-prem data into S3;
  3. Volume gateway cached mode <-> Gateway cached volumes --> iSCSI interface ==> frequently accessed services are cached from on-premise locally, backed by S3 stored data;
  4. Tape gateway <-> Gateway virtual tape library --> iSCSI interface ==> virtual media exchanger & tape library for use with existing backup software;

------------------------------------------------------------------------------------------------------------------------

 AWS Snowball - offline data migration tool, used to migrate very heavy data chunks onto cloud; device / appliance is purpose-built for data transfer - looks like a suitcase, easy to carry & transport / ship;

  • available in 80 TB & 50 TB sizes (50 TB only in US); weights < 50 pounds; water-resistant, dust-proof;
  • data transfer is performed on high-speed internal network, bypassing the internet; thereby safer; ideal to transfer petabytes of data in & out of AWS cloud securely;
  •  common use cases are - cloud migration, data center decommission, content distribution, etc.
  • parallelization, using multiple snowball devices for data center - also helps improve performance efficiency for data transfer; after import into AWS, data is stored into S3 object storage - conforming to similar availability & reliability;
  • for Snowball, AWS KMS protects the encryption keys used to protect the data on each appliance; 
  • being HIPAA compliant, PII data can be moved into & out of AWS
  • Snowball exposes import/export API for access via snowball client;

------------------------------------------------------------------------------------------------------------------------ 

AWS CloudFront -  content-delivery service, speeds up distribution of websites' static & dynamic content also allowing to be globally available across locations; CloudFront also uses routing strategies such as geo-location & geo-proximity, to improve performance for access via CloudFront;

  • CloudFront caches dynamic HTML / PHP pages, static images, video streams, audio media files, & software downloads; 
    • cache content across multiple layers at each edge location; by this way, same data requested by multiple consumers are served via cache;
  • application with static web-content is well suited for CloudFront; 
  • designed for low-latency, high bandwidth for content delivery; route the users to the nearest edge location - to reduce the number of hops in the network;

------------------------------------------------------------------------------------------------------------------------

Relational database services (RDS) - database engine with version, licensing model; database instance class, Multi-AZ deployment; given managed service, advanced level access to tables & database is restricted;

what is database instance..? MS SQL server, Oracle, Maria DB, MySQL, PostgreSQL and Aurora

  • basic building block of Amazon RDS
  • isolated database environment
  • can contain multiple user-created databases
database instance types:
  • standard database instance type 'm' series
    • general purpose
    • 2-64 vCPUs, 8-256 GiB RAM & 2-64 vCPUs, 16-500 GiB RAM
    • high performance networking for CPU intensive workload
  • memory optimized instance type 'r' series
    • 2-64 vCPUs, 16-500 GiB RAM
    • high performance networking
    • optimized for high memory intensive workloads
  • burstable performance DB instances type 't' series
    • 1-32 vCPUs, 2-8 GiB RAM
    • good for test setups / low workloads

 ------------------------------------------------------------------------------------------------------------------------

 AWS Work Docs - Dropbox or Google drive --> AWS equivalent; secure, fully managed; integrate with AD / SSO; web, mobile & native clients [mac & windows, no linux clients]; HIPAA, PCI & ISO compliant; SDK available to integrate work docs;

Amazon Neptune - fully managed graph database;

  • supports open graph APIs for both Gremlin & SPARQL
  • allows to store interrelationships & query them in an effective manner
  • Nodes or vertex & edges, edges link different nodes
  • graph traversal, property / attributes, label

Amazon Elasticache - fully managed implementations of 2 managed in-memory data stores - redis & memcached; in-memory data store, key-value store;

  • push-button scalability for in-memory writes & reads; non-persistent store / cache;
  • billed by node size & hours of use; memcached & redis are the 2 offerings:
  • when we compare Redis v/s Memcached, Redis is multi-AZ, offers high-end features like replication, availability, advanced data-types, ranking / sorting data sets, pub/sub capabilities, persistence, etc.;
  • Memcached is a simple caching service in comparison and offers simple features only; 
  • Surprisingly, other than Memcached & Redis, "CloudFront" and "API Gateway" also offer caching CACHING services; DAX is also a caching service ONLY on dynamo DB;

------------------------------------------------------------------------------------------------------------------------

Amazon FSx for Windows & Amazon FSx for Lustre

Amazon FSx is built on Windows server, used to move windows-based applications requiring file storage to AWS; Used to work with Microsoft services like AD, IIS, Sharepoint, etc.; Primarily good for Windows server message block (SMB)-based file services.; Unlike EFS, which is Network file system (NFS) - for linux / unix based EC2 based file server instances;

Lustre - fully managed file system optimized for compute-intensive workloads - HPC, machine learning, electronic design automation (EDA), etc; Amazon FSx Lustre can run a file system tat can process massive amounts of data upto 100s Gbps throughput, millions of IOPS & sub-millisecond latencies;

 

------------------------------------------------------------------------------------------------------------------------

Amazon Elastic map-reduce - consists of a master node, core node & a task node; by default, logs are stored on the master node; replication can be configured on 5 minute intervals from the master node;

Elastic map-reduce is basically used for big data processing, vast amounts of data using open-source tools; e.g. Apache Spark, Apache Hive, Apache HBase, Apache Fink, Apache Hudi & Apache Presto; used to run data analysis on petabytes of data spending less than half the cost of traditional on-premises solutions - 3 times faster than Apache Spark;

Amazon Athena - overlaid on S3 base on Presto; query raw objects as they sit on S3 bucket; data lives on S3 without need to perform joins with other data sources; use case is to join S3 with joins / unions with data from other data sources;

Amazon QLDB (Quantum ledger database) - blockchain based database; immutable & transparent JOURNAL AS A SERVICE, without having to setup OR maintain an entire blockchain framework centralized design; it's not a fully managed blockchain, it's good as an immutable ledger e.g. Tally;

Amazon Managed Blockchain - this one's a fully managed, replicated, available, hyper-ledger, equivalent of a block chain data store; ironically, managed Blockchain is based of QLDB, where the primary chain's immutable; managed blockchain has the ability to integrate blocks external to AWS;

Amazon timestream database - time-series event streams can be stored in here; real-time event streaming & basic analytics is a good use case; IoT streaming is another use case;

Amazon Document DB - with Mongo DB compatibility; multi-AZ HA, scalable, integrated w KMS, backed up to S3;

Amazon "Elastic Search" [not elastic cache] - referred to as ES; search engine + document store; useful for analytics & search; integrated stack referred to as ELK = elastic search + logstash + Kibana; supports variety of intake for storage + search & analytics - logstash, firehose, IoT greengrass, cloudwatch;

------------------------------------------------------------------------------------------------------------------------  

No comments:

Post a Comment