Storage solutions available on AWS are as follows:
- Simple Storage Service (S3) - durable object storage service on cloud
- Elastic File System (EFS) - network file storage on cloud
- Elastic Block Store (EBS) - block storage volumes on EC2 instances
- EC2 Instance Storage - temporary block storage volumes on EC2 instances
- Storage Gateway - on-premise storage appliance that integrates with cloud storage
- Snowball / Snow mobile - transport data physically from on-premises to cloud
S3 usage patterns:
- store & distribute static web content & media; fast-growing websites, data-intensive, user-generated content such as video & photo sharing sites
- low-cost scalable solution to store static HTML files, images, videos, client-side scripts
- data store for large-scale analytics - financial transaction analysis, click-stream analytics, media transcoding; concurrent access to multiple computing nodes;
- durable storage & archival solution, back-up & archive critical / cold data on storage drives with ensured reliability & replication support;
- using multi-part upload, up to 40TB of data can be uploaded in separate parts - independently or in parallel; integrated with S3 via lifecycle management;
- data is stored synchronously across multiple facilities to ensure reliability & durability;
- systems & integrity checks are built-in with automatic self-healing;
- storage is elastic & scalable; one archive can store up to 40TB of data;
- uses server-side encryption for data-at-rest; key-protection block ciphers used are 256-bit AES standard; customer-managed keys are supported;
- compliance rules can be set on Glacier vaults for long-term record retention & regulatory compliance;
- vault lock policies can be used to control data from changes / deletion e.g. policies such as 'undeletable records' / 'time-based data retention' protects archived data from future edits / accidental deletion
- Exposes REST API interfaces [with java & .net support]; an alternate access is via 'lifecycle policy' management available with S3 - to transition over data from S3 into Glacier archive;
- associated pricing / cost components - storage (per GB per month), data transfer out (per GB per month), requests (per 1000 upload / retrieval requests per month)
- whatever is written into the vault should be secure, not touched or easily accessible
- so "vault lock" can be generated and applied to "Glacier Vault" object
- "vault lock" can be completed OR aborted in "24 hrs", failing which, it gets dissolved automatically
- vault lock policies OR rules are different from IAM access, IAM access only allows access to Glacier vaults; WORM based vault lock policies can enforce compliance controls & avoid documents from future edits / accidental deletion;
- further, archive size in a Glacier vault is around 40TB, it's immutable once loaded
------------------------------------------------------------------------------------------------------------------------
Amazon EFS - elastic & highly available network file system (NFS v4 & v4.1 supported), available as-a-service; NFS v4.1 brings in benefits with performance & parallelism, within a region; storage data (& metadata) is stored across multiple AZs;
- EFS is a network mount OR an encapsulation over the distributed file system created by Sun Microsystems;
- NFS distributed files were shared across zones over mount drives earlier;
- this was a way to share your files across network drives;
- Note that file transfers over EFS are costlier / expensive when compared to S3 / EBS;
- access to file systems & objects are controlled via read/write/execute permissions for users & group IDs - similar to standard unix-style;
- EFS supports multi-AZ distributed file sharing, available across multiple zones & regions;
- used for multi-threaded applications which require substantial levels of aggregate throughput & I/O per second (IOPS);
- distributed file system design - involves a small latency overhead per I/O operation, gets amortized over large amounts of data; hence EFS is ideal for growing datasets with larger files that need performance & multi-client access;
- 2 performance modes available for EFS - general purpose & max I/O performance modes; workloads exceeding 7000 file IOPS per second - suites for max I/O performance mode; this is when 100s-1000s of EC2 instances access the file system;
- EFS is optimized to burst - provide higher levels of throughput for shorter intervals of time; credit system determines when an EFS file system can burst - each file system earns burst credits at a baseline rate, accumulated & used during burst period;
- EFS stores file copies across multiple AZs in a region - thereby durable & available;
- storage is scalable - capacity allocated / released dynamically, as needed;
- file access can be controlled via security groups, API access controls along with users, groups & permissions;
- mount targets & tags are referred to as sub-resources;
- IAM policies can be used to assign & restrict access to cloud file systems;
- A word of caution when mounted from on-premises systems:
- need very fast & stable connectivity like Direct Connect to on-premises, hence adds to cost factor;
- NFS is not considered to be secure, hence need a secure tunnel OR VPN connection;
- An alternate to mount from on-premises is AWS DATASYNC;
- related cost comparison: EFS is ~3X costlier than EBS; EFS is ~20x costlier than S3; certain NFS features aren't supported on EFS;
------------------------------------------------------------------------------------------------------------------------
Amazon EBS - elastic block store volumes for block storage; EBS volumes are network attached storage volumes - persisted independent of the lifecycle of an EC2 instance; multiple EBS volumes can be attached to a single EC2 instance similar to attaching external hard-drives; EBS is useful to create snapshots, backup your EC2 instance volume and restore the same across different availability zone(s) OR region(s);
- volume size ranges from 1 GiB to 16 TiB - depending on the volume type;
- using snapshot & backup, volumes can be restored on a different instance across AZ / regions - volumes are hence useful for multi-regional expansion;
- usage patterns for EBS are - SSD (solid-state drive) & HDD (hard-disk drive); SSD suites for transactional workloads while HDD suites for throughput-intensive workloads e.g. data warehouse, big data processing, etc.
- for high-throughput workloads, HDD per volume throughput can range between 250-500 MiB/s; max IOPS across HDD & SSD is capped at 65,000 IOPS per instance with max through put per instance up to 1250 MiB/s
- SSD-gp2 bursts up to 3000 IOPS for extended periods with 3 IOPS/GiB baseline performance; max 10,000 IOPS; gp2 can range 1 GiB-16 TiB; throughput limit range <=170 GiB; for > 170 GiB, limit increases at 768 KiB/s to max 160 MiB/s;
- suited for transactional workloads;
- volume that balances price & performance for variety of transactional workloads;
- SSD-io1 delivers high-performance for I/O intensive workloads, with small I/O size; range - 4 GiB-16 TiB; 20,000 IOPS per volume can be configured; ratio of IOPS provisioned to volume size requested is max up to 50;
- highest performance SSD volume designed for mission-critical applications;
- HDD-st1 (throughput optimized) - good for frequently accessed throughput-intensive workloads range 500GiB-16TiB can burst up to 250 MiB/s per TB, with baseline through put of 40 MiB/s per TB and max through put of 500 MiB/s;
- low cost SSD volume designed for frequently accessed throughput-intensive workloads;
- suited for big data & data warehouses;
- Cold HDD-sc1 - lowest cost per GiB of EBS volume types; good for infrequently accessed workloads with large I/O sizes; range 500GiB-16TiB can burst upto 80 MiB/s per TiB, with baseline through put of 12 MiB/s per TiB and max through put of 250 MiB/s per TiB;
- lowest cost SSD volume designed for less frequently accessed workloads; file servers;
- EBS Magnetic [standard] - previous generation HDD; suited for infrequently accessed work loads; range 1GiB - 1TiB; max IOPS per volume 40-200;
- other devices on EC2 performing network I/O operations can impact performance, given EBS volumes are network-attached devices;
- latest volumes are available as EBS optimized instances to deliver dedicated through put between EC2 instances & attached EBS volumes;
- include incremental snapshots, point-in-time backups, annual failure rate b/w 0.1-0.2%;
- incremental snapshots with versioning - when deleted - are soft-delete;
- it's possible to restore incremental snapshots to previous specific version;
- however, once snapshot is deleted, it's NOT possible to restore to specific version;
- ensures data is encrypted at rest, on the volume & in-transit between EC2 instance to volume; encryption keys can be AWS managed (via KMS) or customer-managed;
- exposes REST API for access, AWS CLI is an alternative way to access EBS volumes;
- pricing standpoint - data transferred via EBS volumes across regions is charged;
- Use AWS KMS to encrypt EBS volumes; EBS lifecycle manager can take scheduled snapshots of any EC2 volume, irrespective of the state --> even terminated instances;
Checkout the RAID options available with EBS - it's a very good option to snapshot your EC2 instance and backup on S3 OR similar;
Amazon EC2 instance storage - temporary block storage; EPHEMERAL storage by nature is not persistent storage; not available on certain EC2 instance types (t-series, c-series, etc.); good for temporary data storage - buffers, caches, in-memory processing across load-balanced fleet of EC2 instances; Instance volumes are volatile, temporary, cache / in-memory stores, which live until an EC2 instance is active; Given the instance-storage is template based, number of available options to create an EC2 instance is also limited;
Instance-storage volumes cannot be stopped, they can be deleted / terminated only;
In the event of an instance failure for an instance-storage backed volume, all data is LOST; however, additional EBS volumes to an EC2 instance backed by instance-storage can be added after creating the instance; By this, EC2 backed with instance-storage can be attached to persistent volume;
While selecting an AMI, we choose based on:
- region / AZ, operating system, architecture (32 / 64 bit), launch permissions;
- storage for root device (root device volume) - can be EPHEMERAL, when known as instance storage OR EBS backed storage volumes;
All AMIs are categorized as either backed by Amazon EBS OR backed by instance store;
- For EBS volumes - root device for an instance launched from an AMI is the Amazon EBS volume created from the EBS snapshot;
- For instance storage volumes - root device for an instance launched from an AMI is an instance store volume created "from a template stored in Amazon S3";
AWS Storage gateway - connects on-premise software appliance with cloud-based storage; provides seamless storage integration between on-premises IT environment & AWS storage infrastructure;
- provides low-latency performance by maintaining frequently accessed data on-premises while securely storing encrypted data in S3 / Glacier;
- are virtual machines (VMs) OR hypervisors that provide local storage resources, backed by S3 & Glacier; storage gateway can be downloaded as a virtual image (VM) - installed on hosted in data center or EC2 instance;
- can be created as gateway-cached volumes, gateway stored volumes, gateway-virtual tape library (VTL), mounted via iSCSI interface;
- Typically used as local storage volumes to house on-premise data backed by S3 and migrate to cloud gradually. It's an easier option to synchronize data from on-premise to cloud with options for fail-over & recovery;
- storage volumes up to 32 TiB can be created & mounted as iSCSI devices from on-premises app servers; gateways configured for cached volumes can support up to 20 volumes with total storage volumes of 150 TiB;
- each gateway configured for gateway-stored volumes can support up to 12 volumes & total storage volume of 12 TiB;
- data written to gateway-stored volumes are asynchronously backed up to S3 in the form of EBS snapshots
- gateway allows perform offline archival on virtual tape drives (VTL); each VTL can hold up to 1500 virtual tapes with maximum aggregate capacity of 150 TiB;
- File Gateway -- exposed interface via NFS / SMB share drives, to store objects into S3 via mount point;
- Volume gateway stored mode <-> Gateway stored volumes --> iSCSI interface ==> async replication of on-prem data into S3;
- Volume gateway cached mode <-> Gateway cached volumes --> iSCSI interface ==> frequently accessed services are cached from on-premise locally, backed by S3 stored data;
- Tape gateway <-> Gateway virtual tape library --> iSCSI interface ==> virtual media exchanger & tape library for use with existing backup software;
------------------------------------------------------------------------------------------------------------------------
AWS Snowball - offline data migration tool, used to migrate very heavy data chunks onto cloud; device / appliance is purpose-built for data transfer - looks like a suitcase, easy to carry & transport / ship;
- available in 80 TB & 50 TB sizes (50 TB only in US); weights < 50 pounds; water-resistant, dust-proof;
- data transfer is performed on high-speed internal network, bypassing the internet; thereby safer; ideal to transfer petabytes of data in & out of AWS cloud securely;
- common use cases are - cloud migration, data center decommission, content distribution, etc.
- parallelization, using multiple snowball devices for data center - also helps improve performance efficiency for data transfer; after import into AWS, data is stored into S3 object storage - conforming to similar availability & reliability;
- for Snowball, AWS KMS protects the encryption keys used to protect the data on each appliance;
- being HIPAA compliant, PII data can be moved into & out of AWS
- Snowball exposes import/export API for access via snowball client;
------------------------------------------------------------------------------------------------------------------------
AWS CloudFront - content-delivery service, speeds up distribution of websites' static & dynamic content also allowing to be globally available across locations; CloudFront also uses routing strategies such as geo-location & geo-proximity, to improve performance for access via CloudFront;
- CloudFront caches dynamic HTML / PHP pages, static images, video streams, audio media files, & software downloads;
- cache content across multiple layers at each edge location; by this way, same data requested by multiple consumers are served via cache;
- application with static web-content is well suited for CloudFront;
- designed for low-latency, high bandwidth for content delivery; route the users to the nearest edge location - to reduce the number of hops in the network;
------------------------------------------------------------------------------------------------------------------------
Relational database services (RDS) - database engine with version, licensing model; database instance class, Multi-AZ deployment; given managed service, advanced level access to tables & database is restricted;
what is database instance..? MS SQL server, Oracle, Maria DB, MySQL, PostgreSQL and Aurora
- basic building block of Amazon RDS
- isolated database environment
- can contain multiple user-created databases
- standard database instance type 'm' series
- general purpose
- 2-64 vCPUs, 8-256 GiB RAM & 2-64 vCPUs, 16-500 GiB RAM
- high performance networking for CPU intensive workload
- memory optimized instance type 'r' series
- 2-64 vCPUs, 16-500 GiB RAM
- high performance networking
- optimized for high memory intensive workloads
- burstable performance DB instances type 't' series
- 1-32 vCPUs, 2-8 GiB RAM
- good for test setups / low workloads
------------------------------------------------------------------------------------------------------------------------
AWS Work Docs - Dropbox or Google drive --> AWS equivalent; secure, fully managed; integrate with AD / SSO; web, mobile & native clients [mac & windows, no linux clients]; HIPAA, PCI & ISO compliant; SDK available to integrate work docs;
Amazon Neptune - fully managed graph database;
- supports open graph APIs for both Gremlin & SPARQL
- allows to store interrelationships & query them in an effective manner
- Nodes or vertex & edges, edges link different nodes
- graph traversal, property / attributes, label
Amazon Elasticache - fully managed implementations of 2 managed in-memory data stores - redis & memcached; in-memory data store, key-value store;
- push-button scalability for in-memory writes & reads; non-persistent store / cache;
- billed by node size & hours of use; memcached & redis are the 2 offerings:
- when we compare Redis v/s Memcached, Redis is multi-AZ, offers high-end features like replication, availability, advanced data-types, ranking / sorting data sets, pub/sub capabilities, persistence, etc.;
- Memcached is a simple caching service in comparison and offers simple features only;
- Surprisingly, other than Memcached & Redis, "CloudFront" and "API Gateway" also offer caching CACHING services; DAX is also a caching service ONLY on dynamo DB;
------------------------------------------------------------------------------------------------------------------------
Amazon FSx for Windows & Amazon FSx for Lustre
Amazon FSx is built on Windows server, used to move windows-based applications requiring file storage to AWS; Used to work with Microsoft services like AD, IIS, Sharepoint, etc.; Primarily good for Windows server message block (SMB)-based file services.; Unlike EFS, which is Network file system (NFS) - for linux / unix based EC2 based file server instances;
Lustre - fully managed file system optimized for compute-intensive workloads - HPC, machine learning, electronic design automation (EDA), etc; Amazon FSx Lustre can run a file system tat can process massive amounts of data upto 100s Gbps throughput, millions of IOPS & sub-millisecond latencies;
------------------------------------------------------------------------------------------------------------------------
Amazon Elastic map-reduce - consists of a master node, core node & a task node; by default, logs are stored on the master node; replication can be configured on 5 minute intervals from the master node;
Elastic map-reduce is basically used for big data processing, vast amounts of data using open-source tools; e.g. Apache Spark, Apache Hive, Apache HBase, Apache Fink, Apache Hudi & Apache Presto; used to run data analysis on petabytes of data spending less than half the cost of traditional on-premises solutions - 3 times faster than Apache Spark;
Amazon Athena - overlaid on S3 base on Presto; query raw objects as they sit on S3 bucket; data lives on S3 without need to perform joins with other data sources; use case is to join S3 with joins / unions with data from other data sources;
Amazon QLDB (Quantum ledger database) - blockchain based database; immutable & transparent JOURNAL AS A SERVICE, without having to setup OR maintain an entire blockchain framework centralized design; it's not a fully managed blockchain, it's good as an immutable ledger e.g. Tally;
Amazon Managed Blockchain - this one's a fully managed, replicated, available, hyper-ledger, equivalent of a block chain data store; ironically, managed Blockchain is based of QLDB, where the primary chain's immutable; managed blockchain has the ability to integrate blocks external to AWS;
Amazon timestream database - time-series event streams can be stored in here; real-time event streaming & basic analytics is a good use case; IoT streaming is another use case;
Amazon Document DB - with Mongo DB compatibility; multi-AZ HA, scalable, integrated w KMS, backed up to S3;
Amazon "Elastic Search" [not elastic cache] - referred to as ES; search engine + document store; useful for analytics & search; integrated stack referred to as ELK = elastic search + logstash + Kibana; supports variety of intake for storage + search & analytics - logstash, firehose, IoT greengrass, cloudwatch;
------------------------------------------------------------------------------------------------------------------------
No comments:
Post a Comment