Object storage basics - Simple storage service (S3) can store objects up to 5TB in size, one of the earliest available services on AWS public cloud;
- Highly durable & consistent, data is replicated into multiple devices across multiple facilities
- Object attributes
- key = name of the object
- value = object payload
- version id = with version enabled snapshot
- bucket name + key + version id uniquely identify an S3 object
Buckets - referred to using ARN (Amazon Resource Name)
- 100 buckets by default
- buckets can be configured with sub-resources (within S3)
- namespaces are universal; path style URL v/s virtual OR bucket based URL [path style URL will be deprecated]
- virtual or bucket based URL can optionally contain region name in the namespace URL
Pricing - using intelligent tiering can help reduce the cost; since it's the same cost as S3 standard & can give you access to glacier / deep archive when necessary, automatically thereby saving some cost;
------------------------------------------------------------------------------------------------------------------------
S3 security - is applicable at object level, bucket level and user level (IAM);
- S3 offers versioning and ability to delete / restore a given object version;
- with versioning, object lifecycle can be managed on S3 - to store / delete objects;
- optionally MFA can be enabled for additional security;
- S3 also supports cross-region replication
- S3 Encryption at Rest
- AES-256 keys generated from AWS OR custom generated key
- KMS use AWS KMS generated key management services
- encrypt data from client side before uploading to S3
Bucket policies - by default access to all objects are private, only to the resource owner, the one who created it
- policies are categorized as resource policies & user policies
- resource policies are applied to S3 resources & user policies are applied to IAM users in your account
Resource based access policies - 2 types
- access control lists (ACLs), can allow grant access to AWS accounts & pre-defined groups
- control bucket level or object level grant access policies via XML schema configuration
- can choose resource level access within a list of grants
- classified as legacy
- bucket policies allow grant permissions to bucket & objects with AWS accounts AND IAM USERS
- classified NEW, allows configuration via XML OR JSON
Choice of resource based policies v/s ROLES
- When assuming a role, the user has to give up all his current access before assuming the role
- when you assume a role, you give up your original permissions and take the permissions assigned to the new role
- with resource based policies, user will retain his current access to existing objects
- resource based policies are supported with S3, SQS & SNS
User policies - applied to users, groups OR roles using AWS IAM
- applied via AWS IAM service, and not using S3 console
- expressed using JSON
- directly applied to users, hence no anonymous access
- with user policies, following elements come into play
- Principal: account to allow/deny access to actions & resources
- Effect: effect take when user requests an action
- Action: list of permissions to allow/deny
- Resource: bucket or object to which the policies apply to, specified in ARN (Amazon resource name)
- SID: not required or S3, general description of the policy statement
S3 Object locks - are used to store objects using a WORM model
WORM = write once, read many model; prevents objects from being deleted or modified for a fixed amount of time OR indefinitely;
Object locks modes - Governance mode, where users can't override or delete / alter an object without special permissions; Compliance mode, where protected object cannot be overwritten OR deleted by any user, inclusive of root user in my AWS account; object retention can't be modified / shortened; compliance mode related terms --> retention period & legal hold;
------------------------------------------------------------------------------------------------------------------------
S3 integration with Cloudwatch - Cloud watch logs each event details for S3 objects OR EC2 instance logs OR VPC logs --> Cloud Watch event monitoring can be enabled on many AWS services;
Cloud watch captures the record of each event moving in & out of S3 or similar services; Cloud watch events can be used to monitor actions within S3 objects and apply appropriate corrective options;
With Cloudtrail, API calls triggered from S3 can be monitored; This is useful information to analyze what time, by whom an API call was made so as to troubleshoot RCA, detect & protect applications on AWS; Cloudtrail typically logs API calls in JSON format...on the contrary, Cloudwatch uses flat format; Think of Cloudtrail as similar to APIC or APIGEE logs
Cloudtrail operates at the API level, basically captures the API operations related to the bucket OR object; S3 Access Logging operates at object level operations, for examples requests related to accessing an API;
------------------------------------------------------------------------------------------------------------------------
Encryption for data-at-rest
server side encryption - S3 encrypts the data before writing it to disk and decrypts data before reading from the disk, options:
- SSE-S3 --> server side encryption with S3 managed keys, AES-256 encryption, master key rotated automatically
- SSE-KMS --> server side encryption with KMS managed keys, keys managed by AWS KMS AES-256; creation & control of master keys + data keys in our control; AWS operators do not have access to all keys to decrypt the data; control access to master keys - job of account owner / adminstrator;
- SSE-C --> server side encryption with customer managed keys
client side encryption - data is encrypted before being uploaded into S3
- in the case of client side encryption, client encrypts the data locally
- another option is for the customer to use AWS KMS or services to manage the keys
- data uploaded to S3 is already encrypted
Object versioning - multiple versions of an object increases the storage cost, storage cost is multiple of the number of versions of each object; an object you uploaded, storage cost increases that many number of times; versioning & snapshot is hence to be used with careful consideration of costs;
versioning, when enabled, and enable encryption --> creates a new encrypted version of existing objects. Older existing versions of objects are all unencrypted;
------------------------------------------------------------------------------------------------------------------------
Object replication - during replication, objects are encrypted in transit over SSL; cross account replication is possible; storage class & owner can be changed post copy;
same region replication - use cases - log aggregation, data sovereignty, replication between AWS accounts
cross region replication - use cases - compliance, latency, disaster recovery;
In order for replication to work- versioning should be enabled on source & destination buckets
- must have permissions to replicate
- AWS account should have read access to source bucket
- if object lock is enabled on the source, target should also have object lock enabled
- can only replicate to a single target bucket
What is replicated
- objects along with object metadata & tags are replicated
- unencrypted objects are replicated
- lock retention info is replicated
What is not replicated
- existing objects in the buckets before setting up replication are not replicated
- bucket sub-level resources such as lifecycle mgmt, etc. are not replicated
- objects encrypted using "CUSTOMER PROVIDED KEYS" (SSE-C) are not replicated
- objects encrypted using KMS are not replicated by default
- objects that are replicated by another rule are not re-replicated
- objects in s3 GLACIER OR DEEP ARCHIVE are not replicated
S3 RTC (replication time control)
- configure replication time control to replicate objects within a specified time frame
- has an SLA of 99.99% to complete replication within minutes
- Cloudwatch events can be setup to monitor SLA breach for replication
- without S3 RTC, replication is usually asynchronous
- takes several hours to replicate
S3 Glacier & Deep Archive
- Objects have certain 30-day rule to AUTO-transition into S3-IA, S3-1Zone-IA, S3-Intelligent tier, S3-Glacier, S3-Deep archive storage classes;
- < 128kb objects cannot directly AUTO-transition to S3-IA / S3-1Zone-IA / S3-Intelligent tiering;
- for every record stored in glacier / deep archive, 40KB of extra storage is added for object name, index details & metadata; This increases the storage cost for a large number of smaller files;
- for storage into S3-IA, the file size should be a minimum of 128KB. otherwise transition into IA won't occur;
------------------------------------------------------------------------------------------------------------------------
S3 performance - S3 uses read-after-write consistency model, so object written will immediately be available for read access. NO DELAYS;
- by default, S3 serves upto 3500 requests / second for PUT / POST / DELETE / COPY calls, per prefix
- by default, S3 serves upto 5500 requests / second for GET / HEAD per prefix
- in order for to scale horizontally for higher performance, increase the number of prefixes
Minimize latency for performance - regional considerations - S3 bucket placed near to end users / in that region;
To retrieve a newly uploaded object to S3, S3 responds to 'GET' operation until replication is completed; Update & delete are however eventually consistent - hence updates & deletes are completed locally before replication and further replicated to other places - to get to an eventually consistent state;
KMS request rates - server side encryption / decryption has a limit; quota is region specific; quota increase for KMS is not allowed;
Transfer acceleration --> used for reduced latency data transfer when S3 bucket spans distance far from end user region; cache frequently accessed content from S3;
- use CDN [content distribution network] such as CloudFront
- using cache solutions such as elastic cache
- use AWS Elemental MediaStore --> to cache video content
- use Geo-proximity routing policies to route requests to the users' closest location
- transfer acceleration is used to reduce latency & improves rendering speed
- cross-region replication does not use transfer accelaration
Multi-part upload - [PUT] / download [GET], introducing parallelism;
recommend multi-part uploads for objects > 100MB; where a large object is split into multiple chunks and further uploaded / downloaded improves performance; even without introducing multiple prefixes OR scaling number of EC2 instances, introducing parallelism gives us similar results;
Cloudfront CDN - Data in CDN expires by default after 24 hrs (TTL / time to live); Cloudfront can front-end OR point to:
- an EC2 instance
- an S3 bucket
- Route 53 endpoint
- Elastic load balancer (ELB)
- OR A COMPLETELY DIFFERENT EXTERNAL SYSTEM
Geo restriction is supported with CloudFront; CDN can exist on an Edge location OR "Distribution", which is a collection of Edge locations; CloudFront CDN supports caching both static content & dynamic content; dynamic content means java / groovy / jsp / etc; contains 2 types of distribution:
- web-distribution: for websites
- RTMP: for media streaming
Transfer Acceleration - enables fast, easy & secure transfers of files over long distances between client & S3 bucket; leverages Cloudfront's globally distributed edge locations; as data arrives at an edge location, it is routed to AWS S3 bucket over an optimized network path; used to optimize PUTS / GETS / LISTS; Inventory buckets - created in the same region as the AWS S3 bucket;
Simple notification service (SNS) - supports pub/sub - distribution pattern; notification is published with a "topic" on the channel; messages are further received by the subscriber to the "topic";
Elastic Map Reduce - Managed Hadoop framework for processing huge amounts of data; supports Apache Spark, HBase, Presto & Flink; contains master node, core nodes & task nodes - core nodes is the HDFS;
------------------------------------------------------------------------------------------------------------------------
S3 analytics - S3 supports analytics via data lake [Athena, Redshift spectrum, Quicksight]; IoT streaming with Kinesis Firehose writing to S3 buckets - is another option; ML & AI storage, Rekognition, Lex, MXNet, storage class analysis, S3 management analytics;
More Nifty S3 tricks
- provides data transfer accelaration using cloudfront in reverse;
- "requester pays" rather than the bucket owner;
- "TAGS" always useful for costing, billing, security, ALSO FOR document classfication / ARCHIVED objects;
- static web hosting, static content, media / content simple & massively scalable
- "BitTorrent" support
------------------------------------------------------------------------------------------------------------------------
No comments:
Post a Comment