Professional Documents
Culture Documents
AWS Made Simple and Fun 2
AWS Made Simple and Fun 2
AWS Made Simple and Fun 2
10
About The Author..................................................................................... 11
Microservices design: Splitting a monolith into microservices...........12
Use case: Microservices design.............................................................12
Scenario............................................................................................ 12
Out of scope (so we don't lose focus)................................................12
Services.............................................................................................13
Solution step by step......................................................................... 14
Solution explanation.......................................................................... 18
Discussion......................................................................................... 22
Best Practices........................................................................................ 24
Operational Excellence......................................................................24
Security..............................................................................................25
Reliability........................................................................................... 25
Performance Efficiency......................................................................26
Cost Optimization.............................................................................. 26
Securing Microservices with AWS Cognito........................................... 28
Use case: Securing Access to Microservices........................................ 28
Scenario............................................................................................ 28
Services.............................................................................................28
Solution step by step......................................................................... 29
Solution explanation.......................................................................... 36
Discussion......................................................................................... 39
Best Practices........................................................................................ 40
Operational Excellence......................................................................40
Security..............................................................................................40
Reliability........................................................................................... 41
Performance Efficiency......................................................................41
Cost Optimization.............................................................................. 42
Securing Access to S3............................................................................. 43
Use case: Securing access to content stored in S3...............................43
Scenario............................................................................................ 43
Services.............................................................................................43
Solution step by step......................................................................... 44
Solution explanation.......................................................................... 54
Discussion......................................................................................... 56
Best Practices........................................................................................ 58
Operational Excellence......................................................................58
Security..............................................................................................58
Reliability........................................................................................... 59
Performance Efficiency......................................................................59
Cost Optimization.............................................................................. 60
7 Must-Do Security Best Practices for your AWS Account.................. 61
Use case: That AWS account where you run your whole company but
you never bothered to improve security................................................. 61
Scenario............................................................................................ 61
Services.............................................................................................61
Solution..............................................................................................62
Create a password policy..............................................................62
Create IAM users.......................................................................... 62
Add MFA to every user................................................................. 64
Enable logging of all account actions............................................65
Set up a distribution list as your email address.............................66
Enable GuardDuty........................................................................ 67
Create a Budget............................................................................69
Discussion......................................................................................... 69
Quick overview of ECS.............................................................................71
Use Case: Deploying Containerized Applications..................................71
AWS Service: Elastic Container Service........................................... 71
Best Practices........................................................................................ 72
Step-by-Step instructions to migrate a Node.js app from EC2 to ECS...
74
Use case: Transforming an app on EC2 to a scalable app on ECS.......74
Scenario............................................................................................ 74
Services.............................................................................................74
Solution step by step......................................................................... 75
Solution explanation.......................................................................... 84
Discussion......................................................................................... 86
Best Practices........................................................................................ 88
Operational Excellence......................................................................88
Security..............................................................................................89
Reliability........................................................................................... 89
Performance Efficiency......................................................................90
Cost Optimization.............................................................................. 91
CI/CD Pipeline with AWS Code*.............................................................. 93
Use case: CI/CD Pipeline in AWS..........................................................93
Scenario............................................................................................ 93
Services.............................................................................................93
Solution step by step......................................................................... 94
Solution explanation........................................................................ 102
Discussion....................................................................................... 103
Best Practices...................................................................................... 106
Operational Excellence....................................................................106
Security............................................................................................106
Reliability......................................................................................... 107
Performance Efficiency....................................................................107
Cost Optimization............................................................................ 107
Kubernetes on AWS - Basics and Best Practices............................... 108
Use case: Containerized microservices on Kubernetes on AWS.........108
Scenario.......................................................................................... 108
Services...........................................................................................108
Solution............................................................................................109
Discussion........................................................................................111
Basic building blocks of Kubernetes........................................... 113
Benefits of Kubernetes................................................................114
Best Practices.......................................................................................115
Operational Excellence.................................................................... 115
Security............................................................................................ 117
Reliability......................................................................................... 117
Performance Efficiency.................................................................... 118
Cost Optimization............................................................................ 118
Step-By-Step Instructions To Deploy A Node.Js App To Kubernetes on
EKS...........................................................................................................119
Use case: Deploying a Node.js app on Kubernetes on EKS................119
Scenario...........................................................................................119
Services........................................................................................... 119
Solution step by step....................................................................... 120
Solution explanation........................................................................ 128
Discussion....................................................................................... 131
Best Practices...................................................................................... 132
Operational Excellence....................................................................132
Security............................................................................................133
Reliability......................................................................................... 134
Performance Efficiency....................................................................135
Cost Optimization............................................................................ 135
Handling Data at Scale with DynamoDB.............................................. 137
Use case: Storing and Querying User Profile Data at Scale................137
Scenario.......................................................................................... 137
Services...........................................................................................137
Key features of DynamoDB........................................................ 137
Solution............................................................................................138
Best Practices...................................................................................... 140
DynamoDB Database Design................................................................ 144
Use case: DynamoDB Database Design............................................. 144
Scenario.......................................................................................... 144
Services...........................................................................................144
Designing the solution..................................................................... 144
Final Solution.............................................................................. 152
Access Patterns.......................................................................... 153
Discussion....................................................................................... 154
Best Practices...................................................................................... 157
Operational Excellence....................................................................157
Security............................................................................................157
Reliability......................................................................................... 158
Performance Efficiency....................................................................158
Cost Optimization............................................................................ 159
Using SQS to Throttle Database Writes................................................161
Use case: Throttling Database Writes with SQS..................................161
Scenario.......................................................................................... 161
Services...........................................................................................162
Solution step by step....................................................................... 163
Solution explanation........................................................................ 169
Discussion....................................................................................... 172
Best Practices...................................................................................... 173
Operational Excellence....................................................................173
Security............................................................................................174
Reliability......................................................................................... 174
Performance Efficiency....................................................................175
Cost Optimization............................................................................ 175
Transactions in DynamoDB................................................................... 175
Use case: Transactions in DynamoDB.................................................176
Scenario.......................................................................................... 176
Services...........................................................................................176
Solution............................................................................................177
Solution explanation........................................................................ 183
Discussion....................................................................................... 184
Best Practices...................................................................................... 186
Operational Excellence....................................................................186
Security............................................................................................186
Reliability......................................................................................... 186
Performance Efficiency....................................................................187
Cost Optimization............................................................................ 187
Serverless web app in AWS with Lambda and DynamoDB................ 188
Use case: Serverless web app (Lambda + DynamoDB)......................188
Scenario.......................................................................................... 188
Services...........................................................................................188
Solution............................................................................................189
Discussion....................................................................................... 190
Best Practices...................................................................................... 192
Operational Excellence....................................................................192
Security............................................................................................192
Reliability......................................................................................... 193
Performance Efficiency....................................................................194
Cost Optimization............................................................................ 195
20 Advanced Tips for Lambda...............................................................198
Use Case: Efficient Serverless Compute............................................. 198
AWS Service: AWS Lambda........................................................... 198
How it works................................................................................198
Fine details..................................................................................199
Best Practices...................................................................................... 199
Secure access to RDS and Secrets Manager from a Lambda function...
202
Use case: Secure access to RDS and Secrets Manager from a Lambda
function.................................................................................................202
Scenario.......................................................................................... 202
Services...........................................................................................202
Solution............................................................................................203
What the solution looks like........................................................ 203
How to build the solution.............................................................204
Discussion....................................................................................... 206
Best Practices...................................................................................... 209
Operational Excellence....................................................................209
Security............................................................................................210
Reliability......................................................................................... 211
Performance Efficiency....................................................................212
Cost Optimization............................................................................ 213
Monitor and Protect Serverless Endpoints With API Gateway and WAF
214
Use case: Monitor and Protect Serverless Endpoints Easily and
Cost-Effectively.................................................................................... 214
Scenario.......................................................................................... 214
Services...........................................................................................214
Solution............................................................................................215
Best Practices...................................................................................... 218
Using X-Ray for Observability in Event-Driven Architectures........... 222
Use case: Observability in Event-Driven Architectures Using AWS X-Ray
222
AWS Service: AWS X-Ray.............................................................. 222
Without X-Ray.............................................................................222
With X-Ray..................................................................................222
Best Practices...................................................................................... 223
How to set up AWS X-Ray for a Node.js app..............................223
Additional tips..............................................................................225
Serverless, event-driven pipeline with Lambda and S3...................... 226
Use case: Serverless, event-driven image compressing pipeline with
AWS Lambda and S3...........................................................................226
Scenario.......................................................................................... 226
Services...........................................................................................226
Solution............................................................................................227
Discussion....................................................................................... 227
Best Practices...................................................................................... 228
Operational Excellence....................................................................228
Security............................................................................................228
Reliability......................................................................................... 229
Performance Efficiency....................................................................230
Cost Optimization............................................................................ 231
Real-time data processing pipeline with Kinesis and Lambda.......... 232
Use case: Building a real-time data processing pipeline with Kinesis and
Lambda................................................................................................ 232
Scenario.......................................................................................... 232
Services...........................................................................................232
Solution............................................................................................233
How to send data to a Kinesis Data Stream in JavaScript..........234
How to process the data and store it to S3 with a Lambda function
in JavaScript............................................................................... 235
Best Practices...................................................................................... 236
Operational Excellence....................................................................236
Security............................................................................................237
Reliability......................................................................................... 237
Performance Efficiency....................................................................238
Cost Optimization............................................................................ 238
Complex, multi-step workflow with AWS Step Functions.................. 239
Use case: Complex, multi-step image processing workflow with AWS
Step Functions..................................................................................... 239
Scenario.......................................................................................... 239
Services...........................................................................................239
Solution............................................................................................240
Discussion....................................................................................... 241
Best Practices...................................................................................... 243
Operational Excellence....................................................................243
Security............................................................................................243
Reliability......................................................................................... 244
Performance Efficiency....................................................................245
Cost Optimization............................................................................ 246
Using Aurora for your MySQL or Postgres database......................... 247
Use Case: Managed Relational Database........................................... 247
AWS Service: Amazon Aurora.........................................................247
Best Practices...................................................................................... 247
Session Manager: An easier and safer way to SSH into your EC2
instances................................................................................................. 250
Use Case: Connecting to an instance using SSH................................250
AWS Service: Session Manager......................................................250
Benefits of using Session Manager............................................ 250
Best Practices...................................................................................... 251
Using SNS to Decouple Components................................................... 252
Use Case: Using SNS to decouple components..................................252
AWS Service: Amazon SNS............................................................ 252
What you can do with SNS......................................................... 253
Best Practices...................................................................................... 253
Self-healing, Single-instance Environment with AWS EC2................ 255
Use case: Self-healing environment that doesn't need to scale...........255
Scenario.......................................................................................... 255
Services...........................................................................................255
Solution............................................................................................256
Discussion....................................................................................... 257
Best Practices...................................................................................... 258
Operational Excellence....................................................................259
Security............................................................................................259
Reliability......................................................................................... 260
Performance Efficiency....................................................................260
Cost Optimization............................................................................ 260
EBS: Volume types and automated backups with DLM...................... 261
Use Case: Understanding EBS and automating EBS backups........... 261
AWS Service: EBS and DLM...........................................................261
EBS basics..................................................................................261
EBS Volume types...................................................................... 262
Best Practices...................................................................................... 263
Automating Snapshots with Data Lifecycle Manager...................... 264
AWS Organizations and Control Tower................................................ 267
Use Case: Managing Multiple AWS Accounts..................................... 267
AWS Service: Organizations and Control Tower............................. 267
Benefits of using Organizations.................................................. 267
Example account structure......................................................... 268
Best Practices...................................................................................... 268
Introduction
This book is meant to serve as a guide to different AWS solutions,
explaining how to implement them, the reasoning behind the decisions, and
best practices to take the solution to the next level. It was written for devs,
tech leads, cloud/devops engineers and software experts in general, who
have a basic to intermediate understanding of AWS and want to take that
understanding to an advanced level, one solution at a time.
This book is not meant to help you pass certification exams, and it is not
meant as a repository of production-grade solutions you can copy-paste.
The goal is to help you develop and improve your understanding of
these solutions, from both an implementation and an architectural
perspective.
Some others serve just as an introduction to a topic, and only have a brief
explanation of an AWS service and some best practices for it.
Scenario
As the app has grown, we've noticed that content delivery becomes a
bottleneck during normal operations. Additionally, changes in the course
directory resulted in some bugs in progress tracking. To deal with these
issues, we decided to split the app into three microservices: Course
Catalog, Content Delivery, and Progress Tracking.
Services
Solution explanation
There's a lot to say about microservices (heck, I just wrote 3000 words on
the topic), but the main point is that you don't need microservices (for 99%
of apps).
When do you not need microservices? When the domain is not that
complex. In that case, use regular services, where the only split is in the
behavior (i.e. backend code). Or stick with a monolith, Facebook does that
and it works pretty well, at a size we can only dream of.
By the way, here's what a user viewing a course looks like before the split:
1. The user sends a login request with their credentials to the
monolithic application.
2. The application validates the credentials and, if valid, generates an
authentication token for the user.
3. The user sends a request to view a course, including the
authentication token in the request header.
4. The application checks the authentication token and retrieves the
course details from the Courses table in DynamoDB.
5. The application retrieves the course content metadata from the
Content table in DynamoDB, including the S3 object key.
6. Using the S3 object key, the application generates a pre-signed
URL for the course content from Amazon S3.
7. The application responds with the course details and the
pre-signed URL for the course content.
8. The user's browser displays the course details and loads the
course content using the pre-signed URL.
Best Practices
Operational Excellence
● Least privilege: It's not enough to not write the code to access
another service's data, you should also enforce it via IAM
permissions. Your microservices should each use a different IAM
role, that lets each access its own DynamoDB table, not *.
● Zero trust: The idea is to not trust agents inside a network, but
instead authenticate at every stage. Exposing your services
through API Gateway gives you an easy way to do this. Yes, you
should do this even when exposing them to other services.
Reliability
Performance Efficiency
● Rightsize ECS tasks: Now that you split your monolith, it's time to
check the resource usage of each microservice, and fine-tune
them independently.
Cost Optimization
Scenario
We're going to continue working on the app from the previous chapter. As a
reminder, we have an online learning platform that has been split into three
microservices: Course Catalog, Content Delivery, and Progress Tracking.
The Course Catalog microservice is responsible for maintaining the list of
available courses and providing course details to users. To ensure that only
authenticated users can browse the catalog, we need to implement a
secure access mechanism for this microservice. We'll dive a bit into
frontend code here, and I'll assume it's a React.js app.
Services
JavaScript
const config = {
region: "your_aws_region",
cognito: {
userPoolId: "your_cognito_user_pool_id",
appClientId: "your_cognito_app_client_id",
},
apiGateway: {
apiUrl: "your_api_gateway_url",
},
};
const options = {
method,
headers,
};
if (!response.ok) {
throw new Error(data.message || "Error calling
API");
}
return data;
};
function App() {
const [username, setUsername] = useState("");
const [password, setPassword] = useState("");
Solution explanation
Discussion
There's one caveat to our auth solution: the Content Delivery microservice
returns the URL to an S3 object, and (as things are right now), that object
needs to be public. That means only an authenticated user (i.e. paying
customer) can get the URL, but once they have it they're free to share it
with anyone. Securing access to content served through S3 is going to be
the topic of next week's issue.
One more thing about Cognito: If your app users needed AWS permissions,
for example to write to an S3 bucket or read from a DynamoDB table, you'd
need to set up an Identity Pool that's connected to your User Pool.
Best Practices
Operational Excellence
Security
● Enable MFA in Cognito User Pool: You can offer your users the
option of adding MFA to their login.
Reliability
Performance Efficiency
Cost Optimization
Scenario
In our online learning platform that we've been building in the previous 2
chapters, we have three microservices: Course Catalog, Content Delivery,
and Progress Tracking. The Content Delivery service is responsible for
providing access to course materials such as videos, quizzes, and
assignments. These files are stored in Amazon S3, but they are currently
publicly accessible. We need to secure access to these files so that only
authenticated users of our app can access them.
Services
FYI, here's how it works right now: The user clicks View Content, the
frontend sends a request to the Content Delivery endpoint in API Gateway
with the auth data, API Gateway calls the Cognito authorizer, Cognito
approves, API Gateway forwards the request to the Content Delivery
microservice, the Content Delivery microservice reads the S3 URL of the
requested video from the DynamoDB table, and returns that URL. The URL
is public (which is a problem).
Unset
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudfront:ListPublicKeys",
"cloudfront:GetPublicKey"
],
"Resource": "*"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "cloudfront.amazonaws.com"
},
"Action": "s3:GetObject",
"Resource":
"arn:aws:s3:::simple-aws-courses-content/*",
"Condition": {
"StringEquals": {
"aws:SourceArn":
"arn:aws:cloudfront::ACCOUNT_ID:distribution/DISTRIBUTI
ON_ID"
}
}
}
]
}
JavaScript
let cachedKeys;
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"cognito-idp:ListUsers",
"cognito-idp:GetUser"
],
"Resource":
"arn:aws:cognito-idp:<REGION>:<ACCOUNT_ID>:userpool/<US
ER_POOL_ID>"
}
]
}
Discussion
These first three chapters dealt with the same scenario, focusing on
different aspects. We designed our microservices, secured them, and
secured the content. We found problems, we fixed them. We made
mistakes (well, I did), we fixed them.
Best Practices
Operational Excellence
Security
Reliability
Performance Efficiency
Scenario
You created your AWS account and started building. You got your MVP
running, got some paying customers, and things are going well. You always
said you'd come back to improve security, but figuring out how is harder
than you thought, and there's always a new feature to add to your product.
So you keep saying you'll get on it next week, and the next, and the next...
Services
● AWS Billing: Where you see how much money you're spending.
Solution
Check out each of these best practices, and if you haven't applied them, do
it now. It's going to take less than 2 hours, I promise. Keep in mind these
are only applicable if you're using a single AWS account, not an
Organization.
Passwords can be brute forced. No news there. Yet some people still think
they're safe with a 10-digit password that's all numbers. Set up a password
policy to prevent IAM users from entering insecure passwords, with a
minimum of 16 characters. Here's how:
1. Log in to your AWS account and go to the IAM console.
2. Choose Account settings on the left.
3. In the Password policy section, choose Change password policy.
4. Check your options and click Save.
The root user can do everything. If you lose access to it, you're done. So
don't share it, and don't even use it! Instead, create an IAM User for every
person that needs access to AWS, and only use the root in case of
emergencies. You'll be tempted to give Administrator permissions to
everyone. Instead of that, check out these example policies by AWS and
these policies by asecurecloud and take 5 minutes to build something a bit
more restrictive.
(tip for Organizations: Don't use IAM, instead use IAM Identity Center to let
one user access all your accounts)
Here's how to set up users:
1. Log in to your AWS account and go to the IAM console.
2. If you decide to set up your own IAM policies (or use the ones
from AWS or Asecurecloud), click Policies on the left and click
Create policy.
3. Click on the JSON tab, paste your JSON. Click Next: Tags.
4. If you want, enter a tag for your policy. Click Next: Review.
5. Enter a Name and a Description. Click Create policy. Repeat this
for all the policies you want to create.
6. Next we're going to create groups for our users. You'll want to
group them by roles or access levels, for example you could have
a Developers group and an Infrastructure group. To do that, click
User groups on the left and click Create group.
7. Enter the name, scroll down, and select the policies you want to
add (you can use the search box to filter them). Click Create
group.
8. Now we're ready for our users. Click Users on the left and click
Add user.
9. Enter the user name. If they need access to the visual console
(the one where you're doing this), check Enable console access.
Most people will need it.
10. If you checked Enable console access, choose either an auto
generated password or enter your own custom password. In either
case, check Users must create a new password at the next
sign-in. Click Next.
11. Select the group or groups (you can pick more than one) to
which you want to add this user. If you pick more than one group,
the user's permissions will be a combination of both groups. Click
Next and click Create user.
12. Copy the Console sign-in URL and send it together with your
password to the person this user is for. You can also click the
button to Email sign-in instructions.
13. If the user needs programmatic access, on the list of users click
on that user, click on the Security credentials tab, scroll down to
Access keys and click Create access key.
14. Select your use case, read the alternatives presented by AWS,
check that you understand and click Next.
15. Enter a Description and click Create access key.
16. Repeat the user creation process for as many people as you
want to give access to your AWS account. Remember, one user
per person.
(tip for Organizations: do this for all IAM Identity Center users and for the
root of each account)
Logs are a record of what happens in a system, right? There's this service
called CloudTrail that records every action that is taken at your AWS
account. Launched an EC2 instance? It's logged here. Changed a security
group? Logged here. Logged in with your user from an unknown IP range?
Logged here, and you can get notified (more on that next). Event history is
automatically enabled for the past 90 days.
(tip for Organizations: You'll want to collect the logs from all accounts and
move them to a single account. I'll update this with a guide for that in the
coming weeks)
You know that email address you used to create the account? Probably the
CEO's or the CTO's. Turns out AWS sends useful info there. What happens
if that person is not available? Create a distribution list (which is an address
that is automatically forwarded to many addresses) and add addresses
there. Here's how:
1. On Google Worskpaces, go to Directory → Groups.
2. Click on Create Group and enter name, email and description.
Click Next.
3. For Access type you probably want to set Restricted, but any will
do, so long as External can “contact group owners” (i.e. send
emails to that address). Click Create Group.
4. Back on AWS, log in with the root user and go to Account
Settings.
5. Change your contact information to a group you just created, and
click Save.
6. Warning: If you change the root email address to that of a group,
anyone with access to the group can recover the password. This is
actually a good idea, but be careful with whom you add to that
group.
Enable GuardDuty
GuardDuty basically scans your CloudTrail events and warns you when
there's unusual activity. Cool, right? It's $4 per million events scanned. And
you need to enable it in all regions.
Here's how to enable it for a region (you should repeat this for all regions):
1. Go to the GuardDuty console. Click Get started and click Enable.
That's it! Let's explore it a bit.
2. On the left, go to Settings and click Generate sample findings.
3. Go back to Findings on the menu on the left, and check out the
[SAMPLE] findings generated. That's what you can expect from
GuardDuty (though hopefully not that many!)
4. So, do I need to check GuardDuty every day to see if there's
something new? Absolutely not. Let's set it up to notify you on an
SNS topic to which you can subscribe a phone number, an email
address, etc (I recommend one of the email addresses for a
distribution list, created on the recommendation above):
5. First we'll create the SNS topic. Go to the SNS console, click
Topics on the left and Create topic.
6. For the Type select Standard. Enter a name such as
GuardDutyFindings, and click Create topic.
7. Next we'll subscribe our email or phone to the topic. Click Create
subscription, for protocol select either email, enter the email
address of one of the distribution lists created above (or create a
new one for this) and click Create subscription.
8. Finally we'll need an EventBridge rule to post GuardDuty findings
to the SNS topic. To do that, go to the EventBridge console, click
Rules on the left and click Create rule.
9. Enter a name such as GuardDutyToSNS, leave event bus as
“default” and Rule type as Rule with an event pattern. Click Next.
10. For Event source, choose AWS events. For Creation method,
choose Use pattern form. For Event source, choose AWS
services. For AWS service, choose GuardDuty. For Event Type,
choose GuardDuty Finding. Click Next.
11. For Target types, choose AWS service. On Select a target,
choose SNS topic, and for Topic, choose the name of the SNS
topic you created 5 steps ago. Open Additional settings.
12. In the Additional settings section, for Configure target input,
choose Input transformer and click Configure input transformer.
13. Scroll down to the Target input transformer section, and for
Input path, paste the following code:
14. { "severity": "$.detail.severity",
"Finding_ID": "$.detail.id", "Finding_Type":
"$.detail.type", "region": "$.region",
"Finding_description": "$.detail.description" }
15. Scroll down. For Template, paste the following code (tune it if
you want):
16. "Heads up! You have a severity GuardDuty
finding of type in the region." "Finding
Description:" ". " "Check it out on the
GuardDuty console:
https://console.aws.amazon.com/guardduty/home?re
gion=#/findings?search=id%3D"
17. Click Confirm. Click Next. Click Next. Click Create rule. To test
it, re-generate the sample findings from step 2 and you should get
an email for each.
18. Go grab a glass of water, a cup of tea or a beer. That was a
long one!
Create a Budget
There's no way to limit your billing on AWS, but there is a way to get
notified if your current spending or forecasted spending for the month
exceeds a threshold. If you're spending $200/month and suddenly you're
on track to spending $500 for that month, you'd like to know ASAP so you
can delete those EC2 instances that you forgot you launched!
I'd give you a step by step, but I'll do you one better: Here's an in-console
tutorial by AWS. Tip: Set a number above your typical spending (including
peaks). You don't want to be deleting 10 emails a month, because you'll get
used to ignoring them.
Discussion
Do I need to do all of that? Yes. It's easy (with this guide), it's either super
cheap or free, and it's important. You know it's important, or you wouldn't
have read this far.
Do I need to do all of that right now? Now that is a good question. The
answer is no, you don't. But you've been postponing this for how long? Tell
you what, let's compromise: Take each best practice, create a Jira ticket (or
Trello, Todoist or whatever you're using to manage tasks), paste the step
by step there, and schedule it for some time in this sprint or the next. At
least now you know what to do and how to do it.
Quick overview of ECS
Use Case: Deploying Containerized Applications
tl;dr: It's a container orchestrator (like Kubernetes) but done by AWS. You
take a Docker app, set parameters like CPU and memory, set up an EC2
Auto Scaling Group or use Fargate (serverless, you pay per use), and ECS
handles launching everything and keeping it running. Here's a better
explanation of what's involved:
Best Practices
● ECS is free. You only pay for the EC2 instances or Fargate
capacity and the Load Balancers. In contrast, an EKS cluster costs
$72/month.
● Fargate is awesome for unpredictable loads, and scales extremely
fast (you still have to wait for the container to start). Plus, you can
use savings plans!
● In contrast, EC2 is cheaper, but when scaling you need to wait for
the EC2 instance to start.
● Fargate: $85/month.
● EC2: $53/month (with t4g.medium instances).
● GitHub Actions
● GitLab
● Jenkins
Step-by-Step instructions to migrate a
Node.js app from EC2 to ECS
Use case: Transforming an app on EC2 to a
scalable app on ECS
Scenario
You have this cool app you wrote in Node.js. You're a great developer, but
you started out with 0 knowledge of AWS. At first you just launched an EC2
instance, SSH'd there and deployed the app. It works, but it doesn't scale.
You read the previous chapter and understood the basic concepts of ECS,
but you don't know how to go from your app on EC2 to your app on an ECS
cluster.
Services
● Create a Dockerfile.
In your app's root directory, create a file named "Dockerfile" (no file
extension). Use the following as a starting point, adjust as needed.
Unset
Unset
Unset
Resources:
ECSTaskRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal: !Sub ${AWS::AccountId}
Service:
- ecs-tasks.amazonaws.com
Action:
- sts:AssumeRole
Policies:
- PolicyName: ECRReadOnlyAccess
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- ecr:GetAuthorizationToken
Resource: "*"
- Effect: Allow
Action:
- ecr:BatchCheckLayerAvailability
- ecr:GetDownloadUrlForLayer
- ecr:GetRepositoryPolicy
- ecr:DescribeRepositories
- ecr:ListImages
- ecr:DescribeImages
- ecr:BatchGetImage
Resource: !Sub
"arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repositor
y/cool-nodejs-app"
CoolNodejsAppTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: cool-nodejs-app
TaskRoleArn: !Ref ECSTaskRole
ExecutionRoleArn: !Ref ECSTaskRole
RequiresCompatibilities:
- FARGATE
NetworkMode: awsvpc
Cpu: '256'
Memory: '512'
ExecutionRoleArn: !Ref TaskExecutionRole
ContainerDefinitions:
- Name: cool-nodejs-app
Image: !Sub
"arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repositor
y/cool-nodejs-app:latest"
PortMappings:
- ContainerPort: 3000
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref CloudWatchLogsGroup
awslogs-region: !Sub ${AWS::Region}
awslogs-stream-prefix: ecs
● Create the ECS Cluster and Service.
You'll need an existing VPC for this (you can use the default one).
Update your CloudFormation template ("ecs-task-definition.yaml")
to include the ECS Cluster, the Service, and the necessary
resources for networking and load balancing. Replace {VPCID}
with the ID of your VPC, and {SubnetIDs} with one or more
subnet IDs. Then, on the Console, go to CloudFormation and
update the existing stack with the modified template.
Unset
Resources:
# ... (Existing Task Definition and related
resources)
CoolNodejsAppService:
Type: AWS::ECS::Service
Properties:
ServiceName: cool-nodejs-app-service
Cluster: !Ref CoolNodejsAppCluster
TaskDefinition: !Ref CoolNodejsAppTaskDefinition
DesiredCount: 2
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: ENABLED
Subnets:
- {SubnetIDs}
LoadBalancers:
- TargetGroupArn: !Ref AppTargetGroup
ContainerName: cool-nodejs-app
ContainerPort: 3000
CoolNodejsAppCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: cool-nodejs-app-cluster
AppLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: app-load-balancer
Scheme: internet-facing
Type: application
IpAddressType: ipv4
LoadBalancerAttributes:
- Key: idle_timeout.timeout_seconds
Value: '60'
Subnets:
-
SecurityGroups:
- !Ref AppLoadBalancerSecurityGroup
AppLoadBalancerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupName: app-load-balancer-security-group
VpcId: {VPCID}
GroupDescription: Security group for the app load
balancer
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
AppTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: app-target-group
Port: 3000
Protocol: HTTP
TargetType: ip
VpcId: {VPCID}
HealthCheckEnabled: true
HealthCheckIntervalSeconds: 30
HealthCheckPath: /healthcheck
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 2
Unset
Resources:
# ... (Task Definition, ECS cluster and the other
existing resources)
AppScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MaxCapacity: 10
MinCapacity: 2
ResourceId: !Sub
"service/${CoolNodejsAppCluster}/${CoolNodejsAppService
}"
RoleARN: !Sub
"arn:aws:iam::${AWS::AccountId}:role/aws-service-role/e
cs.application-autoscaling.amazonaws.com/AWSServiceRole
ForApplicationAutoScaling_ECSService"
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
AppScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: app-cpu-scaling-policy
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref AppScalingTarget
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType:
ECSServiceAverageCPUUtilization
TargetValue: 50
ScaleInCooldown: 300
ScaleOutCooldown: 300
● Test the new app.
After the CloudFormation stack update is complete, go to the ECS
console, click on the cool-nodejs-app-cluster cluster, and
you'll find the cool-nodejs-app-service service. The service
will launch tasks based on your task definition and desired count,
so you should see 2 tasks. To test the app, go to the EC2 console,
look for the Load Balancers option on the left, click on the load
balancer named app-load-balancer and find the DNS name.
Paste the name in your browser or use curl or Postman. If we
got it right, you should see the same output as when running the
app locally. Congrats!
Solution explanation
1. Install Docker
ECS runs containers. Docker is the tool we use to containerize our
app. Containers package an app and its dependencies together,
ensuring consistent environments across different stages and
platforms. We needed this either way for ECS, but we get the
extra benefit of being able to run it locally or on ECS without any
extra effort.
2. Create a Dockerfile
A Dockerfile is a script that tells Docker how to build a Docker
image. We specified the base image, copied our app's files,
installed dependencies, exposed the app's port, and defined the
command to start the app. Writing this when starting development
is pretty easy, but doing it for an existing app is harder.
3. Build the Docker image and test it locally
The degree to which you know how your software behaves is the
degree to which you've tested it. So, we wrote the Dockerfile, built
the image, and tested it!
Discussion
● We did it! Took a single EC2 instance that wouldn't scale, and
made it scalable, reliable (if a task fails, ECS launches a new one)
and highly available (if you chose at least 2 subnets in different
AZs). And it wasn't that hard.
● Real life could be harder though. This solution works under the
assumption that the app stores no state in the same instance. Any
session data (or any data that needs to be shared across
instances, no matter how temporary or persistent it is) should be
stored in a separate storage, such as a database. DynamoDB is
great for session data, even if you use a relational database like
RDS or Aurora for the rest of the data.
● Most of the people I've seen using a single EC2 instance and a
relational database have the database running in the same
instance. You should move it to RDS or Aurora. This is a separate
step from moving the app to ECS.
● Local environments are easy, until you have 3 devs using an old
Mac, 2 using an M1 or M2 Mac, 2 on Windows and a lone guy
running an obscure Linux distro (it's the same guy who argues vi is
better than VS Code). Docker fixes that.
● Why ECS and not a plain EC2 Auto Scaling Group? For one
service, it's pretty much the same effort. For multiple services,
ECS abstracts away a LOT of complexities.
● Why ECS and not Kubernetes? It's simpler. That's the whole
reason. There's another chapter on doing the same thing for
Kubernetes, you'll see the difference there.
Best Practices
Operational Excellence
● Use a CI/CD pipeline: I ran all of this manually, but you should
add a pipeline. After you've created the infrastructure, all the
pipeline needs to do is build the docker image with docker
build, tag it with docker tag and push it with docker push.
● Use an IAM Role for the pipeline: Of course you don't want to let
anyone write to your ECR registry. The CI/CD pipeline will need to
authenticate. You can either do this with long-lived credentials (not
great but it works), or by letting the pipeline assume an IAM Role.
The details depend on the tool you use, but try to do it.
Security
● Task IAM Role: Assign an IAM role to each ECS task (do it at the
Task Definition), so it has permissions to interact with other AWS
services. We actually did this in our solution, so the tasks could
access ECR!
● Enable Network Isolation: I told you to use the default VPC for
now. For a real use case you should use a dedicated VPC (they're
free!), and put tasks in private subnets.
Reliability
Performance Efficiency
● Use a queue for writes: Before, neither your app nor your
database scaled. Now, your app scales really well, but your
database still doesn't scale. A sudden surge of users no longer
brings down your app layer, but the consequent surge of write
requests can bring down the database. To protect from this, add all
writes to a queue and have another service consume from the
queue at a max rate. There's a chapter on this coming up.
● Use Caching: If you're accessing the same data many times, you
can probably cache it. This will also protect your database from
bursts of reads.
Cost Optimization
Scenario
Note: We're building this 100% in AWS, even using CodeCommit to store
our git repos. You're probably more familiar with GitHub, GitLab or
Bitbucket, but I wanted to show you how AWS does it.
Services
version: 0.2
phases:
pre_build:
commands:
- echo Logging in to Amazon ECR...
- aws --version
- $(aws ecr get-login --region
$AWS_DEFAULT_REGION --no-include-email)
- REPOSITORY_URI=$(aws ecr describe-repositories
--repository-names $ECR_REPOSITORY --query
'repositories[0].repositoryUri' --output text)
build:
commands:
- echo Build started on `date`
- echo Building the Docker image...
- docker buildx build --platform=linux/amd64 -t
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION .
post_build:
commands:
- echo Build completed on `date`
- echo Pushing the Docker image...
- docker push
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
- docker tag
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
$REPOSITORY_URI:latest
- docker push $REPOSITORY_URI:latest
- echo Writing image definitions file...
- printf '[{"name":"%s","imageUri":"%s"}]'
$CONTAINER_NAME
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION >
imagedefinitions.json
artifacts:
files: imagedefinitions.json
discard-paths: yes
Unset
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecr:BatchCheckLayerAvailability",
"ecr:CompleteLayerUpload",
"ecr:DescribeRepositories",
"ecr:InitiateLayerUpload",
"ecr:PutImage",
"ecr:UploadLayerPart"
],
"Resource":
"arn:aws:ecr:us-east-1:your-account-id:repository/your-
ecr-registry"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource":
"arn:aws:logs:us-east-1:your-account-id:*"
}
]
}
Discussion
So, if I can use GitHub Actions, why are you even writing about this?
Well, GitHub Actions makes CI/CD as code much easier (we could have
done all of this through a CloudFormation template of like 250 lines). But
it's not entirely trivial. Here's a post on how the whole thing works. By the
way, if you want to use GitHub Actions because you want to be cloud
agnostic, let me tell you you've got much bigger problems to deal with than
a CI/CD Pipeline. And if you want to avoid vendor lock-in, trading lock-in
with AWS (which you already have) for lock-in with GitHub doesn't solve
that.
Operational Excellence
Security
Performance Efficiency
Cost Optimization
Scenario
You're working on a new app, and you've decided to break it down into
multiple services (could be microservices or regular services). This way,
each service can scale independently, and you can deploy each one
separately. However, you're worried that deploying, communicating and
scaling each service separately will be a lot of work.
Services
Solution
7. Finally, we clean up
Don't forget this step!
Discussion
1. AWS CLI: It's the Command Line Interface tool for AWS.
We're not using it directly, but we need it for eksctl to
work.
2. Kubectl: It's the Command Line Interface tool for
Kubernetes.
5. We took a sample app from here, downloaded the file with the
curl command, and deployed that app to our cluster with the
kubectl apply command. These YAML files specify what app will
Benefits of Kubernetes
Best Practices
In my experience, the bestest best practice is deciding whether you actually
need Kubernetes or not, and not picking it blindly. If you do need it, pay
attention to the best practices that follow. Keep in mind that this is NOT an
exhaustive list.
Operational Excellence
● Use CI/CD: Once you have everything in YAMLs, you could just
Security
Reliability
Performance Efficiency
Cost Optimization
● Use Savings Plans: You're paying for the EKS cluster (which is
the Kubernetes control plane) and for the capacity that you're
using (either EC2 or Fargate). Set up Savings Plans for that
capacity.
Step-By-Step Instructions To Deploy A
Node.Js App To Kubernetes on EKS
Use case: Deploying a Node.js app on Kubernetes
on EKS
Scenario
You have a cool app you wrote in Node.js. You have a pretty good handle
on ECS, but the powers that be have decided that you need to use
Kubernetes. You understand the basic building blocks of Kubernetes and
EKS, and have drawn the parallels with ECS. But you're still not sure how
to go from code to app deployed in EKS.
Services
Note: The first steps of installing Docker and dockerizing the app are the
same as the previous chapter on deploying the app to ECS. I'm adding
them in case you didn't read it, but if you followed them already, feel free to
start from the 6th step, right after pushing the Docker image to ECR.
● Create a Dockerfile.
In your app's root directory, create a file named "Dockerfile" (no file
extension). Use the following as a starting point, adjust as needed.
Unset
Unset
Unset
After installing it, make sure to configure your AWS CLI with your
AWS credentials.
Unset
Resources:
EKSCluster:
Type: AWS::EKS::Cluster
Properties:
Name: cool-nodejs-app-eks-cluster
RoleArn: !GetAtt EKSClusterRole.Arn
ResourcesVpcConfig:
SubnetIds: [{SubnetIDs}]
EndpointPrivateAccess: true
EndpointPublicAccess: true
Version: '1.22'
EKSClusterRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: eks.amazonaws.com
Action: sts:AssumeRole
Path: "/"
ManagedPolicyArns:
-
arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
-
arn:aws:iam::aws:policy/AmazonEKSServicePolicy
EKSFargateProfile:
Type: AWS::EKS::FargateProfile
Properties:
ClusterName: !Ref EKSCluster
FargateProfileName:
cool-nodejs-app-fargate-profile
PodExecutionRoleArn: !GetAtt
FargatePodExecutionRole.Arn
Subnets: [{SubnetIDs}]
Selectors:
- Namespace: {Namespace}
FargatePodExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: 'eks-fargate-pods.amazonaws.com'
Action: 'sts:AssumeRole'
Path: "/"
ManagedPolicyArns:
-
arn:aws:iam::aws:policy/AmazonEKS_Fargate_PodExecutionR
ole_Policy
Unset
apiVersion: apps/v1
kind: Deployment
metadata:
name: cool-nodejs-app
spec:
replicas: 2
selector:
matchLabels:
app: cool-nodejs-app
template:
metadata:
labels:
app: cool-nodejs-app
spec:
containers:
- name: cool-nodejs-app
image:
{AWSAccountId}.dkr.ecr.{AWSRegion}.amazonaws.com/cool-n
odejs-app:latest
ports:
- containerPort: 3000
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
serviceAccountName: fargate-pod-execution-role
---
apiVersion: v1
kind: Service
metadata:
name: cool-nodejs-app
spec:
selector:
app: cool-nodejs-app
ports:
- protocol: TCP
port: 80
targetPort: 3000
type: LoadBalancer
Unset
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: cool-nodejs-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cool-nodejs-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
● Test the new app.
To test the app, find the LoadBalancer's external IP or hostname
by looking at the "EXTERNAL-IP" field in the output of the
kubectl get services command. Paste the IP or hostname
in your browser or use curl or Postman. If everything is set up
correctly, you should see the same output as when running the
app locally. Congrats!
Solution explanation
● Create a Dockerfile.
We're using the same Dockerfile as last week. In case you didn't
read it, a Dockerfile is a script that tells Docker how to build a
Docker image. We specified the base image, copied our app's
files, installed dependencies, exposed the app's port, and defined
the command to start the app. Writing this when starting
development is pretty easy, but doing it for an existing app is
harder.
You've probably noticed that we're not creating pods. A pod is one
instance of our app executing, so that's what we're really after.
However, just like with ECS where we don't create Tasks directly,
we don't create Pods directly in Kubernetes.
● Set up auto-scaling for the Kubernetes Deployment.
We're changing the deployment so it includes the necessary logic
to auto-scale. That is, to create and destroy Pods as needed to
match the specified metric (in this case CPU usage).
Discussion
Text in italics is (my mental representation of) you (the reader) talking or
asking a question, regular text is me talking. That way, it looks more like a
discussion. By the way, if you want to have a real discussion, or want to ask
any questions, feel free to contact me on LinkedIn!
● So, this is me, the reader, asking for clarification about the format?
Exactly!
● That's it?
Yeah, that's it! We did it in ECS last week, and now we did it in EKS!
● So, if it wasn't that hard, why have you been complaining warning us
about the complexities of Kubernetes ever since this book started?
About Kubernetes, we've barely scratched the snow that covers the
tip of the iceberg (I hope that's accurate, I've never seen an iceberg).
● Why make us configure eksctl if you weren't going to use it?
Great catch! With eksctl you can create a cluster simply by running
eksctl create cluster --name my-cluster --region
region-code --version 1.25 --vpc-private-subnets
subnet-ExampleID1,subnet-ExampleID2
--without-nodegroup. There's a few more things to create
though, and I figured it would be easier if you just used a cfn
template. Here's a guide.
Best Practices
Operational Excellence
Security
● Use IAM roles: Set up IAM roles for your Fargate profile or for the
EC2 instances, so your resources only have the permissions they
need.
Reliability
Cost Optimization
Scenario
As a software company with millions of users, you need to store and query
user profile data in a scalable and reliable way. You could use a relational
database like MySQL or PostgreSQL, but handling that volume is
expensive, and at some point you'll have scalability problems.
Services
● Scalability: You can set Read Capacity Units and Write Capacity
Units separately. You can also set them to auto-scale, or just pay
per request.
● Availability: A DynamoDB table is highly available within a region.
You can also set it to replicate to other regions, with a Global
Table.
Solution
Let's go over how to set up a DynamoDB table for user profiles, and how to
create, query, update and delete user profiles.
● After that you need to set the primary key for the table. The
PK is a unique identifier for each item in the table, and it is used to
retrieve data from the table. You can choose either a single
attribute (such as the user's email address), or you can use a
composite PK consisting of two attributes (such as the user's
email and a timestamp), where the first one is called partition key
and the second one sort key. It's important to choose a primary
key that will be unique for each user and that will be used to query
the data.
● You may add secondary indexes to your table, which allow you
to query the data in the table using attributes other than the
primary key. You can also do this later.
● Once you have designed the table schema, just create the table
in DynamoDB using the console, the SDK or infrastructure as
code.
● To store user profiles, use the PutItem API. You could also use the
BatchWriteItem API to insert multiple profiles at once.
● To query the profile of a specific user, use the Query API with the
userId as the partition key. You can also use the sort key to further
narrow down the results.
2. Use Query, not Scan: Scan reads the entire table, Query uses an
index. Scan should only be used for non-indexed attributes, or to
read all items. Don't mix them up.
3. Don't read the whole item: Read Capacity Units used are based
on the amount of data. Use projection expressions to define which
attributes will be retrieved, and only get the data you need.
4. Always filter and sort based on the sort key: You can filter and
sort based on any attribute. If you do so based on an attribute
that's a sort key, DynamoDB uses the index and you only pay for
the items read. If you use an attribute that's not a sort key,
DynamoDB scans the whole table and charges you for every item
on the table. This is independent of whether you query for the
partition key or not.
8. Use caching: DynamoDB is usually fast enough (if it's not, use
DAX). However, ElastiCache can be cheaper for data that's
updated infrequently.
13. Monitor and optimize: You're not gonna get it right the first
time (because requirements change). Monitor usage with
CloudWatch, and optimize schema and queries as needed.
Remember secondary indexes.
14. Mind the costs: You're charged per data stored and per
capacity units. For Provisioned mode, one read capacity unit
represents one strongly consistent read per second, or two
eventually consistent reads per second, for an item up to 4 KB in
size; and one write capacity unit represents one write per second
for an item up to 1 KB in size. Optimize frequently. The key here is
to understand how the database will be used and tune it
accordingly (set secondary indexes, attribute projections, etc).
This requires good upfront design and ongoing efforts.
Scenario
We're building an e-commerce app, with DynamoDB for the database. The
problem we're tackling in this chapter is how to structure the data in
DynamoDB.
Here's how our e-commerce app works: A customer visits the website,
browses different products, picks some and places an order. They can
apply a discount code (discounts a % of the total) or gift card (discounts a
fixed amount) and pay for the remaining amount by credit card.
Services
We're going to start from scratch and progressively design the solution.
● 0 - Create a table
If you're working directly on DynamoDB, just create a table with a
PK and SK.
If you're using NoSQL Workbench, use model ECommerce-0.json
● 3 - Get all order details for a given orderId and 4 - Get all
products for a given orderId
Here we're introducing two new entities: Order and OrderItem. The
Order makes sense as a separate entity, so it gets an Order ID
starting with "o#". The OrderItem has no reason to exist without an
order, and won't ever be queried separately, so we don't need an
OrderItemID.
Another big difference is that our Partition Key and Sort Key won't
be the same value. An OrderItem is just the intersection of Order
and Product, so when we're querying all Products for a given
Order, we're going to use PK=orderId and SK=productId, and the
attributes for that combination are going to be the quantity and
price of the OrderItem.
Use model ECommerce-3-4.json
Example query for 3 - Get all order details for a given orderId:
PK="o#12345"
Example query for 4 - Get all products for a given orderId:
PK="o#12345" and SK begins_with "p#"
● 5 - Get invoice for a given orderId
We're adding the Invoice entity, which has an Invoice ID starting
with "i#". There's no reason to make the Invoice ID a Partition Key,
the only access pattern we have for Invoices so far is getting the
Invoice for a given Order.
Use model ECommerce-5.json
Example query: PK="o#12345" and SK begins_with "i#"
● 6 - Get all orders for a given productId for a given date range
As you probably know, querying a DynamoDB table on an attribute
that's neither a PK nor an SK is extremely slow and expensive,
since DynamoDB scans every single item in the table. To solve
queries of the type "for a given range", we need to make that
attribute into a Sort Key.
If all we're changing is the SK, then we can use a Local Secondary
Index (LSI). In this particular case, since we don't have a way to
query by Product ID where we can get the Orders of that Product
(i.e. there's no item where productId is the PK and the Order data
is in the Attributes), we're going to need to create a Global
Secondary Index (GSI).
Could we do this directly on our main table? Yes, we could, but
we'd be duplicating the data. Instead, we use a GSI, which
projects the existing data.
Use model ECommerce-6.json
Example query (on GSI1): PK="p#99887" and SK between
"2023-04-25T00:00:00" and "2023-04-25T23:59:00"
Final Solution
Access Patterns
Discussion
Let's do this section Q&A style, where my imaginary version of you asks
questions in italics and I answer them. If the real version of you has any
questions, feel free to contact me on LinkedIn!
Best Practices
Operational Excellence
Security
Performance Efficiency
● Use Query, not Scan: Scan reads the entire table, Query uses an
index. Scan should only be used for non-indexed attributes, or to
read all items. Don't mix them up.
● Use caching: DynamoDB is usually fast enough (if it's not, use
DAX). However, ElastiCache can be cheaper for data that's
updated infrequently.
Cost Optimization
● Don't read the whole item: Read Capacity Units used are based
on the amount of data read. Use projection expressions to define
which attributes will be retrieved, so you only read the data you
need.
● Always filter and sort based on the sort key: You can filter and
sort based on any attribute. If you do so based on an attribute
that's a sort key, DynamoDB uses the index and you only pay for
the items read. If you use an attribute that's not a sort key,
DynamoDB scans the whole table and charges you for every item
on the table. This is independent of whether you query for the
partition key or not.
● Don't overdo it with secondary indexes: Every time you write to
a table, DynamoDB uses additional Write Capacity Units to update
that table's indexes, which comes at an additional cost. Create the
indexes that you need, but not more.
● Use Reserved Capacity: You can reserve capacity units, just like
you'd reserve instances in RDS.
● Set a TTL: Some data needs to be stored forever, but some data
can be deleted after some time. You can automate this by setting
a TTL on each item.
Scenario
JavaScript
try {
const result = await
sqs.sendMessage(params).promise();
console.log('Order sent to SQS:',
result.MessageId);
} catch (error) {
console.error('Error sending order to SQS:',
error);
}
}
Also, add this policy to the IAM Role of the function, so it can access SQS.
Don't forget to delete the permissions to access DynamoDB!
Unset
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sqs:SendMessage",
"Resource":
"arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
}
]
}
● Set up SES to notify the customer
1. Open the SES console
2. Click on "Domains" in the left navigation pane
3. Click "Verify a new domain"
4. Follow the on-screen instructions to add the required
DNS records for your domain.
5. Alternatively, click on "Email Addresses" and then click
the "Verify a new email address" button. Enter the email
address you want to verify and click "Verify This Email
Address". Check your inbox and click the link.
JavaScript
try {
await dynamoDB.put(params).promise();
console.log(`Order saved: ${order.orderId}`);
} catch (error) {
console.error(`Error saving order:
${order.orderId}`, error);
}
}
try {
await ses.sendEmail(emailParams).promise();
console.log(`Email sent: ${order.orderId}`);
} catch (error) {
console.error(`Error sending email for order:
${order.orderId}`, error);
}
}
Also, add this policy to the IAM Role of the function, so it can be triggered
by SQS and access DynamoDB and SES:
Unset
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
],
"Resource":
"arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
"Resource":
"arn:aws:dynamodb:YOUR_REGION:YOUR_ACCOUNT_ID:table/Ord
ers"
},
{
"Effect": "Allow",
"Action": "ses:SendEmail",
"Resource": "*"
}
]
}
Solution explanation
Before, our Orders service would return the result of the order. I put 200 OK
there, but the result could very well be an error. From the user's
perspective, they wait until the order is processed, and they see the result
on the website. From the system's perspective, we're constrained to either
succeed or fail processing the order in the execution limit of the Lambda
function. Actually, we're limited by what the user is expecting: we can't just
show a "loading" icon for 15 minutes.
After the change, the website just shows something like "We're processing
your order, we'll email you when it's ready". That sets a different
expectation! That's important for the system, because now we could
actually have our Lambda function take 15 minutes. It's not just that though,
with the change as is if the Order Processing lambda crashes
mid-execution, the SQS queue will make the order available again as a
message after the visibility timeout expires, and the Lambda service will
invoke our function again with the same order. When the maxReceiveCount
limit is reached, the order can be sent to another queue called Dead Letters
Queue (DLQ), where we can store failed orders for future reference. We
didn't set up a DLQ here, but it's easy enough, and for small and
medium-sized systems you can easily set up SNS to send you an email
and resolve the issue manually, the volume shouldn't be that big.
Once the order went through all the steps, failed some, retried, succeeded,
etc, then we notify the user that their order is "ready". This can look
different for different systems, some are just a "we got the money", some
ship physical products, some onboard the user to a complex SaaS. I chose
to do it through email because it's easy and common enough, but you could
use a webhook for example, while keeping the process async.
Best Practices
Operational Excellence
● Monitor and set alarms: You know how to monitor Lambdas. You
can monitor SQS queues as well! An interesting alarm to set here
would be number of orders in the queue, so our customers don't
wait too long for their orders to be processed.
Security
Reliability
● Set visibility timeout: This is the time that SQS waits without
receiving a "success" response, before assuming the message
wasn't processed and making it available again for the next
consumer. Set a reasonable value, and set the same value as a
timeout for your consumer (Order Processing lambda in this case).
Performance Efficiency
Cost Optimization
Scenario
There are more attributes in all entities, but let's ignore them.
Services
The trick here is that we need to read the value of stock and update it
atomically. Atomicity is a property of a set of operations, where that set of
operations can't be divided: it's either applied in full, or not at all. If we just
ran the GetItem and PutItem actions separately, we could have a case
where two customers are buying the last item in stock for that product, our
scalable backend processes both requests simultaneously, and the events
go down like this:
1. Customer123 clicks Buy
2. Customer456 clicks Buy
3. Instance1 receives request from Customer123
4. Instance1 executes GetItem for Product111, receives a stock value
of 1, continues with the purchase
5. Instance2 receives request from Customer456
6. Instance2 executes GetItem for Product111, receives a stock value
of 1, continues with the purchase
7. Instance1 executes PutItem for Product111, sets stock to 0
8. Instance2 executes PutItem for Product111, sets stock to 0
9. Instance1 executes PutItem for Order0046
10. Instance1 receives a success, returns a success to the
frontend.
11. Instance2 executes PutItem for Order0047
12. Instance2 receives a success, returns a success to the
frontend.
The process without transactions
The data doesn't look corrupted, right? Stock for Product111 is 0 (it could
end up being -1, depends on how you write the code), both orders are
created, you received the money for both orders (out of scope for this
issue), and both customers are happily awaiting their product. You go to the
warehouse to dispatch both products, and find that you only have one in
stock. Where did things go wrong?
The problem is that steps 4 and 7 were executed separately, and Instance2
got to read the stock of Product111 (step 6) in between them, and made the
decision to continue with the purchase based on a value that hadn't been
updated yet, but should have. Steps 4 and 7 need to happen atomically, in
a transaction.
First, install the packages from the AWS SDK V3 for JavaScript:
Unset
This is the code in Node.js to run the steps as a transaction (you should
add this to the code imaginary you already has for the service):
JavaScript
const { DynamoDBClient } =
require('@aws-sdk/client-dynamodb');
const { DynamoDBDocumentClient } =
require('@aws-sdk/lib-dynamodb');
const transactItems = {
TransactItems: [
{
ConditionCheck: {
TableName: 'SimpleAwsEcommerce',
Key: { id: productId },
ConditionExpression: 'stock > :zero',
ExpressionAttributeValues: {
':zero': 0
}
}
},
{
Update: {
TableName: 'SimpleAwsEcommerce',
Key: { id: productId },
UpdateExpression: 'SET stock = stock - :one',
ExpressionAttributeValues: {
':one': 1
}
}
},
{
Put: {
TableName: 'SimpleAwsEcommerce',
Item: {
id: newOrderId,
customerId: customerId,
productId: productId
}
}
}
]
};
executeTransaction();
Solution explanation
Discussion
The whole point of this issue (which I've been trying to make for the past
couple of weeks) is that SQL databases shouldn't be your default. I'll make
one concession though: If all your dev team knows is SQL databases, just
go with that unless you have a really strong reason not to.
So far I've shown you that DynamoDB can handle an e-commerce store
just fine, including ACID-compliant transactions. This one's gonna blow
your mind: You can actually query DynamoDB using SQL! Or more
specifically, a SQL-compatible language called PartiQL. Amazon developed
PartiQL as an internal tool, and it was made generally available by AWS. It
can be used on SQL databases, semi-structured data, or NoSQL
databases, so long as the engine supports it.
With PartiQL you could theoretically change your Postgres database for a
DynamoDB database without rewriting any queries. In reality, you need to
consider all of these points:
I'm not saying there isn't a good reason to change, but I'm going to assume
it's not worth the effort, and you'll have to prove me otherwise. Remember
that replicating the data somewhere else for a different access pattern is a
perfectly valid strategy (in fact, that's exactly how DynamoDB GSIs work).
Best Practices
Operational Excellence
Security
Reliability
Cost Optimization
Scenario
You're building a web application, and you're not sure whether you'll get 1
or 1000 users in your first week. Traffic is not going to be consistent,
because your app depends on trends you can't predict. You think
serverless is a good choice (and you're right!), you know the details about
Lambda, S3, API Gateway and DynamoDB, but you're not sure how
everything fits together in this case.
Services
● Lambda: Our serverless compute layer. You put your code there,
and it automagically runs a new instance for every request, scaling
really fast and really well.
● S3: Very cheap and durable storage. We'll use it to store our
frontend code in this case.
4. Create your DynamoDB table: You designed the data part. Now
create the table, and configure the details such as capacity.
In our example, we'll leave it as On Demand. It's basically the full
serverless mode, it costs over 5x more per request but scales
instantly. We're picking on demand because we don't know our
traffic patterns and because it's simpler, we can always optimize
later, with actual data.
6. Create your API Gateway API: First, create an HTTP API. Then
create the routes. Then create an integration with your Lambda
function. Finally, attach that integration to your routes.
In our example, you should create the routes GET /items/{id}, GET
/items, PUT /items and DELETE /items/{id}.
Discussion
● Cost: It's more expensive per request, period. The final bill can
come out cheaper because you have 0 unused capacity (unused
capacity is what you waste when your EC2 instance is using 5% of
the CPU and you're paying for 100%). Unused capacity tends to
decrease a lot as apps grow, because we understand traffic
patterns better and because traffic variations are not as
proportionally big (it's easier to go from 10.000 to 11.000 than from
0 to 1.000, even though the increase is 1.000 in both cases).
Operational Excellence
Security
Reliability
Performance Efficiency
Cost Optimization
● Rightsize your Lambdas (again): Finding the right size can save
you a significant amount of money, on top of saving you a
significant amount of headaches from Lambdas not performing
well.
● Monitor usage: Use CloudWatch to monitor and analyze usage
patterns. Use Cost Explorer to monitor costs and figure out where
your optimization efforts can have the most impact.
How it works
● You create a function and write the code that goes in it.
● You only pay for the time the code was actually running
Obviously, that code runs somewhere. The point is that you don't manage
or care where ('cause it's serverless, you see). Every time a request comes
in, Lambda will either use an available execution environment or start a
new one. That means Lambda scales up and down automatically and
nearly instantly.
Fine details
Best Practices
The most important tip is that you don't need to do everything in this
list, and you don't need to do everything right now. But take your time
to read it, I bet there's at least one thing in there that you should be doing
but aren't.
● Lambdas don't run in a VPC, unless you configure them for that.
You need to do that if you want to access VPC resources, such as
an RDS or Aurora database. The next chapter is about this topic.
● Use environment variables.
● If you need secrets, put them in Secrets Manager and put the
secret name in an environment variable.
Scenario
Services
Note: If you're actually facing this scenario, these steps will cause
downtime. I wrote them in this order to make it easier to understand the
final solution, but if you need to fix this specific problem, let me know and I'll
help you.
Architecture diagram of a Lambda function in a VPC, with Secrets Manager and RDS
How to build the solution
4. Now that we've got everything inside the same VPC, we just need
to allow traffic to reach the different components (while
blocking all other traffic).
Security Groups are these really cool firewalls that'll let us do that.
First, create a security group called SecretsManagerSG and
associate it with the VPC Endpoint, another one called LambdaSG
and associate it with the Lambda function, and another one called
DatabaseSG and associate it with the RDS instance.
Next, edit the DatabaseSG to allow inbound traffic to the database
port (5432, 3306, etc) originating from the LambdaSG.
Finally, edit the SecretsManagerSG to allow inbound traffic over all
protocols and ports originating from the LambdaSG.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"rds-db:connect"
],
"Resource": [
"{database-arn}"
]
},
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": [
"arn:aws:secretsmanager:{region}:{account_id}:secret:{s
ecret name}"
]
}
]
}
Discussion
I've deviated a bit from the simplest solution that works, and went for the
simplest solution that is properly configured.
Do we really need all of that? Yes, we do. Here's why:
Best Practices
Operational Excellence
● Use VPC Flow Logs: Enable VPC Flow Logs to monitor network
traffic in your VPC and identify potential security issues.
Reliability
Performance Efficiency
Cost Optimization
● Use one NAT Instance per VPC (for dev): Dev environments
don't need high availability, so instead of multiple NAT Gateways,
consider a single one. If the AZ fails your dev env will fail, but
that's fine. And I'll do you one better: don't pay $32,40/month for a
NAT Gateway, instead set up a t4g.micro EC2 instance as a NAT
Instance and make it self-healing (as we'll see in a future chapter).
Monitor and Protect Serverless Endpoints
With API Gateway and WAF
Use case: Monitor and Protect Serverless
Endpoints Easily and Cost-Effectively
Scenario
Services
● API Gateway: It's a fully managed service that lets you publish APIs.
It supports REST, HTTP, WebSocket, and gRPC protocols.
● AWS WAF: It's a web application firewall that helps protect your APIs
from common web exploits. You can create rules that allow, block, or
count web requests based on conditions that you specify, and use
already-implemented rules.
5. Deploy the API to dev: Choose a stage for your API (e.g. "dev")
and click deploy.
6. Test your API: Send a request (e.g. with Postman) to the API
endpoint and verify that it returns the expected response. Try
sending different types of requests (e.g. GET, POST, DELETE)
and see if they are processed correctly by your Lambda function.
You can find the API endpoint in the API Gateway console.
8. Deploy the API to prod: Once again, choose a stage for your API
("prod" in this case) and click deploy.
1. Metrics are automatically set up, you can view them by following
this guide.
4. In the "Log Level" dropdown menu, pick the log level that you want
to use (e.g. "ERROR", "INFO").
5. In the "Log Group" field, specify the name of the CloudWatch Logs
log group that you want to use. If the log group doesn't exist, it will
be created automatically.
6. Click "Save Changes".
1. Create a WAF security group: You can do this using the WAF
console, the AWS CLI, or a CloudFormation template. Choose a
name for your security group and specify the IP address ranges
that you want to allow or block (if any).
2. Add a WAF policy to your API Gateway API: Choose a name for
your policy, and specify the WAF security group that you created in
step 1.
3. Test your WAF policy: You can do this using any tool that can send
HTTP requests to your API Gateway API, such as Postman or
cURL. Try sending different types of requests (e.g. GET, POST,
DELETE) and see if they are allowed or blocked by your policy.
4. Configure your WAF rules: Choose a name and a type for your
rule (e.g. SQL injection), and specify the conditions that you want
to match (e.g. specific patterns in the request headers or body).
You can also specify the actions that you want WAF to take when
a request matches the rule (e.g. allow, block, count).
5. Update your WAF policy: Add the WAF rules that you created in
step 4 to your policy.
6. Test your WAF rules: You can do this using the same tool that you
used in step 4. Try sending requests that should match your rules
(e.g. requests with malicious payloads), and see if they are
allowed or blocked by your policy.
7. Monitor and troubleshoot your WAF policy: You can do this using
Amazon CloudWatch. CloudWatch logs all requests that are
allowed or blocked by your policy, and you can use this
information to detect and fix issues. You can also set up
CloudWatch alarms to be notified when there are unusual patterns
of requests or when there are errors in your backend service.
Best Practices
Note: Some of these are features, which you may not need for your
particular use case (e.g. if your API is public you don't need authentication).
Just pick the ones you need.
10. Set up API keys: You can require that requests to your API
include an API key. This is different from authentication because
API keys don't expire (unless you set them to) and aren't linked to
a user. Basically API keys are designed so that other developers
use your API. You can also set usage limits per key.
11. Use API Gateway for private APIs: You can put API Gateway
in front of your private APIs as well. What for? Well, things like
CORS are useless for private APIs, but you can certainly make
use of monitoring, logs, request validation or caching. And there's
even a use case for authentication, if you're implementing a zero
trust security model.
13. Trace requests: You can set up AWS X-Ray for API Gateway.
14. Mock responses: If you just need a mock response, you can
have API Gateway generate it directly. You can change that to a
proper response later.
16. You can also use OpenAPI: Here's how to import an OpenAPI
spec into API Gateway.
Without X-Ray
● You add logs to your Lambda functions. Then you add more logs
just in case.
● When you have an issue, you open 10 tabs of CloudWatch Logs
and try to figure out which log entries are related, based on
timestamps and intuition.
● You finally figure it out (days later), and get a detective badge for
the investigative work.
With X-Ray
Best Practices
How to set up AWS X-Ray for a Node.js app
Unset
// This wraps the AWS SDK with X-Ray. Now you can use
the AWS object to access the SDK for any other service
like you usually do, but X-Ray will be monitoring it
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
That's the gist of it. Now X-Ray will collect data from your function.
Remember to also set it up for DynamoDB tables and SNS topics by going
to Advanced settings and enabling AWS X-Ray Tracing.
Additional tips
● Enable sampling. You don't really need to trace all the requests,
10% is usually enough, and much cheaper.
Serverless, event-driven pipeline with
Lambda and S3
Use case: Serverless, event-driven image
compressing pipeline with AWS Lambda and S3
Scenario
Your app allows users to upload images directly to S3, and then displays
them publicly (think social network). The problem? Modern phones take
really good pictures, but the file size is way larger than what you need, and
you're predicting very high storage costs. You figured out that you can write
an algorithm to resize the images to a more acceptable size without
noticeable quality loss!
You don't want to change the app so that users upload their images to an
EC2 instance running that algorithm. You know it won't scale fast enough to
handle peaks in traffic, and it would cost more than S3. You want to
implement image resizing in a scalable and cost-efficient way, without
having to maintain any servers.
Services
6. Test it.
Discussion
This approach resizes images eagerly, expecting that the image will be
shown so many times that resizing it will in most cases save you money (or
the improved user experience is worth the cost). If that's not the case for
you, you could resize lazily (i.e. when an image is requested).
If your image processing results can wait a bit, you'd be better off pushing
the S3 event to an SQS queue and consuming the queue from an Auto
Scaling Group of EC2 instances (or ECS). AWS Batch is also a great
option, if the upload rate is not constant. Overall, serverless scales much
faster but serverful is cheaper.
If you need to do more than just resize the images, you've got two options:
For independent actions, you can send the S3 Event to multiple consumers
using SNS; For a complex sequence of actions, you can use Step
Functions (there's a chapter on that topic coming up).
Best Practices
Operational Excellence
Security
Reliability
Performance Efficiency
Scenario
Services
AWS Kinesis and AWS Lambda are the services that we'll be using to
analyze clickstream data. Kinesis allows for the ingestion of streaming
data, while Lambda is used to process the data streams in real-time.
Solution
try {
await kinesis.putRecord(params).promise();
console.log(`Data sent to stream: ${data}`);
} catch (err) {
console.log(err);
}
};
records.forEach((record) => {
const payload = new Buffer(record.kinesis.data,
'base64').toString('ascii');
data = JSON.parse(payload);
console.log(data);
});
processData(data);
storeData("my-clickstream-data-delivery-stream",
data);
return {};
};
try {
await Firehose.putRecord(params).promise();
console.log(`Data stored: ${data}`);
} catch (err) {
console.log(err);
}
};
Best Practices
Operational Excellence
Security
● Don't let everyone write to your Kinesis stream: You can use
API Gateway to expose access to your Kinesis stream, and
protect it with Cognito, in a very similar way to how you expose
serverless endpoints.
● Use IAM roles for Lambda to access the Kinesis stream: IAM
roles allow you to grant permissions to Lambda to access the
Kinesis stream while enforcing least privilege access controls.
Reliability
Performance Efficiency
Cost Optimization
Scenario
You are building a social network app. Users will be able to upload images
to an S3 bucket, and you need to first analyze them to detect inappropriate
content. If the image is safe (does not contain inappropriate content), it will
be resized and stored in another S3 bucket. If the image is unsafe
(contains inappropriate content), the user that uploaded it will be notified
via email.
Services
Solution
10. Test the workflow with sample safe and unsafe images.
Discussion
What we're doing here is called orchestrating services. Every task (analyze
the image, send an email, resize the image) is a service, and they need to
interact in a certain order, with a certain logic. There's actually 3 ways to
achieve this:
● Each service calls the next one: This means you're adding on
every service the responsibility of knowing who goes next in the
workflow. You're coupling one service to the next one (and actually
to the previous one as well, for handling rollbacks), and you're
coupling every service to this specific workflow. More than that,
you're adding an additional responsibility to every service. Our
example is not that complex, but in real, complex workflows this
will slow you down a lot.
● Orchestrated services: Every service has a single responsibility
(e.g. resize the image), and some external (centralized) controller
stores and executes all the coordination logic, calling every service
in the right order and passing around the responses. You are here.
Step Functions is our Orchestrator in this case. Our example is
really simple, but the main advantage of orchestrated services is
that you're centralizing the definition of the workflow and making it
easier to implement really complex stuff.
Operational Excellence
● Use CI/CD: Don't manually update the code. That's messy enough
in monoliths, but when working with multiple services that are
called in a complex order, it gets outright impossible to manage.
Use a CI/CD pipeline.
Security
● Restrict who can read and write images: Limit what IAM roles
can read from the uploaded images bucket and write to the
processed images bucket. Hint: These should be your Lambdas'
roles.
Reliability
Performance Efficiency
● Send the S3 object ARN, not the image: We're talking about big
images (that's why we compress them!). That's a huge payload for
Step Functions. Instead of sending the image itself, send the S3
object ARN and let each step read the image from S3.
Cost Optimization
I have to mention RDS here, because it does pretty much the same. The
difference is that RDS uses native MySQL or Postgres (or other engines),
while Aurora basically uses a rewrite that's compatible with MySQL or with
Postgres, but optimized for AWS's infrastructure. Aurora is limited to
MySQL up to 8.0 or PostgreSQL up to 14.3, but it has a lot more features.
Best Practices
● Use Aurora whenever you can
● If you can't, then use RDS
● And if you can't even use RDS (e.g. you need a very specific
database engine):
● First, question whether you really need that (you probably
really need it, but it's worth considering)
● Then, use EC2 and manually install everything
● No need to share or rotate SSH keys, just grant and remove IAM
permissions
● No ports open and no bastion hosts (and if you want, not even a
public IP)
● Monitor and alert on session start
● Log the whole session
● Limit available commands
● Finally, start a session from the EC2 Console or from the CLI (for a
better experience install the CLI plugin)
Using SNS to Decouple Components
Use Case: Using SNS to decouple components
Best Practices
● SNS can be used to send SMS messages to phones. It's not the
cheapest option out there, but it's super simple to set up.
Scenario
You've got a simple environment with a single EC2 instance. Maybe you
don't need to scale right now, or you can't because the instance isn't
stateless, meaning you're saving data in the instance's EBS volumes.
When your instance fails, you want it to fix itself automatically, but you don't
want to pay for a load balancer.
Services
● Elastic IP: Just a static IP address that exists separate from any
EC2 instances (meaning it's not created or destroyed with an
instance). It can be attached to an instance, and moved to another
one at will. It's free while attached to a running instance. PS: It's
not a separate service.
Solution
2. Create an IAM instance profile with an IAM Role that allows the
Unset
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AssociateElasticIpAddress",
"Action": [
"ec2:AssociateAddress"
],
"Effect": "Allow",
"Resource": "{IpAddressArn}"
}
]
}
#!/bin/bash -ex
INSTANCEID=$(curl -s -m 60
http://169.254.169.254/latest/meta-data/instance-id)
aws --region ${AWS::Region} ec2 associate-address
--instance-id $INSTANCEID --allocation-id
${EIP.AllocationId}
5. The Auto Scaling Group will detect there are 0 instances and will
create a new instance (to match the minimum of 1 instance), and
when that instance starts it will associate the Elastic IP address to
itself. When the instance fails, the Auto Scaling Group detects
there are 0 healthy instances and repeats the process.
Discussion
Why are we even discussing this? Can't we just put a load balancer
there and be done with it? $22/month is not that expensive!
Indeed, it's not that expensive. And an Application Load Balancer has other
benefits, such as easily handling the SSL certificate or integrating with
WAF. However, for an app to scale horizontally you need to remove the
state from it. Any data that needs to be shared across instances is part of
the state of the application. This includes configs, shared files, databases
and session data. For session data you can use sticky sessions (all
requests for the same session go to the same instance), but the rest needs
to be moved to a separate storage (S3, EFS, DynamoDB, RDS, etc).
Yes, it is! And you should have done that in the first place! Unfortunately,
not everyone does that. And if you didn't get it right from the start, changing
that later is a lot of work. Still worth it, and you should still do it! But if you
need a self-healing environment right now, this is the solution (while you
work on removing the state from your app).
We usually call that environment staging. Dev is usually cheap and dirty.
But you still don't want it to fail, since dev hours spent fixing a dev
environment can add up to a lot of money. This is a good solution for a
self-healing dev environment.
Best Practices
If you're in this situation, the best thing you can do is just remove the state
from your app and make it horizontally scalable. I'll keep the tips focused
on this solution though, because I think it's a pretty creative solution that
can be useful in certain situations.
Operational Excellence
Security
Reliability
Performance Efficiency
● Pick the right EBS volume type: If you're using a single EC2
instance for your prod environment, you're likely relying on EBS a
lot, so this likely matters a lot to you. EBS is a topic for a future
chapter.
Cost Optimization
● Turn off your dev env at night: Set the min and max instance
number to 0 when your team shuts off for the night, and back to 1
when they begin work the next day. There's many ways to do this,
such as a Lambda function triggered by EventBridge. Note that
while the Elastic IP address is not associated with a running EC2
instance, you'll be charged $0.005/hour (that's $3.60/month).
EBS: Volume types and automated
backups with DLM
Use Case: Understanding EBS and automating EBS
backups
Elastic Block Store is a block-level storage service for EC2 instances. It's a
virtual SSD or HDD that you attach to EC2 instances for persistent storage.
EBS basics
● They're redundant within that AZ, so data loss is less likely than
with a single disk (99.8%-99.9% durability in a year).
● Their lifecycle is separate from that of the EC2 instance. You can
create them, attach them, detach them and delete them on their
own. You can also set up the EC2 instance to delete them when
it's terminated (which is the default for the root volume, and isn't
for non-root volumes).
● An EBS volume can be attached to a maximum of one instance at
a time (except for instances of the family io2). That means, they're
not a shared file system, you can use EFS (Linux) or FSx
(Windows) for that.
● General purpose: gp3. SSD that you use for everything. You can
configure size and IOPS separately (unlike the previous gen, gp2).
● For better performance: io2. SSD for things that require more
performance (e.g. databases). You can configure size and IOPS
separately. It can be attached to multiple instances at the same
time.
● Infrequent access: sc1. Slow but really cheap HDD, ideal for
infrequently accessed data. An alternative is S3 Infrequent
Access, which has more durability and is cheaper for storage, but
is slower to access and you're charged for read operations.
Best Practices
● Use gp3 volumes unless you know you need more performance or
have a specific use case. Not sure? Here's how to benchmark.
Also, some performance tips. And if you need extreme
performance, use instance store.
● Migrate gp2 volumes to gp3, it's easy and you save 20%.
● Encrypt your EBS volumes.
● If you have data sets with different requirements, use multiple EBS
volumes.
Unset
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
KmsKeyArn:
Type: String
Description: The ARN of the KMS key to use for
encrypting cross-Region snapshot copies
DestinationRegion:
Type: String
Description: The destination region to copy the
snapshots to
Resources:
SnapshotPolicy:
Type: AWS::DLM::LifecyclePolicy
Properties:
Description: EBS snapshot policy with
cross-Region copy
PolicyDetails:
ResourceTypes:
- VOLUME
TargetTags:
-
Key: Snapshot
Value: true
Schedules:
- Name: DailySnapshot
CopyTags: true
CreateRule:
Interval: 1
IntervalUnit: DAYS
RetainRule:
Count: 7
Parameters:
ExcludeBootVolume: true
RestorablePeriod: 0
CrossRegionCopy:
DestinationRegion: !Ref DestinationRegion
Encrypted: true
KmsKeyArn: !Ref KmsKeyArn
AWS Organizations and Control Tower
Use Case: Managing Multiple AWS Accounts
Instead of mixing everything into the same AWS account, use multiple
accounts grouped under an AWS Organization.
● Consolidated billing: You only put your credit card details in the root
account, and all AWS bills from all accounts are billed to the root
account
Each AWS Account should serve one single purpose and hold one
workload (one environment for one application, for example the production
environment for App 1). Accounts are grouped into Organizational Units
(OUs).
Example account structure
Best Practices
This is a great way to set up your Organization: