AWS Made Simple and Fun 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 269

Introduction...............................................................................................

10
About The Author..................................................................................... 11
Microservices design: Splitting a monolith into microservices...........12
Use case: Microservices design.............................................................12
Scenario............................................................................................ 12
Out of scope (so we don't lose focus)................................................12
Services.............................................................................................13
Solution step by step......................................................................... 14
Solution explanation.......................................................................... 18
Discussion......................................................................................... 22
Best Practices........................................................................................ 24
Operational Excellence......................................................................24
Security..............................................................................................25
Reliability........................................................................................... 25
Performance Efficiency......................................................................26
Cost Optimization.............................................................................. 26
Securing Microservices with AWS Cognito........................................... 28
Use case: Securing Access to Microservices........................................ 28
Scenario............................................................................................ 28
Services.............................................................................................28
Solution step by step......................................................................... 29
Solution explanation.......................................................................... 36
Discussion......................................................................................... 39
Best Practices........................................................................................ 40
Operational Excellence......................................................................40
Security..............................................................................................40
Reliability........................................................................................... 41
Performance Efficiency......................................................................41
Cost Optimization.............................................................................. 42
Securing Access to S3............................................................................. 43
Use case: Securing access to content stored in S3...............................43
Scenario............................................................................................ 43
Services.............................................................................................43
Solution step by step......................................................................... 44
Solution explanation.......................................................................... 54
Discussion......................................................................................... 56
Best Practices........................................................................................ 58
Operational Excellence......................................................................58
Security..............................................................................................58
Reliability........................................................................................... 59
Performance Efficiency......................................................................59
Cost Optimization.............................................................................. 60
7 Must-Do Security Best Practices for your AWS Account.................. 61
Use case: That AWS account where you run your whole company but
you never bothered to improve security................................................. 61
Scenario............................................................................................ 61
Services.............................................................................................61
Solution..............................................................................................62
Create a password policy..............................................................62
Create IAM users.......................................................................... 62
Add MFA to every user................................................................. 64
Enable logging of all account actions............................................65
Set up a distribution list as your email address.............................66
Enable GuardDuty........................................................................ 67
Create a Budget............................................................................69
Discussion......................................................................................... 69
Quick overview of ECS.............................................................................71
Use Case: Deploying Containerized Applications..................................71
AWS Service: Elastic Container Service........................................... 71
Best Practices........................................................................................ 72
Step-by-Step instructions to migrate a Node.js app from EC2 to ECS...
74
Use case: Transforming an app on EC2 to a scalable app on ECS.......74
Scenario............................................................................................ 74
Services.............................................................................................74
Solution step by step......................................................................... 75
Solution explanation.......................................................................... 84
Discussion......................................................................................... 86
Best Practices........................................................................................ 88
Operational Excellence......................................................................88
Security..............................................................................................89
Reliability........................................................................................... 89
Performance Efficiency......................................................................90
Cost Optimization.............................................................................. 91
CI/CD Pipeline with AWS Code*.............................................................. 93
Use case: CI/CD Pipeline in AWS..........................................................93
Scenario............................................................................................ 93
Services.............................................................................................93
Solution step by step......................................................................... 94
Solution explanation........................................................................ 102
Discussion....................................................................................... 103
Best Practices...................................................................................... 106
Operational Excellence....................................................................106
Security............................................................................................106
Reliability......................................................................................... 107
Performance Efficiency....................................................................107
Cost Optimization............................................................................ 107
Kubernetes on AWS - Basics and Best Practices............................... 108
Use case: Containerized microservices on Kubernetes on AWS.........108
Scenario.......................................................................................... 108
Services...........................................................................................108
Solution............................................................................................109
Discussion........................................................................................111
Basic building blocks of Kubernetes........................................... 113
Benefits of Kubernetes................................................................114
Best Practices.......................................................................................115
Operational Excellence.................................................................... 115
Security............................................................................................ 117
Reliability......................................................................................... 117
Performance Efficiency.................................................................... 118
Cost Optimization............................................................................ 118
Step-By-Step Instructions To Deploy A Node.Js App To Kubernetes on
EKS...........................................................................................................119
Use case: Deploying a Node.js app on Kubernetes on EKS................119
Scenario...........................................................................................119
Services........................................................................................... 119
Solution step by step....................................................................... 120
Solution explanation........................................................................ 128
Discussion....................................................................................... 131
Best Practices...................................................................................... 132
Operational Excellence....................................................................132
Security............................................................................................133
Reliability......................................................................................... 134
Performance Efficiency....................................................................135
Cost Optimization............................................................................ 135
Handling Data at Scale with DynamoDB.............................................. 137
Use case: Storing and Querying User Profile Data at Scale................137
Scenario.......................................................................................... 137
Services...........................................................................................137
Key features of DynamoDB........................................................ 137
Solution............................................................................................138
Best Practices...................................................................................... 140
DynamoDB Database Design................................................................ 144
Use case: DynamoDB Database Design............................................. 144
Scenario.......................................................................................... 144
Services...........................................................................................144
Designing the solution..................................................................... 144
Final Solution.............................................................................. 152
Access Patterns.......................................................................... 153
Discussion....................................................................................... 154
Best Practices...................................................................................... 157
Operational Excellence....................................................................157
Security............................................................................................157
Reliability......................................................................................... 158
Performance Efficiency....................................................................158
Cost Optimization............................................................................ 159
Using SQS to Throttle Database Writes................................................161
Use case: Throttling Database Writes with SQS..................................161
Scenario.......................................................................................... 161
Services...........................................................................................162
Solution step by step....................................................................... 163
Solution explanation........................................................................ 169
Discussion....................................................................................... 172
Best Practices...................................................................................... 173
Operational Excellence....................................................................173
Security............................................................................................174
Reliability......................................................................................... 174
Performance Efficiency....................................................................175
Cost Optimization............................................................................ 175
Transactions in DynamoDB................................................................... 175
Use case: Transactions in DynamoDB.................................................176
Scenario.......................................................................................... 176
Services...........................................................................................176
Solution............................................................................................177
Solution explanation........................................................................ 183
Discussion....................................................................................... 184
Best Practices...................................................................................... 186
Operational Excellence....................................................................186
Security............................................................................................186
Reliability......................................................................................... 186
Performance Efficiency....................................................................187
Cost Optimization............................................................................ 187
Serverless web app in AWS with Lambda and DynamoDB................ 188
Use case: Serverless web app (Lambda + DynamoDB)......................188
Scenario.......................................................................................... 188
Services...........................................................................................188
Solution............................................................................................189
Discussion....................................................................................... 190
Best Practices...................................................................................... 192
Operational Excellence....................................................................192
Security............................................................................................192
Reliability......................................................................................... 193
Performance Efficiency....................................................................194
Cost Optimization............................................................................ 195
20 Advanced Tips for Lambda...............................................................198
Use Case: Efficient Serverless Compute............................................. 198
AWS Service: AWS Lambda........................................................... 198
How it works................................................................................198
Fine details..................................................................................199
Best Practices...................................................................................... 199
Secure access to RDS and Secrets Manager from a Lambda function...
202
Use case: Secure access to RDS and Secrets Manager from a Lambda
function.................................................................................................202
Scenario.......................................................................................... 202
Services...........................................................................................202
Solution............................................................................................203
What the solution looks like........................................................ 203
How to build the solution.............................................................204
Discussion....................................................................................... 206
Best Practices...................................................................................... 209
Operational Excellence....................................................................209
Security............................................................................................210
Reliability......................................................................................... 211
Performance Efficiency....................................................................212
Cost Optimization............................................................................ 213
Monitor and Protect Serverless Endpoints With API Gateway and WAF
214
Use case: Monitor and Protect Serverless Endpoints Easily and
Cost-Effectively.................................................................................... 214
Scenario.......................................................................................... 214
Services...........................................................................................214
Solution............................................................................................215
Best Practices...................................................................................... 218
Using X-Ray for Observability in Event-Driven Architectures........... 222
Use case: Observability in Event-Driven Architectures Using AWS X-Ray
222
AWS Service: AWS X-Ray.............................................................. 222
Without X-Ray.............................................................................222
With X-Ray..................................................................................222
Best Practices...................................................................................... 223
How to set up AWS X-Ray for a Node.js app..............................223
Additional tips..............................................................................225
Serverless, event-driven pipeline with Lambda and S3...................... 226
Use case: Serverless, event-driven image compressing pipeline with
AWS Lambda and S3...........................................................................226
Scenario.......................................................................................... 226
Services...........................................................................................226
Solution............................................................................................227
Discussion....................................................................................... 227
Best Practices...................................................................................... 228
Operational Excellence....................................................................228
Security............................................................................................228
Reliability......................................................................................... 229
Performance Efficiency....................................................................230
Cost Optimization............................................................................ 231
Real-time data processing pipeline with Kinesis and Lambda.......... 232
Use case: Building a real-time data processing pipeline with Kinesis and
Lambda................................................................................................ 232
Scenario.......................................................................................... 232
Services...........................................................................................232
Solution............................................................................................233
How to send data to a Kinesis Data Stream in JavaScript..........234
How to process the data and store it to S3 with a Lambda function
in JavaScript............................................................................... 235
Best Practices...................................................................................... 236
Operational Excellence....................................................................236
Security............................................................................................237
Reliability......................................................................................... 237
Performance Efficiency....................................................................238
Cost Optimization............................................................................ 238
Complex, multi-step workflow with AWS Step Functions.................. 239
Use case: Complex, multi-step image processing workflow with AWS
Step Functions..................................................................................... 239
Scenario.......................................................................................... 239
Services...........................................................................................239
Solution............................................................................................240
Discussion....................................................................................... 241
Best Practices...................................................................................... 243
Operational Excellence....................................................................243
Security............................................................................................243
Reliability......................................................................................... 244
Performance Efficiency....................................................................245
Cost Optimization............................................................................ 246
Using Aurora for your MySQL or Postgres database......................... 247
Use Case: Managed Relational Database........................................... 247
AWS Service: Amazon Aurora.........................................................247
Best Practices...................................................................................... 247
Session Manager: An easier and safer way to SSH into your EC2
instances................................................................................................. 250
Use Case: Connecting to an instance using SSH................................250
AWS Service: Session Manager......................................................250
Benefits of using Session Manager............................................ 250
Best Practices...................................................................................... 251
Using SNS to Decouple Components................................................... 252
Use Case: Using SNS to decouple components..................................252
AWS Service: Amazon SNS............................................................ 252
What you can do with SNS......................................................... 253
Best Practices...................................................................................... 253
Self-healing, Single-instance Environment with AWS EC2................ 255
Use case: Self-healing environment that doesn't need to scale...........255
Scenario.......................................................................................... 255
Services...........................................................................................255
Solution............................................................................................256
Discussion....................................................................................... 257
Best Practices...................................................................................... 258
Operational Excellence....................................................................259
Security............................................................................................259
Reliability......................................................................................... 260
Performance Efficiency....................................................................260
Cost Optimization............................................................................ 260
EBS: Volume types and automated backups with DLM...................... 261
Use Case: Understanding EBS and automating EBS backups........... 261
AWS Service: EBS and DLM...........................................................261
EBS basics..................................................................................261
EBS Volume types...................................................................... 262
Best Practices...................................................................................... 263
Automating Snapshots with Data Lifecycle Manager...................... 264
AWS Organizations and Control Tower................................................ 267
Use Case: Managing Multiple AWS Accounts..................................... 267
AWS Service: Organizations and Control Tower............................. 267
Benefits of using Organizations.................................................. 267
Example account structure......................................................... 268
Best Practices...................................................................................... 268
Introduction
This book is meant to serve as a guide to different AWS solutions,
explaining how to implement them, the reasoning behind the decisions, and
best practices to take the solution to the next level. It was written for devs,
tech leads, cloud/devops engineers and software experts in general, who
have a basic to intermediate understanding of AWS and want to take that
understanding to an advanced level, one solution at a time.

This book is not meant to help you pass certification exams, and it is not
meant as a repository of production-grade solutions you can copy-paste.
The goal is to help you develop and improve your understanding of
these solutions, from both an implementation and an architectural
perspective.

Most chapters contain:


- A use case
- A scenario to give context
- The list of AWS services involved
- Instructions on how to implement the solution (including code
samples)
- An explanation on how the solution works
- A discussion on the problem
- A list of best practices

Some others serve just as an introduction to a topic, and only have a brief
explanation of an AWS service and some best practices for it.

Chapters are grouped by topic similarity, some work as continuations to


other chapters, and some may briefly reference other chapters. However,
each chapter can stand on its own and be read independently of others,
and reading them in any order shouldn't significantly affect your experience.
About The Author
Hi! I'm Guille Ojeda, author of
this book. I'm a former
developer, tech lead, cloud
engineer, cloud architect and
AWS Authorized Instructor.
I've worked at startups,
agencies and big corps, and
had plenty more as clients.

I'm now a full-time content


creator and Cloud
Architecture Consultant. I've
published 2 books, nearly 50
blog posts, and I write a free
weekly newsletter called
Simple AWS, with over 2600
subscribers. I also run the
paid community Simple AWS
Community, and I've authored
a few courses.
Microservices design: Splitting a monolith
into microservices
Use case: Microservices design

Scenario

We have an online learning platform built as a monolithic application, which


enables users to browse and enroll in a variety of courses, access course
materials such as videos, quizzes, and assignments, and track their
progress throughout the courses. The application is deployed in Amazon
ECS as a single service that's scalable and highly available.

As the app has grown, we've noticed that content delivery becomes a
bottleneck during normal operations. Additionally, changes in the course
directory resulted in some bugs in progress tracking. To deal with these
issues, we decided to split the app into three microservices: Course
Catalog, Content Delivery, and Progress Tracking.

Out of scope (so we don't lose focus)


● Authentication/authorization: When I say “users” I mean
authenticated users. We could use Cognito for this.
● User registration and management: Same as above.
● Payments: Since our courses are so awesome, we should charge
for them. We could use a separate microservice that integrates
with a payment processor such as Stripe.
● Caching and CDN: We should use CloudFront to cache the
content, to reduce latency and costs.
● Frontend: Obviously, we need a frontend for our app. We could
say a few interesting things about serving the frontend from S3,
server-side rendering, and comparing with Amplify.
● Database design: Assume our database is properly designed.
● Admin: Someone has to create the courses, upload the content,
course metadata, etc. The operations to do that fall under the
scope of our microservices, but I feared it would grow too
complex, so I cut those features out.

Services

● ECS: Our app is already deployed in ECS as a single ECS


Service, we're going to split it into 3 microservices and deploy
each as an ECS Service.

● DynamoDB: Our database for this example.

● API Gateway: Used to expose each microservice.

● Elastic Load Balancer: To balance traffic across all the tasks.

● S3: Storage for the content (video files) of the courses.

● ECR: Just a Docker registry.


Final design of the app split into microservices

Solution step by step

1. Identify the microservices


Analyze the monolithic application, focusing on the course catalog,
content delivery, and progress tracking functionalities. Based on
these functionalities, outline the responsibilities for each
microservice:
● Course Catalog: manage courses and their metadata.
● Content Delivery: handle storage and distribution of
course content.
● Progress Tracking: manage user progress through
courses.

2. Define the APIs for each microservice


Design the API endpoints for each microservice:
● Course Catalog:
● GET /courses: list all courses
● GET /courses/:id get a specific course
● Content Delivery:
● GET /content/:id get a pre-signed URL for
a specific course content
● Progress Tracking:
● GET /progress/:userId: get a user's
progress
● PUT /progress/:userId/:courseId:
update a user's progress for a specific course

3. Create separate repositories and projects for each


microservice
Set up individual repositories and Node.js projects for Course
Catalog, Content Delivery, and Progress Tracking microservices.
Structure the projects using best practices, with separate folders
for routes, controllers, and database access code. You know the
drill.

4. Separate the code


Refactor the monolithic application code, moving the relevant
functionality for each microservice into its respective project:
● Move the code related to managing courses and their
metadata into the Course Catalog microservice project.
● Move the code related to handling storage and
distribution of course content into the Content Delivery
microservice project.
● Move the code related to managing user progress
through courses into the Progress Tracking microservice
project.
The code may not be as clearly separated as you might want. In
that case, first separate it within the same project, test those
changes, then move the code to the microservices.

5. Separate the data


Create separate Amazon DynamoDB tables for each microservice:
● CourseCatalog: stores course metadata, such as title,
description, and content ID.
● Content: stores content metadata, including content ID,
content type, and S3 object key.
● Progress: stores user progress, with fields for user ID,
course ID, and progress details.
Update the database access code in each microservice to interact
with its specific table.
If you're doing this for a database which already has data, you can
export it to S3, use Glue to filter the data, and then import it back
to DynamoDB.
If the system is live, it gets trickier:
● First, add a timestamp to your data if you don't have one
already.
● Next, create the new tables.
● Then set up DynamoDB Streams to replicate all writes to
the corresponding table.
● Then copy the old data, either with a script or with an S3
export + Glue (don't use the DynamoDB import, it only
works for new tables, write the data manually instead).
Make sure this can handle duplicates.
● Finally, switch over to the new tables.

6. Configure API Gateway


Set up Amazon API Gateway to manage and secure the API
endpoints for each microservice:
● Create an API Gateway resource for each microservice
(Course Catalog, Content Delivery, and Progress
Tracking).
● Apply request validation to ensure incoming requests are
well-formed.

7. Deploy the microservices to ECS


● Write Dockerfiles for each microservice, specifying the
base Node.js image, copying the source code, and
setting the appropriate entry point.
● Build and push the Docker images to Amazon Elastic
Container Registry (ECR) for each microservice.
● Create separate task definitions in Amazon ECS for each
microservice, specifying the required CPU, memory,
environment variables, and the ECR image URLs.
● Create ECS services for each microservice, associating
them with the corresponding task definitions and API
Gateway endpoints.

8. Update the frontend


Modify the frontend code to work with the new
microservices-based architecture. Update API calls to use the
corresponding API Gateway endpoints for Course Catalog,
Content Delivery, and Progress Tracking microservices.

9. Test the new architecture


Thoroughly test the transformed application, covering various user
scenarios such as:
● Logging in and browsing the course catalog.
● Accessing course content using the pre-signed URLs
generated by the Content Delivery microservice.
● Tracking user progress in a course and retrieving
progress information through the Progress Tracking
microservice.
Validate that the application works as expected and meets
performance, security, and reliability requirements.

Solution explanation

1. Identify the microservices


I kind of did that for you, but you should never rush this, take your
time and get it right.
There's two ways to split microservices:
● Vertical slices: Each microservice solves a particular
use case or set of tightly-related use cases. You add
services as you add features, and each user interaction
goes through the minimum possible number of services
(ideally only 1). This means features are an aspect of
decomposition. Code reuse is achieved through shared
libraries.

● Functional services: Each service handles one


particular step, integration, state, or thing. System
behavior is an emergent property, resulting from
combining different services in different ways. Each user
interaction invokes multiple services. New features don't
need entirely new services. Features are an aspect of
integration, not decomposition. Code reuse is often
achieved through invoking another service.

I went for vertical slices, because it's simpler to understand and


easier to realize for simpler systems. Plus, in this case we didn't
even need to deal with service discovery. The drawback is that if
your system does 200 different things, it'll need at least 200
services.
By the way, don't even think about migrating from one
microservice strategy to the other one. It's much easier to drop the
whole system and start from scratch.

2. Define the APIs for each microservice


This is pretty straightforward, since they're the same endpoints
that we had for the monolith. If we were using functional services,
it would get a lot more difficult, since you'd need to understand
what service A needs from service B.
Remember that the only way into a microservice (and that includes
its data) is through the service's API, so design this wisely. Check
out Fowler's post on consumer-driven contracts for some deep
insights.

3. Create separate repositories and projects for each


microservice
Basically, keep separate microservices separate.
You could use a monorepo, where the whole codebase is in a
single git repository but services are still deployed separately. This
works well, but it's a bit harder to pull off.

4. Separate the code


First, refactor as needed until you can copy-paste the
implementation code from your monolith to your services (but don't
copy it just yet). Then test the refactor. Finally, do the
copy-pasting.

5. Separate the data


The difference between a service and a microservice is the
bounded context. Each microservice owns its model, including the
data, and the only way to access that model (and the database
that stores it) is through that service's API.
We could technically implement this without enforcing it, and we
could even enforce it with DynamoDB's field-level permissions.
But we couldn't scale the services independently using a single
table, since capacity is assigned per table.
DynamoDB tables are easy to create and manage (other than
defining the data model). In a relational database we would need
to consider the tradeoff between having to manage (and pay for)
one DB cluster per microservice, or separating per table + DB user
or per database in the same cluster, losing the ability to scale the
data stores independently. Aurora Serverless is a viable option as
well, though it's not cheap for a continuously running database.

6. Configure API Gateway


This is actually a best practice, but I added it as part of the solution
because you're going to need it for authentication.

7. Deploy the microservices to ECS


If you already deployed the monolith, I assume you know how to
deploy the services.

8. Update the frontend


For simplicity, I assume this means updating the frontend code
with the new URLs for your services.
If you were already using the same paths in API Gateway, that's
what you should update instead.

9. Test the new architecture


You build it, you deploy it, you test it, and only then you declare
that it works.
Discussion

There's a lot to say about microservices (heck, I just wrote 3000 words on
the topic), but the main point is that you don't need microservices (for 99%
of apps).

Microservices exist to solve a specific problem: problems in complex


domains require complex solutions, which become unmanageable due to
the size and complexity of the domain itself. Microservices (when done
right) split that complex domain into simpler bounded contexts, thus
encapsulating the complexity and reducing the scope of changes (that's
why they change independently). They also add complexity to the solution,
because now you need to figure out where to draw the boundaries of the
contexts, and how the microservices interact with each other, both at the
domain level (complex actions that span several microservices) and at the
technical level (service discovery, networking, permissions).

So, when do you need microservices? When the reduction in complexity of


the domain outweighs the increase in complexity of the solution.

When do you not need microservices? When the domain is not that
complex. In that case, use regular services, where the only split is in the
behavior (i.e. backend code). Or stick with a monolith, Facebook does that
and it works pretty well, at a size we can only dream of.

By the way, here's what a user viewing a course looks like before the split:
1. The user sends a login request with their credentials to the
monolithic application.
2. The application validates the credentials and, if valid, generates an
authentication token for the user.
3. The user sends a request to view a course, including the
authentication token in the request header.
4. The application checks the authentication token and retrieves the
course details from the Courses table in DynamoDB.
5. The application retrieves the course content metadata from the
Content table in DynamoDB, including the S3 object key.
6. Using the S3 object key, the application generates a pre-signed
URL for the course content from Amazon S3.
7. The application responds with the course details and the
pre-signed URL for the course content.
8. The user's browser displays the course details and loads the
course content using the pre-signed URL.

And here's what it looks like after the split:


1. The user sends a login request with their credentials to the
authentication service (not covered in the previous microservices
example).
2. The authentication service validates the credentials and, if valid,
generates an authentication token for the user.
3. The user sends a request to view a course, including the
authentication token in the request header, to the Course Catalog
microservice through API Gateway.
4. The Course Catalog microservice checks the authentication token
and retrieves the course details from its Course Catalog table in
DynamoDB.
5. The Course Catalog microservice responds with the course
details.
6. The user's browser sends a request to access the course content,
including the authentication token in the request header, to the
Content Delivery microservice through API Gateway.
7. The Content Delivery microservice checks the authentication
token and retrieves the course content metadata from its Content
table in DynamoDB, including the S3 object key.
8. Using the S3 object key, the Content Delivery microservice
generates a pre-signed URL for the course content from Amazon
S3.
9. The Content Delivery microservice responds with the pre-signed
URL for the course content.
10. The user's browser displays the course details and loads the
course content using the pre-signed URL.

Best Practices

Operational Excellence

● Centralized logging: You're basically running 3 apps. Store the


logs in the same place, such as CloudWatch Logs (which ECS
automatically configures for you).

● Distributed tracing: These three services don't call each other,


but in a real microservices app it's a lot more common for that to
happen. In those cases, following the trail of calls becomes rather
difficult. Use X-Ray to make it a lot simpler.
Security

● Least privilege: It's not enough to not write the code to access
another service's data, you should also enforce it via IAM
permissions. Your microservices should each use a different IAM
role, that lets each access its own DynamoDB table, not *.

● Networking: If a service doesn't need network visibility, it


shouldn't have it. Enforce it with security groups.

● Zero trust: The idea is to not trust agents inside a network, but
instead authenticate at every stage. Exposing your services
through API Gateway gives you an easy way to do this. Yes, you
should do this even when exposing them to other services.

Reliability

● Circuit breakers: User calls Service A, Service A calls Service B,


Service B fails, the failure cascades, everything fails, your car is
suddenly on fire (just go with it), your boss is suddenly on fire (is
that a bad thing?), everything is on fire. Circuit breakers act
exactly like the electric versions: They prevent a failure in one
component from affecting the whole system. I'll let Fowler explain.

● Consider different scaling speeds: If Service A depends on


Service B, consider that Service B scales independently, which
could mean that instances of Service B are not started as soon as
Service A gets a request. Service B could be implemented in a
different platform (EC2 Auto Scaling vs Lambda), which scales at
a different speed. Keep that in mind for service dependencies, and
decouple the services when you can.

Performance Efficiency

● Scale services independently: Your microservices are so


independent that even their databases are independent! You know
what that means? You can scale them at will!

● Rightsize ECS tasks: Now that you split your monolith, it's time to
check the resource usage of each microservice, and fine-tune
them independently.

● Rightsize DynamoDB tables: Same as above, for the database


tables.

Cost Optimization

● Optimize capacity: Determine how much capacity each service


needs, and optimize for it. Get a savings plan for the baseline
capacity.

● Consider different platforms: Different microservices have


different needs. A user-facing microservice might need to scale
really fast, at the speed of Fargate or Lambda. A service that only
processes asynchronous transactions, such as a
payments-processing service, probably doesn't need to scale as
fast, and can get away with an Auto Scaling Group of EC2
instances (which is cheaper per compute time). A batch
processing service could even use Spot Instances! Every service
is independent, so don't limit yourself to what you picked for other
services.
Do consider the increased management efforts: It's easier (thus
cheaper) to manage 10 Lambda functions than to manage 5
Lambda functions, 1 ECS cluster and 2 Auto Scaling Groups.
Securing Microservices with AWS Cognito
Use case: Securing Access to Microservices

Scenario

We're going to continue working on the app from the previous chapter. As a
reminder, we have an online learning platform that has been split into three
microservices: Course Catalog, Content Delivery, and Progress Tracking.
The Course Catalog microservice is responsible for maintaining the list of
available courses and providing course details to users. To ensure that only
authenticated users can browse the catalog, we need to implement a
secure access mechanism for this microservice. We'll dive a bit into
frontend code here, and I'll assume it's a React.js app.

Services

● API Gateway: It's a managed service that makes it easy to create,


publish, and manage APIs for your services. In the previous
chapter I suggested we should use it to expose our microservices,
but I'll start this issue assuming you didn't do that (because I didn't
really give you a strong reason to do it).

● Cognito: It’s a user management service that provides


authentication, authorization, and user management capabilities.
In our case, we use Cognito to manage user authentication and
maintain their roles and permissions. Cognito lets us implement
sign up and sign in, and when a user tries to access the course
catalog, Cognito ensures that they are authenticated.
Secure access to Course Catalog microservice

Solution step by step

● Set up a Cognito User Pool


● Sign in to the AWS Management Console and open the
Cognito console.
● Click "Manage User Pools" and then "Create a user
pool".
● Enter a name for the new user pool, and choose "Review
defaults".
● In the "Attributes" section:
● Choose standard attributes you want to collect
and store for your users (e.g., email, name).
● Set the minimum password length to 12 and
require at least one uppercase letter and one
number.
● In the "MFA and verifications" section:
● Set MFA to "Off" for simplicity, but consider
enabling it for better security in the future.
● Select "Email" as the attribute to be verified.
● In the "App clients" section:
● Click "Add an app client" to create a new
application client that will interact with your user
pool.
● Enter a name for the app client, such as
“SimpleAWSCourses”.
● Uncheck "Generate client secret" as we won't
need it for a web app.
● Under "Allowed OAuth Flows," check
"Authorization code grant" and "Implicit grant."
● Under "Allowed OAuth Scopes," check "email,"
"openid," and "profile."
● Set the "Callback URL(s)" to your the app's URL
where users should be redirected after
successful authentication (e.g.,
https://courses.simpleaws.dev/callback).
● Set the "Sign out URL(s)" to the app's URL
where users should be redirected after signing
out (e.g.,
https://courses.simpleaws.dev/signout).
● Save the app client and write down the "App
client ID" for future reference.
● In the "Policies" section:
● Set "Allow users to sign themselves up" to
enable user registration.
● Set "Which attributes do you want to require for
sign-up?" to require email and name.
● Under "Which standard attributes do you want to
allow for sign-in?", select "Email."
● In the "Account recovery" section:
● Set "Which methods do you want to allow for
account recovery?" to "Email only."
● Review all the settings and make any changes as
needed. When finished, click "Create pool."
● Create a groups within the User Pool, called "User". Click
"Groups" in the left-hand navigation pane, and then click
"Create group".
● Select the "User" group and check the "Set as default"
checkbox. This will automatically add new users to the
"User" group when they sign up.

● Create an API in API Gateway


● Open the Amazon API Gateway console in the AWS
Management Console.
● Click "Create API" and select "REST API." Then click
"Build."
● Choose "New API" under "Create new API," provide a
name for your API (e.g.,
"SimpleAWSCourseCatalogAPI"), and add an optional
description. Click "Create API."
● Click "Actions" and select "Create Resource." Provide a
resource name (e.g., "CourseCatalog") and a resource
path (e.g., "/courses"). Click "Create Resource."
● With the new resource selected, click "Actions" and
choose "Create Method." Select "GET" from the
dropdown menu that appears. This will create a GET
method for the resource.
● For the GET method's integration type, choose "AWS
Service." Select "ECS" as the AWS Service, choose the
region where the app is deployed, and provide the "ARN"
of the ECS service. Set the "Action Type" to "HTTP
Proxy" and provide the "HTTP Method" as "GET."

● Set up Cognito as the API Gateway authorizer


● In API Gateway, click "Authorizers" in the left-hand
navigation pane, and then click "Create New Authorizer".
● Select "Cognito" as the type and choose your previously
created Cognito User Pool.
● Enter a name for the authorizer.
● Set the "Token Source" as "Authorization" and click
"Create".

● Attach the Cognito authorizer to the API methods


● Click on the method in the API Gateway console.
● In the "Method Request" section, click on "Authorization"
and select the Cognito authorizer you created in step 3
from the "Authorization" dropdown menu.
● Save the changes.
● Deploy the API
● In the API Gateway console, click "Actions" and then
"Deploy API".
● Choose or create a new stage for deployment.
● Note the Invoke URL provided for each method.

● Update the frontend


● Remember that we're using React.js for this example.
● First, you'll need to install these dependencies: npm
install amazon-cognito-identity-js
aws-sdk
● Then you'll need a file with your configs, and to update
your app. Here's an example of the config in config.js
file and a very basic App.js file (with no CSS, and
probably missing a few React good practices).

JavaScript

const config = {
region: "your_aws_region",
cognito: {
userPoolId: "your_cognito_user_pool_id",
appClientId: "your_cognito_app_client_id",
},
apiGateway: {
apiUrl: "your_api_gateway_url",
},
};

export default config;


JavaScript

import React, { useState } from "react";


import {
CognitoUserPool,
CognitoUser,
AuthenticationDetails,
} from "amazon-cognito-identity-js";
import AWS from "aws-sdk";
import config from "./config";

const userPool = new CognitoUserPool({


UserPoolId: config.cognito.userPoolId,
ClientId: config.cognito.appClientId,
});

const authenticateUser = async (username, password) =>


{
const user = new CognitoUser({ Username: username,
Pool: userPool });
const authDetails = new AuthenticationDetails({
Username: username,
Password: password,
});

return new Promise((resolve, reject) => {


user.authenticateUser(authDetails, {
onSuccess: (result) => {
const accessToken =
result.getAccessToken().getJwtToken();
const idToken =
result.getIdToken().getJwtToken();
resolve({ accessToken, idToken });
},
onFailure: (err) => {
reject(err);
},
});
});
};

const callApi = async (method, path, accessToken) => {


const headers = {
Authorization: `Bearer ${accessToken}`,
};

const options = {
method,
headers,
};

const response = await


fetch(`${config.apiGateway.apiUrl}${path}`, options);
const data = await response.json();

if (!response.ok) {
throw new Error(data.message || "Error calling
API");
}

return data;
};
function App() {
const [username, setUsername] = useState("");
const [password, setPassword] = useState("");

const handleLogin = async () => {


try {
const { accessToken } = await
authenticateUser(username, password);
const courses = await callApi("GET", "/courses",
accessToken);
console.log(courses);
} catch (error) {
console.error("Error:", error.message);
}
};
return (
//Your awesome React code that's much better than
mine.
);
}
export default App;

Solution explanation

● Set up a Cognito User Pool


Cognito User Pools store and manage user profiles, and handle
registration, authentication, and account recovery. We want to
offload all that to Cognito, and we also want to use it to authorize
users. Our authorization logic is really simple: if they're a user,
they get access. For that, we set up our “User” user pool as the
default, so all registered users are in that group. We could set up
more user pools for different roles, for example “Admin” for
administrators, or “Paying Users” for users in a paid plan.

● Create an API in API Gateway


API Gateway is the central component that manages and exposes
your microservice to the frontend. The "CourseCatalog" resource
and the "GET" method exposes our Course Catalog microservice's
functionality, which is integrated with the HTTP Proxy method.
Basically API Gateway receives the requests, runs the authorizer
(which we set up in the next step), runs anything else that needs
running (nothing in this case, but we could set up WAF, Shield,
transform some headers, etc), and then passes the request to our
ECS service. ECS handles the request, returns a response, API
Gateway can do a few things with the response such as transform
headers, and returns the response. We're using API Gateway here
to separate the endpoint from the service, which allows us to
replace the service with something else entirely without changing
the endpoint or how it's invoked.

● Set up Cognito as the API Gateway authorizer


This step ensures that only authenticated and authorized users
can access our Course Catalog. By using the Cognito User Pool
as the authorizer, we're basically saying “if the user is in this
Cognito User Pool, let the request pass”. You can also use a
custom Lambda Authorizer, with a Lambda function that can do
anything you want, such as calling an external service like Okta,
checking a table in DynamoDB, or any complex logic. Using the
Cognito User Pool as an authorizer instead of a custom Lambda
Authorizer is a new-ish feature that makes it easier to implement
this simple logic. A couple of years ago in order to do this you
needed a Lambda that called the Cognito API.

● Attach the Cognito authorizer to the API methods


There's two steps to setting up the Cognito authorizer: Creating it
(previous step) and attaching it to all relevant endpoints (this step).
In our case we're only dealing with one endpoint (GET) in one
microservice (CourseCatalog). If you want to secure the whole
app, this is where you attach that one authorizer to all endpoints.
Don't create more of the same authorizers, you create one and
reuse it for all endpoints. Feel free to create different ones though,
for example if you wanted to secure the POST endpoint so only
“Admin” users can access it, you need another Cognito User
Group and another Authorizer that checks against that group.

● Deploy the API


All set, now let's take it for a spin. Deploy it and test it!
By the way, if you make changes to API Gateway, you need to
deploy them before testing it. I've wasted half an hour several
times because I forgot to deploy the changes.

● Update the frontend


I hope I didn't go too overboard with all the frontend code. The
thing is we're adding authentication to the whole app, and that
includes adding a login button and a login form. Most importantly
though, we need to understand how to talk to Cognito to
authenticate a user, and what to pass to our CourseCatalog
endpoint so the authorizer recognizes the user as a logged in,
valid user. That's what I wanted to show. If you write your frontend,
make it prettier than my bare-bones example. If you have frontend
devs in your team, show them that code so they know what
behavior to add.

Discussion

Since we applied security at the API Gateway level, we've decoupled


authentication and authorization from our microservice. This means we can
use the exact same mechanism for the other two microservices: Content
Delivery and Progress Tracking.

It also means we've offloaded the responsibility of authorizing users to API


Gateway. That way our microservices remain focused on their actual task
(which is important for services, makes them much easier to maintain), and
they also remain within their bounded context instead of having to dip into
the shared context of application users (which is important for
microservices, because the bounded context is the key to them, otherwise
we'd be better off with regular services).

There's one caveat to our auth solution: the Content Delivery microservice
returns the URL to an S3 object, and (as things are right now), that object
needs to be public. That means only an authenticated user (i.e. paying
customer) can get the URL, but once they have it they're free to share it
with anyone. Securing access to content served through S3 is going to be
the topic of next week's issue.
One more thing about Cognito: If your app users needed AWS permissions,
for example to write to an S3 bucket or read from a DynamoDB table, you'd
need to set up an Identity Pool that's connected to your User Pool.

Best Practices

Operational Excellence

● Use Infrastructure as Code: That was a lot of clicks! It would be


a lot easier if you had all of this in a CloudFormation template, and
you reused it for every project that requires authentication (which
is 99.99999% of them).

● Implement monitoring and alerting: Set up dashboards and


alarms in CloudWatch to monitor the health and performance of
your APIs, microservices and other components. A good alarm
would be a % of responses being 4XX or 5XX (i.e. errors!).

Security

● Enable MFA in Cognito User Pool: You can offer your users the
option of adding MFA to their login.

● Configure AWS WAF with API Gateway: Add an AWS Web


Application Firewall (WAF) to your API to protect it from common
web attacks like SQL injection and cross-site scripting.

● Enable AWS Shield: Shield helps protect your API from


Distributed Denial of Service (DDoS) attacks. Like WAF, you also
enable it in API Gateway.
● Encrypt data in transit: tl;dr: use HTTPS. You can get a free,
auto-renewing certificate from Amazon Certificate Manager, and
you install it in the Application Load Balancer.

Reliability

● Offer a degraded response: Your microservice can fail, for


whatever reason (internal to the service, a failure in DynamoDB,
etc). Caching helps, but you should also consider a degraded
response such as a static course catalog served from a different
place, like S3. It's not up to date, sure, but it's usually better than a
big and useless error message.

● Consider Disaster Recovery: In AWS jargon, high availability


means an app can withstand the failure of an Availability Zone,
and disaster recovery means it can withstand the failure of an
entire AWS region. There's different strategies to this, but the most
basic one called Pilot Light involves replicating all the data to
another region, deploying all the configurations, deploying all the
infrastructure with capacity set to 0, and configuring Route 53 to
send traffic to that region if our primary region fails. We're going to
talk about disaster recovery in a future issue, for now just keep it in
mind, and think whether you really need it (most apps don't).

Performance Efficiency

● Use caching in API Gateway: Enable caching in API Gateway to


reduce the load on your backend services and improve the
response times for your API. You can probably set a relatively long
TTL for this, maybe 1 hour. I mean, how often does your course
catalog change?

Cost Optimization

● Remember API Gateway usage plans: This one's actually not


relevant to this particular solution, but I felt it was a good
opportunity to throw it in here. You can configure usage plans and
API keys for your APIs in API Gateway, and limit the number of
requests made by your users. This helps to control costs and
prevents abuse of your API. Not what you want for the public API
of CourseCatalog, but it's important and useful for private APIs.
Securing Access to S3
Use case: Securing access to content stored in S3

Scenario

In our online learning platform that we've been building in the previous 2
chapters, we have three microservices: Course Catalog, Content Delivery,
and Progress Tracking. The Content Delivery service is responsible for
providing access to course materials such as videos, quizzes, and
assignments. These files are stored in Amazon S3, but they are currently
publicly accessible. We need to secure access to these files so that only
authenticated users of our app can access them.

Services

● S3: We have a bucket called simple-aws-courses-content where


we store the videos of the course. The bucket and all objects are
public right now (this is what we're going to fix).

● API Gateway: Last week we set it up to expose our microservices,


including Content Delivery.

● Cognito: It stores the users, gives us sign up and sign in


functionality, and works as an authorizer for API Gateway. We're
going to use it as an authorizer for our content as well.

● CloudFront: A CDN (basically a global cache). It's a good idea to


add it to reduce costs and latency, and it's also going to allow us to
run our authorization code.
● Lambda: A regular Lambda function, run at CloudFront's edge
locations. Called Lambda@Edge.

Complete flow for getting the content

Solution step by step

FYI, here's how it works right now: The user clicks View Content, the
frontend sends a request to the Content Delivery endpoint in API Gateway
with the auth data, API Gateway calls the Cognito authorizer, Cognito
approves, API Gateway forwards the request to the Content Delivery
microservice, the Content Delivery microservice reads the S3 URL of the
requested video from the DynamoDB table, and returns that URL. The URL
is public (which is a problem).

Here's how we fix it:

● Update the IAM permissions for the Content Delivery


microservice
1. Go to the IAM console.
2. Find the IAM role associated with the Content Delivery
microservice (you might need to head over to ECS if you
don't remember the name).
3. Click "Attach policies" and then "Create policy".
4. Add the content below, to grant permissions to read all
objects.
5. Name the policy something like
"SimpleAWSCoursesContentAccess", add a description,
and click "Create policy".
6. Attach the new policy to the IAM role.
7. Here's the content of the policy:

Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudfront:ListPublicKeys",
"cloudfront:GetPublicKey"
],
"Resource": "*"
}
]
}

● Set up a CloudFront distribution


1. Open the CloudFront console.
2. Choose Create distribution.
3. Under Origin, for Origin domain, choose the S3 bucket
simple-aws-courses-content.
4. Use all the default values.
5. At the bottom of the page, choose Create distribution.
6. After CloudFront creates the distribution, the value of the
Status column for the distribution changes from In
Progress to Deployed. This typically takes a few minutes.
7. Write down the domain name that CloudFront assigns to
your distribution. It looks something like
d111111abcdef8.cloudfront.net.

● Make the S3 bucket not public


1. Go to the S3 console.
2. Find the "simple-aws-courses-content" bucket and click
on it.
3. Click on the "Permissions" tab and then on "Block public
access".
4. Turn on the "Block all public access" setting and click
"Save".
5. Remove any existing bucket policies that grant public
access.
6. Add this bucket policy to only allow access from the
CloudFront distribution (replace ACCOUNT_ID with your
Account ID and DISTRIBUTION_ID with your
CloudFront distribution ID):
Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "cloudfront.amazonaws.com"
},
"Action": "s3:GetObject",
"Resource":
"arn:aws:s3:::simple-aws-courses-content/*",
"Condition": {
"StringEquals": {
"aws:SourceArn":
"arn:aws:cloudfront::ACCOUNT_ID:distribution/DISTRIBUTI
ON_ID"
}
}
}
]
}

● Create the CloudFront Origin Access Control


1. Go to the CloudFront console.
2. In the navigation pane, choose Origin access.
3. Choose Create control setting.
4. On the Create control setting form, do the following:
1. In the Details pane, enter a Name and a
Description for the origin access control.
2. In the Settings pane, leave the default setting
Sign requests (recommended).
5. Choose S3 from the Origin type dropdown.
6. Click Create.
7. After the OAC is created, write down the Name. You'll
need this for the following step.
8. Go back to the CloudFront console.
9. Choose the distribution that you created earlier, then
choose the Origins tab.
10. Select the S3 origin and click Edit.
11. From the Origin access control dropdown menu,
choose the OAC that you just created.
12. Click Save changes.

● Create a Lambda@Edge function for authorization


1. Go to the Lambda console, click "Create function" and
choose "Author from scratch".
2. Provide a name for the function, such as
"CognitoAuthorizationLambda".
3. Choose the Node.js runtime.
4. In "Function code", put the code below. You'll need the
packages npm install jsonwebtoken
jwk-to-pem, and you'll need to replace , , and with the
Cognito User Pool's region, ID, and the user group
5. In "Execution role", create a new IAM Role and attach it a
policy with the contents below. You'll need to replace , ,
and (this last one is the ID of the Cognito user pool)
6. Click "Create function".

JavaScript

const AWS = require('aws-sdk');


const jwt = require('jsonwebtoken');
const jwkToPem = require('jwk-to-pem');

const cognito = new


AWS.CognitoIdentityServiceProvider({ region: '<REGION>'
});
const userPoolId = '<USER_POOL_ID>';

let cachedKeys;

const getPublicKeys = async () => {


if (!cachedKeys) {
const { Keys } = await
cognito.listUserPoolClients({
UserPoolId: userPoolId
}).promise();

cachedKeys = Keys.reduce((agg, current) => {


const jwk = { kty: current.kty, n: current.n, e:
current.e };
const key = jwkToPem(jwk);
agg[current.kid] = { instance: current, key };
return agg;
}, {});
}
return cachedKeys;
};
const isTokenValid = async (token) => {
try {
const publicKeys = await getPublicKeys();
const tokenSections = (token || '').split('.');
const headerJSON = Buffer.from(tokenSections[0],
'base64').toString('utf8');
const { kid } = JSON.parse(headerJSON);

const key = publicKeys[kid];


if (key === undefined) {
throw new Error('Claim made for unknown kid');
}

const claim = await jwt.verify(token, key.key, {


algorithms: ['RS256'] });
if
(claim['cognito:groups'].includes('<USER_GROUP>') &&
claim.token_use === 'id') {
return true;
}
return false;
} catch (error) {
console.error(error);
return false;
}
};

exports.handler = async (event) => {


const request = event.Records[0].cf.request;
const headers = request.headers;
if (headers.authorization &&
headers.authorization[0].value) {
const token =
headers.authorization[0].value.split(' ')[1];
const isValid = await isTokenValid(token);
if (isValid) {
return request;
}
}

// Return a 401 Unauthorized response if the token is


not valid
return {
status: '401',
statusDescription: 'Unauthorized',
body: 'Unauthorized',
headers: {
'www-authenticate': [{ key: 'WWW-Authenticate',
value: 'Bearer' }],
'content-type': [{ key: 'Content-Type', value:
'text/plain' }]
}
};
};
Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"cognito-idp:ListUsers",
"cognito-idp:GetUser"
],
"Resource":
"arn:aws:cognito-idp:<REGION>:<ACCOUNT_ID>:userpool/<US
ER_POOL_ID>"
}
]
}

● Add the Lambda@Edge function to the CloudFront


distribution
1. In the CloudFront console, select the distribution you
created earlier.
2. Choose the "Behaviors" tab and edit the existing default
behavior.
3. Under "Lambda Function Associations", choose the
"Viewer Request" event type.
4. Enter the ARN of the Lambda function you created in the
previous step.
5. Click "Save changes" to update the behavior.

● Update the Content Delivery microservice


1. Modify the Content Delivery microservice to return the
CloudFront distribution domain, something like
d111111abcdef8.cloudfront.net, for the requested content.
2. Test and deploy these changes.

● Update the frontend code


1. Update the frontend code to include the auth data in the
content requests, just like we did in last week's issue.

● Test the solution end-to-end


1. Sign in to the app and go to a course.
2. Verify that the course content is displayed correctly and
that the URLs are pointing to the CloudFront distribution
domain (e.g., d111111abcdef8.cloudfront.net).
3. Test accessing the S3 objects directly using their public
URLs to make sure they are no longer accessible.
Solution explanation

● Update the IAM permissions for the Content Delivery


microservice
So far our solution was using minimum permissions, which means
the Content Delivery microservice only had access to what it
needed, and not more. Well, now it needs more access! We're
adding permissions so it can access the (not-created-yet)
CloudFront distribution to generate the pre-signed URLs.

● Set up a CloudFront distribution


CloudFront is a CDN, which means it caches content near the
user.
Without CloudFront: User -----------public internet-----------> S3
bucket.
With CloudFront: User ---public internet---> CloudFront edge
location.
See? It's a shorter path! That's because CloudFront has multiple
edge locations all over the world, and the request goes to the one
nearest the user. That reduces latency for the user.

● Make the S3 bucket not public


With the new solution, users don't need to access the S3 bucket
directly. We want everything to go through CloudFront, so we'll
remove public access to the bucket.
This is basically the same thing as what I should have done in last
week's issue. We add a new layer on top of an existing resource,
now we need to make sure that existing resource is only
accessible through that new layer. In this case, the resource is the
S3 bucket, and the new layer is the CloudFront distribution.

● Create the CloudFront Origin Access Control


In the previous step we restricted access to the S3 bucket. Here
we're giving our CloudFront distribution a sort of “identity” that it
can use to access the S3 bucket. This way, the S3 service can
identify the CloudFront distribution that's trying to access the
bucket, and allow the operation.

● Create a Lambda@Edge function for authorization


Last week I mentioned the old way of doing Cognito authorization
was with a Lambda function. For API Gateway, the much simpler
way is with a Cognito authorizer, just like we did. For CloudFront
we don't have such luxuries, so we ended up having to write the
Lambda function that checks with Cognito whether the auth
headers included in the request belong to an authenticated and
authorized user.

● Add the Lambda@Edge function to the CloudFront


distribution
CloudFront can run Lambda functions when content is requested,
or when a response is returned. They're called Lambda@Edge
because they run at the Edge locations where CloudFront caches
the content, so the user doesn't need to wait for the round-trip
from the edge location to an AWS region. In this case, the user
needs to wait anyways, because our Lambda@Edge accesses the
Cognito user pool, which is in our region. There's no way around
this that I know of.

● Update the Content Delivery microservice


Before, our Content Delivery microservice only returned the public
URLs of the S3 objects. Now, it needs to return the URL of the
CloudFront distribution.
I'm glossing over the details of how Content Delivery knows which
object to access. Heck, I'm not even sure why we need the
Content Delivery microservice. But let's stay in focus.

● Update the frontend code


Before, we were just requesting a public URL, much like after we
first defined our microservices. Now, we need to include the auth
data in the request, just like we did in last week's issue.

● Test the solution end-to-end


Don't just test that it works. Understand how it can fail, and test
that. In this case, a failure would be for non-users to be able to
access the content, which is the very thing we're trying to fix.

Discussion

These first three chapters dealt with the same scenario, focusing on
different aspects. We designed our microservices, secured them, and
secured the content. We found problems, we fixed them. We made
mistakes (well, I did), we fixed them.

This is how cloud architecture is designed and built.


First we understand what's going on, and what needs to happen. That's
why I lead every chapter with the scenario. Otherwise we end up building a
solution that's looking for a problem (which is as common as it is bad,
unfortunately).

We don't do it from scratch though. There are styles and patterns. We


consider them all, evaluate the advantages and tradeoffs, and decide on
one or a few. In this case, we decided that microservices would be the best
way to build this (rather arbitrarily, just because I wanted to talk about
them).

Then we look for problems. Potential functionality issues, unintended


consequences, security issues, etc. We analyze the limits of our
architecture (for example scaling) and how it behaves at those limits (for
example scaling speed). If we want to make our architecture as complete
as possible, we'll look to expand those limits, ensure the system can scale
as fast as possible, all attack vectors are covered, etc. That's
overengineering.

If you want to make your architecture as simple as possible, you need to


understand today's issues and remove things (instead of adding them) until
your architecture only solves those. Keep in mind tomorrow's issues,
consider the probability of them happening and the cost of preparing for
them, and choose between: Solving for them now, making the architecture
easy to change so you can solve them easily in the future, or not doing
anything. For most issues, you don't do anything. For the rest, you make
your architecture easy to change (which is part of evolutionary
architectures).
I think you can guess which one I prefer. Believe it or not, it's actually
harder to make it simpler. The easiest approach to architecture is to pile up
stuff and let someone else live with that (high costs, high complexity, bad
developer experience, etc).

Best Practices

Operational Excellence

● Monitoring: Metrics, metrics everywhere… CloudFront shows you


a metrics dashboard for your distribution (powered by
CloudWatch), and you can check CloudWatch itself as well.

● Monitor the Lambda@Edge function: It's still a Lambda


function, all the @Edge does is say that it runs at CloudFront's
edge locations. Treat it as another piece of code, and monitor it
accordingly.

Security

● Enable HTTPS: You can configure the CloudFront distribution to


use HTTPS. Use Certificate Manager to get a TLS certificate for
your domain, and apply it directly to the CloudFront distribution.

● Make the S3 bucket private: If you add a new layer on top of an


existing resource, make sure malicious actors can't access that
resource directly. In this case, it's not just making it private, but
also setting a bucket policy so only the CloudFront distribution can
access it (remember not to trust anything, even if it's running
inside your AWS account).
● Set up WAF in CloudFront: Just like with API gateway, we can
also set up a WAF Web ACL associated with our CloudFront
distribution, to protect from common exploits.

Reliability

● Configure CloudFront error pages: You can customize error


pages in the CloudFront distribution, so users have a better user
experience when encountering errors. Not actually relevant to our
solution, since CloudFront is only accessed by our JavaScript
code, but I thought I should mention it.

● Enable versioning on the S3 bucket: With versioning enabled,


S3 stores old versions of the content. That way you're protected in
case you accidentally push the wrong version of a video, or
accidentally delete the video from the S3 bucket.

Performance Efficiency

● Optimize content caching: We stuck with the defaults, but you


should configure the cache behavior settings in your CloudFront
distribution to balance between content freshness and the reduced
latency of cached content. Consider how often content changes
(not often, in this case!).

● Compress content: Enable automatic content compression in


CloudFront to reduce the size of the content served to users,
which can help improve performance and reduce data transfer
costs.
Cost Optimization

● Configure CloudFront price class: CloudFront offers several


price classes, based on the geographic locations of the edge
locations it uses. Figure out where your users are, and choose the
price class accordingly. Don't pick global “just in case”, remember
that any user can access any location, it's just going to take them
200 ms longer. Analyze the tradeoffs of cost vs user experience
for those users.

● Enable CloudFront access logs: Enable CloudFront access logs


to analyze usage patterns and identify opportunities for cost
optimization, such as adjusting cache settings or updating the
CloudFront price class. Basically, if you want to make a
data-driven decision, this gets you the data you need.

● Consider S3 storage classes and lifecycle policies: Storage


classes make it cheaper to store infrequently accessed content
(and more expensive to serve, but since it's infrequently accessed,
you end up saving money). Lifecycle policies are rules to automate
transitioning objects to storage classes. If you're creating courses
and you find nobody's accessing your course, I'd rather put these
efforts into marketing the course, or retire it if it's too old. But I
figured this was a good opportunity to mention storage classes
and lifecycle policies.
7 Must-Do Security Best Practices for your
AWS Account
Use case: That AWS account where you run your
whole company but you never bothered to
improve security

Scenario

You created your AWS account and started building. You got your MVP
running, got some paying customers, and things are going well. You always
said you'd come back to improve security, but figuring out how is harder
than you thought, and there's always a new feature to add to your product.
So you keep saying you'll get on it next week, and the next, and the next...

Today is the day you finally secure your AWS account.

Services

● AWS Identity and Access Management (IAM): This is where you


create users and assign permissions.

● Amazon CloudTrail: Logs everything that happens in your AWS


account.

● Amazon GuardDuty: Analyzes the above logs for suspicious


activity.

● AWS Billing: Where you see how much money you're spending.
Solution

Check out each of these best practices, and if you haven't applied them, do
it now. It's going to take less than 2 hours, I promise. Keep in mind these
are only applicable if you're using a single AWS account, not an
Organization.

Create a password policy

Passwords can be brute forced. No news there. Yet some people still think
they're safe with a 10-digit password that's all numbers. Set up a password
policy to prevent IAM users from entering insecure passwords, with a
minimum of 16 characters. Here's how:
1. Log in to your AWS account and go to the IAM console.
2. Choose Account settings on the left.
3. In the Password policy section, choose Change password policy.
4. Check your options and click Save.

Create IAM users

The root user can do everything. If you lose access to it, you're done. So
don't share it, and don't even use it! Instead, create an IAM User for every
person that needs access to AWS, and only use the root in case of
emergencies. You'll be tempted to give Administrator permissions to
everyone. Instead of that, check out these example policies by AWS and
these policies by asecurecloud and take 5 minutes to build something a bit
more restrictive.

(tip for Organizations: Don't use IAM, instead use IAM Identity Center to let
one user access all your accounts)
Here's how to set up users:
1. Log in to your AWS account and go to the IAM console.
2. If you decide to set up your own IAM policies (or use the ones
from AWS or Asecurecloud), click Policies on the left and click
Create policy.
3. Click on the JSON tab, paste your JSON. Click Next: Tags.
4. If you want, enter a tag for your policy. Click Next: Review.
5. Enter a Name and a Description. Click Create policy. Repeat this
for all the policies you want to create.
6. Next we're going to create groups for our users. You'll want to
group them by roles or access levels, for example you could have
a Developers group and an Infrastructure group. To do that, click
User groups on the left and click Create group.
7. Enter the name, scroll down, and select the policies you want to
add (you can use the search box to filter them). Click Create
group.
8. Now we're ready for our users. Click Users on the left and click
Add user.
9. Enter the user name. If they need access to the visual console
(the one where you're doing this), check Enable console access.
Most people will need it.
10. If you checked Enable console access, choose either an auto
generated password or enter your own custom password. In either
case, check Users must create a new password at the next
sign-in. Click Next.
11. Select the group or groups (you can pick more than one) to
which you want to add this user. If you pick more than one group,
the user's permissions will be a combination of both groups. Click
Next and click Create user.
12. Copy the Console sign-in URL and send it together with your
password to the person this user is for. You can also click the
button to Email sign-in instructions.
13. If the user needs programmatic access, on the list of users click
on that user, click on the Security credentials tab, scroll down to
Access keys and click Create access key.
14. Select your use case, read the alternatives presented by AWS,
check that you understand and click Next.
15. Enter a Description and click Create access key.
16. Repeat the user creation process for as many people as you
want to give access to your AWS account. Remember, one user
per person.

Add MFA to every user

Multi-Factor Authentication means any agent of evil will need to


compromise two authentication methods to gain access to your account.
For example, your password and your mobile phone (where you install an
app that generates a code needed to sign in). Remember to do this for the
root user as well!!

(tip for Organizations: do this for all IAM Identity Center users and for the
root of each account)

To set it up, see the first part of this blog post.


Enable logging of all account actions

Logs are a record of what happens in a system, right? There's this service
called CloudTrail that records every action that is taken at your AWS
account. Launched an EC2 instance? It's logged here. Changed a security
group? Logged here. Logged in with your user from an unknown IP range?
Logged here, and you can get notified (more on that next). Event history is
automatically enabled for the past 90 days.

(tip for Organizations: You'll want to collect the logs from all accounts and
move them to a single account. I'll update this with a guide for that in the
coming weeks)

Let's see how to keep those logs forever in an S3 bucket:


1. Log in to your AWS account and go to CloudTrail. Click Create
trail.
2. Enter the trail name, such as "all events".
3. For Storage location, choose Create new S3 bucket to create a
bucket.
4. For Log file SSE-KMS encryption, choose Enabled, and on AWS
KMS alias enter a new name for your KMS key.
5. In Additional settings, for Log file validation, choose Enabled.
6. Optionally, choose Enabled in CloudWatch Logs to have
CloudTrail send the logs to CloudWatch Logs.
7. On the Choose log events page, you need to decide what events
to log. I recommend the following:
1. Check Management events and Insights events.
2. In Management events, for API activity, choose both
Read and Write. If you're very short on money choose
Exclude AWS KMS events.
3. In Insights events, choose API call rate and API error
rate.
8. Click Next and click Create trail.

Set up a distribution list as your email address

You know that email address you used to create the account? Probably the
CEO's or the CTO's. Turns out AWS sends useful info there. What happens
if that person is not available? Create a distribution list (which is an address
that is automatically forwarded to many addresses) and add addresses
there. Here's how:
1. On Google Worskpaces, go to Directory → Groups.
2. Click on Create Group and enter name, email and description.
Click Next.
3. For Access type you probably want to set Restricted, but any will
do, so long as External can “contact group owners” (i.e. send
emails to that address). Click Create Group.
4. Back on AWS, log in with the root user and go to Account
Settings.
5. Change your contact information to a group you just created, and
click Save.
6. Warning: If you change the root email address to that of a group,
anyone with access to the group can recover the password. This is
actually a good idea, but be careful with whom you add to that
group.
Enable GuardDuty

GuardDuty basically scans your CloudTrail events and warns you when
there's unusual activity. Cool, right? It's $4 per million events scanned. And
you need to enable it in all regions.

Here's how to enable it for a region (you should repeat this for all regions):
1. Go to the GuardDuty console. Click Get started and click Enable.
That's it! Let's explore it a bit.
2. On the left, go to Settings and click Generate sample findings.
3. Go back to Findings on the menu on the left, and check out the
[SAMPLE] findings generated. That's what you can expect from
GuardDuty (though hopefully not that many!)
4. So, do I need to check GuardDuty every day to see if there's
something new? Absolutely not. Let's set it up to notify you on an
SNS topic to which you can subscribe a phone number, an email
address, etc (I recommend one of the email addresses for a
distribution list, created on the recommendation above):
5. First we'll create the SNS topic. Go to the SNS console, click
Topics on the left and Create topic.
6. For the Type select Standard. Enter a name such as
GuardDutyFindings, and click Create topic.
7. Next we'll subscribe our email or phone to the topic. Click Create
subscription, for protocol select either email, enter the email
address of one of the distribution lists created above (or create a
new one for this) and click Create subscription.
8. Finally we'll need an EventBridge rule to post GuardDuty findings
to the SNS topic. To do that, go to the EventBridge console, click
Rules on the left and click Create rule.
9. Enter a name such as GuardDutyToSNS, leave event bus as
“default” and Rule type as Rule with an event pattern. Click Next.
10. For Event source, choose AWS events. For Creation method,
choose Use pattern form. For Event source, choose AWS
services. For AWS service, choose GuardDuty. For Event Type,
choose GuardDuty Finding. Click Next.
11. For Target types, choose AWS service. On Select a target,
choose SNS topic, and for Topic, choose the name of the SNS
topic you created 5 steps ago. Open Additional settings.
12. In the Additional settings section, for Configure target input,
choose Input transformer and click Configure input transformer.
13. Scroll down to the Target input transformer section, and for
Input path, paste the following code:
14. { "severity": "$.detail.severity",
"Finding_ID": "$.detail.id", "Finding_Type":
"$.detail.type", "region": "$.region",
"Finding_description": "$.detail.description" }
15. Scroll down. For Template, paste the following code (tune it if
you want):
16. "Heads up! You have a severity GuardDuty
finding of type in the region." "Finding
Description:" ". " "Check it out on the
GuardDuty console:
https://console.aws.amazon.com/guardduty/home?re
gion=#/findings?search=id%3D"
17. Click Confirm. Click Next. Click Next. Click Create rule. To test
it, re-generate the sample findings from step 2 and you should get
an email for each.
18. Go grab a glass of water, a cup of tea or a beer. That was a
long one!

Create a Budget

There's no way to limit your billing on AWS, but there is a way to get
notified if your current spending or forecasted spending for the month
exceeds a threshold. If you're spending $200/month and suddenly you're
on track to spending $500 for that month, you'd like to know ASAP so you
can delete those EC2 instances that you forgot you launched!

I'd give you a step by step, but I'll do you one better: Here's an in-console
tutorial by AWS. Tip: Set a number above your typical spending (including
peaks). You don't want to be deleting 10 emails a month, because you'll get
used to ignoring them.

Discussion

Do I need to do all of that? Yes. It's easy (with this guide), it's either super
cheap or free, and it's important. You know it's important, or you wouldn't
have read this far.

Do I need to do all of that right now? Now that is a good question. The
answer is no, you don't. But you've been postponing this for how long? Tell
you what, let's compromise: Take each best practice, create a Jira ticket (or
Trello, Todoist or whatever you're using to manage tasks), paste the step
by step there, and schedule it for some time in this sprint or the next. At
least now you know what to do and how to do it.
Quick overview of ECS
Use Case: Deploying Containerized Applications

AWS Service: Elastic Container Service

tl;dr: It's a container orchestrator (like Kubernetes) but done by AWS. You
take a Docker app, set parameters like CPU and memory, set up an EC2
Auto Scaling Group or use Fargate (serverless, you pay per use), and ECS
handles launching everything and keeping it running. Here's a better
explanation of what's involved:

● Cluster: It's a container for everything. It defines where the


capacity comes from (EC2 or Fargate) and contains all the other
resources.

● Task Definition: It's a blueprint for Tasks. A Task is an instance of


a Task Definition (like objects and classes in OOP, this one's the
class). The Task Definition is where you set things like:

● The Docker image


● CPU and memory
● Launch type (EC2 or Fargate)
● Logging configuration
● Volumes
● The IAM role
● Task: An instance of a Task Definition. This is basically your
container running. The ECS Task Scheduler handles creating the
tasks, placing them and creating new ones as needed.

● Service: A grouping of Tasks (with the same Task Definition), with


a Load Balancer in front of them.

Best Practices

● ECS is easier than kubernetes. If you're starting from scratch,


don't have anything in k8s yet and don't plan to move from AWS,
go with ECS.

● If you already have some stuff on kubernetes or already know how


to use it, use EKS (managed Kubernetes cluster in AWS). You can
also use Fargate there if you want.

● ECS is free. You only pay for the EC2 instances or Fargate
capacity and the Load Balancers. In contrast, an EKS cluster costs
$72/month.
● Fargate is awesome for unpredictable loads, and scales extremely
fast (you still have to wait for the container to start). Plus, you can
use savings plans!

● In contrast, EC2 is cheaper, but when scaling you need to wait for
the EC2 instance to start.

● Here's the pricing (without savings plans) for 2 tasks running


continuously, each with 1 vCPU and 4 GB of memory:

● Fargate: $85/month.
● EC2: $53/month (with t4g.medium instances).

● And here's a much more complete analysis.

● Here's a reference architecture of microservices deployed on ECS


and exposed with API Gateway.

● And here's how to create a CI/CD pipeline with:

● GitHub Actions
● GitLab
● Jenkins
Step-by-Step instructions to migrate a
Node.js app from EC2 to ECS
Use case: Transforming an app on EC2 to a
scalable app on ECS

Scenario

You have this cool app you wrote in Node.js. You're a great developer, but
you started out with 0 knowledge of AWS. At first you just launched an EC2
instance, SSH'd there and deployed the app. It works, but it doesn't scale.
You read the previous chapter and understood the basic concepts of ECS,
but you don't know how to go from your app on EC2 to your app on an ECS
cluster.

Services

● ECS: A container orchestration service that helps manage Docker


containers on a cluster.

● Elastic Container Registry (ECR): A managed container registry


for storing, managing, and deploying Docker images.

● Fargate: A serverless compute engine for containers that


eliminates the need to manage the underlying EC2 instances.
ECS example with one task

Solution step by step

● Install Docker on your local machine.


Follow the instructions from the official Docker website:
● Windows:
https://docs.docker.com/desktop/windows/install/
● macOS: https://docs.docker.com/desktop/mac/install/
● Linux: https://docs.docker.com/engine/install/

● Create a Dockerfile.
In your app's root directory, create a file named "Dockerfile" (no file
extension). Use the following as a starting point, adjust as needed.
Unset

# Use the official Node.js image as the base image


FROM node:latest

# Set the working directory for the app


WORKDIR /app

# Copy package.json and package-lock.json into the


container
COPY package*.json ./

# Install the app's dependencies


RUN npm ci

# Copy the app's source code into the container


COPY . .

# Expose the port your app listens on


EXPOSE 3000

# Start the app


CMD ["npm", "start"]

● Build the Docker image and test it locally.


While in your app's root directory, build the Docker image. Once
the build is complete, start a local container using the new image.
Test the app in your browser or with curl or Postman to ensure
it's working correctly. If it's not, go back and fix the Dockerfile.
Unset

docker build -t cool-nodejs-app .


docker run -p 3000:3000 cool-nodejs-app

● Create the ECR registry.


First, create a new ECR repository using the AWS Management
Console or the AWS CLI.

Unset

aws ecr create-repository --repository-name


cool-nodejs-app

● Push the Docker image to ECR.


The previous command will output the instructions to authenticate
Docker with your ECR repository. Follow them. Then, tag and
push the image to ECR, replacing {AWSAccountId} and
{AWSRegion} with the appropriate values:

Unset

docker tag cool-nodejs-app:latest


{AWSAccountId}.dkr.ecr.{AWSRegion}.amazonaws.com/cool-n
odejs-app:latest
docker push
{AWSAccountId}.dkr.ecr.{AWSRegion}.amazonaws.com/cool-n
odejs-app:latest

● Create an ECS Task Definition.


We'll use CloudFormation. First, create a file named
"ecs-task-definition.yaml" with the following contents. Then, in the
AWS Console, create a new CloudFormation stack with it.
Unset

Resources:
ECSTaskRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal: !Sub ${AWS::AccountId}
Service:
- ecs-tasks.amazonaws.com
Action:
- sts:AssumeRole
Policies:
- PolicyName: ECRReadOnlyAccess
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- ecr:GetAuthorizationToken
Resource: "*"
- Effect: Allow
Action:
- ecr:BatchCheckLayerAvailability
- ecr:GetDownloadUrlForLayer
- ecr:GetRepositoryPolicy
- ecr:DescribeRepositories
- ecr:ListImages
- ecr:DescribeImages
- ecr:BatchGetImage
Resource: !Sub
"arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repositor
y/cool-nodejs-app"

CoolNodejsAppTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: cool-nodejs-app
TaskRoleArn: !Ref ECSTaskRole
ExecutionRoleArn: !Ref ECSTaskRole
RequiresCompatibilities:
- FARGATE
NetworkMode: awsvpc
Cpu: '256'
Memory: '512'
ExecutionRoleArn: !Ref TaskExecutionRole
ContainerDefinitions:
- Name: cool-nodejs-app
Image: !Sub
"arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repositor
y/cool-nodejs-app:latest"
PortMappings:
- ContainerPort: 3000
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref CloudWatchLogsGroup
awslogs-region: !Sub ${AWS::Region}
awslogs-stream-prefix: ecs
● Create the ECS Cluster and Service.
You'll need an existing VPC for this (you can use the default one).
Update your CloudFormation template ("ecs-task-definition.yaml")
to include the ECS Cluster, the Service, and the necessary
resources for networking and load balancing. Replace {VPCID}
with the ID of your VPC, and {SubnetIDs} with one or more
subnet IDs. Then, on the Console, go to CloudFormation and
update the existing stack with the modified template.

Unset

Resources:
# ... (Existing Task Definition and related
resources)

CoolNodejsAppService:
Type: AWS::ECS::Service
Properties:
ServiceName: cool-nodejs-app-service
Cluster: !Ref CoolNodejsAppCluster
TaskDefinition: !Ref CoolNodejsAppTaskDefinition
DesiredCount: 2
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: ENABLED
Subnets:
- {SubnetIDs}
LoadBalancers:
- TargetGroupArn: !Ref AppTargetGroup
ContainerName: cool-nodejs-app
ContainerPort: 3000

CoolNodejsAppCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: cool-nodejs-app-cluster

AppLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: app-load-balancer
Scheme: internet-facing
Type: application
IpAddressType: ipv4
LoadBalancerAttributes:
- Key: idle_timeout.timeout_seconds
Value: '60'
Subnets:
-
SecurityGroups:
- !Ref AppLoadBalancerSecurityGroup

AppLoadBalancerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupName: app-load-balancer-security-group
VpcId: {VPCID}
GroupDescription: Security group for the app load
balancer
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0

AppTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: app-target-group
Port: 3000
Protocol: HTTP
TargetType: ip
VpcId: {VPCID}
HealthCheckEnabled: true
HealthCheckIntervalSeconds: 30
HealthCheckPath: /healthcheck
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 2

● Set up auto-scaling for the ECS Service.


Add the following resources to the CloudFormation template to set
up auto-scaling policies based on CPU utilization, and update the
CloudFormation stack with the modified template.

Unset

Resources:
# ... (Task Definition, ECS cluster and the other
existing resources)
AppScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MaxCapacity: 10
MinCapacity: 2
ResourceId: !Sub
"service/${CoolNodejsAppCluster}/${CoolNodejsAppService
}"
RoleARN: !Sub
"arn:aws:iam::${AWS::AccountId}:role/aws-service-role/e
cs.application-autoscaling.amazonaws.com/AWSServiceRole
ForApplicationAutoScaling_ECSService"
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs

AppScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: app-cpu-scaling-policy
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref AppScalingTarget
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType:
ECSServiceAverageCPUUtilization
TargetValue: 50
ScaleInCooldown: 300
ScaleOutCooldown: 300
● Test the new app.
After the CloudFormation stack update is complete, go to the ECS
console, click on the cool-nodejs-app-cluster cluster, and
you'll find the cool-nodejs-app-service service. The service
will launch tasks based on your task definition and desired count,
so you should see 2 tasks. To test the app, go to the EC2 console,
look for the Load Balancers option on the left, click on the load
balancer named app-load-balancer and find the DNS name.
Paste the name in your browser or use curl or Postman. If we
got it right, you should see the same output as when running the
app locally. Congrats!

Solution explanation

1. Install Docker
ECS runs containers. Docker is the tool we use to containerize our
app. Containers package an app and its dependencies together,
ensuring consistent environments across different stages and
platforms. We needed this either way for ECS, but we get the
extra benefit of being able to run it locally or on ECS without any
extra effort.

2. Create a Dockerfile
A Dockerfile is a script that tells Docker how to build a Docker
image. We specified the base image, copied our app's files,
installed dependencies, exposed the app's port, and defined the
command to start the app. Writing this when starting development
is pretty easy, but doing it for an existing app is harder.
3. Build the Docker image and test it locally
The degree to which you know how your software behaves is the
degree to which you've tested it. So, we wrote the Dockerfile, built
the image, and tested it!

4. Create the ECR registry


Amazon Elastic Container Registry (ECR) is a managed container
registry that stores Docker container images. It's an artifacts repo,
where our artifacts are Docker images. ECR integrates seamlessly
with ECS, allowing us to pull images directly from there without
much hassle.

5. Push the Docker image to ECR


We're building our Docker image and pushing it to the registry so
we can pull it from there.

6. Create an ECS Task Definition


A Task Definition describes what Docker image the tasks should
use, how much CPU and memory a task should have, and other
configs. They're basically task blueprints. We built ours using
CloudFormation, because it makes it more maintainable (and
because it's easier to give you a template than to show you
Console screenshots). By the way, we also included permissions
for the task definition to pull images from the ECR registry.

7. Create the ECS cluster and service


The cluster contains all services and tasks. A Task is an instance
of our app, basically a Docker container plus some ECS wrapping
and configs. Instead of launching tasks manually, we'll create a
Service, which is a grouping of identical tasks. A Service exposes
a single endpoint for the app (using an Application Load Balancer
in this case), and controls the tasks that are behind that endpoint
(including launching the necessary tasks).

8. Set up auto-scaling for the ECS Service


Before this step our ECS Service has 2 tasks, and that's it, no auto
scaling of any kind. Here's where we set the policies that tell the
ECS Service when to create more tasks or destroy existing tasks.
We'll do this based on CPU Utilization, which is the best parameter
in most cases (and the easiest to set up).

9. Test the new app


Alright, now we've got everything deployed. Time to test it! Find
the DNS name of the Load Balancer that corresponds to our
Service, paste that in your browser or curl or Postman, and take
it for a test drive. If it works, give yourself a pat on the back! If it
doesn't, shoot me an email.

Discussion

● We did it! Took a single EC2 instance that wouldn't scale, and
made it scalable, reliable (if a task fails, ECS launches a new one)
and highly available (if you chose at least 2 subnets in different
AZs). And it wasn't that hard.

● Real life could be harder though. This solution works under the
assumption that the app stores no state in the same instance. Any
session data (or any data that needs to be shared across
instances, no matter how temporary or persistent it is) should be
stored in a separate storage, such as a database. DynamoDB is
great for session data, even if you use a relational database like
RDS or Aurora for the rest of the data.

● Most of the people I've seen using a single EC2 instance and a
relational database have the database running in the same
instance. You should move it to RDS or Aurora. This is a separate
step from moving the app to ECS.

● Local environments are easy, until you have 3 devs using an old
Mac, 2 using an M1 or M2 Mac, 2 on Windows and a lone guy
running an obscure Linux distro (it's the same guy who argues vi is
better than VS Code). Docker fixes that.

● Yes, I used Fargate. Did I cheat? Maybe… I went with Fargate so


we could focus on ECS, but you can also do it on an EC2 Auto
Scaling Group.

● Why ECS and not a plain EC2 Auto Scaling Group? For one
service, it's pretty much the same effort. For multiple services,
ECS abstracts away a LOT of complexities.

● Why ECS and not Kubernetes? It's simpler. That's the whole
reason. There's another chapter on doing the same thing for
Kubernetes, you'll see the difference there.
Best Practices

Operational Excellence

● Use a CI/CD pipeline: I ran all of this manually, but you should
add a pipeline. After you've created the infrastructure, all the
pipeline needs to do is build the docker image with docker
build, tag it with docker tag and push it with docker push.

● Use an IAM Role for the pipeline: Of course you don't want to let
anyone write to your ECR registry. The CI/CD pipeline will need to
authenticate. You can either do this with long-lived credentials (not
great but it works), or by letting the pipeline assume an IAM Role.
The details depend on the tool you use, but try to do it.

● Health Checks: Configure health checks for your ECS service


and Application Load Balancer to ensure that only healthy tasks
are receiving traffic.

● Use Infrastructure as Code: With this issue you're already


halfway there! You've got your ECS cluster, task definition and
service done in CloudFormation!

● Implement Log Aggregation: Set up log aggregation for your


ECS tasks using CloudWatch Logs, Elasticsearch or whatever tool
you prefer. All tasks are the same, logs across tasks should be
aggregated.

● Use blue/green deployments: Blue/green is a strategy that


consists of deploying the new version in parallel with the old one,
testing it, routing traffic to it, monitoring it, and when you're sure it's
working as intended, only then shut down the old version.

Security

● Store secrets in Secrets Manager: Database credentials, API


keys, other sensitive data? Store them in Secrets Manager.

● Task IAM Role: Assign an IAM role to each ECS task (do it at the
Task Definition), so it has permissions to interact with other AWS
services. We actually did this in our solution, so the tasks could
access ECR!

● Enable Network Isolation: I told you to use the default VPC for
now. For a real use case you should use a dedicated VPC (they're
free!), and put tasks in private subnets.

● Use Security Groups and NACLs: This is about defense in


depth. Basically, protect yourself at multiple levels, including the
network level. That way, if any security level is breached (for
example because of a misconfiguration), you're not completely
exposed.

● Regularly Rotate Secrets: Secrets Manager reduces the


chances of a secret being compromised. But what if it is, and you
don't find out? Rotate passwords and all secrets regularly.

Reliability

● Multi-AZ Deployment: Pick multiple subnets in different AZs and


ECS will deploy your tasks in a highly-available manner.
Remember that for AWS “highly available” means it can quickly
and automatically recover from the failure of one AZ.

● Use Connection Draining: At some point you'll want to kill a task,


but it'll probably be serving requests. Connection draining tells the
LB to not send new requests to that task, but to wait a few
seconds (configurable, use 300) to kill it. That way, the task can
finish processing those requests and the users aren't impacted.

● Set Up Automatic Task Retries: Configure automatic retries for


tasks that fail due to transient errors. This way your tasks can
recover from temporary errors automatically.

Performance Efficiency

● Optimize Task Sizing: Adjust CPU and memory allocations to


match your app requirements. Fixing bad performance by throwing
money at it sounds bad, but some processes and languages are
inherently resource-intensive.

● Use Container Insights: Enable Container Insights in


CloudWatch to monitor, troubleshoot, and optimize ECS tasks.
This gives you valuable insights into the performance of your
containers and helps you identify potential bottlenecks or areas for
optimization.

● Use a queue for writes: Before, neither your app nor your
database scaled. Now, your app scales really well, but your
database still doesn't scale. A sudden surge of users no longer
brings down your app layer, but the consequent surge of write
requests can bring down the database. To protect from this, add all
writes to a queue and have another service consume from the
queue at a max rate. There's a chapter on this coming up.

● Use Caching: If you're accessing the same data many times, you
can probably cache it. This will also protect your database from
bursts of reads.

● Optimize Auto-Scaling Policies: Review and adjust scaling


policies regularly, so your services aren't scaling out too late and
scaling in too early.

Cost Optimization

● Configure a Savings Plan: Get a Savings Plan for Fargate.

● Optimize Auto-Scaling Policies: Review and adjust scaling


policies regularly, so your services aren't scaling out too early and
scaling in too late. Yeah, this is the reverse of the one for
Performance Efficiency. Bottom line is you need to find the sweet
spot.

● Right-size ECS Tasks: You threw money at the performance


problems, and it works better now. Optimize the code when you
can, and right-size again (lower resources this time) to get that
money back. As with auto scaling, find the sweet spot.

● Clean your ECR registries regularly: ECR charges you per GB


stored. We usually push and push and push images, and never
delete them. You won't need images from 3 months ago (and if
you do you can always rebuild them!).

● Use EC2 Auto Scaling Groups: Like I mentioned in the


discussion section, I went with Fargate so we could focus on ECS
without getting distracted by auto scaling the instances. Fargate is
pretty cheap, easy to use, 0 maintenance, and scales fast. An
auto-scaling group of EC2 instances is a bit harder to use (not that
much), requires maintenance efforts, doesn't scale as fast, but it is
significantly cheaper. Pick the right one. The good news is that it's
not hard to migrate, so I'd recommend Fargate for all new apps.
CI/CD Pipeline with AWS Code*
Use case: CI/CD Pipeline in AWS

Scenario

We have an app running in an ECS cluster. We made a change to our


code, and we want to deploy it. Manual deployments are slow and prone to
human error, so we want an entirely automated way to do it: a Continuous
Integration / Continuous Delivery Pipeline (CI/CD pipeline).

Note: We're building this 100% in AWS, even using CodeCommit to store
our git repos. You're probably more familiar with GitHub, GitLab or
Bitbucket, but I wanted to show you how AWS does it.

Services

● AWS CodeCommit: A fully-managed source control service that


hosts Git repositories, allowing you to store and manage your
app's source code. Think GitHub or Bitbucket, but done by AWS.

● AWS CodeBuild: A fully-managed build service that compiles


your app's source code, runs tests, and produces build artifacts.

● AWS CodePipeline: A fully-managed continuous deployment


service that helps you automate your release pipelines. You can
orchestrate various stages, such as source code retrieval, build,
and deployment, which are resolved by other services like
CodeCommit, CodeBuild, and CodeDeploy.
AWS Code* CI/CD Pipeline to ECS

Solution step by step

● Set up a git repo in CodeCommit


1. Install Git (if not already installed):
1. Windows: https://gitforwindows.org/.
2. MacOS: brew install git or
https://git-scm.com/download/mac.
3. Linux: sudo apt-get install git or sudo yum install
git.
2. Open the CodeCommit dashboard in the AWS
Management Console.
3. Click on "Create repository" and configure the repository
settings, such as the name, description, and tags.
4. Go to the IAM console and click on your IAM user.
5. Click on the Security credentials tab, scroll down to
HTTPS Git credentials for AWS CodeCommit and click
Generate credentials.
6. Copy the username and password, or download them as
a CSV.
7. Open your terminal and navigate to the directory
containing your app's source code.
8. Go back to the CodeCommit console and click on your
repository.
9. Copy the git clone command under Step 3: Clone the
repository.
10. Open a terminal in a new directory, paste that
command and run it. When it asks for your username and
password, use the ones you copied or downloaded in
step 6.
11. Change directories to the directory the command just
created, and copy all files and directories of your project
into that directory.
12. Run git add ., git commit -m"Initial
commit" and git push.

● Create a buildspec file


1. Create a file called buildspec.yml with the following
contents. Replace $ECR_REPOSITORY,
$AWS_DEFAULT_REGION, and $CONTAINER_NAME
with the appropriate values for your project.
Unset

version: 0.2

phases:
pre_build:
commands:
- echo Logging in to Amazon ECR...
- aws --version
- $(aws ecr get-login --region
$AWS_DEFAULT_REGION --no-include-email)
- REPOSITORY_URI=$(aws ecr describe-repositories
--repository-names $ECR_REPOSITORY --query
'repositories[0].repositoryUri' --output text)
build:
commands:
- echo Build started on `date`
- echo Building the Docker image...
- docker buildx build --platform=linux/amd64 -t
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION .
post_build:
commands:
- echo Build completed on `date`
- echo Pushing the Docker image...
- docker push
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
- docker tag
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
$REPOSITORY_URI:latest
- docker push $REPOSITORY_URI:latest
- echo Writing image definitions file...
- printf '[{"name":"%s","imageUri":"%s"}]'
$CONTAINER_NAME
$REPOSITORY_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION >
imagedefinitions.json
artifacts:
files: imagedefinitions.json
discard-paths: yes

● Configure the CodeBuild project


1. Open the CodeBuild dashboard in the AWS Management
Console.
2. Click on "Create build project" and configure the project
settings:
1. Project name: Enter a unique name for the build
project, such as SimpleAWSBuilder.
2. Source: Choose "CodeCommit" as the source
provider. Then, select your git repo and the
"master" branch.
3. Configure the environment settings for your build project:
1. Environment image: Select "Managed image."
2. Operating system: Choose "Amazon Linux 2."
3. Runtime(s): Select "Standard" and choose the
latest available image version.
4. Image: Choose the latest one.
5. Check the Privileged checkbox.
6. Service role: Choose New service role and enter
for Role name
"SimpleAWSCodeBuildServiceRole".
4. For Buildspec, just leave the default settings. CodeBuild
will use the buildspec.yml file you created earlier.
5. Click "Create build project" to create your new build
project.

● Give CodeBuild the necessary IAM permissions


1. Go to the IAM console and on the menu on the left click
Roles.
2. Search for the role you just created for CodeBuild (the
name should be SimpleAWSCodeBuildServiceRole) and
click on the name.
3. Click Add permissions and click Attach policies. Click
Create policy.
4. Click the JSON tab and replace the contents with the
contents below. Replace your-account-id with the ID of
your AWS Account, your-ecr-registry with the name of
your ECR registry, and change us-east-1 if you're using a
different region.
5. Click Next (Tags), Next (Review), give your policy a name
such as "SimpleAWSCodeBuildPolicy" and click Create.
6. Go back to the role creation tab (you can close the
current one), click the Refresh button on the right, and
type "SimpleAWSCodeBuildPolicy" in the search box.
7. Click the checkbox on the left of the
SimpleAWSCodeBuildPolicy policy and click Add
permissions.

Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecr:BatchCheckLayerAvailability",
"ecr:CompleteLayerUpload",
"ecr:DescribeRepositories",
"ecr:InitiateLayerUpload",
"ecr:PutImage",
"ecr:UploadLayerPart"
],
"Resource":
"arn:aws:ecr:us-east-1:your-account-id:repository/your-
ecr-registry"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource":
"arn:aws:logs:us-east-1:your-account-id:*"
}
]
}

● Create a pipeline in CodePipeline


1. In the AWS Console go to the CodePipeline dashboard.
2. Click on "Create pipeline."
3. Configure the pipeline settings:
1. Pipeline name: Give your pipeline a unique
name, such as SimpleAWSPipeline.
2. Service role: Leave at "New service role" to
create a new IAM role for your pipeline. You can
change the Role name if you want, or leave it as
it is.
4. Click "Next."
5. Configure the source stage:
1. Source provider: Choose "AWS CodeCommit."
2. Repository name: Select the CodeCommit
repository you created earlier.
3. Branch name: Select the "master" branch.
4. Change detection options: Choose "Amazon
CloudWatch Events (recommended)" to
automatically trigger the pipeline when there's a
new commit.
6. Click "Next."
7. Configure the build stage:
1. Build provider: Choose "AWS CodeBuild".
2. Region: Select your region.
3. Project name: Choose the CodeBuild project you
created earlier.
8. Click "Next."
9. Configure the deploy stage:
1. Deploy provider: Choose "Amazon ECS".
2. Region: Select your region.
3. Cluster name: Select your ECS cluster.
4. Service name: Select your ECS service.
10. Click "Next."
11. Review your pipeline settings and click "Create
pipeline".

● Push a change and check the deployment


1. Make a change to your code, run git add ., git
commit -m"Pipeline test" and git push.
2. Watch CodePipeline to see the pipeline progress from
detecting the change, running the CodeBuild step and
deploying to ECS.
Solution explanation

● Set up a git repo in CodeCommit


I bet you didn't know CodeCommit. It's basically AWS GitHub.
You're not forced to use it, here's how to use GitHub and how to
use Bitbucket.

● Create a buildspec file


The buildspec.yml file is a way to use code to tell CodeBuild what
it needs to do. This one just logs in to ECR (so we can then push
the Docker image), runs docker build and docker push. It's
a little verbose maybe, and there's a few details like using docker
buildx build so we can specify the target platform, but that's
the gist of it.

● Configure the CodeBuild project


This step creates the CodeBuild resource (called project), and
includes stuff like what kind of instances it runs on, what OS, what
IAM Role it's going to use, etc. There's no details on what
CodeBuild needs to do, that's all defined in the buildspec.yml file
from the previous step.

● Give CodeBuild the necessary IAM permissions


CodeBuild is pushing a Docker image to our ECR registry, and it
needs permissions to do so (unless the ECR registry is public,
which we probably don't want). The previous step created the
CodeBuild project with an IAM Role, in this step we're adding an
IAM Policy to that role, to give it those permissions. We're also
adding permissions to publish logs to CloudWatch Logs.

● Create a pipeline in CodePipeline


CodePipeline doesn't actually execute any steps, it just
coordinates them and calls other services to do the real work. Like
a Project Manager! (kidding!!).
The change detection is handled by CodeCommit. We're linking
our CodePipeline pipeline to our CodeCommit repository, and
CodeCommit will publish an event to CloudWatch Events when
there's a change, which CodePipeline is listening for so it can kick
off the pipeline.
The build phase is handled by CodeBuild. CodePipeline just
passes the values and tells it to do its thing.
The deployment phase is handled by ECS itself. It's just a rolling
update: It creates a new task, and once it's working it kills the old
one. We could do much fancier stuff here with CodeDeploy, if we
wanted.

● Push a change and check the deployment


Let's turn the ignition key, and see if it blows up. Push any change,
such as a comment, and see CodePipeline do the magic.

Discussion

Do we need all of that just for a CI/CD Pipeline?


Well, in AWS, yes we do. We're 24 issues in, you know by now that AWS
favors solving problems by combining lots of specific services.
Paraphrasing the Microservices Design chapter, they're like functional
infrastructure microservices, where a working infrastructure is an emergent
property. Quite complex, right? Do you understand why I created this
newsletter?

Can't I just use GitHub Actions?


YES!!! You can! Just don't set long-lived AWS credentials as environment
variables, do this instead. Don't discount CodeDeploy though, everything it
needs to do happens inside AWS, so it's pretty great at doing those things,
like a Blue/Green deployment.

So, if I can use GitHub Actions, why are you even writing about this?
Well, GitHub Actions makes CI/CD as code much easier (we could have
done all of this through a CloudFormation template of like 250 lines). But
it's not entirely trivial. Here's a post on how the whole thing works. By the
way, if you want to use GitHub Actions because you want to be cloud
agnostic, let me tell you you've got much bigger problems to deal with than
a CI/CD Pipeline. And if you want to avoid vendor lock-in, trading lock-in
with AWS (which you already have) for lock-in with GitHub doesn't solve
that.

What is CI/CD anyway, and why do I need it?


Continuous Integration actually means integrating your code with other
devs' code all the time. Technically, the only way to do that is with
Trunk-Based Development (TBD). TBD is “A source-control branching
model, where developers collaborate on code in a single branch called
‘trunk’, resist any pressure to create other long-lived development branches
by employing documented techniques. They therefore avoid merge hell, do
not break the build, and live happily ever after.” (from here). It's the
opposite of any branch-based flow such as GitHub flow. Even if you're
using branches (and technically not doing Continuous Integration), a CI/CD
pipeline is still an extremely useful tool. Maybe it shouldn't be called CI/CD
pipeline?
The CD part has two potential meanings: Continuous Delivery, which
means our software is automatically built and made ready to be shipped to
users (or installed in a prod environment) at the push of a button, or
Continuous Deployment where there's no button and the deployment also
happens automatically. Both are good, Deployment is better.
Why you need it is a very long discussion that I'll summarize like this: When
building software, we're very often very wrong about what to build. For that
reason, we want to shorten feedback cycles, so we can correct course
faster and waste less time and money working on the wrong thing. The
most valuable feedback is the actual user using the product, so we want to
make small changes and get them in front of the user fast and often. The
fastest way to do that without risking human errors is to automate that
process. The tool that runs that process on automatic is called a CI/CD
Pipeline.
By the way, that whole paragraph describes one big technical part of agility,
as considered in the Agile Manifesto.
Best Practices

Operational Excellence

● Set up notifications: You can set up CloudWatch Events for


CodePipeline events, and then maybe use SNS to get an email.
Or use AWS Chatbot to get notified on Slack. Whatever you
choose, stay on top of failed pipeline executions.

● Implement testing stages: Include automated testing stages in


your pipeline to validate code changes and ensure the quality and
stability of your application. Automate these tests, obviously.

Security

● Limit IAM permissions: Apply the principle of least privilege for


IAM roles and permissions. Grant only the necessary permissions
for CodeBuild and CodePipeline to access resources, and avoid
using overly permissive policies.

● Use temporary credentials: The step by step proposes using


long-lived credentials tied to your IAM User. There's a better way,
with federated identities and temporary credentials.

● Store secrets in Secrets Manager: Manage and secure sensitive


data using AWS Secrets Manager. Integrate Secrets Manager with
CodeBuild and CodePipeline to provide secure access to secrets
during the deployment process. If you're hitting the pull limit on
Docker Hub and need to create an account to increase the pull
limit, this is how you'd store and pass those credentials.
Reliability

● Retry actions: Configure CodePipeline to retry failed actions.

● Use a deployment strategy that doesn't cause downtime:


We're deploying changes with a rolling update performed by the
ECS service. ECS creates one new task, waits for it to succeed
the health checks (container, LB, etc), kills one old task, and
repeats the process until there are no more old tasks. Blue/Green
is another option, where a new version is deployed while keeping
the old version alive, and traffic is switched to the new version,
and switched back if the new version fails. Take your pick, but
automate it, and automate the rollbacks.

Performance Efficiency

● Optimize build times: Use caching and parallelization in


CodeBuild to speed up build times, reducing pipeline execution
time and improving resource utilization. This also helps developers
avoid switching context while they wait for the build to run.

Cost Optimization

● Right-size CodeBuild environments: Right-size your build


instances so builds are as fast as needed, but still cheap. Don't
focus too much here though, CodeBuild is pretty cheap.

● Clean up old build artifacts: Implement a lifecycle policy in


Amazon S3 to automatically delete old build artifacts.
Kubernetes on AWS - Basics and Best
Practices
Use case: Containerized microservices on
Kubernetes on AWS

Scenario

You're working on a new app, and you've decided to break it down into
multiple services (could be microservices or regular services). This way,
each service can scale independently, and you can deploy each one
separately. However, you're worried that deploying, communicating and
scaling each service separately will be a lot of work.

Enter Kubernetes. Kubernetes is a container orchestrator (like ECS), which


means you define your services and the resources they need, and it will
manage deploying them, scaling them, communication (networking,
security) and management (logging, monitoring, etc). You focus on writing
the code for your services, Kubernetes handles the ops side of things.

Services

● EKS (Elastic Container Service for Kubernetes): This is a fully


managed Kubernetes service by AWS. Setting up a k8s cluster is
actually not trivial. EKS gives you a cluster already set up, and
connects to all the other AWS services needed (IAM for
permissions, EC2/Fargate for capacity, etc). Each cluster costs
$72/month, but it's worth the effort it saves you compared to
manually setting it up.

● EC2: EC2 is the infrastructure that runs your containers. You


create EC2 instances and tell your EKS cluster to put your
containers there. Or you can use Fargate.

● Fargate: It's the serverless way of running containers on EKS.


You forget about instances and just pay per use. More expensive
per resource, but allows faster scaling, and it's much simpler to
start.

Solution

1. First, design your services


There's way too much to say about this, but for now let's just say
that you should first know whether you want a monolith o services,
and then understand what each service does. The following steps
don't actually need this, but a real use case does.

2. Then, write the code and Dockerize your services


We'll skip this step and just use a readily-available app:
ecsdemo-nodejs.

3. Next, we'll set up some tools


Run these bash commands

1. AWS CLI: sudo curl --silent --location -o


"awscliv2.zip"
"https://awscli.amazonaws.com/awscli-exe-
linux-x86_64.zip" && sudo unzip
awscliv2.zip && sudo ./aws/install
Check it with aws --version

2. Kubectl: sudo curl -o /usr/local/bin/kubectl


-LO "https://dl.k8s.io/release/$(curl -L
-s
https://dl.k8s.io/release/stable.txt)/bin
/linux/amd64/kubectl" && sudo chmod +x
/usr/local/bin/kubectl
Check it with kubectl version

3. eksctl: curl --silent --location


"https://github.com/weaveworks/eksctl/rel
eases/latest/download/eksctl_$(uname
-s)_amd64.tar.gz" | tar xz -C /tmp &&
sudo mv -v /tmp/eksctl /usr/local/bin
Check it with eksctl version

4. Now we'll use eksctl to create the cluster


Run the following command:
eksctl create cluster --version=1.20
--name=simpleaws --nodes=2 --managed
--region=us-east-1 --node-type t3.medium
--asg-access
Wait 15-20 minutes and check it with kubectl get nodes

5. Next we'll deploy the app


Run the following command:
curl
https://raw.githubusercontent.com/aws-containers
/ecsdemo-nodejs/main/kubernetes/deployment.yaml
&& kubectl apply -f deployment.yaml

6. Then we expose it through a service


Run the following command:
curl
https://raw.githubusercontent.com/aws-containers
/ecsdemo-nodejs/main/kubernetes/service.yaml &&
kubectl apply -f service.yaml

7. Finally, we clean up
Don't forget this step!

Discussion

Here's an explanation of each step:

1. We decided how we want to structure our code. (again, not


needed for this step by step, but you'll need to do this for a real
app)

2. We wrote some Dockerfiles. (again, not needed for this step by


step, but you'll need to do this for a real app)

3. We installed some tools:

1. AWS CLI: It's the Command Line Interface tool for AWS.
We're not using it directly, but we need it for eksctl to
work.
2. Kubectl: It's the Command Line Interface tool for
Kubernetes.

3. eksctl: A Command Line Interface tool specifically made


by AWS for EKS.

4. We used a command from eksctl to create an EKS cluster. After


this command (which should take 15 or 20 minutes to complete)
you'll be able to use kubectl to get the nodes, and if you log in to
the AWS Console and go to the EKS service you'll see a cluster
created there. So far we have a cluster but no services (it's a bit
like having an EC2 instance up and running but not having
installed any app there).

5. We took a sample app from here, downloaded the file with the

curl command, and deployed that app to our cluster with the

kubectl apply command. These YAML files specify what app will

be deployed (in this case the Docker image


brentley/ecsdemo-nodejs:latest) and how (how many replicas,
which ports are used, the update strategy, etc). These YAML files
should remind you a bit of Docker Compose, though they're more
complex because Kubernetes is more powerful. You'll need to
write your own YAMLs for your app.

6. We did the same as in step 5, but for a Service. Deployments in


Kubernetes are not automatically exposed to the internet, you
need to create a Service for that. Think of it like a Load Balancer
to expose all the instances of that service (that's actually one of
the ways to achieve it, but not the only one). You'll need to write
your own YAMLs for your app.

7. We cleaned up, because this is just a sample and we don't want to


be paying over $70/month because we forgot we deployed this.

Basic building blocks of Kubernetes

● Cluster: It's a collection of worker nodes running a container


orchestration system, in this case, Kubernetes. The cluster is
managed by the control plane, which includes components like the
API server, controller manager, and etcd.

● Pod: A pod is the smallest and simplest unit in the Kubernetes


object model. It represents a single instance of a running process
in your cluster. Pods contain one or more containers, storage
resources, and a unique network IP.

● Deployment: A deployment is an object in Kubernetes that


manages a replicated application. It ensures that a specified
number of replicas of your application are running at any given
time. Deployments are responsible for creating, updating, and
rolling back pods.

● Service: A service is a logical abstraction on top of one or more


pods. It defines a policy by which pods can be accessed, either
internally within the cluster or externally. Services can be
accessed by a ClusterIP, NodePort, LoadBalancer, or
ExternalName.
As you can see, Kubernetes is actually pretty complex. That's why I
typically promote ECS: it's simpler. However, Kubernetes has some clear
advantages. You should decide whether they're worth the extra complexity,
for every particular use case.

Benefits of Kubernetes

● Cloud Agnostic: Kubernetes basically runs on anything. You can


use a managed service like AWS EKS, Azure AKS, GCP GKE
(which is actually the best of all three, and it's free), set it up
manually on any kind of VMs, on bare metal servers, your own
laptop, and even on a raspberry pi. It removes vendor lock in from
any cloud provider (which is not such a big deal), and it lets you
run your own servers with a really fantastic platform.

● Open Source: Kubernetes is published under the Apache License


2.0, which allows anyone to make changes and use it
commercially for free. It's actively maintained by the Cloud Native
Computing Foundation (Google actually built the first versions, and
donated the project).

● Very mature: Kubernetes was released in 2014, and has grown


really quickly. Nowadays it has been widely used in small to
enormous production loads for several years. Basically, we're
really sure that it works really well.

● REALLY powerful: Seriously, there's so many things that you can


do with Kubernetes, and a ton of tools around it. That level of
flexibility requires a similar level of complexity though. In my
experience, for most projects it's not worth the added complexity.
However, when it is worth, it's undeniably the best solution out
there, and I don't see that changing any time soon.

Best Practices
In my experience, the bestest best practice is deciding whether you actually
need Kubernetes or not, and not picking it blindly. If you do need it, pay
attention to the best practices that follow. Keep in mind that this is NOT an
exhaustive list.

Operational Excellence

● Define everything as code: Everything in Kubernetes is some


form of configuration, from what app to deploy to how it's exposed.
Everything can be written in YAML files, and it should be (and
added to version control). Same logic as with Infrastructure as
Code.

● Use CI/CD: Once you have everything in YAMLs, you could just

use kubectl apply -f yourfile.yaml. Don't do that (at least


not for prod). Merge to prod or main and have a pipeline that runs
that same command. That way you always know what version is
deployed.

● Set up a DNS Server: Kubernetes is only aware of the services


that are already running. When you deploy a pod that needs to
access a service, kubernetes can only inject the service's address
if the service is already running at the time of deploying the pod.
That means you'll first need to deploy all services, then all pods.
Or you can set up a DNS Server as a much easier way to handle
this.

● Consider a Service Mesh: Containers need to solve 3 things:


Whatever they're meant to do (this is your code), where they'll live
(this is what Kubernetes solves) and how they'll talk to each other
(you can solve this through a lot of config). If communication
between services is complex, a Service Mesh can solve traffic
management, resiliency, policy, security, strong identity, and
observability. That way, your app is decoupled from these
operational problems and the service mesh moves them out of the
application layer, and down to the infrastructure layer. For AWS
you can use App Mesh, or check this list.

● Use Labels: If you're using Kubernetes, you'll have multiple


services, deployments, pods, volumes, volume claims, ingresses,
etc. That's a lot to manage. Label everything, so you know at a
glance what any component is doing.

● Use Ingress Controllers: Don't set up a Load Balancer for every


service, instead use services for internal visibility, and expose
them with an Ingress Controller.

● Use namespaces: Group up resources in namespaces, to have a


better organization.

● Set up logging and monitoring: You can use CloudWatch Logs,


set up Prometheus, ElasticSearch + LogStash + Kibana, or a
myriad of other options.
● Use an artifact repository: You're already using a Docker registry
for your images. Helm is a registry for kubernetes deployments,
services, and configs. You can download public Helm Charts like
you use public Docker images, or set up your private Helm
repository.

Security

● Test and Scan Container Images: Before deploying to prod, you


can run security tools to scan container images for common
vulnerabilities and exploits, and run tests.

● Run pods with minimal permissions: Pods can run as

privileged (more permissions) or restricted (less permissions).

Run them as restricted when you can.

● Use Secrets: If something is supposed to be secret (such as an


API key), store it as a Kubernetes Secret.

● Limit user permissions: Not everyone needs to be a Kubernetes


admin. Use Role-Based Access Controls to restrict what each user
can do.

Reliability

● Do Blue-Green Deployments: Blue-Green deployments is a


deployment strategy where you deploy a new version without
destroying the old one, send traffic to the new version, and once
you're sure (with real data) that the new version works, only then
you destroy the old version.
● Always use Deployments or ReplicaSets: You can launch pods
on their own, or launch a Deployment or ReplicaSet, which will in
turn create the pods. Pods launched on their own are not
recreated on another node if the node fails. So if you need a pod,
just launch a deployment with 1 pod.

● Use a Network Plugin: Kubernetes is not aware of your cloud


provider's Availability Zones. Set up a Network Plugin to fix that.

Performance Efficiency

● Scale the cluster: Kubernetes scales the number of pods. Cluster


Autoscaler scales the number of nodes. Also, check out Karpenter
as a better autoscaler.

Cost Optimization

● Use Savings Plans: You're paying for the EKS cluster (which is
the Kubernetes control plane) and for the capacity that you're
using (either EC2 or Fargate). Set up Savings Plans for that
capacity.
Step-By-Step Instructions To Deploy A
Node.Js App To Kubernetes on EKS
Use case: Deploying a Node.js app on Kubernetes
on EKS

Scenario

You have a cool app you wrote in Node.js. You have a pretty good handle
on ECS, but the powers that be have decided that you need to use
Kubernetes. You understand the basic building blocks of Kubernetes and
EKS, and have drawn the parallels with ECS. But you're still not sure how
to go from code to app deployed in EKS.

Services

● EKS: A managed Kubernetes cluster, which essentially does the


same as ECS but using Kubernetes.

● Elastic Container Registry (ECR): A managed container registry


for storing, managing, and deploying Docker images.

● Fargate: A serverless compute engine for containers that


eliminates the need to manage the underlying EC2 instances.
Solution step by step

Note: The first steps of installing Docker and dockerizing the app are the
same as the previous chapter on deploying the app to ECS. I'm adding
them in case you didn't read it, but if you followed them already, feel free to
start from the 6th step, right after pushing the Docker image to ECR.

● Install Docker on your local machine.


Follow the instructions from the official Docker website:
● Windows:
https://docs.docker.com/desktop/windows/install/
● macOS: https://docs.docker.com/desktop/mac/install/
● Linux: https://docs.docker.com/engine/install/

● Create a Dockerfile.
In your app's root directory, create a file named "Dockerfile" (no file
extension). Use the following as a starting point, adjust as needed.
Unset

# Use the official Node.js image as the base image


FROM node:latest

# Set the working directory for the app


WORKDIR /app

# Copy package.json and package-lock.json into the


container
COPY package*.json ./

# Install the app's dependencies


RUN npm ci

# Copy the app's source code into the container


COPY . .

# Expose the port your app listens on


EXPOSE 3000

# Start the app


CMD ["npm", "start"]

● Build the Docker image and test it locally.


While in your app's root directory, run the following commands to
build the Docker image and start a local container using the new
image. Test the app in your browser or with curl or Postman to
ensure it's working correctly. If it's not, go back and fix the
Dockerfile.
Unset

docker build -t cool-nodejs-app .


docker run -p 3000:3000 cool-nodejs-app

● Create the ECR registry.


First, create a new ECR repository using the AWS Console or the
AWS CLI.

Unset

aws ecr create-repository --repository-name


cool-nodejs-app

● Push the Docker image to ECR.


The command that creates the ECR registry will output the
instructions in the output to authenticate Docker with your ECR
repository. Follow them. Then, tag and push the image to ECR,
replacing {AWSAccountId} and {AWSRegion} with the
appropriate values:

Unset

docker tag cool-nodejs-app:latest


{AWSAccountId}.dkr.ecr.{AWSRegion}.amazonaws.com/cool-n
odejs-app:latest
docker push
{AWSAccountId}.dkr.ecr.{AWSRegion}.amazonaws.com/cool-n
odejs-app:latest

● Install and configure AWS CLI, eksctl, and kubectl.


Follow the instructions from the official sites:
● Install the AWS CLI: https://aws.amazon.com/cli/
● Install kubectl:
https://kubernetes.io/docs/tasks/tools/install-kubectl/
● Install eksctl: https://eksctl.io/introduction/#installation

After installing it, make sure to configure your AWS CLI with your
AWS credentials.

● Create an EKS Cluster.


You'll need an existing VPC for this (you can use the default one).
Create a CloudFormation template named "eks-cluster.yaml" with
the following contents. Replace {VPCID} with the ID of your VPC,
and {SubnetIDs} with one or more subnet IDs. Then, in the
AWS Console, create a new CloudFormation stack using the
"eks-cluster.yaml" file as the template.

Unset

Resources:
EKSCluster:
Type: AWS::EKS::Cluster
Properties:
Name: cool-nodejs-app-eks-cluster
RoleArn: !GetAtt EKSClusterRole.Arn
ResourcesVpcConfig:
SubnetIds: [{SubnetIDs}]
EndpointPrivateAccess: true
EndpointPublicAccess: true
Version: '1.22'

EKSClusterRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: eks.amazonaws.com
Action: sts:AssumeRole
Path: "/"
ManagedPolicyArns:
-
arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
-
arn:aws:iam::aws:policy/AmazonEKSServicePolicy

EKSFargateProfile:
Type: AWS::EKS::FargateProfile
Properties:
ClusterName: !Ref EKSCluster
FargateProfileName:
cool-nodejs-app-fargate-profile
PodExecutionRoleArn: !GetAtt
FargatePodExecutionRole.Arn
Subnets: [{SubnetIDs}]
Selectors:
- Namespace: {Namespace}

FargatePodExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: 'eks-fargate-pods.amazonaws.com'
Action: 'sts:AssumeRole'
Path: "/"
ManagedPolicyArns:
-
arn:aws:iam::aws:policy/AmazonEKS_Fargate_PodExecutionR
ole_Policy

● Deploy the app to EKS.


Create a Kubernetes manifest file named "eks-deployment.yaml"
with the following contents. Replace {AWSAccountId} and
{AWSRegion} with the appropriate values. Then apply it with
kubectl apply -f eks-deployment.yaml.

Unset

apiVersion: apps/v1
kind: Deployment
metadata:
name: cool-nodejs-app
spec:
replicas: 2
selector:
matchLabels:
app: cool-nodejs-app
template:
metadata:
labels:
app: cool-nodejs-app
spec:
containers:
- name: cool-nodejs-app
image:
{AWSAccountId}.dkr.ecr.{AWSRegion}.amazonaws.com/cool-n
odejs-app:latest
ports:
- containerPort: 3000
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
serviceAccountName: fargate-pod-execution-role

---

apiVersion: v1
kind: Service
metadata:
name: cool-nodejs-app
spec:
selector:
app: cool-nodejs-app
ports:
- protocol: TCP
port: 80
targetPort: 3000
type: LoadBalancer

● Set up auto-scaling for the Kubernetes Deployment.


Add the following to your "eks-deployment.yaml" file to set up
auto-scaling policies based on CPU utilization. Then apply the
changes with kubectl apply -f eks-deployment.yaml.

Unset

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: cool-nodejs-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cool-nodejs-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
● Test the new app.
To test the app, find the LoadBalancer's external IP or hostname
by looking at the "EXTERNAL-IP" field in the output of the
kubectl get services command. Paste the IP or hostname
in your browser or use curl or Postman. If everything is set up
correctly, you should see the same output as when running the
app locally. Congrats!

Solution explanation

● Install Docker on your local machine.


Same as last week, Kubernetes runs Docker containers. Docker
also lets us run our app locally without depending on the OS,
which is irrelevant for a simple test, but really important when your
team's computers aren't standardized.

● Create a Dockerfile.
We're using the same Dockerfile as last week. In case you didn't
read it, a Dockerfile is a script that tells Docker how to build a
Docker image. We specified the base image, copied our app's
files, installed dependencies, exposed the app's port, and defined
the command to start the app. Writing this when starting
development is pretty easy, but doing it for an existing app is
harder.

● Build the Docker image and test it locally.


I'm going to repeat one of my favorite phrases in software
development: The degree to which you know how your software
behaves is the degree to which you've tested it.
● Create the ECR registry.
ECR is a managed container registry that stores Docker container
images. It integrates seamlessly with EKS, which is why we're
using it here.

● Push the Docker image to ECR


We build our Docker image, and we push it to the registry so we
can pull it from there.

● Install and configure AWS CLI, eksctl, and kubectl.


More tools that we need. The AWS CLI you already know. kubectl
is a CLI tool used to manage kubernetes, and it's usually the main
way in which you interact with kubernetes. eksctl is a great CLI
tool that follows the spirit of kubectl, but which is designed to deal
with the parts of EKS that are specific to EKS and not general to
all Kubernetes installations. AWS built their own managed
Kubernetes, which is EKS, but they (obviously) couldn't modify
kubectl, so they built a tool to complement it.

● Create an EKS Cluster.


A cluster is a collection of worker nodes running our Kubernetes
apps. The cluster is managed by the control plane, which includes
components like the API server, controller manager, and etcd.
Same concept as the ECS cluster, except that with EKS being a
managed Kubernetes we can peek under the hood.
In our template we're also creating a FargateProfile, which EKS
needs so it can run our apps on Fargate (serverless worker nodes,
which we use here so we can focus on EKS).
● Deploy the app to EKS.
We're creating a YAML file, but this one's not a CloudFormation
template. Kubernetes uses “manifest files” to define its resources
as code, and our eks-deployment.yaml file is one of those. The
key elements are:

● Deployment: Basically, the thing that tells k8s to run our


app. More technically, an object that manages a
replicated application. It ensures that a specified number
of replicas of your application are running at any given
time. Deployments are responsible for creating, updating,
and rolling back pods. Similar to a Task Definition from
ECS.

● Service: Basically, the load balancer of our app. More


technically, a logical abstraction on top of one or more
pods. It defines a policy by which pods can be accessed,
either internally within the cluster or externally. Services
can be accessed by a ClusterIP, NodePort,
LoadBalancer, or ExternalName. Same concept as a
Service in ECS, just some different implementation
details. Note that in this case it's going to be a load
balancer, but it's not the only option.

You've probably noticed that we're not creating pods. A pod is one
instance of our app executing, so that's what we're really after.
However, just like with ECS where we don't create Tasks directly,
we don't create Pods directly in Kubernetes.
● Set up auto-scaling for the Kubernetes Deployment.
We're changing the deployment so it includes the necessary logic
to auto-scale. That is, to create and destroy Pods as needed to
match the specified metric (in this case CPU usage).

● Test the new app.


You know it's going to work. You test it anyway.

Discussion

Text in italics is (my mental representation of) you (the reader) talking or
asking a question, regular text is me talking. That way, it looks more like a
discussion. By the way, if you want to have a real discussion, or want to ask
any questions, feel free to contact me on LinkedIn!

● So, this is me, the reader, asking for clarification about the format?
Exactly!

● That's it?
Yeah, that's it! We did it in ECS last week, and now we did it in EKS!

● I was expecting a lot more work


I kept it Simple, so we got away with doing it in a newsletter issue
instead of me writing a whole book (which is actually an option, by the
way).

● So, if it wasn't that hard, why have you been complaining warning us
about the complexities of Kubernetes ever since this book started?
About Kubernetes, we've barely scratched the snow that covers the
tip of the iceberg (I hope that's accurate, I've never seen an iceberg).
● Why make us configure eksctl if you weren't going to use it?
Great catch! With eksctl you can create a cluster simply by running
eksctl create cluster --name my-cluster --region
region-code --version 1.25 --vpc-private-subnets
subnet-ExampleID1,subnet-ExampleID2
--without-nodegroup. There's a few more things to create
though, and I figured it would be easier if you just used a cfn
template. Here's a guide.

● So, now that I use Kubernetes, am I cloud agnostic?


Your Kubernetes manifest files are 95% cloud agnostic. That is, you
can run the same deployments, services, etc with the exact same
files, but there are a few differences in how every Kubernetes
installation implements them. For example, our service is defined as
type Load Balancer, and AWS uses an Application Load Balancer for
that. If you deployed the same file in GKE, you'd get Google's Load
Balancer.

Best Practices

Operational Excellence

● Implement health checks: Set up health checks for your


application in the Kubernetes Service manifest to ensure the load
balancer only directs traffic to healthy instances.

● Use Kubernetes readiness and liveness probes: Configure


readiness and liveness probes for your application container to
help detect issues and restart unhealthy containers automatically.
● Monitor application metrics: Use Amazon CloudWatch
Container Insights to get more metrics from your app.

● Set up a Service Mesh: A deployment or two, each running 5


pods, is easy to manage. 10 deployments of (micro)services that
need to talk to one another is 100x more complex. A Service Mesh
solves the networking part of making pods talk to each other,
making it only 50x more complex than a single deployment (it's
actually a big win!)

Security

● Use IAM roles: Set up IAM roles for your Fargate profile or for the
EC2 instances, so your resources only have the permissions they
need.

● Restrict network access inside the cluster: Set up Network


Policies to define at the network level what pods can talk to what
pods. Basically, firewalls inside the cluster itself.

● Enable private network access: Use private subnets for your


EKS cluster and configure VPC endpoints for ECR, to ensure that
network traffic between your cluster and the ECR registry stays
within the AWS network.

● Use Secrets: Kubernetes can store secrets in the cluster. Same


as what we've been discussing about Secrets Manager, but in this
case you don't need another AWS service. You can sync them if
you want.
● Use RBAC: RBAC means Role-Based Access Controls. It's the
same thing we do with AWS IAM Roles, but for Kubernetes
resources and actions. Don't give everyone admin, use roles for
minimum permissions.

Reliability

● Use Horizontal Pod Autoscaler (HPA): Configure HPA to


automatically scale the number of application instances based on
CPU utilization, ensuring that the application remains responsive
during traffic fluctuations. The scaling that we defined in the
Kubernetes manifest only creates more pods, this is what creates
more instances for your pods to run on. You use this instead of
having your Auto Scaling Group launch instances, because while
the ASG has access to instance-level metrics such as actual CPU
utilization, HPA has access to Kubernetes metrics such as total
requested CPU and memory.

● Deploy across multiple Availability Zones (AZs): Ensure that


your application's load balancer and Fargate profiles are
configured to span multiple AZs for increased availability.

● Use rolling updates: Configure your Kubernetes Deployment


manifest to perform rolling updates, ensuring that your application
remains available during updates and new releases. You do this
directly from Kubernetes, like this.
Performance Efficiency

● Optimize resource requests and limits: Kubernetes defines


resource requests and limits. The request is the minimum, and it's
guaranteed to the pod (if Kubernetes can't fulfill the request, the
pod fails to launch). From there, a pod can ask for more resources
up to the limit, and they will be allocated if there are resources
available. For example, if you have one node with 2 GB of
memory, you deploy pod1 with request: 1 GB and limit: 1.5 GB,
pod1 will get 1 GB of memory and can request for more up to 1.5
GB. If you then deploy pod2 with request: 1 GB, Kubernetes will fit
both, but now pod1 can't get more memory than its base 1 GB,
because there's no free memory available. Both pod1 and pod2
succeed in launching, because their request can be fulfilled.

● Monitor performance of the app and cluster: Use tools like


AWS X-Ray and CloudWatch to monitor the performance of your
application. Use tools like Prometheus and Grafana to monitor the
performance of the cluster (free resources, etc).

Cost Optimization

● Right-size the pods: As always, avoid over-provisioning


resources.

● Consider Savings Plans: Remember that EKS is priced at


$72/month for the cluster alone, plus the EC2 or Fargate pricing
for the capacity (the worker nodes). Savings Plans apply to that
capacity, just like if EKS wasn't there.
● Use an Ingress Controller: Each service of type Load Balancer is
a new ALB. Instead of doing that, use a single Ingress Controller
per app (you can use one or multiple apps per cluster). The
Ingress Controller routes all traffic from outside the cluster, with a
single Load Balancer and one or more routes per service.
Handling Data at Scale with DynamoDB
Use case: Storing and Querying User Profile Data
at Scale

Scenario

As a software company with millions of users, you need to store and query
user profile data in a scalable and reliable way. You could use a relational
database like MySQL or PostgreSQL, but handling that volume is
expensive, and at some point you'll have scalability problems.

Services

Amazon DynamoDB is a fully managed NoSQL database. NoSQL


databases are much more performant than relational databases for simple
queries, and much slower for complex queries and analytics.

Key features of DynamoDB

● Flexible data model: Data is stored as groups of values, called


items. You can retrieve the whole item, or just a few attributes.

● Low latency: DynamoDB can handle millions of requests per


second with single-digit millisecond latency.

● Scalability: You can set Read Capacity Units and Write Capacity
Units separately. You can also set them to auto-scale, or just pay
per request.
● Availability: A DynamoDB table is highly available within a region.
You can also set it to replicate to other regions, with a Global
Table.

● Security: Data is encrypted at rest (natively) and in transit (not


natively). You can set access controls at the field level, and audit
using CloudTrail.

● Streams: DynamoDB streams can trigger behavior in response to


events. This makes building event-driven applications significantly
easier.

Solution

Let's go over how to set up a DynamoDB table for user profiles, and how to
create, query, update and delete user profiles.

● First you need to design the table schema. Wait, schema?


Didn't you say NoSQL? Yeah, but here's the catch: NoSQL doesn't
mean No Schema, it means schema is not enforced by the
database engine. You can put anything in a NoSQL database, but
you shouldn't. You need to plan the schema for your table's use
cases. A good starting point is to map out the data you need to
store for each user, such as their name, email, and any other
relevant information. The next chapter dives a lot deeper into this
topic.

● After that you need to set the primary key for the table. The
PK is a unique identifier for each item in the table, and it is used to
retrieve data from the table. You can choose either a single
attribute (such as the user's email address), or you can use a
composite PK consisting of two attributes (such as the user's
email and a timestamp), where the first one is called partition key
and the second one sort key. It's important to choose a primary
key that will be unique for each user and that will be used to query
the data.

● You may add secondary indexes to your table, which allow you
to query the data in the table using attributes other than the
primary key. You can also do this later.

● Once you have designed the table schema, just create the table
in DynamoDB using the console, the SDK or infrastructure as
code.

● To store user profiles, use the PutItem API. You could also use the
BatchWriteItem API to insert multiple profiles at once.

● To query the profile of a specific user, use the Query API with the
userId as the partition key. You can also use the sort key to further
narrow down the results.

● To update a user profile, use the UpdateItem API to update


specific attributes of a profile without having to rewrite the entire
item. If you do want to rewrite the entire item, use the PutItem API.

● To delete a user profile, use the DeleteItem API.


Best Practices

1. Always query on indexes: When you query on an index,


DynamoDB only reads the items that match the query, and only
charges you for that. When you query on a non-indexed attribute,
DynamoDB scans the entire table and charges you for reading
every single item (it filters them afterwards).

2. Use Query, not Scan: Scan reads the entire table, Query uses an
index. Scan should only be used for non-indexed attributes, or to
read all items. Don't mix them up.

3. Don't read the whole item: Read Capacity Units used are based
on the amount of data. Use projection expressions to define which
attributes will be retrieved, and only get the data you need.

4. Always filter and sort based on the sort key: You can filter and
sort based on any attribute. If you do so based on an attribute
that's a sort key, DynamoDB uses the index and you only pay for
the items read. If you use an attribute that's not a sort key,
DynamoDB scans the whole table and charges you for every item
on the table. This is independent of whether you query for the
partition key or not.

5. Set up Local Secondary Indexes: A Local Secondary Index is an


index with the same Partition Key and a different Sort Key. Create
them if you need to filter or sort based on other attributes. For
example, you could create an LSI to filter on the "active" attribute
of your users. LSIs share capacity with the table. There's no extra
pricing, but DynamoDB will use additional write capacity units to
update the relevant indexes.

6. Set up Global Secondary Indexes: A Global Secondary Index is


an index with a different Partition Key and Sort Key, but with the
same data (or just a subset of the data, which saves a ton of
costs). Create them if you need to query based on other attributes.
For example, you could create a GSI on a user's email address,
so you can query based on that attribute. GSIs have separate
capacity from the table. As with LSIs, GSIs don't cost extra per
index, but DynamoDB will use extra write capacity units if you
have indexes.

7. Paginate results: When retrieving large amounts of data, use


pagination to retrieve the data in chunks.

8. Use caching: DynamoDB is usually fast enough (if it's not, use
DAX). However, ElastiCache can be cheaper for data that's
updated infrequently.

9. Mind read consistency: DynamoDB reads are eventually


consistent by default. You can also perform strongly consistent
reads, which cost 2x more (2x cost in On-Demand mode, 2x more
WCUs used in Provisioned mode)

10. Use transactions when needed: Operations are atomic, but if


you need to perform more than one operation atomically, you can
use a transaction. Cost is 2x the regular operation.
11. Use Reserved Capacity: You can reserve capacity units, by
paying upfront or committing to pay monthly, for 1 or 3 years.

12. Prefer Provisioned mode over On-Demand mode:


On-Demand is easier, but over 5x more expensive (without
reserved capacity). Provisioned mode scales pretty fast, try to use
it if your traffic doesn't spike that fast. Also, consider adding an
SQS queue to throttle writes (as you'll see in a future chapter).

13. Monitor and optimize: You're not gonna get it right the first
time (because requirements change). Monitor usage with
CloudWatch, and optimize schema and queries as needed.
Remember secondary indexes.

14. Mind the costs: You're charged per data stored and per
capacity units. For Provisioned mode, one read capacity unit
represents one strongly consistent read per second, or two
eventually consistent reads per second, for an item up to 4 KB in
size; and one write capacity unit represents one write per second
for an item up to 1 KB in size. Optimize frequently. The key here is
to understand how the database will be used and tune it
accordingly (set secondary indexes, attribute projections, etc).
This requires good upfront design and ongoing efforts.

15. Use Standard-IA tables: For most workloads, a standard table


is the best choice. But for workloads that are read infrequently, use
the Standard-IA table class to reduce costs.

16. Back up your data: You can set up scheduled backups, or do


it on-demand.
17. Set a TTL: Some data needs to be stored forever, but some
data can be deleted after some time. You can automate this by
setting a TTL on each item.

18. Design partition keys carefully: Good design is extremely


important in databases, even in NoSQL ones. Pick your partition
key carefully so that load is split across partitions. The next
chapter dives a lot deeper into this topic.

19. Don't be afraid to use multiple databases: DynamoDB is


amazing for repetitive queries with different parameters (e.g.
finding a user by email address), and terrible for complex analytics
(e.g. finding every user that logged in more than once in the past
24 hours). Don't be afraid to use a different database for data or
use cases that don't fit DynamoDB's strengths.
DynamoDB Database Design
Use case: DynamoDB Database Design

Scenario

We're building an e-commerce app, with DynamoDB for the database. The
problem we're tackling in this chapter is how to structure the data in
DynamoDB.

Here's how our e-commerce app works: A customer visits the website,
browses different products, picks some and places an order. They can
apply a discount code (discounts a % of the total) or gift card (discounts a
fixed amount) and pay for the remaining amount by credit card.

Services

● DynamoDB: It's a fully managed NoSQL database. Here's the


trick: NoSQL doesn't mean non-relational or non-ACID-compliant.

Designing the solution

We're going to start from scratch and progressively design the solution.

● First, we're going to understand what we're going to be


saving to the database.
We're focusing on the entities, not the specific fields that an entity
might have, such as Date for the Order. However, in order to
discover all entities, you might want to add some fields. We'll
represent the entities in an Entity Relation Diagram (ERD).
ERD for our e-commerce app

● Next, we'll consider how we're going to access the data.


We need to define our access patterns, since they're going to
determine our structure. This is what the access patterns look like
for our example.
● Get customer for a given customerId
● Get product for a given productId
● Get order for a given orderId
● Get all products for a given orderId
● Get invoice for a given orderId
● Get all orders for a given productId for a given date range
● Get invoice for a given invoiceId
● Get all payments for a given invoiceId
● Get all invoices for a given customerId for a given date
range
● Get all products ordered by a given customerId for a
given date range

● Next, we're going to implement an example of each access


pattern.
This is going to determine how we write our data, and how we
query it.
To follow along this part, you can use NoSQL Workbench and
import the models for every step from this GitHub repo, or just
watch the screenshots.

● 0 - Create a table
If you're working directly on DynamoDB, just create a table with a
PK and SK.
If you're using NoSQL Workbench, use model ECommerce-0.json

● 1 - Get customer for a given customerId


This one is pretty simple. We're just going to store our customers
with PK=customerId and SK=customerId, and retrieve them with a
simple query. Since we're going to add multiple entities to our
table, we'll make sure the Customer ID is "c#" followed by a
number. We're also going to add an attribute called EntityType,
with value "customer".
Use model ECommerce-1.json
Example query: PK="c#12345" and SK="c#12345"
● 2 - Get product for a given productId
Another simple one. We're introducing another entity to our model:
Product. We'll store it in the same way, PK=productId and
SK=productId, but the value for attribute EntityType is going to be
"product". Product ID will start with "p#".
Use model ECommerce-2.json
Example query: PK="p#12345" and SK="p#12345"

● Let's add a bit more data


Use model ECommerce-moredata.json

● 3 - Get all order details for a given orderId and 4 - Get all
products for a given orderId
Here we're introducing two new entities: Order and OrderItem. The
Order makes sense as a separate entity, so it gets an Order ID
starting with "o#". The OrderItem has no reason to exist without an
order, and won't ever be queried separately, so we don't need an
OrderItemID.
Another big difference is that our Partition Key and Sort Key won't
be the same value. An OrderItem is just the intersection of Order
and Product, so when we're querying all Products for a given
Order, we're going to use PK=orderId and SK=productId, and the
attributes for that combination are going to be the quantity and
price of the OrderItem.
Use model ECommerce-3-4.json
Example query for 3 - Get all order details for a given orderId:
PK="o#12345"
Example query for 4 - Get all products for a given orderId:
PK="o#12345" and SK begins_with "p#"
● 5 - Get invoice for a given orderId
We're adding the Invoice entity, which has an Invoice ID starting
with "i#". There's no reason to make the Invoice ID a Partition Key,
the only access pattern we have for Invoices so far is getting the
Invoice for a given Order.
Use model ECommerce-5.json
Example query: PK="o#12345" and SK begins_with "i#"

● 6 - Get all orders for a given productId for a given date range
As you probably know, querying a DynamoDB table on an attribute
that's neither a PK nor an SK is extremely slow and expensive,
since DynamoDB scans every single item in the table. To solve
queries of the type "for a given range", we need to make that
attribute into a Sort Key.
If all we're changing is the SK, then we can use a Local Secondary
Index (LSI). In this particular case, since we don't have a way to
query by Product ID where we can get the Orders of that Product
(i.e. there's no item where productId is the PK and the Order data
is in the Attributes), we're going to need to create a Global
Secondary Index (GSI).
Could we do this directly on our main table? Yes, we could, but
we'd be duplicating the data. Instead, we use a GSI, which
projects the existing data.
Use model ECommerce-6.json
Example query (on GSI1): PK="p#99887" and SK between
"2023-04-25T00:00:00" and "2023-04-25T23:59:00"

● 7 - Get invoice for a given invoiceId and 8 - Get all payments


for a given invoiceId
We're in the same situation as above: Invoice ID is not a PK, and
we need to query by it. We're going to be adding a new element to
GSI1, where Invoice ID is the PK and SK.
Notice that we're adding the Payments entity, but only as a JSON
object, since we're not going to run any queries on Payments. A
potential query would be how much was paid using gift cards, but
that's not in our list of access patterns. If something like that
comes up in the future, we'll deal with it in the future.
Use model ECommerce-7-8.json
Example query for both access patterns: PK="i#55443" and
SK="i#55443"

● 9 - Get all invoices for a given customerId for a given date


range and 10 - Get all products ordered by a given customerId
for a given date range
So far, our queries have been "for a given Product/Order/Invoice",
but we haven't done any queries "for a given Customer". Now that
we need Customer ID as a Partition Key, we need to add another
GSI. Each of these two access patterns is going to need a
different Sort Key, but they both have the same structure: "i#date"
and "p#date" respectively. "i#date" means the date of the invoice,
while "p#date" means the date of the order where that product
appears (which may or may not be the same date). This way we
can easily sort by date and grab only a window (a date range),
without having to Scan the table.
Use model ECommerce-9-10.json
Example query for 9 - Get all invoices for a given customerId for a
given date range: PK="c#12345" and SK between "i#2023-04-15"
and "i#2023-04-30"
Example query for 10 - Get all products ordered by a given
customerId for a given date range: PK="c#12345" and SK
between "p#2023-04-15" and "p#2023-04-30"

Final Solution
Access Patterns
Discussion

Let's do this section Q&A style, where my imaginary version of you asks
questions in italics and I answer them. If the real version of you has any
questions, feel free to contact me on LinkedIn!

Why did we use only one table instead of multiple ones?


Because our data is part of the same data set. The concept of Table in
DynamoDB is comparable to the concept of Database in engines like
Postgres or MySQL. If the data is related, and we need to query it together,
it goes into the same DynamoDB Table.
If we were building microservices, each microservice would have its own
Table, because it has its own data.

What are Local Secondary Indexes and Global Secondary Indexes


again?
They're data structures that contain a subset of attributes from a table, and
a different Primary Key. Local Secondary Indexes have the same Partition
Key and a different Sort Key, while Global Secondary Indexes have a
different Partition Key and Sort Key. You define the attributes that you want
to project into the index, and DynamoDB copies these attributes into the
index, along with the primary key attributes from the base table. You can
then query or scan the index just as you would query or scan a table.
Queries against the Primary Key are really fast and cheap, and queries that
are not against the Primary Key are really slow and expensive. Indices give
us a different Primary Key for the same attributes, so we can query the
same data in different ways.
If we're mixing entities as PK and SK, why are we separating some
stuff to indices? And why aren't we separating more stuff, or less
stuff?
We're not actually storing new data in indices, we're just indexing the same
data in a different way, by creating a new Primary Key and projecting the
attributes. After step 1 we only had Customer data, then we added support
for the access pattern 2 - Get product for a given productId and we had to
add new data, so it goes into the base table. When we added support for 6
- Get all orders for a given productId for a given date range we didn't add
data, we just needed a new way to query existing data, so we created an
index.

Is this the only possible design? Is it the best design?


No, and I'm not sure, respectively. We could have added Payments as a
separate entity with its ID as Partition Key, instead of as a JSON as part of
the Invoice. Or added Invoices with InvoiceID as Partition Key to the base
table, instead of on a GSI. Both would result in a good design, and while I
prefer it this way, I'm sure someone could reasonably argue for another
way.
You measure solutions on these two characteristics:
- Does it solve all access patterns with indices? If not, add support for all
access patterns.
- Is it the simplest design you can come up with? If not, simplify it.
If the answer is yes for both questions, it's a good design!

How will this design change over time?


Great question! Database design isn't static, because requirements aren't
static. When a new requirement comes up, we need to grow the design to
support it (or just use Scan and pay a ton of money). In this example we
grew the design organically, tackling each requirement one or two at a time,
as if they were coming in every week or few weeks. So, you already know
how to do it! Just remember that new data goes into the base table and
new ways to query existing data need indices (LSI for a new SK, GSI for a
new PK+SK).

If the design depends on the order in which I implement the access


patterns, is there anything I can do at the beginning, when I have
identified multiple access patterns at the same time?
Yes, there is. Basically, you write down all access patterns, then you
eliminate duplicates, then solve each pattern independently on a notebook
or Excel sheet (in any order that feels natural), then group up
commonalities, and then you simplify the solutions.
I did a bit of it when I grouped up a couple of access patterns and tackled
them in one go, or with one index. However, actually showing the whole
process would need a much more complex example, and we're already
nearing 3000 words on this issue. I'll try to come up with something,
possibly a micro course, if anyone's interested.

Is DynamoDB really this complex?


Well, not really. It's not DynamoDB that's complex, it's data management as
a whole, regardless of where we're storing it. SQL databases seem simpler
because we're more used to them, but in reality you'd have to take that
ERD and normalize it to Boyce-Codd Normal Form (I bet you didn't
remember that one from college!), then start creating indices, then write
complex queries. In DynamoDB all of that is sort of done together with the
data, instead of defined separately as the database structure.
Is DynamoDB more complex? Sometimes, though usually it's simpler (once
you get used to it). Is it harder? Yes, until you get used to it, which is why I
wanted to write about this. Is it faster? For those access patterns, it's much
faster for large datasets. For anything outside those access patterns, it's
much slower for any dataset (until you implement the necessary changes).

Best Practices

Operational Excellence

● Pick the correct read consistency: DynamoDB reads are


eventually consistent by default. You can also perform strongly
consistent reads, which cost 2x more.

● Use transactions when needed: Operations are atomic, but if


you need to perform more than one operation atomically, you can
use a transaction. Cost is 2x the regular operations.

● Monitor and optimize: CloudWatch gives some great insights into


how DynamoDB is being used. Use this information to optimize
your table.

Security

● Use IAM Permissions and least privilege: You need to grant


permissions for DynamoDB explicitly, using IAM. You can give
your IAM Role permissions on only one table, only for some
operations, and you can even do it per item or per attribute. Give
the minimum permissions needed, not more.
Reliability

● Add an SQS queue to throttle writes: In Provisioned Mode, if


you exceed the available Write Capacity Units, your operation will
fail. Your backend can retry it, but that increases the response
time, and adds even more load to the DynamoDB table. Instead,
consider making the write async by pushing all writes to an SQS
queue and having a process consume from the SQS queue at a
controlled rate. This is especially important when using Lambda
functions, since they tend to out-scale DynamoDB Provisioned
Mode during big spikes. The next chapter deals with this specific
topic.

● Backup your data: You can set up scheduled backups, or do it


on-demand. Either way, have backups.

● Consider a Global Table: DynamoDB has a feature called Global


Table, which is basically a single, global entity that's backed by
regular tables in different regions. It's the best option for any kind
of multi-region setup, including disaster recovery.

Performance Efficiency

● Design partition keys carefully: DynamoDB uses multiple nodes


behind the scenes, and the partition key is what determines which
node stores what element. If you pick the wrong partition key, most
requests will go to the same node, and you'll suffer a performance
hit. Pick a partition key that's evenly distributed, such as a random
ID. Here's a great read on the topic.
● Always query on indexes: When you query on an index,
DynamoDB only reads the items that match the query, and only
charges you for that. When you query on a non-indexed attribute,
DynamoDB scans the entire table and charges you for reading
every single item (it filters them afterwards).

● Use Query, not Scan: Scan reads the entire table, Query uses an
index. Scan should only be used for non-indexed attributes, or to
read all items. Don't mix them up.

● Use caching: DynamoDB is usually fast enough (if it's not, use
DAX). However, ElastiCache can be cheaper for data that's
updated infrequently.

Cost Optimization

● Don't read the whole item: Read Capacity Units used are based
on the amount of data read. Use projection expressions to define
which attributes will be retrieved, so you only read the data you
need.

● Always filter and sort based on the sort key: You can filter and
sort based on any attribute. If you do so based on an attribute
that's a sort key, DynamoDB uses the index and you only pay for
the items read. If you use an attribute that's not a sort key,
DynamoDB scans the whole table and charges you for every item
on the table. This is independent of whether you query for the
partition key or not.
● Don't overdo it with secondary indexes: Every time you write to
a table, DynamoDB uses additional Write Capacity Units to update
that table's indexes, which comes at an additional cost. Create the
indexes that you need, but not more.

● Use Reserved Capacity: You can reserve capacity units, just like
you'd reserve instances in RDS.

● Prefer Provisioned mode over On-Demand mode: On-Demand


is easier, but over 5x more expensive (without reserved capacity).
Provisioned mode usually scales fast enough, try to use it if your
traffic doesn't spike that fast.

● Consider a Standard-IA table: For most workloads, a standard


table is the best choice. But for workloads that are read
infrequently, use the Standard-IA table class to reduce costs.

● Set a TTL: Some data needs to be stored forever, but some data
can be deleted after some time. You can automate this by setting
a TTL on each item.

● Consider multiple databases: Each DB engine has strengths


and weaknesses. Don't be afraid to use several databases for
different use cases.
Using SQS to Throttle Database Writes
Use case: Throttling Database Writes with SQS

Scenario

We're running an e-commerce platform, where people publish products and


other people purchase those products. Our backend has some highly
scalable microservices running on well-designed Lambdas, and there's a
lot of caching involved. Our order processing microservice writes to a
DynamoDB table we set up following the previous chapter, using
provisioned capacity mode with auto scaling. We did a great job and
everything runs smoothly.

Suddenly, someone publishes an ebook titled Node.js on AWS: From Zero


to Highly Available and Scalable Hero, and a lot of people rush in to buy it
at the same time. Our cache and CDN don't even blink at the traffic, our
well-designed Lambdas scale amazingly fast, but our DynamoDB table is
suddenly bombarded with writes and the auto scaling can't keep up. Writes
are throttled, our order processing Lambda receives
ProvisionedThroughputExceededException, and when it retries it just
makes everything worse. Things crash. Sales are lost. We recover in the
end. How do we make sure it doesn't happen again?

The first option is to change the DynamoDB table to On-demand, which


can keep up with Lambda when scaling, but it's over 5x more expensive.
The second option is to make sure the table's write capacity isn't exceeded.
Let's explore the second option.
Services

● DynamoDB: Our database. We discussed it in the previous 2


chapters. All you need to know for this issue is how provisioned
mode works and scales.

● SQS: A fully managed message queuing service that enables you


to decouple components. Producers like our order processing
microservice post to the queue, the queue stores these messages
until they're read, consumers read from the queue in their own
time.

● SES: An email platform, more similar to services like MailChimp


than to AWS services. If you're already on AWS and you just need
to send emails programmatically, it's easy to set up. If you're not
on AWS, need more control, or need to send so many emails that
price is a factor, you'll need to do some research. For this
example, SES is good enough.
Solution step by step

Remember to replace YOUR_ACCOUNT_ID and YOUR_REGION with


your values.

● Create the Orders Queue


1. Go to the SQS console.
2. Click "Create queue"
3. Choose the "FIFO" queue type (not the default Standard)
4. In the "Queue name" field enter "OrdersQueue"
5. Leave the rest as default
6. Click on "Create queue"

● Update the Orders service


We need to update the code of the Orders service so that it sends
the new Order to the Orders Queue, instead of writing to the
Orders table.
This is what the new code looks like:

JavaScript

const AWS = require('aws-sdk');


const sqs = new AWS.SQS();
const queueUrl =
'https://sqs.YOUR_REGION.amazonaws.com/YOUR_ACCOUNT_ID/
OrdersQueue';

async function processOrder(order) {


const params = {
MessageBody: JSON.stringify(order),
QueueUrl: queueUrl,
MessageGroupId: order.customerId,
MessageDeduplicationId: order.id
};

try {
const result = await
sqs.sendMessage(params).promise();
console.log('Order sent to SQS:',
result.MessageId);
} catch (error) {
console.error('Error sending order to SQS:',
error);
}
}

Also, add this policy to the IAM Role of the function, so it can access SQS.
Don't forget to delete the permissions to access DynamoDB!

Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sqs:SendMessage",
"Resource":
"arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
}
]
}
● Set up SES to notify the customer
1. Open the SES console
2. Click on "Domains" in the left navigation pane
3. Click "Verify a new domain"
4. Follow the on-screen instructions to add the required
DNS records for your domain.
5. Alternatively, click on "Email Addresses" and then click
the "Verify a new email address" button. Enter the email
address you want to verify and click "Verify This Email
Address". Check your inbox and click the link.

● Set up the Order Processing service


Go to the Lambda console and create a new Lambda function.
Use the following code:

JavaScript

const AWS = require('aws-sdk');


const dynamoDB = new AWS.DynamoDB.DocumentClient();
const ses = new AWS.SES();

exports.handler = async (event) => {


for (const record of event.Records) {
const order = JSON.parse(record.body);
await saveOrderToDynamoDB(order);
await sendEmailNotification(order);
}
};

async function saveOrderToDynamoDB(order) {


const params = {
TableName: 'orders',
Item: order
};

try {
await dynamoDB.put(params).promise();
console.log(`Order saved: ${order.orderId}`);
} catch (error) {
console.error(`Error saving order:
${order.orderId}`, error);
}
}

async function sendEmailNotification(order) {


const emailParams = {
Source: 'newsletter@simpleaws.dev',
Destination: {
ToAddresses: [order.customerEmail]
},
Message: {
Subject: {
Data: 'Your order is ready'
},
Body: {
Text: {
Data: `Thank you for your order,
${order.customerName}! Your order #${order.orderId} is
now ready.`
}
}
}
};

try {
await ses.sendEmail(emailParams).promise();
console.log(`Email sent: ${order.orderId}`);
} catch (error) {
console.error(`Error sending email for order:
${order.orderId}`, error);
}
}

Also, add this policy to the IAM Role of the function, so it can be triggered
by SQS and access DynamoDB and SES:

Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
],
"Resource":
"arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
"Resource":
"arn:aws:dynamodb:YOUR_REGION:YOUR_ACCOUNT_ID:table/Ord
ers"
},
{
"Effect": "Allow",
"Action": "ses:SendEmail",
"Resource": "*"
}
]
}

● Make the Orders Queue trigger the Order Processing service


1. In the Lambda console, go to the Order Processing
lambda
2. In the "Function overview" section, click "Add trigger"
3. Click "Select a trigger" and choose "SQS"
4. Select the Orders Queue
5. Set Batch size to 1
6. Make sure that the "Enable trigger" checkbox is checked
7. Click "Add"
● Limit concurrent executions of the Order Processing Lambda
1. In the Lambda console, go to the Order Processing
lambda
2. Scroll down to the "Concurrency" section
3. Click "Edit"
4. In the "Provisioned Concurrency" section, set "Reserved
Concurrency" to 10
5. Click "Save"

Solution explanation

● Create the Orders Queue


This one's pretty self explanatory. There's one important detail
though: You need to create a FIFO queue, not a Standard queue
(default).
There's two types of queues in SQS: Standard queues (default)
are cheaper and nearly-infinitely scalable, but they guarantee
at-least-once delivery (meaning you might get duplicates) and
order is mostly respected but not guaranteed. FIFO queues are
more expensive and don't scale infinitely, but they guarantee
ordered, exactly-once delivery.
We don't really care about order in this case, but we do care about
duplicates. We could check for duplicates in the code, but since
the OrderProcessing service is horizontally scalable, we could
have two concurrent Lambda invocations processing the same
order, and a race condition where they both write to the database
at the same time, risking data corruption. In that case we'd need to
use transactions.
● Update the Orders service
Out with the code that writes to DynamoDB, in with the code that
writes to SQS! (same with the permissions)
The only interesting aspect here is the value for MessageGroupId.
FIFO queues guarantee order and exactly-once delivery by only
delivering the next message in a MessageGroup after the previous
message has been successfully processed. If we set it to
customerId and a customer makes two orders at the same time,
the second one to come in won't be processed until the first one is
finished processing.
As a side note, we're also setting MessageDeduplicationId to
ensure that if the order gets duplicated upstream, it will be
deduplicated here (i.e. the queue will only keep one message per
unique value of MessageDeduplicationId). I'm purposefully
ignoring how we set the Order ID, so we don't lose focus.

● Set up SES to notify the customer


I don't want to dive too deep here, because SES is not the focus of
this issue. I just added it because it's simple enough to set up, to
send emails using code.

● Set up the Order Processing service


The code to write to DynamoDB is basically the same that you
(theoretically) previously had in the Orders service: It writes the
same data to the same table. The code to hit SES is also rather
simple. What's interesting is that the handler of a Lambda function
triggered by SQS looks different than one triggered by API
Gateway.
● Make the Orders Queue trigger the Order Processing service
There's one interesting value here: Batch size. We could have our
Lambda function receive several orders in batches, and process
all of them. We could set a batch size of 10 and a concurrent
execution limit of 1, instead of a batch size of 1 and a concurrent
execution limit of 10. This would be more efficient, since we
wouldn't need to wait for and pay for 10 Lambda starts. I didn't do
it because I didn't want to lose focus, but you should consider it.

● Limit concurrent executions of the Order Processing Lambda


This sets a limit on how many lambda invocations are running at
the same time, for this function. There's a caveat though: Your
account has a default limit of 1000 concurrent Lambda executions,
across all Lambda functions. You can increase it by opening a
support ticket, really easy but takes a couple of days to process.
When you set this limit to 10, it's constantly consuming from that
account-wide limit, even when the limited function isn't running.
So, after you set things up like this, if you're processing 0 orders (0
Lambda executions of Order Processing right now), all your other
functions will have an account-wide limit of 990. This doesn't hurt
us at all because we're using a very low limit on only one function,
but it's important to remember if you rely on reserved concurrency
for several functions, and/or with higher limits.
By the way, don't confuse reserved concurrency with provisioned
concurrency. With reserved concurrency you pay the same as
without it, and Lambdas still have cold starts.
Discussion

Implementation-wise, that wasn't so difficult. There are a few interesting


details, but I think I covered them in the solution explanation.
Architecture-wise, there's one big change: We've made our workflow async!
Let me bring the diagram here.

Before, our Orders service would return the result of the order. I put 200 OK
there, but the result could very well be an error. From the user's
perspective, they wait until the order is processed, and they see the result
on the website. From the system's perspective, we're constrained to either
succeed or fail processing the order in the execution limit of the Lambda
function. Actually, we're limited by what the user is expecting: we can't just
show a "loading" icon for 15 minutes.

After the change, the website just shows something like "We're processing
your order, we'll email you when it's ready". That sets a different
expectation! That's important for the system, because now we could
actually have our Lambda function take 15 minutes. It's not just that though,
with the change as is if the Order Processing lambda crashes
mid-execution, the SQS queue will make the order available again as a
message after the visibility timeout expires, and the Lambda service will
invoke our function again with the same order. When the maxReceiveCount
limit is reached, the order can be sent to another queue called Dead Letters
Queue (DLQ), where we can store failed orders for future reference. We
didn't set up a DLQ here, but it's easy enough, and for small and
medium-sized systems you can easily set up SNS to send you an email
and resolve the issue manually, the volume shouldn't be that big.

Once the order went through all the steps, failed some, retried, succeeded,
etc, then we notify the user that their order is "ready". This can look
different for different systems, some are just a "we got the money", some
ship physical products, some onboard the user to a complex SaaS. I chose
to do it through email because it's easy and common enough, but you could
use a webhook for example, while keeping the process async.

Best Practices

Operational Excellence

● Monitor and set alarms: You know how to monitor Lambdas. You
can monitor SQS queues as well! An interesting alarm to set here
would be number of orders in the queue, so our customers don't
wait too long for their orders to be processed.

● Handle errors and retries: Be ready for anything to fail, and


architect accordingly. Set up a DLQ, set up notifications (to you
and to the user) for when things fail, and above all don't
lose/corrupt data.

● Set up tracing: We're complicating things a bit (hopefully for a


good reason). We can gain better visibility into that complexity by
setting up AWS X-Ray (which is featured in a future chapter).

Security

● Check "Enable server-side encryption": That's all you need to


do for an SQS queue to be encrypted at rest: check that box, and
pick a KMS key. SQS communicates over HTTPS, so you already
have encryption in transit.

● Tighten permissions: The IAM policies in this issue are pretty


restrictive, but there's always something to improve.

Reliability

● Set up maxReceiveCount and a DQL: With a FIFO queue, the


next message won't be available for processing until the previous
one is either processed successfully or dropped (to the DLQ if you
set one) after maxReceiveCount attempts. If you don't set these,
one corrupted order will block your whole system.

● Set visibility timeout: This is the time that SQS waits without
receiving a "success" response, before assuming the message
wasn't processed and making it available again for the next
consumer. Set a reasonable value, and set the same value as a
timeout for your consumer (Order Processing lambda in this case).
Performance Efficiency

● Optimize Lambda function memory: More memory means more


money. But it also means faster processing. Going from 30 to 25
seconds won't matter much for a successfully processed order, but
if orders are retried 5 times, now it's 25 seconds we're gaining
instead of 5. Could be worth it, depending on your customers'
expectations.

● Use Batch processing: As discussed earlier, you should consider


processing messages in batches.

Cost Optimization

● Provisioned vs. On-demand for DynamoDB: Remember that


this could be fixed by using our DynamoDB table in On-demand
mode. It's 5x more expensive though. Same goes for relational
databases (if we use Aurora, then Aurora Serverless is an option).

● Consider something other than Lambda: In this case, we're


trying to get all orders processed relatively fast. If the processing
can wait a bit more, an auto scaling group that scales based on
the number of messages in the SQS queue can work wonders, for
a lot less money.
Transactions in DynamoDB
Use case: Transactions in DynamoDB

Scenario

We're building an e-commerce app with DynamoDB for the database,


pretty similar to the one we built for the DynamoDB Database Design
chapter. Here's how our database works:
● Customers are stored with a Customer ID starting with c# (for
example c#123) as the PK and SK.
● Products are stored with a Product ID starting with p# (for example
p#123) as the PK and SK, and with an attribute of type number
called 'stock', which contains the available stock.
● Orders are stored with an Order ID starting with o# (for example
o#123) for the PK and the Product ID as the SK.
● When an item is purchased, we need to check that the Product is
in stock, decrease the stock by 1 and create a new Order.
● Payment, shipping and any other concerns are magically handled
by the power of "that's out of scope for this issue" and "it's left as
an exercise for the reader".

There are more attributes in all entities, but let's ignore them.

Services

● DynamoDB: A NoSQL database that supports ACID transactions,


just like any SQL-based database.
Solution

The trick here is that we need to read the value of stock and update it
atomically. Atomicity is a property of a set of operations, where that set of
operations can't be divided: it's either applied in full, or not at all. If we just
ran the GetItem and PutItem actions separately, we could have a case
where two customers are buying the last item in stock for that product, our
scalable backend processes both requests simultaneously, and the events
go down like this:
1. Customer123 clicks Buy
2. Customer456 clicks Buy
3. Instance1 receives request from Customer123
4. Instance1 executes GetItem for Product111, receives a stock value
of 1, continues with the purchase
5. Instance2 receives request from Customer456
6. Instance2 executes GetItem for Product111, receives a stock value
of 1, continues with the purchase
7. Instance1 executes PutItem for Product111, sets stock to 0
8. Instance2 executes PutItem for Product111, sets stock to 0
9. Instance1 executes PutItem for Order0046
10. Instance1 receives a success, returns a success to the
frontend.
11. Instance2 executes PutItem for Order0047
12. Instance2 receives a success, returns a success to the
frontend.
The process without transactions

The data doesn't look corrupted, right? Stock for Product111 is 0 (it could
end up being -1, depends on how you write the code), both orders are
created, you received the money for both orders (out of scope for this
issue), and both customers are happily awaiting their product. You go to the
warehouse to dispatch both products, and find that you only have one in
stock. Where did things go wrong?

The problem is that steps 4 and 7 were executed separately, and Instance2
got to read the stock of Product111 (step 6) in between them, and made the
decision to continue with the purchase based on a value that hadn't been
updated yet, but should have. Steps 4 and 7 need to happen atomically, in
a transaction.

First, install the packages from the AWS SDK V3 for JavaScript:

Unset

npm install @aws-sdk/client-dynamodb


@aws-sdk/lib-dynamodb

This is the code in Node.js to run the steps as a transaction (you should
add this to the code imaginary you already has for the service):

JavaScript

const { DynamoDBClient } =
require('@aws-sdk/client-dynamodb');
const { DynamoDBDocumentClient } =
require('@aws-sdk/lib-dynamodb');

const dynamoDBClient = new DynamoDBClient({ region:


'us-east-1' });
const dynamodb =
DynamoDBDocumentClient.from(dynamoDBClient);
//The code imaginary you already has

//This is just some filler code to make this example


valid. Imaginary you should already have this solved
const newOrderId = 'o#123' //Must be unique
const productId = 'p#111' //Comes in the request
const customerId = 'c#123' //Comes in the request

const transactItems = {
TransactItems: [
{
ConditionCheck: {
TableName: 'SimpleAwsEcommerce',
Key: { id: productId },
ConditionExpression: 'stock > :zero',
ExpressionAttributeValues: {
':zero': 0
}
}
},
{
Update: {
TableName: 'SimpleAwsEcommerce',
Key: { id: productId },
UpdateExpression: 'SET stock = stock - :one',
ExpressionAttributeValues: {
':one': 1
}
}
},
{
Put: {
TableName: 'SimpleAwsEcommerce',
Item: {
id: newOrderId,
customerId: customerId,
productId: productId
}
}
}
]
};

const executeTransaction = async () => {


try {
const data = await
dynamodb.transactWrite(transactItems);
console.log('Transaction succeeded:',
JSON.stringify(data, null, 2));
} catch (error) {
console.error('Transaction failed:',
JSON.stringify(error, null, 2));
}
};

executeTransaction();

//Rest of the code imaginary you already has


Here's how things may happen with these changes, if both customers click
Buy at the same time:

1. Customer123 clicks Buy

2. Customer456 clicks Buy

3. Instance1 receives request from Customer123

4. Instance2 receives request from Customer456

5. Instance1 executes a transaction:

1. ConditionCheck for Product111, stock is greater than 0


(actual value is 1)

2. PutItem for Product111, set stock to 0

3. PutItem for Order0046

4. Transaction succeeds, it's committed.

6. Instance1 receives a success, returns a success to the frontend.

7. Instance2 executes a transaction:

1. ConditionCheck for Product111, stock is not greater than


0 (actual value is 0)

2. Transaction fails, it's aborted.

8. Instance2 receives an error, returns an error to the frontend.


The process with transactions

Solution explanation

DynamoDB is so scalable because it's actually a distributed database,


where you're presented with a single resource called Table, but behind the
scenes there's multiple nodes that store the data and process queries.
DynamoDB is highly available (meaning it can continue working if an
Availability Zone goes down) because nodes are distributed across
availability zones, and data is replicated. You don't need to know this to use
DynamoDB, but now that you do, you see that transactions in DynamoDB
are actually distributed transactions.

DynamoDB implements distributed transactions using Two-Phase Commit


(2PC). This strategy is pretty simple: All nodes are requested to evaluate
the transaction to determine whether they're capable of executing it, and
only after all nodes report that they're able to successfully execute their
part, the central controller sends the order to commit the transaction, and
each node does the actual writing, affecting the actual data. For this
reason, all operations done in a DynamoDB transaction consume twice as
much capacity.

Transactions can span multiple tables, but they can't be performed on


indexes. Also, data propagation to Global Secondary Indexes and
DynamoDB Streams happens after the transaction, and isn't part of it.

Transaction isolation (the I in ACID) is achieved through optimistic


concurrency control. This means that multiple transactions can be executed
concurrently, but if DynamoDB detects a conflict, one of the transactions
will be rolled back and the caller will need to retry the transaction.

Discussion

The whole point of this issue (which I've been trying to make for the past
couple of weeks) is that SQL databases shouldn't be your default. I'll make
one concession though: If all your dev team knows is SQL databases, just
go with that unless you have a really strong reason not to.

So far I've shown you that DynamoDB can handle an e-commerce store
just fine, including ACID-compliant transactions. This one's gonna blow
your mind: You can actually query DynamoDB using SQL! Or more
specifically, a SQL-compatible language called PartiQL. Amazon developed
PartiQL as an internal tool, and it was made generally available by AWS. It
can be used on SQL databases, semi-structured data, or NoSQL
databases, so long as the engine supports it.

With PartiQL you could theoretically change your Postgres database for a
DynamoDB database without rewriting any queries. In reality, you need to
consider all of these points:

● Why are you even changing? It's not going to be easy.

● How are you going to migrate all the data?

● You need to make sure no queries are triggering a Scan in


DynamoDB, because we know those are slow and very expensive.
You can use an IAM policy to deny full-table Scans.

● Again, why are you even changing?

I'm not saying there isn't a good reason to change, but I'm going to assume
it's not worth the effort, and you'll have to prove me otherwise. Remember
that replicating the data somewhere else for a different access pattern is a
perfectly valid strategy (in fact, that's exactly how DynamoDB GSIs work).
Best Practices

Operational Excellence

● Monitor transaction latencies: Monitor latencies of your


DynamoDB transactions to identify performance bottlenecks and
address them. Use CloudWatch metrics and AWS X-Ray to collect
and analyze performance data.

● Error handling and retries: Handle errors and implement


exponential backoff with jitter for retries in case of transaction
conflicts.

Security

● Fine-grained access control: Assign an IAM Role to your


backend with an IAM Policy that only allows the specific actions
that it needs to perform, only on the specific tables that it needs to
access. You can even do this per record and per attribute. This is
least privilege.

Reliability

● Consider a Global Table: You can make your DynamoDB table


multi-region using a Global Table. Making the rest of your app
multi-region is more complicated than that, but at least the
DynamoDB part is easy.
Performance Efficiency

● Optimize provisioned throughput: If you're using Provisioned


Mode, you'll need to set your Read and Write Capacity Units
appropriately. You can also set them to auto-scale, but it's not
instantaneous. Remember the previous chapter on throttling writes
with an SQS queue.

Cost Optimization

● Optimize transaction sizes: Minimize the number of items and


attributes involved in a transaction to reduce consumed read and
write capacity units. Remember that transactions consume twice
as much capacity, so optimizing the operations in a transaction is
doubly important.
Serverless web app in AWS with Lambda
and DynamoDB
Use case: Serverless web app (Lambda +
DynamoDB)

Scenario

You're building a web application, and you're not sure whether you'll get 1
or 1000 users in your first week. Traffic is not going to be consistent,
because your app depends on trends you can't predict. You think
serverless is a good choice (and you're right!), you know the details about
Lambda, S3, API Gateway and DynamoDB, but you're not sure how
everything fits together in this case.

Services

● Lambda: Our serverless compute layer. You put your code there,
and it automagically runs a new instance for every request, scaling
really fast and really well.

● S3: Very cheap and durable storage. We'll use it to store our
frontend code in this case.

● API Gateway: Exposes endpoints to the world, with security and


lots of good engineering features. We'll use it to expose our
Lambda functions.

● DynamoDB: Managed No-SQL database.


Solution

1. Design your app


Before you start building, it's essential to have a clear
understanding of how your application's different components will
fit together. In a serverless architecture, your application can be
split into multiple smaller functions, each with its own specific task,
which can be executed independently.
So, the first step is to identify the things your app needs to do,
group them into functions, and understand how they talk to each
other. This process is called service design, and the first chapter of
this book dives deeper into it.
In our example, we'll just have a single Lambda function that talks
to DynamoDB.

2. Design your DynamoDB table(s)


No-SQL doesn't mean no structure (it doesn't even mean
non-relational, or non-ACID-compliant). Understand what data
you'll need to store and how you're going to read it. Decide on a
structure, a partition key and sort key, and Global and Local
Secondary Indexes wherever they're needed. Two chapters back,
we discussed this topic in detail.
In our example, we'll just have a single DynamoDB table with
partition key id and no sort key, to keep things simple.

3. Host your static website on S3


The gist of it is that you're hosting your static website's files on an
S3 bucket, and using it as a web server. S3 doesn't support
HTTPS, so you'll use CloudFront for that and as a CDN. Here's
how to do it.
In our example, the details don't matter much, but it'll look just like
in the tutorial: S3 bucket with the files, domain in Route 53,
CloudFront used to expose the site with an SSL certificate from
ACM.

4. Create your DynamoDB table: You designed the data part. Now
create the table, and configure the details such as capacity.
In our example, we'll leave it as On Demand. It's basically the full
serverless mode, it costs over 5x more per request but scales
instantly. We're picking on demand because we don't know our
traffic patterns and because it's simpler, we can always optimize
later, with actual data.

5. Create your Lambda function: Give it a name, give it an IAM


Role with permissions to access your DynamoDB table, put the
table name in an environment variable, and put your code in.

6. Create your API Gateway API: First, create an HTTP API. Then
create the routes. Then create an integration with your Lambda
function. Finally, attach that integration to your routes.
In our example, you should create the routes GET /items/{id}, GET
/items, PUT /items and DELETE /items/{id}.

Discussion

We're building a standard serverless app, so there's not much to discuss


other than serverless vs serverful. Here are the key points on that:
● Operations: In serverless you don't manage the servers, so
there's way less ops work.

● Scalability: It's pretty easy to make a serverful app scale with an


Auto Scaling Group. But there's always that delay in scaling, even
when using ECS or EKS on Fargate. In serverless there's also a
delay (called cold start), but it's significantly smaller.

● Cost: It's more expensive per request, period. The final bill can
come out cheaper because you have 0 unused capacity (unused
capacity is what you waste when your EC2 instance is using 5% of
the CPU and you're paying for 100%). Unused capacity tends to
decrease a lot as apps grow, because we understand traffic
patterns better and because traffic variations are not as
proportionally big (it's easier to go from 10.000 to 11.000 than from
0 to 1.000, even though the increase is 1.000 in both cases).

● Development speed: AWS taking care of compute removes a lot


of work on our side, which means we can develop faster.

● Developer experience: The infrastructure work that remains is


typically pushed to developers. Serverless developers usually like
this. Non-serverless developers either hate it and only want to
write application code, or want to become serverless developers,
there's no middle ground. Keep this in mind when hiring.

● Optimization: There's usually a lot to optimize around fine


implementation details. I try to give you the tools to do this, but you
should consider hiring a consultant for a couple of hours a week.
Best Practices

Operational Excellence

● Something something Infrastructure as Code: You knew it was


coming. I'll never stop saying it until the AWS web Console is
deprecated (ok, that might be a bit extreme). For serverless
solutions you can check out AWS SAM, which builds on top of
CloudFormation, or the Serverless Framework, a great declarative
option. Or use your regular favorite tool.

● Use asynchronous processing: To improve scalability and


reduce costs, consider using asynchronous processing for things
that don't need to happen in real-time. For example, you can use
Lambda functions triggered by S3 events to process uploaded files
in the background, or you can use SQS to queue up messages for
later processing by Lambda functions.

● Monitor the app: Use CloudWatch to monitor your app's metrics,


logs, and alarms. Set up X-Ray to trace requests through your
application and identify bottlenecks and errors.

● Automate deployment and testing: Basically, use a CI/CD


pipeline. Since there's going to be multiple functions, you'll want a
pipeline for each.

Security

● Set up Authentication: Use API Gateway to set up


authentication, so your API endpoints are not public. You can use
Cognito, or your own custom authorizer. This is the topic of the
second chapter of this book.

● Use WAF: Set up Web Application Firewall in your API Gateway


APIs.

● Encrypt data at rest: Both S3 and DynamoDB encrypt data by


default, so you're probably good with this. AWS manages the
encryption key though, you can change the configuration to use a
key you manage.

● Encrypt data in transit: DynamoDB already encrypts data in


transit. For your static website in S3, set up CloudFront with an
SSL certificate.

● Implement least privilege access: Give your Lambda functions


an IAM Role that lets them access the resources they need, such
as DynamoDB. Only give them the permissions they actually
need, for example give the role read permissions on table1,
instead of giving it * permissions on all DynamoDB.

Reliability

● Implement retries and circuit breakers: There's a lot that can


(and will) go wrong, such as network errors or a service throttling.
To recover from these failures, implement retries in your code (use
randomized exponential backoff). To prevent these failures from
cascading, implement circuit breakers.
● Use Lambda versions: You can create versions of your Lambda
functions, and point API Gateway routes to a specific version. That
way, you can deploy and test a new version without disrupting
your prod environment, and make the switch once you're confident
it works. You can also roll back changes.

● Use canary releases: Versions also allow you to have API


Gateway send a small part of the traffic to the new version, so you
can test it with real data without impacting all of your users (it's
better if it fails for 1% of the users than for 100%).

● Backup data in DynamoDB: Stuff happens, and data can get


lost. To protect from that, use DynamoDB backups. If it does
happen, use point-in-time recovery to restore the data.

Performance Efficiency

● Use caching: To reduce latency and improve performance,


consider adding caches. For example, you can use CloudFront to
cache your static website, and even for API Gateway responses.
DynamoDB usually has a really fast response time, but if it's not
fast enough for you, you can use DAX.

● Optimize database access: Review the previous three chapters


for more information on this.

● Rightsize your Lambdas: You get to pick the amount of memory


a Lambda function has, and the CPU power (and the price!) is tied
to that. Too small, and your Lambda runs slow. Too big, and it gets
super expensive. You can use a profiling tool on your code running
locally to determine how much memory it needs, but often it's
simpler to just try different values until you find the sweet spot. Do
this semi-regularly though, since these requirements can change
as your code evolves.

● Minimize response payload: You can improve response times by


sending less data in the response. If you have multiple use cases
for the data, and some need a lot of it and some just a summary,
it's a good idea to create a new endpoint for the summary, even if
you could solve it by querying the other endpoint and summarizing
the data in the front end.

● Optimize cold starts: Lambda functions actually have a few


instances running, and launch more when needed. The time for a
new Lambda instance to launch is called cold start. It's the code
that sits outside the handler function (plus some things AWS
needs to do). Anything you put there will be run only once per
instance, and the values are cached for all future invocations that
use that instance. So, put all initialization code there, but try to
optimize it so your new Lambda instances don't take too long to
start.

Cost Optimization

● Rightsize your Lambdas (again): Finding the right size can save
you a significant amount of money, on top of saving you a
significant amount of headaches from Lambdas not performing
well.
● Monitor usage: Use CloudWatch to monitor and analyze usage
patterns. Use Cost Explorer to monitor costs and figure out where
your optimization efforts can have the most impact.

● Use DynamoDB Provisioned mode: It's much cheaper per


operation! You just need to figure out how much capacity you
actually need, and deal with throttling due to insufficient capacity
(while you're waiting for it to scale). More often than not, you
should start with On Demand because it's easier, then consider
moving to Provisioned because once you get some traffic it makes
a significant difference. Not all workloads are suited for
Provisioned mode, but most are.

● Go serverful: Wait, what? We were talking about serverless!


Yeah, but not every workload is well suited for serverless. I
intentionally proposed a scenario that is, but that might not be your
case. Here's the trick though: you don't have to go all in on
serverless or on serverful. You can split your workload, and use
whatever makes more sense for each part. This will mean
increased operational efforts though, because you're effectively
maintaining two different architectures. And if they mingle, there
are scenarios to consider, such as a Lambda function scaling to
1000 concurrent invocations in 10 seconds, which all try to hit a
service in a poor EC2 instance. Just keep it in mind, you're not
married to serverless just because you've been using it for a few
years.
● Use Savings Plans: Commit to a certain usage, do a zero, partial
or total upfront payment, and enjoy great discounts on your
compute resources. Yes, it does work for Lambda.

● Use Provisioned Concurrency: This is like serverful serverless.


You pay to keep a certain number of Lambda instances always
running, but you pay a lower price. Furthermore, it's great for
ensuring a minimum capacity.
20 Advanced Tips for Lambda
Use Case: Efficient Serverless Compute

AWS Service: AWS Lambda

Lambda is a serverless compute service that runs your code in response to


events and automatically manages the underlying compute resources for
you. It's the most basic serverless building block in AWS.

How it works

● You create a function and write the code that goes in it.

● You set up a trigger for that function, such as an HTTP request

● You configure CPU and memory, and give that function an


execution role with IAM permissions

● When the trigger event occurs, an isolated execution starts,


receives the event, runs the code, and returns

● You only pay for the time the code was actually running

Obviously, that code runs somewhere. The point is that you don't manage
or care where ('cause it's serverless, you see). Every time a request comes
in, Lambda will either use an available execution environment or start a
new one. That means Lambda scales up and down automatically and
nearly instantly.
Fine details

● Supported languages are: Node.js, TypeScript, Python, Ruby,


Java, Go, C# and PowerShell. Use a custom runtime for other
languages.

● Lambda functions can be invoked from HTTP requests, in


response to events from other services, or at defined time intervals
(cron jobs).

● Billing is calculated as execution time * assigned memory


(GB-seconds), plus a fixed charge per invocation. CPU capacity is
tied to memory.

● Lambdas aren't actually instantaneous, there's a cold start (time to


start the execution environment). Check the tips below for how to
mitigate it.

● Logs are automatically generated and sent to CloudWatch Logs.

Best Practices
The most important tip is that you don't need to do everything in this
list, and you don't need to do everything right now. But take your time
to read it, I bet there's at least one thing in there that you should be doing
but aren't.

● Lambdas don't run in a VPC, unless you configure them for that.
You need to do that if you want to access VPC resources, such as
an RDS or Aurora database. The next chapter is about this topic.
● Use environment variables.

● If you need secrets, put them in Secrets Manager and put the
secret name in an environment variable.

● As always, grant minimum permissions only.

● Use versioning and aliases, so you can do canary deployments,


blue-green deployments and rollbacks.

● Use Lambda layers to reuse code and libraries.

● For constant traffic, Lambda is more expensive than anything


serverful (e.g. ECS). The benefit of Lambda is in scaling out
extremely fast, and scaling in to 0 (i.e. if there's no traffic you don't
pay).

● Not everything needs to be serverless. Choose the best runtime


for each service.

● Use Lambda Power Tuning to optimize memory and concurrency


settings for better performance and cost efficiency.

● Set provisioned concurrency to guarantee a minimum number of


hot execution environments, so you don't deal with cold starts for
those environments.

● There's an account limit for concurrent Lambda executions. To


ensure a particular function always has available capacity, use
reserved concurrency. It also serves as a maximum concurrency
for that function.

● Use compute savings plans to save money.


● Already using containers? Run your containerized apps in
Lambda. You could also consider ECS Fargate.

● Don't use function URLs, if you want to trigger functions from


HTTP(s) requests use API Gateway instead. Here's a tutorial.

● If you have Lambdas that call other Lambdas, monitoring and


tracing is a pain, unless you use AWS X-Ray This is the topic for a
future chapter.

● Use SnapStart to improve cold starts by 10x (only in Java, for


now...)

● Code outside the handler only runs on environment initialization,


not on every invocation. Put everything you can there, such as
initializing the SDK.

● Reduce request latency with global accelerators.

● If you're processing data streams from Kinesis or DynamoDB,


configure parallelization factor to process a shard with more than
one simultaneous Lambda invocation.
Secure access to RDS and Secrets
Manager from a Lambda function
Use case: Secure access to RDS and Secrets
Manager from a Lambda function

Scenario

You've deployed a database in RDS and a Lambda function that needs to


talk to that database. You've read the previous chapter, and you learned to
put the database password in Secrets Manager. Then you put the Lambda
in the database VPC, and everything broke. We're going to fix that.

Services

● Lambda: Serverless compute. We won't dive deep, we've done


that already in the previous chapter.

● RDS: Managed Relational Database Service. We'll talk a bit more


about it in a future chapter.

● Secrets Manager: Just a service where you store encrypted


strings (like passwords) and access them securely.

● VPC: A virtual network where your RDS instance is placed. This is


the focus of this chapter.
Solution

Note: If you're actually facing this scenario, these steps will cause
downtime. I wrote them in this order to make it easier to understand the
final solution, but if you need to fix this specific problem, let me know and I'll
help you.

What the solution looks like

Architecture diagram of a Lambda function in a VPC, with Secrets Manager and RDS
How to build the solution

1. First, we're going to “put the Lambda in the VPC”


To do that, go to the Lambda service, choose your Lambda, click
on Configuration, on the left click VPC, and click Edit. Select your
VPC, pick a few subnets, a Security Group (we'll get back to
security groups) and click Save.
Our Lambda function still runs on AWS's shared servers used for
Lambda, but it now has an IP address in that VPC's address
space. This is important because now our Lambda function can
access the RDS instance by sending packets through the VPC
instead of the public internet (faster and more secure).

2. Now we broke internet access for our Lambda!


It turns out Lambdas that are “not in a VPC” actually reside in a
“secret” VPC with internet access, and when we moved our
Lambda to our VPC we broke that. If our Lambda needs to access
the internet, here's how to fix that problem. If all our Lambda
needs to access is other AWS services, don't bother with internet
access, read on.

3. We also broke our Lambda function's access to Secrets


Manager, and we definitely care about that.
To fix it, we're going to add a VPC Interface Endpoint, which is like
giving an AWS service a private IP address in our VPC, so we can
call that service on that address. We can't use the public one,
since we broke internet access).
Go to the VPC console, select "Endpoints" and click "Create
Endpoint". Choose the
"com.amazonaws.{region}.secretsmanager" service name, select
your VPC, the subnets that you picked for the Lambda function,
choose a security group (we'll get back to security groups) and
click Save.

4. Now that we've got everything inside the same VPC, we just need
to allow traffic to reach the different components (while
blocking all other traffic).
Security Groups are these really cool firewalls that'll let us do that.
First, create a security group called SecretsManagerSG and
associate it with the VPC Endpoint, another one called LambdaSG
and associate it with the Lambda function, and another one called
DatabaseSG and associate it with the RDS instance.
Next, edit the DatabaseSG to allow inbound traffic to the database
port (5432, 3306, etc) originating from the LambdaSG.
Finally, edit the SecretsManagerSG to allow inbound traffic over all
protocols and ports originating from the LambdaSG.

5. That's it for the networking part. Now we just need to configure


the proper permissions using IAM Roles.
Go to the IAM service, click “Roles”, and click “Create Role”.
Choose "AWS service" as the trusted entity and select "Lambda"
as the service that will use this role. Click "Next: Permissions",
click "Create policy", and use the following sample policy (replace
the values between {} with your actual values). Save the policy,
save the role with that policy associated, and configure the
Lambda function to use that IAM Role.
Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"rds-db:connect"
],
"Resource": [
"{database-arn}"
]
},
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": [

"arn:aws:secretsmanager:{region}:{account_id}:secret:{s
ecret name}"
]
}
]
}

Discussion

I've deviated a bit from the simplest solution that works, and went for the
simplest solution that is properly configured.
Do we really need all of that? Yes, we do. Here's why:

● Putting the Lambda in the VPC: For performance and security. If


you don't put the Lambda in a VPC, it can only connect to your
RDS instance through the public internet. On one hand, that's a lot
more network jumps, which makes it slower. On the other hand,
you'd need to open up your database to the internet
(networking-wise, you'll still have the password), which is less
secure. There's still the password, and while it can be enough in
most cases, you don't want to risk it. Adding multiple defense
layers (network and authentication in this case) is a strategy called
defense in depth.

● Adding a VPC Endpoint: This one's pretty similar, but in reverse.


Secrets Manager is a public service, which means it resides in the
AWS shared VPC (as opposed to residing in a VPC you own, like
an EC2 or RDS instance). If you want to connect to a public
service, you can do it from another public service (e.g. Lambda
when the function is not in a VPC), through the internet, or through
a VPC Endpoint in your VPC. The first one's no longer an option
(the Lambda function is either in your VPC or in the shared VPC, it
can't be in both). You could add a route to a NAT Gateway in your
subnets, so the Lambda function can reach the internet (in fact,
you need to do it if your function needs internet access), but the
traffic would go through the internet (bad for performance and
security). Instead, you can give Secrets Manager a private IP
address in your VPC (that's essentially what a VPC Endpoint
does), and have the connection go through your VPC and AWS's
internal network.

● Security Groups: This is essentially about defense in depth. By


restricting the inbound traffic on the RDS Security Group you're
preventing anyone other than your Lambda function from
connecting to the database. That means both evil hackers from
the internet, and even other resources such as EC2 instances that
are inside your VPC. The reasoning behind this is that if you have
multiple resources (e.g. multiple Lambdas or instances), it's
possible that one of them gets compromised, and you want to limit
what an evil hacker with access to that resource can do. Note that
you're not actually restricting traffic to that specific Lambda
function, but rather to any resource with the LambdaSG security
group.

● IAM Roles: Access to AWS services is forbidden by default, you


need to allow it explicitly. You could add an IAM Role to your
Lambda function with a policy that allows all actions on all
resources, but that would mean if the function is compromised
(e.g. someone steals your GitHub password and pushes their own
code to the function) they'd have full access to your AWS account.
You can't just deny all actions though, the function needs
permissions. So, you figure out the minimum permissions that the
function needs, review and refine that, and craft an IAM policy that
grants only those permissions (this is called minimum
permissions).
In most cases, the process looks something like this: “The function
needs access to secrets manager and RDS" → “It doesn't need
full access, it just needs to read secrets and connect to a
database" → “It only needs to read this specific secret, and
connect to this specific database" → search for the permissions
needed to read a secret and to connect to RDS → write policy.

Best Practices

Operational Excellence

● Use VPC Flow Logs: Enable VPC Flow Logs to monitor network
traffic in your VPC and identify potential security issues.

● Monitor Service Quotas: Specifically, you need to keep an eye


open for the ENIs quota, since a Lambda in a VPC uses one ENI
per subnet. In this case we're only creating one VPC Endpoint
because we're only accessing one service, but if you create
several, there's a low quota for that as well.

● Single responsibility services: I'm building this with just one


Lambda function for simplicity, assuming your system does exactly
one thing. In reality, your system is going to do multiple things,
which should be split across multiple services (and multiple
Lambda functions). I don't want to dive into service design in this
issue, because it would move the focus out of the technical
aspects of Lambda in a VPC. But if you're implementing this for
real, don't put all your code in a single Lambda.
Security

● Configure CloudTrail and GuardDuty: CloudTrail logs every


action on the AWS API, and GuardDuty scans the CloudTrail logs
for any suspicious activity. There's a step by step guide to
configure it in chapter 4.

● Add Resource Policies: Some resources, such as Secrets


Manager secrets, can have policies that determine who can
access them. Think S3 Bucket Policies, these are the same but for
Secrets Manager. Consider who needs access (probably just your
Lambda and the password rotation Lambda) and write a policy
that restricts access.

● Add VPC Endpoint Policies: Same idea as resource policies, but


for the VPC Endpoint (instead of the resource that is accessed
through that endpoint). It's more common to use them when using
VPC Endpoints to access resources that don't support resource
policies, but in this case they add another layer of defense
(defense in depth). Specifically, you're defending against your own
potential mistakes in the IAM Policy of the Lambda function and in
the resource policy of the secret.

● Use Private Subnets: A public subnet is a subnet with a route


to/from the internet in its route table, a private subnet is a subnet
with no route to/from the internet in its route table. Routes to the
internet are ones that point to an Internet Gateway. You can
change a subnet's route table at any time. You're not going to
access your database from the internet, so you should outright
remove that possibility by placing it in a private subnet. You were
already blocking access from the internet in your security group,
but again, we're doing defense in depth (i.e. protecting ourselves
from our own potential mistakes).

● Regularly rotate database passwords: When was the last time


you rotated a password? Probably way too long ago. Don't risk
your data like that. You can even automate this with a Lambda
function.

● Regularly review and update IAM policies: You should write


your policies with minimum permissions, and I trust you'll do your
best. But we all make mistakes. If you're missing a permission,
you'll notice right away, because things won't work. But if you have
one extra permission that you didn't need, you're only going to
notice that on a regular review or on a post-mortem of a security
incident. I recommend the former method.

Reliability

● Create a Disaster Recovery Plan and test it: Your database is


going to fail at some point. Set up regular backups to protect from
that, and test them frequently. Also, consider what happens if the
AZ where your instance is placed goes down, or if the entire
region goes down. "Too much work, if the AZ or region is down I'm
cool with some downtime" is a perfectly valid strategy, but only if
it's a conscious decision.
● Use one NAT Gateway per AZ (for prod): If you want to give
your Lambda functions access to the internet, you'll need NAT. For
the production environment, use one NAT Gateway per AZ, so
when an AZ goes down your Lambdas can still access the
internet. Of course, in this particular scenario this only makes
sense if your RDS database is also highly available.

Performance Efficiency

● Retrieve secrets and start DB connections in the init code:


Lambda functions have initialization code (outside the handler
method) and execution code (inside the handler method). When a
Lambda function is invoked for the first time, it launches a new
instance, executes the init code (this is called a cold start), and
then passes the invocation event to that instance to start the
execution of the handler method. Future invocations use existing
instances, and the init code is not executed again. If there's no
available instance (e.g. all instances are already executing an
invocation, or they died after waiting several minutes with no
invocations), a new instance is created and the init code is run
again.
How does this help? Simple: If you put the code that starts the DB
connection inside the handler, a new connection is started every
time the function is invoked, and terminated when that execution
ends. Put it in the init code (outside the handler), and the database
connection will get reused across invocations, so long as that
Lambda instance stays alive. Same goes for retrieving the secret
from Secrets Manager.
● Cache what you can: For data that is updated infrequently and/or
you can tolerate not getting the most up to date version, you can
add a cache between the Lambda function and the RDS database.
It's cheaper to read the cached response than to re-calculate the
response on every query, and caches often scale better than
relational databases.

● Throttle database queries: Lambda scales like a beast, RDS


doesn't. If your system is prepared for 100 concurrent users and
you suddenly get 200, your Lambda function is going to scale
seamlessly, but 200 queries to your database at the same time is
going to be too much for your RDS instance. Preemptively
increasing the size doesn't help, you'll be paying more per month
and you'll still hit this problem at some point (be it 200 users, 500,
1000, etc). Instead, for reads you can read from a cache (as
explained above), and for writes you can put the write operations
into a queue and have another Lambda read from that queue at a
more controlled rate. We saw how to do this for DynamoDB 3
chapters ago, and the same solution can be applied to RDS.

Cost Optimization

● Use one NAT Instance per VPC (for dev): Dev environments
don't need high availability, so instead of multiple NAT Gateways,
consider a single one. If the AZ fails your dev env will fail, but
that's fine. And I'll do you one better: don't pay $32,40/month for a
NAT Gateway, instead set up a t4g.micro EC2 instance as a NAT
Instance and make it self-healing (as we'll see in a future chapter).
Monitor and Protect Serverless Endpoints
With API Gateway and WAF
Use case: Monitor and Protect Serverless
Endpoints Easily and Cost-Effectively

Scenario

You have serverless API endpoints, probably exposed through Lambda


URLs or a basic API Gateway configuration. Things are going well, and you
want to monitor and secure those endpoints in a scalable, low maintenance
and cost-effective way.

Services

● API Gateway: It's a fully managed service that lets you publish APIs.
It supports REST, HTTP, WebSocket, and gRPC protocols.

● AWS WAF: It's a web application firewall that helps protect your APIs
from common web exploits. You can create rules that allow, block, or
count web requests based on conditions that you specify, and use
already-implemented rules.

● CloudWatch: It's a fully managed monitoring service that provides


visibility into resource and application performance, operational logs,
and events. It allows you to collect, view, and analyze metrics and
logs from various sources in one place.
Solution

First, set up API Gateway properly:

1. Create an API Gateway API: Go to the API Gateway console and


click on "Create API". Choose a name and a protocol (e.g.
HTTPS), and select "Lambda Function" as the integration type.

2. Create a new resource: Resources are the path segments in the


URL for your API. For example, if you want to create an API for
managing users, you might create a resource called "/users". By
the way, you should properly plan your API endpoints!

3. Create a new method for your resource. A method is an HTTP


verb (e.g. GET, POST, DELETE) that specifies the type of action
you want to perform on the resource. For example, if you want to
retrieve a list of users, you would create a GET method.

4. Configure the integration for your method. This is where you


specify the backend service that will process the request. In this
case, you need to choose "Lambda Function" as the integration
type and choose the appropriate Lambda function from the list.

5. Deploy the API to dev: Choose a stage for your API (e.g. "dev")
and click deploy.

6. Test your API: Send a request (e.g. with Postman) to the API
endpoint and verify that it returns the expected response. Try
sending different types of requests (e.g. GET, POST, DELETE)
and see if they are processed correctly by your Lambda function.
You can find the API endpoint in the API Gateway console.

7. Set up IAM permissions: You need to set up your Lambda function


to only be invoked from API Gateway. Go to the IAM console and
create a new IAM policy with the permissions
lambda:InvokeFunction and apigateway:*. Then, attach this policy
to the IAM role that your Lambda function is using. Keep in mind,
the Function URL that's already set up (if any) will stop working.

8. Deploy the API to prod: Once again, choose a stage for your API
("prod" in this case) and click deploy.

9. Test it again: Make sure it's working!

Next, set up CloudWatch:

1. Metrics are automatically set up, you can view them by following
this guide.

2. To set up logs, go to your API's Settings and scroll down to the


"CloudWatch Settings" section.

3. Select "Enable CloudWatch Logs".

4. In the "Log Level" dropdown menu, pick the log level that you want
to use (e.g. "ERROR", "INFO").

5. In the "Log Group" field, specify the name of the CloudWatch Logs
log group that you want to use. If the log group doesn't exist, it will
be created automatically.
6. Click "Save Changes".

Finally, set up WAF:

1. Create a WAF security group: You can do this using the WAF
console, the AWS CLI, or a CloudFormation template. Choose a
name for your security group and specify the IP address ranges
that you want to allow or block (if any).

2. Add a WAF policy to your API Gateway API: Choose a name for
your policy, and specify the WAF security group that you created in
step 1.

3. Test your WAF policy: You can do this using any tool that can send
HTTP requests to your API Gateway API, such as Postman or
cURL. Try sending different types of requests (e.g. GET, POST,
DELETE) and see if they are allowed or blocked by your policy.

4. Configure your WAF rules: Choose a name and a type for your
rule (e.g. SQL injection), and specify the conditions that you want
to match (e.g. specific patterns in the request headers or body).
You can also specify the actions that you want WAF to take when
a request matches the rule (e.g. allow, block, count).

5. Update your WAF policy: Add the WAF rules that you created in
step 4 to your policy.

6. Test your WAF rules: You can do this using the same tool that you
used in step 4. Try sending requests that should match your rules
(e.g. requests with malicious payloads), and see if they are
allowed or blocked by your policy.

7. Monitor and troubleshoot your WAF policy: You can do this using
Amazon CloudWatch. CloudWatch logs all requests that are
allowed or blocked by your policy, and you can use this
information to detect and fix issues. You can also set up
CloudWatch alarms to be notified when there are unusual patterns
of requests or when there are errors in your backend service.

Best Practices
Note: Some of these are features, which you may not need for your
particular use case (e.g. if your API is public you don't need authentication).
Just pick the ones you need.

1. Design your APIs: A service is used through APIs. API design is


akin to UX design for the service. If you don't design them well,
you'll have a lot of problems after that 1 service becomes 5 or 10
(and I'm not even talking about microservices). Even if it's just 1,
do your frontend self a favor and design your API.s For RESTful
APIs, check the Richardson Maturity Model and Microsoft's guide
for API design. If you're using gRPC, go to the source (not the
source code!). Same for GraphQL.

2. Choose HTTP, REST or WebSocket: WebSocket is easy to pick


or discard: If you need websockets or are using GraphQL, create a
WebSocket API. If you're building a RESTful API, you have to pick
between HTTP API and REST API. The difference? REST APIs
are the real deal, with a lot of features that HTTP APIs don't have,
such as API Keys, canary deployments, rate limiting, WAF
integration, private endpoints. HTTP APIs are 70% cheaper. Best
practice: If you're starting out, you'll be lucky to pay $10/month for
API Gateway, so you're fine with REST APIs (price is $3.50 per
million requests). Once you have a dozen services and you've
figured out what features you actually need (and which you don't),
picking the right type will save you a good chunk of money. Pick as
early as you can, but it won't break your wallet (or system) if you
pick wrong.

3. Set up your own domain: First, you need to own a domain,


which you can purchase from Route 53 or (usually cheaper) from
other registrars like CloudFlare or GoDaddy. If you purchase it
from somewhere other than Route 53, you can still use Route 53
as your DNS server (optional). Next, create a TLS certificate in
Certificate Manager. Then, set up your domain on API Gateway.
Finally, set up your DNS to route traffic to your API.

4. Use CloudWatch to monitor and troubleshoot APIs: You can


view API metrics, check the logs generated by APIs, and set up
alarms to be notified when something goes wrong. Here's how to
monitor REST, HTTP and WebSocket APIs.

5. Enable authentication: Revisit chapter 2 of this book.

6. Enable request validation: Request Validation allows API


Gateway to check the request payload and reject requests that
don't match your schema. This way, your backend doesn't even
get called (you save $$) if the request is malformed.

7. Enable caching: If your Lambda function is idempotent (meaning


it always returns the same output for the same input), you can
enable caching. This reduces the number of calls to the backend
service (saves $$) and decreases response time. You can also
encrypt the cache.

8. Enable WAF: I already mentioned this above, but it's worth


expanding a bit. First, set up WAF with the baseline rule group and
whatever use case specific rule groups apply to your API. After
that, check out the other managed rule groups, consider whether
you need more rules, check the marketplace for other relevant rule
groups, and consider writing your own. Check out the price, and
consider whether it's worth it for you.

9. Set up CORS: If your API is going to be accessed from a different


domain, you'll need to set up CORS.

10. Set up API keys: You can require that requests to your API
include an API key. This is different from authentication because
API keys don't expire (unless you set them to) and aren't linked to
a user. Basically API keys are designed so that other developers
use your API. You can also set usage limits per key.

11. Use API Gateway for private APIs: You can put API Gateway
in front of your private APIs as well. What for? Well, things like
CORS are useless for private APIs, but you can certainly make
use of monitoring, logs, request validation or caching. And there's
even a use case for authentication, if you're implementing a zero
trust security model.

12. Use API Gateway in a VPC: If you're talking to other AWS


resources from inside a VPC, you don't need an internet
connection. You can set up a VPC Endpoint using PrivateLink.
And you can do that for API Gateway APIs as well. You can even
set endpoint policies.

13. Trace requests: You can set up AWS X-Ray for API Gateway.

14. Mock responses: If you just need a mock response, you can
have API Gateway generate it directly. You can change that to a
proper response later.

15. Transform request and response data: If you need to


interconnect two services that send/receive data in different
formats, you can set up API Gateway to transform the data in the
request and/or the response.

16. You can also use OpenAPI: Here's how to import an OpenAPI
spec into API Gateway.

17. Compress payloads: You can use deflate, gzip or identity


content encodings to compress payloads, and API Gateway
decompresses them.

18. Sell your APIs on AWS Marketplace: AWS Marketplace is a


place where you can buy solutions and deploy them directly to
your AWS accounts. You can also sell your solutions, including
selling access to your APIs.
Using X-Ray for Observability in
Event-Driven Architectures
Use case: Observability in Event-Driven
Architectures Using AWS X-Ray

AWS Service: AWS X-Ray

An event-driven architecture is one where, instead of a service calling


another one directly, it emits an event to which other services are
subscribed. That's great for decoupling, but it makes it much more difficult
to understand what's going on. AWS X-Ray gets data from every part of
your architecture, aggregates it into graphs and shows you the big picture
of what's happening.

Without X-Ray

● You add logs to your Lambda functions. Then you add more logs
just in case.
● When you have an issue, you open 10 tabs of CloudWatch Logs
and try to figure out which log entries are related, based on
timestamps and intuition.
● You finally figure it out (days later), and get a detective badge for
the investigative work.

With X-Ray

● You add AWS X-Ray to all your Lambda functions, DynamoDB


tables, SNS topics, etc.
● When you have an issue, the X-Ray console gives you an
overview of the whole system, and you can zoom in on a specific
event and trace it all the way through your app.
● You don't get the detective badge, because figuring this out was
actually pretty easy.

Best Practices
How to set up AWS X-Ray for a Node.js app

First, install the AWS X-Ray SDK:

Unset

npm install aws-xray-sdk

Then structure your code like this:


JavaScript

// You'll always need this


const AWSXRay = require('aws-xray-sdk');

// This wraps the AWS SDK with X-Ray. Now you can use
the AWS object to access the SDK for any other service
like you usually do, but X-Ray will be monitoring it
const AWS = AWSXRay.captureAWS(require('aws-sdk'));

// Same here, but for HTTP requests


AWSXRay.captureHTTPsGlobal(require('https'));

exports.handler = async (event) => {


// X-Ray gathers data from services in the form of
segments
const segment = new
AWSXRay.Segment('my-function-name');
try {
// Your awesome code goes here
} catch (error) {
console.error(error);
throw error;
} finally {
// Gotta close that segment!
segment.close();
}
};

That's the gist of it. Now X-Ray will collect data from your function.
Remember to also set it up for DynamoDB tables and SNS topics by going
to Advanced settings and enabling AWS X-Ray Tracing.
Additional tips

● Enable X-Ray for all relevant components.

● Use segments and subsegments to add context to trace data.

● Use custom attributes to add more context to trace data.

● Use captureFunc (captureAsyncFunc for async functions) to


capture errors and add them to trace data.

● Enable sampling. You don't really need to trace all the requests,
10% is usually enough, and much cheaper.
Serverless, event-driven pipeline with
Lambda and S3
Use case: Serverless, event-driven image
compressing pipeline with AWS Lambda and S3

Scenario

Your app allows users to upload images directly to S3, and then displays
them publicly (think social network). The problem? Modern phones take
really good pictures, but the file size is way larger than what you need, and
you're predicting very high storage costs. You figured out that you can write
an algorithm to resize the images to a more acceptable size without
noticeable quality loss!

You don't want to change the app so that users upload their images to an
EC2 instance running that algorithm. You know it won't scale fast enough to
handle peaks in traffic, and it would cost more than S3. You want to
implement image resizing in a scalable and cost-efficient way, without
having to maintain any servers.

Services

● Lambda: Serverless compute.

● S3: Storing the images and serving them.

● CloudFront: CDN to cache the images.


Solution

The solution is to trigger a Lambda function when an object is uploaded to


an S3 bucket, run the image resizing code in that function, and have it
upload the image to another bucket, from which images are served
(through CloudFront).

1. Create an S3 bucket for users to upload images.

2. Create a Lambda function that will be triggered when a new object


is created in the S3 bucket. This function will run the code to
generate a smaller version of the image. Set up the S3 Event as
the trigger.

3. Create an S3 bucket for storing the resized images.

4. Create a CloudFront distribution for the second S3 bucket and


configure it to serve the images.

5. Set up CloudWatch to monitor the pipeline.

6. Test it.

Discussion

This approach resizes images eagerly, expecting that the image will be
shown so many times that resizing it will in most cases save you money (or
the improved user experience is worth the cost). If that's not the case for
you, you could resize lazily (i.e. when an image is requested).

If your image processing results can wait a bit, you'd be better off pushing
the S3 event to an SQS queue and consuming the queue from an Auto
Scaling Group of EC2 instances (or ECS). AWS Batch is also a great
option, if the upload rate is not constant. Overall, serverless scales much
faster but serverful is cheaper.

If you need to do more than just resize the images, you've got two options:
For independent actions, you can send the S3 Event to multiple consumers
using SNS; For a complex sequence of actions, you can use Step
Functions (there's a chapter on that topic coming up).

Best Practices

Operational Excellence

● Implement a testing environment.

● Monitor the Lambda function with CloudWatch.

● Set everything up using Infrastructure as Code.

● Use X-Ray to trace the request, as discussed in the previous


chapter.

Security

● Encrypt the data at rest: S3 encrypts all objects automatically, so


no need to do anything.

● Use IAM roles for Lambda to access the S3 bucket: Always


apply minimum permissions.
● Restrict who can upload images: Every time a user needs to
upload an image, generate a presigned URL for S3 and use that.
Add your own auth to the URL generation process, so only logged
in users can upload images.

● Restrict access to the resized images S3 bucket: Set up Origin


Access Control so that users can't access the bucket directly, and
can only access the content through the CloudFront distribution.

Reliability

● Configure retries: S3 Events invokes Lambdas asynchronously.


This means there's an eventually-consistent event queue, and the
function is invoked with an event from the queue. If the function
fails, it's retried up to 2 times (delay 1 min before the first retry, 2
mins before the 2nd retry). The default value is 2, you can lower it
if you want.

● Make your function idempotent: If there's an


eventually-consistent queue, there's duplicate records. Make sure
your function can handle them gracefully, by first checking whether
the resized image already exists in the destination bucket.

● Process unrecoverable failures: Set up a DLQ (a queue where


failure events go), and another Lambda that consumes those
events. Move the failed image to a third bucket (so you don't lose
it when cleaning up the uploads bucket) and log it to a DynamoDB
table for later analysis. Tip: Make sure you set up a DLQ, not an
On failure Destination; DLQs receive the failed response,
Destinations only receive the event that failed.

● Set up an alarm for failures: If you're expecting failures as part


of your regular flow, you shouldn't sound the alarm for a single
failure. Instead, define what's a normal amount of failures, and
alert when the real number goes higher. You can do this easily by
monitoring the length of the DLQ, or you can set up something
more complex.

Performance Efficiency

● Optimize the Lambda functions: Review the chapter about 20


Advanced Tips for Lambda.

● Use CloudFront to serve the images: CloudFront is a CDN.


Basically, it stores the images in a cache near the user (there's lots
of locations around the world), and serves requests from there.

● Consider compressing images before uploading: This one's a


clear tradeoff. On one hand, uploading will be faster. On the other
hand, you'll need to uncompress the images to resize them (and
pay for that extra processing time). Faster uploads for a better
user experience, at a higher cost. Use this if you expect users to
upload from slow networks such as 4G and the rest of the app
works really well. If the rest of the app is slow, start optimizing
there. If users typically upload from a 300 Mbps wifi connection,
they won't even notice the improvement.
Cost Optimization

● Transition infrequently accessed objects to S3 Infrequent


Access: In the scenario section I mentioned social networks. How
often are old images accessed in a social network? You can set a
lifecycle rule to transition objects to S3 Infrequent Access, where
storage is cheaper and reads are more expensive. If you did the
math right, you get a lower average cost. The math: If objects are
accessed less than once a month, it's cheaper. And if you can't
find any obvious patterns, you can use S3 Intelligent-Tiering.

● Set up provisioned concurrency for your lambdas: If you know


you're going to have a minimum of executions, setting up
provisioned concurrency will save you some money.

● Get a Savings Plan: Savings Plans are upfront commitments


(with optional upfront pay) for compute resources. You'd typically
link them to EC2 or Fargate, but they apply to Lambdas as well!
Real-time data processing pipeline with
Kinesis and Lambda
Use case: Building a real-time data processing
pipeline with Kinesis and Lambda

Scenario

Your business is generating a large amount of clickstream data from your


web or mobile app, and you need to process them in real-time to gain
insights and improve user engagement. You want to do this in a scalable,
secure, and cost-efficient way.

Services

AWS Kinesis and AWS Lambda are the services that we'll be using to
analyze clickstream data. Kinesis allows for the ingestion of streaming
data, while Lambda is used to process the data streams in real-time.
Solution

1. Consider the volume of data and the required throughput.

2. Set up a Kinesis Data Stream to ingest the clickstream data.


Configure the number of shards, which determine the number of
data streams that can be ingested and processed simultaneously,
to handle your requirements.

3. Collect the clickstream data and send it to the Kinesis stream.

4. (Optional) Set up sessionization of the clickstream data using


Kinesis Data Analytics.

5. Create a Lambda function to process the clickstream data in


real-time. This function can be triggered by the Kinesis stream,
and it will contain the logic for processing the data, such as
filtering, transforming, or aggregating the data.
6. Send data from the Lambda function to Kinesis Data Firehose, to
store the processed data in S3 or Redshift for further analysis.

7. Set up CloudWatch to monitor the pipeline and troubleshoot any


issues.

8. Test the pipeline to ensure it is processing the data correctly.

How to send data to a Kinesis Data Stream in JavaScript


JavaScript

const AWS = require('aws-sdk');


const kinesis = new AWS.Kinesis();

const sendData = async (streamName, data) => {


const params = {
Data: JSON.stringify(data),
PartitionKey: 'partitionKey',
StreamName: streamName
};

try {
await kinesis.putRecord(params).promise();
console.log(`Data sent to stream: ${data}`);
} catch (err) {
console.log(err);
}
};

const data = { userId: "123", event: "pageview" };


sendData("my-clickstream-data", data);
How to process the data and store it to S3 with a Lambda
function in JavaScript
JavaScript

const AWS = require('aws-sdk');


const kinesis = new AWS.Kinesis();\
const Firehose = new AWS.Firehose();

exports.handler = async (event) => {


const records = event.Records;
let data;

records.forEach((record) => {
const payload = new Buffer(record.kinesis.data,
'base64').toString('ascii');
data = JSON.parse(payload);
console.log(data);
});

processData(data);
storeData("my-clickstream-data-delivery-stream",
data);

return {};
};

const processData = (data) => {


// add your data processing logic here
}

const storeData = async (deliveryStreamName, data) => {


const params = {
DeliveryStreamName: deliveryStreamName,
Record: {
Data: JSON.stringify(data)
}
};

try {
await Firehose.putRecord(params).promise();
console.log(`Data stored: ${data}`);
} catch (err) {
console.log(err);
}
};

Best Practices

Operational Excellence

● Test end-to-end data flow: In addition to testing individual


components, set up a test environment that mimics the production
environment as closely as possible. Use this environment to test
the entire data flow, from data collection to data processing and
storage. This will help you identify and fix any issues with data
format, data loss, or data processing errors.

● Monitor and troubleshoot the pipeline: CloudWatch allows you


to view metrics, check logs, and set up alarms to be notified when
something goes wrong. This helps you ensure the pipeline is
running smoothly and quickly troubleshoot any issues that arise.
● Set up automatic scaling for the Kinesis stream: You can scale
your Kinesis stream with CloudWatch and Lambda, or use
On-Demand Mode. This ensures that the pipeline can handle the
desired throughput and reduce costs by only using the necessary
resources.

● Use Infrastructure as Code for your infrastructure: An IaC tool


allows you to version control your infrastructure, rollback to
previous versions and make it easy to replicate your infrastructure.

Security

● Don't let everyone write to your Kinesis stream: You can use
API Gateway to expose access to your Kinesis stream, and
protect it with Cognito, in a very similar way to how you expose
serverless endpoints.

● Use IAM roles for Lambda to access the Kinesis stream: IAM
roles allow you to grant permissions to Lambda to access the
Kinesis stream while enforcing least privilege access controls.

Reliability

● Use SNS for notifications: SNS can be used to send notifications


when events occur in the pipeline. This allows you to identify and
address issues that may impact the pipeline's reliability.

● Set up dedicated throughput: Use Kinesis Data Streams


Enhanced Fan-Out to set up consumers with dedicated
throughput.
● Handle duplicate events: Producers may retry on error, and
Lambda consumers automatically retry on error. This can insert
duplicate records into the stream. Write your consumers so they're
idempotent.

Performance Efficiency

● Optimize the Lambda function's processing time: You can do


this by reducing the number of calls to external services, using a
high-performance language runtime, and optimizing the function's
memory usage.

● Use AWS Lambda provisioned concurrency: Lambda


provisioned concurrency allows you to automatically scale your
Lambda function's capacity based on the number of incoming
requests. This way you can set a minimum capacity and avoid
cold starts for that capacity.

Cost Optimization

● Tune the Kinesis Stream shard count: The shard count


determines the number of data streams that can be ingested and
processed simultaneously. You should consider the volume of data
and the required throughput when configuring the stream in
Provisioned Mode, or just use On-Demand Mode.

● Optimize the Lambda function's memory: Fine-tune your


Lambda function's configuration of CPU and memory to reduce
your Lambda costs.
Complex, multi-step workflow with AWS
Step Functions
Use case: Complex, multi-step image processing
workflow with AWS Step Functions

Scenario

You are building a social network app. Users will be able to upload images
to an S3 bucket, and you need to first analyze them to detect inappropriate
content. If the image is safe (does not contain inappropriate content), it will
be resized and stored in another S3 bucket. If the image is unsafe
(contains inappropriate content), the user that uploaded it will be notified
via email.

Services

● S3: Storing the uploaded images and the processed images.

● Lambda: Serverless processing.

● Step Functions: It's a serverless orchestration service that lets


you integrate with AWS Lambda functions and other AWS
services. Step Functions is based on state machines and tasks. A
state machine is a workflow. A task is a state in a workflow that
represents a single unit of work that another AWS service
performs. Each step in a workflow is a state.
● Rekognition: A service that analyzes images and identifies
objects, people, text, etc. It can detect inappropriate content
(which is our use case for this scenario), do highly accurate facial
analysis, face comparison, and face search. This is not really the
focus of this issue.

● SES: An email-sending service by AWS. This is not the focus of


this chapter.

Solution

1. Create an S3 bucket to store the user-uploaded images.

2. Enable S3 Event Notification with EventBridge

3. Create a Lambda function to analyze images using Amazon


Rekognition's image moderation API.

4. Create a Step Functions State Machine to coordinate the image


analysis and processing.

5. In the State Machine, configure a Task to trigger the Lambda


function to analyze the image using Rekognition.

6. Configure a Choice State to analyze the output of the Lambda


function, which can be safe (the image did not contain
inappropriate content) or unsafe (the image did contain
inappropriate content).
7. Configure another Task that executes if the image is unsafe, which
triggers a Lambda function that calls SES to notify the user that
their image was unsafe.

8. Configure another Task that executes if the image is safe, which


triggers a Lambda function that resizes the image and stores it in
an S3 bucket.

9. Create an Amazon EventBridge Rule on the S3 bucket to trigger


the State Machine every time an image is uploaded to the S3
bucket.

10. Test the workflow with sample safe and unsafe images.

Discussion

What we're doing here is called orchestrating services. Every task (analyze
the image, send an email, resize the image) is a service, and they need to
interact in a certain order, with a certain logic. There's actually 3 ways to
achieve this:

● Each service calls the next one: This means you're adding on
every service the responsibility of knowing who goes next in the
workflow. You're coupling one service to the next one (and actually
to the previous one as well, for handling rollbacks), and you're
coupling every service to this specific workflow. More than that,
you're adding an additional responsibility to every service. Our
example is not that complex, but in real, complex workflows this
will slow you down a lot.
● Orchestrated services: Every service has a single responsibility
(e.g. resize the image), and some external (centralized) controller
stores and executes all the coordination logic, calling every service
in the right order and passing around the responses. You are here.
Step Functions is our Orchestrator in this case. Our example is
really simple, but the main advantage of orchestrated services is
that you're centralizing the definition of the workflow and making it
easier to implement really complex stuff.

● Choreographed services: Every service has a single


responsibility (e.g. resize the image). It subscribes to an event,
and publishes an event when done. Every service only knows "If X
happens, I do Y and post Z", without any knowledge of who
causes X or who's listening for Z. The workflow logic is split across
all services, but not tied to each service's implementation. Instead,
the workflow logic emerges from watching all the services and
understanding the end result of posting a message. The main
advantages are that you don't depend on a centralized
orchestrator and you're vendor agnostic (but that's not as
important as we tend to think). They're not really harder to design
than orchestrated services, the real disadvantage is that they're
harder to keep track of.
Best Practices

Operational Excellence

● Use IaC: Write your workflow as code. You can use


CloudFormation, CDK, SAM, Terraform or any other tool. Creating
the workflow for the first time is easier to do manually, but keeping
track of later changes is ridiculously hard.

● Use CI/CD: Don't manually update the code. That's messy enough
in monoliths, but when working with multiple services that are
called in a complex order, it gets outright impossible to manage.
Use a CI/CD pipeline.

● Automate testing: You're already defining your workflow as code,


and deploying it automatically. Write a few end-to-end tests with
sample images, so you at least cover the happy paths.

● Logging and Monitoring: Set up logging and monitoring for Step


Functions. Also, set up X-Ray.

Security

● Use IAM roles for Lambda: As always, minimum permissions.


Use IAM roles for your Lambda functions. Don't use the same role
for all functions though: The function that just calls Rekognition
doesn't need to write to S3!
● Use IAM roles for Step Functions: Step Functions should also
have minimum permissions. You can achieve that with IAM roles
for Step Functions.

● Restrict who can upload images: Restrict access to the S3


bucket for uploaded images using presigned URLs.

● Restrict who can read and write images: Limit what IAM roles
can read from the uploaded images bucket and write to the
processed images bucket. Hint: These should be your Lambdas'
roles.

● If using CloudFront, use OAC: Origin Access Control lets you


have CloudFront read from a private S3 bucket. That way, users
can only access the images through CloudFront.

Reliability

● Implement error handling and retry logic: Step Functions can


handle errors and retries, for example for Lambda functions. Don't
just design for the happy path, ensure that the workflow can
recover from failures.

● Pick an Async Express workflow: There's sync and async


Express workflows. Sync workflows are called at most once,
Async workflows are called at least once. You're dealing with
async events, where you don't actually control the caller (which is
S3 Events) and its wait and retry logic. Async is the correct choice
here. For reference, Standard workflows (see the Cost
Optimization section) are called exactly once.
● Make your steps idempotent: Idempotency means the final
result is the same whether you call the function once or N times.
Since async workflows are called at least once, you want to make
sure calling the same Lambda twice for the same image doesn't
result in data corruption.

Performance Efficiency

● Use CloudFront to serve the images: CloudFront is a CDN.


Basically, it stores the images in a cache near the user (there's lots
of locations around the world), and serves requests from there.
Faster and cheaper!

● Consider compressing images before uploading: This one's a


clear tradeoff. On one hand, uploading will be faster. On the other
hand, you'll need to uncompress the images to process them (and
pay for that extra processing time). Uncompressing can be added
easily as an additional Task in your Step Functions State Machine.
Faster uploads for a better user experience, at a higher cost. Use
this if you expect users to upload from slow networks such as 4G
and the rest of the app works really well. If the rest of the app is
slow, start optimizing there. If users typically upload from a 300
Mbps wifi connection, they won't even notice the improvement.

● Send the S3 object ARN, not the image: We're talking about big
images (that's why we compress them!). That's a huge payload for
Step Functions. Instead of sending the image itself, send the S3
object ARN and let each step read the image from S3.
Cost Optimization

● Use an Express Workflow: There's Standard and Express


workflows in Step Functions. Standard is for long running, Express
workflows are for high throughput, and are cheaper. Maximum
runtime for Express workflows is 5 minutes, which is more than
enough for our example. If you need more you can nest an
Express workflow inside a Standard one.

● Consider going serverful: If your workflow has a baseline of


constant traffic and processing can wait a few minutes, consider
replacing some Lambda functions for servers (EC2 with auto
scaling, or ECS, possibly with Fargate).
Using Aurora for your MySQL or Postgres
database
Use Case: Managed Relational Database

AWS Service: Amazon Aurora

Aurora is a managed instance with a managed installation of MySQL or


PostgreSQL. Imagine you create an EC2 instance, install MySQL or
Postgres, configure automated backups, etc. Aurora is like an EC2
instance that already comes with all of that. Plus, you can set up a cluster
with one extra click (or a few lines of code), Aurora handles all the config.

I have to mention RDS here, because it does pretty much the same. The
difference is that RDS uses native MySQL or Postgres (or other engines),
while Aurora basically uses a rewrite that's compatible with MySQL or with
Postgres, but optimized for AWS's infrastructure. Aurora is limited to
MySQL up to 8.0 or PostgreSQL up to 14.3, but it has a lot more features.

Aurora comes in serverful and serverless modes. Serverless auto scales


horizontally (in and out), is approximately 3x more expensive, and works
really well.

Best Practices
● Use Aurora whenever you can
● If you can't, then use RDS
● And if you can't even use RDS (e.g. you need a very specific
database engine):
● First, question whether you really need that (you probably
really need it, but it's worth considering)
● Then, use EC2 and manually install everything

● After you've installed it, create a new AMI with


the database installed, so you can reuse it.

● For production environments, consider a Reserved Instance (you


either pay upfront for 1 or 3 years, or commit to paying monthly for
1 or 3 years; you get a discount between 30% and 60%
approximately)

● Also, remember that in Aurora reserved instances are


flexible: if you move to a larger instance size you still get
benefited from the reservation.

● You can also go serverless:

● Use serverless v2, serverless v1 is fading out


● It's approximately 3x more expensive
● That means, if you're on average using less than 33% of
your DB instance's capacity, serverless will be cheaper
● Plus, you get the benefit of scaling beyond your planned
capacity (with no previous setup and with no intervention)

● If you want high availability, you need a failover replica (another


instance in another AZ).
● On Aurora, replicas act as both read and failover. That means your
failover replica doesn't sit idle, it can be used to offload reads. In
RDS, read replicas and failover replicas are separate.
● This only matters if you need a read replica and high availability.
When you do need both, Aurora cuts your database costs down by
33% vs RDS.
● Always consider whether you actually need a relational database.
NoSQL databases are great, and you skip the ORM. Revisit this
book's chapters about DynamoDB.
● Sometimes some data needs to be queried in a complex manner
using SQL (e.g. for analytics) and other data just needs simple
queries. Having 2 databases is completely acceptable, and if the
dataset is large and SQL queries are very infrequent, then Athena
is a good option.
Session Manager: An easier and safer way
to SSH into your EC2 instances
Use Case: Connecting to an instance using SSH

AWS Service: Session Manager

Session Manager lets you connect through SSH to an EC2 instance


without SSH keys or open ports. You set it up, grant permissions to IAM or
IAM IC users and they connect with 2 clicks (console) or 1 command (CLI).
You can even forward a connection to a database, use the SDK in your
apps, and configure logs for the sessions.

Benefits of using Session Manager

● No need to share or rotate SSH keys, just grant and remove IAM
permissions
● No ports open and no bastion hosts (and if you want, not even a
public IP)
● Monitor and alert on session start
● Log the whole session
● Limit available commands

Session Manager is a fully managed AWS Systems Manager capability.


That means it's a part of Systems Manager that you can use on its own.
Best Practices
● You can use it for Linux, MacOS and Windows.
● Make sure you meet the requirements: SSM agent installed for
Linux, Windows and macOS, and internet connection or a VPC
Endpoint.
● Then here's how to set it up:

● Create an IAM role with Session Manager permissions


● Attach that role to your instances
● Grant IAM Permissions to users so they can start a
session

● Finally, start a session from the EC2 Console or from the CLI (for a
better experience install the CLI plugin)
Using SNS to Decouple Components
Use Case: Using SNS to decouple components

AWS Service: Amazon SNS

SNS is a managed publish-subscribe service. The whole idea of the


publish-subscribe model is that producers send messages to topics and
consumers subscribe to those topics to receive the messages. In SNS you
create a topic, subscribers subscribe to it, and when a producer publishes a
message to it, SNS automatically delivers the message to all of the
subscribers of that topic. Subscribers can be email addresses, phone
numbers (using SMS), HTTPS endpoints, application endpoints (other
protocols), Lambda functions (directly invoked), SQS queues, Kinesis
streams or ECS services.

In event-driven architectures, SNS can act as a central hub over which


different components communicate. Publishers don't know anything about
consumers, so those components are decoupled. Also, since multiple
consumers can subscribe to a topic, SNS is used to "fan out" a message
(i.e. get it to many subscribers).
What you can do with SNS

● Filter messages based on attributes


● Use separate topics for different delivery policies
● Set up message retries
● Set up message deduplication (FIFO topics)

Best Practices

● SNS can be used to send SMS messages to phones. It's not the
cheapest option out there, but it's super simple to set up.

● You can trigger Lambda functions when messages are published


to a topic. Great for serverless workflows and event-driven
architectures!
● Use SNS to fan out messages to multiple recipients. For example,
instead of having Service A send a message to Service B and to
Service C, you can send the message to an SNS topic where
Services B, C and maybe others are subscribed to. Every
subscriber gets a copy of the message.

● You can send cross-account messages!

● If some consumers are not guaranteed to be able to handle the


message right away, you should add an SQS queue between the
SNS topic and the consumer. The queue is subscribed to the
topic, and the consumer reads from the queue. Like this: Producer
--> SNS topic --> SQS queue --> Consumer. Why use SNS then?
To fan out the message to multiple consumers (maybe multiple
queues!).
Self-healing, Single-instance Environment
with AWS EC2
Use case: Self-healing environment that doesn't
need to scale

Scenario

You've got a simple environment with a single EC2 instance. Maybe you
don't need to scale right now, or you can't because the instance isn't
stateless, meaning you're saving data in the instance's EBS volumes.
When your instance fails, you want it to fix itself automatically, but you don't
want to pay for a load balancer.

Services

● EC2: Where you create the instance.

● EC2 Auto Scaling: You set up an Auto Scaling Group with a


desired capacity, and it creates or destroys instances to match that
desired capacity. You can set metrics such as average CPU
usage, based on which the desired capacity changes (this is the
scaling part).

● Elastic IP: Just a static IP address that exists separate from any
EC2 instances (meaning it's not created or destroyed with an
instance). It can be attached to an instance, and moved to another
one at will. It's free while attached to a running instance. PS: It's
not a separate service.
Solution

1. Allocate an Elastic IP address.

2. Create an IAM instance profile with an IAM Role that allows the

instance to associate that Elastic IP address. Here's the policy


(replace {IpAddressArn} with the ARN of your Elastic IP
address):

Unset

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AssociateElasticIpAddress",
"Action": [
"ec2:AssociateAddress"
],
"Effect": "Allow",
"Resource": "{IpAddressArn}"
}
]
}

3. Create a Launch Template using the instance profile created in


Step 2. Add in the User Data section (the script that runs when an
instance launches) a command to associate the address with that
instance. Use this template, replace ${AWS::Region}, replace
${EIP.AllocationId} with the AllocationId of the Elastic IP address.
Unset

#!/bin/bash -ex
INSTANCEID=$(curl -s -m 60
http://169.254.169.254/latest/meta-data/instance-id)
aws --region ${AWS::Region} ec2 associate-address
--instance-id $INSTANCEID --allocation-id
${EIP.AllocationId}

4. Create an Auto Scaling Group with min 1 instance and max 1


instance, and associate the Launch Template.

5. The Auto Scaling Group will detect there are 0 instances and will
create a new instance (to match the minimum of 1 instance), and
when that instance starts it will associate the Elastic IP address to
itself. When the instance fails, the Auto Scaling Group detects
there are 0 healthy instances and repeats the process.

6. (Optional) Point your DNS record to the Elastic IP address

Discussion

Why are we even discussing this? Can't we just put a load balancer
there and be done with it? $22/month is not that expensive!

Indeed, it's not that expensive. And an Application Load Balancer has other
benefits, such as easily handling the SSL certificate or integrating with
WAF. However, for an app to scale horizontally you need to remove the
state from it. Any data that needs to be shared across instances is part of
the state of the application. This includes configs, shared files, databases
and session data. For session data you can use sticky sessions (all
requests for the same session go to the same instance), but the rest needs
to be moved to a separate storage (S3, EFS, DynamoDB, RDS, etc).

Ok, so I just need to design my compute layer to be stateless! That's


easy.

Yes, it is! And you should have done that in the first place! Unfortunately,
not everyone does that. And if you didn't get it right from the start, changing
that later is a lot of work. Still worth it, and you should still do it! But if you
need a self-healing environment right now, this is the solution (while you
work on removing the state from your app).

You also said something about dev environments. Shouldn't a dev


environment be identical to a prod environment?

We usually call that environment staging. Dev is usually cheap and dirty.
But you still don't want it to fail, since dev hours spent fixing a dev
environment can add up to a lot of money. This is a good solution for a
self-healing dev environment.

Best Practices
If you're in this situation, the best thing you can do is just remove the state
from your app and make it horizontally scalable. I'll keep the tips focused
on this solution though, because I think it's a pretty creative solution that
can be useful in certain situations.
Operational Excellence

● Use Session Manager: Session Manager is a secure way to SSH


into an EC2 instance.

● Mind the service quotas: There's a default limit of 5 Elastic IP


addresses per region per account. NAT Gateways consume 1
Elastic IP address from that quota. You can increase this limit if
you want, but AWS takes a couple of weeks to do it, so do it
before you hit the limit.

● Store configs in SSM Parameters Store: If you have


configuration values, there's 3 ways to set them: hard code them
in the code (bad idea), write them in the instance (bad idea
because when the instance fails the new instance won't have
them), and set them up in a separate storage where the instance
can read them from. SSM Parameters Store is that separate
storage.

● Send logs to CloudWatch Logs: Logs stored in the instance will


be lost when the instance fails. Instead, set up the CloudWatch
Agent to send logs to CloudWatch Logs, so you can view them
later.

Security

● Use HTTPS: If I want HTTPS (which I do, always), I normally set


up an SSL certificate in the Application Load Balancer. I can't do
that here, because there is no load balancer. But we should still
use secure connections! A way to do this is to set up an Nginx
reverse proxy in the instance. Keep in mind that if your SSL
certificate is only inside the instance, a new instance will need to
recreate it.

Reliability

● Use a secondary EBS volume: If you store everything in the root


EBS volume, you can lose that data when the instance fails.
Instead, use a secondary EBS volume with all your data. In the
User Data section of the Launch Template you can add a line to
automatically associate the EBS volume with a new instance.

Performance Efficiency

● Pick the right EBS volume type: If you're using a single EC2
instance for your prod environment, you're likely relying on EBS a
lot, so this likely matters a lot to you. EBS is a topic for a future
chapter.

Cost Optimization

● Use Savings Plans: You can share it across instances, so if an


instance fails and a new one is spun up, it gets the benefit now.

● Turn off your dev env at night: Set the min and max instance
number to 0 when your team shuts off for the night, and back to 1
when they begin work the next day. There's many ways to do this,
such as a Lambda function triggered by EventBridge. Note that
while the Elastic IP address is not associated with a running EC2
instance, you'll be charged $0.005/hour (that's $3.60/month).
EBS: Volume types and automated
backups with DLM
Use Case: Understanding EBS and automating EBS
backups

AWS Service: EBS and DLM

Elastic Block Store is a block-level storage service for EC2 instances. It's a
virtual SSD or HDD that you attach to EC2 instances for persistent storage.

EBS basics

● An EC2 instance has a root volume where it boots from (the OS is


in the volume) and can have additional volumes.

● EBS volumes can be resized.

● EBS volumes are zonal resources. They exist in one Availability


Zone, so no high availability.

● They're redundant within that AZ, so data loss is less likely than
with a single disk (99.8%-99.9% durability in a year).

● Their lifecycle is separate from that of the EC2 instance. You can
create them, attach them, detach them and delete them on their
own. You can also set up the EC2 instance to delete them when
it's terminated (which is the default for the root volume, and isn't
for non-root volumes).
● An EBS volume can be attached to a maximum of one instance at
a time (except for instances of the family io2). That means, they're
not a shared file system, you can use EFS (Linux) or FSx
(Windows) for that.

● EBS volumes can be encrypted with KMS in a transparent


manner.

EBS Volume types

● General purpose: gp3. SSD that you use for everything. You can
configure size and IOPS separately (unlike the previous gen, gp2).

● Limits: Size 1 GB to 16 TB, 16,000 IOPS, 1000 MB/s


throughput.
● Price: $0.08/GB-month, $0.005/provisioned IOPS-month
(first 3,000 are free).

● For better performance: io2. SSD for things that require more
performance (e.g. databases). You can configure size and IOPS
separately. It can be attached to multiple instances at the same
time.

● Limits: Size 4 GB to 16 TB, 64,000 IOPS, 1,000 MB/s


throughput (256,000 IOPS and 4,000 MB/s for io2 Block
Express).
● Price: $0.125/GB-month, $0.065/provisioned
IOPS-month (no free IOPS)
● Throughput-intensive uses: st1. HDD (yeah, spinning disks) that
performs well for use cases that read contiguous data (e.g. logs),
for half the price.

● Limits: Size 125 GB to 16 TB, 500 IOPS, 500 MB/s


throughput.
● Price: $0.045/GB-month

● Infrequent access: sc1. Slow but really cheap HDD, ideal for
infrequently accessed data. An alternative is S3 Infrequent
Access, which has more durability and is cheaper for storage, but
is slower to access and you're charged for read operations.

● Limits: Size 125 GB – 16 TB, 250 IOPS, 250 MB/s


throughput.
● Price: $0.015/GB-month

Best Practices

● Use gp3 volumes unless you know you need more performance or
have a specific use case. Not sure? Here's how to benchmark.
Also, some performance tips. And if you need extreme
performance, use instance store.

● EC2 instances have a cap on EBS performance. If you're


approaching it, use EBS-optimized instances.

● Migrate gp2 volumes to gp3, it's easy and you save 20%.
● Encrypt your EBS volumes.

● If you have data sets with different requirements, use multiple EBS
volumes.

● If you're storing important data, remember to back it up with


snapshots. They're basically an incremental backup stored in S3,
and can be encrypted.

● After restoring from a snapshot, the first time you read a


block is slow because the data is lazy loaded from S3.
You can eager load it by initializing the EBS volume, or
you can enable fast snapshot restore (it's pricey).

● Snapshots are regional. If you want to use them for


Disaster Recovery, you need to copy them to your DR
region, either manually, or automatically with DLM. If
you're encrypting snapshots, use a multi-region KMS key.

Automating Snapshots with Data Lifecycle Manager

Data Lifecycle Manager (DLM) can be used to automate snapshot creation


and copying to another region by creating a snapshot policy, which tells
DLM how often to create the backups and where to store them.

Here's a CloudFormation template to set it up:

Unset

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
KmsKeyArn:
Type: String
Description: The ARN of the KMS key to use for
encrypting cross-Region snapshot copies
DestinationRegion:
Type: String
Description: The destination region to copy the
snapshots to
Resources:
SnapshotPolicy:
Type: AWS::DLM::LifecyclePolicy
Properties:
Description: EBS snapshot policy with
cross-Region copy
PolicyDetails:
ResourceTypes:
- VOLUME
TargetTags:
-
Key: Snapshot
Value: true
Schedules:
- Name: DailySnapshot
CopyTags: true
CreateRule:
Interval: 1
IntervalUnit: DAYS
RetainRule:
Count: 7
Parameters:
ExcludeBootVolume: true
RestorablePeriod: 0
CrossRegionCopy:
DestinationRegion: !Ref DestinationRegion
Encrypted: true
KmsKeyArn: !Ref KmsKeyArn
AWS Organizations and Control Tower
Use Case: Managing Multiple AWS Accounts

AWS Service: Organizations and Control Tower

Instead of mixing everything into the same AWS account, use multiple
accounts grouped under an AWS Organization.

Benefits of using Organizations

● Consolidated billing: You only put your credit card details in the root
account, and all AWS bills from all accounts are billed to the root
account

● Centralized management: You can create new accounts and


manage existing accounts from the Organizations console in the root
account

● Improved security using Service Control Policies.

Each AWS Account should serve one single purpose and hold one
workload (one environment for one application, for example the production
environment for App 1). Accounts are grouped into Organizational Units
(OUs).
Example account structure

Best Practices
This is a great way to set up your Organization:

● Root account: Create a brand new account. Set everything up using


AWS Control Tower, or do it manually following this guide. Don't use
the root account to deploy any workloads.

● If you already have an AWS account with resources, create a brand


new account to use as the root, and invite your existing account
manually or using Control Tower.

● Create Organizational Units to group your accounts, for example by


project.

● Create the following accounts:

● Log archive: Account that concentrates all logs.


● Security: Used to deploy security workloads and run audits.
● Shared services: Use it to deploy anything that can be shared
across accounts, such as CI/CD

● Set up one account per environment: Project 1 dev, Project 1 prod,


Project 2 dev, Project 2 prod, etc.

● Set up Service Control Policies as security guardrails for your


accounts. You can do this per account or per Organizational Unit.

You might also like