AWS re:Invent 2022 - Reduce your operational and infrastructure costs with Amazon ECS (CON308)
AWS re:Invent 2022 - Reduce your operational and infrastructure costs with Amazon ECS (CON308)
In this session, walk through how to reduce operational overhead from the control plane with Amazon ECS. Learn about how to use containers for bin-packing workloads, efficient scaling techniques, cost savings plans, AWS Copilot, blueprints, and AWS Graviton.
ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
#reInvent2022 #AWSreInvent2022 #AWSEvents
Content
0.87 -> - Thank you for joining everyone,
2.13 -> and welcome to today's session.
4.59 -> In case you're in the wrong room,
5.91 -> we are doing CON308 about
reducing your operational
8.4 -> and infrastructure costs with ECS.
10.89 -> My name is Vibhav,
12.21 -> I'm a senior product
manager with Amazon ECS.
14.88 -> I've been with ECS for
about 2 1/2 years now.
18.63 -> Over the course of this time,
19.71 -> I've had an opportunity to interact
21.15 -> with a number of customers
22.967 -> and help them in a
number of different ways,
26.37 -> and also impact the product
right across the data plane,
29.067 -> and the control plane,
scheduler, deployments, whatnot.
33.87 -> I will give a quick heads up.
35.58 -> We have a lot of content to
share with you all today,
37.83 -> so we might not have a ton of
time left at the end for Q&A,
41.13 -> but I promise to stick around.
42.84 -> I'm also around till
Friday here at re:Invent.
44.97 -> So if there's any questions
46.83 -> that you would like to
talk about afterwards,
48.155 -> please do feel free to reach out.
50.91 -> So before we go ahead
and look at the agenda
54.06 -> for today's session,
55.17 -> I wanted to take an opportunity
57 -> to set a quick preamble for this session.
60.15 -> It's adapted from the AWS
Well-Architected Framework,
63.9 -> but in principle, my
pitch here to do today
66.75 -> is that I want to share
all of the best practices
69.87 -> that I've learned speaking to customers
72.15 -> and our principal engineering team
74.4 -> over the past 2 1/2 years,
76.62 -> and share best practices,
78.39 -> which when utilized well, can
help you save up to 30% costs,
82.77 -> or even, in some cases, 73% costs,
85.41 -> which we've seen in our customers.
87.72 -> So quickly walking through
the agenda for today.
90.3 -> So I'll start with a quick
overview of Amazon ECS
92.88 -> for those who might not be super familiar
94.8 -> or not already using ECS.
97.56 -> After that, I'll spend
time on how you can,
102.15 -> I'll spend a bunch of time talking about
103.77 -> how you can optimize
your workloads with ECS.
106.65 -> I'll be talking across three subsections.
109.26 -> And finally, I'll be
handing it over to Francis,
112.59 -> one of the customers I've
worked very closely with
114.87 -> over the past 2 1/2 years,
117.45 -> who will be sharing his journey
118.56 -> of how they have built
their platform on ECS
122.22 -> and how they've been able
to run their organization
124.98 -> in a super efficient manner.
127.74 -> Quick show of hands before I move on.
130.56 -> How many in the audience
are using ECS today?
135.24 -> Pretty cool. So about 70% folks.
138.223 -> So for those in the audience
140.61 -> who aren't using ECS today already,
142.95 -> ECS is a fully managed container
orchestration solution.
146.31 -> ECS was introduced about eight years ago
149.01 -> at re:Invent itself.
151.11 -> Since then, we've added a number
of capabilities over time.
155.75 -> When it started, you could use ECS
158.4 -> to run your containerized
applications on EC2.
162.09 -> Later in 2018, we introduced Fargate,
164.511 -> which is a serverless compute
engine for running containers.
167.61 -> And most recently, in 2021,
we launched ECS Anywhere,
171.33 -> which allows you to run
containers on your own,
174.39 -> compute running in your own
data centers or at the edge.
179.22 -> There's two key statistics
180.3 -> that I'd love to double click on here.
182.79 -> First, 2.25 billion tasks every week.
186.3 -> So that's the scale at
which customers rely on ECS.
190.5 -> So every week,
192.63 -> customers launch over
2.25 billion tasks on ECS.
196.74 -> And this is distributed
198.93 -> across hundreds of thousands of customers.
201.66 -> The second big stat I
would like to focus on here
204.18 -> is that 65% of all new
AWS containers customers
208.05 -> choose ECS as their first
platform for running containers.
211.41 -> Many of these customers, in
fact, might be new to AWS itself
214.83 -> and often see ECS as their
first intro to the cloud.
221.67 -> Before we move on,
222.81 -> a quick intro to Fargate
224.61 -> for those who might not already
be familiar with Fargate.
227.52 -> So Fargate is a serverless
compute engine for containers.
231.3 -> When you run containers on Fargate,
233.25 -> you don't need to think about
234.69 -> managing the underlying instance,
236.94 -> you don't need to think about
239.16 -> the underlying host operating system,
241.23 -> you don't need to think about patching,
243.03 -> you don't need to think about
things like observability
245.7 -> or agents to gather that
information, local storage,
249.57 -> all of that is taken care of.
253.173 -> The way we think about ECS is
that ECS's value proposition
257.82 -> closely ties with AWS's
value proposition as a whole.
261.72 -> And that really boils down
into two simple things.
264.962 -> By being performant at scale
266.85 -> and providing availability
and reliability,
269.73 -> and giving you a simple interface
271.77 -> to manage and run
applications and containers,
274.59 -> we can help you unlock cost
savings, better availability,
278.13 -> and help you get to production faster,
280.77 -> help you get features out more quickly
283.35 -> so that you can innovate
faster for your customers
286.11 -> in this day and age.
288.21 -> And that's why,
increasingly, we think of ECS
290.94 -> as your platform team.
293.91 -> Our goal is to provide you opinionated
295.77 -> and managed experiences
297.51 -> with the right level of
flexibility, of course,
300.03 -> but with the two key goals
as I identified earlier.
303.54 -> Improving time to production
and reducing operation overhead
307.02 -> so that you can focus
on what really matters,
309.6 -> and your developers can
focus on what really matters.
313.56 -> So with the intro out of the way,
316.77 -> I'd like to spend a couple of minutes
320.25 -> talking about how I'll be going
322.77 -> through the rest of the session.
324.24 -> So I'll be talking through
three key principles here.
328.26 -> First, operating less.
330.3 -> Why?
331.23 -> Because when you operate less,
332.7 -> you have more time to do
other things that matter more.
335.61 -> If you spend less time patching host,
337.56 -> managing software that does orchestration,
341.46 -> if you spend less time managing AMIs,
345.45 -> you get more time to focus on innovating,
348.51 -> to focus on improving all the other things
351.39 -> that move the needle for your business.
353.31 -> Secondly, utilizing more efficiently.
356.31 -> So the way I think about it
is the best way to save cost
360.93 -> is to use your compute more efficiently.
363.21 -> So we'll be doing a deep dive
364.74 -> on how you can run your
containers efficiently with ECS
368.22 -> and how that can unlock
cost savings for you.
371.22 -> Finally, we'll be talking through
373.17 -> some infrastructure choices
on AWS that you can make
376.956 -> to reduce your TCO quite substantially.
381.047 -> And when you do all of this,
383.46 -> it can unlock significant
cost savings for you.
386.67 -> So let's start by the first
pillar that I talked through,
389.64 -> operating less.
394.05 -> So customers want to focus on innovation,
398.099 -> and the goal of containers is
to facilitate that innovation.
401.58 -> However, it does take time to
get through towards that end.
406.56 -> You need to build out your application,
408.57 -> and then encode it into containers,
411.45 -> and finally, deploy them into production.
414.67 -> But once that's done, it's not done.
417.33 -> You need to continue to
manage your application,
420.36 -> you need to maintain host, patch host,
422.73 -> you need to observe, make
continuous improvements.
426.66 -> Now, if you use any other
container orchestration solution,
429.78 -> you not only have to manage all of this,
432.09 -> you also have to manage
the orchestrator itself
434.46 -> because it's not a fully managed solution.
437.88 -> However, when you're using
Amazon ECS and Fargate,
441.42 -> you can focus purely on your application,
444.887 -> which is your code, and
unlock innovation that way.
452.01 -> Before we go into the how,
454.32 -> quick note on how we think
about, on why we think so.
459.24 -> So increasingly,
460.98 -> we see ECS as a serverless
container orchestrator.
464.19 -> Now what that means is ECS,
466.65 -> you don't need to think
about a control plane in ECS.
470.34 -> When you create a cluster in ECS,
472.41 -> it's not a physical entity
that lives in your AWS account.
476.85 -> It's purely a logical namespace
479.88 -> that resides in your account,
481.83 -> which you can use for
distributing applications
484.25 -> in the right manner that works
well for your organization.
487.47 -> There's no charge for the
compute for the control plane
490.83 -> and the control plane is provided
492.21 -> as a fully managed service by ECS,
495.696 -> which runs at immense
scale as we already showed.
499.53 -> And secondly,
501.9 -> Fargate is a serverless
compute engine for containers,
506.67 -> as I said earlier.
507.84 -> You don't need to think about host,
509.16 -> you don't need to manage capacity,
510.87 -> you don't need to think about AMIs
513.39 -> or operating system, agent, none of that.
517.89 -> So let's go back and
take a quick closer look
520.53 -> at how ECS manages all of this.
524.43 -> So first, let's start at the build stage.
527.67 -> So while there are a number
of other different ways
530.73 -> that you can deploy to ECS,
532.35 -> including the API, CLI,
534.645 -> you could use CloudFormation
console, whatnot.
541.08 -> Increasingly, we think about
higher level interfaces
545.13 -> which customers can use to develop on ECS,
548.85 -> and they take a number
of different shapes.
551.37 -> So the first, being ECS Blueprints.
554.31 -> So these are Terraform-based blueprints
556.44 -> that are available as an
open source project on GitHub
559.623 -> and maintained by AWS,
561.57 -> which allow you to get
started really quickly
564.72 -> with your first deployments
566.76 -> in case you're already using Terraform.
570.13 -> And it's not only a
getting started solution,
572.91 -> it's something that
allows you to get started
575.64 -> with best practices
576.66 -> and continuously evolve
that configuration over time
580.44 -> so that you're architecting
with the best practices
582.69 -> from the get go.
584.46 -> CDK extensions.
587.85 -> Similar to Blueprints for Terraform,
591.66 -> CDK extensions provide higher
level construct at CDK.
596.85 -> CDK being an interface for
infrastructure as code,
601.83 -> which allows you to
provision infrastructure
604.23 -> in the programming
language of your choice.
606.72 -> Finally, Copilot CLI.
609.166 -> Copilot is an opinionated CLI,
611.79 -> which allows you to run applications
613.92 -> at higher order constructs
615.812 -> so that you don't need
to think about things,
618.66 -> like creating a build pipeline,
looking at infrastructure,
624.12 -> looking at the application spec.
626.334 -> It's almost as simple as saying,
628.477 -> "Here's my container. AWS,
please run it for me."
631.92 -> We'll do a closer look at Copilot CLI
634.62 -> in a couple of minutes.
638.91 -> I'd like to spend a
couple of minutes here.
641.25 -> So traditionally, when you
deploy an application on EC2,
646.856 -> there's a shared responsibility model
648.9 -> that you're signing up for.
650.52 -> There's some things that you manage,
652.5 -> whereas other things that AWS manages.
655.35 -> When you run a traditional EC2 machine,
658.28 -> typically, AWS manages the
virtual machine itself,
662.55 -> as well as the physical server.
664.62 -> However, everything
that's over that layer,
666.78 -> including your application,
the operating system,
669.3 -> it's running on the runtime,
any storage monitoring,
673.38 -> login plugins that you need
deployed on your hosts,
679.23 -> all of those need to be managed by you.
681.93 -> Now, over time, that can
add significant overhead.
686.16 -> This was the reason why
we introduced ECS EC2
690.15 -> back in 2014,
691.804 -> to take on some of that
responsibility away from you
696.24 -> so that you can focus more
on the application layer.
699.57 -> So when you use ECS with EC2,
702.401 -> you can use ECS-optimized AMIs,
705.36 -> which provide a managed host,
707.25 -> which provide a super
lightweight Amazon machine image
711.57 -> that you can use to run your hosts,
715.65 -> which includes the container runtime,
717.718 -> storage and logging plugins,
the operating system.
722.49 -> However, this is still in your accounts,
725.49 -> you still need to manage
726.87 -> everything associated with
patching, and whatnot.
730.74 -> However, with Fargate,
733.53 -> all of that is abstracted entirely by ECS.
736.59 -> You don't interact or see
the host operating system,
740.91 -> you don't see the container runtime,
742.906 -> you can choose to not
worry about monitoring
747.15 -> or logging plugins, all
of that comes fully baked
751.2 -> so that you can focus
purely on your application.
754.26 -> And that is why we are seeing customers
756.78 -> increasingly choose Fargate
758.82 -> as their first way to
deploy containers on ECS
762.9 -> or in AWS in general.
766.74 -> Now, moving on, another key
topic that I want to cover on
771.84 -> is the observability part.
774.06 -> ECS brings observability out of the box.
776.79 -> So in 2019, we launched
CloudWatch Container Insights,
780.78 -> which allows you to simply say,
782.827 -> "Hey, ECS, I want observability
baked in for my cluster."
789.24 -> And ECS automatically
provides application metrics
793.62 -> for all applications
running on your cluster
796.62 -> at the granularity that you want.
799.2 -> For logging, ECS provides
FireLens for ECS,
804.15 -> which allows you to send your
logs to many destinations,
807.81 -> depending upon what your
logging platform is.
811.68 -> However, observability isn't
just about observing metrics
816.36 -> and logs.
817.26 -> Observability is also
about observing costs.
819.69 -> And this can be super, super important
821.61 -> when you're trying to cut down costs,
824.28 -> given the current
macroeconomic environment.
827.34 -> However, as those of you already
run containers might know,
831.75 -> monitoring costs per containers is hard.
834.003 -> If you're launching containers
838.35 -> which which lasts for only a few minutes,
841.29 -> which are sharing resources
across an EC2 host,
845.867 -> which is shared by multiple
different containers,
848.91 -> it can be really hard to keep track
850.77 -> of how much compute is being
utilized by which application,
855.39 -> and allocating that cost
at an application level.
858.621 -> This is a really hard problem.
861.78 -> However, when you use ECS with Fargate,
865.2 -> ECS provides you task
container-level costs visibility
868.95 -> out of the book.
870.21 -> So when you run your application with ECS,
874.2 -> you can see how much you're
paying per application,
878.13 -> instead of seeing how much
you're paying per box.
881.82 -> What this means is that by using tags,
885.45 -> you can get to really,
really granular data,
890.37 -> really, really granular costs
data for your application,
893.43 -> which you would not be
able to do otherwise,
895.86 -> even in a shared
multi-tenant and management.
903.728 -> Before we end this section,
905.85 -> I did want to briefly
double click on AWS Copilot,
909.39 -> especially for those of
you who are new to ECS.
913.53 -> So Copilot is built
915.93 -> with well-architected principles in mind.
919.38 -> So what that means is when you launch
921.51 -> your first application
on ECS with Copilot,
924.72 -> you're already well-architected
from the outside.
928.47 -> So your first application itself
930.36 -> uses load balancers efficiently,
932.94 -> it uses health checks,
it it has observability,
937.56 -> all of that built-in.
939.18 -> What's more, you can build things like...
941.288 -> Copilot also makes it easy for you
943.92 -> to set up deployment pipelines,
945.81 -> which make it really fast and easy
947.55 -> for your development
teams to push out code.
951.78 -> Also, Copilot makes it easier for you
954.03 -> to troubleshoot and operate,
especially for developers,
957.99 -> especially for environments
where you have developers
960.72 -> running and managing the entire stack.
964.916 -> Taking a step back before
we move to the next section,
969.15 -> when we think about all of the things
971.97 -> that I spoke about combined,
974.79 -> it creates the amount of time
saving that it creates for you
979.35 -> to not think about
managing the control plane,
981.96 -> the data plane, host patching,
985.35 -> all of that time can be
used for driving innovation,
988.26 -> for improving your release velocity,
992.13 -> for spending time on other
things that you can use
996.42 -> to cut down your costs.
999.69 -> With that said,
1000.71 -> let's move on to the next
pillar, utilizing more.
1005.93 -> So I'm sure,
1007.618 -> I hope most of you are
familiar with this image.
1010.91 -> This is the image that
always pops up in my mind
1013.1 -> when I think about running
containers efficiently.
1017 -> Just like in Tetris,
1019.616 -> the goal is to use to
fill up lines perfectly
1024.064 -> and increase your score.
1026.21 -> When you run containers efficiently
1028.25 -> and utilize your compute
efficiently, you save cost.
1032.75 -> And that's why, as I said,
1034.37 -> the best way to optimize your costs
1036.328 -> is to improve efficiency of
utilization of your compute.
1041.96 -> Before we move on,
1043.52 -> for those of you who aren't
super familiar with ECS,
1046.82 -> I want to take a quick moment
1048.68 -> to talk through some ECS terminology here.
1052.07 -> So not a lot, just three quick terms.
1054.47 -> First, cluster.
1056.45 -> A cluster in ECS is nothing
more than a namespace.
1059.6 -> It's not a physical entity
that resides in your account.
1062.99 -> It's fully managed by ECS.
1064.82 -> You can create as many
clusters as you want.
1067.9 -> An ECS task.
1069.29 -> Task is the smallest
unit of compute in ECS.
1072.23 -> Task is a group of 1 to 10 containers,
1075.454 -> which typically might constitute
1078.23 -> one single unit of an
application or a microservice.
1081.68 -> And finally, services.
1083.686 -> An ECS service typically represents
1087.56 -> a group of running tasks,
1089.33 -> which combine to form a microservice.
1093.782 -> So the service construct in ECS
provides inbuilt resiliency.
1099.26 -> What that means is if a task crashes,
1103.27 -> if a task and a service crashes,
1105.86 -> ECS ensures that it will bring it back up
1108.14 -> so that your application
remains highly available.
1111.17 -> So with that out of the way,
1115.13 -> I'll take a step back to talk
about the three key dimensions
1118.25 -> that you can think about to
improve compute utilization.
1122.18 -> The first one being, size of the task.
1124.85 -> So when you run a container on ECS,
1128.3 -> you need to configure
1129.47 -> how much compute you
need for your container.
1132.2 -> So you might say,
1134.217 -> "I want one vCPU and four
gigabytes of memory."
1139.28 -> But depending upon how you utilize that
1142.79 -> or how much your application needs,
1144.71 -> there might be an
opportunity to cut costs.
1147.77 -> Secondly, number of tasks in a service.
1150.38 -> So how much traffic your application sees
1153.26 -> over the course of a day can vary a lot.
1155.778 -> You can use this to make
sure that your applications
1160.19 -> are always right-sized.
1161.96 -> And finally, compute
capacity in the cluster.
1169.251 -> This represents the
synthesis of everything else.
1172.49 -> So if you have, say, n instances
running in your account,
1176.63 -> you want to ensure
1177.8 -> that those n are being utilized
1179.81 -> to the right extent possible.
1182.45 -> And by doing that, you
can obviously save cost.
1185.72 -> So let's start by looking
at the task level.
1189.35 -> So let's look at this example scenario.
1191.75 -> So if you look at this application,
1194.12 -> it's pretty clear that
it's becoming CPU bound,
1196.85 -> whereas it isn't using
more than 50% memory
1202.46 -> at any point of time.
1204.26 -> So in this scenario,
1205.55 -> you could easily reduce
the amount of memory
1209.45 -> that you've reserved for this container,
1212.48 -> and at the same time,
1214.61 -> increase the amount of
compute you have reserved.
1218.636 -> And by doing that, you
could likely save some cost.
1223.28 -> Right-sizing tasks is important.
1225.71 -> You need to look at historical metrics
1228.62 -> over a reasonable period of time
1230.9 -> in order to make sure
1231.98 -> that you're right-sizing
your task correctly.
1235.04 -> And it's okay to not size your
task correctly on day one.
1239.055 -> This is something you can
correct over the course of time.
1243.538 -> In order to do right-sizing well,
1248.33 -> you can use Container Insights
1249.77 -> that I spoke about earlier,
1251.99 -> which allow you to monitor
1253.46 -> historical compute utilization data
1257.27 -> for each task or application.
1259.04 -> And you can look at the average,
1262.49 -> as well as the peak utilization,
1264.5 -> to make sure that you've
sized your task correctly.
1269.03 -> With that said, let's move
on to the next big pillar,
1273.29 -> number of tasks in your service.
1275.93 -> So the amount of traffic
that your application serves
1279.32 -> can vary a lot.
1280.58 -> So let's say this application
needed 10 tasks at peak.
1285.35 -> At different times in the day,
1287 -> this application might need 6, 7, 8,
1289.79 -> or even five tasks in some case.
1292.46 -> In order to respond to that,
1294.53 -> ECS allows you to configure auto-scaling
1299.81 -> to scale your service horizontally
1301.73 -> so that it automatically
responds to traffic.
1304.87 -> So what this means is
1308 -> that when you use auto-scaling for ECS,
1310.547 -> ECS automatically observes
1312.717 -> what is the utilization for your compute.
1316.97 -> And based on the target
metric that you specify,
1320.24 -> ECS scales-up or scales-down your service
1323.39 -> to ensure that your application
is always right-sized,
1328.31 -> so that your application is available,
1330.59 -> but at the same time,
1332.855 -> you're not running more tasks
1334.43 -> than is absolutely necessary
for your application.
1337.94 -> You can scale your service
based on a variety of metrics.
1341.27 -> It can be ECS metrics,
1343.19 -> such as the CPU and memory utilization,
1345.95 -> it can be metrics ship out of
the box by other AWS services,
1349.97 -> such as ALB or SQS,
1352.55 -> or it could be custom application
metrics that you generate.
1355.55 -> Depending upon what is
right for your application,
1358.85 -> you can use the metric of your choice
1361.46 -> to auto-scale your application.
1364.43 -> All right.
1365.45 -> With that said, let's move
on to the next big bucket,
1370.43 -> utilization at the cluster level.
1372.83 -> This is fundamentally a hard problem
1374.9 -> because a cluster
typically has 10 instances.
1378.08 -> Let's say a cluster has 10 instances
1380.72 -> and 10 services running on it,
1382.91 -> each service requires a
different number of tasks
1385.73 -> over the course of the day.
1387.35 -> Now, to make sure that
the size of the cluster
1390.59 -> is sufficient to run all
of those 10 services,
1393.446 -> but also not over-provisioned
1397.16 -> is fundamentally a hard problem.
1400.04 -> To simplify this, in 2019,
1402.257 -> ECS introduced the notion
1403.52 -> of what we call capacity providers.
1406.1 -> When you use capacity providers,
1408.29 -> ECS automatically looks
1411.89 -> at not just what your
application needs right now,
1415.04 -> but what is the amount of compute
1417.56 -> that might be required
in the next few minutes.
1422.42 -> And based upon that,
1423.65 -> it automatically right-sizes
your cluster size.
1428.63 -> However, it is still a hard problem.
1433.043 -> In order to ensure availability,
1435.26 -> you would likely need to
have some spare capacity,
1438.26 -> and it becomes hard.
1441.68 -> And it's not something
that we want our customers
1444.32 -> to really worry about.
1446.15 -> So what if you didn't have
to think about it at all
1449.39 -> and instead could just
think about running compute.
1452.48 -> So with Fargate, you could simply do that.
1455.87 -> You don't need to worry
1456.92 -> about underlying host instances at all.
1459.53 -> You're just running tasks
1461.066 -> on compute that is provided by AWS.
1464.66 -> So what that means is that
you only need to worry
1467.93 -> about the first two dimensions
that we spoke about.
1471.02 -> First, you want to right-size your tasks
1473.72 -> so that you're not paying for compute
1475.73 -> that you aren't utilizing.
1477.17 -> And second, you need to make sure
1478.43 -> your service size is correct.
1480.44 -> As long as those two are fine,
1482.18 -> you will never be over-reserving
and underutilizing capacity
1485.521 -> that you're paying for.
1490.07 -> And this is why we've seen
customers, like Edmunds,
1493.19 -> reduce costs up to 30%
1495.8 -> by switching to ECS
Fargate from EC2 on-demand.
1503.09 -> Quickly moving on to the next big pillar.
1507.62 -> The infrastructure choices you make
1509.75 -> when you run your application
on AWS can be pretty,
1513.56 -> it can have pretty strong ramifications
1516.32 -> on what your AWS bill looks
like at the end of the day.
1519.77 -> The the three key choices
that we see to reduce cost.
1525.65 -> First, what type of
hardware you're running.
1528.71 -> Second, what type of
capacity you're running on.
1531.44 -> And finally, making
commitments and reservations
1535.004 -> which allow you to benefit from
commitment-based discounts.
1539.66 -> So first, AWS Graviton is
AWS's own custom silicon
1547.43 -> that we introduced multiple years back,
1550.91 -> which AWS Graviton today
provides the highest performance
1555.08 -> in an instance family,
1557.03 -> 20% lower cost compared
to same size instances
1561.35 -> in the same family, and up to
40% higher cost performance.
1565.55 -> What that means is you
can run applications
1570.65 -> more efficiently at a lower cost.
1574.73 -> ECS is deeply integrated with Graviton.
1578.27 -> You can run ECS tasks on
Graviton with Fargate,
1585.407 -> you can run ECS tasks
on EC2 with Graviton,
1588.32 -> you can simply use graviton instances
1590.69 -> with an ECS optimized AMI
1593.523 -> and use that to run your containers.
1596.72 -> And ECS fully supports multi-arch images,
1599.24 -> which are needed to run Graviton.
1601.73 -> And we've seen customers across the board,
1603.95 -> utilize Graviton with ECS
for a number of workloads.
1609.11 -> For everything, from
microservices to caches,
1612.68 -> to media encoding, analytics, whatnot.
1617.99 -> Moving on.
1620.06 -> Spot capacity is spare
capacity in AWS data centers,
1623.75 -> which is made available via Spot.
1627.974 -> EC2 and Fargate Spot is how
you can do more for less.
1632.42 -> It offers up to 90% on-demand,
1634.781 -> the same host of instance types,
1639.14 -> the same number of SKU types on Fargate.
1644.48 -> What this means is you can pay
by running your applications,
1649.16 -> which are interruptible.
1650.78 -> So that's something to keep in mind.
1653.03 -> When you run Spot capacity,
1655.58 -> your capacity can be interrupted
1657.47 -> with a two minute-notice
at any point of time.
1660.23 -> However, when you're using containers,
1663.38 -> containers make it much easier
to run workloads on Spot
1670.7 -> because containers, by definition,
1672.86 -> are prone to interruption
1674.27 -> and are more likely to be
cycled around in any case.
1678.23 -> Quick nuance between EC2 and Fargate Spot.
1680.99 -> So when you use Fargate Spot,
1683.42 -> it it provides up to 70% discount
1686.691 -> compared to standard Fargate,
1688.97 -> whereas EC2 Spot might
have higher discounts.
1691.67 -> And secondly, when you use Fargate Spot,
1695.27 -> your the underlying instance pool
1697.49 -> that is used for running your tasks
1699.65 -> is automatically diversified.
1701.57 -> So the likelihood of
seeing an interruption
1704.87 -> when you run Fargate Spot
is automatically lowered.
1709.73 -> ECS makes it easier to run Spot
1712.49 -> by handling interruptions automatically.
1714.68 -> So when you run a long
running application on ECS,
1720.181 -> a Spot termination notice comes in,
1723.62 -> ECS automatically read
that interruption notice
1727.686 -> and gracefully drains any connections
1730.13 -> to any load balanced applications
1732.233 -> that might still be receiving connections.
1735.05 -> What that means is ECS ensures
1738.02 -> that all outstanding
requests are processed.
1740.87 -> And once that is done,
1741.98 -> ECS tells the container
that it should shut down.
1745.28 -> There are a few load
balancer configurations there
1748.85 -> that you need to be careful of
1751.379 -> when you're running
applications with Spot on ECS.
1756.29 -> However, that is best practice,
1759.56 -> and doing that allows you to benefit
1762.567 -> from the steep discounts
that Spot can offer.
1768.71 -> What is more, when you
use Capacity providers,
1771.5 -> you can mix on-demand and Spot capacity
1775.441 -> to improve your availability posture
1781.4 -> without increasing costs substantially.
1784.286 -> Capacity Providers works with
both Fargate and Fargate Spot,
1788.27 -> as well as EC2 and EC2 Spot,
1790.31 -> so you can use capacity providers
1792.23 -> to mix both EC2 and EC2 spot capacity,
1794.904 -> as well as Fargate and
Fargate Spot capacity.
1797.93 -> You can't mix EC2 and Fargate
capacity on ECS today.
1805.4 -> Finally, I do want to make a brief mention
1810.11 -> about commitment-based
discounts with savings plans.
1813.59 -> So savings plans provide a
flexible discounted pricing model
1819.29 -> based on commitments that you make
1821.84 -> to use over an end-year period.
1824.63 -> Compute Saving Plans apply
to your compute resources
1828.02 -> across Fargate, EC2 and Lambda,
1831.59 -> so it automatically
applies to any applications
1834.14 -> that you run on ECS.
1836.24 -> However, when you use
Instance Saving Plans,
1838.91 -> those are only available for EC2.
1841.512 -> So that's something you
might want to keep in mind.
1844.55 -> As a quick glance of what the savings
1847.622 -> with Fargate Savings Plan might look like,
1851.21 -> with a three-year commitment,
1852.95 -> you can save up to 50-52% on demand costs.
1860.15 -> With these out of the way, or
actually just taking a moment,
1865.1 -> we have seen customers, like athenahealth,
1867.56 -> reduce their costs, up to
73%, from running on Lambda
1872.45 -> by moving their event
processing workload to ECS
1876.23 -> with EC2 Spot.
1878.563 -> This was our motivation
for doing this session
1884.15 -> in the first place.
1885.23 -> Architecting your applications
1887.45 -> to utilize compute
efficiently, to operate less,
1891.44 -> and to use the right set of compute types
1894.17 -> allows you to unlock significant savings.
1898.55 -> With that, I'll hand over to Francis.
1901.64 -> I've worked closely with Francis
1903.11 -> over the last two years
and I've been amazed
1905.69 -> by how he runs an
organization at this scale
1909.235 -> with such a small team.
1911.822 -> Over to you, Francis.
1913.7 -> - Thank you, Vibhav.
1916.37 -> I'm excited to be here
1917.39 -> and talk to you about
how CleverTap uses ECS,
1921.8 -> keeps us agile, and
how we leverage AWS ECS
1925.94 -> to operate infrastructure at scale.
1933.77 -> Oh, sorry, it's is going backwards.
1937.94 -> All right, so CleverTap, just
to give you some background,
1940.58 -> is an omnichannel customer engagement
1943.61 -> and user retention platform
powered by TesseractDB,
1949.52 -> the world's first purpose built database
1952.37 -> for user engagement and retention
1955.19 -> that delivers incredible performance,
1958.31 -> and lightning fast data processing,
1960.38 -> and unlimited secure storage capability.
1965.21 -> This enables us to surface and
report on parallel analytics
1969.65 -> and insights into user
behaviors and trends
1973.67 -> via cohorts, funnels, trends,
flows, pivots, and insights.
1981.44 -> And then, based on the
analytical data and insights,
1985.85 -> you can engage users across
messaging touch points,
1991.1 -> such as push, in app, text messaging,
1996.59 -> WhatsApp, app inbox, email,
in real time and offline.
2004.45 -> Essentially, we enable app owners
2007.27 -> to retain and grow their user base.
2011.83 -> And hi, I'm Francis.
2014.23 -> I'm the VP of infrastructure
and security at CleverTap.
2018.28 -> I've been helping keep
lights on for 16 years.
2022.18 -> Been fortunate enough to
do it right out of college.
2025 -> Last eight years, I've been at CleverTap.
2028.3 -> And today, I'm gonna talk to
you about about how we do ECS,
2031.81 -> how a journey that spans
2037.12 -> essentially of running infrastructure
2040.12 -> for about eight years at CleverTap,
2042.1 -> and a story of how we fell
in love with containers,
2044.95 -> and where ECS fits into
this whole picture.
2048.58 -> So from 2012 to 2016,
this is the beginning,
2055.69 -> we used, well, there wasn't a
lot of ECS in the first place.
2060.04 -> So there were AMIs, we bootstrapped them
2065.38 -> for both stateful and stateless workloads.
2068.05 -> What would happen is then
this was a combination
2070.3 -> of user data coming in
2072.16 -> and bootstrapping our
configuration management system
2076.3 -> that would then sort of bring in
2079.84 -> the application runtime environment,
2081.97 -> bring in the application, start it up.
2085.84 -> But there was a problem,
2087.67 -> there were just way too many
moving parts in this system.
2091.12 -> Yeah, something could go wrong.
2093.13 -> Every now and then, it did.
2095.41 -> Things such as package
repositories being offline
2098.95 -> during the bootstrap process.
2100.57 -> And boom, bootstrap failure.
2102.49 -> Holy shit, we're in the
middle of this scale-up,
2105.04 -> and bootstrap failures.
2107.95 -> So they had to be a better way.
2109.54 -> There just had to be a better way.
2112.9 -> AMI was one of them.
2113.733 -> I mean you could burn your AMI,
2118.21 -> including the runtime
dependencies and the application,
2121.3 -> but this meant that for every
build you are making in AMI,
2125.56 -> and that's a lot of AMIs in an account.
2130.96 -> So that's when we came across Docker.
2132.85 -> A way to package runtime dependencies
2136.36 -> along with the application
without the kernel
2139.96 -> and the other things that are required
2141.34 -> to keep the operating system going,
2143.14 -> interact with the hardware.
2145.36 -> And so in, in 2017 we
started using ECS with EC2.
2150.16 -> Sorry, in 2017,
2152.47 -> we went what I think of
as container version 1,
2157.78 -> where we packaged the
application inside a container
2162.79 -> and sort of deployed with
AWS code deploy, no ECS.
2168.37 -> This worked.
2169.51 -> But then we realized
2170.5 -> we were sort of reinventing
2172.06 -> a container orchestration engine.
2174.49 -> And that's when we went all in with ECS.
2181.48 -> But then something interesting happened.
2182.92 -> So we've been on ECS since 2018
2186.43 -> for both stateful and stateless workloads.
2189.85 -> There's just everything
orchestrated with ECS.
2193.99 -> But then something
interesting started to happen.
2196.63 -> Starting mid-June this year,
2200.35 -> we kind of made this
decision to move to Fargate,
2204.52 -> to move all of our stateless
workloads to Fargate.
2207.277 -> And I'm gonna talk to you about Fargate,
2209.2 -> and why we want to do that,
why would we choose to do that.
2214.03 -> So a containerized application
2218.8 -> is surrounded by its runtime dependencies,
2221.89 -> that is the root file system.
2224.35 -> As a team, this is where we
wanna focus our time and money.
2230.83 -> There is no business benefit
2232.39 -> on focusing on patching
operating system packages,
2237.395 -> hardening the operating system.
2239.32 -> But don't get me wrong here,
2240.52 -> I'm not saying you shouldn't be doing it.
2241.87 -> It's super critical.
2244.06 -> Just that there's just
no business benefit.
2247.96 -> Having run security
engineering for a while,
2250.96 -> you have to figure out
how do I balance time
2256.45 -> and focus on doing this repetitive things,
2261.01 -> as opposed to spending time
2262.24 -> doing what truly takes
my business forward.
2266.77 -> So what if, just like we don't
think about the hypervisor?
2274.09 -> When was the last time
2274.923 -> you thought about patching a hypervisor?
2277.295 -> Just like you don't think
about virtualization,
2281.59 -> what if you could make the
operating system go away?
2285.1 -> And that was our motivation for Fargate.
2287.86 -> Fargate makes us forget
the operating system.
2291.55 -> So we can just focus on
what is key on the app,
2297.58 -> on the runtime environment,
hardening and securing it.
2305.362 -> All right.
2307.66 -> The other thing is ECS is great on EC2,
2311.44 -> but you have to manage these resources
2314.71 -> inside of the cluster,
2317.14 -> and resources in the
form of EC2 instances,
2320.8 -> things where your containers run.
2323.17 -> But the thing is your ECS container,
2327.43 -> the resource is decoupled
from the containers.
2332.11 -> So you can find yourself in a state
2335.32 -> where the service is trying to scale-up,
2338.53 -> but you don't have underlining compute
2340.84 -> to feed that scale up.
2343.05 -> In our case, we have custom
alarms and Lambda functions
2348.1 -> all glued together to sort of preempt
2351.16 -> when a service is going to
scale or is likely gonna scale,
2355.15 -> and then feed that capacity in place
2357.52 -> just before the service is gonna scale-up,
2359.83 -> so there is enough resources to run that.
2364.15 -> But why do it?
2366.13 -> I mean, you should be able to say,
2367.637 -> "Hey, ECS, run this container."
2370.75 -> And boom, it should come to life.
2373.69 -> And that's where Fargate comes in.
2374.92 -> And that's why we made
this decision to go all in.
2379.09 -> And we've taken all of
our stateless compute
2381.97 -> and moved it off to Fargate.
2384.75 -> This was like a three months exercise.
2387.22 -> And I'm gonna tell you
things that we learned,
2390.124 -> not just from the Fargate migration,
2393.01 -> but but things from just operating ECS
2396.49 -> over the last four years.
2402.01 -> You would expect sort of clustered events,
2404.32 -> all these things that are
going on inside of a cluster.
2407.2 -> Things, such as containers starting up,
2408.91 -> containers scaling-out,
2410.65 -> containers registering
with a load balancer,
2412.63 -> containers being terminated,
2414.07 -> containers restarting in
a restart loop, I forgot.
2417.76 -> And you expect to see
them in the CloudTrail
2420.61 -> so you can sort of put up
monitoring alerts, alarms,
2423.61 -> that kind of stuff, but it doesn't.
2426.67 -> That's not how it is.
2427.503 -> And then it kinda struck us, right?
2429.16 -> Just like you don't see
queries from an RDS database
2431.74 -> inside of RDS database in the CloudTrail
2434.8 -> or S3 data events get and
put in the CloudTrail.
2438.7 -> These are events happening
inside of a service.
2441.16 -> So they don't show up in the CloudTrail.
2444.76 -> I wish somebody told me this up front.
2447.209 -> The good news is that you
can send it to EventBridge.
2451.48 -> And then from EventBridge,
2452.62 -> you can take it to whatever
monitoring system you got
2455.53 -> and do what you want, and
what you expect to do with it
2458.63 -> if it showed up in CloudTrail.
2462.73 -> The other thing is tasks
can't be protected.
2465.67 -> So you can essentially take an EC2.
2468.7 -> If you can take an EC2
instance out of a load balancer
2473.32 -> to sort of observe the
applications behavior
2476.71 -> without exposure to traffic.
2480.28 -> If you attempt to do that with ECS,
2483.52 -> both on EC2, as well as
Fargate, the task is destroyed.
2492.597 -> There's like some kernel capabilities
2496.781 -> that don't work on Fargate.
2499.15 -> For most use cases, you
probably wouldn't encounter it.
2504.67 -> But if you do, I wanna tell you up front,
2508.12 -> there are these nuances and
you should go check them out.
2514.69 -> So I told you, it take
like three months to move
2518.29 -> from EC2 ECS to ECS Fargate,
and that's weird, right?
2524.32 -> These are containers,
2525.153 -> they should just move
from one place to another,
2527.53 -> and why does it take so long?
2529.87 -> Turns out, the application
used instance IAM credentials
2535.87 -> when it started making these
calls to other AWS services.
2540.73 -> Now, when you move the
application over to Fargate,
2546.7 -> you no longer control the underlying host,
2552.46 -> and so you cannot attach
an IAM role to it.
2556.3 -> And so Fargate forces you to
sort of use the task IAM role.
2563.65 -> And now suddenly, you have
to go find every place
2565.87 -> that is an AWS client in your
application and change it,
2570.46 -> change it to a Default
Credentials Provider Chain.
2574.18 -> I wish somebody told me this up front,
2577.18 -> I wish this is documented,
and I wish the AWS APIs said,
2582.017 -> "Warning, do not use task IAM
role or instance IAM role.
2589.09 -> Instead, use the Default
Credentials Chain,"
2591.61 -> 'cause that's an interesting one.
2593.29 -> That one attempts to use credentials
2595.6 -> from what is available.
2596.62 -> So if the task role is available,
2598.72 -> it tries to use those credentials.
2600.97 -> And then if it isn't, it'll
fall back to your instance.
2608.17 -> And this allows your application
to run off ECS Fargate,
2613.06 -> as well as EC2.
2615.82 -> And this is what sets back,
2617.86 -> like a simple thing that had
to move from EC2 to Fargate,
2623.11 -> it set back by three months.
2627.94 -> You can sort of orchestrate
stateful workloads
2634.57 -> with ECS and EC2 with the
state being persistent
2638.9 -> on an EBS volume, with some
hacks that basically pin service
2647.71 -> to a specific place
where you want it to be.
2651.58 -> But on Fargate, you can only
do persistence with EFS.
2657.13 -> There's just no support
2659.92 -> to be able to do persistence with EBS.
2664.9 -> And so these guys move these containers
2668.47 -> over from EC2 to Fargate.
2673.21 -> And to spend a lot of time
2675.79 -> trying to figure out what's wrong,
2677.11 -> why isn't it working the way it should.
2680.02 -> And turns out, ECS exec is the answer.
2682.6 -> You can go drop in
2684.13 -> even though you don't control
the underlining EC2 instance
2688.84 -> to see what's going on.
2689.71 -> Just like you would if it was EC2,
2692.52 -> we could SSH, SSM and go
Docker Tail Containers
2697.51 -> to see what's going on.
2698.98 -> So ECS exec definitely a win,
2702.04 -> but needs to be enabled up front.
2706.51 -> And when you're running
stateful workloads on ECS,
2710.5 -> it turns out that if you want change
2715.21 -> the instance type of the
underlying container,
2718.69 -> you have to go destroy,
terminate it, get rid of it,
2723.4 -> and then bring it back to life
2725.71 -> so that ECS can then register it.
2728.38 -> But if you just sort of updated in place,
2732.49 -> shut it down, bring it back up
as a different instance type,
2736.09 -> up or down, then ECS is bluntly
going to refuse to use it.
2742.84 -> And then having run sort of containers
2745.99 -> for the last four years,
2747.28 -> this independent of ECS,
here's my takeaways.
2752.59 -> In the beginning,
2753.423 -> we almost like built
these custom containers
2757.783 -> because we had to adapt
like upstream containers
2761.02 -> for our use case.
2763.54 -> Over time, containers became mainstream,
2767.77 -> and we realized that
we are doing it wrong.
2770.14 -> There is almost always a way
2772.81 -> to make the container behave, and act,
2775.63 -> and adapt to your
environment from the outside
2780.52 -> or from these sidecar containers
2783.16 -> that contain configuration files
2785.47 -> that can be mounted into an OEM container.
2788.56 -> Thereby, letting you
not have to worry about
2793.42 -> building these containers,
and build pipelines,
2795.28 -> and things like that.
2801.55 -> The other thing is ECS, like I told you,
2804.78 -> we spend a lot of time
going in and figuring out
2808.24 -> why is this container not doing things?
2810.22 -> Like is it really reading my values?
2812.68 -> Does it understand
2813.97 -> all these environment variables or not?
2816.1 -> And turns out, exec is pretty cool.
2817.87 -> It's also cool if the service is breaking
2820.27 -> and you're in the middle
of trying to fix it.
2823.09 -> But the thing here is that it
needs to be enabled up front.
2829.21 -> It's not something you
can turn on on the fly,
2831.31 -> not some checkbox you can run at run time.
2833.77 -> So you have to sort of just say up front,
2837.04 -> in a task definition,
2840.04 -> that you are kind of enabling ECS exec.
2843.52 -> And then there's the
sidecar deployment model.
2845.71 -> So instead of like packing your container
2849.19 -> into one large monolith.
2850.93 -> For example, imagine a application
2854.65 -> and logging agent along with it, right?
2860.71 -> Your application runs and
there is a logging agent
2862.63 -> that ships all of your
logs to some central system
2866.47 -> so that it's there for
analysis, and things like that.
2869.47 -> Instead of putting these two together
2871.75 -> into one monolith sort of container,
2875.05 -> stick by the one application
2876.765 -> per container deployment model.
2879.76 -> And so in our example,
2881.38 -> that works out two containers in a task,
2883.96 -> the application itself and
the log monitoring agent.
2888.19 -> And then you can do mount point
2890.71 -> that can be shared between the containers,
2892.54 -> thereby enabling the log forwarding agent
2898 -> to read the application logs
2899.71 -> and ship it to your
central monitoring system.
2905.02 -> And then having it run
ECS for like the last,
2909.28 -> for the four years.
2911.59 -> Here's what we have, or I
have to tell you as takeaways.
2917.858 -> It can be frustrating to debug
2920.14 -> why are containers
behaving in a specific way.
2922.45 -> If it's just in a randomly start loop,
2925.33 -> there's no way to tell what's going on.
2928.06 -> So you can use remote logging drivers
2933.16 -> that take your standard out
2935.08 -> and they ship it to a
remote logging system.
2938.29 -> Suddenly, you have visibility
2940.93 -> into what the standard
out of your container is.
2943.51 -> I mean you could do this on EC2,
2947.53 -> but you can't do this, you
can't go docker log on Fargate.
2951.82 -> So specifically, when going to Fargate,
2955.81 -> this is absolutely critical,
saves so much time.
2960.94 -> Then there is all these events going on.
2963.58 -> And depending on how
large your deployment is,
2966.01 -> how many clusters you have,
how many tasks services,
2969.818 -> there's just this crazy
stuff going on all the time,
2972.55 -> inside stuff.
2973.93 -> Makes sense if you're doing ECS EC2
2978.58 -> to ship your ECS agent logs
2984.34 -> over to a central monitoring system.
2985.69 -> So that when something behaves weirdly,
2988.36 -> you at least have something to look at,
2990.091 -> 3000 feet view, peeking
in to see what's going on.
2994.81 -> And irrespective of ECS,
EC2, or ECS Fargate,
3001.17 -> the cluster events themselves,
3002.82 -> shipping them off to a
central monitoring system
3005.31 -> enables you to look at,
3006.84 -> like a 3000 feet view
into what's going on.
3011.16 -> And then this is my most
favorite recommendation.
3015.3 -> Clusters are cheap,
they're zero maintenance,
3019.77 -> and they cost zero.
3022.5 -> In our case,
3023.492 -> we literally deployed like
one microservice per cluster.
3029.67 -> It's very simple.
3030.78 -> When you have to explain this
3031.83 -> to new people who are you just hired,
3034.26 -> people who are just starting out,
3035.31 -> people getting into your team,
and developers, it's simple.
3040.17 -> It runs one cluster in one service.
3046.95 -> Are you on?
3047.783 -> That's all I got.
3049.8 -> - Yeah.
3050.633 -> Before we close, I did want to go through
3053.43 -> some of the key takeaways
from the session so far.