
Top 5 Jenkins Issues And How To Avoid Them
Top 5 Jenkins Issues And How To Avoid Them
This workshop will discuss Jenkins best practices including the inner-workings of the Jenkins application related to JVM Administration tips, and highlight the top 5 issues we see in CloudBees Support, as well as solutions to resolve them. You will walk away prepared to keep Jenkins running smoothly and keep your CI/CD pipeline humming along.
Content
0.93 -> [Music]
14.719 -> Hi everyone and thanks for joining us
16.24 -> today for the CloudBees Connect virtual
18.48 -> conference!
19.439 -> Today's workshop covers a lot of topics
21.439 -> from JVM administration
23.279 -> to the top five jenkins issues we see
25.279 -> here at CloudBees support.
26.72 -> So let's get started! My name is Ryan
30.48 -> Smith
30.96 -> and I'm currently the Global Escalation
32.559 -> Manager for CloudBees.
34 -> I've been with the company for a little
35.2 -> over two years. I'm formerly a Senior
37.36 -> Development Support Engineer within our
38.879 -> support
39.36 -> organization, and I'm heavily focused on
41.68 -> maintaining the performance and
43.36 -> stability of Jenkins.
44.96 -> I've worked with enterprise Java
46.239 -> deployment for a little over a decade,
48.32 -> and I'm joined here today by my friend
50 -> and colleague Michelle Fogwell.
51.92 -> Michelle tell us a little bit about
53.039 -> yourself and kick us off.
55.52 -> Hey everyone, I'm Michelle. I am a current
57.76 -> Senior Developer Support Engineer and
59.6 -> I've been at CloudBees
61.199 -> for about three years now. I started my
63.92 -> career as a programmer
65.36 -> but I turned to support because I'm
66.96 -> really passionate about the customer
68.479 -> experience,
69.76 -> and this gives me a chance to see what
71.439 -> issues customers are facing in the real
73.36 -> world.
74.159 -> As we mentioned we both work at
75.52 -> CloudBees and just to tell you a little
77.36 -> bit more about our company,
78.96 -> we offer enterprise CI/CD products and
81.68 -> services.
82.799 -> We have the largest number of Jenkins
84.64 -> certified engineers and are proud to be
86.88 -> the number one contributor to the
88.479 -> Jenkins project.
90.32 -> Next slide. So what are we going to talk
93.52 -> about today
94.32 -> and how is it going to help you? Jenkins
97.2 -> has changed a lot over the years
99.52 -> and that means so have the best
100.88 -> practices. We're going to make sure to
103.119 -> talk about the latest best practices
105.52 -> and we'll cover JVM tuning and
107.439 -> administration.
108.88 -> We're going to talk about the top five
110.479 -> issues that we as a support organization
113.04 -> see our customers struggle with and
115.119 -> we'll go over solutions for those as
116.719 -> well.
117.68 -> Once we've covered that we want to show
119.439 -> you some real world data to back up what
121.28 -> we're saying
121.92 -> and then answer any questions you might
123.439 -> have. We're hoping that this information
126 -> is going to help you with your JVM
127.52 -> administration skills
129.119 -> and make you care a little more about
130.879 -> the inner workings of the JVM
132.72 -> and topics like garbage collection. This
134.8 -> will help you grow as a Jenkins
136.239 -> administrator
137.04 -> so that you and any development teams
138.8 -> you're working with can have a better
140.08 -> Jenkins experience.
141.68 -> Our goal is to help you get the most out
143.36 -> of Jenkins.
145.12 -> Next slide. So let's talk about Jenkins
148 -> and JVM.
149.36 -> Jenkins and JVM are both impressive
151.519 -> pieces of software.
152.879 -> They run pretty well out-of-box and
154.8 -> if that's all you're doing, they're
155.92 -> pretty simple.
157.04 -> The problem comes when you start to need
158.879 -> customizations.
160.56 -> It starts out with a small change but as
162.8 -> your user base grows organically,
164.72 -> so will the level of complexity. With all
167.44 -> the possible customizations we encourage
169.599 -> you to do
170.08 -> research to make sure you're choosing
171.599 -> what's best for your company, or you can
173.44 -> always come to CloudBees to get the
174.64 -> answers.
175.519 -> The CloudBees support team works through
177.44 -> thousands of tickets every month.
179.04 -> It makes us really good at seeing the
180.4 -> big picture and that helps us point you
182.319 -> in the right direction based on your
183.68 -> requirements. We also offer professional
186.48 -> services to help you set up your CloudBees
188.08 -> Jenkins instance, if that's where
189.76 -> you're at
190.64 -> and just to give you a little spoiler,
192.48 -> one of our customers saw a 3,500%
195.36 -> increase in their Jenkins performance by
197.76 -> implementing a lot of things that we're
199.2 -> going to be talking about today.
201.76 -> Next slide, and I promise it's not
204.4 -> shenanigans and Ryan's here to help me
206.239 -> prove it.
210.84 -> Ryan?
212.879 -> Thanks Michelle. I want to kick off by
214.799 -> giving a brief overview of the
216.72 -> inner workings of Jenkins.
221.12 -> We see thousands of support cases a year
223.12 -> at CloudBees and I find myself repeating
225.12 -> this phrase a lot
226 -> on phone calls with clients. Jenkins is a
228.72 -> Java application.
230.239 -> Mainly because system administrators
232.08 -> often forget that with most Java
233.68 -> applications,
234.799 -> there's a layer of JVM administration
237.04 -> that often becomes overlooked
238.72 -> after deployment. Common responsibilities
241.68 -> of the JVM administrator include
243.28 -> defining an initial heap size, setting
245.519 -> JVM arguments,
246.959 -> but it's not necessarily a set it and
249.2 -> forget it role.
250.72 -> As usage of the application increases so
253.2 -> does the need to monitor
254.56 -> and revisit those initial settings.
258.88 -> You're looking at a high level overview
260.72 -> of JVM architecture.
262.639 -> The underlying mechanisms like class
264.88 -> loader, code compiler,
266.32 -> garbage collector, this is all commonly a
268.56 -> black box for most Java developers.
271.04 -> What you really need to grasp here
272.8 -> without understanding all the moving
274.32 -> pieces
274.88 -> is that the JVM has two primary
277.04 -> functions.
278.32 -> First to allow Java to run an
281.28 -> application on
282.08 -> any device or operating system. That's
284.56 -> the Java right once run anywhere
286.72 -> principle. Second to manage and
289.919 -> optimize program memory. So how is memory
293.04 -> allocated?
295.759 -> This illustration sheds some light on
297.759 -> that black box from a memory allocation
300.32 -> standpoint
301.039 -> and uncovers some common misconceptions.
303.68 -> The most common misconception is that
305.52 -> the memory you allocate to heapspace is
307.6 -> all the memory that JVM is going to
309.12 -> consume. That's simply not true.
311.759 -> There's always memory in use from the
313.52 -> JVM outside of the heapspace.
316.24 -> Some of you might be familiar with the
317.759 -> concept of permgin space in Java 7
320.479 -> and in Java 8. It's now referred to as
322.84 -> metaspace.
324.16 -> Metaspace is memory that JVM uses to
326.24 -> store class metadata.
327.84 -> It's important to understand that
329.28 -> metaspace is unbounded
330.96 -> by default, so if a class leak is present
333.84 -> you can have an out of memory event.
335.759 -> In addition to metaspace, there's also
337.52 -> Java Native memory.
338.96 -> This is where mechanisms like garbage
340.88 -> collection take place,
342 -> as well as code generation, which is
344.32 -> essentially conversion of bytecode to
346.08 -> native code.
347.36 -> In short there's a lot more going on
349.52 -> under the covers than you might think. I
352.08 -> want to talk just a little bit about one
353.84 -> very important mechanism of the JVM,
356.56 -> the garbage collection. Garbage
358.56 -> collection in many circles is referred
360.639 -> to as voodoo science.
362.4 -> There's a lot of different garbage
363.68 -> collection algorithms out there that do
365.44 -> things just a little differently, but to
367.52 -> break it down to a simplistic overview,
370.08 -> there are essentially three main steps
372.319 -> to garbage collection.
374 -> First, objects get loaded into heat
376.639 -> memory.
377.84 -> During a garbage collection cycle the
380 -> garbage collector
380.96 -> iterates through all those objects and
382.8 -> determines whether or not they're
384.16 -> reachable.
385.36 -> If they're not they are marked for
387.28 -> collection. That's what's called the
388.8 -> marking phase.
390.4 -> Next during the sweep phase, all the
393.199 -> objects that are marked for removal
394.88 -> are removed from memory. Lastly during
397.919 -> the compaction
398.8 -> phase, all those reference objects are
401.28 -> compacted down, which makes new
403.12 -> allocation
404.08 -> faster. It's important to get a base
406.8 -> understanding of how this works because
408.319 -> I'm going to show you later what
409.919 -> effect this mechanism has on the overall
412.319 -> health of Jenkins.
415.28 -> Traditionally a system admins, we look at
418.319 -> these three primary metrics when
420.24 -> monitoring a Java application.
422.639 -> Primarily how much CPU is being used how
425.36 -> much memory.
426.639 -> I refer to these things as macrometrics.
429.28 -> The problem with this is that it's
431.12 -> difficult to forecast
432.479 -> problems before they happen. So I'd like
435.199 -> to suggest the concept of monitoring
437.36 -> micro metrics as a good best practice to
439.84 -> employ in your environment.
441.52 -> So what exactly are micrometers? Well in
444.72 -> short
445.36 -> some examples of micrometrics are object
447.68 -> creation rate,
448.88 -> which is essentially the rate of objects
450.72 -> that are being loaded into memory,
452.96 -> garbage collection latency, which is how
454.96 -> long a garbage collection cycle is
456.639 -> taking to complete.
458.4 -> For applications like Jenkins with low
460.24 -> latency requirements we want to see that
461.84 -> limited to under
462.72 -> one second. Garbage collection throughput,
465.919 -> which is how much time the application
467.44 -> is not spending in garbage collection.
469.36 -> We like to see this as 99%.
472.639 -> The good news is you can gather these
474.56 -> micro metrics by analyzing things like
476.639 -> garbage collection logs,
478.16 -> thread dumps, and heat dumps. Garbage
480.639 -> collection logging can be enabled by
482.4 -> defining startup parameters for the JVM,
485.039 -> thread dumps, and heap dumps, can be
486.639 -> collected on the fly by issuing commands
488.8 -> to the JVM.
490.16 -> Using tools like JCMD, JMAP and those are
493.52 -> provided as part of the Java development
495.12 -> toolkit,
495.84 -> or JDK. This is one reason we recommend
498.96 -> using a JDK instead of a JRE as one of
501.44 -> our best practices.
502.96 -> While there are many tools out there for
504.72 -> analyzing this data we recommend
507.039 -> GC-easy, fast thread, and heap hero.
510.08 -> All of these online tools are free as
513.279 -> long as you're not an enterprise client,
514.88 -> and allow you to upload your logs
516.719 -> directly to their site for analysis.
518.8 -> Here at CloudBees we use an enterprise
520.56 -> version of these tools, which allows us
522.159 -> to keep the data in-house and secure.
525.6 -> Here's a quick example showing the GC
528.32 -> -easy
528.959 -> interface which allows you to upload
530.64 -> your garbage collection logs directly to
532.48 -> their online analyzer.
535.68 -> Remember we talked about how garbage
537.6 -> collection works with the three cycles
539.68 -> marking, sweeping, and compaction? Well
542.16 -> let's turn this graph on its side for an
544.48 -> example.
546.959 -> Note the pattern that this creates when
549.12 -> objects are loaded into memory
550.8 -> and then subsequently get removed. This
553.36 -> might look familiar to you if you've
554.959 -> ever analyzed a garbage collection log
556.8 -> before.
557.68 -> We call this the sawtooth pattern.
561.2 -> The sawtooth pattern is an indicator of
563.68 -> a healthy application.
565.36 -> This means that you have allocated
567.2 -> enough memory to the JVM
568.959 -> so that it has plenty of room for
570.72 -> objects to be loaded,
572.24 -> removed, and the space compacted to make
574.88 -> room for new objects.
578.32 -> In an unhealthy garbage collection log
580.72 -> you might see a pattern that looks
582 -> something like this.
583.6 -> Notice that pretty sawtooth pattern
585.6 -> disappears and there's
586.8 -> now evidence of a memory leak. So why is this
589.76 -> importanW?
590.56 -> why should you as a Jenkins
591.839 -> administrator care about this?
593.68 -> Well in short, paying attention to these
596.48 -> micro metrics like garbage collection
598.399 -> logging
599.2 -> will allow you to forecast problems
601.279 -> before they happen.
603.2 -> Here's an example showing how over time
606.399 -> I can see early indications of a problem
608.72 -> before an
609.44 -> out-of-memory event occurs, that takes
611.12 -> down my application,
612.64 -> and ultimately affects my end users. In
615.279 -> this case
615.92 -> I can see the application is suffering
617.68 -> from what appears to be a memory leak
619.76 -> and therefore I could take steps to
621.6 -> notify my users of an upcoming
623.6 -> maintenance window
624.72 -> where I could capture heap dumps to
626.24 -> further analyze the situation.
629.6 -> I had mentioned fast thread earlier as
631.36 -> an option to analyze thread dumps.
633.839 -> Thread dumps can be taken using JDK
635.92 -> tools like Jstack,
637.279 -> and once you've captured them, you can
638.959 -> upload the output to the analyzer.
641.44 -> Anytime you need to understand what
642.959 -> Jenkins is doing at a certain moment,
644.72 -> you can a capture thread. A thread
647.519 -> dump analysis looks something like this.
650.32 -> Note the micro metrics we can gather
652.32 -> from this output like total thread count.
655.2 -> In a healthy jenkins instance we see
657.04 -> thread counts hover anywhere from 200 to
659.519 -> 300.
660.64 -> In this example we see a very high
662.64 -> thread count which indicates we may need
664.8 -> to look deeper into the individual
666.24 -> threads
666.959 -> to understand what exactly is going on.
670.24 -> In this example I was diagnosing a slow
672.88 -> running instance
673.68 -> which was suffering from a performance
675.279 -> issue. Here we see that there's
677.279 -> 54 blocked threads with the weather
679.68 -> column.
680.72 -> While the weather column is a nice
681.92 -> feature to have in Jenkins on smaller
683.68 -> instances,
684.64 -> we don't recommend running it on larger
686.399 -> masters it may have thousands of jobs
688.32 -> running.
689.04 -> The key here is that I would not have
690.88 -> known the weather column was the root
692.399 -> cause of the reported performance issue
694.64 -> unless I had captured that thread dump
696.48 -> as a way to investigate it.
700 -> I'd like to turn it over to Michelle to
701.519 -> talk about some of the challenges we
703.04 -> often face with Jenkins administration.
708 -> So this video is a great way to
709.92 -> illustrate what happens with Jenkins.
712.32 -> You create a really solid instance
714.399 -> similar to this tire swing
716.16 -> and you start using it. The tire swing is
718.56 -> fun and so your friends want to jump on,
720.8 -> just like when your Jenkins is running
722.72 -> smoothly other teams want to jump aboard.
726.079 -> Nothing bad is going to happen at first
728.639 -> but the more people that you add on
730.399 -> the more you start to feel your instance
732.639 -> slowing down.
734.399 -> If you continue to allow anyone to jump
736.72 -> on board
737.92 -> without making the necessary adjustments,
740.48 -> then you're going to end up holding your
741.76 -> breath,
742.639 -> as it gets slower and slower just
744.399 -> waiting for something to go wrong.
746.48 -> After enough teams jump aboard, you'll
749.279 -> see
749.6 -> that eventually something will go wrong.
754.48 -> When talking about the main issues we
756.399 -> see our customers
757.6 -> have, we can break it down into two
759.839 -> categories.
761.04 -> People and infrastructure. First JVM
764.32 -> skills are often
765.519 -> not a focus when looking at a Jenkins
767.44 -> admin. We find that customers use
769.76 -> out of date best practices and aren't
772.399 -> staying current on their knowledge.
774.32 -> We also see that growth within Jenkins
776.399 -> happens organically.
777.92 -> One team has success with Jenkins and
779.839 -> the other teams want to do it as well,
781.44 -> similar to the tire swing. This is
783.68 -> something a lot of people don't account
785.68 -> for
786.16 -> and becomes an issue, if you aren't
787.6 -> adjusting as you need.
789.36 -> The other thing we see from the people
790.88 -> side is that there can be a lot of
792.32 -> different sources
793.6 -> causing strain and it's hard to keep
795.36 -> track of who's doing what.
797.44 -> The other side of Jenkins admin is the
799.2 -> actual infrastructure that you've set up.
801.68 -> We see a number of disk performance
803.519 -> issues due to a lack of optimization,
806.24 -> we see poorly managed memory and garbage
808.48 -> collection settings,
809.6 -> and these things are often overlooked
811.44 -> when talking about how to have a healthy
813.2 -> Jenkins system.
814.8 -> One thing people don't always realize is
816.72 -> that calling a rest API over and over
818.88 -> can put a strain on the system as well,
821.44 -> and the last item worth mentioning is
823.6 -> that third party plugins aren't always
825.279 -> your friend.
826.079 -> They can add some cool features but not
828 -> all of them are worth the strain the
829.519 -> going to put on the system.
831.279 -> If you can focus on a few of these
832.959 -> things, you're going to save yourself a
834.48 -> lot of time in the long run.
838.399 -> This is a list for you to look back on
840.24 -> that outlines configuration
842 -> best practices, and I want to touch on a
844.16 -> few of these.
845.36 -> First rotating your build history. Make
847.92 -> sure you're choosing to keep
849.279 -> only the important builds. You don't need
851.6 -> to keep all
852.32 -> information forever. You also want to
855.04 -> make sure
855.76 -> you use a JDK and not a JRE
859.199 -> and the NSF versions 3.0 and 4.1
863.6 -> are the recommended versions for
865.04 -> performance. G1
867.279 -> is the garbage collector that we
868.959 -> recommend and when you enable logging, it
871.68 -> can help you know when you need to grow.
874.56 -> You want to explicitly set the heap size
876.88 -> because you need to know
878.079 -> how much your app requires and
880.56 -> monitoring the macro
882 -> and micro metrics like Ryan had
883.839 -> discussed, are really important.
886.48 -> You can also install the support Core
888.32 -> plugin which allows you to easily
890.32 -> collect these logs.
892.079 -> The last one is a link to the JVM
894.639 -> recommended settings for Jenkins. so I
896.48 -> highly recommend you read over that.
900.399 -> Now we're going to talk a
901.839 -> little about scaling horizontally.
904.32 -> If you're not familiar with the term
905.92 -> scaling horizontally,
907.519 -> we mean adding additional resources or
909.839 -> masters to allow for the growing user
912.079 -> base
912.8 -> to continue to use jenkins as smoothly
915.12 -> as you currently have it.
916.8 -> So how do you know when it's time to
918.079 -> scale? As with most questions
920.72 -> it depends. It mostly has to do with your
923.6 -> jenkins usage
924.8 -> and growth rate. You want to stay ahead
927.12 -> of the need but you also don't want to
928.72 -> be wasting money with unnecessary
930.639 -> resources.
932.16 -> Jenkins is memory and IO bound. This
934.8 -> means they are able to monitor how the
936.639 -> system is doing.
938 -> We highly recommend that you set this up.
941.519 -> That gives you a really good idea of
942.959 -> when it's time to scale
944.639 -> and you want to avoid huge masters. If
947.199 -> you have thousands of jobs it's time to
949.199 -> split up your masters
950.32 -> into multiple masters. When splitting up
953.12 -> your masters you want to make sure
954.639 -> you're segmenting them
955.839 -> logically. A lot of our customers do this
958.24 -> by teams
959.04 -> and we actually do recommend one team
962 -> per master.
963.68 -> The only thing that doesn't really
964.959 -> depend. is that you cannot use more than
967.12 -> 16 gig heap.
968.88 -> Jhe old school way of thinking about
970.56 -> java development is no longer the best
972.56 -> way.
974.24 -> It was common knowledge that the JVM
976.48 -> needed CPU,
977.839 -> RAM, and disk space. When the JVM ran out
981.04 -> of memory you added more.
983.279 -> If you ran out of more to add then you
985.519 -> had to convince your boss
987.279 -> that you needed to buy more RAM. That was
989.839 -> common practice.
991.279 -> The problem is that with larger JVM heap
994.079 -> sizes,
995.12 -> you get larger garbage collection cycles.
998.079 -> Garbage collection cycles are stop the
1000 -> world events.
1001.199 -> This means that no other threads are
1003.04 -> going to execute while garbage
1004.56 -> collection is taking place.
1006.8 -> We've found that in order to keep your
1008.32 -> garbage collection cycles under
1010 -> one second, a 16 gig limitation of heap
1013.199 -> size
1013.68 -> should be observed.
1017.04 -> And now we're going to go into the
1018.32 -> countdown.
1020.16 -> So these are the top five issues
1023.6 -> that we
1024 -> see in jenkins but I first want to touch
1026.079 -> on how we came up with these.
1027.919 -> So I already mentioned that Ryan and I's
1029.919 -> team deals with thousands of cases
1031.919 -> and we have access to a lot of diverse
1033.76 -> Jenkins users.
1035.12 -> So we polled Jenkins engineers from
1037.199 -> different industries
1038.24 -> and company sizes and these were the
1040.48 -> reoccurring themes.
1041.679 -> So let's start the countdown.
1047.6 -> Thanks Michelle! To start off with number
1050.48 -> five, the fifth
1051.44 -> most common issue we run into
1053.2 -> within CloudBees support organization
1055.44 -> is quite simply a lack of architectural
1057.6 -> understanding of what
1058.72 -> Jenkins is and how it works. Not only is
1061.919 -> Jenkins a powerhouse within your
1063.44 -> software development
1064.4 -> delivery management system, like I said
1067.12 -> earlier Jenkins is a Java application. So
1069.44 -> that means
1070 -> it requires JVM administration. In
1072.799 -> addition
1073.52 -> Jenkins doesn't have a traditional
1074.88 -> database associated with it.
1076.799 -> So it relies on his home
1082.64 -> making IO throughput an important part
1085.039 -> of the overall
1085.84 -> infrastructure. We also find that many
1088.48 -> times
1089.039 -> admins don't know where to go when they
1090.799 -> have a question. Some important parts of
1092.96 -> the infrastructure, networking,
1094.64 -> storage, linux administration, kubernetes
1097.28 -> administration, to name a few.
1099.36 -> As a solution we recommend identifying
1101.84 -> subject matter experts in these areas,
1104 -> making sure that you have thoroughly
1105.76 -> documented your infrastructure,
1107.52 -> and understand where Jenkins fits in as
1110.48 -> a critical piece of your CI/DC pipeline.
1113.44 -> Lastly ensuring that you have a playbook
1115.44 -> for when things go awry
1116.96 -> and you know who to call. Michelle, take
1119.44 -> us to number four.
1122.88 -> All right we often have people reach out
1125.2 -> saying that their Jenkins is broken,
1127.2 -> while also telling us that nothing has
1129.36 -> changed.
1130.48 -> As far as they know nothing has changed
1132.72 -> but Jenkins has
1133.76 -> so many moving pieces that this change
1136.32 -> can be happening in
1137.52 -> a number of places. One reason
1140.4 -> this can happen
1141.28 -> is miscommunication between teams. This
1144.32 -> can make it appear
1145.36 -> that an error came out of nowhere. We see
1148.16 -> this when a
1148.96 -> different team updates a plugin without
1151.52 -> confirming with anyone,
1153.28 -> we also see pipelines being changed
1155.44 -> without proper notification to any
1157.36 -> affected teams.
1158.799 -> This is one of the reasons we recommend
1160.799 -> putting your pipeline code into your
1162.48 -> source code manager as a best practice.
1165.36 -> You could have network or firewall
1167.039 -> changes that break your Jenkins
1168.88 -> and these can be difficult to find.
1171.039 -> Someone could do a security upgrade and
1172.96 -> if you patch the kernel without testing
1174.799 -> with your version of Jenkins,
1176.48 -> that could be another thing that causes
1178.16 -> error without you realizing.
1180.559 -> All of these reasons are some type of
1182.48 -> poor communication between the teams at
1184.48 -> a company,
1185.52 -> so what can you do? We recommend taking a
1188.64 -> scientific approach,
1190.08 -> and what i mean by that is keeping
1192 -> everything consistent
1193.679 -> and only making one change at a time.
1196.32 -> This allows you to know where the error
1198.16 -> is coming from.
1199.44 -> It's also a great idea to keep a log of
1201.6 -> the changes so you can see what's been
1203.36 -> adjusted recently.
1204.96 -> You want to increase the communication
1206.72 -> between teams
1208 -> and ensure that any architectural
1209.84 -> changes need to be sent out ahead of
1212.159 -> time
1212.64 -> for other teams to be prepared. There are
1215.2 -> plugins you can use to track
1216.72 -> change logs and at the bottom of this
1219.2 -> we've included
1220.159 -> two links to our recommended plugins, the
1222.32 -> job config history plugin
1224.4 -> and the audit trail plug-in. On to number
1226.72 -> three.
1228.88 -> Thanks Michelle! So number three is gonna
1230.88 -> be having inconsistent backups and we
1234.4 -> hear
1235.12 -> a lot that quote "we don't have a backup".
1238.159 -> Surprisingly, one of the one of the top
1239.919 -> issues we deal with is quite frankly a
1241.84 -> lack
1242.24 -> thereof of backups and most admins have
1244.72 -> deployed Jenkins,
1246.08 -> and set up backups or at least they
1248.159 -> think they did,
1249.12 -> and they never tested it, and it's a
1251.28 -> really tough position to be in when
1252.88 -> catastrophe hits the data center and you
1254.88 -> find out the backups you've been running
1256.48 -> smoothly for the last three years appear
1258.08 -> to be
1258.48 -> zero kilobytes in size. Believe me, I've
1261.44 -> been there
1261.76 -> and it's no fun. Jenkins can be easily
1264.799 -> configured to perform backups through
1266.48 -> the backup plugin,
1267.76 -> but the more important piece here is to
1270 -> understand the ceremony of how to
1272 -> perform a backup and restore
1273.76 -> of the crucial data Jenkins relies on,
1276.32 -> which is located in the Jenkins home
1278 -> location.
1279.44 -> Again practicing that strategy and
1281.52 -> checking on your backups frequently to
1283.28 -> ensure the data is sound,
1284.88 -> is an excellent best practice. Remember
1287.44 -> that there is a date base associated with
1290.96 -> the Jenkins home location is the data
1294.08 -> point of record.
1295.52 -> This will keep you from having tough
1297.039 -> conversations by not having backups.
1300.08 -> Take us on a number two Michelle. So we
1304.4 -> see a lot of teams who use monolithic
1306.799 -> masters.
1308.08 -> This means they keep piling all their
1310.159 -> jobs onto one master
1311.919 -> and that is what we call a Jenkinstein.
1314.48 -> There are a few ways
1315.52 -> you can tell if you have a Jenkenstein,
1318 -> if you have multiple teams
1319.84 -> using the same server, or a misuse of
1322.08 -> resources,
1322.88 -> you might have a Jenkinstein. As we
1325.12 -> talked about earlier
1326.48 -> choosing to follow the old-school way of
1328.64 -> thinking about Dava development
1330.88 -> means you probably have a Jenkinstein.
1333.36 -> Quick recap:
1334.559 -> Larger JVM heap sizes mean larger
1336.96 -> garbage collection cycles,
1338.48 -> which brings your Jenkins instance to a
1341.039 -> standstill.
1342.24 -> You want to focus on scaling out to
1344.24 -> avoid this.
1345.6 -> With monolithic masters you have your
1347.679 -> cowboys and cowgirls going around the
1349.6 -> wild west
1350.64 -> downloading their favorite Chuck Norris
1352.4 -> plug-in without thinking about the
1354.159 -> ramifications for the master as a whole.
1357.2 -> It ends up coming down to there just too
1358.96 -> many cooks in the kitchen.
1360.72 -> When you scale horizontally, you can
1362.64 -> allow people the freedom to mess things
1364.559 -> up
1365.039 -> without a large blast radius when
1367.039 -> Jenkins goes down.
1368.799 -> Having zero downtime isn't a realistic
1371.2 -> expectation,
1372.48 -> so the next best thing would be to limit
1374.48 -> who's affected when one master does go
1376.559 -> down.
1377.6 -> There's a great link here about moving
1379.6 -> to the new way of thinking about
1381.2 -> JVM administration, which I highly
1383.28 -> recommend
1384.4 -> it includes a lot of best practices.
1387.44 -> If you take nothing else away from this,
1389.52 -> remember to scale
1390.72 -> out and no more than 16 gig heap. Ryan
1394.08 -> bring us home!
1395.36 -> Thanks Michelle! The number one issue we
1397.52 -> see in the support organization
1399.039 -> is that Jenkins becomes slow or
1401.12 -> unresponsive
1402.559 -> and this is the whole reason I walked
1404 -> you through all that JVM and garbage
1405.76 -> collection stuff earlier.
1407.36 -> When we were on the performance and
1409.28 -> stability team,
1410.72 -> when we on the performance and stability
1412.88 -> team hear that your Jenkins is running
1414.32 -> slow
1415.12 -> the first place I go to is the garbage
1417.12 -> collection logs.
1418.48 -> Nine times out of ten with a little
1420.159 -> prescriptive tuning of JVM parameters, we
1422.559 -> can resolve that slowness.
1424.64 -> Most of the time the root cause ends up
1426.32 -> being that the JVM administrator has
1428.4 -> simply neglected the JVM since it first
1431.2 -> got deployed.
1432.48 -> Now the application usage has increased
1434.32 -> to a point where it requires more
1435.679 -> resources
1436.48 -> than you initially gave it. While garbage
1439.52 -> collection is
1440.159 -> often the issue, some other commonalities
1442.48 -> we see on the front lines of support
1444.4 -> tend to be around directory services
1446.24 -> configuration, or maybe running a tier
1448.64 -> three plug-in that has some poorly
1450.159 -> written code.
1451.44 -> Nothing against the Chuck Norris plug-in
1453.36 -> which tells you a Chuck Norris joke on
1455.12 -> demand,
1455.919 -> but it's probably not something I
1457.52 -> recommend running in production.
1459.52 -> The solutions to these issues are simply
1461.6 -> to follow our best practices around JVM
1463.76 -> administration,
1464.96 -> pipeline construction, and to keep an eye
1467.44 -> on those macro
1468.48 -> and micro metrics we talked about
1470.159 -> earlier. By doing these things you've got
1472.4 -> a leg up
1473.039 -> and ensuring you're running a stable and
1475.039 -> performant Jenkins.
1476.96 -> I want to change gears here and show you
1478.72 -> some real world data examples where we
1480.799 -> were able to resolve performance issues
1483.039 -> with some of the best practices that we
1484.64 -> talked about here today.
1491.76 -> The scenario here is that it was
1493.76 -> reported that users were waiting
1495.52 -> several minutes to log into the
1496.96 -> application and UI navigation
1500.24 -> was slow. Now when we look at the
1502.96 -> garbage collection data here, we can see
1505.039 -> the following
1506.08 -> KPIs. Notice that the throughput
1509.52 -> is at 92%. This means that about eight
1513.919 -> percent of the time
1515.12 -> the application is waiting for garbage
1517.12 -> collection.
1518.4 -> Remember earlier I had said we want to
1520.24 -> see this number at above
1521.44 -> 99%, remember also that
1524.799 -> garbage collection events are stop the
1526.559 -> world events like Michelle
1528.08 -> noted. So during this time no other
1530.48 -> threads are moving
1531.44 -> including login and http requests.
1534.559 -> Ultimately this is causing a bottleneck
1536.48 -> that will eventually render the
1537.76 -> application unusable.
1539.76 -> Also take a look at the max pause GC
1542.48 -> time.
1543.52 -> Do you see that there's 20 second wait
1545.6 -> times there?
1546.96 -> This explains the UI slowness reporting
1550.159 -> as well as the thread bottle nexus is
1551.919 -> causing because
1553.279 -> Jenkins is a Java application that
1554.96 -> requires low latency.
1556.72 -> If I'm waiting on a GC cycle that's
1559.12 -> taking 20 seconds,
1560.72 -> that means that the UI is essentially
1562.64 -> unavailable for me to log into.
1566.4 -> When we dig into this further we notice
1569.44 -> what JVM arguments are in place and we
1572.4 -> found
1572.799 -> several arguments that were forcing the
1574.64 -> G1 GC
1575.84 -> algorithm to work overtime to keep up
1578.48 -> with the constraints of the argument
1579.919 -> limitations.
1581.279 -> Some of the examples of these unwanted
1583.36 -> arguments are listed here.
1585.6 -> Take note that these are very explicit
1589.279 -> arguments that are setting
1591.12 -> values of percentages and
1594.24 -> sizes. It's an
1598.159 -> old-school way of thinking to do this
1599.919 -> for fine-tuning the JVM, in fact I'd call
1602.24 -> this overtuning the JVM.
1604.64 -> As the JDK has matured, there's a more
1607.76 -> keep it
1608.32 -> simple methodology that comes with it,
1610.24 -> that states
1611.76 -> allow the JDK to do what it's intended
1614.24 -> to do
1615.2 -> and stop throwing a wrench into it, when
1617.36 -> it's trying to do what it's
1618.559 -> intended to do. For what it's worth I
1621.12 -> recommended these arguments
1622.799 -> about a year ago and that just shows you
1625.36 -> that
1626.24 -> was a time inside of Java 8 where
1629.919 -> recommending these arguments was
1631.44 -> necessary and
1633.52 -> like I said as the JDK has matured these
1636.559 -> fine-tuning arguments are no longer
1638.84 -> necessary.
1640.559 -> When we remove those arguments. watch
1643.12 -> what happens.
1644.399 -> The throughput goes to 99%
1647.52 -> and take a look at the number of garbage
1649.52 -> collections that are actually
1651.36 -> taking place here.
1652.88 -> When we compare the data from what we
1654.64 -> were just looking at,
1658.159 -> we saw 41,000 garbage collection cycles
1661.279 -> over a 48-hour period,
1665.039 -> compare that to what we're seeing now
1666.799 -> and we're only seeing 2,800.
1668.88 -> Note that max pause GC time actually went
1671.44 -> down as well, from 20 seconds
1673.36 -> to one spike of a three second interval.
1676.559 -> On this master that we're working
1678.64 -> on at an enterprise financial customer
1681.279 -> with thousands of jobs, this was a huge
1684.159 -> improvement.
1687.52 -> Here's another example of real world
1689.2 -> data from a big shipping company that we
1691.44 -> worked with.
1692.399 -> The scenario here was that it was
1694.48 -> reported that HA
1695.679 -> failovers were occurring daily leading
1697.6 -> to multiple production outages and
1699.279 -> downtime for Jenkins users.
1701.2 -> Now looking at this garbage collector
1703.76 -> the following was observed.
1706.159 -> Note that on the left hand side the time
1708.64 -> is in seconds and we see
1710.159 -> several GC pauses that are well above 20
1713.52 -> seconds.
1714.64 -> Well if my HA failover time is less
1717.679 -> than 20 seconds,
1719.279 -> then this means that it's going to
1720.96 -> initiate an HA failure.
1724.24 -> When we take a look at some of those JVM
1726.08 -> arguments that were in there, once again
1727.52 -> we see old arguments that
1729.039 -> at one point in time we recommended.
1732.64 -> Also you'll note that in this
1736 -> chart you'll see that system.gc method
1738.32 -> calls were taking place, which are often
1740.64 -> found in
1741.44 -> third tier plugins as there are no
1743.919 -> system.gc method calls
1745.84 -> inside the jenkins code base. This method
1748.32 -> call essentially calls the garbage
1750.159 -> collector
1750.88 -> adhoc and is throwing a wrench into its
1753.919 -> natural cycle of taking a garbage
1755.76 -> collection.
1756.88 -> What we did here was not only remove the
1758.799 -> unwanted JVM arguments
1760.559 -> but we also disabled the explicit GC
1763.279 -> method calls using an additional JVM
1765.2 -> market.
1768.96 -> What we have here is a 3500 percent
1771.44 -> performance increase when we did that.
1773.679 -> We went from 12 to 23 seconds to an
1776.24 -> average
1777.36 -> GC pause time of 660 milliseconds.
1781.279 -> On an absolutely monolithic master at
1784.799 -> this big shipping company, the
1787.84 -> end users were absolutely raving that
1789.919 -> instead of having to wait for their
1791.279 -> application
1792.08 -> to log in for 20 seconds and walk away
1794.799 -> and get a coffee,
1796 -> now they're coming and logging in and
1798.32 -> it's taking less than half a second.
1800.24 -> This was a huge improvement.
1805.679 -> So I wanted to wrap up by telling you a
1807.6 -> little bit more information
1809.039 -> about CloudBees and how we help
1811.2 -> customers get the most out of Jenkins.
1815.2 -> As a CloudBees customer you're entitled
1817.279 -> to CloudBees support.
1818.559 -> As we mentioned Ryan and I both work on
1820.72 -> the support team and work with a
1822.559 -> fantastic team of jenkins certified
1824.559 -> engineers.
1825.52 -> We're located all around the world
1827.279 -> allowing us to provide 24/7 support.
1830.08 -> We work with thousands of customers
1831.919 -> allowing us to see that bigger picture
1833.84 -> and help you guide you to be successful
1836.08 -> with your Jenkins.
1837.44 -> Our customers have dedicated success
1839.36 -> managers, who will help you drive your CD
1841.76 -> plan forward.
1842.799 -> We also offer online training which you
1845.2 -> get free as a customer
1846.72 -> and our support team assists with
1848.84 -> upgrades.
1850.399 -> I wanted to highlight a little about our
1852.399 -> training.
1853.44 -> Customers have access to the online
1855.36 -> training portal where you can see
1857.44 -> all of our courses. We have courses
1859.679 -> focused on pipelines,
1861.279 -> pipeline advanced, Jenkins
1863.84 -> administration,
1865.039 -> and if you're interested in getting your
1866.559 -> Jenkins certification, we even have a
1868.32 -> study course for that.
1869.919 -> We offer a knowledge base with
1871.44 -> documentation about updated industry
1873.84 -> best practices
1875.12 -> and we have a list of plugins certified
1877.36 -> by CloudBees
1878.399 -> so you have the confidence that you're
1879.919 -> using plugins which have already been
1881.84 -> tested by a Jenkins engineer.
1884.88 -> Ryan do you want to talk about assisted
1886.32 -> updates? Yeah I just want to bring up
1888.88 -> that some of the best feedback we've had
1891.519 -> from our current CloudBees customer
1893.279 -> set, has been around the assisted update
1894.72 -> program.
1895.519 -> If you've been through a Jenkins update
1897.2 -> in the past you know it can cause some
1899.12 -> anxiety with plug-in dependencies
1901.36 -> and making sure backups are in place, and
1903.519 -> verifying build jobs work after the
1905.36 -> update.
1906.08 -> Syntax changes that come between
1907.84 -> versions with the assist
1909.6 -> update program that you get as being a
1912.159 -> CloudBees customer
1913.44 -> the support team here works with you
1915.12 -> proactively to create a plan of action
1917.519 -> that makes your update run as smooth as
1918.88 -> possible,
1919.679 -> and if you're a platinum subscriber, we
1921.44 -> offer live assistance during your
1922.96 -> scheduled update.
1927.84 -> I also want to tell you about the
1929.919 -> Jenkins health advisor.
1931.919 -> Whether you are a CloudBees customer or
1934.08 -> an open source user, if you haven't heard
1936.24 -> about Jenkins health advisor yet
1938 -> I strongly recommend that after this
1940.24 -> presentation you go and download
1941.919 -> and install this plugin. Our support team
1944.72 -> is focused around automation
1946.799 -> and making your life easier, and as a
1948.48 -> part of that we wanted to share advisor
1950.88 -> with you. Advisor automatically analyzes
1953.84 -> your Jenkins environment
1955.12 -> and provides you proactive reporting on
1957.519 -> potential issues before they get out of
1959.36 -> hand.
1960.08 -> It emails you with solutions to
1962.08 -> discovered issues
1963.44 -> so you can prioritize them accordingly.
1969.44 -> So if you're ready to get the most out
1970.799 -> of Jenkins contact [email protected]
1972.399 -> to see how we can help you.
1974.72 -> We're looking forward to speaking with
1976.559 -> you and I've also
1978.159 -> included some of the resources list that
1980.32 -> we've talked about here today. I
1981.76 -> encourage you to click through that,
1983.6 -> and finally we're going to go ahead and
1985.519 -> take some live Q&A
1986.799 -> with myself and Michelle. Thank you all
1989.2 -> for enjoying the conference and I hope
1990.64 -> you learned something today.
1992.48 -> Thanks everyone!
Source: https://www.youtube.com/watch?v=govN7rXOmpc