Top 5 Jenkins Issues And How To Avoid Them

Aug 16, 2023

Top 5 Jenkins Issues And How To Avoid Them

This workshop will discuss Jenkins best practices including the inner-workings of the Jenkins application related to JVM Administration tips, and highlight the top 5 issues we see in CloudBees Support, as well as solutions to resolve them. You will walk away prepared to keep Jenkins running smoothly and keep your CI/CD pipeline humming along.

Content

0.93 -> [Music]

14.719 -> Hi everyone and thanks for joining us

16.24 -> today for the CloudBees Connect virtual

18.48 -> conference!

19.439 -> Today's workshop covers a lot of topics

21.439 -> from JVM administration

23.279 -> to the top five jenkins issues we see

25.279 -> here at CloudBees support.

26.72 -> So let's get started! My name is Ryan

30.48 -> Smith

30.96 -> and I'm currently the Global Escalation

32.559 -> Manager for CloudBees.

34 -> I've been with the company for a little

35.2 -> over two years. I'm formerly a Senior

37.36 -> Development Support Engineer within our

38.879 -> support

39.36 -> organization, and I'm heavily focused on

41.68 -> maintaining the performance and

43.36 -> stability of Jenkins.

44.96 -> I've worked with enterprise Java

46.239 -> deployment for a little over a decade,

48.32 -> and I'm joined here today by my friend

50 -> and colleague Michelle Fogwell.

51.92 -> Michelle tell us a little bit about

53.039 -> yourself and kick us off.

55.52 -> Hey everyone, I'm Michelle. I am a current

57.76 -> Senior Developer Support Engineer and

59.6 -> I've been at CloudBees

61.199 -> for about three years now. I started my

63.92 -> career as a programmer

65.36 -> but I turned to support because I'm

66.96 -> really passionate about the customer

68.479 -> experience,

69.76 -> and this gives me a chance to see what

71.439 -> issues customers are facing in the real

73.36 -> world.

74.159 -> As we mentioned we both work at

75.52 -> CloudBees and just to tell you a little

77.36 -> bit more about our company,

78.96 -> we offer enterprise CI/CD products and

81.68 -> services.

82.799 -> We have the largest number of Jenkins

84.64 -> certified engineers and are proud to be

86.88 -> the number one contributor to the

88.479 -> Jenkins project.

90.32 -> Next slide. So what are we going to talk

93.52 -> about today

94.32 -> and how is it going to help you? Jenkins

97.2 -> has changed a lot over the years

99.52 -> and that means so have the best

100.88 -> practices. We're going to make sure to

103.119 -> talk about the latest best practices

105.52 -> and we'll cover JVM tuning and

107.439 -> administration.

108.88 -> We're going to talk about the top five

110.479 -> issues that we as a support organization

113.04 -> see our customers struggle with and

115.119 -> we'll go over solutions for those as

116.719 -> well.

117.68 -> Once we've covered that we want to show

119.439 -> you some real world data to back up what

121.28 -> we're saying

121.92 -> and then answer any questions you might

123.439 -> have. We're hoping that this information

126 -> is going to help you with your JVM

127.52 -> administration skills

129.119 -> and make you care a little more about

130.879 -> the inner workings of the JVM

132.72 -> and topics like garbage collection. This

134.8 -> will help you grow as a Jenkins

136.239 -> administrator

137.04 -> so that you and any development teams

138.8 -> you're working with can have a better

140.08 -> Jenkins experience.

141.68 -> Our goal is to help you get the most out

143.36 -> of Jenkins.

145.12 -> Next slide. So let's talk about Jenkins

148 -> and JVM.

149.36 -> Jenkins and JVM are both impressive

151.519 -> pieces of software.

152.879 -> They run pretty well out-of-box and

154.8 -> if that's all you're doing, they're

155.92 -> pretty simple.

157.04 -> The problem comes when you start to need

158.879 -> customizations.

160.56 -> It starts out with a small change but as

162.8 -> your user base grows organically,

164.72 -> so will the level of complexity. With all

167.44 -> the possible customizations we encourage

169.599 -> you to do

170.08 -> research to make sure you're choosing

171.599 -> what's best for your company, or you can

173.44 -> always come to CloudBees to get the

174.64 -> answers.

175.519 -> The CloudBees support team works through

177.44 -> thousands of tickets every month.

179.04 -> It makes us really good at seeing the

180.4 -> big picture and that helps us point you

182.319 -> in the right direction based on your

183.68 -> requirements. We also offer professional

186.48 -> services to help you set up your CloudBees

188.08 -> Jenkins instance, if that's where

189.76 -> you're at

190.64 -> and just to give you a little spoiler,

192.48 -> one of our customers saw a 3,500%

195.36 -> increase in their Jenkins performance by

197.76 -> implementing a lot of things that we're

199.2 -> going to be talking about today.

201.76 -> Next slide, and I promise it's not

204.4 -> shenanigans and Ryan's here to help me

206.239 -> prove it.

210.84 -> Ryan?

212.879 -> Thanks Michelle. I want to kick off by

214.799 -> giving a brief overview of the

216.72 -> inner workings of Jenkins.

221.12 -> We see thousands of support cases a year

223.12 -> at CloudBees and I find myself repeating

225.12 -> this phrase a lot

226 -> on phone calls with clients. Jenkins is a

228.72 -> Java application.

230.239 -> Mainly because system administrators

232.08 -> often forget that with most Java

233.68 -> applications,

234.799 -> there's a layer of JVM administration

237.04 -> that often becomes overlooked

238.72 -> after deployment. Common responsibilities

241.68 -> of the JVM administrator include

243.28 -> defining an initial heap size, setting

245.519 -> JVM arguments,

246.959 -> but it's not necessarily a set it and

249.2 -> forget it role.

250.72 -> As usage of the application increases so

253.2 -> does the need to monitor

254.56 -> and revisit those initial settings.

258.88 -> You're looking at a high level overview

260.72 -> of JVM architecture.

262.639 -> The underlying mechanisms like class

264.88 -> loader, code compiler,

266.32 -> garbage collector, this is all commonly a

268.56 -> black box for most Java developers.

271.04 -> What you really need to grasp here

272.8 -> without understanding all the moving

274.32 -> pieces

274.88 -> is that the JVM has two primary

277.04 -> functions.

278.32 -> First to allow Java to run an

281.28 -> application on

282.08 -> any device or operating system. That's

284.56 -> the Java right once run anywhere

286.72 -> principle. Second to manage and

289.919 -> optimize program memory. So how is memory

293.04 -> allocated?

295.759 -> This illustration sheds some light on

297.759 -> that black box from a memory allocation

300.32 -> standpoint

301.039 -> and uncovers some common misconceptions.

303.68 -> The most common misconception is that

305.52 -> the memory you allocate to heapspace is

307.6 -> all the memory that JVM is going to

309.12 -> consume. That's simply not true.

311.759 -> There's always memory in use from the

313.52 -> JVM outside of the heapspace.

316.24 -> Some of you might be familiar with the

317.759 -> concept of permgin space in Java 7

320.479 -> and in Java 8. It's now referred to as

322.84 -> metaspace.

324.16 -> Metaspace is memory that JVM uses to

326.24 -> store class metadata.

327.84 -> It's important to understand that

329.28 -> metaspace is unbounded

330.96 -> by default, so if a class leak is present

333.84 -> you can have an out of memory event.

335.759 -> In addition to metaspace, there's also

337.52 -> Java Native memory.

338.96 -> This is where mechanisms like garbage

340.88 -> collection take place,

342 -> as well as code generation, which is

344.32 -> essentially conversion of bytecode to

346.08 -> native code.

347.36 -> In short there's a lot more going on

349.52 -> under the covers than you might think. I

352.08 -> want to talk just a little bit about one

353.84 -> very important mechanism of the JVM,

356.56 -> the garbage collection. Garbage

358.56 -> collection in many circles is referred

360.639 -> to as voodoo science.

362.4 -> There's a lot of different garbage

363.68 -> collection algorithms out there that do

365.44 -> things just a little differently, but to

367.52 -> break it down to a simplistic overview,

370.08 -> there are essentially three main steps

372.319 -> to garbage collection.

374 -> First, objects get loaded into heat

376.639 -> memory.

377.84 -> During a garbage collection cycle the

380 -> garbage collector

380.96 -> iterates through all those objects and

382.8 -> determines whether or not they're

384.16 -> reachable.

385.36 -> If they're not they are marked for

387.28 -> collection. That's what's called the

388.8 -> marking phase.

390.4 -> Next during the sweep phase, all the

393.199 -> objects that are marked for removal

394.88 -> are removed from memory. Lastly during

397.919 -> the compaction

398.8 -> phase, all those reference objects are

401.28 -> compacted down, which makes new

403.12 -> allocation

404.08 -> faster. It's important to get a base

406.8 -> understanding of how this works because

408.319 -> I'm going to show you later what

409.919 -> effect this mechanism has on the overall

412.319 -> health of Jenkins.

415.28 -> Traditionally a system admins, we look at

418.319 -> these three primary metrics when

420.24 -> monitoring a Java application.

422.639 -> Primarily how much CPU is being used how

425.36 -> much memory.

426.639 -> I refer to these things as macrometrics.

429.28 -> The problem with this is that it's

431.12 -> difficult to forecast

432.479 -> problems before they happen. So I'd like

435.199 -> to suggest the concept of monitoring

437.36 -> micro metrics as a good best practice to

439.84 -> employ in your environment.

441.52 -> So what exactly are micrometers? Well in

444.72 -> short

445.36 -> some examples of micrometrics are object

447.68 -> creation rate,

448.88 -> which is essentially the rate of objects

450.72 -> that are being loaded into memory,

452.96 -> garbage collection latency, which is how

454.96 -> long a garbage collection cycle is

456.639 -> taking to complete.

458.4 -> For applications like Jenkins with low

460.24 -> latency requirements we want to see that

461.84 -> limited to under

462.72 -> one second. Garbage collection throughput,

465.919 -> which is how much time the application

467.44 -> is not spending in garbage collection.

469.36 -> We like to see this as 99%.

472.639 -> The good news is you can gather these

474.56 -> micro metrics by analyzing things like

476.639 -> garbage collection logs,

478.16 -> thread dumps, and heat dumps. Garbage

480.639 -> collection logging can be enabled by

482.4 -> defining startup parameters for the JVM,

485.039 -> thread dumps, and heap dumps, can be

486.639 -> collected on the fly by issuing commands

488.8 -> to the JVM.

490.16 -> Using tools like JCMD, JMAP and those are

493.52 -> provided as part of the Java development

495.12 -> toolkit,

495.84 -> or JDK. This is one reason we recommend

498.96 -> using a JDK instead of a JRE as one of

501.44 -> our best practices.

502.96 -> While there are many tools out there for

504.72 -> analyzing this data we recommend

507.039 -> GC-easy, fast thread, and heap hero.

510.08 -> All of these online tools are free as

513.279 -> long as you're not an enterprise client,

514.88 -> and allow you to upload your logs

516.719 -> directly to their site for analysis.

518.8 -> Here at CloudBees we use an enterprise

520.56 -> version of these tools, which allows us

522.159 -> to keep the data in-house and secure.

525.6 -> Here's a quick example showing the GC

528.32 -> -easy

528.959 -> interface which allows you to upload

530.64 -> your garbage collection logs directly to

532.48 -> their online analyzer.

535.68 -> Remember we talked about how garbage

537.6 -> collection works with the three cycles

539.68 -> marking, sweeping, and compaction? Well

542.16 -> let's turn this graph on its side for an

544.48 -> example.

546.959 -> Note the pattern that this creates when

549.12 -> objects are loaded into memory

550.8 -> and then subsequently get removed. This

553.36 -> might look familiar to you if you've

554.959 -> ever analyzed a garbage collection log

556.8 -> before.

557.68 -> We call this the sawtooth pattern.

561.2 -> The sawtooth pattern is an indicator of

563.68 -> a healthy application.

565.36 -> This means that you have allocated

567.2 -> enough memory to the JVM

568.959 -> so that it has plenty of room for

570.72 -> objects to be loaded,

572.24 -> removed, and the space compacted to make

574.88 -> room for new objects.

578.32 -> In an unhealthy garbage collection log

580.72 -> you might see a pattern that looks

582 -> something like this.

583.6 -> Notice that pretty sawtooth pattern

585.6 -> disappears and there's

586.8 -> now evidence of a memory leak. So why is this

589.76 -> importanW?

590.56 -> why should you as a Jenkins

591.839 -> administrator care about this?

593.68 -> Well in short, paying attention to these

596.48 -> micro metrics like garbage collection

598.399 -> logging

599.2 -> will allow you to forecast problems

601.279 -> before they happen.

603.2 -> Here's an example showing how over time

606.399 -> I can see early indications of a problem

608.72 -> before an

609.44 -> out-of-memory event occurs, that takes

611.12 -> down my application,

612.64 -> and ultimately affects my end users. In

615.279 -> this case

615.92 -> I can see the application is suffering

617.68 -> from what appears to be a memory leak

619.76 -> and therefore I could take steps to

621.6 -> notify my users of an upcoming

623.6 -> maintenance window

624.72 -> where I could capture heap dumps to

626.24 -> further analyze the situation.

629.6 -> I had mentioned fast thread earlier as

631.36 -> an option to analyze thread dumps.

633.839 -> Thread dumps can be taken using JDK

635.92 -> tools like Jstack,

637.279 -> and once you've captured them, you can

638.959 -> upload the output to the analyzer.

641.44 -> Anytime you need to understand what

642.959 -> Jenkins is doing at a certain moment,

644.72 -> you can a capture thread. A thread

647.519 -> dump analysis looks something like this.

650.32 -> Note the micro metrics we can gather

652.32 -> from this output like total thread count.

655.2 -> In a healthy jenkins instance we see

657.04 -> thread counts hover anywhere from 200 to

659.519 -> 300.

660.64 -> In this example we see a very high

662.64 -> thread count which indicates we may need

664.8 -> to look deeper into the individual

666.24 -> threads

666.959 -> to understand what exactly is going on.

670.24 -> In this example I was diagnosing a slow

672.88 -> running instance

673.68 -> which was suffering from a performance

675.279 -> issue. Here we see that there's

677.279 -> 54 blocked threads with the weather

679.68 -> column.

680.72 -> While the weather column is a nice

681.92 -> feature to have in Jenkins on smaller

683.68 -> instances,

684.64 -> we don't recommend running it on larger

686.399 -> masters it may have thousands of jobs

688.32 -> running.

689.04 -> The key here is that I would not have

690.88 -> known the weather column was the root

692.399 -> cause of the reported performance issue

694.64 -> unless I had captured that thread dump

696.48 -> as a way to investigate it.

700 -> I'd like to turn it over to Michelle to

701.519 -> talk about some of the challenges we

703.04 -> often face with Jenkins administration.

708 -> So this video is a great way to

709.92 -> illustrate what happens with Jenkins.

712.32 -> You create a really solid instance

714.399 -> similar to this tire swing

716.16 -> and you start using it. The tire swing is

718.56 -> fun and so your friends want to jump on,

720.8 -> just like when your Jenkins is running

722.72 -> smoothly other teams want to jump aboard.

726.079 -> Nothing bad is going to happen at first

728.639 -> but the more people that you add on

730.399 -> the more you start to feel your instance

732.639 -> slowing down.

734.399 -> If you continue to allow anyone to jump

736.72 -> on board

737.92 -> without making the necessary adjustments,

740.48 -> then you're going to end up holding your

741.76 -> breath,

742.639 -> as it gets slower and slower just

744.399 -> waiting for something to go wrong.

746.48 -> After enough teams jump aboard, you'll

749.279 -> see

749.6 -> that eventually something will go wrong.

754.48 -> When talking about the main issues we

756.399 -> see our customers

757.6 -> have, we can break it down into two

759.839 -> categories.

761.04 -> People and infrastructure. First JVM

764.32 -> skills are often

765.519 -> not a focus when looking at a Jenkins

767.44 -> admin. We find that customers use

769.76 -> out of date best practices and aren't

772.399 -> staying current on their knowledge.

774.32 -> We also see that growth within Jenkins

776.399 -> happens organically.

777.92 -> One team has success with Jenkins and

779.839 -> the other teams want to do it as well,

781.44 -> similar to the tire swing. This is

783.68 -> something a lot of people don't account

785.68 -> for

786.16 -> and becomes an issue, if you aren't

787.6 -> adjusting as you need.

789.36 -> The other thing we see from the people

790.88 -> side is that there can be a lot of

792.32 -> different sources

793.6 -> causing strain and it's hard to keep

795.36 -> track of who's doing what.

797.44 -> The other side of Jenkins admin is the

799.2 -> actual infrastructure that you've set up.

801.68 -> We see a number of disk performance

803.519 -> issues due to a lack of optimization,

806.24 -> we see poorly managed memory and garbage

808.48 -> collection settings,

809.6 -> and these things are often overlooked

811.44 -> when talking about how to have a healthy

813.2 -> Jenkins system.

814.8 -> One thing people don't always realize is

816.72 -> that calling a rest API over and over

818.88 -> can put a strain on the system as well,

821.44 -> and the last item worth mentioning is

823.6 -> that third party plugins aren't always

825.279 -> your friend.

826.079 -> They can add some cool features but not

828 -> all of them are worth the strain the

829.519 -> going to put on the system.

831.279 -> If you can focus on a few of these

832.959 -> things, you're going to save yourself a

834.48 -> lot of time in the long run.

838.399 -> This is a list for you to look back on

840.24 -> that outlines configuration

842 -> best practices, and I want to touch on a

844.16 -> few of these.

845.36 -> First rotating your build history. Make

847.92 -> sure you're choosing to keep

849.279 -> only the important builds. You don't need

851.6 -> to keep all

852.32 -> information forever. You also want to

855.04 -> make sure

855.76 -> you use a JDK and not a JRE

859.199 -> and the NSF versions 3.0 and 4.1

863.6 -> are the recommended versions for

865.04 -> performance. G1

867.279 -> is the garbage collector that we

868.959 -> recommend and when you enable logging, it

871.68 -> can help you know when you need to grow.

874.56 -> You want to explicitly set the heap size

876.88 -> because you need to know

878.079 -> how much your app requires and

880.56 -> monitoring the macro

882 -> and micro metrics like Ryan had

883.839 -> discussed, are really important.

886.48 -> You can also install the support Core

888.32 -> plugin which allows you to easily

890.32 -> collect these logs.

892.079 -> The last one is a link to the JVM

894.639 -> recommended settings for Jenkins. so I

896.48 -> highly recommend you read over that.

900.399 -> Now we're going to talk a

901.839 -> little about scaling horizontally.

904.32 -> If you're not familiar with the term

905.92 -> scaling horizontally,

907.519 -> we mean adding additional resources or

909.839 -> masters to allow for the growing user

912.079 -> base

912.8 -> to continue to use jenkins as smoothly

915.12 -> as you currently have it.

916.8 -> So how do you know when it's time to

918.079 -> scale? As with most questions

920.72 -> it depends. It mostly has to do with your

923.6 -> jenkins usage

924.8 -> and growth rate. You want to stay ahead

927.12 -> of the need but you also don't want to

928.72 -> be wasting money with unnecessary

930.639 -> resources.

932.16 -> Jenkins is memory and IO bound. This

934.8 -> means they are able to monitor how the

936.639 -> system is doing.

938 -> We highly recommend that you set this up.

941.519 -> That gives you a really good idea of

942.959 -> when it's time to scale

944.639 -> and you want to avoid huge masters. If

947.199 -> you have thousands of jobs it's time to

949.199 -> split up your masters

950.32 -> into multiple masters. When splitting up

953.12 -> your masters you want to make sure

954.639 -> you're segmenting them

955.839 -> logically. A lot of our customers do this

958.24 -> by teams

959.04 -> and we actually do recommend one team

962 -> per master.

963.68 -> The only thing that doesn't really

964.959 -> depend. is that you cannot use more than

967.12 -> 16 gig heap.

968.88 -> Jhe old school way of thinking about

970.56 -> java development is no longer the best

972.56 -> way.

974.24 -> It was common knowledge that the JVM

976.48 -> needed CPU,

977.839 -> RAM, and disk space. When the JVM ran out

981.04 -> of memory you added more.

983.279 -> If you ran out of more to add then you

985.519 -> had to convince your boss

987.279 -> that you needed to buy more RAM. That was

989.839 -> common practice.

991.279 -> The problem is that with larger JVM heap

994.079 -> sizes,

995.12 -> you get larger garbage collection cycles.

998.079 -> Garbage collection cycles are stop the

1000 -> world events.

1001.199 -> This means that no other threads are

1003.04 -> going to execute while garbage

1004.56 -> collection is taking place.

1006.8 -> We've found that in order to keep your

1008.32 -> garbage collection cycles under

1010 -> one second, a 16 gig limitation of heap

1013.199 -> size

1013.68 -> should be observed.

1017.04 -> And now we're going to go into the

1018.32 -> countdown.

1020.16 -> So these are the top five issues

1023.6 -> that we

1024 -> see in jenkins but I first want to touch

1026.079 -> on how we came up with these.

1027.919 -> So I already mentioned that Ryan and I's

1029.919 -> team deals with thousands of cases

1031.919 -> and we have access to a lot of diverse

1033.76 -> Jenkins users.

1035.12 -> So we polled Jenkins engineers from

1037.199 -> different industries

1038.24 -> and company sizes and these were the

1040.48 -> reoccurring themes.

1041.679 -> So let's start the countdown.

1047.6 -> Thanks Michelle! To start off with number

1050.48 -> five, the fifth

1051.44 -> most common issue we run into

1053.2 -> within CloudBees support organization

1055.44 -> is quite simply a lack of architectural

1057.6 -> understanding of what

1058.72 -> Jenkins is and how it works. Not only is

1061.919 -> Jenkins a powerhouse within your

1063.44 -> software development

1064.4 -> delivery management system, like I said

1067.12 -> earlier Jenkins is a Java application. So

1069.44 -> that means

1070 -> it requires JVM administration. In

1072.799 -> addition

1073.52 -> Jenkins doesn't have a traditional

1074.88 -> database associated with it.

1076.799 -> So it relies on his home

1082.64 -> making IO throughput an important part

1085.039 -> of the overall

1085.84 -> infrastructure. We also find that many

1088.48 -> times

1089.039 -> admins don't know where to go when they

1090.799 -> have a question. Some important parts of

1092.96 -> the infrastructure, networking,

1094.64 -> storage, linux administration, kubernetes

1097.28 -> administration, to name a few.

1099.36 -> As a solution we recommend identifying

1101.84 -> subject matter experts in these areas,

1104 -> making sure that you have thoroughly

1105.76 -> documented your infrastructure,

1107.52 -> and understand where Jenkins fits in as

1110.48 -> a critical piece of your CI/DC pipeline.

1113.44 -> Lastly ensuring that you have a playbook

1115.44 -> for when things go awry

1116.96 -> and you know who to call. Michelle, take

1119.44 -> us to number four.

1122.88 -> All right we often have people reach out

1125.2 -> saying that their Jenkins is broken,

1127.2 -> while also telling us that nothing has

1129.36 -> changed.

1130.48 -> As far as they know nothing has changed

1132.72 -> but Jenkins has

1133.76 -> so many moving pieces that this change

1136.32 -> can be happening in

1137.52 -> a number of places. One reason

1140.4 -> this can happen

1141.28 -> is miscommunication between teams. This

1144.32 -> can make it appear

1145.36 -> that an error came out of nowhere. We see

1148.16 -> this when a

1148.96 -> different team updates a plugin without

1151.52 -> confirming with anyone,

1153.28 -> we also see pipelines being changed

1155.44 -> without proper notification to any

1157.36 -> affected teams.

1158.799 -> This is one of the reasons we recommend

1160.799 -> putting your pipeline code into your

1162.48 -> source code manager as a best practice.

1165.36 -> You could have network or firewall

1167.039 -> changes that break your Jenkins

1168.88 -> and these can be difficult to find.

1171.039 -> Someone could do a security upgrade and

1172.96 -> if you patch the kernel without testing

1174.799 -> with your version of Jenkins,

1176.48 -> that could be another thing that causes

1178.16 -> error without you realizing.

1180.559 -> All of these reasons are some type of

1182.48 -> poor communication between the teams at

1184.48 -> a company,

1185.52 -> so what can you do? We recommend taking a

1188.64 -> scientific approach,

1190.08 -> and what i mean by that is keeping

1192 -> everything consistent

1193.679 -> and only making one change at a time.

1196.32 -> This allows you to know where the error

1198.16 -> is coming from.

1199.44 -> It's also a great idea to keep a log of

1201.6 -> the changes so you can see what's been

1203.36 -> adjusted recently.

1204.96 -> You want to increase the communication

1206.72 -> between teams

1208 -> and ensure that any architectural

1209.84 -> changes need to be sent out ahead of

1212.159 -> time

1212.64 -> for other teams to be prepared. There are

1215.2 -> plugins you can use to track

1216.72 -> change logs and at the bottom of this

1219.2 -> we've included

1220.159 -> two links to our recommended plugins, the

1222.32 -> job config history plugin

1224.4 -> and the audit trail plug-in. On to number

1226.72 -> three.

1228.88 -> Thanks Michelle! So number three is gonna

1230.88 -> be having inconsistent backups and we

1234.4 -> hear

1235.12 -> a lot that quote "we don't have a backup".

1238.159 -> Surprisingly, one of the one of the top

1239.919 -> issues we deal with is quite frankly a

1241.84 -> lack

1242.24 -> thereof of backups and most admins have

1244.72 -> deployed Jenkins,

1246.08 -> and set up backups or at least they

1248.159 -> think they did,

1249.12 -> and they never tested it, and it's a

1251.28 -> really tough position to be in when

1252.88 -> catastrophe hits the data center and you

1254.88 -> find out the backups you've been running

1256.48 -> smoothly for the last three years appear

1258.08 -> to be

1258.48 -> zero kilobytes in size. Believe me, I've

1261.44 -> been there

1261.76 -> and it's no fun. Jenkins can be easily

1264.799 -> configured to perform backups through

1266.48 -> the backup plugin,

1267.76 -> but the more important piece here is to

1270 -> understand the ceremony of how to

1272 -> perform a backup and restore

1273.76 -> of the crucial data Jenkins relies on,

1276.32 -> which is located in the Jenkins home

1278 -> location.

1279.44 -> Again practicing that strategy and

1281.52 -> checking on your backups frequently to

1283.28 -> ensure the data is sound,

1284.88 -> is an excellent best practice. Remember

1287.44 -> that there is a date base associated with

1290.96 -> the Jenkins home location is the data

1294.08 -> point of record.

1295.52 -> This will keep you from having tough

1297.039 -> conversations by not having backups.

1300.08 -> Take us on a number two Michelle. So we

1304.4 -> see a lot of teams who use monolithic

1306.799 -> masters.

1308.08 -> This means they keep piling all their

1310.159 -> jobs onto one master

1311.919 -> and that is what we call a Jenkinstein.

1314.48 -> There are a few ways

1315.52 -> you can tell if you have a Jenkenstein,

1318 -> if you have multiple teams

1319.84 -> using the same server, or a misuse of

1322.08 -> resources,

1322.88 -> you might have a Jenkinstein. As we

1325.12 -> talked about earlier

1326.48 -> choosing to follow the old-school way of

1328.64 -> thinking about Dava development

1330.88 -> means you probably have a Jenkinstein.

1333.36 -> Quick recap:

1334.559 -> Larger JVM heap sizes mean larger

1336.96 -> garbage collection cycles,

1338.48 -> which brings your Jenkins instance to a

1341.039 -> standstill.

1342.24 -> You want to focus on scaling out to

1344.24 -> avoid this.

1345.6 -> With monolithic masters you have your

1347.679 -> cowboys and cowgirls going around the

1349.6 -> wild west

1350.64 -> downloading their favorite Chuck Norris

1352.4 -> plug-in without thinking about the

1354.159 -> ramifications for the master as a whole.

1357.2 -> It ends up coming down to there just too

1358.96 -> many cooks in the kitchen.

1360.72 -> When you scale horizontally, you can

1362.64 -> allow people the freedom to mess things

1364.559 -> up

1365.039 -> without a large blast radius when

1367.039 -> Jenkins goes down.

1368.799 -> Having zero downtime isn't a realistic

1371.2 -> expectation,

1372.48 -> so the next best thing would be to limit

1374.48 -> who's affected when one master does go

1376.559 -> down.

1377.6 -> There's a great link here about moving

1379.6 -> to the new way of thinking about

1381.2 -> JVM administration, which I highly

1383.28 -> recommend

1384.4 -> it includes a lot of best practices.

1387.44 -> If you take nothing else away from this,

1389.52 -> remember to scale

1390.72 -> out and no more than 16 gig heap. Ryan

1394.08 -> bring us home!

1395.36 -> Thanks Michelle! The number one issue we

1397.52 -> see in the support organization

1399.039 -> is that Jenkins becomes slow or

1401.12 -> unresponsive

1402.559 -> and this is the whole reason I walked

1404 -> you through all that JVM and garbage

1405.76 -> collection stuff earlier.

1407.36 -> When we were on the performance and

1409.28 -> stability team,

1410.72 -> when we on the performance and stability

1412.88 -> team hear that your Jenkins is running

1414.32 -> slow

1415.12 -> the first place I go to is the garbage

1417.12 -> collection logs.

1418.48 -> Nine times out of ten with a little

1420.159 -> prescriptive tuning of JVM parameters, we

1422.559 -> can resolve that slowness.

1424.64 -> Most of the time the root cause ends up

1426.32 -> being that the JVM administrator has

1428.4 -> simply neglected the JVM since it first

1431.2 -> got deployed.

1432.48 -> Now the application usage has increased

1434.32 -> to a point where it requires more

1435.679 -> resources

1436.48 -> than you initially gave it. While garbage

1439.52 -> collection is

1440.159 -> often the issue, some other commonalities

1442.48 -> we see on the front lines of support

1444.4 -> tend to be around directory services

1446.24 -> configuration, or maybe running a tier

1448.64 -> three plug-in that has some poorly

1450.159 -> written code.

1451.44 -> Nothing against the Chuck Norris plug-in

1453.36 -> which tells you a Chuck Norris joke on

1455.12 -> demand,

1455.919 -> but it's probably not something I

1457.52 -> recommend running in production.

1459.52 -> The solutions to these issues are simply

1461.6 -> to follow our best practices around JVM

1463.76 -> administration,

1464.96 -> pipeline construction, and to keep an eye

1467.44 -> on those macro

1468.48 -> and micro metrics we talked about

1470.159 -> earlier. By doing these things you've got

1472.4 -> a leg up

1473.039 -> and ensuring you're running a stable and

1475.039 -> performant Jenkins.

1476.96 -> I want to change gears here and show you

1478.72 -> some real world data examples where we

1480.799 -> were able to resolve performance issues

1483.039 -> with some of the best practices that we

1484.64 -> talked about here today.

1491.76 -> The scenario here is that it was

1493.76 -> reported that users were waiting

1495.52 -> several minutes to log into the

1496.96 -> application and UI navigation

1500.24 -> was slow. Now when we look at the

1502.96 -> garbage collection data here, we can see

1505.039 -> the following

1506.08 -> KPIs. Notice that the throughput

1509.52 -> is at 92%. This means that about eight

1513.919 -> percent of the time

1515.12 -> the application is waiting for garbage

1517.12 -> collection.

1518.4 -> Remember earlier I had said we want to

1520.24 -> see this number at above

1521.44 -> 99%, remember also that

1524.799 -> garbage collection events are stop the

1526.559 -> world events like Michelle

1528.08 -> noted. So during this time no other

1530.48 -> threads are moving

1531.44 -> including login and http requests.

1534.559 -> Ultimately this is causing a bottleneck

1536.48 -> that will eventually render the

1537.76 -> application unusable.

1539.76 -> Also take a look at the max pause GC

1542.48 -> time.

1543.52 -> Do you see that there's 20 second wait

1545.6 -> times there?

1546.96 -> This explains the UI slowness reporting

1550.159 -> as well as the thread bottle nexus is

1551.919 -> causing because

1553.279 -> Jenkins is a Java application that

1554.96 -> requires low latency.

1556.72 -> If I'm waiting on a GC cycle that's

1559.12 -> taking 20 seconds,

1560.72 -> that means that the UI is essentially

1562.64 -> unavailable for me to log into.

1566.4 -> When we dig into this further we notice

1569.44 -> what JVM arguments are in place and we

1572.4 -> found

1572.799 -> several arguments that were forcing the

1574.64 -> G1 GC

1575.84 -> algorithm to work overtime to keep up

1578.48 -> with the constraints of the argument

1579.919 -> limitations.

1581.279 -> Some of the examples of these unwanted

1583.36 -> arguments are listed here.

1585.6 -> Take note that these are very explicit

1589.279 -> arguments that are setting

1591.12 -> values of percentages and

1594.24 -> sizes. It's an

1598.159 -> old-school way of thinking to do this

1599.919 -> for fine-tuning the JVM, in fact I'd call

1602.24 -> this overtuning the JVM.

1604.64 -> As the JDK has matured, there's a more

1607.76 -> keep it

1608.32 -> simple methodology that comes with it,

1610.24 -> that states

1611.76 -> allow the JDK to do what it's intended

1614.24 -> to do

1615.2 -> and stop throwing a wrench into it, when

1617.36 -> it's trying to do what it's

1618.559 -> intended to do. For what it's worth I

1621.12 -> recommended these arguments

1622.799 -> about a year ago and that just shows you

1625.36 -> that

1626.24 -> was a time inside of Java 8 where

1629.919 -> recommending these arguments was

1631.44 -> necessary and

1633.52 -> like I said as the JDK has matured these

1636.559 -> fine-tuning arguments are no longer

1638.84 -> necessary.

1640.559 -> When we remove those arguments. watch

1643.12 -> what happens.

1644.399 -> The throughput goes to 99%

1647.52 -> and take a look at the number of garbage

1649.52 -> collections that are actually

1651.36 -> taking place here.

1652.88 -> When we compare the data from what we

1654.64 -> were just looking at,

1658.159 -> we saw 41,000 garbage collection cycles

1661.279 -> over a 48-hour period,

1665.039 -> compare that to what we're seeing now

1666.799 -> and we're only seeing 2,800.

1668.88 -> Note that max pause GC time actually went

1671.44 -> down as well, from 20 seconds

1673.36 -> to one spike of a three second interval.

1676.559 -> On this master that we're working

1678.64 -> on at an enterprise financial customer

1681.279 -> with thousands of jobs, this was a huge

1684.159 -> improvement.

1687.52 -> Here's another example of real world

1689.2 -> data from a big shipping company that we

1691.44 -> worked with.

1692.399 -> The scenario here was that it was

1694.48 -> reported that HA

1695.679 -> failovers were occurring daily leading

1697.6 -> to multiple production outages and

1699.279 -> downtime for Jenkins users.

1701.2 -> Now looking at this garbage collector

1703.76 -> the following was observed.

1706.159 -> Note that on the left hand side the time

1708.64 -> is in seconds and we see

1710.159 -> several GC pauses that are well above 20

1713.52 -> seconds.

1714.64 -> Well if my HA failover time is less

1717.679 -> than 20 seconds,

1719.279 -> then this means that it's going to

1720.96 -> initiate an HA failure.

1724.24 -> When we take a look at some of those JVM

1726.08 -> arguments that were in there, once again

1727.52 -> we see old arguments that

1729.039 -> at one point in time we recommended.

1732.64 -> Also you'll note that in this

1736 -> chart you'll see that system.gc method

1738.32 -> calls were taking place, which are often

1740.64 -> found in

1741.44 -> third tier plugins as there are no

1743.919 -> system.gc method calls

1745.84 -> inside the jenkins code base. This method

1748.32 -> call essentially calls the garbage

1750.159 -> collector

1750.88 -> adhoc and is throwing a wrench into its

1753.919 -> natural cycle of taking a garbage

1755.76 -> collection.

1756.88 -> What we did here was not only remove the

1758.799 -> unwanted JVM arguments

1760.559 -> but we also disabled the explicit GC

1763.279 -> method calls using an additional JVM

1765.2 -> market.

1768.96 -> What we have here is a 3500 percent

1771.44 -> performance increase when we did that.

1773.679 -> We went from 12 to 23 seconds to an

1776.24 -> average

1777.36 -> GC pause time of 660 milliseconds.

1781.279 -> On an absolutely monolithic master at

1784.799 -> this big shipping company, the

1787.84 -> end users were absolutely raving that

1789.919 -> instead of having to wait for their

1791.279 -> application

1792.08 -> to log in for 20 seconds and walk away

1794.799 -> and get a coffee,

1796 -> now they're coming and logging in and

1798.32 -> it's taking less than half a second.

1800.24 -> This was a huge improvement.

1805.679 -> So I wanted to wrap up by telling you a

1807.6 -> little bit more information

1809.039 -> about CloudBees and how we help

1811.2 -> customers get the most out of Jenkins.

1815.2 -> As a CloudBees customer you're entitled

1817.279 -> to CloudBees support.

1818.559 -> As we mentioned Ryan and I both work on

1820.72 -> the support team and work with a

1822.559 -> fantastic team of jenkins certified

1824.559 -> engineers.

1825.52 -> We're located all around the world

1827.279 -> allowing us to provide 24/7 support.

1830.08 -> We work with thousands of customers

1831.919 -> allowing us to see that bigger picture

1833.84 -> and help you guide you to be successful

1836.08 -> with your Jenkins.

1837.44 -> Our customers have dedicated success

1839.36 -> managers, who will help you drive your CD

1841.76 -> plan forward.

1842.799 -> We also offer online training which you

1845.2 -> get free as a customer

1846.72 -> and our support team assists with

1848.84 -> upgrades.

1850.399 -> I wanted to highlight a little about our

1852.399 -> training.

1853.44 -> Customers have access to the online

1855.36 -> training portal where you can see

1857.44 -> all of our courses. We have courses

1859.679 -> focused on pipelines,

1861.279 -> pipeline advanced, Jenkins

1863.84 -> administration,

1865.039 -> and if you're interested in getting your

1866.559 -> Jenkins certification, we even have a

1868.32 -> study course for that.

1869.919 -> We offer a knowledge base with

1871.44 -> documentation about updated industry

1873.84 -> best practices

1875.12 -> and we have a list of plugins certified

1877.36 -> by CloudBees

1878.399 -> so you have the confidence that you're

1879.919 -> using plugins which have already been

1881.84 -> tested by a Jenkins engineer.

1884.88 -> Ryan do you want to talk about assisted

1886.32 -> updates? Yeah I just want to bring up

1888.88 -> that some of the best feedback we've had

1891.519 -> from our current CloudBees customer

1893.279 -> set, has been around the assisted update

1894.72 -> program.

1895.519 -> If you've been through a Jenkins update

1897.2 -> in the past you know it can cause some

1899.12 -> anxiety with plug-in dependencies

1901.36 -> and making sure backups are in place, and

1903.519 -> verifying build jobs work after the

1905.36 -> update.

1906.08 -> Syntax changes that come between

1907.84 -> versions with the assist

1909.6 -> update program that you get as being a

1912.159 -> CloudBees customer

1913.44 -> the support team here works with you

1915.12 -> proactively to create a plan of action

1917.519 -> that makes your update run as smooth as

1918.88 -> possible,

1919.679 -> and if you're a platinum subscriber, we

1921.44 -> offer live assistance during your

1922.96 -> scheduled update.

1927.84 -> I also want to tell you about the

1929.919 -> Jenkins health advisor.

1931.919 -> Whether you are a CloudBees customer or

1934.08 -> an open source user, if you haven't heard

1936.24 -> about Jenkins health advisor yet

1938 -> I strongly recommend that after this

1940.24 -> presentation you go and download

1941.919 -> and install this plugin. Our support team

1944.72 -> is focused around automation

1946.799 -> and making your life easier, and as a

1948.48 -> part of that we wanted to share advisor

1950.88 -> with you. Advisor automatically analyzes

1953.84 -> your Jenkins environment

1955.12 -> and provides you proactive reporting on

1957.519 -> potential issues before they get out of

1959.36 -> hand.

1960.08 -> It emails you with solutions to

1962.08 -> discovered issues

1963.44 -> so you can prioritize them accordingly.

1969.44 -> So if you're ready to get the most out

1970.799 -> of Jenkins contact [email protected]

1972.399 -> to see how we can help you.

1974.72 -> We're looking forward to speaking with

1976.559 -> you and I've also

1978.159 -> included some of the resources list that

1980.32 -> we've talked about here today. I

1981.76 -> encourage you to click through that,

1983.6 -> and finally we're going to go ahead and

1985.519 -> take some live Q&A

1986.799 -> with myself and Michelle. Thank you all

1989.2 -> for enjoying the conference and I hope

1990.64 -> you learned something today.

1992.48 -> Thanks everyone!

Source: https://www.youtube.com/watch?v=govN7rXOmpc