CI/CD pipeline using dbt, Docker, and Jenkins, Simply Business

Aug 16, 2023

CI/CD pipeline using dbt, Docker, and Jenkins, Simply Business

This recording is from London dbt Meetup Online on 29 April 2021, hosted by Fishtown Analytics.

In this talk, Ben Balde dives into:
- how to use Docker to host a dbt image shared by users/systems
- how running dbt in a container has made it easier for Simply Business to onboard new users
- how to create a robust Jenkins CI/CD pipeline by building automated dbt tests that run against production data and reduce the risk of broken code reaching the production database

Speaker:
Ben Balde, Data Engineer, Simply Business

Slides:
https://bit.ly/3ejLEYG

Register for Coalesce 2021—The Analytics Engineering Conference (it’s free to attend!): https://bit.ly/Coalesce2021YT

Content

0.08 -> so today i'll be talking about ci cd

2.879 -> pipelines using dbt

4.72 -> docker and jenkins

10.32 -> so a little bit of history from us in

13.04 -> simple business we've been using

14.4 -> dbt for just over two years now dbt is

16.96 -> pretty much at the center of

18.48 -> our mission to build a single source of

21.199 -> truth

22.08 -> as we expose data to systems and users

25.199 -> across the business thanks to the

27.439 -> artwork of the team

28.56 -> we significantly reduce the number of

30.56 -> times where people ask the question

32.8 -> why is this number different why this

34.8 -> data doesn't match the system

36.96 -> and yeah in dna we believe in

40.16 -> empowering users to own and explore

43.04 -> their data as

44.32 -> our user base has been growing we now

46.64 -> include

47.44 -> engineers analysts and finance users

54.8 -> so with great growth also comes great

57.12 -> responsibility

58.16 -> and today i'll be talking about how we

60.079 -> combine dbt and docker to

62 -> address challenges we faced and how our

65.199 -> solution made it easier for users to

67.68 -> utilize

68.72 -> dbt we'll also be talking about our

71.2 -> journey

71.76 -> with jenkins and the problems we faced

74.24 -> and how we implemented a solution which

76.24 -> made

76.88 -> our deployment process more robust i'll

79.28 -> also touch on

80.64 -> what is coming down the line within

82.88 -> simple business for

84 -> in terms of this project so yeah

88.24 -> so here we identify that needed

90.88 -> improvement

91.6 -> was around user setup two years ago our

94.32 -> plan was to

95.6 -> make the installation process as easy

98 -> and seamless as

99.36 -> possible if we now pictured the journey

102.799 -> of a new dbt user in simple business

105.119 -> they would start by having to install

107.04 -> dbt local to their machine

109.68 -> they also have to make sure that they

112.079 -> have that

112.88 -> all the dependents required to run the

114.399 -> bt installed in their machine correctly

116.24 -> like python for example

118.079 -> and this process is well documented but

121.36 -> if you can imagine it's sometimes

124.159 -> daunting for someone

125.439 -> that is not very technical to use

127.28 -> command the command line for example to

129.28 -> run a bunch of scripts which they

130.8 -> probably don't even

131.84 -> understand what what they do and if

134 -> there is a problem

134.959 -> midway through the installation most of

136.72 -> the times they probably don't even know

137.84 -> how to fix it and

138.879 -> even in those scenarios if someone comes

140.4 -> in to help that person needs to

141.84 -> understand

142.48 -> have they done all the if they follow

144.319 -> the steps after you've installed all the

146.08 -> dependencies

146.879 -> so all this process takes time just to

149.28 -> get the

150.239 -> tool up and running

154.16 -> so we also seen situations where

157.28 -> users were having difficulties in

159.519 -> upgrading dbt

160.8 -> so for example they didn't know if they

162.72 -> had to use pip

164 -> or brew when upgrading and sometimes

167.599 -> when upgrading users didn't realize

169.36 -> they've actually upgraded

170.64 -> uh python for example which

173.68 -> in in some cases wasn't compatible that

176.08 -> python version wasn't compatible with

177.84 -> with dbt

178.879 -> so they had to go through the process of

180.56 -> reversing that change and kind of having

182.64 -> it

182.959 -> in order to use the tool pretty much as

184.56 -> you can see here here that our processes

186.879 -> can be very smooth if they follow the

188.319 -> documentation and they know what they're

189.599 -> doing

190.08 -> but it can also be a little bit bumpy so

192.8 -> we thought there must be

194.56 -> a better way to make the whole process

197.2 -> more user-friendly

198.08 -> and the tool is easy to use so we saw on

201.12 -> our special

202.879 -> virtual room and we were looking for an

205.519 -> answer

206 -> and the answer we came up with was

210.84 -> docker

215.84 -> so what is docker so docker is a tool

217.92 -> designed to make it easier to create

219.84 -> deploy and run applications an important

222.159 -> element of docker

223.28 -> is images so images are immutable

226.72 -> files that contain all the necessary

229.28 -> parts to run

230.08 -> an application such as source codes

232.959 -> libraries and other dependencies

236.56 -> and then there are containers so

238.239 -> containers are running instances

240.239 -> of docker images so one thing

243.76 -> to denote here is that containers

246.64 -> require images to run

248.159 -> so they are dependent on those images

250.72 -> and

251.2 -> they use that image to construct a

252.879 -> runtime environment to run the

254.4 -> application

255.519 -> so as a reminder so images can exist

258.079 -> without container but container needs

261.519 -> an image to exist

265.199 -> so what are the benefits of using docker

267.68 -> containers

270 -> they are portable so they are available

271.919 -> from pretty much anywhere

273.36 -> and can quickly be shared

277.04 -> they're reliable so as the containers

279.52 -> operate in

280.4 -> an isolated environment everything

282.479 -> remains pretty much consistent between

284.16 -> users

292.08 -> they are lightweight which means they

295.04 -> share the machine os system

296.639 -> and means that for example they use less

299.04 -> memory

303.68 -> so going through our project so

306.72 -> our objective here was to dockerize dbt

309.84 -> by making use of a common image which

312.08 -> could be shared between users and

314.24 -> systems we plan to make use of

317.039 -> containers to run dbt

318.639 -> in reliable environments

321.84 -> so how did we achieve this

329.84 -> so we started by creating a docker image

332.16 -> which contains

333.199 -> the dbt infrastructure in this image

337.12 -> we installed dbt and all other

339.36 -> dependencies

340.32 -> such as dvt helpers python and in our

343.68 -> case

344.479 -> a snowflake connector so once we have

347.6 -> the image pretty much packaged

351.36 -> we we're pretty much ready to publish

353.199 -> that to aws

355.039 -> and we then have access by our users so

357.6 -> once

358.08 -> an image is built and ready to be shared

360.319 -> we publish that to aws

362.639 -> so in terms of in terms of our project

365.52 -> so we started by creating a dvd docker

368.08 -> dvt infrastructure and which is an image

371.919 -> in this image we installed dbt and all

373.84 -> of the dependencies such as

375.28 -> dvt helpers python and snowflake

378.72 -> once we have that image bundle up we

381.039 -> publish that to aws which can be then

383.44 -> accessed by the users

385.039 -> so the benefits of using a docker image

387.52 -> to hold

388.16 -> the infrastructure is that as a team we

391.44 -> can control

392.72 -> any future upgrades to dvt or even roll

395.199 -> out new features

396.08 -> when they're pretty much in a stable

397.919 -> position and we can test this in an

399.6 -> isolating

400.319 -> environment

406.24 -> so this slide here goes through the pros

408.24 -> the setup of

409.36 -> local developments so we if we imagine

411.68 -> the previous slide we built

413.039 -> our dvd infrastructure this is available

415.28 -> in aws

416.4 -> so now let's focus on the users and how

419.199 -> the users would work with dbt locally

422.479 -> so the process is very simple as you're

425.039 -> gonna see

425.599 -> so we start by the users would start by

428.08 -> running

429.199 -> a script called pull dvt so this script

432 -> connects to

432.88 -> aws and pulls down the latest

435.44 -> infrastructure

436.16 -> image which then gets stored locally

441.28 -> so after that after they have they got

443.28 -> the latest image they

444.8 -> then run a star dbt script which

446.96 -> utilizes the same image to start

449.199 -> a docker container which has all the

451.52 -> complete

452.8 -> which has dvt in the complete

454.8 -> environment so here the users can access

457.919 -> the models the macros the seeds and they

459.68 -> can do pretty much what they do

461.52 -> outside the container so they can run

462.88 -> them the models they can test the models

465.36 -> they could compile the codes do pretty

467.28 -> much everything they do outside

468.72 -> but remember they operating in isolated

471.599 -> environments

473.759 -> so what did we achieve by doing this we

476.56 -> now

477.44 -> pretty much simplified our setup process

480 -> so now

480.56 -> for a user to use dvt all they need to

482.8 -> do is run

483.599 -> two scripts so one script which they do

486.319 -> only when required to upgrade or

488.879 -> to when they first start using the tool

491.36 -> and the second script

492.56 -> which they use when they need to access

495.039 -> dbt

500.56 -> so we solved the problem right so

503.68 -> to summarize we add an installation

505.759 -> process which wasn't

507.039 -> very user friendly and we solved this by

509.12 -> condensing the installation process to

511.12 -> a single script which installs dbt

514.56 -> and then by using containers we pretty

517.36 -> much made dvt

518.64 -> portable reliable and lightweight which

521.039 -> is everything that

522.64 -> all the benefits of a docker container

529.279 -> so well now we managed to make dvt

532.24 -> easier

532.72 -> to use we went back to our special

535.36 -> virtual room

536.48 -> and we started to think what else we

538.48 -> could improve and

540.48 -> the thing that we decided to do was to

542.56 -> look at our

543.6 -> ci cd pipeline which runs the jenkins

547.12 -> so jenking is an open source automation

549.92 -> tool

550.64 -> used to build test and deploy software

552.88 -> continuously

554.16 -> and to give everyone a brief description

557.12 -> of

557.44 -> what ci cd means it means continuous

559.6 -> integration and continuous deployment

561.6 -> yeah so we used jenkins a simple

563.519 -> business so we looked at this pipeline

565.519 -> and he decided

566.64 -> can we make this better so when

569.839 -> reviewing our pipeline the key

571.36 -> we found was that sometimes program code

574 -> was being deployed to production

575.76 -> we even with all the checks we have in

577.6 -> place um

578.8 -> and some of the reasons we found this

580.72 -> was happening was because

582.399 -> user's tests were executed against a dev

585.44 -> database

586.8 -> also users were only required to run

589.04 -> tests on models they change

590.72 -> which meant that even if those models

592.48 -> run successfully locally

594.32 -> when running with dependencies they

595.839 -> could still fail

597.92 -> so the problem of having broken code in

600.16 -> production is that

601.519 -> it prevents pretty much anyone else from

604 -> deploying

604.959 -> until that broken code is either fixed

607.44 -> or reverted

613.92 -> so the solution i'm sure everyone is

615.76 -> excited to find out what the solution

617.279 -> was

618 -> we've introduced automated tests against

621.2 -> clone production data so how did we

624.32 -> achieve this

630.64 -> so as part of our solution we have at

633.519 -> the center

634.16 -> python script so this python script wait

637.04 -> does

637.44 -> it go through logic steps to determine

639.68 -> first if there are changes made by the

641.839 -> user

642.32 -> if the change is made by that user will

644.48 -> work in production

646.24 -> so it starts by comparing

649.44 -> all the changes made in that dev branch

651.279 -> against the main branch

652.88 -> using a gig command called git diff

656.32 -> so at this point we get pretty much all

658.16 -> the changes within that that branch for

659.92 -> that user

661.519 -> so that's good but not everything will

664 -> be relative for

665.12 -> our test so what we do here is we try to

669.04 -> condense that list to only models and

672.72 -> seed files and the way we do this in the

674.959 -> filter we

676.16 -> include a condition which filters only

678.32 -> for sql and csv files

682 -> so now all great we regularly saw

685.2 -> what models have changed and once its

687.2 -> files have changed

688.72 -> so they exist in a dev branch but they

690.399 -> don't exist in master so we got all

692.24 -> these

692.72 -> grids so

696.88 -> all these of change files and models

698.959 -> once we have that we

700.16 -> implement some further logic to

702.8 -> determine

703.44 -> what models are changed by running so

707.04 -> here at this point we the further logic

708.959 -> we

710.079 -> determine the dependencies and if you

712.959 -> remember here

714.16 -> the the reason why we do this is because

717.04 -> even if

717.6 -> that model runs successfully you can

719.68 -> still fail

720.88 -> on the dependencies so in order to get

723.2 -> this we use dbt lists

725.44 -> which first looks at the model that

728.32 -> you're making a change and then

731.36 -> we get one dependence one depend parents

734.32 -> and one dependencies

735.519 -> child and at this point you might be

738.56 -> asking yourselves

739.92 -> why have we decided to do one dependency

742.639 -> either way and

744.24 -> the reason for that is performance we've

747.76 -> tried it with multiple dependencies and

749.839 -> we realized that the effects performance

752.399 -> and

753.279 -> one dependency works for us while doing

755.76 -> the test as well

756.72 -> one thing that we've noticed as well is

758.399 -> that sometimes the child's dependency or

760.639 -> the child model

761.6 -> as dependency on a parent on another

764.079 -> parent model

765.519 -> and for this we've included some further

767.76 -> logic which checks does that child model

769.839 -> depends on another parent model and if

771.68 -> it does

772.079 -> it gets added to the list and we'll use

774.88 -> this list

775.6 -> later on

783.76 -> so the next step is

786.8 -> based on the previous list that we get

788.32 -> of models and dependencies

790.32 -> we got everything lined up we

794.32 -> the script goes through runs a dvt run

797.04 -> operation that creates

798.56 -> an empty database which is dynamically

800.56 -> named by the user then branch

802.48 -> so in the example there you can see if

804.399 -> we have a give branch called

806.48 -> demo dbt there they'll get created a

809.839 -> cloned

810.32 -> baseball clone demo dvt so the users

813.279 -> know where to look if they have to debug

815.2 -> and things like that

816.639 -> is pretty easy to find so

819.839 -> the clone database so it first creates

822.16 -> the the call database which is empty and

824.16 -> then it looks at the list of objects

825.519 -> that we had previously

826.639 -> and he uses that to create a copy from

829.12 -> production

830 -> into that database so one thing to

833.519 -> that we've done here is that after seven

835.68 -> days the con database is automatically

838.32 -> dropped so there's no storage of

840.399 -> unnecessary clients databases

842.48 -> at this point so we're pretty lucky in

844.72 -> symbol business that we using snowflake

846.959 -> which provides uh zero copy clone

849.199 -> functionality out of the box

850.88 -> but if you guys are using another

853.279 -> database

854.16 -> you can probably achieve this in a

856.079 -> different way maybe by having

858.079 -> views on top of production database

865.92 -> so the final steps of this test process

869.44 -> is we use the python script to generate

871.519 -> a list of

872.24 -> dbt commands which will then be used to

874.8 -> perform

875.279 -> a dvt run and the dvt tests against the

878.56 -> clone database

880.32 -> and by doing this

883.44 -> we pretty much make sure that whatever

885.839 -> they trying to run

886.88 -> will also work in production so

890.079 -> i understand i went through the steps so

892.24 -> putting all the script together

894.56 -> this is pretty much what what you get so

896.959 -> just to recap

898.88 -> we first detect the changes on the

900.399 -> development branch we detect the model

902.72 -> dependencies

903.68 -> we clone the database objects we perform

906.959 -> a dbt run and a dvd test and if

909.6 -> successful

911.04 -> the user is able then to deploy to

912.88 -> production if not

914.24 -> the users can also go into the clone

916.8 -> database and kind of

917.839 -> check the data make and try to debug

920.16 -> that way

924.32 -> so what did we achieve by doing this so

927.04 -> less manual tests for users

929.68 -> tests now run production data which

932.32 -> pretty much means that if they run

934.16 -> at that point then they should run in

935.839 -> production without any issues

937.199 -> the risks of broken code is written

939.6 -> production is simply

940.56 -> reduced at this point

945.44 -> so what's next so how does the feature

947.839 -> look for us

948.88 -> we hope by using docker now we will be

951.44 -> able to split the bt into

953.199 -> smaller projects which will mean that

955.759 -> the tool will perform better and faster

957.759 -> and we can also tailor projects to

960.16 -> specific departments

961.36 -> so if you can imagine finance marketing

963.759 -> they all

964.399 -> have their own little dvt projects which

967.04 -> they can

968 -> look after we also hoping to develop a

972 -> full blue-green deployment pipeline at

974 -> some point so

974.88 -> by having two production databases

976.88 -> running side by side for example we can

979.199 -> make deployment deployments running even

981.199 -> faster

986.56 -> thank you very much and open the floor

989.44 -> for

990.24 -> questions at this point

993.6 -> awesome thank you so much ben one for me

996.399 -> i was getting the impression that your

997.759 -> organization

998.639 -> was pretty large because i was trying it

1000.32 -> was just reasoning about why it would

1002 -> make sense

1002.72 -> to why you'd want to solve for people

1004.48 -> getting ramped up on dbt as quickly as

1006.32 -> possible like installing out on the

1007.519 -> computers and it totally makes sense

1009.36 -> when you have a team of that scale i'm

1012.32 -> curious

1012.959 -> what were you doing before what was

1016 -> you described the pain briefly but i'm

1018.32 -> curious what it was like prior to

1020.399 -> do it you know completing this work what

1022.079 -> were what did that look like

1024.4 -> sure so we had documentation so a pretty

1027.199 -> much

1027.52 -> a step-by-step guide on what to do once

1029.919 -> you start a simple business if you

1031.679 -> require to use

1032.959 -> dbt and it's pretty useful it kind of

1035.6 -> guides you step by step

1037.12 -> and you pretty much have to copy and

1038.72 -> paste those commands into the terminal

1040.799 -> when it works it is perfect and is very

1043.439 -> straightforward

1044.559 -> the problem starts when you don't have

1047.28 -> something installed on your machine or

1048.88 -> you have a version which is not

1050.88 -> compatible then the errors appear

1054.24 -> and for someone which is not very

1055.84 -> technical like for example we start on

1057.6 -> onboarding people in finance

1059.28 -> and other departments and they just

1061.919 -> don't know what to do

1062.799 -> so that's the reason why we try to

1064.559 -> simplify the process as much

1066 -> as we could even though engineers can

1067.76 -> probably figure out what the code is

1069.679 -> telling them and

1070.88 -> what else to check most of any other

1072.799 -> folks they just need to

1074.48 -> ask for help and it just is time

1076.08 -> consuming for everyone because

1077.76 -> they can't use the tool and we need to

1080.4 -> figure out

1080.96 -> what they've done wrong if they've done

1082.4 -> anything wrong

1084.48 -> it sounds like it makes your yeah it

1086.4 -> makes your life a lot easier

1090.4 -> and as someone who was i came from that

1092.88 -> same path of being less technical and

1094.88 -> figuring out the command line so i it's

1097.76 -> definitely something i

1098.72 -> would have appreciated we've got another

1100.4 -> question from

1102.08 -> actually here i'm gonna hand it over to

1104.32 -> martina to ask it out loud

1106.08 -> martina go ahead and ask your question

1111.2 -> yeah i was just wondering whether this

1113.52 -> will also work

1114.64 -> for windows user and if you have to do

1116.88 -> another docker image or how that work

1119.919 -> or will it work so

1123.2 -> it should work for docker for windows

1125.039 -> user at simple business

1126.559 -> we're not quite there at the moment only

1128.96 -> mac users are using docker

1130.96 -> in simple business it's one of the key

1132.72 -> requirements so they need to have

1134.799 -> docker installed and access to to aws to

1137.919 -> pull the image down

1138.88 -> yeah it is possible but it depends on on

1141.679 -> your infrastructure

1142.64 -> on your company at simple business we're

1144.4 -> not quite there but

1145.76 -> that's the next step especially as we're

1147.76 -> looking to

1149.039 -> board more people we did ask the

1150.64 -> question the infosec team went

1152.48 -> away and think about it how they would

1154.4 -> implement this

1155.6 -> on windows machines but they said it is

1158.48 -> quite possible

1159.28 -> but they have security concerns which

1161.52 -> they're going to think on how to

1163.039 -> of dissolve

1168 -> thank you yeah that sounds awesome

1170 -> though all right thank you

1174.96 -> one last naive question just only

1177.12 -> because this is my first time at the

1178.559 -> london meetup how long has your team how

1180.72 -> long has simply business been using dbt

1182.559 -> it was just over two years so even

1184.96 -> before i joined the company they're

1186.64 -> already using dbt for me i've been with

1188.799 -> the company for just over

1190.4 -> a year almost a year and a half i've

1192.16 -> never used the bt before

1193.52 -> but i think this the tool is amazing

1196.64 -> especially because it empowers analysts

1198.559 -> and other users to own their own data

1200.48 -> and which frees up times for engineers

1203.36 -> to focus on

1204.32 -> other things

Source: https://www.youtube.com/watch?v=B_EpbM8XBJw