CI/CD pipeline using dbt, Docker, and Jenkins, Simply Business

CI/CD pipeline using dbt, Docker, and Jenkins, Simply Business


CI/CD pipeline using dbt, Docker, and Jenkins, Simply Business

This recording is from London dbt Meetup Online on 29 April 2021, hosted by Fishtown Analytics.

In this talk, Ben Balde dives into:
- how to use Docker to host a dbt image shared by users/systems
- how running dbt in a container has made it easier for Simply Business to onboard new users
- how to create a robust Jenkins CI/CD pipeline by building automated dbt tests that run against production data and reduce the risk of broken code reaching the production database

Speaker:
Ben Balde, Data Engineer, Simply Business

Slides:
https://bit.ly/3ejLEYG


Register for Coalesce 2021—The Analytics Engineering Conference (it’s free to attend!): https://bit.ly/Coalesce2021YT


Content

0.08 -> so today i'll be talking about ci cd
2.879 -> pipelines using dbt
4.72 -> docker and jenkins
10.32 -> so a little bit of history from us in
13.04 -> simple business we've been using
14.4 -> dbt for just over two years now dbt is
16.96 -> pretty much at the center of
18.48 -> our mission to build a single source of
21.199 -> truth
22.08 -> as we expose data to systems and users
25.199 -> across the business thanks to the
27.439 -> artwork of the team
28.56 -> we significantly reduce the number of
30.56 -> times where people ask the question
32.8 -> why is this number different why this
34.8 -> data doesn't match the system
36.96 -> and yeah in dna we believe in
40.16 -> empowering users to own and explore
43.04 -> their data as
44.32 -> our user base has been growing we now
46.64 -> include
47.44 -> engineers analysts and finance users
54.8 -> so with great growth also comes great
57.12 -> responsibility
58.16 -> and today i'll be talking about how we
60.079 -> combine dbt and docker to
62 -> address challenges we faced and how our
65.199 -> solution made it easier for users to
67.68 -> utilize
68.72 -> dbt we'll also be talking about our
71.2 -> journey
71.76 -> with jenkins and the problems we faced
74.24 -> and how we implemented a solution which
76.24 -> made
76.88 -> our deployment process more robust i'll
79.28 -> also touch on
80.64 -> what is coming down the line within
82.88 -> simple business for
84 -> in terms of this project so yeah
88.24 -> so here we identify that needed
90.88 -> improvement
91.6 -> was around user setup two years ago our
94.32 -> plan was to
95.6 -> make the installation process as easy
98 -> and seamless as
99.36 -> possible if we now pictured the journey
102.799 -> of a new dbt user in simple business
105.119 -> they would start by having to install
107.04 -> dbt local to their machine
109.68 -> they also have to make sure that they
112.079 -> have that
112.88 -> all the dependents required to run the
114.399 -> bt installed in their machine correctly
116.24 -> like python for example
118.079 -> and this process is well documented but
121.36 -> if you can imagine it's sometimes
124.159 -> daunting for someone
125.439 -> that is not very technical to use
127.28 -> command the command line for example to
129.28 -> run a bunch of scripts which they
130.8 -> probably don't even
131.84 -> understand what what they do and if
134 -> there is a problem
134.959 -> midway through the installation most of
136.72 -> the times they probably don't even know
137.84 -> how to fix it and
138.879 -> even in those scenarios if someone comes
140.4 -> in to help that person needs to
141.84 -> understand
142.48 -> have they done all the if they follow
144.319 -> the steps after you've installed all the
146.08 -> dependencies
146.879 -> so all this process takes time just to
149.28 -> get the
150.239 -> tool up and running
154.16 -> so we also seen situations where
157.28 -> users were having difficulties in
159.519 -> upgrading dbt
160.8 -> so for example they didn't know if they
162.72 -> had to use pip
164 -> or brew when upgrading and sometimes
167.599 -> when upgrading users didn't realize
169.36 -> they've actually upgraded
170.64 -> uh python for example which
173.68 -> in in some cases wasn't compatible that
176.08 -> python version wasn't compatible with
177.84 -> with dbt
178.879 -> so they had to go through the process of
180.56 -> reversing that change and kind of having
182.64 -> it
182.959 -> in order to use the tool pretty much as
184.56 -> you can see here here that our processes
186.879 -> can be very smooth if they follow the
188.319 -> documentation and they know what they're
189.599 -> doing
190.08 -> but it can also be a little bit bumpy so
192.8 -> we thought there must be
194.56 -> a better way to make the whole process
197.2 -> more user-friendly
198.08 -> and the tool is easy to use so we saw on
201.12 -> our special
202.879 -> virtual room and we were looking for an
205.519 -> answer
206 -> and the answer we came up with was
210.84 -> docker
215.84 -> so what is docker so docker is a tool
217.92 -> designed to make it easier to create
219.84 -> deploy and run applications an important
222.159 -> element of docker
223.28 -> is images so images are immutable
226.72 -> files that contain all the necessary
229.28 -> parts to run
230.08 -> an application such as source codes
232.959 -> libraries and other dependencies
236.56 -> and then there are containers so
238.239 -> containers are running instances
240.239 -> of docker images so one thing
243.76 -> to denote here is that containers
246.64 -> require images to run
248.159 -> so they are dependent on those images
250.72 -> and
251.2 -> they use that image to construct a
252.879 -> runtime environment to run the
254.4 -> application
255.519 -> so as a reminder so images can exist
258.079 -> without container but container needs
261.519 -> an image to exist
265.199 -> so what are the benefits of using docker
267.68 -> containers
270 -> they are portable so they are available
271.919 -> from pretty much anywhere
273.36 -> and can quickly be shared
277.04 -> they're reliable so as the containers
279.52 -> operate in
280.4 -> an isolated environment everything
282.479 -> remains pretty much consistent between
284.16 -> users
292.08 -> they are lightweight which means they
295.04 -> share the machine os system
296.639 -> and means that for example they use less
299.04 -> memory
303.68 -> so going through our project so
306.72 -> our objective here was to dockerize dbt
309.84 -> by making use of a common image which
312.08 -> could be shared between users and
314.24 -> systems we plan to make use of
317.039 -> containers to run dbt
318.639 -> in reliable environments
321.84 -> so how did we achieve this
329.84 -> so we started by creating a docker image
332.16 -> which contains
333.199 -> the dbt infrastructure in this image
337.12 -> we installed dbt and all other
339.36 -> dependencies
340.32 -> such as dvt helpers python and in our
343.68 -> case
344.479 -> a snowflake connector so once we have
347.6 -> the image pretty much packaged
351.36 -> we we're pretty much ready to publish
353.199 -> that to aws
355.039 -> and we then have access by our users so
357.6 -> once
358.08 -> an image is built and ready to be shared
360.319 -> we publish that to aws
362.639 -> so in terms of in terms of our project
365.52 -> so we started by creating a dvd docker
368.08 -> dvt infrastructure and which is an image
371.919 -> in this image we installed dbt and all
373.84 -> of the dependencies such as
375.28 -> dvt helpers python and snowflake
378.72 -> once we have that image bundle up we
381.039 -> publish that to aws which can be then
383.44 -> accessed by the users
385.039 -> so the benefits of using a docker image
387.52 -> to hold
388.16 -> the infrastructure is that as a team we
391.44 -> can control
392.72 -> any future upgrades to dvt or even roll
395.199 -> out new features
396.08 -> when they're pretty much in a stable
397.919 -> position and we can test this in an
399.6 -> isolating
400.319 -> environment
406.24 -> so this slide here goes through the pros
408.24 -> the setup of
409.36 -> local developments so we if we imagine
411.68 -> the previous slide we built
413.039 -> our dvd infrastructure this is available
415.28 -> in aws
416.4 -> so now let's focus on the users and how
419.199 -> the users would work with dbt locally
422.479 -> so the process is very simple as you're
425.039 -> gonna see
425.599 -> so we start by the users would start by
428.08 -> running
429.199 -> a script called pull dvt so this script
432 -> connects to
432.88 -> aws and pulls down the latest
435.44 -> infrastructure
436.16 -> image which then gets stored locally
441.28 -> so after that after they have they got
443.28 -> the latest image they
444.8 -> then run a star dbt script which
446.96 -> utilizes the same image to start
449.199 -> a docker container which has all the
451.52 -> complete
452.8 -> which has dvt in the complete
454.8 -> environment so here the users can access
457.919 -> the models the macros the seeds and they
459.68 -> can do pretty much what they do
461.52 -> outside the container so they can run
462.88 -> them the models they can test the models
465.36 -> they could compile the codes do pretty
467.28 -> much everything they do outside
468.72 -> but remember they operating in isolated
471.599 -> environments
473.759 -> so what did we achieve by doing this we
476.56 -> now
477.44 -> pretty much simplified our setup process
480 -> so now
480.56 -> for a user to use dvt all they need to
482.8 -> do is run
483.599 -> two scripts so one script which they do
486.319 -> only when required to upgrade or
488.879 -> to when they first start using the tool
491.36 -> and the second script
492.56 -> which they use when they need to access
495.039 -> dbt
500.56 -> so we solved the problem right so
503.68 -> to summarize we add an installation
505.759 -> process which wasn't
507.039 -> very user friendly and we solved this by
509.12 -> condensing the installation process to
511.12 -> a single script which installs dbt
514.56 -> and then by using containers we pretty
517.36 -> much made dvt
518.64 -> portable reliable and lightweight which
521.039 -> is everything that
522.64 -> all the benefits of a docker container
529.279 -> so well now we managed to make dvt
532.24 -> easier
532.72 -> to use we went back to our special
535.36 -> virtual room
536.48 -> and we started to think what else we
538.48 -> could improve and
540.48 -> the thing that we decided to do was to
542.56 -> look at our
543.6 -> ci cd pipeline which runs the jenkins
547.12 -> so jenking is an open source automation
549.92 -> tool
550.64 -> used to build test and deploy software
552.88 -> continuously
554.16 -> and to give everyone a brief description
557.12 -> of
557.44 -> what ci cd means it means continuous
559.6 -> integration and continuous deployment
561.6 -> yeah so we used jenkins a simple
563.519 -> business so we looked at this pipeline
565.519 -> and he decided
566.64 -> can we make this better so when
569.839 -> reviewing our pipeline the key
571.36 -> we found was that sometimes program code
574 -> was being deployed to production
575.76 -> we even with all the checks we have in
577.6 -> place um
578.8 -> and some of the reasons we found this
580.72 -> was happening was because
582.399 -> user's tests were executed against a dev
585.44 -> database
586.8 -> also users were only required to run
589.04 -> tests on models they change
590.72 -> which meant that even if those models
592.48 -> run successfully locally
594.32 -> when running with dependencies they
595.839 -> could still fail
597.92 -> so the problem of having broken code in
600.16 -> production is that
601.519 -> it prevents pretty much anyone else from
604 -> deploying
604.959 -> until that broken code is either fixed
607.44 -> or reverted
613.92 -> so the solution i'm sure everyone is
615.76 -> excited to find out what the solution
617.279 -> was
618 -> we've introduced automated tests against
621.2 -> clone production data so how did we
624.32 -> achieve this
630.64 -> so as part of our solution we have at
633.519 -> the center
634.16 -> python script so this python script wait
637.04 -> does
637.44 -> it go through logic steps to determine
639.68 -> first if there are changes made by the
641.839 -> user
642.32 -> if the change is made by that user will
644.48 -> work in production
646.24 -> so it starts by comparing
649.44 -> all the changes made in that dev branch
651.279 -> against the main branch
652.88 -> using a gig command called git diff
656.32 -> so at this point we get pretty much all
658.16 -> the changes within that that branch for
659.92 -> that user
661.519 -> so that's good but not everything will
664 -> be relative for
665.12 -> our test so what we do here is we try to
669.04 -> condense that list to only models and
672.72 -> seed files and the way we do this in the
674.959 -> filter we
676.16 -> include a condition which filters only
678.32 -> for sql and csv files
682 -> so now all great we regularly saw
685.2 -> what models have changed and once its
687.2 -> files have changed
688.72 -> so they exist in a dev branch but they
690.399 -> don't exist in master so we got all
692.24 -> these
692.72 -> grids so
696.88 -> all these of change files and models
698.959 -> once we have that we
700.16 -> implement some further logic to
702.8 -> determine
703.44 -> what models are changed by running so
707.04 -> here at this point we the further logic
708.959 -> we
710.079 -> determine the dependencies and if you
712.959 -> remember here
714.16 -> the the reason why we do this is because
717.04 -> even if
717.6 -> that model runs successfully you can
719.68 -> still fail
720.88 -> on the dependencies so in order to get
723.2 -> this we use dbt lists
725.44 -> which first looks at the model that
728.32 -> you're making a change and then
731.36 -> we get one dependence one depend parents
734.32 -> and one dependencies
735.519 -> child and at this point you might be
738.56 -> asking yourselves
739.92 -> why have we decided to do one dependency
742.639 -> either way and
744.24 -> the reason for that is performance we've
747.76 -> tried it with multiple dependencies and
749.839 -> we realized that the effects performance
752.399 -> and
753.279 -> one dependency works for us while doing
755.76 -> the test as well
756.72 -> one thing that we've noticed as well is
758.399 -> that sometimes the child's dependency or
760.639 -> the child model
761.6 -> as dependency on a parent on another
764.079 -> parent model
765.519 -> and for this we've included some further
767.76 -> logic which checks does that child model
769.839 -> depends on another parent model and if
771.68 -> it does
772.079 -> it gets added to the list and we'll use
774.88 -> this list
775.6 -> later on
783.76 -> so the next step is
786.8 -> based on the previous list that we get
788.32 -> of models and dependencies
790.32 -> we got everything lined up we
794.32 -> the script goes through runs a dvt run
797.04 -> operation that creates
798.56 -> an empty database which is dynamically
800.56 -> named by the user then branch
802.48 -> so in the example there you can see if
804.399 -> we have a give branch called
806.48 -> demo dbt there they'll get created a
809.839 -> cloned
810.32 -> baseball clone demo dvt so the users
813.279 -> know where to look if they have to debug
815.2 -> and things like that
816.639 -> is pretty easy to find so
819.839 -> the clone database so it first creates
822.16 -> the the call database which is empty and
824.16 -> then it looks at the list of objects
825.519 -> that we had previously
826.639 -> and he uses that to create a copy from
829.12 -> production
830 -> into that database so one thing to
833.519 -> that we've done here is that after seven
835.68 -> days the con database is automatically
838.32 -> dropped so there's no storage of
840.399 -> unnecessary clients databases
842.48 -> at this point so we're pretty lucky in
844.72 -> symbol business that we using snowflake
846.959 -> which provides uh zero copy clone
849.199 -> functionality out of the box
850.88 -> but if you guys are using another
853.279 -> database
854.16 -> you can probably achieve this in a
856.079 -> different way maybe by having
858.079 -> views on top of production database
865.92 -> so the final steps of this test process
869.44 -> is we use the python script to generate
871.519 -> a list of
872.24 -> dbt commands which will then be used to
874.8 -> perform
875.279 -> a dvt run and the dvt tests against the
878.56 -> clone database
880.32 -> and by doing this
883.44 -> we pretty much make sure that whatever
885.839 -> they trying to run
886.88 -> will also work in production so
890.079 -> i understand i went through the steps so
892.24 -> putting all the script together
894.56 -> this is pretty much what what you get so
896.959 -> just to recap
898.88 -> we first detect the changes on the
900.399 -> development branch we detect the model
902.72 -> dependencies
903.68 -> we clone the database objects we perform
906.959 -> a dbt run and a dvd test and if
909.6 -> successful
911.04 -> the user is able then to deploy to
912.88 -> production if not
914.24 -> the users can also go into the clone
916.8 -> database and kind of
917.839 -> check the data make and try to debug
920.16 -> that way
924.32 -> so what did we achieve by doing this so
927.04 -> less manual tests for users
929.68 -> tests now run production data which
932.32 -> pretty much means that if they run
934.16 -> at that point then they should run in
935.839 -> production without any issues
937.199 -> the risks of broken code is written
939.6 -> production is simply
940.56 -> reduced at this point
945.44 -> so what's next so how does the feature
947.839 -> look for us
948.88 -> we hope by using docker now we will be
951.44 -> able to split the bt into
953.199 -> smaller projects which will mean that
955.759 -> the tool will perform better and faster
957.759 -> and we can also tailor projects to
960.16 -> specific departments
961.36 -> so if you can imagine finance marketing
963.759 -> they all
964.399 -> have their own little dvt projects which
967.04 -> they can
968 -> look after we also hoping to develop a
972 -> full blue-green deployment pipeline at
974 -> some point so
974.88 -> by having two production databases
976.88 -> running side by side for example we can
979.199 -> make deployment deployments running even
981.199 -> faster
986.56 -> thank you very much and open the floor
989.44 -> for
990.24 -> questions at this point
993.6 -> awesome thank you so much ben one for me
996.399 -> i was getting the impression that your
997.759 -> organization
998.639 -> was pretty large because i was trying it
1000.32 -> was just reasoning about why it would
1002 -> make sense
1002.72 -> to why you'd want to solve for people
1004.48 -> getting ramped up on dbt as quickly as
1006.32 -> possible like installing out on the
1007.519 -> computers and it totally makes sense
1009.36 -> when you have a team of that scale i'm
1012.32 -> curious
1012.959 -> what were you doing before what was
1016 -> you described the pain briefly but i'm
1018.32 -> curious what it was like prior to
1020.399 -> do it you know completing this work what
1022.079 -> were what did that look like
1024.4 -> sure so we had documentation so a pretty
1027.199 -> much
1027.52 -> a step-by-step guide on what to do once
1029.919 -> you start a simple business if you
1031.679 -> require to use
1032.959 -> dbt and it's pretty useful it kind of
1035.6 -> guides you step by step
1037.12 -> and you pretty much have to copy and
1038.72 -> paste those commands into the terminal
1040.799 -> when it works it is perfect and is very
1043.439 -> straightforward
1044.559 -> the problem starts when you don't have
1047.28 -> something installed on your machine or
1048.88 -> you have a version which is not
1050.88 -> compatible then the errors appear
1054.24 -> and for someone which is not very
1055.84 -> technical like for example we start on
1057.6 -> onboarding people in finance
1059.28 -> and other departments and they just
1061.919 -> don't know what to do
1062.799 -> so that's the reason why we try to
1064.559 -> simplify the process as much
1066 -> as we could even though engineers can
1067.76 -> probably figure out what the code is
1069.679 -> telling them and
1070.88 -> what else to check most of any other
1072.799 -> folks they just need to
1074.48 -> ask for help and it just is time
1076.08 -> consuming for everyone because
1077.76 -> they can't use the tool and we need to
1080.4 -> figure out
1080.96 -> what they've done wrong if they've done
1082.4 -> anything wrong
1084.48 -> it sounds like it makes your yeah it
1086.4 -> makes your life a lot easier
1090.4 -> and as someone who was i came from that
1092.88 -> same path of being less technical and
1094.88 -> figuring out the command line so i it's
1097.76 -> definitely something i
1098.72 -> would have appreciated we've got another
1100.4 -> question from
1102.08 -> actually here i'm gonna hand it over to
1104.32 -> martina to ask it out loud
1106.08 -> martina go ahead and ask your question
1111.2 -> yeah i was just wondering whether this
1113.52 -> will also work
1114.64 -> for windows user and if you have to do
1116.88 -> another docker image or how that work
1119.919 -> or will it work so
1123.2 -> it should work for docker for windows
1125.039 -> user at simple business
1126.559 -> we're not quite there at the moment only
1128.96 -> mac users are using docker
1130.96 -> in simple business it's one of the key
1132.72 -> requirements so they need to have
1134.799 -> docker installed and access to to aws to
1137.919 -> pull the image down
1138.88 -> yeah it is possible but it depends on on
1141.679 -> your infrastructure
1142.64 -> on your company at simple business we're
1144.4 -> not quite there but
1145.76 -> that's the next step especially as we're
1147.76 -> looking to
1149.039 -> board more people we did ask the
1150.64 -> question the infosec team went
1152.48 -> away and think about it how they would
1154.4 -> implement this
1155.6 -> on windows machines but they said it is
1158.48 -> quite possible
1159.28 -> but they have security concerns which
1161.52 -> they're going to think on how to
1163.039 -> of dissolve
1168 -> thank you yeah that sounds awesome
1170 -> though all right thank you
1174.96 -> one last naive question just only
1177.12 -> because this is my first time at the
1178.559 -> london meetup how long has your team how
1180.72 -> long has simply business been using dbt
1182.559 -> it was just over two years so even
1184.96 -> before i joined the company they're
1186.64 -> already using dbt for me i've been with
1188.799 -> the company for just over
1190.4 -> a year almost a year and a half i've
1192.16 -> never used the bt before
1193.52 -> but i think this the tool is amazing
1196.64 -> especially because it empowers analysts
1198.559 -> and other users to own their own data
1200.48 -> and which frees up times for engineers
1203.36 -> to focus on
1204.32 -> other things

Source: https://www.youtube.com/watch?v=B_EpbM8XBJw