Machine Learning for Everybody – Full Course 
                    
	Aug 15, 2023
 
                    
                     
                    
	Machine Learning for Everybody – Full Course 
	Learn Machine Learning in a way that is accessible to absolute beginners. You will learn the basics of Machine Learning and how to use TensorFlow to implement many different concepts. ✏️ Kylie Ying developed this course. Check out her channel:    / ycubed   ⭐️ Code and Resources ⭐️ 🔗 Supervised learning (classification/MAGIC): https://colab.research.google.com/dri … 🔗 Supervised learning (regression/bikes): https://colab.research.google.com/dri … 🔗 Unsupervised learning (seeds): https://colab.research.google.com/dri … 🔗 Dataets (add a note that for the bikes dataset, they may have to open the downloaded csv file and remove special characters) 🔗 MAGIC dataset: https://archive.ics.uci.edu/ml/datase … 🔗 Bikes dataset: https://archive.ics.uci.edu/ml/datase … 🔗 Seeds/wheat dataset: https://archive.ics.uci.edu/ml/datase … 🏗 Google provided a grant to make this course possible.  ⭐️ Contents ⭐️ ⌨️ (0:00:00) Intro ⌨️ (0:00:58) Data/Colab Intro ⌨️ (0:08:45) Intro to Machine Learning ⌨️ (0:12:26) Features ⌨️ (0:17:23) Classification/Regression ⌨️ (0:19:57) Training Model ⌨️ (0:30:57) Preparing Data ⌨️ (0:44:43) K-Nearest Neighbors ⌨️ (0:52:42) KNN Implementation ⌨️ (1:08:43) Naive Bayes ⌨️ (1:17:30) Naive Bayes Implementation ⌨️ (1:19:22) Logistic Regression ⌨️ (1:27:56) Log Regression Implementation ⌨️ (1:29:13) Support Vector Machine ⌨️ (1:37:54) SVM Implementation ⌨️ (1:39:44) Neural Networks ⌨️ (1:47:57) Tensorflow ⌨️ (1:49:50) Classification NN using Tensorflow ⌨️ (2:10:12) Linear Regression ⌨️ (2:34:54) Lin Regression Implementation ⌨️ (2:57:44) Lin Regression using a Neuron ⌨️ (3:00:15) Regression NN using Tensorflow ⌨️ (3:13:13) K-Means Clustering ⌨️ (3:23:46) Principal Component Analysis ⌨️ (3:33:54) K-Means and PCA Implementations 🎉 Thanks to our Champion and Sponsor supporters: 👾 Raymond Odero 👾 Agustín Kussrow 👾 aldo ferretti 👾 Otis Morgan 👾 DeezMaster — Learn to code for free and get a developer job: https://www.freecodecamp.org  Read hundreds of articles on programming: https://freecodecamp.org/news 
                     
                    
     
                    Content 
                    0 ->  Kylie Ying has worked at many interesting places such as MIT, CERN, and Free Code Camp.
6 ->  She's a physicist, engineer, and basically a genius. And now she's going to teach you
10.88 ->  about machine learning in a way that is accessible to absolute beginners.
15.28 ->  What's up you guys? So welcome to Machine Learning for Everyone. If you are someone who
21.6 ->  is interested in machine learning and you think you are considered as everyone, then this video
27.52 ->  is for you. In this video, we'll talk about supervised and unsupervised learning models,
33.04 ->  we'll go through maybe a little bit of the logic or math behind them, and then we'll also see how
39.2 ->  we can program it on Google CoLab. If there are certain things that I have done, and you know,
46.96 ->  you're somebody with more experience than me, please feel free to correct me in the comments
50.96 ->  and we can all as a community learn from this together. So with that, let's just dive right in.
58 ->  Without wasting any time, let's just dive straight into the code and I will be teaching you guys
62.16 ->  concepts as we go. So this here is the UCI machine learning repository. And basically,
71.04 ->  they just have a ton of data sets that we can access. And I found this really cool one called
75.28 ->  the magic gamma telescope data set. So in this data set, if you want to read all this information,
82.56 ->  to summarize what I what I think is going on, is there's this gamma telescope, and we have all
88.32 ->  these high energy particles hitting the telescope. Now there's a camera, there's a detector that
94.24 ->  actually records certain patterns of you know, how this light hits the camera. And we can use
100.4 ->  properties of those patterns in order to predict what type of particle caused that radiation. So
106.64 ->  whether it was a gamma particle, or some other head, like hadron. Down here, these are all of
114.88 ->  the attributes of those patterns that we collect in the camera. So you can see that there's, you
120 ->  know, some length, width, size, asymmetry, etc. Now we're going to use all these properties to
126.48 ->  help us discriminate the patterns and whether or not they came from a gamma particle or hadron.
133.2 ->  So in order to do this, we're going to come up here, go to the data folder. And you're going
139.52 ->  to click this magic zero for data, and we're going to download that. Now over here, I have a colab
148.24 ->  notebook open. So you go to colab dot research dot google.com, you start a new notebook. And
154.32 ->  I'm just going to call this the magic data set. So actually, I'm going to call this for code camp
163.12 ->  magic example. Okay. So with that, I'm going to first start with some imports. So I will import,
172.24 ->  you know, I always import NumPy, I always import pandas. And I always import matplotlib.
186.08 ->  And then we'll import other things as we go. So yeah,
194.08 ->  we run that in order to run the cell, you can either click this play button here, or you can
199.2 ->  on my computer, it's just shift enter and that that will run the cell. And here, I'm just going
204.32 ->  to order I'm just going to, you know, let you guys know, okay, this is where I found the data set.
210 ->  So I've copied and pasted this actually, but this is just where I found the data set.
215.2 ->  And in order to import that downloaded file that we we got from the computer, we're going to go
220.64 ->  over here to this folder thing. And I am literally just going to drag and drop that file into here.
230.8 ->  Okay. So in order to take a look at, you know, what does this file consist of,
235.84 ->  do we have the labels? Do we not? I mean, we could open it on our computer, but we can also just do
240.96 ->  pandas read CSV. And we can pass in the name of this file.
246.64 ->  And let's see what it returns. So it doesn't seem like we have the label. So let's go back to here.
256.16 ->  I'm just going to make the columns, the column labels, all of these attribute names over here.
263.6 ->  So I'm just going to take these values and make that the column names.
269.12 ->  All right, how do I do that? So basically, I will come back here, and I will create a list called
276.08 ->  calls. And I will type in all of those things. With f size, f conk. And we also have f conk one.
290.56 ->  We have f symmetry, f m three long, f m three trans, f alpha. Let's see, we have f dist and class.
309.84 ->  Okay, great. Now in order to label those as these columns down here in our data frame.
316.64 ->  So basically, this command here just reads some CSV file that you pass in CSV has come about comma
322.88 ->  separated values, and turns that into a pandas data frame object. So now if I pass in a names here,
331.52 ->  then it basically assigns these labels to the columns of this data set. So I'm going to set
338.8 ->  this data frame equal to DF. And then if we call the head is just like, give me the first five things,
344.96 ->  give me the first five things. Now you'll see that we have labels for all of these. Okay.
352 ->  All right, great. So one thing that you might notice is that over here, the class labels,
357.52 ->  we have G and H. So if I actually go down here, and I do data frame class unique,
367.2 ->  you'll see that I have either G's or H's, and these stand for gammas or hadrons.
371.52 ->  And our computer is not so good at understanding letters, right? Our computer is really good at
377.44 ->  understanding numbers. So what we're going to do is we're going to convert this to zero for G and
383.28 ->  one for H. So here, I'm going to set this equal to this, whether or not that equals G. And then
395.68 ->  I'm just going to say as type int. So what this should do is convert this entire column,
403.36 ->  if it equals G, then this is true. So I guess that would be one. And then if it's H, it would
408.72 ->  be false. So that would be zero, but I'm just converting G and H to one and zero, it doesn't
412.8 ->  really matter. Like, if G is one and H is zero or vice versa. Let me just take a step back right
422.24 ->  now and talk about this data set. So here I have some data frame, and I have all of these different
429.44 ->  values for each entry. Now this is a you know, each of these is one sample, it's one example,
438.24 ->  it's one item in our data set, it's one data point, all of these things are kind of the same
443.2 ->  thing when I mentioned, oh, this is one example, or this is one sample or whatever. Now, each of
449.12 ->  these samples, they have, you know, one quality for each or one value for each of these labels
456.24 ->  up here, and then it has the class. Now what we're going to do in this specific example is try to
461.6 ->  predict for future, you know, samples, whether the class is G for gamma or H for hadron. And
470.8 ->  that is something known as classification. Now, all of these up here, these are known as our features,
480.32 ->  and features are just things that we're going to pass into our model in order to help us predict
485.76 ->  the label, which in this case is the class column. So for you know, sample zero, I have
494.24 ->  10 different features. So I have 10 different values that I can pass into some model.
499.52 ->  And I can spit out, you know, the class the label, and I know the true label here is G. So this is
506.72 ->  this is actually supervised learning. All right. So before I move on, let me just give you a quick
515.44 ->  little crash course on what I just said. This is machine learning for everyone. Well, the first
523.36 ->  question is, what is machine learning? Well, machine learning is a sub domain of computer science
529.76 ->  that focuses on certain algorithms, which might help a computer learn from data, without a
536 ->  programmer being there telling the computer exactly what to do. That's what we call explicit
541.36 ->  programming. So you might have heard of AI and ML and data science, what is the difference between
548.48 ->  all of these. So AI is artificial intelligence. And that's an area of computer science, where the
554.72 ->  goal is to enable computers and machines to perform human like tasks and simulate human behavior.
563.6 ->  Now machine learning is a subset of AI that tries to solve one specific problem and make predictions
571.6 ->  using certain data. And data science is a field that attempts to find patterns and draw insights
579.84 ->  from data. And that might mean we're using machine learning. So all of these fields kind of overlap,
585.84 ->  and all of them might use machine learning. So there are a few types of machine learning.
592.56 ->  The first one is supervised learning. And in supervised learning, we're using labeled inputs.
598.4 ->  So this means whatever input we get, we have a corresponding output label, in order to train
605.36 ->  models and to learn outputs of different new inputs that we might feed our model. So for example,
612.96 ->  I might have these pictures, okay, to a computer, all these pictures are are pixels, they're pixels
619.04 ->  with a certain color. Now in supervised learning, all of these inputs have a label associated with
627.44 ->  them, this is the output that we might want the computer to be able to predict. So for example,
632.88 ->  over here, this picture is a cat, this picture is a dog, and this picture is a lizard.
641.6 ->  Now there's also unsupervised learning. And in unsupervised learning, we use unlabeled data
647.84 ->  to learn about patterns in the data. So here are here are my input data points. Again, they're just
657.92 ->  images, they're just pixels. Well, okay, let's say I have a bunch of these different pictures.
665.76 ->  And what I can do is I can feed all these to my computer. And I might not, you know,
669.92 ->  my computer is not going to be able to say, Oh, this is a cat, dog and lizard in terms of,
674.48 ->  you know, the output. But it might be able to cluster all these pictures, it might say,
679.68 ->  Hey, all of these have something in common. All of these have something in common. And then these
686.08 ->  down here have something in common, that's finding some sort of structure in our unlabeled data.
693.68 ->  And finally, we have reinforcement learning. And reinforcement learning. Well, they usually
700.16 ->  there's an agent that is learning in some sort of interactive environment, based on rewards and
706.48 ->  penalties. So let's think of a dog, we can train our dog, but there's not necessarily, you know,
714.72 ->  any wrong or right output at any given moment, right? Well, let's pretend that dog is a computer.
723.6 ->  Essentially, what we're doing is we're giving rewards to our computer, and tell your computer,
728.24 ->  Hey, this is probably something good that you want to keep doing. Well, computer agent terminology.
736.88 ->  But in this class today, we'll be focusing on supervised learning and unsupervised learning
741.76 ->  and learning different models for each of those. Alright, so let's talk about supervised learning
749.12 ->  first. So this is kind of what a machine learning model looks like you have a bunch of inputs
755.12 ->  that are going into some model. And then the model is spitting out an output, which is our prediction.
761.92 ->  So all these inputs, this is what we call the feature vector. Now there are different types
768.4 ->  of features that we can have, we might have qualitative features. And qualitative means
773.92 ->  categorical data, there's either a finite number of categories or groups. So one example of a
781.36 ->  qualitative feature might be gender. And in this case, there's only two here, it's for the sake of
787.44 ->  the example, I know this might be a little bit outdated. Here we have a girl and a boy, there are
793.2 ->  two genders, there are two different categories. That's a piece of qualitative data. Another
799.84 ->  example might be okay, we have, you know, a bunch of different nationalities, maybe a nationality or
805.6 ->  a nation or a location, that might also be an example of categorical data. Now, in both of
813.28 ->  these, there's no inherent order. It's not like, you know, we can rate us one and France to Japan
823.2 ->  three, etc. Right? There's not really any inherent order built into either of these categorical
831.84 ->  data sets. That's why we call this nominal data. Now, for nominal data, the way that we want
840.24 ->  to feed it into our computer is using something called one hot encoding. So let's say that, you
846.64 ->  know, I have a data set, some of the items in our data, some of the inputs might be from the US,
853.12 ->  some might be from India, then Canada, then France. Now, how do we get our computer to recognize that
859.2 ->  we have to do something called one hot encoding. And basically, one hot encoding is saying, okay,
864.56 ->  well, if it matches some category, make that a one. And if it doesn't just make that a zero.
871.12 ->  So for example, if your input were from the US, you would you might have 1000. India, you know,
880.16 ->  0100. Canada, okay, well, the item representing Canada is one and then France, the item representing
886.88 ->  France is one. And then you can see that the rest are zeros, that's one hot encoding.
894.48 ->  Now, there are also a different type of qualitative feature. So here on the left,
900.48 ->  there are different age groups, there's babies, toddlers, teenagers, young adults,
908.64 ->  adults, and so on, right. And on the right hand side, we might have different ratings. So maybe
915.84 ->  bad, not so good, mediocre, good, and then like, great. Now, these are known as ordinal pieces of
926.16 ->  data, because they have some sort of inherent order, right? Like, being a toddler is a lot closer to
933.6 ->  being a baby than being an elderly person, right? Or good is closer to great than it is to really
941.68 ->  bad. So these have some sort of inherent ordering system. And so for these types of data sets,
948.56 ->  we can actually just mark them from, you know, one to five, or we can just say, hey, for each of these,
954.4 ->  let's give it a number. And this makes sense. Because, like, for example, the thing that I
962.96 ->  just said, how good is closer to great, then good is close to not good at all. Well, four is closer
969.76 ->  to five, then four is close to one. So this actually kind of makes sense. And it'll make sense for the
974.56 ->  computer as well. Alright, there are also quantitative pieces of data and quantitative
982.96 ->  pieces of data are numerical valued pieces of data. So this could be discrete, which means,
989.04 ->  you know, they might be integers, or it could be continuous, which means all real numbers.
994.16 ->  So for example, the length of something is a quantitative piece of data, it's a quantitative
1000.8 ->  feature, the temperature of something is a quantitative feature. And then maybe how many
1006.56 ->  Easter eggs I collected in my basket, this Easter egg hunt, that is an example of discrete quantitative
1013.68 ->  feature. Okay, so these are continuous. And this over here is the screen. So those are the things
1022.08 ->  that go into our feature vector, those are our features that we're feeding this model, because
1028.4 ->  our computers are really, really good at understanding math, right at understanding numbers,
1034.8 ->  they're not so good at understanding things that humans might be able to understand.
1041.76 ->  Well, what are the types of predictions that our model can output? So in supervised learning,
1049.68 ->  there are some different tasks, there's one classification, and basically classification,
1055.44 ->  just saying, okay, predict discrete classes. And that might mean, you know, this is a hot dog,
1062.8 ->  this is a pizza, and this is ice cream. Okay, so there are three distinct classes and any other
1068.64 ->  pictures of hot dogs, pizza or ice cream, I can put under these labels. Hot dog, pizza, ice cream.
1076.48 ->  Hot dog, pizza, ice cream. This is something known as multi class classification. But there's also
1083.44 ->  binary classification. And binary classification, you might have hot dog, or not hot dog. So there's
1090.64 ->  only two categories that you're working with something that is something and something that's
1094.24 ->  isn't binary classification. Okay, so yeah, other examples. So if something has positive or negative
1103.68 ->  sentiment, that's binary classification. Maybe you're predicting your pictures of their cats or
1108.96 ->  dogs. That's binary classification. Maybe, you know, you are writing an email filter, and you're
1115.04 ->  trying to figure out if an email spam or not spam. So that's also binary classification.
1121.76 ->  Now for multi class classification, you might have, you know, cat, dog, lizard, dolphin, shark,
1126.96 ->  rabbit, etc. We might have different types of fruits like orange, apple, pear, etc. And then
1133.52 ->  maybe different plant species. But multi class classification just means more than two. Okay,
1139.44 ->  and binary means we're predicting between two things. There's also something called regression
1146.32 ->  when we talk about supervised learning. And this just means we're trying to predict continuous
1151.36 ->  values. So instead of just trying to predict different categories, we're trying to come up
1155.76 ->  with a number that you know, is on some sort of scale. So some examples. So some examples might
1164.4 ->  be the price of aetherium tomorrow, or it might be okay, what is going to be the temperature?
1171.76 ->  Or it might be what is the price of this house? Right? So these things don't really fit into
1177.44 ->  discrete classes. We're trying to predict a number that's as close to the true value as possible
1183.92 ->  using different features of our data set. So that's exactly what our model looks like in
1191.76 ->  supervised learning. Now let's talk about the model itself. How do we make this model learn?
1199.92 ->  Or how can we tell whether or not it's even learning? So before we talk about the models,
1205.68 ->  let's talk about how can we actually like evaluate these models? Or how can we tell
1210.32 ->  whether something is a good model or bad model? So let's take a look at this data set. So this data
1219.04 ->  set has this is from a diabetes, a Pima Indian diabetes data set. And here we have different
1226.64 ->  number of pregnancies, different glucose levels, blood pressure, skin thickness, insulin, BMI,
1232.64 ->  age, and then the outcome whether or not they have diabetes one for they do zero for they don't.
1237.52 ->  So here, all of these are quantitative features, right, because they're all on some scale.
1248.72 ->  So each row is a different sample in the data. So it's a different example, it's one person's data,
1256.16 ->  and each row represents one person in this data set. Now this column, each column represents a
1264.24 ->  different feature. So this one here is some measure of blood pressure levels. And this one
1271.6 ->  over here, as we mentioned is the output label. So this one is whether or not they have diabetes.
1279.04 ->  And as I mentioned, this is what we would call a feature vector, because these are all of our
1283.76 ->  features in one sample. And this is what's known as the target, or the output for that feature
1293.52 ->  vector. That's what we're trying to predict. And all of these together is our features matrix x.
1302.64 ->  And over here, this is our labels or targets vector y. So I've condensed this to a chocolate
1311.92 ->  bar to kind of talk about some of the other concepts in machine learning. So over here,
1318 ->  we have our x, our features matrix, and over here, this is our label y. So each row of this
1328.16 ->  will be fed into our model, right. And our model will make some sort of prediction. And what we do
1335.2 ->  is we compare that prediction to the actual value of y that we have in our label data set, because
1341.92 ->  that's the whole point of supervised learning is we can compare what our model is outputting to,
1346.96 ->  oh, what is the truth, actually, and then we can go back and we can adjust some things. So the next
1351.92 ->  iteration, we get closer to what the true value is. So that whole process here, the tinkering that,
1361.04 ->  okay, what's the difference? Where did we go wrong? That's what's known as training the model.
1367.68 ->  Alright, so take this whole, you know, chunk right here, do we want to really put our entire
1374.08 ->  chocolate bar into the model to train our model? Not really, right? Because if we did that, then
1382.32 ->  how do we know that our model can do well on new data that we haven't seen? Like, if I were to
1390.24 ->  create a model to predict whether or not someone has diabetes, let's say that I just train all my
1398 ->  data, and I see that all my training data does well, I go to some hospital, I'm like, here's my
1403.12 ->  model. I think you can use this to predict if somebody has diabetes. Do we think that would
1408.56 ->  be effective or not? Probably not, right? Because we haven't assessed how well our model can
1421.04 ->  generalize. Okay, it might do well after you know, our model has seen this data over and over and
1426.88 ->  over again. But what about new data? Can our model handle new data? Well, how do we how do we get our
1434.96 ->  model to assess that? So we actually break up our whole data set that we have into three different
1442.32 ->  types of data sets, we call it the training data set, the validation data set and the testing data
1447.76 ->  set. And you know, you might have 60% here 20% and 20% or 80 10 and 10. It really depends on how
1455.76 ->  many statistics you have, I think either of those would be acceptable. So what we do is then we feed
1462 ->  the training data set into our model, we come up with, you know, this might be a vector of predictions
1468.96 ->  corresponding with each sample that we put into our model, we figure out, okay, what's the difference
1476.08 ->  between our prediction and the true values, this is something known as loss, losses, you know,
1482.88 ->  what's the difference here, in some numerical quantity, of course. And then we make adjustments,
1490.08 ->  and that's what we call training. Okay. So then, once you know, we've made a bunch of adjustments,
1498.48 ->  we can put our validation set through this model. And the validation set is kind of used as a reality
1506 ->  check during or after training to ensure that the model can handle unseen data still. So every
1514.56 ->  single time after we train one iteration, we might stick the validation set in and see, hey, what's
1519.6 ->  the loss there. And then after our training is over, we can assess the validation set and ask,
1525.68 ->  hey, what's the loss there. But one key difference here is that we don't have that training step,
1532.4 ->  this loss never gets fed back into the model, right, that feedback loop is not closed.
1538.8 ->  Alright, so let's talk about loss really quickly. So here, I have four different types of models,
1545.92 ->  I have some sort of data that's being fed into the model, and then some output. Okay, so this output
1552.96 ->  here is pretty far from you know, this truth that we want. And so this loss is going to be high. In
1562.72 ->  model B, again, this is pretty far from what we want. So this loss is also going to be high,
1567.84 ->  let's give it 1.5. Now this one here, it's pretty close, I mean, maybe not almost, but pretty close
1575.76 ->  to this one. So that might have a loss of 0.5. And then this one here is maybe further than this,
1583.84 ->  but still better than these two. So that loss might be 0.9. Okay, so which of these model
1590.32 ->  performs the best? Well, model C has a smallest loss, so it's probably model C. Okay, now let's
1600.08 ->  take model C. After you know, we've come up with these, all these models, and we've seen, okay, model
1605.68 ->  C is probably the best model. We take model C, and we run our test set through this model. And this
1612.88 ->  test set is used as a final check to see how generalizable that chosen model is. So if I,
1620.72 ->  you know, finish training my diabetes data set, then I could run it through some chunk of the
1625.68 ->  data and I can say, oh, like, this is how we perform on data that it's never seen before at
1631.52 ->  any point during the training process. Okay. And that loss, that's the final reported performance
1639.6 ->  of my test set, or this would be the final reported performance of my model. Okay.
1649.28 ->  So let's talk about this thing called loss, because I think I kind of just glossed over it,
1654.88 ->  right? So loss is the difference between your prediction and the actual, like, label.
1663.2 ->  So this would give a slightly higher loss than this. And this would even give a higher loss,
1670.64 ->  because it's even more off. In computer science, we like formulas, right? We like formulaic ways
1677.6 ->  of describing things. So here are some examples of loss functions and how we can actually come
1683.28 ->  up with numbers. This here is known as L one loss. And basically, L one loss just takes the
1690.16 ->  absolute value of whatever your you know, real value is, whatever the real output label is,
1698.64 ->  subtracts the predicted value, and takes the absolute value of that. Okay. So the absolute
1706.16 ->  value is a function that looks something like this. So the further off you are, the greater your losses,
1715.52 ->  right in either direction. So if your real value is off from your predicted value by 10,
1722.48 ->  then your loss for that point would be 10. And then this sum here just means, hey,
1727.52 ->  we're taking all the points in our data set. And we're trying to figure out the sum of how far
1733.04 ->  everything is. Now, we also have something called L two loss. So this loss function is quadratic,
1741.6 ->  which means that if it's close, the penalty is very minimal. And if it's off by a lot,
1748.56 ->  then the penalty is much, much higher. Okay. And this instead of the absolute value, we just square
1755.84 ->  the the difference between the two. Now, there's also something called binary cross entropy loss.
1766.96 ->  It looks something like this. And this is for binary classification, this this might be the
1772.72 ->  loss that we use. So this loss, you know, I'm not going to really go through it too much.
1778.96 ->  But you just need to know that loss decreases as the performance gets better. So there are some
1787.84 ->  other measures of accurate or performance as well. So for example, accuracy, what is accuracy?
1795.44 ->  So let's say that these are pictures that I'm feeding my model, okay. And these predictions
1802.56 ->  might be apple, orange, orange, apple, okay, but the actual is apple, orange, apple, apple. So
1812.24 ->  three of them were correct. And one of them was incorrect. So the accuracy of this model is
1817.68 ->  three quarters or 75%. Alright, coming back to our colab notebook, I'm going to close this a little
1825.6 ->  bit. Again, we've imported stuff up here. And we've already created our data frame right here. And
1833.04 ->  this is this is all of our data. This is what we're going to use to train our models. So down here,
1840.56 ->  again, if we now take a look at our data set, you'll see that our classes are now zeros and ones.
1849.04 ->  So now this is all numerical, which is good, because our computer can now understand that.
1853.12 ->  Okay. And you know, it would probably be a good idea to maybe kind of plot, hey, do these things
1860.72 ->  have anything to do with the class. So here, I'm going to go through all the labels. So for label
1870.24 ->  in the columns of this data frame. So this just gets me the list. Actually, we have the list,
1875.84 ->  right? It's called so let's just use that might be less confusing of everything up to the last
1880.88 ->  thing, which is the class. So I'm going to take all these 10 different features. And I'm going
1886.56 ->  to plot them as a histogram. So and now I'm going to plot them as a histogram. So basically, if I
1897.04 ->  take that data frame, and I say, okay, for everything where the class is equal to one, so these are all
1905.6 ->  of our gammas, remember, now, for that portion of the data frame, if I look at this label, so now
1915.28 ->  these, okay, what this part here is saying is, inside the data frame, get me everything where
1923.44 ->  the class is equal to one. So that's all all of these would fit into that category, right?
1929.12 ->  And now let's just look at the label column. So the first label would be f length, which would
1934.08 ->  be this column. So this command here is getting me all the different values that belong to class one
1940.48 ->  for this specific label. And that's exactly what I'm going to put into the histogram. And now I'm
1947.2 ->  just going to tell you know, matplotlib make the color blue, make this label this as you know, gamma
1957.04 ->  set alpha, why do I keep doing that, alpha equal to 0.7. So that's just like the transparency.
1963.28 ->  And then I'm going to set density equal to true, so that when we compare it to
1970 ->  the hadrons here, we'll have a baseline for comparing them. Okay, so the density being true
1976.96 ->  just basically normalizes these distributions. So you know, if you have 200 in of one type,
1985.36 ->  and then 50 of another type, well, if you drew the histograms, it would be hard to compare because
1992.08 ->  one of them would be a lot bigger than the other, right. But by normalizing them, we kind of are
1997.6 ->  distributing them over how many samples there are. Alright, and then I'm just going to put a title
2004.24 ->  on here and make that the label, the y label. So because it's density, the y label is probability.
2012.8 ->  And the x label is just going to be the label.
2016.32 ->  What is going on. And I'm going to include a legend and PLT dot show just means okay, display
2024.64 ->  the plot. So if I run that, just be up to the last item. So we want a list, right, not just the last
2034.8 ->  item. And now we can see that we're plotting all of these. So here we have the length. Oh, and I
2042.24 ->  made this gamma. So this should be hadron. Okay, so the gammas in blue, the hadrons are in red. So
2051.2 ->  here we can already see that, you know, maybe if the length is smaller, it's probably more likely
2056.56 ->  to be gamma, right. And we can kind of you know, these all look somewhat similar. But here, okay,
2064.32 ->  clearly, if there's more asymmetry, or if you know, this asymmetry measure is larger, then it's
2074.64 ->  probably hadron. Okay, oh, this one's a good one. So f alpha seems like hadrons are pretty evenly
2084.48 ->  distributed. Whereas if this is smaller, it looks like there's more gammas in that area.
2088.96 ->  Okay, so this is kind of what the data that we're working with, we can kind of see what's going on.
2095.92 ->  Okay, so the next thing that we're going to do here is we are going to create our train,
2103.12 ->  our validation, and our test data sets. I'm going to set train valid and test to be equal to
2112.88 ->  this. So NumPy dot split, I'm just splitting up the data frame. And if I do this sample,
2120.8 ->  where I'm sampling everything, this will basically shuffle my data. Now, if I I want to pass in where
2129.36 ->  exactly I'm splitting my data set, so the first split is going to be maybe at 60%. So I'm going
2138.32 ->  to say 0.6 times the length of this data frame. So and then cast that 10 integer, that's going
2144.72 ->  to be the first place where you know, I cut it off, and that'll be my training data. Now, if I
2150.56 ->  then go to 0.8, this basically means everything between 60% and 80% of the length of the data
2157.36 ->  set will go towards validation. And then, like everything from 80 to 100, I'm going to pass
2163.76 ->  my test data. So I can run that. And now, if we go up here, and we inspect this data, we'll see that
2172.08 ->  these columns seem to have values in like the 100s, whereas this one is 0.03. Right? So the scale of
2180.48 ->  all these numbers is way off. And sometimes that will affect our results. So I'm going to run this
2188.24 ->  is way off. And sometimes that will affect our results. So one thing that we would want to do
2195.92 ->  is scale these so that they are, you know, so that it's now relative to maybe the mean and the
2206.24 ->  standard deviation of that specific column. I'm going to create a function called scale data set.
2214.4 ->  And I'm going to pass in the data frame. And that's what I'll do for now. Okay, so the x values are
2224.88 ->  going to be, you know, I take the data frame. And let's assume that the columns are going to be,
2234.32 ->  you know, that the label will always be the last thing in the data frame. So what I can do is say
2240 ->  data frame, dot columns all the way up to the last item, and get those values. Now for my y,
2250 ->  well, it's the last column. So I can just do this, I can just index into that last column,
2254.8 ->  and then get those values. Now, in, so I'm actually going to import something known as
2266.64 ->  the standard scalar from sk learn. So if I come up here, I can go to sk learn dot pre processing.
2276.08 ->  And I'm going to import standard scalar, I have to run that cell, I'm going to come back down here.
2284.88 ->  And now I'm going to create a scalar and use that skip or so standard scalar.
2290.88 ->  And with the scalar, what I can do is actually just fit and transform x. So here, I can say x
2301.12 ->  is equal to scalar dot fit, fit, transform x. So what that's doing is saying, okay, take x and
2311.6 ->  fit the standard scalar to x, and then transform all those values. And what would it be? And that's
2316.8 ->  going to be our new x. Alright. And then I'm also going to just create, you know, the whole data as
2325.04 ->  one huge 2d NumPy array. And in order to do that, I'm going to call H stack. So H stack is saying,
2333.92 ->  okay, take an array, and another array and horizontally stack them together. That's what
2338.4 ->  the H stands for. So by horizontally stacked them together, just like put them side by side,
2343.44 ->  okay, not on top of each other. So what am I stacking? Well, I have to pass in something
2350 ->  so that it can stack x and y. And now, okay, so NumPy is very particular about dimensions,
2360.4 ->  right? So in this specific case, our x is a two dimensional object, but y is only a one dimensional
2367.12 ->  thing, it's only a vector of values. So in order to now reshape it into a 2d item, we have to call
2375.44 ->  NumPy dot reshape. And we can pass in the dimensions of its reshape. So if I pass in negative
2385.2 ->  one comma one, that just means okay, make this a 2d array, where the negative one just means infer
2391.04 ->  what what this dimension value would be, which ends up being the length of y, this would be the
2396.72 ->  same as literally doing this. But the negative one is easier because we're making the computer
2401.44 ->  do the hard work. So if I stack that, I'm going to then return the data x and y. Okay. So one more
2413.12 ->  thing is that if we go into our training data set, okay, again, this is our training data set.
2418.48 ->  And we get the length of the training data set. But where the training data sets class is one,
2428.24 ->  so remember that this is the gammas. And then if we print that, and we do the same thing, but zero,
2439.44 ->  we'll see that, you know, there's around 7000 of the gammas, but only around 4000 of the hadrons.
2449.04 ->  So that might actually become an issue. And instead, what we want to do is we want to oversample
2457.36 ->  our our training data set. So that means that we want to increase the number of these values,
2466.2 ->  so that these kind of match better. And surprise, surprise, there is something that we can import
2473.96 ->  that will help us do that. It's so I'm going to go to from in the learn dot oversampling. And I'm
2483.16 ->  going to import this random oversampler, run that cell, and come back down here. So I will actually
2491.76 ->  add in this parameter called oversample, and set that to false for default. And if I do want to
2503.64 ->  oversample, then what I'm going to do, and by oversample, so if I do want to oversample,
2511.24 ->  then I'm going to create this ROS and set it equal to this random oversampler. And then for x and y,
2519.56 ->  I'm just going to say, okay, just fit and resample x and y. And what that's doing is saying, okay,
2526.96 ->  take more of the less class. So take take the less class and keep sampling from there to increase
2535 ->  the size of our data set of that smaller class so that they now match. So if I do this, and I scale
2544.04 ->  data set, and I pass in the training data set where oversample is true. So this let's say this
2553.28 ->  is train and then x train, y train. Oops, what's going on? These should be columns. So basically,
2568.4 ->  what I'm doing now is I'm just saying, okay, what is the length of y train? Okay, now it's
2575.04 ->  14,800, whatever. And now let's take a look at how many of these are type one. So actually,
2585.44 ->  we can just sum that up. And then we'll also see that if we instead switch the label and ask how
2592.72 ->  many of them are the other type, it's the same value. So now these have been evenly, you know,
2599.8 ->  rebalanced. Okay, well, okay. So here, I'm just going to make this the validation data set. And
2611.32 ->  then the next one, I'm going to make this the test data set. Alright, and we're actually going to
2619.88 ->  switch oversample here to false. Now, the reason why I'm switching that to false is because my
2626.28 ->  validation and my test sets are for the purpose of you know, if I have data that I haven't seen yet,
2631.84 ->  how does my sample perform on those? And I don't want to oversample for that right now. Like,
2639.68 ->  I don't care about balancing those I'm, I want to know if I have a random set of data that's
2646.56 ->  unlabeled, can I trust my model, right? So that's why I'm not oversampling. I run that. And again,
2656.84 ->  what is going on? Oh, it's because we already have this train. So I have to go come up here and split
2663.12 ->  that data frame again. And now let's run these. Okay. So now we have our data properly formatted.
2672.28 ->  And we're going to move on to different models now. And I'm going to tell you guys a little bit
2677.04 ->  about each of these models. And then I'm going to show you how we can do that in our code. So the
2683 ->  first model that we're going to learn about is KNN or K nearest neighbors. Okay, so here, I've
2689.88 ->  already drawn a plot on the y axis, I have the number of kids that a family might have. And then
2697.72 ->  on the x axis, I have their income in terms of 1000s per year. So, you know, if if someone's
2707.4 ->  making 40,000 a year, that's where this would be. And if somebody making 320, that's where that
2712.36 ->  would be somebody has zero kids, it'd be somewhere along this axis. Somebody has five, it'd be
2718 ->  somewhere over here. Okay. And now I have these plus signs and these minus signs on here. So what
2728.4 ->  I'm going to represent here is the plus sign means that they own a car. And the minus sign is going
2742.48 ->  to represent no car. Okay. So your initial thought should be okay, I think this is binary
2749.8 ->  classification because all of our points all of our samples have labels. So this is a sample with
2760.24 ->  the plus label. And this here is another sample with the minus label. This is an abbreviation for
2773 ->  width that I'll use. Alright, so we have this entire data set. And maybe around half the people
2780.76 ->  own a car and maybe around half the people don't own a car. Okay, well, what if I had some new
2789.2 ->  point, let me use choose a different color, I'll use this nice green. Well, what if I have a new
2795.4 ->  point over here? So let's say that somebody makes 40,000 a year and has two kids. What do we think
2802.72 ->  that would be? Well, just logically looking at this plot, you might think, okay, it seems like
2812.44 ->  they wouldn't have a car, right? Because that kind of matches the pattern of everybody else around
2817.8 ->  them. So that's a whole concept of this nearest neighbors is you look at, okay, what's around you.
2826.24 ->  And then you're basically like, okay, I'm going to take the label of the majority that's around me.
2831.32 ->  So the first thing that we have to do is we have to define a distance function. And a lot of times
2837.64 ->  in, you know, 2d plots like this, our distance function is something known as Euclidean distance.
2845.28 ->  And Euclidean distance is basically just this straight line distance like this. Okay. So this
2865.48 ->  would be the Euclidean distance, it seems like there's this point, there's this point, there's
2874 ->  that point, etc. So the length of this line, this green line that I just drew, that is what's known
2880.68 ->  as Euclidean distance. If we want to get technical with that, this exact formula is the distance here,
2890.16 ->  let me zoom in. The distance is equal to the square root of one point x minus the other points x
2900.2 ->  squared plus extend that square root, the same thing for y. So y one of one minus y two of the
2909.16 ->  other squared. Okay, so we're basically trying to find the length, the distances, the difference
2916.16 ->  between x and y, and then square each of those sum it up and take the square root. Okay, so I'm
2923.72 ->  going to erase this so it doesn't clutter my drawing. But anyways, now going back to this plot,
2933.24 ->  so here in the nearest neighbor algorithm, we see that there is a K, right? And this K is basically
2943.52 ->  telling us, okay, how many neighbors do we use in order to judge what the label is? So usually,
2949.72 ->  we use a K of maybe, you know, three or five, depends on how big our data set is. But here,
2956.52 ->  I would say, maybe a logical number would be three or five. So let's say that we take K to be equal
2965.36 ->  to three. Okay, well, of this data point that I drew over here, let me use green to highlight this.
2974.64 ->  Okay, so of this data point that I drew over here, it looks like the three closest points are definitely
2980.2 ->  this one, this one. And then this one has a length of four. And this one seems like it'd be a little
2990.36 ->  bit further than four. So actually, this would be these would be our three points. Well, all those
2997.56 ->  points are blue. So chances are, my prediction for this point is going to be blue, it's going to be
3005.92 ->  probably don't have a car. All right, now what if my point is somewhere? What if my point is
3014.84 ->  somewhere over here, let's say that a couple has four kids, and they make 240,000 a year. All right,
3026.12 ->  well, now my closest points are this one, probably a little bit over that one. And then this one,
3034.16 ->  right? Okay, still all pluses. Well, this one is more than likely to be plus. Right? Now,
3045.64 ->  let me get rid of some of these just so that it looks a little bit more clear. All right,
3055.28 ->  let's go through one more. What about a point that might be right here? Okay, let's see. Well,
3066.96 ->  definitely this is the closest, right? This one's also closest. And then it's really close between
3076 ->  the two of these. But if we actually do the mathematics, it seems like if we zoom in,
3082.72 ->  this one is right here. And this one is in between these two. So this one here is actually shorter
3090.84 ->  than this one. And that means that that top one is the one that we're going to take. Now,
3097.92 ->  what is the majority of the points that are close by? Well, we have one plus here, we have one plus
3105.08 ->  here, and we have one minus here, which means that the pluses are the majority. And that means
3112.16 ->  that this label is probably somebody with a car. Okay. So this is how K nearest neighbors would
3124.56 ->  work. It's that simple. And this can be extrapolated to further dimensions to higher dimensions. You
3133.6 ->  know, if you have here, we have two different features, we have the income, and then we have
3139.4 ->  the number of kids. But let's say we have 10 different features, we can expand our distance
3145.92 ->  function so that it includes all 10 of those dimensions, we take the square root of everything,
3151.52 ->  and then we figure out which one is the closest to the point that we desire to classify. Okay. So
3159.48 ->  that's K nearest neighbors. So now we've learned about K nearest neighbors. Let's see how we would
3165.24 ->  be able to do that within our code. So here, I'm going to label the section K nearest neighbors.
3171.08 ->  And we're actually going to use a package from SK learn. So the reason why we, you know, use these
3179.56 ->  packages and so that we don't have to manually code all these things ourselves, because it would
3184.64 ->  be really difficult. And chances are the way that we would code it, either would have bugs,
3188.2 ->  or it'd be really slow, or I don't know a whole bunch of issues. So what we're going to do is
3193.08 ->  hand it off to the pros. From here, I can say, okay, from SK learn, which is this package dot
3200.32 ->  neighbors, I'm going to import K neighbors classifier, because we're classifying. Okay,
3207.88 ->  so I run that. And our KNN model is going to be this K neighbors classifier. And we can pass in
3218.16 ->  a parameter of how many neighbors, you know, we want to use. So first, let's see what happens if
3223.92 ->  we just use one. So now if I do K, and then model dot fit, I can pass in my x training set and my
3232.8 ->  weight y train data. Okay. So that effectively fits this model. And let's get all the predictions. So
3243.56 ->  why can and I guess yeah, let's do y predictions. And my y predictions are going to be cannon model
3251.88 ->  dot predict. So let's use the test set x test. Okay. Alright, so if I call y predict, you'll see
3264.96 ->  that we have those. But if I get my truth values for that test set, you'll see that this is what
3269.72 ->  we actually do. So just looking at this, we got five out of six of them. Okay, great. So let's
3273.88 ->  actually take a look at something called the classification report that's offered by SK learn.
3279.48 ->  So if I go to from SK learn dot metrics, import classification report, what I can actually do is
3289.72 ->  say, hey, print out this classification report for me. And let's check, you know, I'm giving you the
3297.96 ->  y test and the y prediction. We run this and we see we get this whole entire chart. So I'm going
3304.12 ->  to tell you guys a few things on this chart. Alright, this accuracy is 82%, which is actually
3310.72 ->  pretty good. That's just saying, hey, if we just look at, you know, what each of these new points,
3315.68 ->  what it's closest to, then we actually get an 82% accuracy, which means how many do we get right
3323.36 ->  versus how many total are there. Now, precision is saying, okay, you might see that we have it
3329.96 ->  for class one, or class zero and class one. What precision is saying was, let's go to this Wikipedia
3336.2 ->  diagram over here, because I actually kind of like this diagram. So here, this is our entire data set.
3342.88 ->  And on the left over here, we have everything that we know is positive. So everything that is
3348.16 ->  actually truly positive, that we've labeled positive in our original data set. And over here,
3354.08 ->  this is everything that's truly negative. Now in the circle, we have things that are positive that
3361.08 ->  were labeled positive by our model. On the left here, we have things that are truly positive,
3368.16 ->  because you know, this side is the positive side and the side is the negative side. So these are
3373.12 ->  truly positive. Whereas all these ones out here, well, they should have been positive, but they
3378.84 ->  are labeled as negative. And in here, these are the ones that we've labeled positive, but they're
3384.56 ->  actually negative. And out here, these are truly negative. So precision is saying, okay, out of all
3393 ->  the ones we've labeled as positive, how many of them are true positives? And recall is saying,
3400.4 ->  okay, out of all the ones that we know are truly positive, how many do we actually get right? Okay,
3407.16 ->  so going back to this over here, our precision score, so again, precision, out of all the ones
3415.48 ->  that we've labeled as the specific class, how many of them are actually that class, it's 7784%. Now,
3423.88 ->  recall how out of all the ones that are actually this class, how many of those that we get, this
3429.4 ->  is 68% and 89%. Alright, so not too shabby, we can clearly see that this recall and precision for
3438.2 ->  like this, the class zero is worse than class one. Right? So that means for hadron, it's worked for
3444.08 ->  hadrons and for our gammas. This f1 score over here is kind of a combination of the precision and
3450.08 ->  recall score. So we're actually going to mostly look at this one because we have an unbalanced
3455.52 ->  test data set. So here we have a measure of 72 and 87 or point seven two and point eight seven,
3463 ->  which is not too shabby. All right. Well, what if we, you know, made this three. So we actually see
3475.64 ->  that, okay, so what was it originally with one? We see that our f1 score, you know, is now it was
3484.6 ->  point seven two and then point eight seven. And then our accuracy was 82%. So if I change that to
3490.36 ->  three. Alright, so we've kind of increased zero at the cost of one and then our overall accuracy
3500.44 ->  is 81. So let's actually just make this five. Alright, so you know, again, very similar numbers,
3508.16 ->  we have 82% accuracy, which is pretty decent for a model that's relatively simple. Okay,
3515.36 ->  the next type of model that we're going to talk about is something known as naive Bayes. Now,
3522.88 ->  in order to understand the concepts behind naive Bayes, we have to be able to understand
3528.4 ->  conditional probability and Bayes rule. So let's say I have some sort of data set that's shown in
3535.8 ->  this table right here. People who have COVID are over here in this red row. And people who do not
3543.72 ->  have COVID are down here in this green row. Now, what about the COVID test? Well, people who have
3549.04 ->  tested positive are over here in this column. And people who have tested negative are over here in
3558.36 ->  this column. Okay. Yeah, so basically, our categories are people who have COVID and test positive,
3565.84 ->  people who don't have COVID, but test positive, so a false false positive, people who have COVID
3572.8 ->  and test negative, which is a false negative, and people who don't have COVID and test negative,
3578.56 ->  which good means you don't have COVID. Okay, so let's make this slightly more legible. And here,
3588.16 ->  in the margins, I've written down the sums of whatever it's referring to. So this here is the
3595.36 ->  sum of this entire row. And this here might be the sum of this column over here. Okay. So the first
3605.56 ->  question that I have is, what is the probability of having COVID given that you have a positive
3611.56 ->  test? And in probability, we write that out like this. So the probability of COVID given, so this
3621.92 ->  line, that vertical line means given that, you know, some condition, so given a positive test,
3629.36 ->  okay, so what is the probability of having COVID given a positive test? So what this is asking is
3639.44 ->  saying, okay, let's go into this condition. So the condition of having a positive test, that is this
3648.32 ->  slice of the data, right? That means if you're in this slice of data, you have a positive test. So
3653.36 ->  given that we have a positive test, given in this condition, in this circumstance, we have a positive
3659 ->  test. So what's the probability that we have COVID? Well, if we're just using this data, the number
3665.68 ->  of people that have COVID is 531. So I'm gonna say that there's 531 people that have COVID. And then
3675.44 ->  now we divide that by the total number of people that have a positive test, which is 551. Okay,
3684.6 ->  so that's the probability and doing a quick division, we get that this is equal to around
3694.64 ->  96.4%. So according to this data set, which is data that I made up off the top of my head, so it's
3703.24 ->  not actually real COVID data. But according to this data, the probability of having COVID given
3710.76 ->  that you tested positive is 96.4%. Alright, now with that, let's talk about Bayes rule, which is
3722.48 ->  this section here. Let's ignore this bottom part for now. So Bayes rule is asking, okay, what is
3730.44 ->  the probability of some event A happening, given that B happened. So this, we already know has
3738 ->  happened. This is our condition, right? Well, what if we don't have data for that, right? Like, what
3746 ->  if we don't know what the probability of A given B is? Well, Bayes rule is saying, okay, well, you
3751.44 ->  can actually go and calculate it, as long as you have a probability of B given A, the probability
3756.92 ->  of A and the probability of B. Okay. And this is just a mathematical formula for that. Alright,
3763.92 ->  so here we have Bayes rule. And let's actually see Bayes rule in action. Let's use it on an example.
3771.32 ->  So here, let's say that we have some disease statistics, okay. So not COVID different disease.
3778.92 ->  And we know that the probability of obtaining a false positive is 0.05 probability of obtaining a
3785.96 ->  false negative is 0.01. And the probability of the disease is 0.1. Okay, what is the probability of
3792.8 ->  the disease given that we got a positive test? Hmm, how do we even go about solving this? So
3800.64 ->  what what do I mean by false positive? What's a different way to rewrite that? A false positive
3806.52 ->  is when you test positive, but you don't actually have the disease. So this here is a probability
3812.96 ->  that you have a positive test given no disease, right? And similarly for the false negative,
3822.48 ->  it's a probability that you test negative given that you actually have the disease. So if I put
3827.6 ->  that into a chart, for example, and this might be my positive and negative tests, and this might
3838.12 ->  be my diseases, disease and no disease. Well, the probability that I test positive, but actually
3847.24 ->  have no disease, okay, that's 0.05 over here. And then the false negatives up here for 0.01. So I'm
3854.04 ->  testing negative, but I don't actually have the disease. This so the probability that you test
3860.88 ->  positive, and you don't have the disease, plus a probability that you test negative, given that you
3865.48 ->  don't have the disease, that should sum up to one. Okay, because if you don't have the disease,
3870.88 ->  then you should have some probability that you're testing positive and some probability that you're
3874.36 ->  testing negative. But that probability, in total should be one. So that means that the probability
3883.12 ->  negative and no disease, this should be the reciprocal, this should be the opposite. So it
3887.04 ->  should be 0.95 because it's one minus whatever this probability is. And then similarly, oops,
3899.68 ->  up here, this should be 0.99 because the probability that we, you know,
3906.32 ->  test negative and have the disease plus the probability that we test positive and have the
3910.08 ->  disease should equal one. So this is our probability chart. And now, this probability of disease
3916.8 ->  being point 0.1 just means I have 10% probability of actually of having the disease, right? Like,
3923.2 ->  in the general population, the probability that I have the disease is 0.1. Okay, so what is the
3930 ->  probability that I have the disease given that I got a positive test? Well, remember that we
3937.04 ->  can write this out in terms of Bayes rule, right? So if I use this rule up here, this is the
3943.12 ->  probability of a positive test given that I have the disease times the probability of the disease
3952.88 ->  divided by the probability of the evidence, which is my positive test.
3960 ->  Alright, now let's plug in some numbers for that. The probability of having a positive test given
3965.68 ->  that I have the disease is 0.99. And then the probability that I have the disease is this value
3973.84 ->  over here 0.1. Okay. And then the probability that I have a positive test at all should be okay,
3986 ->  what is the probability that I have a positive test given that I actually have the disease
3989.84 ->  and then having having the disease. And then the other case, where the probability of me having a
3997.36 ->  negative test given or sorry, positive test giving no disease times the probability of not actually
4005.52 ->  having a disease. Okay, so I can expand that probability of having a positive test out into
4012 ->  these two different cases, I have a disease, and then I don't. And then what's the probability of
4018.48 ->  having positive tests in either one of those cases. So that expression would become 0.99 times 0.1
4029.52 ->  plus 0.05. So that's the probability that I'm testing positive, but don't have the disease.
4036.96 ->  And the times the probability that I don't actually have the disease. So that's one minus
4040.4 ->  0.1 probability that the population doesn't have the disease is 90%. So 0.9. And let's do that
4049.84 ->  multiplication. And I get an answer of 0.6875 or 68.75%. Okay. All right, so we can actually expand
4068.72 ->  that we can expand Bayes rule and apply it to classification. And this is what we call naive
4076.48 ->  base. So first, a little terminology. So the posterior is this over here, because it's asking,
4084.64 ->  Hey, what is the probability of some class CK? So by CK, I just mean, you know, the different
4092.48 ->  categories, so C for category or class or whatever. So category one might be cats, category two,
4099.36 ->  dogs, category three, lizards, all the way, we have k categories, k is just some number. Okay.
4107.52 ->  So what is the probability of having of this specific sample x, so this is our feature vector
4116.16 ->  of this one sample. What is the probability of x fitting into category 123 for whatever, right,
4124.08 ->  so that that's what this is asking, what is the probability that, you know, it's actually from
4129.12 ->  this class, given all this evidence that we see the x's. So the likelihood is this quantity over
4139.92 ->  here, it's saying, Okay, well, given that, you know, assume, assume we are, assume that this
4147.6 ->  class is class CK, okay, assume that this is a category. Well, what is the likelihood of
4153.76 ->  actually seeing x, all these different features from that category. And then this here is the
4161.28 ->  prior. So like in the entire population of things, what are the probabilities? What is the
4166.88 ->  probability of this class in general? Like if I have, you know, in my entire data set, what is the
4172.64 ->  percentage? What is the chance that this image is a cat? How many cats do I have? Right. And then this
4180.16 ->  down here is called the evidence because what we're trying to do is we're changing our prior,
4187.44 ->  we're creating this new posterior probability built upon the prior by using some sort of evidence,
4194.32 ->  right? And that evidence is a probability of x. So that's some vocab. And this here
4205.44 ->  is a rule for naive Bayes. Whoa, okay, let's digest that a little bit. Okay. So what is
4215.6 ->  let me use a different color. What is this side of the equation asking? It's asking,
4221.68 ->  what is the probability that we are in some class K, CK, given that, you know, this is my first
4228.32 ->  input, this is my second input, this is, you know, my third, fourth, this is my nth input. So let's
4233.92 ->  say that our classification is, do we play soccer today or not? Okay, and let's say our x's are,
4241.6 ->  okay, is it how much wind is there? How much rain is there? And what day of the week is it? So let's
4249.44 ->  So let's say that it's raining, it's not windy, but it's Wednesday, do we play soccer? Do we not?
4256.08 ->  So let's use Bayes rule on this. So this here
4266.08 ->  is equal to the probability of x one, x two, all these joint probabilities, given class K
4273.84 ->  times the probability of that class, all over the probability of this evidence.
4284.4 ->  Okay. So what is this fancy symbol over here, this means proportional to
4293.6 ->  so how our equal sign means it's equal to this like little squiggly sign means that this is
4298.56 ->  proportional to okay, and this denominator over here, you might notice that it has no impact on
4308.8 ->  the class like this, that number doesn't depend on the class, right? So this is going to be constant
4313.84 ->  for all of our different classes. So what I'm going to do is make things simpler. So I'm just
4319.2 ->  going to say that this probability x one, x two, all the way to x n, this is going to be proportional
4327.92 ->  to the numerator, I don't care about the denominator, because it's the same for every
4330.8 ->  single class. So this is proportional to x one, x two, x n given class K times the probability of
4340.8 ->  that class. Okay. All right. So in naive Bayes, the point of it being naive, is that we're actually
4352.96 ->  this joint probability, we're just assuming that all of these different things
4356.32 ->  are all independent. So in my soccer example, you know, the probability that we're playing soccer,
4364.8 ->  or the probability that, you know, it's windy, and it's rainy, and, and it's Wednesday, all these
4370.72 ->  things are independent, we're assuming that they're independent. So that means that I can
4376.8 ->  actually write this part of the equation here as this. So each term in here, I can just multiply
4387.12 ->  all of them together. So the probability of the first feature, given that it's class K,
4394.8 ->  times the probability of the second feature and given this problem, like class K all the way up
4400.16 ->  all the way up until, you know, the nth feature of given that it's class K. So this expands to
4410.96 ->  all of this. All right, which means that this here is now proportional to the thing that we just
4419.2 ->  expanded times this. So I'm going to write that out. So the probability of that class.
4427.6 ->  And I'm actually going to use this symbol. So what this means is it's a huge multiplication,
4434.56 ->  it means multiply everything to the right of this. So this probability x, given some class K,
4444.72 ->  but do it for all the i's. So I, what is I, okay, we're going to go from the first
4451.36 ->  the first x i all the way to the nth. So that means for every single i, we're just multiplying
4459.36 ->  these probabilities together. And that's where this up here comes from. So to wrap this up,
4467.44 ->  oops, this should be a line to wrap this up in plain English. Basically, what this is saying
4471.6 ->  is a probability that you know, we're in some category, given that we have all these different
4477.52 ->  features is proportional to the probability of that class in general, times the probability of
4484.96 ->  each of those features, given that we're in this one class that we're testing. So the probability
4491.68 ->  of it, you know, of us playing soccer today, given that it's rainy, not windy, and and it's
4499.6 ->  Wednesday, is proportional to Okay, well, what is what is the probability that we play soccer
4504.88 ->  anyways, and then times the probability that it's rainy, given that we're playing soccer,
4510.96 ->  times the probability that it's not windy, given that we're playing soccer. So how many times are
4515.44 ->  we playing soccer when it's windy, how you know, and then how many times are what's the probability
4521.2 ->  that's Wednesday, given that we're playing soccer. Okay. So how do we use this in order to make a
4530.32 ->  classification. So that's where this comes in our y hat, our predicted y is going to be equal to
4539.04 ->  something called the arg max. And then this expression over here, because we want to take
4545.44 ->  the arg max. Well, we want. So okay, if I write out this, again, this means the probability of
4555.2 ->  being in some class CK given all of our evidence. Well, we're going to take the K that maximizes
4566.64 ->  this expression on the right. That's what arc max means. So if K is in zero, oops,
4574.72 ->  one through K, so this is how many categories are, we're going to go through each K. And we're going
4581.2 ->  to solve this expression over here and find the K that makes that the largest. Okay. And remember
4592.32 ->  that instead of writing this, we have now a formula, thanks to Bayes rule for helping us
4600.56 ->  approximate that right in something that maybe we can we maybe we have like the evidence for that,
4607.44 ->  we have the answers for that based on our training set. So this principle of going through each of
4614.48 ->  these and finding whatever class whatever category maximizes this expression on the right,
4620.56 ->  this is something known as MAP for short, or maximum a posteriori.
4632.16 ->  Pick the hypothesis. So pick the K that is the most probable so that we minimize the probability
4640.16 ->  of misclassification. Right. So that is MAP. That is naive Bayes. Back to the notebook. So
4651.76 ->  just like how I imported k nearest neighbor, k neighbors classifier up here for naive Bayes,
4658.8 ->  I can go to SK learn naive Bayes. And I can import Gaussian naive Bayes.
4666.8 ->  Right. And here I'm going to say my naive Bayes model is equal. This is very similar to what we
4672.72 ->  had above. And I'm just going to say with this model, we are going to fit x train and y train.
4686.48 ->  All right, just like above. So this, I might actually, so I'm going to set that. And
4699.2 ->  exactly, just like above, I'm going to make my prediction. So here, I'm going to instead use my
4706.16 ->  naive Bayes model. And of course, I'm going to run the classification report again. So I'm actually
4715.28 ->  just going to put these in the same cell. But here we have the y the new y prediction and then y test
4720.72 ->  is still our original test data set. So if I run this, you'll see that. Okay, what's going on here,
4729.52 ->  we get worse scores, right? Our precision, for all of them, they look slightly worse. And our,
4738.64 ->  you know, for our precision, our recall, our f1 score, they look slightly worse for all the different
4744.16 ->  categories. And our total accuracy, I mean, it's still 72%, which is not too shabby. But it's still
4751.44 ->  72%. Okay. Which, you know, is not not that great. Okay, so let's move on to logistic regression.
4762 ->  Here, I've drawn a plot, I have y. So this is my label on one axis. And then this is maybe one of
4769.76 ->  my features. So let's just say I only have one feature in this case, text zero, right? Well,
4776.72 ->  we see that, you know, I have a few of one class type down here. And we know it's one class type
4784.08 ->  because it's zero. And then we have our other class type one up here. And then we have our
4791.28 ->  y. Okay. So many of you guys are familiar with regression. So let's start there. If I were to
4798.96 ->  draw a regression line through this, it might look something like like this. Right? Well, this
4810.16 ->  doesn't seem to be a very good model. Like, why would we use this specific line to predict why?
4816.24 ->  Right? It's, it's iffy. Okay. For example, we might say, okay, well, it seems like, you know,
4827.84 ->  everything from here downwards would be one class type in here, upwards would be another class type.
4834.64 ->  But when you look at this, you're just you, you visually can tell, okay, like, that line doesn't
4841.52 ->  make sense. Things are not those dots are not along that line. And the reason is because we
4846.24 ->  are doing classification, not regression. Okay. Well, first of all, let's start here, we know that
4855.28 ->  this model, if we just use this line, it equals m x. So whatever this let's just say it's x plus b,
4864.64 ->  which is the y intercept, right? And m is the slope. But when we use a linear regression,
4870 ->  is it actually y hat? No, it's not right. So when we're working with linear regression,
4875.76 ->  what we're actually estimating in our model is a probability, what's a probability between zero
4880.72 ->  and one, that is class zero or class one. So here, let's rewrite this as p equals m x plus b.
4892.72 ->  Okay, well, m x plus b, that can range, you know, from negative infinity to infinity,
4899.44 ->  right? For any for any value of x, it goes from negative infinity to infinity.
4904.16 ->  But probability, we know probably one of the rules of probability is that probability has to stay
4909.04 ->  between zero and one. So how do we fix this? Well, maybe instead of just setting the probability
4917.04 ->  equal to that, we can set the odds equal to this. So by that, I mean, okay, let's do probability
4923.52 ->  divided by one minus the probability. Okay, so now becomes this ratio. Now this ratio is allowed to
4930.08 ->  take on infinite values. But there's still one issue here. Let me move this over a bit.
4938.08 ->  The one issue here is that m x plus b, that can still be negative, right? Like if you know,
4944.56 ->  I have a negative slope, if I have a negative b, if I have some negative x's in there, I don't know,
4948.8 ->  but that can be that's allowed to be negative. So how do we fix that? We do that by actually taking
4956.4 ->  the log of the odds. Okay. So now I have the log of you know, some probability divided by one minus
4967.84 ->  the probability. And now that is on a range of negative infinity to infinity, which is good
4974.32 ->  because the range of log should be negative infinity to infinity. Now how do I solve for P
4980.64 ->  the probability? Well, the first thing I can do is take, you know, I can remove the log by taking
4988.4 ->  the not the e to the whatever is on both sides. So that gives me the probability
4996.48 ->  over the one minus the probability is now equal to e to the m x plus b. Okay. So let's multiply
5007.84 ->  that out. So the probability is equal to one minus probability e to the m x plus b. So P is equal to
5019.04 ->  e to the m x plus b minus P times e to the m x plus b. And now we have we can move like terms to
5029.28 ->  one side. So if I do P, so basically, I'm moving this over, so I'm adding P. So now P one plus e
5038.88 ->  to the m x plus b is equal to e to the m x plus b and let me change this parentheses make it a
5051.44 ->  little bigger. So now my probability can be e to the m x plus b divided by one plus e to the m x plus b.
5062.72 ->  Okay, well, let me just rewrite this really quickly, I want a numerator of one on top.
5073.84 ->  Okay, so what I'm going to do is I'm going to multiply this by negative m x plus b,
5080.8 ->  and then also the bottom by negative m x plus b, and I'm allowed to do that because
5085.12 ->  this over this is one. So now my probability is equal to one over
5094.64 ->  one plus e to the negative m x plus b. And now why did I rewrite it like that?
5101.84 ->  It's because this is actually a form of a special function, which is called the sigmoid
5107.6 ->  function. And for the sigmoid function, it looks something like this. So s of x sigmoid, you know,
5120.16 ->  that some x is equal to one over one plus e to the negative x. So essentially, what I just did up here
5130.64 ->  is rewrite this in some sigmoid function, where the x value is actually m x plus b.
5138.96 ->  So maybe I'll change this to y just to make that a bit more clear, it doesn't matter what
5142.88 ->  the variable name is. But this is our sigmoid function. And visually, what our sigmoid function
5150.32 ->  looks like is it goes from zero. So this here is zero to one. And it looks something like this
5161.04 ->  curved s, which I didn't draw too well. Let me try that again. It's hard to draw
5170.16 ->  something if I can draw this right. Like that. Okay, so it goes in between zero and one.
5179.12 ->  And you might notice that this form fits our shape up here.
5189.84 ->  Oops, let's draw it sharper. But if it's our shape up there a lot better, right?
5197.44 ->  Alright, so that is what we call logistic regression, we're basically trying to fit our data
5204.48 ->  to the sigmoid function. Okay. And when we only have, you know, one data point, so if we only have
5216.24 ->  one feature x, and that's what we call simple logistic regression. But then if we have, you know,
5226.24 ->  so that's only x zero, but then if we have x zero, x one, all the way to x n, we call this
5232.64 ->  multiple logistic regression, because there are multiple features that we're considering
5239.36 ->  when we're building our model, logistic regression. So I'm going to put that here.
5246.08 ->  And again, from SK learn this linear model, we can import logistic regression. All right.
5256.08 ->  And just like how we did above, we can repeat all of this. So here, instead of NB, I'm going to call
5263.28 ->  this log model, or LG logistic regression. I'm going to change this to logistic regression.
5274.32 ->  So I'm just going to use the default logistic regression. But actually, if you look here,
5279.12 ->  you see that you can use different penalties. So right now we're using
5282.32 ->  an L2 penalty. But L2 is our quadratic formula. Okay, so that means that for,
5289.68 ->  you know, outliers, it would really penalize that. For all these other things, you know,
5296.08 ->  you can toggle these different parameters, and you might get slightly different results.
5302.32 ->  If I were building a production level logistic regression model, then I would want to go and I
5306.96 ->  would want to figure out how to do that. So I'm going to go ahead and I'm going to go ahead and
5311.44 ->  I would want to figure out, you know, what are the best parameters to pass into here,
5316.48 ->  based on my validation data. But for now, we'll just we'll just use this out of the box.
5322.72 ->  So again, I'm going to fit the X train and the Y train. And I'm just going to predict again,
5329.6 ->  so I can just call this again. And instead of LG, NB, I'm going to use LG. So here, this is decent
5337.44 ->  precision 65% recall 71, f 168, or 82 total accuracy of 77. Okay, so it performs slightly
5347.28 ->  better than I base, but it's still not as good as K and N. Alright, so the last model for
5355.28 ->  classification that I wanted to talk about is something called support vector machines,
5360.08 ->  or SVMs for short. So what exactly is an SVM model, I have two different features x zero and
5371.84 ->  x one on the axes. And then I've told you if it's you know, class zero or class one based on the
5379.52 ->  blue and red labels, my goal is to find some sort of line between these two labels that best divides
5391.28 ->  the data. Alright, so this line is our SVM model. So I call it a line here because in 2d, it's a
5400.56 ->  line, but in 3d, it would be a plane and then you can also have more and more dimensions. So the
5406.16 ->  proper term is actually I want to find the hyperplane that best differentiates these two
5411.6 ->  classes. Let's see a few examples. Okay, so first, between these three lines, let's say A, B, and C,
5430 ->  and C, which one is the best divider of the data, which one has you know, all the data on one side
5437.76 ->  or the other, or at least if it doesn't, which one divides it the most, right, like which one
5442.88 ->  is has the most defined boundary between the two different groups. So this this question should be
5453.92 ->  pretty straightforward. It should be a right because a has a clear distinct line between where you
5462.08 ->  know, everything on this side of a is one label, it's negative and everything on this side of a
5469.04 ->  is the other label, it's positive. So what if I have a but then what if I had drawn my B
5476.4 ->  like this, and my C, maybe like this, sorry, they're kind of the labels are kind of close together.
5487.44 ->  But now which one is the best? So I would argue that it's still a, right? And why is it still a?
5498.56 ->  Right? And why is it still a? Because in these other two, look at how close this is to that,
5507.84 ->  to these points. Right? So if I had some new point that I wanted to estimate, okay,
5517.12 ->  say I didn't have A or B. So let's say we're just working with C. Let's say I have some new point
5522.96 ->  that's right here. Or maybe a new point that's right there. Well, it seems like just logically
5530.96 ->  looking at this. I mean, without the boundary, that would probably go under the positives,
5539.6 ->  right? I mean, it's pretty close to that other positive. So one thing that we care about in SVM
5547.52 ->  is something known as the margin. Okay, so not only do we want to separate the two classes really
5556.32 ->  well, we also care about the boundary in between where the points in those classes in our data set
5563.12 ->  are, and the line that we're drawing. So in a line like this, the closest values to this line
5573.28 ->  might be like here. And I'm trying to draw these perpendicular. Right? And so this effectively,
5590 ->  if I switch over to these dotted lines, if I can draw this right. So these effectively
5602.4 ->  are what's known as the margins. Okay, so these both here, these are our margins in our SVMs.
5618.48 ->  And our goal is to maximize those margins. So not only do we want the line that best separates the
5623.04 ->  two different classes, we want the line that has the largest margin. And the data points that lie
5631.28 ->  on the margin lines, the data. So basically, these are the data points that's helping us define our
5637.52 ->  divider. These are what we call support vectors. Hence the name support vector machines. Okay,
5648.48 ->  so the issue with SVM sometimes is that they're not so robust to outliers. Right? So for example,
5656.48 ->  if I had one outlier, like this up here, that would totally change where I want my support
5665.84 ->  vector to be, even though that might be my only outlier. Okay. So that's just something to keep
5671.92 ->  in mind. As you know, when you're working with SVM is, it might not be the best model if there
5678.24 ->  are outliers in your data set. Okay, so another example of SVMs might be, let's say that we have
5685.68 ->  data like this, I'm just going to use a one dimensional data set for this example. Let's
5690.48 ->  say we have a data set that looks like this. Well, our, you know, separators should be
5696.8 ->  perpendicular to this line. But it should be somewhere along this line. So it could be
5702.4 ->  anywhere like this. You might argue, okay, well, there's one here. And then you could also just
5709.12 ->  draw another one over here, right? And then maybe you can have two SVMs. But that's not really how
5713.84 ->  SVMs work. But one thing that we can do is we can create some sort of projection. So I realize here
5721.68 ->  that one thing I forgot to do was to label where zero was. So let's just say zero is here.
5732 ->  Now, what I'm going to do is I'm going to say, okay, I'm going to have x, and then I'm going to
5736.8 ->  have x, sorry, x zero and x one. So x zero is just going to be my original x. But I'm going to make
5744.56 ->  x one equal to let's say, x squared. So whatever is this squared, right? So now, my natives would be,
5756.88 ->  you know, maybe somewhere here, here, just pretend that it's somewhere up here.
5762.96 ->  Right. And now my pluses might be something like
5770.08 ->  that. And I'm going to run out of space over here. So I'm just going to draw these together,
5776.08 ->  use your imagination. But once I draw it like this, well, it's a lot easier to apply a boundary,
5787.6 ->  right? Now our SVM could be maybe something like this, this. And now you see that we've divided
5795.52 ->  our data set. Now it's separable where one class is this way. And the other class is that way.
5802.8 ->  Okay, so that's known as SVMs. I do highly suggest that, you know, any of these models that we just
5809.36 ->  mentioned, if you're interested in them, do go more in depth mathematically into them. Like how
5814.4 ->  do we how do we find this hyperplane? Right? I'm not going to go over that in this specific course,
5820.24 ->  because you're just learning what an SVM is. But it's a good idea to know, oh, okay, this is the
5825.84 ->  technique behind finding, you know, what exactly are the are the how do you define the hyperplane
5833.04 ->  that we're going to use. So anyways, this transformation that we did down here, this is known
5839.52 ->  as the kernel trick. So when we go from x to some coordinate x, and then x squared,
5847.12 ->  what we're doing is we are applying a kernel. So that's why it's called the kernel trick.
5853.28 ->  So SVMs are actually really powerful. And you'll see that here. So from sk learn.svm, we are going
5860.16 ->  to import SVC. And SVC is our support vector classifier. So with this, so with our SVM model,
5869.6 ->  we are going to, you know, create SVC model. And we are going to, again, fit this to X train, I
5879.84 ->  could have just copied and pasted this, I should be able to do that. So we're going to create SVC
5886.56 ->  again, fit this to X train, I could have just copied and pasted this, I should have probably
5890.48 ->  done that. Okay, taking a bit longer. All right. Let's predict using RSVM model. And here,
5903.76 ->  let's see if I can hover over this. Right. So again, you see a lot of these different
5908.88 ->  parameters here that you can go back and change if you were creating a production level model. Okay,
5917.12 ->  but in this specific case, we'll just use it out of the box again. So if I make predictions,
5926.32 ->  you'll note that Wow, the accuracy actually jumps to 87% with the SVM. And even with class zero,
5933.12 ->  there's nothing less than, you know, point eight, which is great. And for class one,
5939.2 ->  I mean, everything's at 0.9, which is higher than anything that we had seen to this point.
5946.64 ->  So so far, we've gone over four different classification models, we've done SVM,
5951.36 ->  logistic regression, naive Bayes and cannon. And these are just simple ways on how to implement
5957.04 ->  them. Each of these they have different, you know, they have different hyper parameters that you can
5963.76 ->  go and you can toggle. And you can try to see if that helps later on or not. But for the most part,
5971.92 ->  they perform, they give us around 70 to 80% accuracy. Okay, with SVM being the best. Now,
5980.8 ->  let's see if we can actually beat that using a neural net. Now the final type of model that
5985.44 ->  I wanted to talk about is known as a neural net or neural network. And neural nets look something
5991.84 ->  like this. So you have an input layer, this is where all your features would go. And they have
5998.48 ->  all these arrows pointing to some sort of hidden layer. And then all these arrows point to some
6003.2 ->  sort of output layer. So what is what is all this mean? Each of these layers in here, this is
6010.56 ->  something known as a neuron. Okay, so that's a neuron. In a neural net. These are all of our
6018.16 ->  features that we're inputting into the neural net. So that might be x zero x one all the way through
6023.84 ->  x n. Right. And these are the features that we talked about there, they might be you know,
6028.88 ->  the pregnancy, the BMI, the age, etc. Now all of these get weighted by some value. So they
6038.72 ->  are multiplied by some w number that applies to that one specific category that one specific
6044.24 ->  feature. So these two get multiplied. And the sum of all of these goes into that neuron. Okay,
6051.84 ->  so basically, I'm taking w zero times x zero. And then I'm adding x one times w one and then
6058.4 ->  I'm adding you know, x two times w two, etc, all the way to x n times w n. And that's getting
6065.36 ->  input into the neuron. Now I'm also adding this bias term, which just means okay, I might want
6071.2 ->  to shift this by a little bit. So I might add five or I might add 0.1 or I might subtract 100,
6077.2 ->  I don't know. But we're going to add this bias term. And the output of all these things. So
6084.96 ->  the sum of this, this, this and this, go into something known as an activation function,
6091.28 ->  okay. And then after applying this activation function, we get an output. And this is what a
6098.96 ->  neuron would look like. Now a whole network of them would look something like this.
6106 ->  So I kind of gloss over this activation function. What exactly is that? This is how a neural net
6113.76 ->  looks like if we have all our inputs here. And let's say all of these arrows represent some sort
6118.72 ->  of addition, right? Then what's going on is we're just adding a bunch of times, right? We're adding
6128.16 ->  the some sort of weight times these input layer a bunch of times. And then if we were to go back
6133.84 ->  and factor that all out, then this entire neural net is just a linear combination of these input
6142 ->  layers, which I don't know about you, but that just seems kind of useless, right? Because we could
6147.84 ->  literally just write that out in a formula, why would we need to set up this entire neural network,
6153.28 ->  we wouldn't. So the activation function is introduced, right? So without an activation
6160 ->  function, this just becomes a linear model. An activation function might look something like
6166.88 ->  this. And as you can tell, these are not linear. And the reason why we introduce these is so that
6172.88 ->  our entire model doesn't collapse on itself and become a linear model. So over here, this is
6178.48 ->  something known as a sigmoid function, it runs between zero and one, tanh runs between negative
6184.08 ->  one all the way to one. And this is ReLU, which anything less than zero is zero, and then anything
6190.72 ->  greater than zero is linear. So with these activation functions, every single output of a neuron
6198.64 ->  is no longer just the linear combination of these, it's some sort of altered linear state, which means
6204.16 ->  that the input into the next neuron is, you know, it doesn't it doesn't collapse on itself, it doesn't
6212.88 ->  become linear, because we've introduced all these nonlinearities. So this is a training set, the
6219.92 ->  model, the loss, right? And then we do this thing called training, where we have to feed the loss
6225.44 ->  back into the model, and make certain adjustments to the model to improve this predicted output.
6235.2 ->  Let's talk a little bit about the training, what exactly goes on during that step.
6240.72 ->  Let's go back and take a look at our L2 loss function. This is what our L2 loss function
6247.6 ->  looks like it's a quadratic formula, right? Well, up here, the error is really, really, really, really
6255.84 ->  large. And our goal is to get somewhere down here, where the loss is decreased, right? Because that
6263.2 ->  means that our predicted value is closer to our true value. So that means that we want to go
6270.72 ->  this way. Okay. And thanks to a lot of properties of math, something that we can do is called
6279.68 ->  gradient descent, in order to follow this slope down this way. This quadratic is, it has different
6293.68 ->  different slopes with respect to some value. Okay, so the loss with respect to some weight
6303.12 ->  w zero, versus w one versus w n, they might all be different. Right? So some way that I kind of
6312.48 ->  think about it is, to what extent is this value contributing to our loss. And we can actually
6318.32 ->  figure that out through some calculus, which we're not going to touch up on in this specific course.
6324.4 ->  But if you want to learn more about neural nets, you should probably also learn some calculus
6329.6 ->  and figure out what exactly back propagation is doing, in order to actually calculate, you know,
6335.36 ->  how much do we have to backstep by. So the thing is here, you might notice that this follows
6341.76 ->  this curve at all of these different points. And the closer we get to the bottom, the smaller
6348.48 ->  this step becomes. Now stick with me here. So my new value, this is what we call a weight update,
6357.84 ->  I'm going to take w zero, and I'm going to set some new value for w zero. And what I'm going to
6364.8 ->  set for that is the old value of w zero, plus some factor, which I'll just call alpha for now,
6373.68 ->  times whatever this arrow is. So that's basically saying, okay, take our old w zero, our old weight,
6383.04 ->  and just decrease it this way. So I guess increase it in this direction, right, like take a step in
6390 ->  this direction. But this alpha here is telling us, okay, don't don't take a huge step, right,
6394.64 ->  just in case we're wrong, take a small step, take a small step in that direction, see if we get any
6398.8 ->  closer. And for those of you who, you know, do want to look more into the mathematics of things,
6405.76 ->  the reason why I use a plus here is because this here is the negative gradient, right, if this were
6411.84 ->  just the if you were to use the actual gradient, this should be a minus.
6414.72 ->  Now this alpha is something that we call the learning rate. Okay, and that adjusts how quickly
6420.56 ->  we're taking steps. And that might, you know, tell our that that will ultimately control
6427.84 ->  how long it takes for our neural net to converge. Or sometimes if you set it too high, it might even
6433.04 ->  diverge. But with all of these weights, so here I have w zero, w one, and then w n. We make the same
6441.84 ->  update to all of them after we calculate the loss, the gradient of the loss with respect to that
6449.84 ->  weight. So that's how back propagation works. And that is everything that's going on here. After we
6457.68 ->  calculate the loss, we're calculating gradients, making adjustments in the model. So we're setting
6462.88 ->  all the all the weights to something adjusted slightly. And then we're going to calculate the
6470.48 ->  gradient. And then we're saying, Okay, let's take the training set and run it through the model
6475.12 ->  again, and go through this loop all over again. So for machine learning, we already have seen some
6481.84 ->  libraries that we use, right, we've already seen SK learn. But when we start going into neural
6489.04 ->  networks, this is kind of what we're trying to program. And it's not very fun to try to
6499.92 ->  do this from scratch, because not only will we probably have a lot of bugs, but also probably
6505.76 ->  not going to be fast enough, right? Wouldn't it be great if there are just some, you know,
6510.8 ->  full time professionals that are dedicated to solving this problem, and they could literally
6515.76 ->  just give us their code that's already running really fast? Well, the answer is, yes, that exists.
6523.36 ->  And that's why we use TensorFlow. So TensorFlow makes it really easy to define these models. But
6529.36 ->  we also have enough control over what exactly we're feeding into this model. So for example,
6535.6 ->  this line here is basically saying, Okay, let's create a sequential neural net. So sequential is
6542.64 ->  just, you know, what we've seen here, it just goes one layer to the next. And a dense layer means that
6548 ->  a dense layer means that all of them are interconnected. So here, this is interconnected with all of these
6553.36 ->  nodes, and this one's all these, and then this one gets connected to all of the next ones, and so on.
6559.84 ->  So we're going to create 16 dense nodes with relu activation functions. And then we're going
6566.8 ->  to create another layer of 16 dense nodes with relu activation. And then our output layer is going
6574 ->  to be just one node. Okay. And that's how easy it is to define something in TensorFlow. So TensorFlow
6583.2 ->  is an open source library that helps you develop and train your ML models. Let's implement this
6591.2 ->  for a neural net. So we're using a neural net for classification. Now, so our neural net model,
6598.24 ->  we are going to use TensorFlow, and I don't think I imported that up here. So we are going to import
6603.84 ->  that down here. So I'm going to import TensorFlow as TF. And enter. Cool. So my neural net model
6619.28 ->  is going to be, I'm going to use this. So essentially, this is saying layer all these
6628.16 ->  things that I'm about to pass in. So yeah, layer them linear stack of layers, layer them as a model.
6635.76 ->  And what that means, nope, not that. So what that means is I can pass in
6642.72 ->  some sort of layer, and I'm just going to use a dense layer.
6646.56 ->  Oops, dot dense. And let's say we have 32 units. Okay, I will also
6661.28 ->  set the activation as really. And at first we have to specify the input shape. So here we have 10,
6669.6 ->  and comma. Alright. Alright, so that's our first layer. Now our next layer, I'm just going to have
6679.68 ->  another dense layer of 32 units all using relu. And that's it. So for the final layer, this is
6688.88 ->  just going to be my output layer, it's going to just be one node. And the activation is going to
6695.76 ->  be sigmoid. So if you recall from our logistic regression, what happened there was when we had
6703.12 ->  a sigmoid, it looks something like this, right? So by creating a sigmoid activation on our last layer,
6709.6 ->  we're essentially projecting our predictions to be zero or one, just like in logistic regression.
6717.44 ->  And that's going to help us, you know, we can just round to zero or one and classify that way.
6723.28 ->  Okay. So this is my neural net model. And I'm going to compile this. So in TensorFlow,
6732 ->  we have to compile it. It's really cool, because I can just literally pass in what type of optimizer
6737.52 ->  I want, and it'll do it. So here, if I go to optimizers, I'm actually going to use atom.
6744.72 ->  And you'll see that, you know, the learning rate is 0.001. So I'm just going to use that default.
6751.04 ->  So 0.001. And my loss is going to be binary cross entropy. And the metrics that I'm also going to
6764.8 ->  include on here, so it already will consider loss, but I'm, I'm also going to tack on accuracy.
6770.08 ->  So we can actually see that in a plot later on. Alright, so I'm going to run this.
6775.6 ->  And one thing that I'm going to also do is I'm going to define these plot definitions. So I'm
6781.76 ->  actually copying and pasting this, I got these from TensorFlow. So if you go on to some TensorFlow
6786.8 ->  tutorial, they actually have these, this like, defined. And that's exactly what I'm doing here.
6793.12 ->  So I'm actually going to move this cell up, run that. So we're basically plotting the loss
6798.24 ->  over all the different epochs. epochs means like training cycles. And we're going to run that. So
6803.52 ->  means like training cycles. And we're going to plot the accuracy over all the epochs.
6808.96 ->  Alright, so we have our model. And now all that's left is, let's train it. Okay.
6817.2 ->  So I'm going to say history. So TensorFlow is great, because it keeps track of the history
6822.72 ->  of the training, which is why we can go and plot it later on. Now I'm going to set that equal to
6827.68 ->  this neural net model. And fit that with x train, y train, I'm going to make the number of epochs
6839.28 ->  equal to let's say just let's just use 100 for now. And the batch size, I'm going to set equal to,
6846.16 ->  let's say 32. Alright. And the validation split. So what the validation split does, if it's down
6858.16 ->  here somewhere. Okay, so yeah, this validation split is just the fraction of the training data
6863.92 ->  to be used as validation data. So essentially, every single epoch, what's going on is TensorFlow
6871.12 ->  saying, leave certain if this is point two, then leave 20% out. And we're going to test how the
6877.2 ->  model performs on that 20% that we've left out. Okay, so it's basically like our validation data
6882.56 ->  set. But TensorFlow does it on our training data set during the training. So we have now a measure
6888.8 ->  outside of just our validation data set to see, you know, what's going on. So validation split,
6894.64 ->  I'm going to make that 0.2. And we can run this. So if I run that, all right, and I'm actually going
6905.76 ->  to set verbose equal to zero, which means, okay, don't print anything, because printing something
6913.76 ->  for 100 epochs might get kind of annoying. So I'm just going to let it run, let it train,
6919.68 ->  and then we'll see what happens. Cool, so it finished training. And now what I can do is
6931.04 ->  because you know, I've already defined these two functions, I can go ahead and I can plot the loss,
6936.96 ->  oops, loss of that history. And I can also plot the accuracy throughout the training.
6945.2 ->  So this is a little bit ish what we're looking for. We definitely are looking for a steadily
6952.24 ->  decreasing loss and an increasing accuracy. So here we do see that, you know, our validation
6959.12 ->  accuracy improves from around point seven, seven or something all the way up to somewhere around
6967.2 ->  point, maybe eight one. And our loss is decreasing. So this is good. It is expected that the validation
6976.88 ->  loss and accuracy is performing worse than the training loss or accuracy. And that's because
6983.36 ->  our model is training on that data. So it's adapting to that data. Whereas the validation stuff is,
6988.48 ->  you know, stuff that it hasn't seen yet. So, so that's why. So in machine learning, as we saw above,
6995.76 ->  we could change a bunch of the parameters, right? Like I could change this to 64. So now it'd be
7000.16 ->  a row of 64 nodes, and then 32, and then one. So I can change some of these parameters.
7007.68 ->  And a lot of machine learning is trying to find, hey, what do we set these hyper parameters to?
7014.4 ->  So what I'm actually going to do is I'm going to rewrite this so that we can do something what's
7022.08 ->  known as a grid search. So we can search through an entire space of hey, what happens if, you know,
7028.08 ->  we have 64 nodes and 64 nodes, or 16 nodes and 16 nodes, and so on. And then on top of all that,
7039.2 ->  we can, you know, we can change this learning rate, we can change how many epochs we can change,
7046.64 ->  you know, the batch size, all these things might affect our training. And just for kicks,
7053.04 ->  I'm also going to add what's known as a dropout layer in here. And what dropout is doing is
7062 ->  saying, hey, randomly choose with at this rate, certain nodes, and don't train them in, you know,
7071.12 ->  in a certain iteration. So this helps prevent overfitting. Okay, so I'm actually going to
7079.76 ->  define this as a function called train model, we're going to pass in x train, y train,
7087.92 ->  the number of nodes, the dropout, you know, the probability that we just talked about
7095.76 ->  learning rate. So I'm actually going to say lr batch size. And we can also pass in number epochs,
7107.2 ->  right? I mentioned that as a parameter. So indent this, so it goes under here. And with these two,
7114.32 ->  I'm going to set this equal to number of nodes. And now with the two dropout layers, I'm going
7120.8 ->  to set dropout prob. So now you know, the probability of turning off a node during the training
7128.72 ->  is equal to dropout prob. And I'm going to keep the output layer the same. Now I'm compiling it,
7135.36 ->  but this here is now going to be my learning rate. And I still want binary cross entropy and
7140.48 ->  accuracy. We are actually going to train our model inside of this function. But here we can do the
7152.64 ->  epochs equal epochs, and this is equal to whatever, you know, we're passing in x train,
7159.2 ->  y train belong right here. Okay, so those are getting passed in as well. And finally, at the
7165.28 ->  end, I'm going to return this model and the history of that model. Okay. So now what I'll do
7180.4 ->  is let's just go through all of these. So let's say let's keep epochs at 100. And now what I can
7186.4 ->  do is I can say, hey, for a number of nodes in, let's say, let's do 1632 and 64, to see what
7193.28 ->  happens for the different dropout probabilities. And I mean, zero would be nothing. Let's use 0.2.
7202.96 ->  Also, to see what happens. You know, for the learning rate in 0.005, 0.001. And you know,
7217.2 ->  maybe we want to throw on 0.1 in there as well. And then for the batch size, let's do 1632,
7227.36 ->  64 as well. Actually, and let's also throw in 128. Actually, let's get rid of 16. Sorry,
7233.68 ->  so 128 in there. That should be 01. I'm going to record the model and history using this
7244.08 ->  train model here. So we're going to do x train y train, the number of nodes is going to be,
7254.64 ->  you know, the number of nodes that we've defined here, dropout, prob, LR, batch size, and epochs.
7264.24 ->  Okay. And then now we have both the model and the history. And what I'm going to do is again,
7270.48 ->  I want to plot the loss for the history. I'm also going to plot the accuracy.
7279.84 ->  Probably should have done them side by side, that probably would have been easier.
7286.32 ->  Okay, so what I'm going to do is split up, split this up. And that will be
7294.4 ->  the subplots. So now this is just saying, okay, I want one row and two columns in that row for my
7301.04 ->  plots. Okay, so I'm going to plot on my axis one, the loss. I don't actually know this is going to
7316 ->  work. Okay, we don't care about the grid. Yeah, let's let's keep the grid. And then now my other.
7329.2 ->  So now on here, I'm going to plot all the accuracies on the second plot.
7340.16 ->  I might have to debug this a bit.
7341.84 ->  We should be able to get rid of that. If we run this, we already have history saved as a variable
7347.68 ->  in here. So if I just run it on this, okay, it has no attribute x label. Oh, I think it's because
7356.8 ->  it's like set x label or something. Okay, yeah, so it's, it's set instead of just x label, y label.
7367.68 ->  So let's see if that works. All right, cool. Um, and let's actually make this a bit larger.
7375.44 ->  Okay, so we can actually change the figure size that I'm gonna set. Let's see what happens if I
7379.92 ->  set that to. Oh, that's not the way I wanted it. Okay, so that looks reasonable.
7388.16 ->  And that's just going to be my plot history function. So now I can plot them side by side.
7395.28 ->  Here, I'm going to plot the history. And what I'm actually going to do is I so here, first,
7403.28 ->  I'm going to print out all these parameters. So I'm going to print out
7407.36 ->  the F string to print out all of this stuff. So here, I'm going to print out all these parameters.
7414.96 ->  Uh, all of this stuff. So here, I'm printing out how many nodes, um, the dropout probability,
7425.6 ->  uh, the learning rate.
7435.2 ->  And we already know how many you found, so I'm not even going to bother with that.
7437.52 ->  So once we plot this, uh, let's actually also figure out what the, um, what the validation
7450.56 ->  losses on our validation set that we have that we created all the way back up here.
7456.72 ->  Alright, so remember, we created three data sets. Let's call our model and evaluate what the
7463.76 ->  validation data with the validation data sets loss would be. And I actually want to record,
7473.52 ->  let's say I want to record whatever model has the least validation loss. So
7480.64 ->  first, I'm going to initialize that to infinity so that you know, any model will beat that score.
7485.36 ->  So if I do float infinity, that will set that to infinity. And maybe I'll keep
7493.6 ->  track of the parameters. Actually, it doesn't really matter. I'm just going to keep track of
7498.64 ->  the model. And I'm gonna set that to none. So now down here, if the validation loss is ever
7506.48 ->  less than the least validation loss, then I am going to simply come down here and say,
7513.76 ->  Hey, this validation for this least validation loss is now equal to the validation loss.
7521.6 ->  And the least loss model is whatever this model is that just earned that validation loss. Okay.
7531.84 ->  So we are actually just going to let this run for a while. And then we're going to get our least
7540.32 ->  last model after that. So let's just run. All right, and now we wait.
7551.84 ->  All right, so we've finally finished training. And you'll notice that okay, down here, the loss
7572.08 ->  actually gets to like 0.29. The accuracy is around 88%, which is pretty good. So you might be wondering,
7579.04 ->  okay, why is this accuracy in this? Like, these are both the validation. So this accuracy here
7586.24 ->  is on the validation data set that we've defined at the beginning, right? And this one here,
7590.32 ->  this is actually taking 20% of our tests, our training set every time during the training,
7595.84 ->  and saying, Okay, how much of it do I get right now? You know, after this one step where I didn't
7601.2 ->  train with any of that. So they're slightly different. And actually, I realized later on
7606.88 ->  that I probably you know, probably what I should have done is over here, when we were defining
7614.64 ->  the model fit, instead of the validation split, you can define the validation data.
7620.48 ->  And you can pass in the validation data, I don't know if this is the proper syntax. But
7625.44 ->  that's probably what I should have done. But instead, you know, we'll just stick with what
7629.44 ->  we have here. So you'll see at the end, you know, with the 64 nodes, it seems like this is our best
7636.72 ->  performance 64 nodes with a dropout of 0.2, a learning rate of 0.001, and a batch size of 64.
7645.44 ->  And it does seem like yes, the validation, you know, the fake validation, but the validation
7654 ->  loss is decreasing, and then the accuracy is increasing, which is a good sign. Okay,
7660.24 ->  so finally, what I'm going to do is I'm actually just going to predict. So I'm going to take
7665.04 ->  this model, which we've called our least loss model, I'm going to take this model,
7670.96 ->  and I'm going to predict x test on that. And you'll see that it gives me some values that
7678.16 ->  are really close to zero and some that are really close to one. And that's because we have a sigmoid
7682.16 ->  output. So if I do this, and what I can do is I can cast them. So I'm going to say anything that's
7691.92 ->  greater than 0.5, set that to one. So if I actually, I think what happens if I do this?
7702.4 ->  Oh, okay, so I have to cast that as type. And so now you'll see that it's ones and zeros. And I'm
7709.76 ->  actually going to transform this into a column as well. So here I'm going to Oh, oops, I didn't
7720.56 ->  I didn't mean to do that. Okay, no, I wanted to just reshape it to that. So now it's one dimensional.
7729.28 ->  Okay. And using that we can actually just rerun the classification report based on these this
7737.6 ->  neural net output. And you'll see that okay, the the F ones are the accuracy gives us 87%. So it
7744.88 ->  seems like what happened here is the precision on class zero. So the hadrons has increased a bit,
7752.56 ->  but the recall decreased. But the F one score is still at a good point eight one. And for the other
7759.84 ->  class, it looked like the precision decreased a bit the recall increased for an overall F one score.
7765.04 ->  That's also been increased. I think I interpreted that properly. I mean, we went through all this
7771.44 ->  work and we got a model that performs actually very, very similarly to the SVM model that we
7777.84 ->  had earlier. And the whole point of this exercise was to demonstrate, okay, these are how you can
7783.04 ->  define your models. But it's also to say, hey, maybe, you know, neural nets are very, very
7788.72 ->  powerful, as you can tell. But sometimes, you know, an SVM or some other model might actually be more
7795.84 ->  appropriate. But in this case, I guess it didn't really matter which one we use at the end. An 87%
7804.4 ->  accuracy score is still pretty good. So yeah, let's now move on to regression.
7811.84 ->  We just saw a bunch of different classification models. Now let's shift gears into regression,
7817.04 ->  the other type of supervised learning. If we look at this plot over here, we see a bunch of scattered
7823.28 ->  data points. And here we have our x value for those data points. And then we have the corresponding y
7831.44 ->  value, which is now our label. And when we look at this plot, well, our goal in regression is to find
7840.08 ->  the line of best fit that best models this data. Essentially, we're trying to let's say we're given
7848.16 ->  some new value of x that we don't have in our sample, we're trying to say, okay, what would my
7854.16 ->  prediction for y be for that given x value. So that, you know, might be somewhere around there.
7863.28 ->  I don't know. But remember, in regression that, you know, given certain features,
7868.4 ->  we're trying to predict some continuous numerical value for y.
7872.08 ->  In linear regression, we want to take our data and fit a linear model to this data. So in this case,
7881.2 ->  our linear model might look something along the lines of here. Right. So this here would be
7890.08 ->  considered as maybe our line of best fit. And this line is modeled by the equation, I'm going to write
7901.12 ->  it down here, y equals b zero, plus b one x. Now b zero just means it's this y intercept. So if we
7911.68 ->  extend this y down here, this value here is b zero, and then b one defines the source of the
7918.88 ->  line, defines the slope of this line. Okay. All right. So that's the that's the formula
7929.68 ->  for linear regression. And how exactly do we come up with that formula? What are we trying to do
7937.12 ->  with this linear regression? You know, we could just eyeball where the line be, but humans are
7943.28 ->  not very good at eyeballing certain things like that. I mean, we can get close, but a computer is
7949.28 ->  better at giving us a precise value for b zero and b one. Well, let's introduce the concept of
7957.52 ->  something known as a residual. Okay, so residual, you might also hear this being called the error.
7967.2 ->  And what that means is, let's take some data point in our data set. And we're going to evaluate how
7975.04 ->  far off is our prediction from a data point that we already have. So this here is our y, let's say,
7984 ->  this is 12345678. So this is y eight, let's call it, you'll see that I use this y i in order to
7995.12 ->  I in order to represent, hey, just one of these points. Okay. So this here is why and this here
8003.04 ->  would be the prediction. Oops, this here would be the prediction for y eight, which I've labeled
8010.72 ->  with this hat. Okay, if it has a hat on it, that means hey, this is what this is my guess this is
8015.2 ->  my prediction for you know, this specific value of x. Okay. Now the residual would be this distance
8028.24 ->  here between y eight and y hat eight. So y eight minus y hat eight. All right, because that would
8038.72 ->  give us this here. And I'm just going to take the absolute value of this. Because what if it's below
8044.4 ->  the line, right, then you would get a negative value, but distance can't be negative. So we're
8048.88 ->  just going to put a little hat, or we're going to put a little absolute value around this quantity.
8055.28 ->  And that gives us the residual or the error. So let me rewrite that. And you know, to generalize
8063.52 ->  to all the points, I'm going to say the residual can be calculated as y i minus y hat of i. Okay.
8072.96 ->  So this just means the distance between some given point, and its prediction, its corresponding
8079.28 ->  prediction on the line. So now, with this residual, this line of best fit is generally trying to
8087.68 ->  decrease these residuals as much as possible. So now that we have some value for the error,
8095.84 ->  our line of best fit is trying to decrease the error as much as possible for all of the different
8100.64 ->  data points. And that might mean, you know, minimizing the sum of all the residuals. So this
8107.84 ->  here, this is the sum symbol. And if I just stick the residual calculation in there,
8116.64 ->  it looks something like that, right. And I'm just going to say, okay, for all of the eyes in our
8121.2 ->  data set, so for all the different points, we're going to sum up all the residuals. And I'm going
8127.68 ->  to try to decrease that with my line of best fit. So I'm going to find the B0 and B1, which gives
8133.2 ->  me the lowest value of this. Okay. Now in other, you know, sometimes in different circumstances,
8141.68 ->  we might attach a squared to that. So we're trying to decrease the sum of the squared residuals.
8149.04 ->  And what that does is it just, you know, it adds a higher penalty for how far off we are from,
8163.52 ->  you know, points that are further off. So that is linear regression, we're trying to find
8168.64 ->  this equation, some line of best fit that will help us decrease this measure of error
8175.52 ->  with respect to all the data points that we have in our data set, and try to come up with
8179.92 ->  the best prediction for all of them. This is known as simple linear regression.
8190.88 ->  And basically, that means, you know, our equation looks something like this. Now, there's also
8199.52 ->  multiple linear regression, which just means that hey, if we have more than one value for x, so like
8212.48 ->  think of our feature vectors, we have multiple values in our x vector, then our predictor might
8218.56 ->  look something more like this. Actually, I'm just going to say etc, plus b n, x n. So now I'm coming
8231.2 ->  up with some coefficient for all of the different x values that I have in my vector. Now you guys
8238.96 ->  might have noticed that I have some assumptions over here. And you might be asking, okay, Kylie,
8243.04 ->  what in the world do these assumptions mean? So let's go over them.
8246.56 ->  So let's go over them. The first one is linearity.
8253.84 ->  And what that means is, let's say I have a data set. Okay.
8263.76 ->  Linearity just means, okay, my does my data follow a linear pattern? Does y increase as x
8270.96 ->  increases? Or does y decrease at as x increases? Does so if y increases or decreases at a constant
8279.28 ->  rate as x increases, then you're probably looking at something linear. So what's the example of a
8284.72 ->  nonlinear data set? Let's say I had data that might look something like that. Okay. So now just
8292.96 ->  visually judging this, you might say, okay, seems like the line of best fit might actually be some
8298.72 ->  curve like this. Right. And in this case, we don't satisfy that linearity assumption anymore.
8309.68 ->  So with linearity, we basically just want our data set to follow some sort of linear trajectory.
8319.28 ->  And independence, our second assumption
8322.64 ->  just means this point over here, it should have no influence on this point over here,
8330.08 ->  or this point over here, or this point over here. So in other words, all the points,
8336 ->  all the samples in our data set should be independent. Okay, they should not rely on
8343.44 ->  one another, they should not affect one another.
8345.84 ->  Okay, now, normality and homoscedasticity, those are concepts which use this residual. Okay. So if
8357.12 ->  I have a plot that looks something like this, and I have a plot that looks like this. Okay,
8371.12 ->  something like this. And my line of best fit is somewhere here, maybe it's something like that.
8387.2 ->  In order to look at these normality and homoscedasticity assumptions, let's look at
8392 ->  the residual plot. Okay. And what that means is I'm going to keep my same x axis. But instead
8403.44 ->  of plotting now where they are relative to this y, I'm going to plot these errors. So now I'm
8409.36 ->  going to plot y minus y hat like this. Okay. And now you know, this one is slightly positive,
8419.2 ->  so it might be here, this one down here is negative, it might be here. So our residual plot,
8425.84 ->  it's literally just a plot of how you know, the values are distributed around our line of best
8430.08 ->  fit. So it looks like it might, you know, look something like this. Okay. So this might be our
8442.88 ->  residual plot. And what normality means, so our assumptions are normality and homoscedasticity,
8459.28 ->  I might have butchered that spelling, I don't really know. But what normality is saying is
8465.12 ->  saying, okay, these residuals should be normally distributed. Okay, around this line of best fit,
8472.96 ->  it should follow a normal distribution. And now what homoscedasticity says, okay, our variants
8481.6 ->  of these points should remain constant throughout. So this spread here should be approximately the
8488.4 ->  same as this spread over here. Now, what's an example of where you know, homoscedasticity is
8495.2 ->  not held? Well, let's say that our original plot actually looks something like this.
8506.48 ->  Okay, so now if we looked at the residuals for that, it might look something
8511.6 ->  like that. And now if we look at this spread of the points, it decreases, right? So now the spread
8523.6 ->  is not constant, which means that homoscedasticity, this assumption would not be fulfilled, and it
8532.56 ->  might not be appropriate to use linear regression. So that's just linear regression. Basically,
8538.56 ->  we have a bunch of data points, we want to predict some y value for those. And we're trying to come
8545.68 ->  up with this line of best fit that best describes, hey, given some value x, what would be my best
8552.64 ->  guess of what y is. So let's move on to how do we evaluate a linear regression model. So the first
8563.04 ->  measure that I'm going to talk about is known as mean absolute error, or MAE
8572.08 ->  for short, okay. And mean absolute error is basically saying, all right, let's take
8579.04 ->  all the errors. So all these residuals that we talked about, let's sum up the distance
8586.08 ->  for all of them, and then take the average. And then that can describe, you know, how far off are
8591.44 ->  we. So the mathematical formula for that would be, okay, let's take all the residuals.
8601.68 ->  Alright, so this is the distance. Actually, let me redraw a plot down here. So
8607.44 ->  suppose I have a data set, look like this. And here are all my data points, right. And now let's
8621.44 ->  say my line looks something like that. So my mean absolute error would be summing up all of these
8632.32 ->  values. This was a mistake. So summing up all of these, and then dividing by how many data points
8641.6 ->  I have. So what would be all the residuals, it would be y i, right, so every single point,
8648.64 ->  minus y hat i, so the prediction for that on here. And then we're going to sum over all of
8656.16 ->  all of the different i's in our data set. Right, so i, and then we divide by the number of points
8664.32 ->  we have. So actually, I'm going to rewrite this to make it a little clearer. So i is equal to
8669.12 ->  whatever the first data point is all the way through the nth data point. And then we divide
8673.68 ->  it by n, which is how many points there are. Okay, so this is our measure of mae. And this is basically
8682.4 ->  telling us, okay, in on average, this is the distance between our predicted value and the
8690.48 ->  actual value in our training set. Okay. And mae is good because it allows us to, you know, when we
8701.36 ->  get this value here, we can literally directly compare it to whatever units the y value is in.
8708.72 ->  So let's say y is we're talking, you know, the prediction of the price of a house, right, in
8717.92 ->  dollars. Once we have once we calculate the mae, we can literally say, oh, the average, you know,
8724.72 ->  price, the average, how much we're off by is literally this many dollars. Okay. So that's the
8734.32 ->  mean absolute error. An evaluation technique that's also closely related to that is called the mean
8740.16 ->  squared error. And this is MSE for short. Okay. Now, if I take this plot again, and I duplicated
8753.28 ->  and move it down here, well, the gist of mean squared error is kind of the same, but instead
8759.36 ->  of the absolute value, we're going to square. So now the MSE is something along the lines of,
8766.16 ->  okay, let's sum up something, right, so we're going to sum up all of our errors.
8773.28 ->  So now I'm going to do y i minus y hat i. But instead of absolute valuing them,
8779.12 ->  I'm going to square them all. And then I'm going to divide by n in order to find the mean. So
8785.36 ->  basically, now I'm taking all of these different values, and I'm squaring them first before I add
8793.2 ->  them to one another. And then I divide by n. And the reason why we like using mean squared error
8802.08 ->  is that it helps us punish large errors in the prediction. And later on, MSE might be important
8809.68 ->  because of differentiability, right? So a quadratic equation is differentiable, you know,
8815.76 ->  if you're familiar with calculus, a quadratic equation is differentiable, whereas the absolute
8820.72 ->  value function is not totally differentiable everywhere. But if you don't understand that,
8825.28 ->  don't worry about it, you won't really need it right now. And now one downside of mean squared
8830.56 ->  error is that once I calculate the mean squared error over here, and I go back over to y, and I
8836.24 ->  want to compare the values. Well, it gets a little bit trickier to do that because now my mean squared
8845.36 ->  error is in terms of y squared, right? It's this is now squared. So instead of just dollars, how,
8853.28 ->  you know, how many dollars off am I I'm talking how many dollars squared off am I. And that,
8860.08 ->  you know, to humans, it doesn't really make that much sense. Which is why we have created
8865.44 ->  something known as the root mean squared error. And I'm just going to copy this diagram over here
8873.6 ->  because it's very, very similar to mean squared error. Except now we take a big squared root.
8883.28 ->  Okay, so this is our messy, and we take the square root of that mean squared error. And so now the
8890.64 ->  term in which you know, we're defining our error is now in terms of that dollar sign symbol again.
8897.76 ->  So that's a pro of root mean squared error is that now we can say, okay, our error according
8903.28 ->  to this metric is this many dollar signs off from our predictor. Okay, so it's in the same unit,
8910.32 ->  which is one of the pros of root mean squared error. And now finally, there is the coefficient
8917.68 ->  of determination, or r squared. And this is a formula for r squared. So r squared is equal
8923.2 ->  to one minus RSS over TSS. Okay, so what does that mean? Basically, RSS stands for the sum
8936.64 ->  of the squared residuals. So maybe it should be SSR instead, but
8943.92 ->  RSS sum of the squared residuals, and this is equal to if I take the sum of all the values,
8954.8 ->  and I take y i minus y hat, i, and square that, that is my RSS, right, it's a sum of the squared
8964.8 ->  residuals. Now TSS, let me actually use a different color for that.
8970.64 ->  So TSS is the total sum of squares.
8981.04 ->  And what that means is that instead of being with respect to this prediction,
8988.88 ->  we are instead going to
8992.08 ->  take each y value and just subtract the mean of all the y values, and square that.
9000.8 ->  Okay, so if I drew this out,
9013.52 ->  and if this were my
9016 ->  actually, let's use a different color. Let's use green. If this were my predictor,
9024.8 ->  so RSS is giving me this measure here, right? It's giving me some estimate of how far off we are from
9033.04 ->  our regressor that we predicted. Actually, I'm gonna take this one, and I'm gonna take this one,
9041.84 ->  and actually, I'm going to use red for that. Well, TSS, on the other hand, is saying, okay,
9052.64 ->  how far off are these values from the mean. So if we literally didn't do any calculations for the
9059.04 ->  line of best fit, if we just took all the y values and average all of them, and said, hey,
9064.8 ->  this is the average value for every single x value, I'm just going to predict that average value
9070.16 ->  instead, then it's asking, okay, how far off are all these points from that line?
9079.12 ->  Okay, and remember that this square means that we're punishing larger errors, right? So even if
9086.08 ->  they look somewhat close in terms of distance, the further a few data points are, then the further
9092.96 ->  the larger our total sum of squares is going to be. Sorry, that was my dog. So the total sum of
9099.44 ->  squares is taking all of these values and saying, okay, what is the sum of squares, if I didn't do
9104.96 ->  any regressor, and I literally just calculated the average of all the y values in my data set,
9111.12 ->  and for every single x value, I'm just going to predict that average, which means that okay,
9115.44 ->  like, that means that maybe y and x aren't associated with each other at all. Like the
9120.72 ->  best thing that I can do for any new x value, just predict, hey, this is the average of my data set.
9125.6 ->  And this total sum of squares is saying, okay, well, with respect to that average,
9132.24 ->  what is our error? Right? So up here, the sum of the squared residuals, this is telling us what is
9139.92 ->  our what what is our error with respect to this line of best fit? Well, our total sum of squares
9146.8 ->  saying what is the error with respect to, you know, just the average y value. And if our line
9154.56 ->  of best fit is a better fit, then this total sum of squares, that means that you know, this numerator,
9166.08 ->  that means that this numerator is going to be smaller than this denominator, right?
9172.32 ->  And if our errors in our line of best fit are much smaller, then that means that this ratio
9179.6 ->  of the RSS over TSS is going to be very small, which means that R squared is going to go towards
9186.96 ->  one. And now when R squared is towards one, that means that that's usually a sign that we have a
9194.32 ->  good predictor. It's one of the signs, not the only one. So over here, I also have, you know,
9204.72 ->  that there's this adjusted R squared. And what that does, it just adjusts for the number of terms.
9209.84 ->  So x1, x2, x3, etc. It adjusts for how many extra terms we add, because usually when we,
9217.28 ->  you know, add an extra term, the R squared value will increase because that'll help us predict
9222.48 ->  y some more. But the value for the adjusted R squared increase if the new term actually
9228.88 ->  improves this model fit more than expected, you know, by chance. So that's what adjusted
9234 ->  R squared is. I'm not, you know, it's out of the scope of this one specific course.
9238.16 ->  And now that's linear regression. Basically, I've covered the concept of residuals or errors.
9245.28 ->  And, you know, how do we use that in order to find the line of best fit? And you know,
9251.04 ->  our computer can do all the calculations for us, which is nice. But behind the scenes,
9255.2 ->  it's trying to minimize that error, right? And then we've gone through all the different
9260.4 ->  ways of actually evaluating a linear regression model and the pros and cons of each one.
9266.56 ->  So now let's look at an example. So we're still on supervised learning. But now we're just going to
9271.76 ->  talk about regression. So what happens when you don't just want to predict, you know, type 123?
9277.12 ->  What happens if you actually want to predict a certain value? So again, I'm on the UCI machine
9283.84 ->  learning repository. And here I found this data set about bike sharing in Seoul, South Korea.
9295.04 ->  So this data set is predicting rental bike count. And here it's the kind of bikes rented at each
9301.52 ->  hour. So what we're going to do, again, you're going to go into the data folder, and you're going
9308.16 ->  to download this CSV file. And we're going to move over to collab again. And here I'm going to name
9319.52 ->  this FCC bikes and regression. I don't remember what I called the last one. But yeah, FCC bikes
9329.68 ->  regression. Now I'm going to import a bunch of the same things that I did earlier. And, you know,
9339.6 ->  I'm going to also continue to import the oversampler and the standard scaler. And then I'm actually
9346.56 ->  also just going to let you guys know that I have a few more things I wanted import. So this is a
9352.8 ->  library that lets us copy things. Seaborn is a wrapper over a matplotlib. So it also allows us
9359.2 ->  to plot certain things. And then just letting you know that we're also going to be using
9363.28 ->  TensorFlow. Okay, so one more thing that we're also going to be using, we're going to use the
9367.92 ->  sklearn linear model library. Actually, let me make my screen a little bit bigger. So yeah,
9375.6 ->  awesome. Run this and that'll import all the things that we need. So again, I'm just going to,
9385.12 ->  you know, give some credit to where we got this data set. So let me copy and paste this UCI thing.
9398 ->  And I will also give credit to this here.
9406.56 ->  Okay, cool. All right, cool. So this is our data set. And again, it tells us all the different
9414.32 ->  attributes that we have right here. So I'm actually going to go ahead and paste this in here.
9425.28 ->  Feel free to copy and paste this if you want me to read it out loud, so you can type it.
9429.28 ->  It's byte count, hour, temp, humidity, wind, visibility, dew point, temp, radiation, rain,
9438.96 ->  snow, and functional, whatever that means. Okay, so I'm going to come over here and import my data
9447.28 ->  by dragging and dropping. All right. Now, one thing that you guys might actually need to do is
9454.8 ->  you might actually have to open up the CSV because there were, at first, a few like forbidding
9461.36 ->  characters in mine, at least. So you might have to get rid of like, I think there was a degree here,
9466.32 ->  but my computer wasn't recognizing it. So I got rid of that. So you might have to go through
9470.64 ->  and get rid of some of those labels that are incorrect. I'm going to do this. Okay. But
9479.6 ->  after we've done that, we've imported in here, I'm going to create a data a data frame from that. So,
9487.04 ->  all right, so now what I can do is I can read that CSV file and I can get the data into here.
9492.56 ->  So so like data dot CSV. Okay, so now if I call data dot head, you'll see that I have all the
9501.36 ->  various labels, right? And then I have the data in there. So I'm going to from here, I'm actually
9512.08 ->  going to get rid of some of these columns that, you know, I don't really care about. So here,
9517.6 ->  I'm going to, when I when I type this in, I'm going to drop maybe the date, whether or not it's a
9524.16 ->  holiday, and the various seasons. So I'm just not going to care about these things. Access equals
9533.04 ->  one means drop it from the columns. So now you'll see that okay, we still have, I mean,
9539.12 ->  I guess you don't really notice it. But if I set the data frames columns equal to data set calls,
9545.28 ->  and I look at, you know, the first five things, then you'll see that this is now our data set.
9551.28 ->  It's a lot easier to read. So another thing is, I'm actually going to
9558.32 ->  df functional. And we're going to create this. So remember that our computers are not very good
9564.24 ->  at language, we want it to be in zeros and ones. So here, I will convert that.
9570 ->  Well, if this is equal to yes, then that that gets mapped as one. So then set type integer. All right.
9581.04 ->  Great. Cool. So the thing is, right now, these by counts are for whatever hour. So
9588.56 ->  to make this example simpler, I'm just going to index on an hour, and I'm gonna say, okay,
9592.56 ->  we're only going to use that specific hour. So I'm just going to index on an hour, and I'm
9599.36 ->  going to use an hour. So here, let's say. So this data frame is only going to be data frame where
9607.68 ->  the hour, let's say it equals 12. Okay, so it's noon. All right. So now you'll see that all the
9617.6 ->  equal to 12. And I'm actually going to now drop that column. Our access equals one. Alright,
9631.12 ->  so we run this cell. Okay, so now we got rid of the hour in here. And we just have the by count,
9638.48 ->  the temperature, humidity, wind, visibility, and yada, yada, yada. Alright, so what I want to do
9645.76 ->  is I'm going to actually plot all of these. So for i in all the columns, so the range, length of
9655.44 ->  whatever its data frame is, and all the columns, because I don't have by count as
9660.16 ->  actually, it's my first thing. So what I'm going to do is say for a label in data frame,
9666.56 ->  columns, everything after the first thing, so that would give me the temperature and
9670.16 ->  onwards. So these are all my features, right? I'm going to just scatter. So I want to see how that
9679.44 ->  label how that specific data, how that affects the by count. So I'm going to plot the bike count on
9689.68 ->  the y axis. And I'm going to plot, you know, whatever the specific label is on the x axis.
9695.76 ->  And I'm going to title this, whatever the label is. And, you know, make my y label, the bike count
9706.64 ->  at noon. And the x label as just the label. Okay, now, I guess we don't even need the legend.
9718.08 ->  We don't even need the legend. So just show that plot. All right. So it seems like functional is
9730 ->  not really doesn't really give us any utility. So then snow rain seems like this radiation,
9741.92 ->  you know, is fairly linear dew point temperature, visibility, wind doesn't really seem like it does
9751.04 ->  much humidity, kind of maybe like an inverse relationship. But the temperature definitely
9757.2 ->  looks like there's a relationship between that and the number of bikes, right. So what I'm actually
9761.68 ->  going to do is I'm going to drop some of the ones that don't don't seem like they really matter. So
9766 ->  maybe wind, you know, visibility. Yeah, so I'm going to get rid of when visibility and functional.
9779.28 ->  So now data frame, and I'm going to drop wind, visibility, and functional. All right. And the
9793.76 ->  axis again is the column. So that's one. So if I look at my data set, now, I have just the
9801.2 ->  temperature, the humidity, the dew point temperature, radiation, rain, and snow. So again,
9807.2 ->  what I want to do is I want to split this into my training, my validation and my test data set,
9814.32 ->  just as we talked before. Here, we can use the exact same thing that we just did. And we can say
9822.72 ->  numpy dot split, and sample, you know that the whole sample, and then create our splits
9834 ->  of the data frame. And we're going to do that. But now set this to eight. Okay.
9844.64 ->  So I don't really care about, you know, the the full grid, the full array. So I'm just going to
9850.16 ->  use an underscore for that variable. But I will get my training x and y's. And actually, I don't
9859.68 ->  have a function for getting the x and y's. So here, I'm going to write a function defined,
9870.16 ->  get x y. And I'm going to pass in the data frame. And I'm actually going to pass in what the name
9876.88 ->  of the y label is, and what the x what specific x labels I want to look at. So here, if that's none,
9887.04 ->  then I'm just like, like, I'm only going to I'm going to get everything from the data set. That's
9891.52 ->  not the wildlife. So here, I'm actually going to make first a deep copy of my data frame.
9900.56 ->  And that basically means I'm just copying everything over. If, if like x labels is none,
9908.88 ->  so if not x labels, then all I'm going to do is say, all right, x is going to be whatever this
9914.56 ->  data frame is. And I'm just going to take all the columns. So C for C, and data frame, dot columns,
9922.96 ->  if C does not equal the y label, right, and I'm going to get the values from that. But if there
9932.24 ->  is the x labels, well, okay, so in order to index only one thing, so like, let's say I pass in only
9940.16 ->  one thing in here, then my data frame is, so let me make a case for that. So if the length of x
9950 ->  labels is equal to one, then what I'm going to do is just say that this is going to be x labels,
9960.32 ->  and add that just that label values, and I actually need to reshape to make this 2d.
9968.16 ->  So I'm going to pass in negative one comma one there. Now, otherwise, if I have like a list of
9975.04 ->  specific x labels that I want to use, then I'm actually just going to say x is equal to data
9980 ->  frame of those x labels, dot values. And that should suffice. Alright, so now that's just me
9988.72 ->  extracting x. And in order to get my y, I'm going to do y equals data frame, and then passing the y
9996.16 ->  label. And at the very end, I'm going to say data equals NP dot h stack. So I'm stacking them horizontally
10005.44 ->  one next to each other. And I'll take x and y, and return that. Oh, but this needs to be values.
10014.96 ->  And I'm actually going to reshape this to make it 2d as well so that we can do this h stack.
10019.12 ->  And I will return data x, y. So now I should be able to say, okay, get x, y, and take that data
10030.16 ->  frame. And the y label, so my y label is byte count. And actually, so for the x label, I'm actually
10038.64 ->  going to let's just do like one dimension right now. And earlier, I got rid of the plots, but we
10044.4 ->  had seen that maybe, you know, the temperature dimension does really well. And we might be able
10050.72 ->  to use that to predict why. So I'm going to label this also that, you know, it's just using the
10058.64 ->  temperature. And I am also going to do this again for, oh, this should be train. And this should be
10068.56 ->  validation. And this should be a test. Because oh, that's Val. Right. But here, it should be Val.
10081.92 ->  And this should be test. Alright, so we run this and now we have our training validation and test
10088.64 ->  data sets for just the temperature. So if I look at x train temp, it's literally just the temperature.
10096.24 ->  Okay, and I'm doing this first to show you simple linear regression. Alright, so right now I can
10103.04 ->  create a regressor. So I can say the temp regressor here. And then I'm going to, you know, make a
10110.8 ->  linear regression model. And just like before, I can simply fix fit my x train temp, y train temp
10120 ->  in order to train train this linear regression model. Alright, and then I can also, I can print
10129.04 ->  this regressor is coefficients and the intercept. So if I do that, okay, this is the coefficient
10142.16 ->  for whatever the temperature is, and then the the x intercept, okay, or the y intercept, sorry. All
10151.04 ->  right. And I can, you know, score, so I can get the the r squared score. So I can score x test
10165.92 ->  and y test. All right, so it's an r squared of around point three eight, which is better than
10175.52 ->  zero, which would mean, hey, there's absolutely no association. But it's also not, you know, like,
10182.32 ->  good, it depends on the context. But, you know, the higher that number, it means the higher that
10187.52 ->  the two variables would be correlated, right? Which here, it's all right. It just means there's
10193.68 ->  maybe some association between the two. But the reason why I want to do this one D was to show
10200.32 ->  you, you know, if we plotted this, this is what it would look like. So if I create a scatterplot,
10207.44 ->  and let's take the training. So this is our data. And then let's make it blue. And then if I
10222.48 ->  also plotted, so something that I can do is say, you know, the x range, I'm going to plot it,
10229.84 ->  is when space, and this goes from negative 20 to 40, this piece of data. So I'm going to just say,
10236.4 ->  let's take 100 things from there. So I'm going to plot x, and I'm going to take this temper,
10247.2 ->  this, like, regressor, and predict x with that. Okay, and this label, I'm going to label that
10257.2 ->  the fit. And this color, let's make this red. And let's actually set the line with, so I can,
10268.8 ->  I can change how thick that value is. Okay. Now at the very end, let's create a legend. And let's,
10281.92 ->  all right, let's also create, you know, title, all these things that matter, in some sense. So
10290.24 ->  here, let's just say, this would be the bikes, versus the temperature, right? And the y label
10299.36 ->  would be number of bikes. And the x label would be the temperature. So I actually think that this
10308.4 ->  might cause an error. Yeah. So it's expecting a 2d array. So we actually have to reshape this.
10317.92 ->  Okay, there we go. So I just had to make this an array and then reshape it. So it was 2d. Now,
10335.12 ->  we see that, all right, this increases. But again, remember those assumptions that we had about
10340.96 ->  linear regression, like this, I don't really know if this fits those assumptions, right? I just
10346.8 ->  wanted to show you guys though, that like, all right, this is what a line of s fit through this
10352.16 ->  data would look like. Okay. Now, we can do multiple linear regression, right. So I'm going to go ahead
10366.4 ->  and do that as well. Now, if I take my data set, and instead of the labels, it's actually what's
10378.08 ->  my current data set right now. Alright, so let's just use all of these except for the byte count,
10389.6 ->  right. So I'm going to just say for the x labels, let's just take the data frames columns and just
10398.4 ->  remove the byte count. So does that work? So if this part should be of x labels is none. And then
10410.56 ->  this should work now. Oops, sorry. Okay, so I have Oh, but this here, because it's not just the
10419.04 ->  temperature anymore, we should actually do this, let's say all, right. So I'm just going to quickly
10428.16 ->  rerun this piece here so that we have our temperature only data set. And now we have our
10433.92 ->  all data set. Okay. And this regressor, I can do the same thing. So I can do the all regressor.
10442 ->  And I'm going to make this the linear regression. And I'm going to fit this to x train all and y
10452.88 ->  train all. Okay. Alright, so let's go ahead and also score this regressor. And let's see how the
10460.96 ->  R squared performs now. So if I test this on the test data set, what happens? Alright, so our R
10470.16 ->  square seems to improve it went from point four to point five, two, which is a good sign. Okay.
10478.32 ->  And I can't necessarily plot, you know, every single dimension. But this just this is just
10484.56 ->  to say, okay, this is this is improved, right? Alright, so one cool thing that you can do with
10489.68 ->  tensorflow is you can actually do regression, but with the neural net. So here, I'm going
10500.08 ->  to we already have our our training data for just the temperature and just, you know, for all the
10508.88 ->  different columns. So I'm not going to bother with splitting up the data again, I'm just going to go
10513.84 ->  ahead and start building the model. So in this linear regression model, typically, you know,
10520.64 ->  it does help if we normalize it. So that's very easy to do with tensorflow, I can just create some
10528.08 ->  normalizer layer. So I'm going to do tensorflow Keras layers, and get the normalization layer.
10537.44 ->  And the input shape for that will just be one because let's just do it again on just the
10543.92 ->  temperature and the access I will make none. Now for this temp normalizer, and I should have had
10553.52 ->  an equal sign there. I'm going to adapt this to X train temp, and reshape this to just a single vector.
10566.48 ->  So that should work great. Now with this model, so temp neural net model, what I can do is I can do,
10574.8 ->  you know, dot keras, sequential. And I'm going to pass in this normalizer layer. And then I'm
10583.76 ->  going to say, hey, just give me one single dense layer with one single unit. And what that's doing
10589.92 ->  is saying, all right, well, one single node just means that it's linear. And if you don't add any
10597.12 ->  sort of activation function to it, the output is also linear. So here, I'm going to have tensorflow
10603.36 ->  Keras layers dot dense. And I'm just going to have one unit. And that's going to be my model. Okay.
10614.48 ->  So with this model, let's compile. And for our optimizer, let's use,
10626.8 ->  let's use the atom again, dot atom, and we have to pass in the learning rate. So learning rate,
10636.4 ->  and our learning rate, let's do 0.01. And now, the loss, we actually let's get this one 0.1. And the
10646.88 ->  loss, I'm going to do mean squared error. Okay, so we run that we've compiled it, okay, great.
10654.08 ->  And just like before, we can call history. And I'm going to fit this model. So here,
10661.44 ->  if I call fit, I can just fit it, and I'm going to take the x train with the temperature,
10669.28 ->  but reshape it. Y train for the temperature. And I'm going to set verbose equal to zero so
10677.84 ->  that it doesn't, you know, display stuff. I'm actually going to set epochs equal to, let's do
10684.48 ->  1000. And the validation data should be let's pass in the validation data set here
10696.32 ->  as a tuple. And I know I spelled that wrong. So let's just run this.
10702.8 ->  And up here, I've copied and pasted the plot loss from our previous but changed the y label
10707.76 ->  to MSC. Because now we're talking we're dealing with mean squared error. And I'm going to plot
10714.16 ->  the loss of this history after it's done. So let's just wait for this to finish training and then to
10719.12 ->  plot. Okay, so this actually looks pretty good. We see that the value is still the same. So
10730.32 ->  this actually looks pretty good. We see that the values are converging. So now what I can do is
10736.48 ->  I'm going to go back up and take this plot. And we are going to just run that plot again. So
10747.2 ->  here, instead of this temperature regressor, I'm going to use the neural net regressor.
10756.32 ->  This neural net model.
10757.36 ->  And if I run that, I can see that, you know, this also gives me a linear regressor,
10766.4 ->  you'll notice that this this fit is not entirely the same as the one
10771.12 ->  up here. And that's due to the training process of, you know, of this neural net. So just two
10778.8 ->  different ways to try and try to find the best linear regressor. Okay, but here we're using back
10785.28 ->  propagation to train a neural net node, whereas in the other one, they probably are not doing that.
10790.96 ->  Okay, they're probably just trying to actually compute the line of s fit. So, okay, given this,
10799.6 ->  well, we can repeat the exact same exercise with our with our multiple linear regressions. Okay,
10809.36 ->  but I'm actually going to skip that part. I will leave that as an exercise to the viewer. Okay,
10814.56 ->  so now what would happen if we use a neural net, a real neural net instead of just, you know,
10819.04 ->  one single node in order to predict this. So let's start on that code, we already have our
10824.96 ->  normalizer. So I'm actually going to take the same setup here. But instead of, you know, this
10831.44 ->  one dense layer, I'm going to set this equal to 32 units. And for my activation, I'm going to use
10837.52 ->  Relu. And now let's duplicate that. And for the final output, I just want one answer. So I just
10846.16 ->  want one cell. And this activation is also going to be Relu, because I can't ever have less than
10852.08 ->  zero bytes. So I'm just going to set that as Relu. I'm just going to name this the neural net model.
10857.04 ->  Okay. And at the bottom, I'm going to have this neural net model. I'm going to have this neural
10864.64 ->  net model, I'm going to compile. And I will actually use the same compiler here. But instead of
10878.64 ->  instead of a learning rate of 0.01, I'll use 0.001. Okay. And I'm going to train this here.
10887.28 ->  So the history is this neural net model. And I'm going to fit that against x train temp, y train
10899.92 ->  temp, and valid validation data, I'm going to set this again equal to x val temp, and y val temp.
10914.48 ->  Now, for the verbose, I'm going to say equal to zero epochs, let's do 100. And here for the batch
10923.6 ->  size, actually, let's just not do a batch size right now. Let's just try it. Let's see what happens
10928.56 ->  here. And again, we can plot the loss of this history after it's done training. So let's just
10938.32 ->  run this. And that's not what we're supposed to get. So what is going on? Here is sequential,
10946.88 ->  we have our temperature normalizer, which I'm wondering now if we have to redo that.
10959.68 ->  Do that. Okay, so we do see this decline, it's an interesting curve, but we do we do see it eventually.
10973.28 ->  So this is our loss, which all right, if decreasing, that's a good sign.
10977.92 ->  And actually, what's interesting is let's just let's plot this model again. So here instead of that.
10984.08 ->  And you'll see that we actually have this like, curve that looks something like this. So actually,
10989.84 ->  what if I got rid of this activation? Let's train this again. And see what happens.
11001.12 ->  Alright, so even even when I got rid of that really at the end, it kind of knows, hey, you know, if
11007.6 ->  it's not the best model, if we had maybe one more layer in here, these are just things that you have
11016.56 ->  to play around with. When you're, you know, working with machine learning, it's like, you don't really
11021.68 ->  know what the best model is going to be. For example, this also is not brilliant. But I guess
11033.44 ->  it's okay. So my point is, though, that with a neural net, I mean, this is not brilliant, but also
11040.4 ->  there's like no data down here, right? So it's kind of hard for our model to predict. In fact,
11044.96 ->  we probably should have started the prediction somewhere around here. My point, though, is that
11049.44 ->  with this neural net model, you can see that this is no longer a linear predictor, but yet we still
11054.56 ->  get an estimate of the value, right? And we can repeat this exact same exercise, right? So let's
11061.6 ->  do that. Right. And we can repeat this exact same exercise with the multiple inputs. So here,
11073.52 ->  if I now pass in all of the data, so this is my all normalizer,
11080.72 ->  and I should just be able to pass in that. So let's move this to the next cell. Here,
11094.48 ->  I'm going to pass in my all normalizer. And let's compile it. Yeah, those parameters look good.
11102.96 ->  Great. So here with the history, when we're trying to fit this model, instead of temp,
11110.48 ->  we're going to use our larger data set with all the features. And let's just train that.
11122 ->  And of course, we want to plot the loss.
11131.52 ->  Okay, so that's what our loss looks like. So an interesting curve, but it's decreasing.
11137.76 ->  So before we saw that our R squared score was around point five, two. Well, we don't really have
11144.48 ->  that with a neural net anymore. But one thing that we can measure is hey, what is the mean squared
11149.68 ->  error, right? So if I come down here, and I compare the two mean squared errors, so
11159.6 ->  so I can predict x test all right. So these are my predictions using that linear regressor,
11174.08 ->  will linear multiple multiple linear regressor. So these are my live predictions, linear regression.
11180.16 ->  Okay. I'm actually going to do that at the bottom. So let me just copy and paste that cell and bring
11192.08 ->  it down here. So now I'm going to calculate the mean squared error for both the linear regressor
11201.76 ->  and the neural net. Okay, so this is my linear and this is my neural net. So if I do my neural net
11211.36 ->  model, and I predict x test all, I get my two, you know, different y predictions. And I can calculate
11223.76 ->  the mean squared error, right? So if I want to get the mean squared error, and I have y prediction
11231.28 ->  and y real, I can do numpy dot square, and then I would need the y prediction minus, you know, the
11239.2 ->  real. So this this is basically squaring everything. And this should be a vector. So if I just take
11251.84 ->  this entire thing and take the mean of that, that should give me the MSC. So let's just try that out.
11264.96 ->  And the y real is y test all, right? So that's my mean squared error for the linear regressor.
11272.64 ->  And this is my mean squared error for the neural net. So that's interesting. I will debug this live,
11284.56 ->  I guess. So my guess is that it's probably coming from this normalization layer. Because this input
11294.4 ->  shape is probably just six. And okay, so that works now. And the reason why is because, like,
11313.28 ->  my inputs are only for every vector, it's only a one dimensional vector of length six. So I should
11319.04 ->  have I should have just had six, comma, which is a tuple of size six from the start, or it's a it's
11326 ->  a tuple containing one element, which is a six. Okay, so it's actually interesting that my neural
11334.08 ->  net results seem like they they have a larger mean squared error than my linear regressor.
11340.48 ->  One thing that we can look at is, we can actually plot the real versus, you know, the the actual
11349.84 ->  results versus what the predictions are. So if I say, some access, and I use plt dot axes, and make
11361.2 ->  axes and make these equal, then I can scatter the the y, you know, the test. So what the actual
11371.28 ->  values are on the x axis, and then what the prediction are on the x axis. Okay. And I can
11380 ->  label this as the linear regression predictions. Okay, so then let me just label my axes. So the
11390.16 ->  x axis, I'm going to say is the true values. The y axis is going to be my linear regression predictions.
11404.32 ->  Or actually, let's plot. Let's just make this predictions.
11409.28 ->  And then at the end, I'm going to plot. Oh, let's set some limits.
11422.88 ->  Because I think that's like approximately the max number of bikes.
11428.64 ->  So I'm going to set my x limit to this and my y limit to this.
11435.2 ->  So here, I'm going to pass that in here too. And all right, this is what we actually get for our
11446.48 ->  linear regressor. You see that actually, they align quite well, I mean, to some extent. So 2000 is
11454.72 ->  probably too much 2500. I mean, looks like maybe like 1800 would be enough here for our limits.
11463.36 ->  And I'm actually going to label something else, the neural net predictions.
11472.72 ->  Let's add a legend. So you can see that our neural net for the larger values, it seems like
11482 ->  it's a little bit more spread out. And it seems like we tend to underestimate a little bit down
11488.48 ->  here in this area. Okay. And for some reason, these are way off as well.
11497.84 ->  But yeah, so we've basically used a linear regressor and a neural net. Honestly, there are
11504.48 ->  sometimes where a neural net is more appropriate and a linear regressor is more appropriate.
11509.12 ->  I think that it just comes with time and trying to figure out, you know, and just literally seeing
11514.72 ->  like, hey, what works better, like here, a linear, a multiple linear regressor might actually work
11519.28 ->  better than a neural net. But for example, with the one dimensional case, a linear regressor would
11525.76 ->  never be able to see this curve. Okay. I mean, I'm not saying this is a great model either, but I'm
11532.88 ->  just saying like, hey, you know, sometimes it might be more appropriate to use something that's not
11539.44 ->  linear. So yeah, I will leave regression at that. Okay, so we just talked about supervised learning.
11549.84 ->  And in supervised learning, we have data, we have some a bunch of features and for a bunch of
11554.88 ->  different samples. But each of those samples has some sort of label on it, whether that's a number,
11559.76 ->  a category, a class, etc. Right, we were able to use that label in order to try to predict
11566.16 ->  right, we were able to use that label in order to try to predict new labels of other points that
11571.84 ->  we haven't seen yet. Well, now let's move on to unsupervised learning. So with unsupervised
11579.52 ->  learning, we have a bunch of unlabeled data. And what can we do with that? You know, can we learn
11585.6 ->  anything from this data? So the first algorithm that we're going to discuss is known as k means
11593.12 ->  clustering. What k means clustering is trying to do is it's trying to compute k clusters from the data.
11605.76 ->  So in this example below, I have a bunch of scattered points. And you'll see that this
11611.36 ->  is x zero and x one on the two axes, which means I'm actually plotting two different features,
11618.08 ->  right of each point, but we don't know what the y label is for those points. And now, just looking
11624.8 ->  at these scattered points, we can kind of see how there are different clusters in the data set,
11631.44 ->  right. So depending on what we pick for k, we might have different clusters. Let's say k equals two,
11640.32 ->  right, then we might pick, okay, this seems like it could be one cluster, but this here is also
11645.44 ->  another cluster. So those might be our two different clusters. If we have k equals three,
11653.12 ->  for example, then okay, this seems like it could be a cluster. This seems like it could be a
11658.16 ->  cluster. And maybe this could be a cluster, right. So we could have three different clusters in the
11663.12 ->  data set. Now, this k here is predefined, if I can spell that correctly, by the person who's running
11673.52 ->  the model. So that would be you. All right. And let's discuss how you know, the computer actually
11682.48 ->  goes through and computes the k clusters. So I'm going to write those steps down here.
11692.64 ->  Now, the first step that happens is we actually choose well, the computer chooses three random
11701.28 ->  points on this plot to be the centroids. And by centuries, I just mean the center of the clusters.
11711.84 ->  Okay. So three random points, let's say we're doing k equals three, so we're choosing three
11716.8 ->  random points to be the centroids of the three clusters. If it were two, we'd be choosing two
11721.52 ->  random points. Okay. So maybe the three random points I'm choosing might be here.
11727.76 ->  Here, here, and here. All right. So we have three different points. And the second thing that we do
11744.4 ->  is we actually calculate
11746 ->  the distance for each point to those centroids. So between all the points and the centroid.
11761.36 ->  So basically, I'm saying, all right, this is this distance, this distance, this distance,
11767.6 ->  all of these distances, I'm computing between oops, not those two, between the points, not the
11773.12 ->  centroids themselves. So I'm computing the distances for all of these plots to each of the centroids.
11780.08 ->  Okay. And that comes with also assigning those points to the closest centroid.
11794.8 ->  What do I mean by that? So let's take this point here, for example, so I'm computing
11802.4 ->  this distance, this distance, and this distance. And I'm saying, okay, it seems like the red one
11806.96 ->  is the closest. So I'm actually going to put this into the red centroid. So if I do that for
11814.4 ->  all of these points, it seems slightly closer to red, and this one seems slightly closer to red,
11823.28 ->  right? Now for the blue, I actually wouldn't put any blue ones in here, but we would probably
11833.04 ->  actually, that first one is closer to red. And now it seems like the rest of them are probably
11841.2 ->  closer to green. So let's just put all of these into green here, like that. And cool. So now we
11851.44 ->  have, you know, our two, three, technically centroid. So there's this group here, there's
11858.48 ->  this group here. And then blue is kind of just this group here, it hasn't really touched any
11864.88 ->  of the points yet. So the next step, three that we do is we actually go and we recalculate the
11874.56 ->  centroid. So we compute new centroids based on the points that we have in all the centroids.
11884 ->  And by that, I just mean, okay, well, let's take the average of all these points. And where is that
11890.16 ->  new centroid? That's probably going to be somewhere around here, right? The blue one, we don't have
11895.68 ->  any points in there. So we won't touch and then the screen one, we can put that probably somewhere
11902.8 ->  over here, oops, somewhere over here. Right. So now if I erase all of the previously computed centroids,
11918.24 ->  I can go and I can actually redo step two over here, this calculation.
11925.28 ->  Alright, so I'm going to go back and I'm going to iterate through everything again,
11928.56 ->  and I'm going to recompute my three centroids. So let's see, we're going to take this red point,
11935.2 ->  these are definitely all red, right? This one still looks a bit red. Now,
11943.76 ->  this part, we actually start getting closer to the blues.
11948.16 ->  So this one still seems closer to a blue than a green, this one as well. And I think the rest
11956.8 ->  would belong to green. Okay, so now our three centroids are three, sorry, our three clusters
11966.4 ->  would be this, this, and then this, right? Those are our three centroids. And so now we go back
11979.84 ->  and we compute the new sorry, those would be the three clusters. So now we go back and we compute
11984.08 ->  the three centroids. So I'm going to get rid of this, this and this. And now where would this
11990.64 ->  red be centered, probably closer, you know, to this point here, this blue might be closer to
11997.68 ->  up here. And then this green would probably be somewhere. It's pretty similar to what we had
12005.52 ->  before. But it seems like it'd be pulled down a bit. So probably somewhere around there for green.
12010.88 ->  All right. And now, again, we go back and we compute the distance between all the points
12020.24 ->  and the centroids. And then we assign them to the closest centroid. Okay. So the reds are all here,
12027.6 ->  it's very clear. Actually, let me just circle that. And this it actually seems like this point is
12036 ->  it actually seemed like this point is closer to this blue now. So the blues seem like they would
12043.44 ->  be maybe this point looks like it'd be blue. So all these look like they would be blue now.
12050.16 ->  And the greens would probably be this cluster right here. So we go back, we compute the centroids,
12058 ->  bam. This one probably like almost here, bam. And then the green looks like it would be probably
12070.96 ->  here ish. Okay. And now we go back and we compute the we compute the clusters again.
12081.92 ->  So red, still this blue, I would argue is now this cluster here. And green is this cluster here.
12093.36 ->  Okay, so we go and we recompute the centroids, bam, bam. And, you know, bam. And now if I were
12108.08 ->  to go and assign all the points to clusters again, I would get the exact same thing. Right. And so
12114.4 ->  that's when we know that we can stop iterating between steps two and three is when we've
12119.84 ->  converged on some solution when we've reached some stable point. And so now because none of
12126.56 ->  these points are really changing out of their clusters anymore, we can go back to the user
12130.4 ->  and say, Hey, these are our three clusters. Okay. And this process, something known as
12140.72 ->  expectation maximization. This part where we're assigning the points to the closest centroid,
12153.28 ->  this is something this is our expectation step. And this part where we're computing the new
12161.84 ->  centroids, this is our maximization step. Okay, so that's expectation maximization.
12175.04 ->  And we use this in order to compute the centroids, assign all the points to clusters,
12182.72 ->  according to those centroids. And then we're recomputing all that over again, until we reach
12187.52 ->  some stable point where nothing is changing anymore. Alright, so that's our first example
12193.76 ->  of unsupervised learning. And basically, what this is doing is trying to find some structure,
12199.2 ->  some pattern in the data. So if I came up with another point, you know, might be somewhere here,
12205.52 ->  I can say, Oh, it looks like that's closer to if this is a, b, c, it looks like that's closest to
12212.56 ->  cluster B. And so I would probably put it in cluster B. Okay, so we can find some structure
12218.24 ->  in the data based on just how, how the points are scattered relative to one another. Now,
12226.24 ->  the second unsupervised learning technique that I'm going to discuss with you guys, something noted,
12230.48 ->  principal component analysis. And the point of principal component analysis is very often it's
12237.44 ->  used as a dimensionality reduction technique. So let me write that down. It's used for dimensionality
12247.52 ->  reduction. And what do I mean by dimensionality reduction is if I have a bunch of features like
12255.52 ->  x1 x2 x3 x4, etc. Can I just reduce that down to one dimension that gives me the most information
12263.6 ->  about how all these points are spread relative to one another. And that's what PCA is for. So PCA
12269.52 ->  principal component analysis. Let's say I have some points in the x zero and x one feature space.
12282.8 ->  Okay, so these points might be spread, you know, something like this.
12299.68 ->  Okay. So for example, if this were something to do with housing prices, right,
12308.96 ->  this here might be x zero might be hey, years since built, right, since the house was built,
12319.6 ->  and x one might be square footage of the house. Alright, so like years since built, I mean, like
12329.92 ->  right now it's been, you know, 22 years since a house in 2000 was built. Now principal component
12336.96 ->  analysis is just saying, alright, let's say we want to build a model, or let's say we want to,
12340.8 ->  you know, display something about our data, but we don't we don't have two axes to show it on.
12349.52 ->  How do we display, you know, how do we how do we demonstrate that this point is a further away from
12356.32 ->  this point than this point. And we can do that using principal component analysis. So
12364.24 ->  take what you know about linear regression and just forget about it for a second. Otherwise,
12367.92 ->  you might get confused. PCA is a way of trying to find direction in the space with the largest
12376.88 ->  variance. So this principal component, what that means is basically the component.
12383.92 ->  So some direction in this space with the largest variance, okay, it tells us the most about our
12398.88 ->  data set without the two different dimensions. Like, let's say we have these two different
12402.64 ->  mentions, and somebody's telling us, hey, you only get one dimension in order to show your data set.
12408.08 ->  What dimension do you want to show us? Okay, so let's say we want to show our data set,
12413.84 ->  what dimension like what do we do, we want to project our data onto a single dimension.
12420.16 ->  Alright, so that in this case might be a dimension that looks something like
12426.4 ->  this. And you might say, okay, we're not going to talk about linear regression, okay.
12431.68 ->  We don't have a y value. So linear regression, this would be why this is not why, okay, we don't
12436.8 ->  have a label for that. Instead, what we're doing is we're taking the right angle projection. So
12443.2 ->  all of these take that's not very visible. But take this right angle projection onto this line.
12453.04 ->  And what PCA is doing is saying, okay, map all of these points onto this one dimensional space.
12459.52 ->  So the transformed data set would be here.
12464 ->  This one's on the data sets are on the line. So we just put that there. But now this would be our
12469.76 ->  new one dimensional data set. Okay, it's not our prediction or anything. This is our new data set.
12477.12 ->  If somebody came to us said you only get one dimension, you only get one number to represent
12482.48 ->  each of these 2d points. What number would you give us? What number would you give us?
12486.88 ->  So this would be our new one dimensional data set. Okay, it's not our prediction or anything.
12493.04 ->  What number would you give me? This would be the number that we gave. Okay, this in this direction,
12504.16 ->  this is where our points are the most spread out. Right? If I took this plot,
12511.04 ->  and let me actually duplicate this so I don't have to rewrite anything.
12516.32 ->  Or so I don't have to erase and then redraw anything. Let me get rid of some of this stuff.
12527.44 ->  And I just got rid of a point there too. So let me draw that back.
12534.16 ->  Alright, so if this were my original data point, what if I had taken, you know, this to be
12541.04 ->  the PCA dimension? Okay, well, I then would have points that let me actually do that in different
12552.96 ->  color. So if I were to draw a right angle to this for every point, my points would look something
12564.32 ->  like this. And so just intuitively looking at these two different plots, this top one and this one,
12577.44 ->  we can see that the points are squished a little bit closer together. Right? Which means that the
12583.12 ->  variance that's not the space with the largest variance. The thing about the largest variance
12588.8 ->  is that this will give us the most discrimination between all of these points. The larger the
12595.76 ->  variance, the further spread out these points will likely be. Now, and so that's the that's the
12601.52 ->  dimension that we should project it on a different way to actually look at that, like what is the
12607.6 ->  dimension with the largest variance. It's actually it also happens to be the dimension that decreases
12614.4 ->  to be the dimension that decreases that minimizes the residuals. So if we take all the points, and
12625.28 ->  we take the residual from that the XY residual, so in linear regression, in linear regression,
12633.52 ->  we were looking only at this residual, the differences between the predictions right between
12637.76 ->  y and y hat, it's not that here in principal component analysis, we're taking the difference
12644.8 ->  from our current point in two dimensional space, and then it's projected point. Okay, so we're
12652.32 ->  taking that dimension. And we're saying, alright, how much, you know, how much distance is there
12660.88 ->  between that projection residual, and we're trying to minimize that for all of these points. So that
12668.72 ->  actually equates to this largest variance dimension, this dimension here, the PCA dimension,
12681.12 ->  you can either look at it as minimizing, minimize, let me get rid of this,
12694.56 ->  the projection residuals. So that's the stuff in orange.
12702.08 ->  Or to maximizing the variance between the points.
12708.32 ->  Okay. And we're not really going to talk about, you know, the method that we need in order to
12715.28 ->  calculate out the principal components, or like what that projection would be, because you will
12720.96 ->  need to understand linear algebra for that, especially eigenvectors and eigenvalues, which
12726.8 ->  I'm not going to cover in this class. But that's how you would find the principal components. Okay,
12732.08 ->  now, with this two dimensional data set here, sorry, this one dimensional data set, we started
12736.88 ->  from a 2d data set, and we now boil it down to one dimension. Well, we can go and take that
12742.16 ->  dimension, and we can do other things with it. Right, we can, like if there were a y label,
12747.68 ->  then we can now show x versus y, rather than x zero and x one in different plots with that y.
12755.04 ->  Now we can just say, oh, this is a principal component. And we're going to plot that with
12758.48 ->  the y. Or for example, if there were 100 different dimensions, and you only wanted to take five of
12764.56 ->  them, well, you could go and you could find the top five PCA dimensions. And that might be a lot
12771.2 ->  more useful to you than 100 different feature vector values. Right. So that's principal component
12778.4 ->  analysis. Again, we're taking, you know, certain data that's unlabeled, and we're trying to make
12785.28 ->  some sort of estimation, like some guess about its structure from that original data set, if we
12793.76 ->  wanted to take, you know, a 3d thing, so like a sphere, but we only have a 2d surface to draw it
12800.16 ->  on. Well, what's the best approximation that we can make? Oh, it's a circle. Right PCA is kind of
12806.08 ->  the same thing. It's saying if we have something with all these different dimensions, but we can't
12810.08 ->  show all of them, how do we boil it down to just one dimension? How do we extract the most
12815.92 ->  information from that multiple dimensions? And that is exactly either you minimize the projection
12823.2 ->  residuals, or you maximize the variance. And that is PCA. So we'll go through an example of that.
12830.4 ->  Now, finally, let's move on to implementing the unsupervised learning part of this class.
12837.04 ->  Here, again, I'm on the UCI machine learning repository. And I have a seeds data set where,
12844.4 ->  you know, I have a bunch of kernels that belong to three different types of wheat. So there's
12849.44 ->  comma, Rosa and Canadian. And the different features that we have access to are, you know,
12857.12 ->  geometric parameters of those wheat kernels. So the area perimeter, compactness, length, width,
12863.84 ->  width, asymmetry, and the length of the kernel groove. Okay, so all of these are real values,
12870.64 ->  which is easy to work with. And what we're going to do is we're going to try to predict,
12876.08 ->  or I guess we're going to try to cluster the different varieties of the wheat.
12881.44 ->  So let's get started. I have a colab notebook open again. Oh, you're gonna have to, you know,
12886.96 ->  go to the data folder, download this. And so I'm going to go to the data folder, download this,
12892.16 ->  and let's get started. So the first thing to do is to import our seeds data set into our colab
12904.24 ->  notebook. So I've done that here. Okay, and then we're going to import all the classics again,
12911.92 ->  so pandas. And then I'm also going to import seedborn because I'm going to want that for this
12928.96 ->  specific class. Okay. Great. So now our columns that we have in our seed data set are the area,
12940.24 ->  the perimeter, the compactness, the length, with asymmetry, groove, length, I mean, I'm just going
12954.88 ->  to call it groove. And then the class, right, the wheat kernels class. So now we have to import this,
12960.96 ->  I'm going to do that using pandas read CSV. And it's called seeds data.csv. So I'm going to turn
12971.2 ->  that into a data frame. And the names are equal to the columns over here. So what happens if I just
12979.04 ->  do that? Oops, what did I call this seeds data set text? Alright, so if we actually look at our
12989.12 ->  data frame right now, you'll notice something funky. Okay. And here, you know, we have all the
12996.8 ->  stuff under area. And these are all our numbers with some dash t. So the reason is because we
13002.24 ->  haven't actually told pandas what the separator is, which we can do like this. And this t that's
13010.8 ->  just a tab. So in order to ensure that like all whitespace gets recognized as a separator,
13016.96 ->  we can actually this is for like a space. So any spaces are going to get recognized as data
13024.56 ->  separators. So if I run that, now our this, you know, this is a lot better. Okay. Okay.
13034.56 ->  So now let's actually go and like visualize this data. So what I'm actually going to do is plot
13040.72 ->  each of these against one another. So in this case, pretend that we don't have access to the
13046.48 ->  class, right? Pretend that so this class here, I'm just going to show you in this example,
13051.28 ->  that like, hey, we can predict our classes using unsupervised learning. But for this example,
13056.16 ->  in unsupervised learning, we don't actually have access to the class. So I'm going to just try to
13061.44 ->  plot these against one another and see what happens. So for some I in range, you know,
13069.92 ->  the columns minus one because the classes in the columns. And I'm just going to say for j in range,
13077.04 ->  so take everything from I onwards, you know, so I like the next thing after I until the end of this.
13086.64 ->  So this will give us basically a grid of all the different like combinations. And our x label is
13095.52 ->  going to be columns I our y label is going to be the columns j. So those are our labels up here.
13105.28 ->  And I'm going to use seaborne this time. And I'm going to say scatter my data. So our x is going
13114 ->  to be our x label. Or y is going to be our y label. And our data is going to be the data frame that
13126.4 ->  we're passing in. So what's interesting here is that we can say hue. And what this will do is say,
13133.52 ->  like if I give this class, it's going to separate the three different classes into three different
13137.92 ->  hues. So now what we're doing is we're basically comparing the area and the perimeter or the area
13143.2 ->  and the compactness. But we're going to visualize, you know, what classes they're in. So let's go
13150.88 ->  ahead and I might have to show. So great. So basically, we can see perimeter and area we give
13162.4 ->  we get these three groups. The area compactness, we get these three groups, and so on. So these all
13171.76 ->  kind of look honestly like somewhat similar. Right, so Wow, look at this one. So this one,
13180.64 ->  we have the compactness and the asymmetry. And it looks like there's not really I mean,
13184.32 ->  it just looks like they're blobs, right? Sure, maybe class three is over here more, but
13190 ->  one and two kind of look like they're on top of each other. Okay. I mean, there are some that
13195.68 ->  might look slightly better in terms of clustering. But let's go through some of the some of the
13200.72 ->  clustering examples that we talked about, and try to implement those. The first thing that we're
13205.92 ->  going to do is just straight up clustering. So what we learned about was k means clustering.
13216.24 ->  So from SK learn, I'm going to import k means. Okay. And just for the sake of being able to run,
13229.04 ->  you know, any x and any y, I'm just going to say, hey, let's use some x. What's a good one, maybe.
13240.96 ->  I mean, perimeter asymmetry could be a good one. So x could be perimeter, y could be asymmetry.
13247.44 ->  Okay. And for this, the x values, I'm going to just extract those specific values.
13259.84 ->  Alright, well, let's make a k means algorithm, or let's, you know, define this. So k means,
13269.2 ->  and in this specific case, we know that the number of clusters is three. So let's just use that. And
13275.76 ->  I'm going to fit this against this x that I've just defined right here. Right. So, you know, if I
13287.12 ->  create this clusters, so one thing, one cool thing is I can actually go to this clusters, and I can
13293.2 ->  say k mean dot labels. And it'll give give me if I can type correctly, it'll give me what its
13303.2 ->  predictions for all the clusters are. And our actual, oops, not that. If we go to the data frame,
13312.16 ->  and we get the class, and the values from those, we can actually compare these two and say, hey,
13319.44 ->  like, you know, everything in general, most of the zeros that it's predicted, are the ones, right.
13325.2 ->  And in general, the twos are the twos here. And then this third class one, okay, that corresponds
13331.36 ->  to three. Now remember, these are separate classes. So the labels, what we actually call them don't
13336.56 ->  really matter. We can say a map zero to one map two to two and map one to three. Okay, and our,
13343.76 ->  you know, our mapping would do fairly well. But we can actually visualize this. And in order to do
13350.88 ->  that, I'm going to create this cluster cluster data frame. So I'm going to create a data frame.
13360.24 ->  And I'm going to pass in a horizontally stacked array with x, so my values for x and y. And then
13371.92 ->  the clusters that I have here, but I'm going to reshape them. So it's 2d.
13378.16 ->  Okay. And the columns, the labels for that are going to be x, y, and plus. Okay. So I'm going
13394.32 ->  to go ahead and do that same seaborne scatter plot. Again, where x is x, y is y. And now,
13403.52 ->  the hue is again the class. And the data is now this cluster data frame. Alright, so this here,
13415.76 ->  this here is my k means like, I guess classes.
13422.64 ->  So k means kind of looks like this. If I come down here and I plot, you know, my original data frame,
13434.32 ->  this is my original classes with respect to this specific x and y. And you'll see that, honestly,
13441.76 ->  like it doesn't do too poorly. Yeah, there's I mean, the colors are different, but that's fine.
13447.36 ->  For the most part, it gets information of the clusters, right. And now we can do that with
13456 ->  higher dimensions. So with the higher dimensions, if we make x equal to, you know, all the columns,
13465.68 ->  except for the last one, which is our class, we can do the exact same thing.
13471.68 ->  We can do the exact same thing. So here, and we can
13483.6 ->  predict this. But now, our columns are equal to our data frame columns all the way to the last one.
13495.36 ->  And then with this class, actually, so we can literally just say data frame columns.
13502.08 ->  And we can fit all of this. And now, if I want to plot the k means classes.
13511.52 ->  Alright, so this was my that's my clustered and my original. So actually, let me see if I can
13520.08 ->  get these on the same page. So yeah, I mean, pretty similar to what we just saw. But what's
13527.36 ->  actually really cool is even something like, you know, if we change. So what's one of them
13536.16 ->  where they were like on top of each other? Okay, so compactness and asymmetry, this one's messy.
13547.28 ->  Right. So if I come down here, and I say compactness and asymmetry, and I'm trying to do this in 2d,
13558.96 ->  this is what my scatterplot. So this is what you know, my k means is telling me for these two
13565.12 ->  dimensions for compactness and asymmetry, if we just look at those two, these are our three classes,
13572 ->  right? And we know that the original looks something like this. And are these two remotely
13578.24 ->  alike? No. Okay, so now if I come back down here, and I rerun this higher dimensions one,
13585.12 ->  but actually, this clusters, I need to get the labels of the k means again.
13594.56 ->  Okay, so if I rerun this with higher dimensions,
13598.4 ->  well, if we zoom out, and we take a look at these two, sure, the colors are mixed up. But in general,
13605.6 ->  there are the three groups are there, right? This does a much better job at assessing, okay,
13612 ->  what group is what. So, for example, we could relabel the one in the original class to two.
13621.2 ->  And then we could make sorry, okay, this is kind of confusing. But for example, if this light pink
13628.4 ->  were projected onto this darker pink here, and then this dark one was actually the light pink,
13635.6 ->  and this light one was this dark one, then you kind of see like these correspond to one another,
13641.28 ->  right? Like even these two up here are the same class as all the other ones over here, which are
13646.16 ->  the same in the same color. So you don't want to compare the two colors between the plots,
13651.04 ->  you want to compare which points are in what colors in each of the plots. So that's one cool
13657.68 ->  application. So this is how k means functions, it's basically taking all the data sets and saying,
13664.08 ->  All right, where are my clusters given these pieces of data? And then the next thing that we
13670.24 ->  talked about is PCA. So PCA, we're reducing the dimension, but we're mapping all these like,
13678.32 ->  you know, seven dimensions. I don't know if there are seven, I made that number up, but we're
13682.8 ->  mapping multiple dimensions into a lower dimension number. Right. And so let's see how that works.
13690.08 ->  So from SK learn decomposition, I can import PCA and that will be my PCA model.
13696.16 ->  So if I do PCA component, so this is how many dimensions you want to map it into.
13702.48 ->  And you know, for this exercise, let's do two. Okay, so now I'm taking the top two dimensions.
13709.36 ->  And my transformed x is going to be PCA dot fit transform, and the same x that I had up here.
13719.6 ->  And the same x that I had up here. Okay, so all the other all the values basically, area,
13726.56 ->  perimeter, compactness, length, width, asymmetry, groove. Okay. So let's run that. And we've
13734.8 ->  transformed it. So let's look at what the shape of x used to be. So they're okay. So seven was right,
13742.4 ->  I had 210 samples, each seven, seven features long, basically. And now my transformed x
13754.64 ->  is 210 samples, but only of length two, which means that I only have two dimensions now that
13760.08 ->  I'm plotting. And we can actually even take a look at, you know, the first five things.
13767.2 ->  Okay, so now we see each each one is a two dimensional point,
13770.32 ->  each sample is now a two dimensional point in our new in our new dimensions.
13778.88 ->  So what's cool is I can actually scatter these
13786.64 ->  zero and transformed x. So I actually have to
13793.52 ->  take the columns here. And if I show that,
13801.92 ->  basically, we've just taken this like seven dimensional thing, and we've made it into a
13806.88 ->  single or I guess to a two dimensional representation. So that's a point of PCA.
13813.2 ->  And actually, let's go ahead and do the same clustering exercise as we did up here. If I take
13820.8 ->  the k means this PCA data frame, I can let's construct data frame out of that. And the data
13829.84 ->  frame is going to be H stack. I'm going to take this transformed x and the clusters that reshape.
13840.4 ->  So actually, instead of clusters, I'm going to use k means dot labels. And I need to reshape this.
13846.56 ->  So it's 2d. So we can do the H stack. And for the columns, I'm going to set this to PCA one PCA two,
13859.68 ->  and the class. All right. So now if I take this, I can also do the same for the truth.
13868.16 ->  But instead of the k means labels, I want from the data frame the original classes.
13873.2 ->  And I'm just going to take the values from that. And so now I have a data frame for the k means
13880.72 ->  with PCA and then a data frame for the truth with also the PCA. And I can now plot these similarly
13887.2 ->  to how I plotted these up here. So let me actually take these two.
13892.32 ->  Instead of the cluster data frame, I want the this is the k means PCA data frame. This is still going
13901.28 ->  to be class, but now x and y are going to be the two PCA dimensions. Okay. So these are my two PCA
13911.2 ->  dimensions. And you can see that the data frame is going to be the same as the cluster data frame.
13918.16 ->  So these are my two PCA dimensions. And you can see that, you know, they're, they're pretty spread
13925.76 ->  out. And then here, I'm going to go to my truth classes. Again, it's PCA one PCA two, but instead
13934.32 ->  of k means this should be truth PCA data frame. So you can see that like in the truth data frame
13942 ->  along these two dimensions, we actually are doing fairly well in terms of separation, right? It does
13949.52 ->  seem like this is slightly more separable than the other like dimensions that we had been looking at
13956.72 ->  up here. So that's a good sign. And up here, you can see that hey, some of these correspond to one
13965.36 ->  another. I mean, for the most part, our algorithm or unsupervised clustering algorithm is able to
13971.44 ->  to give us is able to spit out, you know, what the proper labels are. I mean, if you map these
13979.68 ->  specific labels to the different types of kernels. But for example, this one might all be the comma
13985.2 ->  kernel kernels and same here. And then these might all be the Canadian kernels. And these might all
13989.36 ->  be the Canadian kernels. So it does struggle a little bit with, you know, where they overlap.
13994.96 ->  But for the most part, our algorithm is able to find the three different categories, and do a
14001.12 ->  fairly good job at predicting them without without any information from us, we haven't given our
14006.48 ->  algorithm any labels. So that's a gist of unsupervised learning. I hope you guys enjoyed
14012.88 ->  this course. I hope you know, a lot of these examples made sense. If there are certain things
14018.8 ->  that I have done, and you know, you're somebody with more experience than me, please let me know
14024.24 ->  in the comments and we can all as a community learn from this together. So thank you all for watching.
                     
                    
                        Source: https://www.youtube.com/watch?v=i_LwzRVP7bg