
Stanford CS229 Machine Learning I Feature / Model selection, ML Advice I 2022 I Lecture 11
Stanford CS229 Machine Learning I Feature / Model selection, ML Advice I 2022 I Lecture 11
For more information about Stanford’s Artificial Intelligence programs visit: https://stanford.io/ai
To follow along with the course, visit:
https://cs229.stanford.edu/syllabus-s…
Tengyu Ma
Assistant Professor of Computer Science
https://ai.stanford.edu/~tengyuma/
Christopher Ré
Associate Professor of Computer Science
https://cs.stanford.edu/~chrismre/
To view all online courses and programs offered by Stanford, visit: http://online.stanford.edu
Content
4.799 -> advice happened it happens that I write
7.319 -> too small font please feel free to stop
10.32 -> me and let me know it's just uh as I
13.08 -> said every after a few lectures I
15.24 -> stopped uh I I start to forget about it
18.66 -> so please remind me
21.24 -> um
21.84 -> so okay so I guess
25.74 -> um
26.46 -> first let me briefly reviewed the last
28.199 -> lecture just very quick so last lecture
29.939 -> we talked about these two important
31.679 -> Concepts
33.12 -> um under fitting and overfitting so I
35.64 -> got some so so here you know our goal is
38.64 -> to make the organization work right so
40.14 -> we want to generalize two unseen
42 -> examples
43.2 -> so
44.82 -> um and last time we talked about two
46.5 -> possible reasons
48.18 -> for why your test error is not good
51.78 -> enough right so one possible reason is
54.059 -> overfitting so overfitting means that
56.52 -> your tuning hour is actually pretty good
58.199 -> your children loss is pretty small but
60.36 -> their test loss is pretty high
62.64 -> so and we have discussed the possible uh
65.76 -> reasons for why
68.76 -> um you can have overfitting and two
71.52 -> possible reasons are if maybe you have
73.799 -> two complex of model for example last
76.32 -> time we discussed that you know if you
77.82 -> use a 50 degree polynomial for this very
80.4 -> very small data set where you only have
82.259 -> like four examples then you may overfit
84.6 -> or maybe you don't have enough data
86.64 -> right if you have more and more data
89.04 -> um if you have like a million data then
91.32 -> a 50 degree polynomial wouldn't be a
93.119 -> problem
94.619 -> and also we discussed that
97.02 -> um another reason underfading so under
99.78 -> fitting is much easier understating in
101.82 -> some sense basically just means that you
103.979 -> don't have small enough tuning loss or
106.86 -> tuning error right so your model is just
108.84 -> not powerful enough so that you cannot
111.38 -> even fit to the training data you have
116.46 -> so so and in some sense these are kind
119.159 -> of like two complementary situations
121.56 -> right so in this case you probably want
123.18 -> to make your model more expressive and
125.939 -> in this case you maybe you want to make
127.32 -> your model less expressive
129.539 -> or less complex so we use this word in
132.12 -> your complex expressive you know a lot
134.099 -> you know without the formal definition
135.66 -> right so we say some models are more
137.819 -> complex you know some models are less
139.2 -> complex typically you know you can
141.12 -> somehow feel it right so a fifth degree
143.16 -> polynomial probably is more complex than
144.72 -> linear model but actually you know if
147.36 -> you really want to have a you know
148.68 -> concrete definition it becomes a little
150.239 -> bit tricky right so what is the right
152.34 -> complexity measure of the model someone
154.2 -> asks about that as well in the last
155.94 -> lecture you know and the answer is that
157.98 -> there's no
159.18 -> um Universal measure for what's the
161.16 -> right complex model complexity measure
163.86 -> and and there are a few you know
166.98 -> um
167.7 -> um kind of like a complex financials
170.099 -> people often use you know they all have
171.9 -> their kind of like particular strands
173.94 -> and they are like and and also like
175.98 -> there's no
177.42 -> um you know real kind of like a formal
179.94 -> Theory to say which one is better so
181.739 -> these are kind of complex measures that
183.66 -> can be you know
185.22 -> um theoretically kind of Justified in
187.5 -> certain cases but they are not kind of
189.36 -> like a
190.44 -> um Universal so so what are the complex
192.42 -> measures so I'm just listing a few just
194.28 -> to follow kind of like for your
196.319 -> knowledge in some sense so
200.34 -> so I guess the most obvious one is how
202.379 -> many parameters they are right so if you
203.76 -> have more parameters than your model
205.14 -> might be more complex
207.239 -> and this is you know very intuitive
208.92 -> however the limitation here is that
211.019 -> maybe you have a lot of parameters but
213.18 -> you actually the the effective
215.34 -> complexity of the model is very very low
217.8 -> maybe all the parameters are very very
219.42 -> small then maybe you can say in this
221.7 -> case maybe the complexity is actually
223.019 -> not as big as your thoughts so so to
226.379 -> kind of like a deal with this kind of
228 -> like scaling thing right so what do you
229.5 -> feel complex your all your parameters
231.42 -> are basically zero even though you have
234 -> a million parameters then people
235.62 -> consider kind of like Norms of the
237.9 -> parameters right
240.72 -> okay
244.739 -> but this may not be no this is actually
247.26 -> typical the Norms of parameters is
249.42 -> actually very good uh they are very good
251.58 -> complex measures for linear models
254.28 -> um and and this there's no basically
255.9 -> before deep learning kind of like uh
257.699 -> arised
259.38 -> um um like uh you know before I think we
261.479 -> are using Norms as complex measures a
263.46 -> lot
264.6 -> um and still we use them in some cases
267.06 -> but these also have some you know
268.62 -> limitations for example sometimes you
270.419 -> may for example
272.3 -> would be that you have a very low Norm a
275.58 -> low Norm solution and you add some
277.08 -> random kind of like a noise to the to
279.78 -> the model and when you add the noise you
282 -> make the norm bigger but actually the
283.62 -> noise doesn't really change the
284.82 -> complexity right because you add some
286.44 -> noise and when you take the Matrix
288.479 -> modification you average all the noise
289.919 -> to some extent so so these are you know
292.56 -> there are also this kind of like issues
294 -> so some of the other kind of more more
296.34 -> than complex measures people have
297.6 -> considered uh are for example something
299.759 -> like lipsteousness
302.34 -> right whether your model is
303.6 -> ellipselessness uh it's Ellipsis or
305.699 -> maybe your model is smooth enough and
307.919 -> here I'm using the word smooth uh you
310.139 -> know relatively kind of like um informal
312.78 -> way you know you could mean this could
313.979 -> mean the bound and the secondary
315.36 -> derivative it could mean the bonus third
317.28 -> of derivative something like that if
319.139 -> your model is kind of like no it's kind
320.58 -> of like oscillating or kind of
321.9 -> fluctuating a lot maybe that means it's
323.94 -> not very complex
325.919 -> um and there are other kind of like a
327.24 -> complex measures for example how
328.62 -> environment your model is with respect
330.78 -> to for example certain translations
332.4 -> certain environments
334.74 -> um that you should have in a data set
336 -> for example whether your model is
337.259 -> environment to data augmentation
340.02 -> um but in general there's no kind of
341.52 -> like a very kind of like established
343.919 -> theory on what is exactly the right
345.66 -> complex measure and sometimes it also
347.22 -> depends on the data as I would see as
349.74 -> you will see today so so if you if your
352.44 -> data you know
353.66 -> sometimes for you know for example
356.28 -> suppose your data
358.08 -> um um for example
360.06 -> let's talk about Norms right so
361.68 -> different Norms you know what what type
363.78 -> of norms are you talking about L1 Norm
365.34 -> L2 Norm sometimes L2 Norm is the right
367.8 -> complex measure for certain type of data
369.9 -> and sometimes L1 Norm is the right
371.94 -> complex measure for a certain type of
373.56 -> data so basically I don't think this is
375.6 -> kind of like there's anything super
377.58 -> concrete we can
379.979 -> um like it's not like I have a kind of
382.259 -> like just a fixed like a suggestion for
385.979 -> you to consider so so in some sense you
387.9 -> should just keep this in mind and kind
389.759 -> of like a kind of consider them you know
391.8 -> when you
392.94 -> um when you do your own data set so
398.1 -> um
399.12 -> so so this is the um okay so now we have
402.479 -> to discuss the complexity measures and
404.28 -> now that in the the the the the rest of
407.52 -> the lecture I think I'm going to cover
408.6 -> two things so one thing is that so once
411.419 -> you have some kind of like guess on
413.28 -> what's the right complex measure you are
414.66 -> looking for how do you make the complex
416.039 -> measure small right so so
418.44 -> um how do you encourage the model to
421.02 -> have small complexity right so it's
422.52 -> easier to do this because you just
424.02 -> change how how many kind of neurons or
427.02 -> how many kind of like
428.4 -> hidden variables in deep networks where
430.919 -> you can change the number of parameters
432.12 -> but if you want to change the norm what
433.68 -> you do so that's called regularization
435.3 -> I'm going to discuss that in the first
437.58 -> half of the lecture and then in the
439.62 -> second half of lecture I'm going to talk
441.479 -> about you know how do you
443.34 -> um I'm going to talk about you know some
445.199 -> kind of more General ml device for
447 -> example how do you tune your hyper
449.16 -> parameters like when you do
451.44 -> regularization or when you kind of like
452.94 -> choose your model complexity right you
455.52 -> can use a lot of hyper parameters
457.5 -> meaning you're going to choose how many
459.36 -> parameters you have you're going to
460.5 -> choose you know how strong your
461.52 -> organization is so so how do you tune
463.979 -> your hyper parameters
465.78 -> um and on what data set you should tune
467.58 -> your hybrid parameters and at the end
469.139 -> I'm going to probably spend 30 minutes
470.46 -> on even more applied angle about some
474.18 -> email or device right so like for
475.919 -> example how do you design the ml system
477.599 -> you know from scratch you know there are
479.46 -> a lot more things in reality more than
483.479 -> what you're doing research
485.34 -> um so so that part will I'll use some
487.02 -> slides to talk about some general no
488.819 -> ideas on how to design ml system uh in
491.699 -> in the in the in reality
493.919 -> so so that's the general kind of
495.599 -> introduction of this course of this
497.58 -> lecture
498.9 -> um I'm going to start with
499.86 -> regularization I need questions so far
503.34 -> foreign
516.12 -> I think we have probably mentioned this
518.219 -> um I think in the sometimes like in the
520.2 -> previous kind of lectures because just
521.52 -> because you know
522.899 -> um
526.74 -> just because you know uh we have
528.66 -> mentioned this you know informally so by
530.88 -> regularization mostly we just mean that
532.98 -> you add some additional term in your
535.019 -> training loss to encourage low
536.76 -> complexity models so
539.58 -> um so for example you so we use J of
542.279 -> theta as our training loss and then you
544.8 -> consider this so-called regularized loss
546.48 -> where you add a term
549.12 -> Lambda times R Theta
551.7 -> so here this R Theta is often called the
554.58 -> regularizer
556.86 -> and Lambda is you know I think there are
559.8 -> different names for this but you could
561.24 -> call it regularization strengths or
563.519 -> regularization coefficient
565.2 -> regularization parameter regularization
567.18 -> strengths whatever you call it if let's
569.58 -> call it regularization strength
571.92 -> so this Lambda is a scalar and R of
574.26 -> theta is a function of data
577.019 -> um which um which will change as data
580.32 -> changes so
583.68 -> um
584.279 -> and this and the goal of this Rotator is
586.8 -> to you know either additional kind of
588.24 -> like encouragement to find model Theta
591.48 -> such that are obviously small so for
593.519 -> example typical R of theta could be
596.1 -> um
597.54 -> something like you know L2
599.22 -> regularization so you say R Theta one
601.62 -> possible choice is that you take this is
604.2 -> probably the most common choice
608.22 -> um you take the L2 North square and you
611.88 -> multiply about half the half doesn't
613.26 -> really matter this is just some kind of
615.42 -> conversion
616.8 -> um you know that because anyway you can
618.42 -> multiply a Lambda in front of it so so
620.519 -> whether you have half or not here you
622.44 -> know we just change your choice of
624.18 -> Lambda but just uh this is you know
626.7 -> um just a convention
628.5 -> um and so so this is called you know L2
630.899 -> regularization also in deep learning
632.76 -> people called Ray Decay there is a
634.92 -> reason why people call it way Decay I
636.6 -> guess probably I wouldn't have time to
638.339 -> discuss it today so you know the
640.56 -> lecturer knows there's a very short
641.88 -> paragraph you can see actually if you
644.04 -> use this regularization and update rule
646.38 -> will look like a very Decay so there's
648.3 -> one step in the update rule where you
649.86 -> Decay your parameter will shrink your
652.38 -> parameter by scalar so but anyway so you
655.26 -> know it's just a name like either it's
657 -> called vertical you call it Auto
658.62 -> regularization
659.88 -> so let's play one of the pretty common
662.399 -> one so and and you can see that if you
664.74 -> add this thing to your loss function and
666.48 -> you're minimizing your loss function so
668.339 -> then you are trying to make
671.279 -> both the last month and also make the
673.92 -> the L2 Norm of your parameter small and
676.74 -> the Lambda in some senses kind of kind
679.38 -> of like controlling the trade-off
681.18 -> between these two terms right if you
682.62 -> take Lambda 2 back then you only focus
684.959 -> on the regularization you just only
686.339 -> focus on low low Norm solution but maybe
689.04 -> you don't fit your data very well if you
691.38 -> take Lambda to be you know for example
693.24 -> zero literally zero then you are not
694.92 -> using your regularization you are just
696.54 -> only fitting your data and actually when
699.42 -> you make the Lambda very very small this
701.22 -> can still uh do something right so even
704.1 -> say let's say Lambda is .0001 Right very
707.16 -> small
707.88 -> still this might do something because
709.86 -> maybe they are multiple Theta such that
712.86 -> J of theta is really really close to
714.72 -> zero or maybe even literally close zero
716.94 -> so so if you don't have this then you
719.519 -> are not doing any type working right if
721.26 -> you don't have if you literally make
722.82 -> lambda zero then you are just picking
725.82 -> one of the solution where J Theta is
727.56 -> zero but you don't know which one you
729.12 -> pick but even as long as you add a
730.98 -> little bit of organization then you are
732.779 -> using this as a tie breaker in some
734.399 -> sense so so you are finding some
737.339 -> solutions such as J Theta is very very
739.079 -> small but you use the the r the norm as
742.56 -> a tiebreaker uh among all of the
744.66 -> solutions that have very small tuning
746.459 -> laws
747.959 -> so so this is probably the most typical
750.18 -> regularization people use and another
753.36 -> one is the following so you can take R
755.64 -> Theta
757.38 -> to be
758.519 -> uh the so-called zero zero Norm of the
761.82 -> parameter but actually this is not
763.5 -> really a norm this is just a notation so
765.54 -> this is really just defined to be the
768.3 -> number of non-zeros
772.68 -> in the model in in cell
775.5 -> so you count how many non-zero entries
777.54 -> in Theta and that's the uh that's what
780.779 -> this notation is for some sometimes
782.639 -> people call it zero Norm but it's
783.839 -> actually not a norm uh literally it's
786 -> just the number of non-zeros in a
788.04 -> parameter and sometimes people call it
790.019 -> sparsity
792.36 -> because you know if you have very few
795.079 -> non-zero entries then it sparse
797.16 -> otherwise it's it's dense so if you add
799.8 -> this to the the thing then you're gonna
802.079 -> have a different effect you are trying
803.459 -> to say that I'm going to find a model
805.139 -> such that the number of non-zeros in it
808.019 -> is small and this is particularly
810.18 -> meaningful for linear models in the
811.98 -> following sense because if you think of
813.54 -> the state as a linear model then say you
815.639 -> have Theta transpose X right suppose you
817.38 -> have a linear model and then what is
819.36 -> this this is really just some of say the
821.7 -> i x i
824.7 -> from one to D and then you can see that
827.1 -> the number of non-zeros is really how
829.74 -> many
830.7 -> if suppose you have like s non-zero
832.92 -> answers in Theta that means that you are
834.72 -> using only s of the coordinates of X
837.959 -> size
838.8 -> so basically the number of non-zeros is
841.8 -> in Theta is the number of coordinates or
844.74 -> number of features you are using uh from
847.68 -> x i
848.639 -> right so you can imagine that maybe for
851.16 -> example for some applications you have a
853.44 -> lot of like a coordinates in our
855.72 -> features in your input features right so
857.82 -> you have so many different kind of like
859.32 -> informations but you don't know which
861.42 -> one you should use to predict right for
863.16 -> example suppose you want to predict
864.3 -> house the price of a house then you have
866.88 -> so many different features but some
868.38 -> features may not be that useful so then
870.3 -> you can imagine that that could be a
871.92 -> situation where you should use this as a
873.42 -> regularizer because you want to say I
875.04 -> want to use as few features as possible
877.32 -> but also I want to make sure my training
879.899 -> loss is good right so I want to find the
881.88 -> simplest
883.579 -> explanations of the of the existing data
886.56 -> or simplest meaning that you want to use
888.18 -> as few features as possible
890.94 -> so and and once you find this Theta such
894 -> that Theta sparse right so suppose you
895.62 -> have a Theta such that you only have a
897.48 -> few non-zeros uh in Theta then you are
900.06 -> selecting the right feature so in some
901.98 -> sense you know you have a sparse model
905.94 -> means you are selecting
910.079 -> the features
913.019 -> right because those non-zeros
914.459 -> corresponds to its features that are
916.56 -> selected by the model so so people often
919.8 -> call this kind of like a feature
921.24 -> Selections in some in certain kind of
922.86 -> contexts
924.959 -> so
926.579 -> um however I mean you may have realized
928.92 -> that this
930.839 -> regularizer as a function of theta this
933.839 -> sparsity one is not uh differentiable
938.1 -> right so you're just counting how many
940.139 -> non-zeros they are right so suppose you
941.639 -> have one suppose you have like you have
943.8 -> one entry that is um maybe it's just
946.639 -> zero zero zero right you do an if the
949.92 -> testimo change in this uh in any of the
953.459 -> coordinates you're going to change the
955.74 -> this function the value of this function
958.139 -> by a lot so that's why it's not
959.519 -> differentiable right a differentiable
961.199 -> function should satisfy that if you
963 -> change
963.959 -> Theta if the decimal is small then you
966.36 -> should change the function output by a
968.76 -> small amount but actually here if you
970.98 -> currently the sparse is zero but if you
974.699 -> change the state a little bit your
976.38 -> sparsity becomes one so you can have
978.72 -> infinite testimony small changes to make
980.399 -> the regularizer value change by a large
982.74 -> amount so that's that's why it's not
985.019 -> differentiable so so it's not because
987.24 -> because it's not differentiable then um
989.579 -> you don't have gradients you don't have
991.139 -> like a derivatives so that's that cause
993.779 -> the problem in in using this so so
996.839 -> basically even though I told you this is
998.459 -> a regularizer but literally in reality
1001.04 -> nobody use this exactly in their
1003.019 -> algorithm because if you use it you get
1005.66 -> this you know you put it here but this
1007.639 -> term has no gradient how do you optimize
1009.62 -> it
1010.82 -> so so if because it's non-diffensible so
1013.82 -> then what we will do is that
1016.1 -> um you like you change um you have a
1019.04 -> surrogate so this is a typical surrogate
1023.12 -> um the reason why this is surrogate is a
1024.799 -> little bit kind of tricky
1026.299 -> um but um
1027.919 -> um but you know but this is you know has
1029.66 -> been your surrogate for the sparsity so
1032 -> this is a differentiable surrogate
1034.939 -> um
1041.839 -> for the for the sparsity of the model
1044.66 -> so you use the one Norm so here one Norm
1046.819 -> just means the the sum of the apps value
1050.66 -> is the sum of the absolute value of each
1052.64 -> chord
1055.46 -> so and you can see you know I wouldn't
1058.039 -> attempt to give you a very formal
1059.72 -> justification for why this is so good
1062.84 -> um why this is a good Target for the
1064.94 -> zero
1065.96 -> um you know one reason could be that you
1068.179 -> know you can see at least one Norm is
1070.82 -> closer to than two Norm to zero Norm
1073.22 -> right Y is close to closer to zero than
1075.919 -> two and another reason probably you know
1078.74 -> this is not really very solid
1081.38 -> mathematical reason but it's just to
1082.7 -> give you some you know intuition so
1084.5 -> suppose you think of theta as a as a
1087.74 -> vector in zero one suppose you really
1090.08 -> just have a binary Vector then indeed
1093.86 -> this is equals to this
1097.82 -> right so that's that's probably another
1099.5 -> kind of intuitive reason why they are
1101.96 -> somewhat kind of like related but you
1103.88 -> know you can see a lot of problems with
1105.559 -> this argument right so why I'm assuming
1107.539 -> Theta is between is only taking value
1109.76 -> from zero and one right so if they are
1111.62 -> from zero and two then these two are no
1113.299 -> longer like related and more so
1116.059 -> um so so I'm not saying that this is
1117.559 -> really a good argument for why they are
1119.66 -> related so so if you really want to I'll
1122.059 -> say they are related um or this is a
1124.1 -> good surrogate I think you have to go
1125.419 -> through more much more mass
1129.02 -> um any questions so far
1134.32 -> regularizer with the second nominal
1136.46 -> regularly you mean yes okay yes that's
1139.52 -> what I'm gonna do next
1141.62 -> so
1143 -> um
1144.86 -> all right so right so that's a great
1148.039 -> question so why
1150.02 -> um you want to encourage sparsity
1153.08 -> um sometimes and sometimes you want to
1154.52 -> encourage the two Norms to be small
1156.08 -> right so let's answer that question you
1157.88 -> know let's think of this as a as a
1159.679 -> surrogate for the sparsity so the
1160.94 -> question I'm trying to answer is why
1162.86 -> sometimes this is better sometimes this
1164.539 -> is better so in some sense the the
1166.76 -> fundamental reason is just that you know
1168.64 -> in some sense the regularization or
1171.86 -> another way to think about
1172.88 -> regularization instead of just
1174.32 -> encouraging low complexity is that
1176.24 -> regularization can also improve can also
1178.94 -> impose structures kind of like prior
1181.88 -> beliefs about the Theta so so this is
1186.08 -> probably another
1187.299 -> at least you know
1189.559 -> one of the other ways to think of
1191.179 -> requisition this also imposes
1195.2 -> structures
1200.24 -> and this what are the structures the
1202.16 -> structures you know probably sometimes
1203.96 -> units you have a prior belief
1207.5 -> so for example suppose you believe you
1210.14 -> you have a prior belief because of
1212.299 -> domain knowledge right so such that
1215.179 -> um you somehow believe that uh Theta
1217.82 -> sparse
1223.179 -> right so in in smart SATA
1228.08 -> then in this case you probably should
1230.48 -> just use R Theta is the the one Norm or
1234.26 -> the zero Norm
1236.6 -> just because you believe that your model
1238.34 -> is sparse why not just encourage that
1240.26 -> right so so this you know when you have
1242.6 -> this belief then you should say okay so
1245.299 -> if you if you encourage the one normally
1246.919 -> you're you're in some sense you you
1249.38 -> limit your search space right so yeah
1251.48 -> like you limit your search space before
1253.58 -> you are searching over all possible
1254.96 -> parameters and now you're only searching
1256.64 -> over low Norm low one non-parameters and
1259.94 -> because you believe that your true model
1261.44 -> is in is is having low Norm then even
1264.74 -> narrowing the search space is always new
1266.48 -> never narrowing your search space is
1268.52 -> always helping you because uh you didn't
1271.28 -> lose anything right because you know you
1273.559 -> know that all every model you excluded
1275.539 -> are not going to be uh the right
1278.12 -> solution so so you never search space
1279.98 -> the search the new search space still
1281.78 -> has the right model then why not do it
1284.24 -> so so that's another interpretation of
1288.14 -> um of the uh of the regularizer so it's
1291.62 -> trying to it can impose additional prior
1294.08 -> belief
1295.34 -> um uh in the in the in the structure of
1297.14 -> the model right so if you believe in one
1299.299 -> Norm then you should you encourage well
1301.159 -> not or whatnot if you believe that your
1303.08 -> true model has a small L2 Norm then you
1305.6 -> should encourage
1307.34 -> um small L2 naught and and you know if
1310.64 -> if you go into more mathematical uh
1312.86 -> Theory I think L2 Norm typically
1314.96 -> corresponds to situations where you
1316.52 -> believe that all the features are useful
1318.559 -> but you have to use them you know uh in
1321.02 -> a combination right so you have to use
1322.4 -> each of the features you know a little
1323.84 -> bit an L1 Norm or l0 Norm typically
1327.44 -> corresponds to situations where you
1328.76 -> believe only a subset of the features
1330.74 -> are meaningful and you should discard
1332.659 -> other ones because other ones are just
1334.64 -> kind of like are there to confuse you in
1336.919 -> substance so so if you believe your
1338.72 -> model is his sparse you should use L1
1340.52 -> norm and if you believe your model
1341.78 -> shouldn't be sparse then typically
1343.58 -> people use l tuna
1346.88 -> so and
1349.58 -> and if you have a linear model so
1351.62 -> suppose you have a suppose you have a
1353.6 -> linear model
1357.08 -> and and then this this law this loss if
1361.76 -> you use this L1 normalizer this is
1363.62 -> called lasso
1366.38 -> I guess I'm here I'm just the final name
1368.36 -> because I think it's probably useful for
1369.74 -> to at least a third of this name this
1372.14 -> this uh acronym uh was actually I don't
1375.679 -> know what it originally stands for but
1377.419 -> you know this is uh so like this has
1380.059 -> been like there for like 20 or 30 years
1382.159 -> which is a very very important algorithm
1383.84 -> for linear model you apply L1 uh norm
1386.659 -> regularization and it's called lasso in
1388.52 -> everyone in machine learning should know
1390.62 -> the equity
1392.36 -> so
1393.74 -> um
1394.46 -> right
1398.96 -> and taking a little more broader
1400.76 -> perspective so
1403.28 -> um so if you think about like a
1404.72 -> non-linear models like deep learning
1406.4 -> models so what are the most popular
1408.44 -> regularizers these days I think L1 Norm
1411.26 -> is not used very often and actually
1413.12 -> pretty much it's never used um you know
1415.34 -> I don't know exactly you know how
1416.96 -> frequent it is but you know I think
1418.4 -> probably less than 10 percent of model
1420.2 -> use L1 recognizer maybe even less than
1422.36 -> that 10 probably is much overestimate
1424.64 -> maybe one percent so
1426.98 -> um and but the L2 regularization is
1429.44 -> almost always used like even though
1431.72 -> sometimes you only use a very weak Auto
1433.58 -> regularization indeed I'm talking about
1435.799 -> deep learning model so sorry maybe let
1437.78 -> me just clarify for for linear models
1440.059 -> people tried uh you can try almost
1442.58 -> anything anything would be reasonable
1444.86 -> and you probably should try all of them
1446.84 -> you could try one Norm two norm and
1448.82 -> sometimes you can try different Norms
1450.26 -> which I didn't write down but you can
1452.48 -> try 1.5 Norms something like that so for
1455.72 -> nonlinear model for deep learning models
1457.58 -> I think basically L2 Norm is something
1461.179 -> that you almost always use but you only
1463.58 -> use of it with relatively small Lambda
1466.1 -> people generally don't use very large
1467.96 -> Lambda
1469.28 -> I don't know exactly what's the reason
1470.78 -> you know um researchers don't really
1473.36 -> know that much either but a small Auto
1476.059 -> regularization is typically useful for
1478.88 -> deep learning and in deep learning I
1480.5 -> think some of the other regularizations
1482.24 -> you know could be useful for example you
1484.159 -> can try to regularize the lipsense of
1486.32 -> the model
1487.64 -> um and you can try to use data
1488.96 -> augmentation which we probably haven't
1490.82 -> discussed I'm going to discuss that
1493.88 -> um in a later lecture but you can use
1495.44 -> data augmentation which tries to
1497.299 -> encourage your model to be environment
1499.72 -> with respect to a kind of translation
1501.919 -> cropping this kind of things for for
1504.02 -> images
1505.46 -> um I think those are pretty much the the
1507.44 -> only regularization techniques in deep
1509.659 -> learning
1510.98 -> um
1511.88 -> yeah
1513.679 -> um
1517.54 -> [Applause]
1518.799 -> this kind of
1521.26 -> pertains to my employer yeah what you
1524.059 -> suggest initially using the
1526.159 -> and one day to kind of eliminate the
1529.1 -> features
1530.919 -> that's a very good question so I think
1533.419 -> this kind of algorithm was pretty
1535.82 -> popular in uh um for before deep
1540.08 -> learning era so when you use linear
1541.88 -> models I think using L1 to do a
1544.279 -> selection and then you use L2 I think
1546.2 -> you know I don't know how exactly how
1547.94 -> popular they are but this is definitely
1549.26 -> one algorithm people try to could you
1551.48 -> could try to use
1553.22 -> um in deep learning I think it's
1556.46 -> probably less likely to be useful you
1558.679 -> know but also depends on the situation
1560.419 -> for example if you don't have enough
1561.44 -> data maybe you are more or less in a
1563.779 -> linear model case but you just need the
1565.88 -> nonlinearity to help you a little bit
1567.08 -> maybe then in that case you should still
1569.179 -> mostly use some kind of like more like
1570.62 -> linear model type of approach if you are
1572.72 -> in the the typical deep learning setting
1575.299 -> for example you do for a vision project
1577.4 -> right you have like images right as your
1580.34 -> inputs I think in those cases you
1582.2 -> probably don't want to select your
1583.64 -> features first I think all the inputs
1586.159 -> are useful like I'm able to use them as
1589.279 -> much as possible and you just want to
1590.84 -> let the new light works to figure out
1593 -> what's the best way to use those inputs
1599.96 -> any other question by the way this
1601.58 -> lecture will be pretty kind of we don't
1603.679 -> have a lot of math right most of the
1604.94 -> things are about um just uh I I don't
1607.58 -> think there's even a theory here
1608.659 -> sometimes they're just experiences
1609.98 -> because especially if you talk about
1613.58 -> um like the more the Machinery in the
1614.9 -> last five years right everything seems
1616.64 -> to change a little bit right so I like I
1619.1 -> cannot say anything with about 100
1620.6 -> guarantee I can only say okay it sounds
1622.7 -> like people are doing this a lot and
1625.34 -> that's that's the best thing I can I can
1626.84 -> tell you in some sense
1629.72 -> um so so feel free to ask any questions
1632.059 -> any other questions
1637.7 -> right so and the next thing I'm going to
1642.2 -> discuss is the so-called implicit
1643.82 -> regulation effect
1645.74 -> and um
1653.059 -> this does uh
1656.48 -> this released more to the the Deep
1658.159 -> learning and so one reason uh that
1660.86 -> people started to think about this is
1662.48 -> that you know I haven't told you what
1663.679 -> exactly means so one motivation that
1666.02 -> people started to stand up for research
1667.34 -> is that people realize that in deep
1669.08 -> learning you don't use a lot of
1670.039 -> regularization technique right so you
1671.72 -> use L2 as I said you only use a weak out
1673.88 -> to regularization and and often some of
1676.279 -> these latest ones but they only have a
1677.96 -> little bit right they can be useful but
1679.64 -> people don't necessarily use them very
1681.38 -> often so why in deep learning you don't
1684.14 -> have to use strong regularization at
1686.419 -> least like you can feel that the
1688.039 -> regularization stop method stop matter
1690.02 -> from that much it still matters when you
1692.36 -> really care about the final performance
1693.62 -> you care about 95 versus 97 but but you
1696.86 -> don't have to even you don't use
1697.88 -> regularization sometimes you get
1699.08 -> reasonably good performance so so that's
1701.6 -> why people are especially with
1702.98 -> theoretical researchers people are
1704.779 -> wondering why you don't need to use a
1707.419 -> strong regularizations in deep learning
1709.52 -> and this is particularly mysterious
1711.08 -> because in deep learning people are
1713 -> using over parametrization like we are
1715.52 -> we are in this regime where you have
1717.14 -> more parameters than the number of
1718.94 -> samples we call that in the last lecture
1721.7 -> we have drawn this uh double descent
1723.98 -> thing right where you have this kind of
1725.299 -> things right here is the number of
1726.5 -> parameters
1730.159 -> and this is the test error
1733.58 -> and you know we have kind of discussed
1735.32 -> that this peak might be just something
1736.82 -> about the sub-optimality of the
1738.38 -> algorithm which let's say you don't care
1740.299 -> for the moment but at least you have to
1742.159 -> care about why here
1744.32 -> um why it's go going down you know still
1746.299 -> here right so why when you have so many
1747.919 -> parameters a lot more parameters you can
1750.5 -> still make your model generalized and
1752.299 -> and it seems that more and more Paramus
1754.279 -> makes it looks better
1756.32 -> so so so this the overall primary Choice
1759.02 -> regime is kind of mysterious because you
1761.059 -> don't use strong regularization but you
1762.679 -> can still generalize
1764.48 -> so that was the kind of the motivation
1766.52 -> for people to study this and people
1768.44 -> realized that there's you know even
1770.659 -> though in this regime suppose you don't
1772.1 -> use any expensive regularizer you don't
1773.72 -> you make that lambda zero literally zero
1775.7 -> right in this regime still it can
1777.86 -> generalize and the reason it can
1779.059 -> generalize in many cases is because
1781.88 -> um you can still have some implicit
1783.38 -> regularization in fact even without
1785.36 -> explicit regularizer and where that
1787.58 -> effect comes from we know what kind of
1788.96 -> make that happen the reason is that
1792.2 -> um the optimization process the
1794.24 -> optimization algorithm
1795.98 -> the optimizers can have a con implicitly
1800.36 -> regularize you
1806.539 -> so so why this can happen I think the
1808.88 -> reason is that let me draw kind of like
1810.919 -> illustrative kind of figure which I kind
1813.62 -> of like use pretty often so suppose this
1815.899 -> is the
1817.34 -> let's say this is the the pyramid like
1819.74 -> suppose you have a this is the Lost
1821 -> landscape the Lost surface
1824.48 -> so meaning that here is Theta let's say
1826.399 -> is one dimensional
1827.84 -> and because we are in this deep learning
1829.94 -> setting where we have like a non-linear
1831.86 -> models and non-convex loss function so
1834.2 -> maybe a loss function it looks like this
1835.94 -> maybe
1838.159 -> so this is the loss function
1843.32 -> and you have two maybe you have multiple
1845.6 -> Global Minima of your loss function
1847.7 -> right so so this is a global minimum
1849.679 -> this is a global minimum
1851.96 -> and but this but you have multiple
1854.36 -> Global minimum in your loss function
1855.74 -> however
1856.7 -> I'm here I'm talking about trending logs
1858.74 -> right if you really look at the test
1860.179 -> loss you're going to look they will look
1861.86 -> a little bit different the test loss
1863.299 -> would be different from the training
1864.44 -> loss so test loss maybe look like
1866.299 -> something like this
1869.84 -> maybe okay let me draw something I'll
1871.88 -> come to my figure so that
1874.34 -> um
1890.779 -> so this is the training loss now test
1893.24 -> loss probably look like this
1900.74 -> so that means that even though both of
1903.559 -> these two Global Minima are
1906.679 -> um
1907.46 -> are good solutions from the training
1909.26 -> loss perspective one of them is better
1911.539 -> from the test test performance
1913.58 -> perspective Right This Global minimum is
1915.44 -> good and Better Than This Global minimum
1917.299 -> because the test performance is better
1919.899 -> so and in some sense like the
1922.88 -> regularization effect is trying to
1924.32 -> choose the the right Global minimum
1927.02 -> right you want the regularization in
1928.22 -> fact to choose the right Global minimum
1929.659 -> so so that you can do some type working
1931.7 -> or you can encourage certain kind of
1933.02 -> models maybe this model is more kind of
1935.179 -> like ellipses so this model has more
1936.559 -> Norm than this model so that's why you
1938.419 -> prefer this one so if you use explicit
1940.399 -> regularization what you do is that you
1942.38 -> you're going to say I'm going to change
1943.82 -> the tuning loss I'm going to add
1945.08 -> something to prefer this one than this
1947.6 -> one I'm going to reshape the tuning logs
1949.46 -> right that's what the explicit
1950.779 -> regularization would do but the implicit
1953 -> regularization we'll do is the following
1954.559 -> so if you consider an algorithm that
1957.44 -> optimizes for example suppose you have
1959.059 -> you run algorithm this algorithm which
1960.62 -> is always initialized this is
1962.059 -> initialization
1965.299 -> and you do green in this set so you're
1967.52 -> gonna
1968.24 -> do something like this
1970.52 -> and it converts to this one
1972.44 -> so so this algorithm will only Converse
1975.38 -> to this one but not this one
1977.12 -> just because you initialize at this far
1979.34 -> right
1980.48 -> right so that is kind of in some sense a
1982.64 -> preference to convert to This Global
1984.14 -> minimum
1985.76 -> over this Global minimum because your
1987.98 -> algorithm somehow prefer one Global
1990.26 -> minimum than the other just because your
1992.12 -> algorithm has some certain specifics
1993.799 -> right so the initialization make it to
1996.14 -> prefer to converge this one
1998.299 -> and and there could be other kind of
1999.74 -> effects for example if you
2001.779 -> um use you know bigger step sets maybe
2004 -> you are more likely to converts to this
2005.44 -> one maybe or maybe vice versa you know
2007.48 -> depending on on kind of like the
2009.34 -> situations right so this is a very
2010.659 -> illustrative thing with one dimension
2012.58 -> then you don't really have a lot of like
2014.94 -> flexibility here but if you have a very
2017.38 -> very complex thing then if you run
2019.059 -> different algorithms different algorithm
2020.62 -> will converge to different Global
2022.48 -> minimum and that preference to certain
2025.059 -> type of global minimum is in some sense
2027.159 -> is a regularization effect
2029.74 -> um so so that you don't converge to
2031.12 -> arbitrary Global minimum
2034.96 -> um does it make some sense
2041.08 -> you said so how does like having a large
2043.179 -> number of parameters ensure that it
2045.399 -> initializes at that point
2048.04 -> yeah I I was I was silent on that in
2050.8 -> some sense like I didn't really say why
2052.599 -> the initialization has to be here like
2055.3 -> this is a active area of research so
2057.7 -> what we are sure about is that the
2059.44 -> algorithm could have this effect the
2062.08 -> algorithm could possibly
2064.659 -> um prefer certain kinds of global
2066.04 -> minimum than the others but why it would
2068.2 -> prefer which kind of global minimum we
2070.78 -> don't exactly know for certain kind of
2072.22 -> like toy cases we know but for uh for
2075.399 -> the general cases we don't I'm going to
2076.96 -> show you one cases where we actually can
2079.179 -> say what does the algorithm prefer to do
2081.94 -> but that's very very simple case for
2084.399 -> General case I think the the research
2086.08 -> this is still very open research on
2088.3 -> question
2089.44 -> I saw two other questions here
2091.679 -> this seems like of the authorizer is
2094.24 -> keeping a number of problems it doesn't
2096.28 -> quite necessity
2098.14 -> no no here that no no the what do you
2101.2 -> mean by the name optimizers
2103.3 -> the access is the value of the parameter
2105.94 -> it's just they only have one parameter
2107.98 -> I'm just I'm joining the the landscape
2109.599 -> of the pyramid and I can only Draw
2111.64 -> Something in one dimension so so so this
2114.4 -> is the value of the parameter you are
2115.66 -> just tuning in this parameter you are
2116.619 -> doing good instant and this is the loss
2118.359 -> surface
2119.38 -> so so it does depend on where you
2121.72 -> initialize right so if you initialize at
2124.06 -> different places you're going to
2124.839 -> converge different Global minimum and
2126.4 -> and they may have different
2127.359 -> generalization effects
2130.66 -> different algorithms and then just
2132.579 -> choose the one that has the best
2133.599 -> performance that's pretty much the right
2135.52 -> to do um of course there are some I'm
2137.5 -> going to discuss this you know but uh
2139.54 -> more detail later but but you know
2141.7 -> basically you know
2144.339 -> like you can have some intuition where
2146.44 -> you have you know the theoreticians have
2148.66 -> tried to understand you know what kind
2151 -> of like algorithms can help
2152.56 -> generalization but I think the
2154.359 -> conclusion at least so far is not no no
2156.94 -> like it's very far from conclusive they
2159.16 -> can give you some intuition but they are
2160.66 -> not going to be like predictive you know
2163.3 -> I don't just tell you like what to do so
2165.64 -> you still have to try a lot yeah so yeah
2169.18 -> going back to this you know this is just
2170.5 -> one dimension you know another way to
2171.94 -> think about is that if you can think of
2173.2 -> like a two-dimensional question for
2174.64 -> example you are skiing in the in the in
2176.56 -> a ski resort right so your objective is
2179.14 -> basically
2180.3 -> minimizing your like uh you're trying to
2183.099 -> go downhill right that's your objective
2184.9 -> so and and this key result probably have
2187.599 -> a lot of villages right like that you
2189.52 -> can eventually go home there are
2191.02 -> multiple parking lots right so so you
2193.119 -> you like in some sense you are saying
2194.68 -> that you know one of these parking lot
2195.94 -> is great right so this one of this
2197.32 -> parking lot is really where they are so
2200.079 -> you want to go to that one
2201.76 -> um so so diff but but uh but different
2204.339 -> algorithm would lead you to do diff to
2206.56 -> convert to different uh different
2208.06 -> parking lots right so for example
2209.32 -> someone is doing very fast skin then
2212.32 -> when you do it that you cannot go to
2214 -> those kind of small Trails so then you
2215.68 -> lead you go to one of the parking lot
2217.359 -> and some other one prefers like um a
2221.02 -> wider kind of like trails and then you
2222.88 -> go to the other parking lot so so
2225.64 -> different algorithm will lead to lead
2227.26 -> you to different parking lot and
2228.52 -> different parking lots have different
2229.98 -> generalization performance eventually
2233.74 -> so
2235.18 -> um so this is the high level um
2236.619 -> intuition so I'm going to
2240.7 -> um let's see
2242.5 -> I'm going to discuss a concrete case
2244.18 -> which
2245.14 -> um
2245.76 -> which will also be part of a homework
2248.079 -> question so this concrete case uh just
2250.42 -> to give you a concrete sense of how this
2253.359 -> could even be possible so I'm going to
2256.06 -> show you the high level thing and there
2257.859 -> are some mathematical part which will be
2259.42 -> in the homework so this is the linear
2262.06 -> this is a in a linear model
2265.54 -> so interestingly even though this
2267.28 -> implicit requisition effect
2269.02 -> was mostly discovered after deep
2271.42 -> learning uh start to be kind of like
2273.52 -> powerful but actually you can still see
2275.74 -> it in linear models and that's how
2277.66 -> researchers start to do research so
2280.54 -> um so so let's say suppose we are just
2282.52 -> in the most vanilla linear model setting
2284.68 -> where you have some under data points
2291.4 -> this is just the the trivial linear
2293.619 -> regression and your loss function is
2296.32 -> something like just the L2 loss the
2298.42 -> screen squared error
2305.859 -> something like this you have a linear
2307.599 -> model but let's say let's make the the
2310.839 -> the one different thing is that we
2313.18 -> assume n is much smaller than d
2316.06 -> so you have very few examples and a very
2318.579 -> high dimension so what is d d is the
2320.859 -> dimension of the data
2323.38 -> and the N is the number of examples I'm
2325.72 -> going to assume only is much smaller
2327.04 -> than b
2328.359 -> so this is over parameterized you have
2330.46 -> multiple Global minimum while you have
2332.619 -> much so first of all you have multiple
2333.94 -> Global minimum
2339.82 -> why because they are I'm claiming that
2342.82 -> they are minus Theta such that
2345.7 -> minus Theta satisfies
2350.2 -> y i is equals to Theta transpose x i for
2354.16 -> all I
2357.28 -> why because you know how many equations
2359.26 -> here you have right so so if you want to
2361.78 -> so this is the equation to make training
2363.7 -> loss zero which is means Global minimum
2365.56 -> right so if you have all of this
2367.359 -> equality then it means you are either
2369.28 -> Global minimum of this training loss and
2371.98 -> why there are multiple status such that
2374.44 -> you can satisfy this that's because you
2377.32 -> can count how many equations they are
2378.76 -> right so there are any equations
2384.7 -> right and D variables
2389.92 -> and these are linear equations right so
2392.079 -> so I guess the linear algebra tells us
2394.24 -> that if you have any equations T
2395.74 -> variables and if n is less than I think
2397.9 -> if n is less than D or D minus one I
2400.119 -> think only is less than D then uh you're
2402.7 -> gonna have at least one solution and if
2404.44 -> any is much much smaller than D then you
2406.42 -> have a Subspace of solutions and that's
2408.52 -> called the uh the what's the the kernels
2412.24 -> of the anyway you have a Subspace of
2414.579 -> solutions
2416.079 -> um uh for for this kind of linear system
2418.359 -> equations
2420.04 -> right so
2421.78 -> um and that's why you're gonna have much
2423.82 -> for Global minimum of the chaining
2425.44 -> blocks because the entire Subspace of
2427.119 -> solutions are Global minimum of the
2429.46 -> tuning loss so the question is which one
2432.099 -> you're going to converts to right so
2433.3 -> which one your Optimizer will will
2434.859 -> choose
2436.359 -> so it turns out that you know if you use
2439.359 -> a good InDesign with zero initialization
2442.42 -> then you are going to choose the one
2444.28 -> with the minimum L2 naught so here is
2446.98 -> the claim
2448.48 -> so the claim is that
2450.579 -> if you do good in distance
2453.88 -> with initialization
2455.98 -> Theta is zero
2459.16 -> uh this will converge to
2465.3 -> uh the minimum Norm solution
2474.46 -> so what does the minimum solution mean
2476.68 -> formula it means that you converts to
2479.68 -> a solution with the smallest L2 Norm
2483.94 -> among
2485.32 -> those Solutions such that
2487.66 -> those global minimum of the loss
2489.82 -> function
2491.079 -> so
2492.7 -> so we use green designs you are not only
2494.44 -> just finding a Theta such as the loss
2496.06 -> function zero right so typically when
2498.339 -> you think about optimization the
2499.48 -> optimization is trying to find a
2501.28 -> solution such that the loss function is
2503.14 -> minimized right that's true you still
2504.94 -> find you definitely find a solution such
2506.5 -> that the loss function is minimized but
2509.14 -> you also have you actually have a tight
2511.3 -> breaking effect among the solutions such
2513.579 -> that the Optimus the loss function is
2515.56 -> minimized you actually choose the one
2517.839 -> with the smallest L2 naught
2530.02 -> so guys you know in some sense you know
2531.76 -> the kind of intuition is the following
2533.079 -> so I'm going to try to draw this this is
2536.44 -> a little bit
2538.359 -> um I need to try to draw this well
2541.9 -> um
2546.76 -> so suppose you have a
2549.28 -> suppose let's say the intuition is that
2551.32 -> supposed to let's say you have n is one
2555.579 -> um and
2558.64 -> say these three
2561.7 -> so you just have one equation one linear
2563.74 -> equation
2564.88 -> and
2567.22 -> um so and you have like three variables
2569.68 -> so that means that the family of
2572.14 -> solutions is a two dimensional Subspace
2574.78 -> so
2576.339 -> make sure to draw this
2581.44 -> okay
2609.099 -> okay so so here the Subspace I'm drawing
2612.4 -> here is
2613.839 -> this is the family of theta such that
2616.96 -> you satisfy that the loss is zero right
2620.319 -> this is the Subspace
2621.76 -> right so you have a substrate Solutions
2623.5 -> and but which solution you converge to
2625.72 -> that's the question
2627.04 -> it turns out that
2628.9 -> if you
2631.599 -> if you start with let me see what maybe
2634.24 -> I will write here
2636.099 -> um
2636.819 -> it turns out that you're going to find a
2638.2 -> solution such that
2640.3 -> this is the solution of the fund this is
2642.22 -> the solution
2653.02 -> drawing this is a little bit challenging
2654.64 -> I guess
2656.14 -> how did I do this I think I did this
2664.96 -> so so you consider that you project zero
2667.18 -> to this Subspace right so that you find
2669.16 -> this point this point is the the
2671.68 -> solution with the minimum knob that is
2673.54 -> closest to zero among on the Subspace
2676.06 -> and this is the solution that you you
2678.339 -> will find you're not going to find other
2680.26 -> Solutions with good in descent on
2683.44 -> um with initialization zero
2685.599 -> so basically that's the claim the claim
2687.22 -> is that you're going to find this
2688 -> particular solution but not the other
2689.2 -> Solutions
2690.76 -> and the reason is actually fundamentally
2693.099 -> reason it's pretty simple like
2694.119 -> especially if I draw it in this way of
2696.4 -> course if you want to prove it it's a
2697.66 -> little bit more complicated
2700.78 -> um so the reason is really just that
2703.24 -> you start with zero
2705.579 -> this is uh where you start with right we
2707.38 -> need that and you have a property such
2709.96 -> that
2711.04 -> when you how do I
2715.54 -> um
2717.04 -> you need a series this
2719.2 -> so you have a property such that if you
2721.96 -> start with
2723.94 -> if initial is zero right and then at any
2728.02 -> time
2731.859 -> so uh your Theta is always in the span
2736.24 -> of all the data points here I'm gonna
2739.3 -> have actually one data point so I'm only
2741.64 -> so
2743.079 -> um so so basically your Theta cannot
2745.06 -> move arbitrarily in any places you only
2748.18 -> have you have a restriction on where the
2750.339 -> setup can go so actually for this
2752.44 -> particular case what happens is really
2754 -> just that you are just moving it along
2755.44 -> this direction
2758.74 -> and here you'll find this point that has
2760.839 -> the Subspace and that's what the
2762.52 -> gradient design is doing so green in
2764.02 -> design will not do something like this
2765.52 -> will not converge to here it will not
2767.38 -> converge to here it will just go
2768.7 -> directly go to this this mean this
2771.04 -> closest point where the point that is
2773.26 -> closest to zero on the Subspace
2776.68 -> so so this is this is probably a
2779.26 -> property of the optimizers right you can
2780.94 -> imagine you may have optimized other
2782.74 -> Optimizer suppose you design some crazy
2784.24 -> Optimizer which does this or does this
2786.4 -> then you will convert to a different
2787.599 -> points but if you use Golden design
2789.339 -> you're going to do this
2799.68 -> you show that green inside is doing this
2801.94 -> is just by saying that the good instant
2804.4 -> is always in a soft the span of the data
2806.38 -> I think this is uh this is actually
2808.3 -> something we have we have approved for
2810.579 -> in the kernel
2811.9 -> kernel uh lecture I'm not sure for a
2815.14 -> different purpose you know it's not for
2816.339 -> this purpose remember that in a kernel
2818.319 -> lecture we try to show that your
2820.78 -> parameter is always in a linear
2822.64 -> combination of the data and and then
2824.859 -> there the purpose was that you want to
2826.3 -> represent it by the betas in that
2828.7 -> lecture so it's a different reason but
2831.28 -> it's different on goal but it's the same
2833.92 -> fact right your status is always in a
2835.72 -> span of the of the data
2845.28 -> [Music]
2850.98 -> is defined to be the all the solutions
2853.48 -> that have zero loss so these are all the
2856.42 -> spelling is that's my definition of the
2858.819 -> Subspace this is the family of solutions
2860.859 -> that have zero training loss
2864.4 -> so and the question is which one I'm
2866.2 -> gonna convert to in kind of like I was
2868.18 -> arguing that you know they're much for
2870.64 -> Global minimum right so this whole spine
2872.8 -> is our Global minimum all of them are
2874.359 -> Global minimum and which my own converts
2875.92 -> to so so different algorithm probably
2877.96 -> would convert to different points
2880.24 -> so if you run cooling design you're
2881.92 -> going to converts to one particular one
2883.54 -> in this in this bag
2890.5 -> right but but this
2892.119 -> um um this this phenomenon also shows up
2894.099 -> in other cases but it's going to be you
2895.78 -> know much more kind of complicated like
2897.7 -> like I think they are only a very
2899.5 -> limited number of situations where we
2901.3 -> can theoretical proof where you converge
2903.579 -> to
2904.599 -> um but but it's almost always the case
2906.28 -> that the optimizer has some preferences
2908.14 -> the optimizer will not converge to
2910.119 -> arbitrary zero training loss solution it
2913.18 -> will converge to one particular zero
2915.16 -> General solution and sometimes that
2916.9 -> solution just to General is much better
2918.52 -> than the other ones
2923.06 -> [Music]
2925.2 -> so um
2928.68 -> so it's not only one or linear models or
2932.68 -> right so so this so only for linear
2935.68 -> models the final of zero zero loss uh
2939.22 -> solution is a spot right so if you have
2941.56 -> non-linear models then the family of
2943.72 -> solutions
2945.099 -> satisfying this wouldn't be a spot maybe
2947.2 -> it's a manifold something some other
2948.7 -> weird structure uh right so so in that
2951.76 -> sense this is very special
2958.599 -> Solution that's gonna be in that's bad
2962.579 -> it's going to be the constraint
2964.54 -> optimization problem that we just saw
2966.46 -> that like it's going to constraint
2969.22 -> itself to this like minimizing a second
2973.02 -> right right so so I didn't show you the
2975.52 -> full proof so this point turns out to be
2977.319 -> the point that you converts to turns out
2978.76 -> to be the the minimum solution
2981.52 -> and it turns out that you actually just
2983.44 -> going straight at least for this at
2985.18 -> least one case so you know not it's
2987.4 -> actually it's not even always true that
2988.72 -> you are going in a straight line like
2991 -> some but uh but you always go in this
2993.339 -> Subspace
2994.72 -> um so
2996.22 -> I'm answering the question maybe I
2997.78 -> didn't um
2999.599 -> can you prove that yeah you you can
3002.64 -> prove that prove it yeah I think the
3003.96 -> homework question I actually asked that
3005.819 -> you're gonna converts this this Pawn is
3007.68 -> exactly the minimum normal solution and
3009.06 -> also you're going to convert to that
3010.859 -> oh okay
3012.3 -> actually you can have a actually you can
3014.52 -> have a pretty concrete
3016.98 -> representation of this point right it's
3018.66 -> really just some inverse some of the
3020.819 -> Matrix times something you can you can
3022.5 -> compute what exactly is and and you can
3024.78 -> show you converge to the point
3026.76 -> um I'm not sure whether the homework
3028.619 -> asks you to so I think the homework has
3030.72 -> to show both
3041.28 -> but we will have a lot of hints you know
3043.079 -> along the way it's not going to be
3044.099 -> interested show this that's it
3051.9 -> and maybe for example another just to
3054.359 -> give you a sense on you know what these
3055.859 -> kind of things can change right supposed
3056.94 -> to initialize here then you wouldn't
3059.28 -> converge to here so you probably would
3061.44 -> convert to somewhere here
3063.48 -> and you know and so uh and if you use
3067.44 -> stochastic reading is that you probably
3069.18 -> wouldn't commercial exactly here either
3070.619 -> your power will convert to some
3072.059 -> somewhere definitely so so so where do
3075.3 -> you exactly converts to it it's it's a
3077.16 -> very hard question we don't really know
3078.839 -> uh we only know that this this the only
3081.78 -> thing we know right now I think formula
3083.88 -> is that this matters if you use
3085.26 -> different algorithmic cucumbers
3086.579 -> different solutions and different
3088.44 -> solutions generalize differently so you
3090.359 -> have to consider the effect of the
3092.16 -> optimizers
3095.22 -> and going back to this the reason here
3098.22 -> is really so I guess like a this in some
3102.24 -> sense this kind of is trying to explain
3103.5 -> why you can you can challenge here
3105.72 -> that's because because of this implicit
3108.059 -> requisition effect even though like you
3110.4 -> don't have regularizers you still
3112.14 -> implicit regularized L2 Norm so and
3115.02 -> that's why in this regime even though
3117 -> you have a lot of parameters but
3118.2 -> actually you are still implicitly
3119.4 -> regulating out norm and if you look at
3121.619 -> the norm the norm would look like this
3125.579 -> so this is the norm as you change the
3128.16 -> parameter
3129.24 -> so so basically this is saying that when
3131.46 -> you have a lot of parameters actually
3132.78 -> your real Norm is actually relatively
3134.819 -> small and that's why you can generalize
3137.88 -> so um so so the the the
3141.24 -> reason why you don't generalize in the
3143.28 -> middle is because this minimum normal
3145.5 -> solution is not actually doing well in
3147.78 -> the middle for some other reason so
3150.119 -> um so the norm actually turns out to be
3151.74 -> big but actually the norm is very small
3153.24 -> in the over parametric 3G even though
3154.98 -> you use a lot of parameters
3165.66 -> thank you
3167.339 -> Okay so
3169.44 -> so now let's talk about you know
3172.44 -> um how do you really do um how do you
3175.079 -> really find out what's that you know
3176.16 -> I've told you that we don't know too
3177.48 -> much about you know what how does the
3179.099 -> optimizer uh change things right so we
3181.26 -> also don't know exactly how does the
3183.24 -> model complexity uh uh change things
3185.819 -> right so you only know some intuitions
3187.14 -> right so you know that if you have more
3189.18 -> complexity it turns out to be more
3190.92 -> likely to
3192.599 -> over fit but you don't know exactly what
3194.52 -> is the right complexity right so how do
3195.96 -> you find out the right model the right
3198.26 -> optimization algorithm the right you
3200.64 -> know regularizer all of this where you
3201.96 -> have so many decisions where you
3203.099 -> probably have like 10 decisions you have
3204.359 -> to make in this machine learning
3206.059 -> algorithm so how do you find out what's
3209.04 -> the what's the best thing so so I think
3211.26 -> the the typical way is just that you are
3213.66 -> user uh validation set to
3217.98 -> um to figure out what's the best
3220.44 -> decision
3221.76 -> so
3222.96 -> maybe just to motivate that just briefly
3225.059 -> so the the easiest way to do is that you
3227.88 -> just use a test set right so you have
3230.28 -> some test set and you just you try all
3232.619 -> kind of algorithms all kind of models
3234.24 -> all kind of like a regularization
3236.339 -> strength and you see which one has the
3238.319 -> best performance uh tested
3240.9 -> so that's okay as long as you only use
3243.119 -> one you only use the test set at the end
3245.94 -> right so to try all of this algorithm in
3248.76 -> advance you know you you and then you
3250.44 -> collect some test the site or maybe you
3252.599 -> collect the test set before but you
3254.4 -> never touch it right so that's okay so
3256.38 -> you you so you if you only use the test
3258.839 -> test once then you can use the tester to
3261.72 -> evaluate the performance of all possible
3263.819 -> algorithms all possible kind of like
3265.8 -> models uh you you want to use
3268.859 -> so so that's that's that's that's that's
3270.359 -> that's a good thing so however the
3272.64 -> problem is that sometimes
3275.16 -> um you like you
3277.2 -> um you want to do this iteratively you
3279.359 -> want to look at a test set and see what
3280.619 -> the performance is and then you go back
3281.94 -> to say okay maybe I'll change my model
3284.04 -> size right or maybe I'll change my uh uh
3288.3 -> like my Optimizer right so maybe I'll
3290.099 -> change from Green instant to stochastic
3292.02 -> in science maybe I want to add some
3293.76 -> regularization effect I add some
3295.5 -> regularization on like a function so
3299.4 -> um
3300.42 -> um so if you want to do with iteratively
3302.16 -> then the what I said before was not
3304.619 -> going to work that's because you know
3306.9 -> so typically if you have a test set this
3309.96 -> you can only use it once
3312.839 -> so because if you use it multiple times
3315.24 -> what happens is that you are you could
3318.3 -> kind of like overfit to the test set so
3321.72 -> basically the your later decision
3323.64 -> becomes kind of like a over 3 are our
3327.48 -> decisions over fitting to the test set
3329.16 -> you have seen before so so the only the
3331.859 -> validity of the test type is only
3333.54 -> insured when you only suggested after
3337.26 -> you you do the tuning so and if you see
3340.14 -> the test set and then you like uh you
3342.839 -> you do the training and then you test it
3345.42 -> again then the second time you test on
3347.22 -> test that it will be not guaranteed to
3349.619 -> be valid so you may all over a fit to
3352.619 -> the test set so does it make sense
3355.98 -> I'm trying to be not over complicated
3358.559 -> this right so that's why I'm trying to
3360.66 -> use informal words for it but if
3363.839 -> there's any questions right so so how do
3366.78 -> we all deal with this right so the test
3368.16 -> set we can only use it once or at least
3369.78 -> we can only use it we cannot use it
3371.4 -> interactive interactively you can also
3373.2 -> see the test stats tune and then see
3375.18 -> your tester get set again so so one way
3378.119 -> to deal with this is that you um
3382.26 -> you
3383.76 -> um
3384.3 -> you you have a whole doubt or you have a
3386.28 -> validation set so so basically you split
3389.099 -> the data
3393 -> into a three part so one part is called
3394.98 -> training site
3398.339 -> and one part is called validation set
3403.559 -> and also test that
3406.92 -> and for test start this is your kind of
3409.02 -> like uh this is a very you have to be
3412.44 -> like a very careful about it you
3414.359 -> shouldn't touch touch it this test that
3416.52 -> is only only about the very very and you
3418.859 -> are using a test set to evaluate your
3420.119 -> performance
3421.619 -> so and but the validation side this you
3424.92 -> use this to tune
3427.079 -> hyper parameters
3429.72 -> and by hyper parameters here I mean all
3431.7 -> the
3432.48 -> kind of like the type of parameters that
3435.119 -> you are you are choosing for example the
3437.22 -> the batch Stars you know the the the
3439.619 -> Lambda in the regularization uh maybe
3442.44 -> the the choice of the optimizer the
3444.78 -> number of neurons you're going to use in
3446.28 -> the in deep learning in your deep
3447.359 -> learning model
3448.98 -> um how long you have you're going to
3450.42 -> train in all of these you know decisions
3453.24 -> um that you are going to decide in the
3455.04 -> in this process they are called hyper
3456.72 -> parameters and so you're using mobile
3459.72 -> edition set to tune the hyper parameters
3461.4 -> and you are using the chain inside to
3462.54 -> tune the real parameters right by the
3465.72 -> to optimize the parameters right so I
3468.48 -> guess typically we don't know so to tune
3470.94 -> the parameters
3472.559 -> right these parameters are just a
3473.94 -> numerical numbers right in the model
3475.92 -> which anyway you don't know what where
3478.319 -> the meanings are but the hyper
3479.88 -> parameters are those kind of things that
3481.38 -> you uh you know their meanings right
3483.599 -> batch size learning rate step size right
3485.4 -> so they all have some meanings uh and
3487.859 -> you want to use this validation set to
3489.48 -> tune the hyper parameters
3491.64 -> so so basically the kind of process is
3493.92 -> that you start with the tuning and then
3495.54 -> you you valid you start with training
3497.16 -> with some hyper parameters and then you
3498.54 -> validate on your performance uh and then
3501.72 -> you go back to to tune again maybe using
3503.7 -> some other hyper parameters and then you
3505.26 -> do this iteration for many times
3507.48 -> so and after you do all you are done
3509.819 -> with everything and you find out a model
3511.38 -> that you are happy with which you know
3513.48 -> by your happy with you I mean that you
3515.7 -> found another model that is very good on
3517.38 -> a validation set then you finally test
3520.44 -> your model on test set and that can be
3522.96 -> only done worse so so in some sense I'm
3525.54 -> not sure how many of you have for
3526.799 -> accounts in this kygo computation right
3528.54 -> it's kind of structured exactly like
3529.92 -> this so there is this online platform
3532.26 -> where people release their data sets for
3534.839 -> and set up some kind of kind of like
3536.88 -> challenge for people to submit their
3538.74 -> machine in the model to solve their
3540.059 -> tasks so basically you know kygo
3542.22 -> competition so they have a like the
3544.619 -> organizer I have a test set which nobody
3546.839 -> can touch at all like this test side is
3549.48 -> only the it's only used once at the very
3552.24 -> end when you decide who is the winner
3554.579 -> so but and then the the the the
3557.299 -> organizer released
3559.339 -> these two actually I'm not sure
3561.599 -> sometimes they give you a division right
3563.28 -> so they say this is advantage that this
3564.66 -> is the training side sometimes they just
3565.859 -> released the total
3568.02 -> um all of them to you and then you can
3570.42 -> you can divide yourself you know even
3572.16 -> the release in this format you can
3574.559 -> re-divide them you know whatever you
3576.119 -> want to do so let's say suppose you have
3577.92 -> divide your you know all the training
3579.42 -> example into these two sets you can do
3580.859 -> whatever kind of like
3582.78 -> um whatever kind of like uh optimization
3585.18 -> uh you want so and I think typically
3587.52 -> they do have like a desk in the
3589.98 -> validation side which is used to for the
3591.66 -> for computing the scores on the
3592.92 -> leaderboard whether there's a
3594.299 -> leaderboard which kind of like tells you
3595.859 -> how well you are doing against others at
3598.5 -> least temporarily right so so that's the
3601.2 -> validation side that's that's evaluate
3602.88 -> on the validation side but this
3604.5 -> leaderboard may not be exactly uh the
3607.26 -> same as the final rank it's possible
3609.78 -> that you know finally you found out that
3611.88 -> somebody is succeeding in the
3613.02 -> leaderboard but eventually in a very
3614.7 -> final test
3616.38 -> um the the performance is not like as
3619.079 -> the validations I suggest
3621.059 -> so so but but this is the general setup
3623.7 -> that people are doing
3625.74 -> um
3626.579 -> does it make sense
3628.559 -> so one common question you know that
3630.24 -> people kind of like generally ask you
3631.859 -> know which I ask myself as well is that
3633.96 -> you know how how reliable this
3635.94 -> validation site is
3637.26 -> right so like a like if you have very
3640.38 -> high performance on the valuation side
3641.88 -> should you trust yourself
3644.22 -> so on one side you shouldn't trust
3646.68 -> yourself you know 100 because if if you
3648.72 -> can trust the organization side
3649.859 -> performance why you need to test that
3651.24 -> well the test stat is supposed to give
3653.04 -> you the final verdict in success right
3654.9 -> it's it's a very uh like a it's a
3658.619 -> guarantee it's something that guarantees
3659.94 -> to give you the right answer so so the
3662.22 -> value set is never 100 you know you
3664.619 -> probably shouldn't 100 trust it so on
3667.559 -> the other hand in Prior college so
3669.839 -> people realized in the last five years
3671.28 -> that I think there is a sequence of
3673.2 -> paper on this people realize that
3674.76 -> actually the validation side performance
3676.26 -> you know is actually well card with the
3678.059 -> test set like so this is a reasonable
3680.64 -> indicator about how good your
3682.319 -> performance is on Tesla it's just not
3684.119 -> there's no theoretical guarantee that
3686.16 -> these two are exactly the same so but
3688.319 -> but in most of the cases if you don't do
3690.24 -> anything crazy you don't kind of like a
3692.099 -> somehow just memorize the entire
3694.559 -> validation Side by creating some kind of
3696.72 -> like um some kind of like uh like lookup
3699.48 -> table kind of things then typically a
3702.119 -> validation size performance the
3703.38 -> performance and radiation set is kind of
3705.18 -> very close to test that and this has you
3707.52 -> know there is a there's a very um
3708.96 -> important paper in the probably three or
3710.94 -> four years ago
3712.26 -> um by Berkeley people so they actually
3714.299 -> look at them
3715.859 -> like maybe 300K gold competitions and
3719.22 -> and they are like they look at the the
3722.16 -> best performance you know the the the
3723.66 -> rank of the performance on the
3725.22 -> validation side on the leaderboard and
3726.9 -> they look at how they correlated with
3728.46 -> the the final winner the final
3730.74 -> performance and they found that they are
3732.18 -> very correlated so which suggests that
3733.859 -> the validation side is actually a pretty
3735.359 -> good indicator for test set even though
3737.099 -> it's not guaranteed
3739.319 -> um and any and in this case the typical
3741.66 -> machine the uh the typical machine
3743.819 -> learning kind of practice is that you
3746.28 -> know if you look at the the
3749.22 -> um the
3750.78 -> um
3751.859 -> um like the the when people publish
3753.599 -> papers right so in some sense people
3755.4 -> publish results based on validation
3757.74 -> search
3759.119 -> um so for example if you look at image
3760.92 -> United performance in some sense people
3762.72 -> are uh like the the so-called test
3765.059 -> performance that people report is
3766.92 -> actually it's actually a performance on
3769.2 -> a validation side because that that
3770.64 -> so-called test set of has been seen so
3772.68 -> many so many times is I think actually
3774.42 -> it's lit uh I don't know exactly whether
3776.88 -> there's a label like name for it like in
3779.4 -> the image not uh the official data set
3781.619 -> but at least that's that you know that
3783.119 -> you report your performance
3785.16 -> um with that side shouldn't be
3787.68 -> considered as a test set because test
3789.359 -> that you should only use it once but
3790.98 -> actually people have used it so many
3792.299 -> times maybe a million times so so so
3794.64 -> basically
3795.96 -> abstract speaking I think these days
3797.7 -> when you publish paper you use the
3798.96 -> validation set only when you have the
3800.64 -> Kegel computation you use the tester to
3802.5 -> really decide the winner
3803.94 -> but but empirical it sounds like they
3806.339 -> are very close so so actually that's why
3808.319 -> we are not worried too much about it
3814.859 -> any questions
3818.4 -> oh I think it's this is a
3820.92 -> [Music]
3822.14 -> I think it's called hey go or kaigo I
3825 -> don't know how to be an artist uh this
3826.859 -> is a platform so the platform hosts a
3829.38 -> lot of computations maybe like um
3831.96 -> a hundred every year or something like
3833.819 -> that you can submit your model and
3835.859 -> sometimes there is a kind of like there
3837.24 -> is a price uh for winning the the
3839.46 -> competition
3841.26 -> so
3842.7 -> um right
3843.72 -> and by the way I think this validation
3845.52 -> is that sometimes now people call it
3847.2 -> development site as well
3853.44 -> um I don't know how popular this name is
3856.02 -> uh but at least if you say but addition
3858.299 -> said I think everyone would know what
3859.859 -> you're talking about development side I
3861.9 -> think most people would know as well but
3863.88 -> it's a relatively new term in the last
3865.799 -> five years
3869.64 -> trading certain validation sets as part
3872.28 -> of like a bigger so like once it's
3874.619 -> actually decided on what type of
3876.24 -> parameters you want to use
3884.359 -> right so how do you how do you do the
3886.859 -> split right so so here um so the most
3890.28 -> typical way is that you just split
3892.02 -> randomly you reserve probably a tenth of
3894.839 -> the data set as validation size maybe 20
3897.059 -> you know depending on how many data you
3899.46 -> have so
3901.2 -> um and I think what you are probably
3902.7 -> thinking is the so-called cross
3903.839 -> validation which does something much
3905.76 -> more complicated you can kind of split
3907.5 -> your data set into uh you can do
3909.839 -> multiple splits and try multiple
3911.22 -> experiments on different splits so
3914.76 -> um I think uh
3916.2 -> um
3917.4 -> I I think I'm not going to cover it uh
3919.5 -> for this lecture mostly because I think
3921.24 -> these days if you have a large enough
3922.98 -> data set typically you just do this
3925.02 -> static spelled just because it's much
3927.599 -> easier you don't have to run your
3928.859 -> algorithm multiple times and this is
3931.26 -> just almost like the
3933 -> like like in most of the larger scale
3935.94 -> machine learning situations you just use
3938.099 -> this so but if you just have like 100
3940.799 -> examples then indeed as you said you
3943.5 -> know if you fix like 20 examples that's
3945.18 -> validation side it's a little kind of
3946.68 -> like wasteful so then you have to do
3948.18 -> some cross validation so we have a we
3950.22 -> have a kind of like a section in lecture
3952.2 -> notes about cross validation
3954 -> um there's a there's a description of
3955.559 -> the the of the Practical
3958.44 -> um I think you know if you're
3959.16 -> interesting you can you can read it it's
3961.14 -> nothing very complicated either
3963.78 -> thank you
3964.799 -> okay so I got some
3966.66 -> um yeah so I'm gonna use the last
3968.76 -> tournaments to talk about some more
3970.799 -> um applied perspective
3972.96 -> um I'm gonna use the slides so I guess
3975.54 -> I'll
4034.16 -> see that this can work
4038.48 -> okay great so uh
4041.839 -> it's not censored is it
4045.68 -> wait
4047.78 -> okay sounds good
4049.819 -> um okay good so um so so I think this
4052.819 -> lecture um um so we're talking about
4054.98 -> some ml device so um I think I
4058.64 -> um so here on these slides are made by
4060.92 -> our other instructor crispy uh with the
4064.099 -> help of Alex Radner
4066.319 -> um I'm pretty much just repeating uh
4068.059 -> whatever
4069.38 -> um he's saying in the slides
4072.079 -> so
4073.88 -> um I think the slides you know used to
4075.619 -> be a little bit longer than this I'm
4076.7 -> going to release the longer version as
4077.9 -> well so I shorten it you know to only 20
4079.819 -> minutes or 30 minutes
4081.559 -> um and part of the reason is that I
4082.88 -> think slides also contains something
4084.02 -> that has been covered on the Blackboard
4085.88 -> on the Whiteboard and also part of the
4087.98 -> reason is that there are some applied
4089.48 -> Parts on which uh I think we don't have
4091.94 -> a lot of time to discuss in
4094.4 -> um in this quarter so but I'm going to
4096.44 -> release the the longer slides as well
4098 -> for your reference so so so so this set
4101.54 -> of slides are mostly for kind of like a
4104.06 -> little bit more applied kind of like
4105.259 -> situations for example you are thinking
4107.299 -> that you are for example your startup
4109.219 -> and you are doing machine learning to
4111.02 -> solve some concrete problems right so um
4113.6 -> so it's a little bit less like a
4114.859 -> research because you know you're gonna
4116.12 -> see that you're gonna have more much
4117.319 -> more issues than a concrete research
4120.14 -> setting right in research actually you
4121.64 -> sometimes also have this right in
4122.839 -> research the most typical setting is
4125.66 -> that you probably have a con very
4127.16 -> concrete data set right you know that
4128.839 -> the input output you know everything
4130.339 -> there's no any room in flexible you
4132.56 -> cannot redefine the problem and you just
4134.239 -> want to get the best number that's one
4135.679 -> type of research I don't think this is
4137.359 -> the most typical one either right like
4138.799 -> but um but this is one type of research
4140.779 -> and then from there you can have more
4142.46 -> and more flexibility you can change your
4143.839 -> data you can refreeze your problem you
4145.759 -> can find out what's the right problem
4146.779 -> and and once you really do it in in
4149.96 -> Industry then it's going to be you know
4151.699 -> much more complicated
4153.739 -> so some disclaimers to start with I
4156.56 -> think this is Chris disclaimer which is
4158.239 -> also mine so these are you know like
4160.58 -> there's no Universal kind of like what
4162.38 -> is ground shoes here right so there's no
4164.06 -> ground shoes it's really just some
4165.5 -> experiences and some
4167.42 -> um uh some experiences from people doing
4169.64 -> right you know in real life
4171.739 -> so
4173.359 -> um and and things change you know over
4175.58 -> time sometimes people thought that was
4177.5 -> the right thing to do in five years ago
4179.12 -> and now things changed so I'm going to
4182.54 -> go through this a little quickly so um
4184.52 -> you know
4185.48 -> um so um but you know some I'm gonna
4187.94 -> omit some of details as well but feel
4189.92 -> free to stop me
4191.6 -> um
4192.739 -> right so
4194.9 -> so there are many
4196.34 -> um
4196.94 -> so in some sense there are many phases
4198.62 -> of IML project if you really do it in
4200.42 -> Industry right so so for example one
4202.4 -> thing you want to discuss is that you
4204.26 -> really need ml system even to start with
4206.36 -> right some of the questions are not
4207.86 -> really necessarily suitable for ML I
4210.26 -> think at least I knew I'm not I don't do
4212.42 -> as much industry uh work you know as um
4215.54 -> you know Chris is also entrepreneur
4218.199 -> besides you know
4220.1 -> um a professor
4221.719 -> um so uh so he knows a lot about this
4223.219 -> but even I know that sometimes you know
4225.199 -> people actually when they really uh sell
4228.08 -> their product as a ml system but
4229.58 -> actually the underlying system is not
4231.739 -> really using much ml so sometimes you
4233.78 -> don't really need ammo and and and when
4236.659 -> you need when you use the ml if it
4238.159 -> doesn't work you know what you do so and
4241.04 -> uh and and also like you know how do you
4243.02 -> deal with all the kind of ecosystem
4245.3 -> so and when I use the runner example you
4247.699 -> know we're going to have a Spam detector
4249.26 -> and the question is how do you know
4250.88 -> detect spams
4252.38 -> um
4252.98 -> um you know we'll use this example a lot
4255.62 -> in this like in this course so
4259.1 -> this is the seven steps for ML system so
4262.1 -> here again it's more a little broader
4264.5 -> than just the ml research right so
4266.36 -> you're thinking about designing a system
4267.92 -> that can actually work in practice so
4271.64 -> um so acquire data and you want to look
4274.4 -> at the data you know and maybe you want
4277.1 -> to create some kind of like twin
4278.42 -> development test set as we discussed you
4281.06 -> want to um Define our refund
4283.34 -> specification which I'm going to
4284.719 -> discussing in some sense this is saying
4286.4 -> that you have to have a evaluation
4288.08 -> metric for for your model in what sense
4291.14 -> you want to you want your model to
4292.52 -> succeed and then you want to build your
4294.5 -> model and try a bunch of models maybe
4296.42 -> you are going to spend a lot of time in
4298.1 -> step five and then you're eventually
4300.08 -> you're going to measure uh models
4301.94 -> performance you know not necessarily
4303.32 -> only according to the specification you
4305 -> have defined in step four but maybe
4306.5 -> you'll have like other match
4308.42 -> measurements for example speed training
4310.88 -> time so and so forth and then eventually
4312.62 -> you have to repeat and maybe you have to
4314.12 -> repeat a lot of times
4316.04 -> so
4317.54 -> um I'm going to go through these steps
4318.739 -> you know relatively quickly I only have
4320 -> like 15 minutes so but you can kind of
4322.159 -> like
4323.12 -> um if you're interested you can look at
4324.679 -> a longer slides as well so
4327.56 -> um supposed to say um you want to decide
4329.78 -> what is Spam or not
4332.06 -> so
4333.56 -> so ideally you want a data sample from
4335.54 -> the data that your stump product will be
4337.52 -> around right so you want to have your
4338.84 -> data to be somewhat kind of like closer
4340.76 -> to the the final test data right so you
4343.28 -> don't want to just collect some spam
4345.679 -> data from 30 years ago and then and use
4348.5 -> this data to choose something that's
4349.82 -> that can work not these days but
4352.94 -> sometimes this is not always available
4354.32 -> because you never know what what the
4356.78 -> spam emails will be 10 years after so
4359.12 -> you have to make some sacrifice
4361.52 -> um you know sometimes you don't even
4362.6 -> have the features you know it may be
4364.88 -> like your existing record didn't save
4366.62 -> everything maybe it just saves the title
4368.36 -> of the email didn't save the the entire
4370.28 -> content then that would limit your
4372.14 -> capability of detecting spec and there
4375.26 -> are many legal issues to look at the
4376.52 -> data
4377.719 -> um and this is according to crazy you
4380 -> know I think this is true as well so
4381.8 -> you're getting wrong on the first try
4383.239 -> right so like sometimes you like you'll
4385.58 -> find out that the data you collect are
4387.14 -> not the right one you have to repeat
4389.84 -> and then you know after collect some
4391.58 -> data you have to look at them right so
4393.98 -> and this is something that we actually
4395.36 -> don't really teach a lot you know in in
4397.76 -> this machine learning course looking at
4399.32 -> your data right because you know we are
4401.12 -> mostly assuming that you already have
4402.26 -> your data you you make the right
4403.58 -> assumption already you already know your
4404.84 -> data is gaussian and then you uh you you
4407.239 -> are engaging discriminated analysis
4409.4 -> right but we never say how do you how do
4411.44 -> you decide whether you you really make
4413 -> the Assumption of the about the gaussian
4415.159 -> Assumption so but in practice you have
4417.5 -> to do that because you have to
4419.659 -> um uh see whether data makes sense you
4421.76 -> know there are many nuances there for
4423.56 -> example sometimes your data are not as
4425.06 -> good as you think maybe there are some
4427.159 -> kind of like um maybe the format is not
4429.26 -> right maybe there's some kind of like
4430.64 -> outliers
4431.96 -> um so and so forth and and only if you
4435.14 -> look at the data you cannot see uh
4437.12 -> what's what's going on there so I
4438.98 -> actually you know even research
4440.12 -> sometimes I experience this right so
4441.62 -> like I think in one of my projects
4444.02 -> um like uh I think we use we just use
4446.6 -> the wrong data from day one in some
4448.219 -> sense I think the data like some of the
4449.78 -> data will just crafted you know just by
4451.88 -> accident and we are tuning on them and
4454.1 -> until only until like
4455.96 -> um like one month I think we realized
4457.4 -> that so of course in research it's
4459.56 -> probably easier to tack that you know
4461 -> one month is a long time for us to
4462.679 -> detect it I think but but actually you
4464.84 -> know you can easily detect them but for
4466.64 -> for real life cases you know sometimes
4468.62 -> it's even harder for example you don't
4470.36 -> even necessarily have the tools to look
4472.159 -> at look at your data you maybe have to
4474.14 -> build some tools to look at your data
4476.56 -> and and you need to kind of like think
4480.32 -> about different subpopulations maybe
4481.699 -> spams from edu emails or spams from.com
4484.699 -> emails and and see uh what are the
4487.52 -> differences so this will give you a lot
4489.26 -> of intuitions on you know what data you
4491.6 -> should use and our Workshop models uh
4493.52 -> you should use and and do this at every
4496.52 -> stage because you know like um in for
4500.06 -> example we will really do them
4502.219 -> and this is also while the reason why
4504.56 -> you want to build some tools to look at
4506.12 -> data conveniently right so sometimes
4508.28 -> like if you just look at it once then
4510.62 -> sure then you can just uh maybe print
4512.6 -> out something right so but if you want
4514.04 -> to look at many times then you should
4515.9 -> have some convenient tools which
4517.88 -> actually eventually will reinforce and
4520.1 -> let you to more likely to to look at
4522.32 -> data right so I think at least in
4523.76 -> research I I also realized this so if
4526.64 -> the data is very hard to visualize then
4528.8 -> people are less likely to
4530.78 -> to to visualize the data so so sometimes
4533.96 -> it requires an investment so that you
4536.12 -> can
4536.9 -> um you can have this tool so that in the
4538.82 -> future you are you have less of um uh on
4542.3 -> um kind of like cost to look at your
4544.04 -> data and you should do this at every
4545.96 -> stage uh in many cases
4548.659 -> um
4549.739 -> so and and this is um I guess you know
4552.62 -> this is about domain knowledge where you
4554.06 -> sometimes you know some of the data
4555.38 -> requires expertise right like
4558.08 -> um to know so so some I think there are
4561.08 -> there are some examples you know in the
4562.64 -> slides which I uh removed just to save
4564.679 -> some time but like in short and
4567.08 -> sometimes you know for example if your
4568.94 -> data is crowded like experts only
4571.159 -> experts can know like a like for example
4573.44 -> you have medical data only experts can
4575.12 -> know that your data are crafted but like
4577.04 -> from a machine learner
4578.78 -> um perspective the data looks fine
4581.54 -> so I will talk about children depth on
4584 -> tester split so um so this of course is
4587.42 -> something important for you to
4589.76 -> um to do
4591.14 -> um and in practice you know it's other
4592.64 -> lesson
4593.9 -> um clear
4595.1 -> um than uh than in research right
4596.96 -> because in research you already got
4598.28 -> sometimes you already got a split even
4600.08 -> at the first place right so before like
4602 -> you got a data the data already has a
4603.38 -> split but in real life sometimes you
4605.42 -> have to avoid certain kind of like
4607.82 -> leakage uh so for example maybe
4610.34 -> sometimes for example let me take an
4612.98 -> extreme case right so suppose your data
4615.44 -> has reputations so if you have like a
4618.14 -> million data but actually it's just a
4620.239 -> two like every data points is repeated
4623.3 -> twice so essentially you just have 500k
4625.4 -> and but repeat it twice if that's the
4628.219 -> case then you split the data then you're
4630.199 -> going to see some reputations between
4631.64 -> like you some examples in the test will
4633.8 -> also show up in the training exactly the
4635.239 -> same
4636.02 -> so that would be disasters so so you
4638.659 -> have to kind of avoid some situations
4640.1 -> and this actually happens in in the
4641.78 -> Kegel context so actually in many in
4645.14 -> many many
4646.64 -> like exactly like I actually I I try to
4649.219 -> do some of this Kegel contacts at some
4651.26 -> point at least at that time like that's
4653.42 -> probably three or four years ago
4655.76 -> um maybe maybe more than four years
4656.9 -> maybe six years ago at that time many of
4659.659 -> the contest so if you look at them
4662.12 -> there's some always some kind of like
4663.98 -> forum for discussions discussing like uh
4668.179 -> uh
4670.82 -> okay
4674.6 -> [Music]
4688.239 -> like I think I'm sure this happens more
4690.739 -> in the industry which I'm not less
4692.12 -> familiar with but in even in the Kegel
4694.28 -> contest so
4695.9 -> um so in many of the Kegel context if
4698 -> you look at the Forum always after like
4700.28 -> a half like a like a half a month I mean
4702.679 -> after a few weeks someone will figure
4704.84 -> out some leakage but just because you
4707.179 -> know something examples are very very
4709.4 -> close to test example so that they just
4711.44 -> use this leakage uh to to hide the
4714.32 -> number so it's kind of like some kind of
4715.94 -> like weird rule so that you can make the
4717.739 -> validation performance much better than
4719.54 -> you thought and and everyone has to use
4721.34 -> that and it's kind of interesting I
4722.78 -> don't know why it's like everyone who
4724.4 -> found this kind of late cage they always
4725.78 -> post it in the Forum somehow and and
4728.12 -> then ever I don't know whether this is
4729.62 -> always true but like for the few cases
4732.02 -> I've seen they do this and then everyone
4733.94 -> else will have to use this small Gadget
4736.28 -> to improve their model performance
4737.6 -> because if you don't use it your model
4739.52 -> performance is just not as good as
4740.84 -> others
4742.159 -> um
4742.88 -> so yeah I don't know whether they now
4744.98 -> they have maybe they have some better
4746.54 -> ways to to detect this leakage to design
4748.76 -> the computation much better
4750.679 -> um I don't know so but this is something
4752.06 -> you have to pay attention you know um in
4754.28 -> practice and also another kind of tricky
4756.739 -> thing is that what is a good spirit so
4759.08 -> we have discussed whether you should do
4760.34 -> random splits right so in research as I
4762.86 -> said you know random side is pretty much
4764.36 -> the best way you can do because you
4766.159 -> really literally care about uh the
4768.02 -> validation performance but
4770.78 -> um but the problem is that
4773.12 -> um sometimes you know uh in the real
4775.159 -> world
4776.36 -> like your test you like your the the the
4780.02 -> the test data set is actually not really
4782.12 -> what you care about so
4784.4 -> um so so that's why when you split it
4786.02 -> you also want to split the the children
4787.94 -> validation set in a way such that the
4789.8 -> validation set is
4791.3 -> some of the closer to the real test set
4793.1 -> let's so I guess this is the situation
4796.58 -> so suppose you say you are thinking
4798.14 -> about stock price prediction
4800.06 -> right so so your final goal is to
4802.4 -> predict the price
4803.9 -> in the future
4805.58 -> right so that's something you just don't
4807.14 -> have at all right so and and so now
4810.14 -> suppose you have data between
4812.54 -> um 2020 or 200 2020 right so you have
4816.739 -> these 20 years of data so how do you do
4818.6 -> the tune uh uh validation split should
4822.02 -> you just just do a random split or you
4823.88 -> should do some other things so one
4826.04 -> possible option is that you should
4828.44 -> probably split into for example 2000 to
4831.679 -> 2015. that's the tune and to 2015 to
4835.28 -> 2020 that's the validation why you argue
4837.8 -> that why that's a possible option
4840.5 -> um possibly just because the last five
4842.179 -> years is more predictive of the future
4844.84 -> than the the earlier years
4847.58 -> so you know I don't I don't necessarily
4849.8 -> say I'm not necessarily saying that this
4851.42 -> is the the only option or the best
4852.86 -> option but this is at least something to
4854.36 -> consider right so this was kind of like
4856.52 -> a kind of like a um this is not what we
4859.04 -> do in research just and the reason is
4861.86 -> just because you know like you care
4863.06 -> about the performance in the future
4864.62 -> which is something you don't have access
4867.08 -> to
4868.36 -> right
4869.96 -> so I guess this is a
4875.36 -> right so the better split is they use
4876.92 -> the first 50 days to predict the last 50
4878.42 -> days
4880.64 -> so um okay and create a specification
4883.76 -> right so so I think this is mostly
4885.62 -> related to like how do you kind of like
4887.84 -> Define what you want to predict and and
4890.84 -> what's the what's the kind of the goal
4893.06 -> right so are you so in many cases you
4894.92 -> can use Machine model in many different
4897.02 -> ways uh uh from uh from from what you
4902.36 -> wanted to use it right so and also you
4903.98 -> care about kind of different
4905.84 -> um kind of like uh perspectives for
4908.179 -> example you know what is a spam right so
4910.1 -> like the definition of spam sometimes
4911.719 -> you know are different
4914.42 -> um between different people right so
4916.04 -> like um maybe do you think like an ad
4919.34 -> right from some from from say Google do
4922.159 -> you think that's that as a spam you know
4924.14 -> maybe I always think of like that but
4925.82 -> some something else probably prefer to
4927.5 -> receive some emails some add emails with
4930.92 -> very low rate right so so you have to
4933.08 -> specify exactly what you want to predict
4934.699 -> right so what is really the definition
4935.96 -> of span and you don't want to have
4938.56 -> ambiguities at least you know from
4941.239 -> this day's perspective right so like a
4943.28 -> machine learning models don't like
4945.26 -> ambiguities you really want to have a
4947.12 -> clear cut what is the spam what is not
4949.88 -> and and also like a what level of
4952.159 -> expertise is hard to understand it
4953.6 -> because you know if you specify the spam
4955.76 -> you know you can have a definition but
4957.26 -> if your labeler don't are not able to uh
4961.699 -> label the spam according to your
4963.32 -> definition is not going to be useful
4964.46 -> right suppose you have a very very
4965.78 -> complex definition of spam and then you
4968.36 -> say I have this data and I'll ask
4970.82 -> laborers to label them but the laborers
4972.62 -> cannot execute my definition of spam
4974.78 -> easily that's going to be another issue
4978.14 -> and and also the specifications like uh
4982.58 -> um so you use the specification you want
4984.44 -> because you want to kind of like a uh uh
4987.8 -> on you use the specification to define a
4990.38 -> set of examples right so because
4991.94 -> eventually if you just have a some kind
4994.34 -> of description text description about
4995.9 -> what is a spam that's probably not
4997.58 -> useful you have to really have a set of
4999.5 -> test examples and the test examples have
5001.719 -> labels and that's is really your
5003.4 -> definition of span
5005.679 -> so and for example one of the quick and
5008.86 -> dirty tests here is that
5011.02 -> um
5011.739 -> what are your definition of the span
5014.98 -> um can pass this so-called you know
5016.6 -> inner annotated agreement so basically
5019.36 -> what you say is that you write down some
5020.739 -> definition and then you select a some
5023.52 -> randomly selected or you get some
5025.84 -> examples of emails and then you ask
5029.8 -> um three annotators to see whether they
5032.02 -> can agree on which email is a spam or
5034.84 -> not according to the definition you give
5036.52 -> them and often you don't really get that
5039.52 -> high of agreement you know you know you
5041.38 -> don't get 100 agreement you know in many
5043.36 -> cases you know people's interpretation
5044.86 -> of the same definition would be
5046.6 -> different
5047.8 -> um and let's say you know you have 95
5049.42 -> agreement I think that's already
5050.739 -> considered to be great
5052.679 -> and then you have 95 of agreement then
5055.96 -> the question becomes you know whether uh
5059.14 -> uh it's meaningful to shoot for some
5062.199 -> accuracy more than 95 right if the
5064.48 -> annotators don't agree with each other
5066.58 -> uh um only agree with each other on 95
5069.94 -> percent of 95 of time
5073.5 -> actually sometimes you can do better
5075.34 -> than that just because humans are
5077.38 -> sometimes you know lesson
5079.84 -> um
5080.8 -> have less in accurate like
5082.84 -> interpretation or sometimes machine
5084.58 -> models can do better but you know
5086.08 -> typically you probably shouldn't shoot
5087.82 -> for much higher than 95 in many cases
5091.78 -> so um
5094.3 -> and and then you know you're gonna you
5096.52 -> know do this iteratively for example you
5098.38 -> have to kind of examine the
5099.4 -> specifications you gotta look at you
5101.14 -> know what's the discouragement you know
5102.34 -> why you have discouragement maybe that
5103.9 -> means you have changed your
5104.86 -> specification and
5107.56 -> um and and and you know and the last
5109.84 -> question is kind of interesting what do
5111.04 -> you tune the people on the machine
5111.94 -> eventually at some time if you kind of
5114.1 -> have a lot of like a different
5116.38 -> um sometimes you have to tune the people
5117.4 -> to label them correctly right so for
5119.44 -> example even like the image
5121.54 -> classification problem we have this
5122.98 -> clearly defined labels right so dog cats
5125.739 -> you know but once you go to the the more
5129.04 -> the the the breeze of godox right so
5131.32 -> some some of the laborers don't really
5132.88 -> recognize different breeds of dogs so
5134.739 -> you have to tune the labelers uh in some
5136.9 -> way so I have a friend who did this like
5138.88 -> a in PhD and and basically he has a lot
5141.82 -> of kind of training documents for the
5143.679 -> labelers you know Amazon account workers
5145.36 -> and and I I they have to actually ask
5147.34 -> the Turkish to pass some exams to be
5150.52 -> able to to be a labeler for them so so
5153.46 -> so it's actually kind of complicated
5156.699 -> okay so I guess I'll I'll be quick given
5159.34 -> that we are almost running all the time
5160.6 -> so and then when you do the model this
5162.82 -> is the more machine learning part so you
5164.98 -> want to implement the simplest possible
5166.84 -> model so
5168.94 -> um and you keep it simple and then
5171.28 -> um um and sometimes I think this is the
5173.44 -> this thing so don't get embarked on new
5175.6 -> models uh use them to understand data
5177.639 -> right so sometimes like a
5179.92 -> um the problem is that
5183.159 -> um the models are not only the angle
5184.84 -> sometimes you know you want to use the
5186.159 -> model to understand uh what are the
5188.86 -> problems with the data right so and and
5190.719 -> sometimes you can you can fix the data
5192.639 -> and then the model becomes the
5193.9 -> performance becomes much better so
5196.78 -> um so this this is a whole Loop right
5199.139 -> your bottleneck may not just be only
5201.4 -> about the model sometimes it could come
5204.1 -> from some other places maybe a data
5205.659 -> maybe a specification maybe the chin
5207.639 -> test split so and so forth
5210.699 -> and and you have some bass lines so that
5213.1 -> you know what you are what you are doing
5215.26 -> and you need to do some mobilization
5216.88 -> studies on so and so forth
5219.4 -> and and then step six you know you need
5221.56 -> to match the output so and uh
5225.52 -> you have to measure the output so that
5227.26 -> you don't you know make mistakes twice
5228.76 -> and you want to catch up catch new
5230.5 -> mistakes as soon as possible
5232.659 -> um and you want to measure kind of
5234.219 -> different things and simple things for
5235.719 -> example
5237.219 -> um like like there are a bunch of kind
5239.98 -> of like qualities you care about here
5242.739 -> um and and so and so forth
5246.76 -> and this is probably one thing that
5249.28 -> um
5250 -> um that we it's one challenge we are
5251.739 -> really facing these days about machine
5253.36 -> learning model maybe I would say
5254.679 -> probably one of the most important
5255.88 -> challenges so the reason is that you
5258.58 -> have a description shift right so you're
5260.199 -> training an validation description are
5262.78 -> very different from a distribution that
5264.58 -> you will test eventually
5266.44 -> um or maybe they are similar but there
5268.42 -> is some kind of like special
5269.38 -> subpopulations that make them different
5271.659 -> right so for example you're tuning on
5273.4 -> San Francisco street views a new test on
5276.28 -> Arizona street views for example when
5278.679 -> you build a automatic autonomous driving
5280.6 -> car then you know you tune on some
5282.52 -> street views and you have to test on
5283.9 -> some other places and that creates a
5285.76 -> distribution shift and then uh you can
5288.28 -> you can have surprises
5290.4 -> and you know there are not so many kind
5293.26 -> of like good uh ways to do this you know
5295.36 -> except that you have to kind of be
5296.8 -> careful about it and these are new
5298.42 -> algorithms
5299.8 -> um and um
5301.239 -> and this is incredibly hard right
5302.92 -> there's no real solutions in Industry I
5304.9 -> don't think there's real solutions in
5307.96 -> um in in research either so I think
5310 -> we're gonna have for one guest lecture
5311.739 -> by James o about the robustness of
5313.78 -> machine learning models there are a lot
5315.28 -> of recent work on it
5317.26 -> um but so far we don't have you know I
5319.12 -> think we have definitely better
5320.56 -> algorithms to to be more robust but I
5324.04 -> think the performance is probably the
5325.48 -> robustness is still not
5327.28 -> um ideal enough
5329.56 -> Okay so
5334.239 -> okay I think I'll just jump to step
5335.92 -> seven so yeah you have to repeat and
5338.199 -> look how data I release the longer
5340.48 -> version of the slides you know if you
5341.8 -> are interested in some of the details
5343 -> thanks
Source: https://www.youtube.com/watch?v=NirZnqwYfYU