Stanford CS229 Machine Learning I Feature / Model selection, ML Advice I 2022 I Lecture 11

Aug 14, 2023

Stanford CS229 Machine Learning I Feature / Model selection, ML Advice I 2022 I Lecture 11

For more information about Stanford’s Artificial Intelligence programs visit: https://stanford.io/ai

To follow along with the course, visit:
https://cs229.stanford.edu/syllabus-s…

Tengyu Ma
Assistant Professor of Computer Science
https://ai.stanford.edu/~tengyuma/

Christopher Ré
Associate Professor of Computer Science
https://cs.stanford.edu/~chrismre/

To view all online courses and programs offered by Stanford, visit: http://online.stanford.edu

Content

4.799 -> advice happened it happens that I write

7.319 -> too small font please feel free to stop

10.32 -> me and let me know it's just uh as I

13.08 -> said every after a few lectures I

15.24 -> stopped uh I I start to forget about it

18.66 -> so please remind me

21.24 -> um

21.84 -> so okay so I guess

25.74 -> um

26.46 -> first let me briefly reviewed the last

28.199 -> lecture just very quick so last lecture

29.939 -> we talked about these two important

31.679 -> Concepts

33.12 -> um under fitting and overfitting so I

35.64 -> got some so so here you know our goal is

38.64 -> to make the organization work right so

40.14 -> we want to generalize two unseen

42 -> examples

43.2 -> so

44.82 -> um and last time we talked about two

46.5 -> possible reasons

48.18 -> for why your test error is not good

51.78 -> enough right so one possible reason is

54.059 -> overfitting so overfitting means that

56.52 -> your tuning hour is actually pretty good

58.199 -> your children loss is pretty small but

60.36 -> their test loss is pretty high

62.64 -> so and we have discussed the possible uh

65.76 -> reasons for why

68.76 -> um you can have overfitting and two

71.52 -> possible reasons are if maybe you have

73.799 -> two complex of model for example last

76.32 -> time we discussed that you know if you

77.82 -> use a 50 degree polynomial for this very

80.4 -> very small data set where you only have

82.259 -> like four examples then you may overfit

84.6 -> or maybe you don't have enough data

86.64 -> right if you have more and more data

89.04 -> um if you have like a million data then

91.32 -> a 50 degree polynomial wouldn't be a

93.119 -> problem

94.619 -> and also we discussed that

97.02 -> um another reason underfading so under

99.78 -> fitting is much easier understating in

101.82 -> some sense basically just means that you

103.979 -> don't have small enough tuning loss or

106.86 -> tuning error right so your model is just

108.84 -> not powerful enough so that you cannot

111.38 -> even fit to the training data you have

116.46 -> so so and in some sense these are kind

119.159 -> of like two complementary situations

121.56 -> right so in this case you probably want

123.18 -> to make your model more expressive and

125.939 -> in this case you maybe you want to make

127.32 -> your model less expressive

129.539 -> or less complex so we use this word in

132.12 -> your complex expressive you know a lot

134.099 -> you know without the formal definition

135.66 -> right so we say some models are more

137.819 -> complex you know some models are less

139.2 -> complex typically you know you can

141.12 -> somehow feel it right so a fifth degree

143.16 -> polynomial probably is more complex than

144.72 -> linear model but actually you know if

147.36 -> you really want to have a you know

148.68 -> concrete definition it becomes a little

150.239 -> bit tricky right so what is the right

152.34 -> complexity measure of the model someone

154.2 -> asks about that as well in the last

155.94 -> lecture you know and the answer is that

157.98 -> there's no

159.18 -> um Universal measure for what's the

161.16 -> right complex model complexity measure

163.86 -> and and there are a few you know

166.98 -> um

167.7 -> um kind of like a complex financials

170.099 -> people often use you know they all have

171.9 -> their kind of like particular strands

173.94 -> and they are like and and also like

175.98 -> there's no

177.42 -> um you know real kind of like a formal

179.94 -> Theory to say which one is better so

181.739 -> these are kind of complex measures that

183.66 -> can be you know

185.22 -> um theoretically kind of Justified in

187.5 -> certain cases but they are not kind of

189.36 -> like a

190.44 -> um Universal so so what are the complex

192.42 -> measures so I'm just listing a few just

194.28 -> to follow kind of like for your

196.319 -> knowledge in some sense so

200.34 -> so I guess the most obvious one is how

202.379 -> many parameters they are right so if you

203.76 -> have more parameters than your model

205.14 -> might be more complex

207.239 -> and this is you know very intuitive

208.92 -> however the limitation here is that

211.019 -> maybe you have a lot of parameters but

213.18 -> you actually the the effective

215.34 -> complexity of the model is very very low

217.8 -> maybe all the parameters are very very

219.42 -> small then maybe you can say in this

221.7 -> case maybe the complexity is actually

223.019 -> not as big as your thoughts so so to

226.379 -> kind of like a deal with this kind of

228 -> like scaling thing right so what do you

229.5 -> feel complex your all your parameters

231.42 -> are basically zero even though you have

234 -> a million parameters then people

235.62 -> consider kind of like Norms of the

237.9 -> parameters right

240.72 -> okay

244.739 -> but this may not be no this is actually

247.26 -> typical the Norms of parameters is

249.42 -> actually very good uh they are very good

251.58 -> complex measures for linear models

254.28 -> um and and this there's no basically

255.9 -> before deep learning kind of like uh

257.699 -> arised

259.38 -> um um like uh you know before I think we

261.479 -> are using Norms as complex measures a

263.46 -> lot

264.6 -> um and still we use them in some cases

267.06 -> but these also have some you know

268.62 -> limitations for example sometimes you

270.419 -> may for example

272.3 -> would be that you have a very low Norm a

275.58 -> low Norm solution and you add some

277.08 -> random kind of like a noise to the to

279.78 -> the model and when you add the noise you

282 -> make the norm bigger but actually the

283.62 -> noise doesn't really change the

284.82 -> complexity right because you add some

286.44 -> noise and when you take the Matrix

288.479 -> modification you average all the noise

289.919 -> to some extent so so these are you know

292.56 -> there are also this kind of like issues

294 -> so some of the other kind of more more

296.34 -> than complex measures people have

297.6 -> considered uh are for example something

299.759 -> like lipsteousness

302.34 -> right whether your model is

303.6 -> ellipselessness uh it's Ellipsis or

305.699 -> maybe your model is smooth enough and

307.919 -> here I'm using the word smooth uh you

310.139 -> know relatively kind of like um informal

312.78 -> way you know you could mean this could

313.979 -> mean the bound and the secondary

315.36 -> derivative it could mean the bonus third

317.28 -> of derivative something like that if

319.139 -> your model is kind of like no it's kind

320.58 -> of like oscillating or kind of

321.9 -> fluctuating a lot maybe that means it's

323.94 -> not very complex

325.919 -> um and there are other kind of like a

327.24 -> complex measures for example how

328.62 -> environment your model is with respect

330.78 -> to for example certain translations

332.4 -> certain environments

334.74 -> um that you should have in a data set

336 -> for example whether your model is

337.259 -> environment to data augmentation

340.02 -> um but in general there's no kind of

341.52 -> like a very kind of like established

343.919 -> theory on what is exactly the right

345.66 -> complex measure and sometimes it also

347.22 -> depends on the data as I would see as

349.74 -> you will see today so so if you if your

352.44 -> data you know

353.66 -> sometimes for you know for example

356.28 -> suppose your data

358.08 -> um um for example

360.06 -> let's talk about Norms right so

361.68 -> different Norms you know what what type

363.78 -> of norms are you talking about L1 Norm

365.34 -> L2 Norm sometimes L2 Norm is the right

367.8 -> complex measure for certain type of data

369.9 -> and sometimes L1 Norm is the right

371.94 -> complex measure for a certain type of

373.56 -> data so basically I don't think this is

375.6 -> kind of like there's anything super

377.58 -> concrete we can

379.979 -> um like it's not like I have a kind of

382.259 -> like just a fixed like a suggestion for

385.979 -> you to consider so so in some sense you

387.9 -> should just keep this in mind and kind

389.759 -> of like a kind of consider them you know

391.8 -> when you

392.94 -> um when you do your own data set so

398.1 -> um

399.12 -> so so this is the um okay so now we have

402.479 -> to discuss the complexity measures and

404.28 -> now that in the the the the the rest of

407.52 -> the lecture I think I'm going to cover

408.6 -> two things so one thing is that so once

411.419 -> you have some kind of like guess on

413.28 -> what's the right complex measure you are

414.66 -> looking for how do you make the complex

416.039 -> measure small right so so

418.44 -> um how do you encourage the model to

421.02 -> have small complexity right so it's

422.52 -> easier to do this because you just

424.02 -> change how how many kind of neurons or

427.02 -> how many kind of like

428.4 -> hidden variables in deep networks where

430.919 -> you can change the number of parameters

432.12 -> but if you want to change the norm what

433.68 -> you do so that's called regularization

435.3 -> I'm going to discuss that in the first

437.58 -> half of the lecture and then in the

439.62 -> second half of lecture I'm going to talk

441.479 -> about you know how do you

443.34 -> um I'm going to talk about you know some

445.199 -> kind of more General ml device for

447 -> example how do you tune your hyper

449.16 -> parameters like when you do

451.44 -> regularization or when you kind of like

452.94 -> choose your model complexity right you

455.52 -> can use a lot of hyper parameters

457.5 -> meaning you're going to choose how many

459.36 -> parameters you have you're going to

460.5 -> choose you know how strong your

461.52 -> organization is so so how do you tune

463.979 -> your hyper parameters

465.78 -> um and on what data set you should tune

467.58 -> your hybrid parameters and at the end

469.139 -> I'm going to probably spend 30 minutes

470.46 -> on even more applied angle about some

474.18 -> email or device right so like for

475.919 -> example how do you design the ml system

477.599 -> you know from scratch you know there are

479.46 -> a lot more things in reality more than

483.479 -> what you're doing research

485.34 -> um so so that part will I'll use some

487.02 -> slides to talk about some general no

488.819 -> ideas on how to design ml system uh in

491.699 -> in the in the in reality

493.919 -> so so that's the general kind of

495.599 -> introduction of this course of this

497.58 -> lecture

498.9 -> um I'm going to start with

499.86 -> regularization I need questions so far

503.34 -> foreign

516.12 -> I think we have probably mentioned this

518.219 -> um I think in the sometimes like in the

520.2 -> previous kind of lectures because just

521.52 -> because you know

522.899 -> um

526.74 -> just because you know uh we have

528.66 -> mentioned this you know informally so by

530.88 -> regularization mostly we just mean that

532.98 -> you add some additional term in your

535.019 -> training loss to encourage low

536.76 -> complexity models so

539.58 -> um so for example you so we use J of

542.279 -> theta as our training loss and then you

544.8 -> consider this so-called regularized loss

546.48 -> where you add a term

549.12 -> Lambda times R Theta

551.7 -> so here this R Theta is often called the

554.58 -> regularizer

556.86 -> and Lambda is you know I think there are

559.8 -> different names for this but you could

561.24 -> call it regularization strengths or

563.519 -> regularization coefficient

565.2 -> regularization parameter regularization

567.18 -> strengths whatever you call it if let's

569.58 -> call it regularization strength

571.92 -> so this Lambda is a scalar and R of

574.26 -> theta is a function of data

577.019 -> um which um which will change as data

580.32 -> changes so

583.68 -> um

584.279 -> and this and the goal of this Rotator is

586.8 -> to you know either additional kind of

588.24 -> like encouragement to find model Theta

591.48 -> such that are obviously small so for

593.519 -> example typical R of theta could be

596.1 -> um

597.54 -> something like you know L2

599.22 -> regularization so you say R Theta one

601.62 -> possible choice is that you take this is

604.2 -> probably the most common choice

608.22 -> um you take the L2 North square and you

611.88 -> multiply about half the half doesn't

613.26 -> really matter this is just some kind of

615.42 -> conversion

616.8 -> um you know that because anyway you can

618.42 -> multiply a Lambda in front of it so so

620.519 -> whether you have half or not here you

622.44 -> know we just change your choice of

624.18 -> Lambda but just uh this is you know

626.7 -> um just a convention

628.5 -> um and so so this is called you know L2

630.899 -> regularization also in deep learning

632.76 -> people called Ray Decay there is a

634.92 -> reason why people call it way Decay I

636.6 -> guess probably I wouldn't have time to

638.339 -> discuss it today so you know the

640.56 -> lecturer knows there's a very short

641.88 -> paragraph you can see actually if you

644.04 -> use this regularization and update rule

646.38 -> will look like a very Decay so there's

648.3 -> one step in the update rule where you

649.86 -> Decay your parameter will shrink your

652.38 -> parameter by scalar so but anyway so you

655.26 -> know it's just a name like either it's

657 -> called vertical you call it Auto

658.62 -> regularization

659.88 -> so let's play one of the pretty common

662.399 -> one so and and you can see that if you

664.74 -> add this thing to your loss function and

666.48 -> you're minimizing your loss function so

668.339 -> then you are trying to make

671.279 -> both the last month and also make the

673.92 -> the L2 Norm of your parameter small and

676.74 -> the Lambda in some senses kind of kind

679.38 -> of like controlling the trade-off

681.18 -> between these two terms right if you

682.62 -> take Lambda 2 back then you only focus

684.959 -> on the regularization you just only

686.339 -> focus on low low Norm solution but maybe

689.04 -> you don't fit your data very well if you

691.38 -> take Lambda to be you know for example

693.24 -> zero literally zero then you are not

694.92 -> using your regularization you are just

696.54 -> only fitting your data and actually when

699.42 -> you make the Lambda very very small this

701.22 -> can still uh do something right so even

704.1 -> say let's say Lambda is .0001 Right very

707.16 -> small

707.88 -> still this might do something because

709.86 -> maybe they are multiple Theta such that

712.86 -> J of theta is really really close to

714.72 -> zero or maybe even literally close zero

716.94 -> so so if you don't have this then you

719.519 -> are not doing any type working right if

721.26 -> you don't have if you literally make

722.82 -> lambda zero then you are just picking

725.82 -> one of the solution where J Theta is

727.56 -> zero but you don't know which one you

729.12 -> pick but even as long as you add a

730.98 -> little bit of organization then you are

732.779 -> using this as a tie breaker in some

734.399 -> sense so so you are finding some

737.339 -> solutions such as J Theta is very very

739.079 -> small but you use the the r the norm as

742.56 -> a tiebreaker uh among all of the

744.66 -> solutions that have very small tuning

746.459 -> laws

747.959 -> so so this is probably the most typical

750.18 -> regularization people use and another

753.36 -> one is the following so you can take R

755.64 -> Theta

757.38 -> to be

758.519 -> uh the so-called zero zero Norm of the

761.82 -> parameter but actually this is not

763.5 -> really a norm this is just a notation so

765.54 -> this is really just defined to be the

768.3 -> number of non-zeros

772.68 -> in the model in in cell

775.5 -> so you count how many non-zero entries

777.54 -> in Theta and that's the uh that's what

780.779 -> this notation is for some sometimes

782.639 -> people call it zero Norm but it's

783.839 -> actually not a norm uh literally it's

786 -> just the number of non-zeros in a

788.04 -> parameter and sometimes people call it

790.019 -> sparsity

792.36 -> because you know if you have very few

795.079 -> non-zero entries then it sparse

797.16 -> otherwise it's it's dense so if you add

799.8 -> this to the the thing then you're gonna

802.079 -> have a different effect you are trying

803.459 -> to say that I'm going to find a model

805.139 -> such that the number of non-zeros in it

808.019 -> is small and this is particularly

810.18 -> meaningful for linear models in the

811.98 -> following sense because if you think of

813.54 -> the state as a linear model then say you

815.639 -> have Theta transpose X right suppose you

817.38 -> have a linear model and then what is

819.36 -> this this is really just some of say the

821.7 -> i x i

824.7 -> from one to D and then you can see that

827.1 -> the number of non-zeros is really how

829.74 -> many

830.7 -> if suppose you have like s non-zero

832.92 -> answers in Theta that means that you are

834.72 -> using only s of the coordinates of X

837.959 -> size

838.8 -> so basically the number of non-zeros is

841.8 -> in Theta is the number of coordinates or

844.74 -> number of features you are using uh from

847.68 -> x i

848.639 -> right so you can imagine that maybe for

851.16 -> example for some applications you have a

853.44 -> lot of like a coordinates in our

855.72 -> features in your input features right so

857.82 -> you have so many different kind of like

859.32 -> informations but you don't know which

861.42 -> one you should use to predict right for

863.16 -> example suppose you want to predict

864.3 -> house the price of a house then you have

866.88 -> so many different features but some

868.38 -> features may not be that useful so then

870.3 -> you can imagine that that could be a

871.92 -> situation where you should use this as a

873.42 -> regularizer because you want to say I

875.04 -> want to use as few features as possible

877.32 -> but also I want to make sure my training

879.899 -> loss is good right so I want to find the

881.88 -> simplest

883.579 -> explanations of the of the existing data

886.56 -> or simplest meaning that you want to use

888.18 -> as few features as possible

890.94 -> so and and once you find this Theta such

894 -> that Theta sparse right so suppose you

895.62 -> have a Theta such that you only have a

897.48 -> few non-zeros uh in Theta then you are

900.06 -> selecting the right feature so in some

901.98 -> sense you know you have a sparse model

905.94 -> means you are selecting

910.079 -> the features

913.019 -> right because those non-zeros

914.459 -> corresponds to its features that are

916.56 -> selected by the model so so people often

919.8 -> call this kind of like a feature

921.24 -> Selections in some in certain kind of

922.86 -> contexts

924.959 -> so

926.579 -> um however I mean you may have realized

928.92 -> that this

930.839 -> regularizer as a function of theta this

933.839 -> sparsity one is not uh differentiable

938.1 -> right so you're just counting how many

940.139 -> non-zeros they are right so suppose you

941.639 -> have one suppose you have like you have

943.8 -> one entry that is um maybe it's just

946.639 -> zero zero zero right you do an if the

949.92 -> testimo change in this uh in any of the

953.459 -> coordinates you're going to change the

955.74 -> this function the value of this function

958.139 -> by a lot so that's why it's not

959.519 -> differentiable right a differentiable

961.199 -> function should satisfy that if you

963 -> change

963.959 -> Theta if the decimal is small then you

966.36 -> should change the function output by a

968.76 -> small amount but actually here if you

970.98 -> currently the sparse is zero but if you

974.699 -> change the state a little bit your

976.38 -> sparsity becomes one so you can have

978.72 -> infinite testimony small changes to make

980.399 -> the regularizer value change by a large

982.74 -> amount so that's that's why it's not

985.019 -> differentiable so so it's not because

987.24 -> because it's not differentiable then um

989.579 -> you don't have gradients you don't have

991.139 -> like a derivatives so that's that cause

993.779 -> the problem in in using this so so

996.839 -> basically even though I told you this is

998.459 -> a regularizer but literally in reality

1001.04 -> nobody use this exactly in their

1003.019 -> algorithm because if you use it you get

1005.66 -> this you know you put it here but this

1007.639 -> term has no gradient how do you optimize

1009.62 -> it

1010.82 -> so so if because it's non-diffensible so

1013.82 -> then what we will do is that

1016.1 -> um you like you change um you have a

1019.04 -> surrogate so this is a typical surrogate

1023.12 -> um the reason why this is surrogate is a

1024.799 -> little bit kind of tricky

1026.299 -> um but um

1027.919 -> um but you know but this is you know has

1029.66 -> been your surrogate for the sparsity so

1032 -> this is a differentiable surrogate

1034.939 -> um

1041.839 -> for the for the sparsity of the model

1044.66 -> so you use the one Norm so here one Norm

1046.819 -> just means the the sum of the apps value

1050.66 -> is the sum of the absolute value of each

1052.64 -> chord

1055.46 -> so and you can see you know I wouldn't

1058.039 -> attempt to give you a very formal

1059.72 -> justification for why this is so good

1062.84 -> um why this is a good Target for the

1064.94 -> zero

1065.96 -> um you know one reason could be that you

1068.179 -> know you can see at least one Norm is

1070.82 -> closer to than two Norm to zero Norm

1073.22 -> right Y is close to closer to zero than

1075.919 -> two and another reason probably you know

1078.74 -> this is not really very solid

1081.38 -> mathematical reason but it's just to

1082.7 -> give you some you know intuition so

1084.5 -> suppose you think of theta as a as a

1087.74 -> vector in zero one suppose you really

1090.08 -> just have a binary Vector then indeed

1093.86 -> this is equals to this

1097.82 -> right so that's that's probably another

1099.5 -> kind of intuitive reason why they are

1101.96 -> somewhat kind of like related but you

1103.88 -> know you can see a lot of problems with

1105.559 -> this argument right so why I'm assuming

1107.539 -> Theta is between is only taking value

1109.76 -> from zero and one right so if they are

1111.62 -> from zero and two then these two are no

1113.299 -> longer like related and more so

1116.059 -> um so so I'm not saying that this is

1117.559 -> really a good argument for why they are

1119.66 -> related so so if you really want to I'll

1122.059 -> say they are related um or this is a

1124.1 -> good surrogate I think you have to go

1125.419 -> through more much more mass

1129.02 -> um any questions so far

1134.32 -> regularizer with the second nominal

1136.46 -> regularly you mean yes okay yes that's

1139.52 -> what I'm gonna do next

1141.62 -> so

1143 -> um

1144.86 -> all right so right so that's a great

1148.039 -> question so why

1150.02 -> um you want to encourage sparsity

1153.08 -> um sometimes and sometimes you want to

1154.52 -> encourage the two Norms to be small

1156.08 -> right so let's answer that question you

1157.88 -> know let's think of this as a as a

1159.679 -> surrogate for the sparsity so the

1160.94 -> question I'm trying to answer is why

1162.86 -> sometimes this is better sometimes this

1164.539 -> is better so in some sense the the

1166.76 -> fundamental reason is just that you know

1168.64 -> in some sense the regularization or

1171.86 -> another way to think about

1172.88 -> regularization instead of just

1174.32 -> encouraging low complexity is that

1176.24 -> regularization can also improve can also

1178.94 -> impose structures kind of like prior

1181.88 -> beliefs about the Theta so so this is

1186.08 -> probably another

1187.299 -> at least you know

1189.559 -> one of the other ways to think of

1191.179 -> requisition this also imposes

1195.2 -> structures

1200.24 -> and this what are the structures the

1202.16 -> structures you know probably sometimes

1203.96 -> units you have a prior belief

1207.5 -> so for example suppose you believe you

1210.14 -> you have a prior belief because of

1212.299 -> domain knowledge right so such that

1215.179 -> um you somehow believe that uh Theta

1217.82 -> sparse

1223.179 -> right so in in smart SATA

1228.08 -> then in this case you probably should

1230.48 -> just use R Theta is the the one Norm or

1234.26 -> the zero Norm

1236.6 -> just because you believe that your model

1238.34 -> is sparse why not just encourage that

1240.26 -> right so so this you know when you have

1242.6 -> this belief then you should say okay so

1245.299 -> if you if you encourage the one normally

1246.919 -> you're you're in some sense you you

1249.38 -> limit your search space right so yeah

1251.48 -> like you limit your search space before

1253.58 -> you are searching over all possible

1254.96 -> parameters and now you're only searching

1256.64 -> over low Norm low one non-parameters and

1259.94 -> because you believe that your true model

1261.44 -> is in is is having low Norm then even

1264.74 -> narrowing the search space is always new

1266.48 -> never narrowing your search space is

1268.52 -> always helping you because uh you didn't

1271.28 -> lose anything right because you know you

1273.559 -> know that all every model you excluded

1275.539 -> are not going to be uh the right

1278.12 -> solution so so you never search space

1279.98 -> the search the new search space still

1281.78 -> has the right model then why not do it

1284.24 -> so so that's another interpretation of

1288.14 -> um of the uh of the regularizer so it's

1291.62 -> trying to it can impose additional prior

1294.08 -> belief

1295.34 -> um uh in the in the in the structure of

1297.14 -> the model right so if you believe in one

1299.299 -> Norm then you should you encourage well

1301.159 -> not or whatnot if you believe that your

1303.08 -> true model has a small L2 Norm then you

1305.6 -> should encourage

1307.34 -> um small L2 naught and and you know if

1310.64 -> if you go into more mathematical uh

1312.86 -> Theory I think L2 Norm typically

1314.96 -> corresponds to situations where you

1316.52 -> believe that all the features are useful

1318.559 -> but you have to use them you know uh in

1321.02 -> a combination right so you have to use

1322.4 -> each of the features you know a little

1323.84 -> bit an L1 Norm or l0 Norm typically

1327.44 -> corresponds to situations where you

1328.76 -> believe only a subset of the features

1330.74 -> are meaningful and you should discard

1332.659 -> other ones because other ones are just

1334.64 -> kind of like are there to confuse you in

1336.919 -> substance so so if you believe your

1338.72 -> model is his sparse you should use L1

1340.52 -> norm and if you believe your model

1341.78 -> shouldn't be sparse then typically

1343.58 -> people use l tuna

1346.88 -> so and

1349.58 -> and if you have a linear model so

1351.62 -> suppose you have a suppose you have a

1353.6 -> linear model

1357.08 -> and and then this this law this loss if

1361.76 -> you use this L1 normalizer this is

1363.62 -> called lasso

1366.38 -> I guess I'm here I'm just the final name

1368.36 -> because I think it's probably useful for

1369.74 -> to at least a third of this name this

1372.14 -> this uh acronym uh was actually I don't

1375.679 -> know what it originally stands for but

1377.419 -> you know this is uh so like this has

1380.059 -> been like there for like 20 or 30 years

1382.159 -> which is a very very important algorithm

1383.84 -> for linear model you apply L1 uh norm

1386.659 -> regularization and it's called lasso in

1388.52 -> everyone in machine learning should know

1390.62 -> the equity

1392.36 -> so

1393.74 -> um

1394.46 -> right

1398.96 -> and taking a little more broader

1400.76 -> perspective so

1403.28 -> um so if you think about like a

1404.72 -> non-linear models like deep learning

1406.4 -> models so what are the most popular

1408.44 -> regularizers these days I think L1 Norm

1411.26 -> is not used very often and actually

1413.12 -> pretty much it's never used um you know

1415.34 -> I don't know exactly you know how

1416.96 -> frequent it is but you know I think

1418.4 -> probably less than 10 percent of model

1420.2 -> use L1 recognizer maybe even less than

1422.36 -> that 10 probably is much overestimate

1424.64 -> maybe one percent so

1426.98 -> um and but the L2 regularization is

1429.44 -> almost always used like even though

1431.72 -> sometimes you only use a very weak Auto

1433.58 -> regularization indeed I'm talking about

1435.799 -> deep learning model so sorry maybe let

1437.78 -> me just clarify for for linear models

1440.059 -> people tried uh you can try almost

1442.58 -> anything anything would be reasonable

1444.86 -> and you probably should try all of them

1446.84 -> you could try one Norm two norm and

1448.82 -> sometimes you can try different Norms

1450.26 -> which I didn't write down but you can

1452.48 -> try 1.5 Norms something like that so for

1455.72 -> nonlinear model for deep learning models

1457.58 -> I think basically L2 Norm is something

1461.179 -> that you almost always use but you only

1463.58 -> use of it with relatively small Lambda

1466.1 -> people generally don't use very large

1467.96 -> Lambda

1469.28 -> I don't know exactly what's the reason

1470.78 -> you know um researchers don't really

1473.36 -> know that much either but a small Auto

1476.059 -> regularization is typically useful for

1478.88 -> deep learning and in deep learning I

1480.5 -> think some of the other regularizations

1482.24 -> you know could be useful for example you

1484.159 -> can try to regularize the lipsense of

1486.32 -> the model

1487.64 -> um and you can try to use data

1488.96 -> augmentation which we probably haven't

1490.82 -> discussed I'm going to discuss that

1493.88 -> um in a later lecture but you can use

1495.44 -> data augmentation which tries to

1497.299 -> encourage your model to be environment

1499.72 -> with respect to a kind of translation

1501.919 -> cropping this kind of things for for

1504.02 -> images

1505.46 -> um I think those are pretty much the the

1507.44 -> only regularization techniques in deep

1509.659 -> learning

1510.98 -> um

1511.88 -> yeah

1513.679 -> um

1517.54 -> [Applause]

1518.799 -> this kind of

1521.26 -> pertains to my employer yeah what you

1524.059 -> suggest initially using the

1526.159 -> and one day to kind of eliminate the

1529.1 -> features

1530.919 -> that's a very good question so I think

1533.419 -> this kind of algorithm was pretty

1535.82 -> popular in uh um for before deep

1540.08 -> learning era so when you use linear

1541.88 -> models I think using L1 to do a

1544.279 -> selection and then you use L2 I think

1546.2 -> you know I don't know how exactly how

1547.94 -> popular they are but this is definitely

1549.26 -> one algorithm people try to could you

1551.48 -> could try to use

1553.22 -> um in deep learning I think it's

1556.46 -> probably less likely to be useful you

1558.679 -> know but also depends on the situation

1560.419 -> for example if you don't have enough

1561.44 -> data maybe you are more or less in a

1563.779 -> linear model case but you just need the

1565.88 -> nonlinearity to help you a little bit

1567.08 -> maybe then in that case you should still

1569.179 -> mostly use some kind of like more like

1570.62 -> linear model type of approach if you are

1572.72 -> in the the typical deep learning setting

1575.299 -> for example you do for a vision project

1577.4 -> right you have like images right as your

1580.34 -> inputs I think in those cases you

1582.2 -> probably don't want to select your

1583.64 -> features first I think all the inputs

1586.159 -> are useful like I'm able to use them as

1589.279 -> much as possible and you just want to

1590.84 -> let the new light works to figure out

1593 -> what's the best way to use those inputs

1599.96 -> any other question by the way this

1601.58 -> lecture will be pretty kind of we don't

1603.679 -> have a lot of math right most of the

1604.94 -> things are about um just uh I I don't

1607.58 -> think there's even a theory here

1608.659 -> sometimes they're just experiences

1609.98 -> because especially if you talk about

1613.58 -> um like the more the Machinery in the

1614.9 -> last five years right everything seems

1616.64 -> to change a little bit right so I like I

1619.1 -> cannot say anything with about 100

1620.6 -> guarantee I can only say okay it sounds

1622.7 -> like people are doing this a lot and

1625.34 -> that's that's the best thing I can I can

1626.84 -> tell you in some sense

1629.72 -> um so so feel free to ask any questions

1632.059 -> any other questions

1637.7 -> right so and the next thing I'm going to

1642.2 -> discuss is the so-called implicit

1643.82 -> regulation effect

1645.74 -> and um

1653.059 -> this does uh

1656.48 -> this released more to the the Deep

1658.159 -> learning and so one reason uh that

1660.86 -> people started to think about this is

1662.48 -> that you know I haven't told you what

1663.679 -> exactly means so one motivation that

1666.02 -> people started to stand up for research

1667.34 -> is that people realize that in deep

1669.08 -> learning you don't use a lot of

1670.039 -> regularization technique right so you

1671.72 -> use L2 as I said you only use a weak out

1673.88 -> to regularization and and often some of

1676.279 -> these latest ones but they only have a

1677.96 -> little bit right they can be useful but

1679.64 -> people don't necessarily use them very

1681.38 -> often so why in deep learning you don't

1684.14 -> have to use strong regularization at

1686.419 -> least like you can feel that the

1688.039 -> regularization stop method stop matter

1690.02 -> from that much it still matters when you

1692.36 -> really care about the final performance

1693.62 -> you care about 95 versus 97 but but you

1696.86 -> don't have to even you don't use

1697.88 -> regularization sometimes you get

1699.08 -> reasonably good performance so so that's

1701.6 -> why people are especially with

1702.98 -> theoretical researchers people are

1704.779 -> wondering why you don't need to use a

1707.419 -> strong regularizations in deep learning

1709.52 -> and this is particularly mysterious

1711.08 -> because in deep learning people are

1713 -> using over parametrization like we are

1715.52 -> we are in this regime where you have

1717.14 -> more parameters than the number of

1718.94 -> samples we call that in the last lecture

1721.7 -> we have drawn this uh double descent

1723.98 -> thing right where you have this kind of

1725.299 -> things right here is the number of

1726.5 -> parameters

1730.159 -> and this is the test error

1733.58 -> and you know we have kind of discussed

1735.32 -> that this peak might be just something

1736.82 -> about the sub-optimality of the

1738.38 -> algorithm which let's say you don't care

1740.299 -> for the moment but at least you have to

1742.159 -> care about why here

1744.32 -> um why it's go going down you know still

1746.299 -> here right so why when you have so many

1747.919 -> parameters a lot more parameters you can

1750.5 -> still make your model generalized and

1752.299 -> and it seems that more and more Paramus

1754.279 -> makes it looks better

1756.32 -> so so so this the overall primary Choice

1759.02 -> regime is kind of mysterious because you

1761.059 -> don't use strong regularization but you

1762.679 -> can still generalize

1764.48 -> so that was the kind of the motivation

1766.52 -> for people to study this and people

1768.44 -> realized that there's you know even

1770.659 -> though in this regime suppose you don't

1772.1 -> use any expensive regularizer you don't

1773.72 -> you make that lambda zero literally zero

1775.7 -> right in this regime still it can

1777.86 -> generalize and the reason it can

1779.059 -> generalize in many cases is because

1781.88 -> um you can still have some implicit

1783.38 -> regularization in fact even without

1785.36 -> explicit regularizer and where that

1787.58 -> effect comes from we know what kind of

1788.96 -> make that happen the reason is that

1792.2 -> um the optimization process the

1794.24 -> optimization algorithm

1795.98 -> the optimizers can have a con implicitly

1800.36 -> regularize you

1806.539 -> so so why this can happen I think the

1808.88 -> reason is that let me draw kind of like

1810.919 -> illustrative kind of figure which I kind

1813.62 -> of like use pretty often so suppose this

1815.899 -> is the

1817.34 -> let's say this is the the pyramid like

1819.74 -> suppose you have a this is the Lost

1821 -> landscape the Lost surface

1824.48 -> so meaning that here is Theta let's say

1826.399 -> is one dimensional

1827.84 -> and because we are in this deep learning

1829.94 -> setting where we have like a non-linear

1831.86 -> models and non-convex loss function so

1834.2 -> maybe a loss function it looks like this

1835.94 -> maybe

1838.159 -> so this is the loss function

1843.32 -> and you have two maybe you have multiple

1845.6 -> Global Minima of your loss function

1847.7 -> right so so this is a global minimum

1849.679 -> this is a global minimum

1851.96 -> and but this but you have multiple

1854.36 -> Global minimum in your loss function

1855.74 -> however

1856.7 -> I'm here I'm talking about trending logs

1858.74 -> right if you really look at the test

1860.179 -> loss you're going to look they will look

1861.86 -> a little bit different the test loss

1863.299 -> would be different from the training

1864.44 -> loss so test loss maybe look like

1866.299 -> something like this

1869.84 -> maybe okay let me draw something I'll

1871.88 -> come to my figure so that

1874.34 -> um

1890.779 -> so this is the training loss now test

1893.24 -> loss probably look like this

1900.74 -> so that means that even though both of

1903.559 -> these two Global Minima are

1906.679 -> um

1907.46 -> are good solutions from the training

1909.26 -> loss perspective one of them is better

1911.539 -> from the test test performance

1913.58 -> perspective Right This Global minimum is

1915.44 -> good and Better Than This Global minimum

1917.299 -> because the test performance is better

1919.899 -> so and in some sense like the

1922.88 -> regularization effect is trying to

1924.32 -> choose the the right Global minimum

1927.02 -> right you want the regularization in

1928.22 -> fact to choose the right Global minimum

1929.659 -> so so that you can do some type working

1931.7 -> or you can encourage certain kind of

1933.02 -> models maybe this model is more kind of

1935.179 -> like ellipses so this model has more

1936.559 -> Norm than this model so that's why you

1938.419 -> prefer this one so if you use explicit

1940.399 -> regularization what you do is that you

1942.38 -> you're going to say I'm going to change

1943.82 -> the tuning loss I'm going to add

1945.08 -> something to prefer this one than this

1947.6 -> one I'm going to reshape the tuning logs

1949.46 -> right that's what the explicit

1950.779 -> regularization would do but the implicit

1953 -> regularization we'll do is the following

1954.559 -> so if you consider an algorithm that

1957.44 -> optimizes for example suppose you have

1959.059 -> you run algorithm this algorithm which

1960.62 -> is always initialized this is

1962.059 -> initialization

1965.299 -> and you do green in this set so you're

1967.52 -> gonna

1968.24 -> do something like this

1970.52 -> and it converts to this one

1972.44 -> so so this algorithm will only Converse

1975.38 -> to this one but not this one

1977.12 -> just because you initialize at this far

1979.34 -> right

1980.48 -> right so that is kind of in some sense a

1982.64 -> preference to convert to This Global

1984.14 -> minimum

1985.76 -> over this Global minimum because your

1987.98 -> algorithm somehow prefer one Global

1990.26 -> minimum than the other just because your

1992.12 -> algorithm has some certain specifics

1993.799 -> right so the initialization make it to

1996.14 -> prefer to converge this one

1998.299 -> and and there could be other kind of

1999.74 -> effects for example if you

2001.779 -> um use you know bigger step sets maybe

2004 -> you are more likely to converts to this

2005.44 -> one maybe or maybe vice versa you know

2007.48 -> depending on on kind of like the

2009.34 -> situations right so this is a very

2010.659 -> illustrative thing with one dimension

2012.58 -> then you don't really have a lot of like

2014.94 -> flexibility here but if you have a very

2017.38 -> very complex thing then if you run

2019.059 -> different algorithms different algorithm

2020.62 -> will converge to different Global

2022.48 -> minimum and that preference to certain

2025.059 -> type of global minimum is in some sense

2027.159 -> is a regularization effect

2029.74 -> um so so that you don't converge to

2031.12 -> arbitrary Global minimum

2034.96 -> um does it make some sense

2041.08 -> you said so how does like having a large

2043.179 -> number of parameters ensure that it

2045.399 -> initializes at that point

2048.04 -> yeah I I was I was silent on that in

2050.8 -> some sense like I didn't really say why

2052.599 -> the initialization has to be here like

2055.3 -> this is a active area of research so

2057.7 -> what we are sure about is that the

2059.44 -> algorithm could have this effect the

2062.08 -> algorithm could possibly

2064.659 -> um prefer certain kinds of global

2066.04 -> minimum than the others but why it would

2068.2 -> prefer which kind of global minimum we

2070.78 -> don't exactly know for certain kind of

2072.22 -> like toy cases we know but for uh for

2075.399 -> the general cases we don't I'm going to

2076.96 -> show you one cases where we actually can

2079.179 -> say what does the algorithm prefer to do

2081.94 -> but that's very very simple case for

2084.399 -> General case I think the the research

2086.08 -> this is still very open research on

2088.3 -> question

2089.44 -> I saw two other questions here

2091.679 -> this seems like of the authorizer is

2094.24 -> keeping a number of problems it doesn't

2096.28 -> quite necessity

2098.14 -> no no here that no no the what do you

2101.2 -> mean by the name optimizers

2103.3 -> the access is the value of the parameter

2105.94 -> it's just they only have one parameter

2107.98 -> I'm just I'm joining the the landscape

2109.599 -> of the pyramid and I can only Draw

2111.64 -> Something in one dimension so so so this

2114.4 -> is the value of the parameter you are

2115.66 -> just tuning in this parameter you are

2116.619 -> doing good instant and this is the loss

2118.359 -> surface

2119.38 -> so so it does depend on where you

2121.72 -> initialize right so if you initialize at

2124.06 -> different places you're going to

2124.839 -> converge different Global minimum and

2126.4 -> and they may have different

2127.359 -> generalization effects

2130.66 -> different algorithms and then just

2132.579 -> choose the one that has the best

2133.599 -> performance that's pretty much the right

2135.52 -> to do um of course there are some I'm

2137.5 -> going to discuss this you know but uh

2139.54 -> more detail later but but you know

2141.7 -> basically you know

2144.339 -> like you can have some intuition where

2146.44 -> you have you know the theoreticians have

2148.66 -> tried to understand you know what kind

2151 -> of like algorithms can help

2152.56 -> generalization but I think the

2154.359 -> conclusion at least so far is not no no

2156.94 -> like it's very far from conclusive they

2159.16 -> can give you some intuition but they are

2160.66 -> not going to be like predictive you know

2163.3 -> I don't just tell you like what to do so

2165.64 -> you still have to try a lot yeah so yeah

2169.18 -> going back to this you know this is just

2170.5 -> one dimension you know another way to

2171.94 -> think about is that if you can think of

2173.2 -> like a two-dimensional question for

2174.64 -> example you are skiing in the in the in

2176.56 -> a ski resort right so your objective is

2179.14 -> basically

2180.3 -> minimizing your like uh you're trying to

2183.099 -> go downhill right that's your objective

2184.9 -> so and and this key result probably have

2187.599 -> a lot of villages right like that you

2189.52 -> can eventually go home there are

2191.02 -> multiple parking lots right so so you

2193.119 -> you like in some sense you are saying

2194.68 -> that you know one of these parking lot

2195.94 -> is great right so this one of this

2197.32 -> parking lot is really where they are so

2200.079 -> you want to go to that one

2201.76 -> um so so diff but but uh but different

2204.339 -> algorithm would lead you to do diff to

2206.56 -> convert to different uh different

2208.06 -> parking lots right so for example

2209.32 -> someone is doing very fast skin then

2212.32 -> when you do it that you cannot go to

2214 -> those kind of small Trails so then you

2215.68 -> lead you go to one of the parking lot

2217.359 -> and some other one prefers like um a

2221.02 -> wider kind of like trails and then you

2222.88 -> go to the other parking lot so so

2225.64 -> different algorithm will lead to lead

2227.26 -> you to different parking lot and

2228.52 -> different parking lots have different

2229.98 -> generalization performance eventually

2233.74 -> so

2235.18 -> um so this is the high level um

2236.619 -> intuition so I'm going to

2240.7 -> um let's see

2242.5 -> I'm going to discuss a concrete case

2244.18 -> which

2245.14 -> um

2245.76 -> which will also be part of a homework

2248.079 -> question so this concrete case uh just

2250.42 -> to give you a concrete sense of how this

2253.359 -> could even be possible so I'm going to

2256.06 -> show you the high level thing and there

2257.859 -> are some mathematical part which will be

2259.42 -> in the homework so this is the linear

2262.06 -> this is a in a linear model

2265.54 -> so interestingly even though this

2267.28 -> implicit requisition effect

2269.02 -> was mostly discovered after deep

2271.42 -> learning uh start to be kind of like

2273.52 -> powerful but actually you can still see

2275.74 -> it in linear models and that's how

2277.66 -> researchers start to do research so

2280.54 -> um so so let's say suppose we are just

2282.52 -> in the most vanilla linear model setting

2284.68 -> where you have some under data points

2291.4 -> this is just the the trivial linear

2293.619 -> regression and your loss function is

2296.32 -> something like just the L2 loss the

2298.42 -> screen squared error

2305.859 -> something like this you have a linear

2307.599 -> model but let's say let's make the the

2310.839 -> the one different thing is that we

2313.18 -> assume n is much smaller than d

2316.06 -> so you have very few examples and a very

2318.579 -> high dimension so what is d d is the

2320.859 -> dimension of the data

2323.38 -> and the N is the number of examples I'm

2325.72 -> going to assume only is much smaller

2327.04 -> than b

2328.359 -> so this is over parameterized you have

2330.46 -> multiple Global minimum while you have

2332.619 -> much so first of all you have multiple

2333.94 -> Global minimum

2339.82 -> why because they are I'm claiming that

2342.82 -> they are minus Theta such that

2345.7 -> minus Theta satisfies

2350.2 -> y i is equals to Theta transpose x i for

2354.16 -> all I

2357.28 -> why because you know how many equations

2359.26 -> here you have right so so if you want to

2361.78 -> so this is the equation to make training

2363.7 -> loss zero which is means Global minimum

2365.56 -> right so if you have all of this

2367.359 -> equality then it means you are either

2369.28 -> Global minimum of this training loss and

2371.98 -> why there are multiple status such that

2374.44 -> you can satisfy this that's because you

2377.32 -> can count how many equations they are

2378.76 -> right so there are any equations

2384.7 -> right and D variables

2389.92 -> and these are linear equations right so

2392.079 -> so I guess the linear algebra tells us

2394.24 -> that if you have any equations T

2395.74 -> variables and if n is less than I think

2397.9 -> if n is less than D or D minus one I

2400.119 -> think only is less than D then uh you're

2402.7 -> gonna have at least one solution and if

2404.44 -> any is much much smaller than D then you

2406.42 -> have a Subspace of solutions and that's

2408.52 -> called the uh the what's the the kernels

2412.24 -> of the anyway you have a Subspace of

2414.579 -> solutions

2416.079 -> um uh for for this kind of linear system

2418.359 -> equations

2420.04 -> right so

2421.78 -> um and that's why you're gonna have much

2423.82 -> for Global minimum of the chaining

2425.44 -> blocks because the entire Subspace of

2427.119 -> solutions are Global minimum of the

2429.46 -> tuning loss so the question is which one

2432.099 -> you're going to converts to right so

2433.3 -> which one your Optimizer will will

2434.859 -> choose

2436.359 -> so it turns out that you know if you use

2439.359 -> a good InDesign with zero initialization

2442.42 -> then you are going to choose the one

2444.28 -> with the minimum L2 naught so here is

2446.98 -> the claim

2448.48 -> so the claim is that

2450.579 -> if you do good in distance

2453.88 -> with initialization

2455.98 -> Theta is zero

2459.16 -> uh this will converge to

2465.3 -> uh the minimum Norm solution

2474.46 -> so what does the minimum solution mean

2476.68 -> formula it means that you converts to

2479.68 -> a solution with the smallest L2 Norm

2483.94 -> among

2485.32 -> those Solutions such that

2487.66 -> those global minimum of the loss

2489.82 -> function

2491.079 -> so

2492.7 -> so we use green designs you are not only

2494.44 -> just finding a Theta such as the loss

2496.06 -> function zero right so typically when

2498.339 -> you think about optimization the

2499.48 -> optimization is trying to find a

2501.28 -> solution such that the loss function is

2503.14 -> minimized right that's true you still

2504.94 -> find you definitely find a solution such

2506.5 -> that the loss function is minimized but

2509.14 -> you also have you actually have a tight

2511.3 -> breaking effect among the solutions such

2513.579 -> that the Optimus the loss function is

2515.56 -> minimized you actually choose the one

2517.839 -> with the smallest L2 naught

2530.02 -> so guys you know in some sense you know

2531.76 -> the kind of intuition is the following

2533.079 -> so I'm going to try to draw this this is

2536.44 -> a little bit

2538.359 -> um I need to try to draw this well

2541.9 -> um

2546.76 -> so suppose you have a

2549.28 -> suppose let's say the intuition is that

2551.32 -> supposed to let's say you have n is one

2555.579 -> um and

2558.64 -> say these three

2561.7 -> so you just have one equation one linear

2563.74 -> equation

2564.88 -> and

2567.22 -> um so and you have like three variables

2569.68 -> so that means that the family of

2572.14 -> solutions is a two dimensional Subspace

2574.78 -> so

2576.339 -> make sure to draw this

2581.44 -> okay

2609.099 -> okay so so here the Subspace I'm drawing

2612.4 -> here is

2613.839 -> this is the family of theta such that

2616.96 -> you satisfy that the loss is zero right

2620.319 -> this is the Subspace

2621.76 -> right so you have a substrate Solutions

2623.5 -> and but which solution you converge to

2625.72 -> that's the question

2627.04 -> it turns out that

2628.9 -> if you

2631.599 -> if you start with let me see what maybe

2634.24 -> I will write here

2636.099 -> um

2636.819 -> it turns out that you're going to find a

2638.2 -> solution such that

2640.3 -> this is the solution of the fund this is

2642.22 -> the solution

2653.02 -> drawing this is a little bit challenging

2654.64 -> I guess

2656.14 -> how did I do this I think I did this

2664.96 -> so so you consider that you project zero

2667.18 -> to this Subspace right so that you find

2669.16 -> this point this point is the the

2671.68 -> solution with the minimum knob that is

2673.54 -> closest to zero among on the Subspace

2676.06 -> and this is the solution that you you

2678.339 -> will find you're not going to find other

2680.26 -> Solutions with good in descent on

2683.44 -> um with initialization zero

2685.599 -> so basically that's the claim the claim

2687.22 -> is that you're going to find this

2688 -> particular solution but not the other

2689.2 -> Solutions

2690.76 -> and the reason is actually fundamentally

2693.099 -> reason it's pretty simple like

2694.119 -> especially if I draw it in this way of

2696.4 -> course if you want to prove it it's a

2697.66 -> little bit more complicated

2700.78 -> um so the reason is really just that

2703.24 -> you start with zero

2705.579 -> this is uh where you start with right we

2707.38 -> need that and you have a property such

2709.96 -> that

2711.04 -> when you how do I

2715.54 -> um

2717.04 -> you need a series this

2719.2 -> so you have a property such that if you

2721.96 -> start with

2723.94 -> if initial is zero right and then at any

2728.02 -> time

2731.859 -> so uh your Theta is always in the span

2736.24 -> of all the data points here I'm gonna

2739.3 -> have actually one data point so I'm only

2741.64 -> so

2743.079 -> um so so basically your Theta cannot

2745.06 -> move arbitrarily in any places you only

2748.18 -> have you have a restriction on where the

2750.339 -> setup can go so actually for this

2752.44 -> particular case what happens is really

2754 -> just that you are just moving it along

2755.44 -> this direction

2758.74 -> and here you'll find this point that has

2760.839 -> the Subspace and that's what the

2762.52 -> gradient design is doing so green in

2764.02 -> design will not do something like this

2765.52 -> will not converge to here it will not

2767.38 -> converge to here it will just go

2768.7 -> directly go to this this mean this

2771.04 -> closest point where the point that is

2773.26 -> closest to zero on the Subspace

2776.68 -> so so this is this is probably a

2779.26 -> property of the optimizers right you can

2780.94 -> imagine you may have optimized other

2782.74 -> Optimizer suppose you design some crazy

2784.24 -> Optimizer which does this or does this

2786.4 -> then you will convert to a different

2787.599 -> points but if you use Golden design

2789.339 -> you're going to do this

2799.68 -> you show that green inside is doing this

2801.94 -> is just by saying that the good instant

2804.4 -> is always in a soft the span of the data

2806.38 -> I think this is uh this is actually

2808.3 -> something we have we have approved for

2810.579 -> in the kernel

2811.9 -> kernel uh lecture I'm not sure for a

2815.14 -> different purpose you know it's not for

2816.339 -> this purpose remember that in a kernel

2818.319 -> lecture we try to show that your

2820.78 -> parameter is always in a linear

2822.64 -> combination of the data and and then

2824.859 -> there the purpose was that you want to

2826.3 -> represent it by the betas in that

2828.7 -> lecture so it's a different reason but

2831.28 -> it's different on goal but it's the same

2833.92 -> fact right your status is always in a

2835.72 -> span of the of the data

2845.28 -> [Music]

2850.98 -> is defined to be the all the solutions

2853.48 -> that have zero loss so these are all the

2856.42 -> spelling is that's my definition of the

2858.819 -> Subspace this is the family of solutions

2860.859 -> that have zero training loss

2864.4 -> so and the question is which one I'm

2866.2 -> gonna convert to in kind of like I was

2868.18 -> arguing that you know they're much for

2870.64 -> Global minimum right so this whole spine

2872.8 -> is our Global minimum all of them are

2874.359 -> Global minimum and which my own converts

2875.92 -> to so so different algorithm probably

2877.96 -> would convert to different points

2880.24 -> so if you run cooling design you're

2881.92 -> going to converts to one particular one

2883.54 -> in this in this bag

2890.5 -> right but but this

2892.119 -> um um this this phenomenon also shows up

2894.099 -> in other cases but it's going to be you

2895.78 -> know much more kind of complicated like

2897.7 -> like I think they are only a very

2899.5 -> limited number of situations where we

2901.3 -> can theoretical proof where you converge

2903.579 -> to

2904.599 -> um but but it's almost always the case

2906.28 -> that the optimizer has some preferences

2908.14 -> the optimizer will not converge to

2910.119 -> arbitrary zero training loss solution it

2913.18 -> will converge to one particular zero

2915.16 -> General solution and sometimes that

2916.9 -> solution just to General is much better

2918.52 -> than the other ones

2923.06 -> [Music]

2925.2 -> so um

2928.68 -> so it's not only one or linear models or

2932.68 -> right so so this so only for linear

2935.68 -> models the final of zero zero loss uh

2939.22 -> solution is a spot right so if you have

2941.56 -> non-linear models then the family of

2943.72 -> solutions

2945.099 -> satisfying this wouldn't be a spot maybe

2947.2 -> it's a manifold something some other

2948.7 -> weird structure uh right so so in that

2951.76 -> sense this is very special

2958.599 -> Solution that's gonna be in that's bad

2962.579 -> it's going to be the constraint

2964.54 -> optimization problem that we just saw

2966.46 -> that like it's going to constraint

2969.22 -> itself to this like minimizing a second

2973.02 -> right right so so I didn't show you the

2975.52 -> full proof so this point turns out to be

2977.319 -> the point that you converts to turns out

2978.76 -> to be the the minimum solution

2981.52 -> and it turns out that you actually just

2983.44 -> going straight at least for this at

2985.18 -> least one case so you know not it's

2987.4 -> actually it's not even always true that

2988.72 -> you are going in a straight line like

2991 -> some but uh but you always go in this

2993.339 -> Subspace

2994.72 -> um so

2996.22 -> I'm answering the question maybe I

2997.78 -> didn't um

2999.599 -> can you prove that yeah you you can

3002.64 -> prove that prove it yeah I think the

3003.96 -> homework question I actually asked that

3005.819 -> you're gonna converts this this Pawn is

3007.68 -> exactly the minimum normal solution and

3009.06 -> also you're going to convert to that

3010.859 -> oh okay

3012.3 -> actually you can have a actually you can

3014.52 -> have a pretty concrete

3016.98 -> representation of this point right it's

3018.66 -> really just some inverse some of the

3020.819 -> Matrix times something you can you can

3022.5 -> compute what exactly is and and you can

3024.78 -> show you converge to the point

3026.76 -> um I'm not sure whether the homework

3028.619 -> asks you to so I think the homework has

3030.72 -> to show both

3041.28 -> but we will have a lot of hints you know

3043.079 -> along the way it's not going to be

3044.099 -> interested show this that's it

3051.9 -> and maybe for example another just to

3054.359 -> give you a sense on you know what these

3055.859 -> kind of things can change right supposed

3056.94 -> to initialize here then you wouldn't

3059.28 -> converge to here so you probably would

3061.44 -> convert to somewhere here

3063.48 -> and you know and so uh and if you use

3067.44 -> stochastic reading is that you probably

3069.18 -> wouldn't commercial exactly here either

3070.619 -> your power will convert to some

3072.059 -> somewhere definitely so so so where do

3075.3 -> you exactly converts to it it's it's a

3077.16 -> very hard question we don't really know

3078.839 -> uh we only know that this this the only

3081.78 -> thing we know right now I think formula

3083.88 -> is that this matters if you use

3085.26 -> different algorithmic cucumbers

3086.579 -> different solutions and different

3088.44 -> solutions generalize differently so you

3090.359 -> have to consider the effect of the

3092.16 -> optimizers

3095.22 -> and going back to this the reason here

3098.22 -> is really so I guess like a this in some

3102.24 -> sense this kind of is trying to explain

3103.5 -> why you can you can challenge here

3105.72 -> that's because because of this implicit

3108.059 -> requisition effect even though like you

3110.4 -> don't have regularizers you still

3112.14 -> implicit regularized L2 Norm so and

3115.02 -> that's why in this regime even though

3117 -> you have a lot of parameters but

3118.2 -> actually you are still implicitly

3119.4 -> regulating out norm and if you look at

3121.619 -> the norm the norm would look like this

3125.579 -> so this is the norm as you change the

3128.16 -> parameter

3129.24 -> so so basically this is saying that when

3131.46 -> you have a lot of parameters actually

3132.78 -> your real Norm is actually relatively

3134.819 -> small and that's why you can generalize

3137.88 -> so um so so the the the

3141.24 -> reason why you don't generalize in the

3143.28 -> middle is because this minimum normal

3145.5 -> solution is not actually doing well in

3147.78 -> the middle for some other reason so

3150.119 -> um so the norm actually turns out to be

3151.74 -> big but actually the norm is very small

3153.24 -> in the over parametric 3G even though

3154.98 -> you use a lot of parameters

3165.66 -> thank you

3167.339 -> Okay so

3169.44 -> so now let's talk about you know

3172.44 -> um how do you really do um how do you

3175.079 -> really find out what's that you know

3176.16 -> I've told you that we don't know too

3177.48 -> much about you know what how does the

3179.099 -> optimizer uh change things right so we

3181.26 -> also don't know exactly how does the

3183.24 -> model complexity uh uh change things

3185.819 -> right so you only know some intuitions

3187.14 -> right so you know that if you have more

3189.18 -> complexity it turns out to be more

3190.92 -> likely to

3192.599 -> over fit but you don't know exactly what

3194.52 -> is the right complexity right so how do

3195.96 -> you find out the right model the right

3198.26 -> optimization algorithm the right you

3200.64 -> know regularizer all of this where you

3201.96 -> have so many decisions where you

3203.099 -> probably have like 10 decisions you have

3204.359 -> to make in this machine learning

3206.059 -> algorithm so how do you find out what's

3209.04 -> the what's the best thing so so I think

3211.26 -> the the typical way is just that you are

3213.66 -> user uh validation set to

3217.98 -> um to figure out what's the best

3220.44 -> decision

3221.76 -> so

3222.96 -> maybe just to motivate that just briefly

3225.059 -> so the the easiest way to do is that you

3227.88 -> just use a test set right so you have

3230.28 -> some test set and you just you try all

3232.619 -> kind of algorithms all kind of models

3234.24 -> all kind of like a regularization

3236.339 -> strength and you see which one has the

3238.319 -> best performance uh tested

3240.9 -> so that's okay as long as you only use

3243.119 -> one you only use the test set at the end

3245.94 -> right so to try all of this algorithm in

3248.76 -> advance you know you you and then you

3250.44 -> collect some test the site or maybe you

3252.599 -> collect the test set before but you

3254.4 -> never touch it right so that's okay so

3256.38 -> you you so you if you only use the test

3258.839 -> test once then you can use the tester to

3261.72 -> evaluate the performance of all possible

3263.819 -> algorithms all possible kind of like

3265.8 -> models uh you you want to use

3268.859 -> so so that's that's that's that's that's

3270.359 -> that's a good thing so however the

3272.64 -> problem is that sometimes

3275.16 -> um you like you

3277.2 -> um you want to do this iteratively you

3279.359 -> want to look at a test set and see what

3280.619 -> the performance is and then you go back

3281.94 -> to say okay maybe I'll change my model

3284.04 -> size right or maybe I'll change my uh uh

3288.3 -> like my Optimizer right so maybe I'll

3290.099 -> change from Green instant to stochastic

3292.02 -> in science maybe I want to add some

3293.76 -> regularization effect I add some

3295.5 -> regularization on like a function so

3299.4 -> um

3300.42 -> um so if you want to do with iteratively

3302.16 -> then the what I said before was not

3304.619 -> going to work that's because you know

3306.9 -> so typically if you have a test set this

3309.96 -> you can only use it once

3312.839 -> so because if you use it multiple times

3315.24 -> what happens is that you are you could

3318.3 -> kind of like overfit to the test set so

3321.72 -> basically the your later decision

3323.64 -> becomes kind of like a over 3 are our

3327.48 -> decisions over fitting to the test set

3329.16 -> you have seen before so so the only the

3331.859 -> validity of the test type is only

3333.54 -> insured when you only suggested after

3337.26 -> you you do the tuning so and if you see

3340.14 -> the test set and then you like uh you

3342.839 -> you do the training and then you test it

3345.42 -> again then the second time you test on

3347.22 -> test that it will be not guaranteed to

3349.619 -> be valid so you may all over a fit to

3352.619 -> the test set so does it make sense

3355.98 -> I'm trying to be not over complicated

3358.559 -> this right so that's why I'm trying to

3360.66 -> use informal words for it but if

3363.839 -> there's any questions right so so how do

3366.78 -> we all deal with this right so the test

3368.16 -> set we can only use it once or at least

3369.78 -> we can only use it we cannot use it

3371.4 -> interactive interactively you can also

3373.2 -> see the test stats tune and then see

3375.18 -> your tester get set again so so one way

3378.119 -> to deal with this is that you um

3382.26 -> you

3383.76 -> um

3384.3 -> you you have a whole doubt or you have a

3386.28 -> validation set so so basically you split

3389.099 -> the data

3393 -> into a three part so one part is called

3394.98 -> training site

3398.339 -> and one part is called validation set

3403.559 -> and also test that

3406.92 -> and for test start this is your kind of

3409.02 -> like uh this is a very you have to be

3412.44 -> like a very careful about it you

3414.359 -> shouldn't touch touch it this test that

3416.52 -> is only only about the very very and you

3418.859 -> are using a test set to evaluate your

3420.119 -> performance

3421.619 -> so and but the validation side this you

3424.92 -> use this to tune

3427.079 -> hyper parameters

3429.72 -> and by hyper parameters here I mean all

3431.7 -> the

3432.48 -> kind of like the type of parameters that

3435.119 -> you are you are choosing for example the

3437.22 -> the batch Stars you know the the the

3439.619 -> Lambda in the regularization uh maybe

3442.44 -> the the choice of the optimizer the

3444.78 -> number of neurons you're going to use in

3446.28 -> the in deep learning in your deep

3447.359 -> learning model

3448.98 -> um how long you have you're going to

3450.42 -> train in all of these you know decisions

3453.24 -> um that you are going to decide in the

3455.04 -> in this process they are called hyper

3456.72 -> parameters and so you're using mobile

3459.72 -> edition set to tune the hyper parameters

3461.4 -> and you are using the chain inside to

3462.54 -> tune the real parameters right by the

3465.72 -> to optimize the parameters right so I

3468.48 -> guess typically we don't know so to tune

3470.94 -> the parameters

3472.559 -> right these parameters are just a

3473.94 -> numerical numbers right in the model

3475.92 -> which anyway you don't know what where

3478.319 -> the meanings are but the hyper

3479.88 -> parameters are those kind of things that

3481.38 -> you uh you know their meanings right

3483.599 -> batch size learning rate step size right

3485.4 -> so they all have some meanings uh and

3487.859 -> you want to use this validation set to

3489.48 -> tune the hyper parameters

3491.64 -> so so basically the kind of process is

3493.92 -> that you start with the tuning and then

3495.54 -> you you valid you start with training

3497.16 -> with some hyper parameters and then you

3498.54 -> validate on your performance uh and then

3501.72 -> you go back to to tune again maybe using

3503.7 -> some other hyper parameters and then you

3505.26 -> do this iteration for many times

3507.48 -> so and after you do all you are done

3509.819 -> with everything and you find out a model

3511.38 -> that you are happy with which you know

3513.48 -> by your happy with you I mean that you

3515.7 -> found another model that is very good on

3517.38 -> a validation set then you finally test

3520.44 -> your model on test set and that can be

3522.96 -> only done worse so so in some sense I'm

3525.54 -> not sure how many of you have for

3526.799 -> accounts in this kygo computation right

3528.54 -> it's kind of structured exactly like

3529.92 -> this so there is this online platform

3532.26 -> where people release their data sets for

3534.839 -> and set up some kind of kind of like

3536.88 -> challenge for people to submit their

3538.74 -> machine in the model to solve their

3540.059 -> tasks so basically you know kygo

3542.22 -> competition so they have a like the

3544.619 -> organizer I have a test set which nobody

3546.839 -> can touch at all like this test side is

3549.48 -> only the it's only used once at the very

3552.24 -> end when you decide who is the winner

3554.579 -> so but and then the the the the

3557.299 -> organizer released

3559.339 -> these two actually I'm not sure

3561.599 -> sometimes they give you a division right

3563.28 -> so they say this is advantage that this

3564.66 -> is the training side sometimes they just

3565.859 -> released the total

3568.02 -> um all of them to you and then you can

3570.42 -> you can divide yourself you know even

3572.16 -> the release in this format you can

3574.559 -> re-divide them you know whatever you

3576.119 -> want to do so let's say suppose you have

3577.92 -> divide your you know all the training

3579.42 -> example into these two sets you can do

3580.859 -> whatever kind of like

3582.78 -> um whatever kind of like uh optimization

3585.18 -> uh you want so and I think typically

3587.52 -> they do have like a desk in the

3589.98 -> validation side which is used to for the

3591.66 -> for computing the scores on the

3592.92 -> leaderboard whether there's a

3594.299 -> leaderboard which kind of like tells you

3595.859 -> how well you are doing against others at

3598.5 -> least temporarily right so so that's the

3601.2 -> validation side that's that's evaluate

3602.88 -> on the validation side but this

3604.5 -> leaderboard may not be exactly uh the

3607.26 -> same as the final rank it's possible

3609.78 -> that you know finally you found out that

3611.88 -> somebody is succeeding in the

3613.02 -> leaderboard but eventually in a very

3614.7 -> final test

3616.38 -> um the the performance is not like as

3619.079 -> the validations I suggest

3621.059 -> so so but but this is the general setup

3623.7 -> that people are doing

3625.74 -> um

3626.579 -> does it make sense

3628.559 -> so one common question you know that

3630.24 -> people kind of like generally ask you

3631.859 -> know which I ask myself as well is that

3633.96 -> you know how how reliable this

3635.94 -> validation site is

3637.26 -> right so like a like if you have very

3640.38 -> high performance on the valuation side

3641.88 -> should you trust yourself

3644.22 -> so on one side you shouldn't trust

3646.68 -> yourself you know 100 because if if you

3648.72 -> can trust the organization side

3649.859 -> performance why you need to test that

3651.24 -> well the test stat is supposed to give

3653.04 -> you the final verdict in success right

3654.9 -> it's it's a very uh like a it's a

3658.619 -> guarantee it's something that guarantees

3659.94 -> to give you the right answer so so the

3662.22 -> value set is never 100 you know you

3664.619 -> probably shouldn't 100 trust it so on

3667.559 -> the other hand in Prior college so

3669.839 -> people realized in the last five years

3671.28 -> that I think there is a sequence of

3673.2 -> paper on this people realize that

3674.76 -> actually the validation side performance

3676.26 -> you know is actually well card with the

3678.059 -> test set like so this is a reasonable

3680.64 -> indicator about how good your

3682.319 -> performance is on Tesla it's just not

3684.119 -> there's no theoretical guarantee that

3686.16 -> these two are exactly the same so but

3688.319 -> but in most of the cases if you don't do

3690.24 -> anything crazy you don't kind of like a

3692.099 -> somehow just memorize the entire

3694.559 -> validation Side by creating some kind of

3696.72 -> like um some kind of like uh like lookup

3699.48 -> table kind of things then typically a

3702.119 -> validation size performance the

3703.38 -> performance and radiation set is kind of

3705.18 -> very close to test that and this has you

3707.52 -> know there is a there's a very um

3708.96 -> important paper in the probably three or

3710.94 -> four years ago

3712.26 -> um by Berkeley people so they actually

3714.299 -> look at them

3715.859 -> like maybe 300K gold competitions and

3719.22 -> and they are like they look at the the

3722.16 -> best performance you know the the the

3723.66 -> rank of the performance on the

3725.22 -> validation side on the leaderboard and

3726.9 -> they look at how they correlated with

3728.46 -> the the final winner the final

3730.74 -> performance and they found that they are

3732.18 -> very correlated so which suggests that

3733.859 -> the validation side is actually a pretty

3735.359 -> good indicator for test set even though

3737.099 -> it's not guaranteed

3739.319 -> um and any and in this case the typical

3741.66 -> machine the uh the typical machine

3743.819 -> learning kind of practice is that you

3746.28 -> know if you look at the the

3749.22 -> um the

3750.78 -> um

3751.859 -> um like the the when people publish

3753.599 -> papers right so in some sense people

3755.4 -> publish results based on validation

3757.74 -> search

3759.119 -> um so for example if you look at image

3760.92 -> United performance in some sense people

3762.72 -> are uh like the the so-called test

3765.059 -> performance that people report is

3766.92 -> actually it's actually a performance on

3769.2 -> a validation side because that that

3770.64 -> so-called test set of has been seen so

3772.68 -> many so many times is I think actually

3774.42 -> it's lit uh I don't know exactly whether

3776.88 -> there's a label like name for it like in

3779.4 -> the image not uh the official data set

3781.619 -> but at least that's that you know that

3783.119 -> you report your performance

3785.16 -> um with that side shouldn't be

3787.68 -> considered as a test set because test

3789.359 -> that you should only use it once but

3790.98 -> actually people have used it so many

3792.299 -> times maybe a million times so so so

3794.64 -> basically

3795.96 -> abstract speaking I think these days

3797.7 -> when you publish paper you use the

3798.96 -> validation set only when you have the

3800.64 -> Kegel computation you use the tester to

3802.5 -> really decide the winner

3803.94 -> but but empirical it sounds like they

3806.339 -> are very close so so actually that's why

3808.319 -> we are not worried too much about it

3814.859 -> any questions

3818.4 -> oh I think it's this is a

3820.92 -> [Music]

3822.14 -> I think it's called hey go or kaigo I

3825 -> don't know how to be an artist uh this

3826.859 -> is a platform so the platform hosts a

3829.38 -> lot of computations maybe like um

3831.96 -> a hundred every year or something like

3833.819 -> that you can submit your model and

3835.859 -> sometimes there is a kind of like there

3837.24 -> is a price uh for winning the the

3839.46 -> competition

3841.26 -> so

3842.7 -> um right

3843.72 -> and by the way I think this validation

3845.52 -> is that sometimes now people call it

3847.2 -> development site as well

3853.44 -> um I don't know how popular this name is

3856.02 -> uh but at least if you say but addition

3858.299 -> said I think everyone would know what

3859.859 -> you're talking about development side I

3861.9 -> think most people would know as well but

3863.88 -> it's a relatively new term in the last

3865.799 -> five years

3869.64 -> trading certain validation sets as part

3872.28 -> of like a bigger so like once it's

3874.619 -> actually decided on what type of

3876.24 -> parameters you want to use

3884.359 -> right so how do you how do you do the

3886.859 -> split right so so here um so the most

3890.28 -> typical way is that you just split

3892.02 -> randomly you reserve probably a tenth of

3894.839 -> the data set as validation size maybe 20

3897.059 -> you know depending on how many data you

3899.46 -> have so

3901.2 -> um and I think what you are probably

3902.7 -> thinking is the so-called cross

3903.839 -> validation which does something much

3905.76 -> more complicated you can kind of split

3907.5 -> your data set into uh you can do

3909.839 -> multiple splits and try multiple

3911.22 -> experiments on different splits so

3914.76 -> um I think uh

3916.2 -> um

3917.4 -> I I think I'm not going to cover it uh

3919.5 -> for this lecture mostly because I think

3921.24 -> these days if you have a large enough

3922.98 -> data set typically you just do this

3925.02 -> static spelled just because it's much

3927.599 -> easier you don't have to run your

3928.859 -> algorithm multiple times and this is

3931.26 -> just almost like the

3933 -> like like in most of the larger scale

3935.94 -> machine learning situations you just use

3938.099 -> this so but if you just have like 100

3940.799 -> examples then indeed as you said you

3943.5 -> know if you fix like 20 examples that's

3945.18 -> validation side it's a little kind of

3946.68 -> like wasteful so then you have to do

3948.18 -> some cross validation so we have a we

3950.22 -> have a kind of like a section in lecture

3952.2 -> notes about cross validation

3954 -> um there's a there's a description of

3955.559 -> the the of the Practical

3958.44 -> um I think you know if you're

3959.16 -> interesting you can you can read it it's

3961.14 -> nothing very complicated either

3963.78 -> thank you

3964.799 -> okay so I got some

3966.66 -> um yeah so I'm gonna use the last

3968.76 -> tournaments to talk about some more

3970.799 -> um applied perspective

3972.96 -> um I'm gonna use the slides so I guess

3975.54 -> I'll

4034.16 -> see that this can work

4038.48 -> okay great so uh

4041.839 -> it's not censored is it

4045.68 -> wait

4047.78 -> okay sounds good

4049.819 -> um okay good so um so so I think this

4052.819 -> lecture um um so we're talking about

4054.98 -> some ml device so um I think I

4058.64 -> um so here on these slides are made by

4060.92 -> our other instructor crispy uh with the

4064.099 -> help of Alex Radner

4066.319 -> um I'm pretty much just repeating uh

4068.059 -> whatever

4069.38 -> um he's saying in the slides

4072.079 -> so

4073.88 -> um I think the slides you know used to

4075.619 -> be a little bit longer than this I'm

4076.7 -> going to release the longer version as

4077.9 -> well so I shorten it you know to only 20

4079.819 -> minutes or 30 minutes

4081.559 -> um and part of the reason is that I

4082.88 -> think slides also contains something

4084.02 -> that has been covered on the Blackboard

4085.88 -> on the Whiteboard and also part of the

4087.98 -> reason is that there are some applied

4089.48 -> Parts on which uh I think we don't have

4091.94 -> a lot of time to discuss in

4094.4 -> um in this quarter so but I'm going to

4096.44 -> release the the longer slides as well

4098 -> for your reference so so so so this set

4101.54 -> of slides are mostly for kind of like a

4104.06 -> little bit more applied kind of like

4105.259 -> situations for example you are thinking

4107.299 -> that you are for example your startup

4109.219 -> and you are doing machine learning to

4111.02 -> solve some concrete problems right so um

4113.6 -> so it's a little bit less like a

4114.859 -> research because you know you're gonna

4116.12 -> see that you're gonna have more much

4117.319 -> more issues than a concrete research

4120.14 -> setting right in research actually you

4121.64 -> sometimes also have this right in

4122.839 -> research the most typical setting is

4125.66 -> that you probably have a con very

4127.16 -> concrete data set right you know that

4128.839 -> the input output you know everything

4130.339 -> there's no any room in flexible you

4132.56 -> cannot redefine the problem and you just

4134.239 -> want to get the best number that's one

4135.679 -> type of research I don't think this is

4137.359 -> the most typical one either right like

4138.799 -> but um but this is one type of research

4140.779 -> and then from there you can have more

4142.46 -> and more flexibility you can change your

4143.839 -> data you can refreeze your problem you

4145.759 -> can find out what's the right problem

4146.779 -> and and once you really do it in in

4149.96 -> Industry then it's going to be you know

4151.699 -> much more complicated

4153.739 -> so some disclaimers to start with I

4156.56 -> think this is Chris disclaimer which is

4158.239 -> also mine so these are you know like

4160.58 -> there's no Universal kind of like what

4162.38 -> is ground shoes here right so there's no

4164.06 -> ground shoes it's really just some

4165.5 -> experiences and some

4167.42 -> um uh some experiences from people doing

4169.64 -> right you know in real life

4171.739 -> so

4173.359 -> um and and things change you know over

4175.58 -> time sometimes people thought that was

4177.5 -> the right thing to do in five years ago

4179.12 -> and now things changed so I'm going to

4182.54 -> go through this a little quickly so um

4184.52 -> you know

4185.48 -> um so um but you know some I'm gonna

4187.94 -> omit some of details as well but feel

4189.92 -> free to stop me

4191.6 -> um

4192.739 -> right so

4194.9 -> so there are many

4196.34 -> um

4196.94 -> so in some sense there are many phases

4198.62 -> of IML project if you really do it in

4200.42 -> Industry right so so for example one

4202.4 -> thing you want to discuss is that you

4204.26 -> really need ml system even to start with

4206.36 -> right some of the questions are not

4207.86 -> really necessarily suitable for ML I

4210.26 -> think at least I knew I'm not I don't do

4212.42 -> as much industry uh work you know as um

4215.54 -> you know Chris is also entrepreneur

4218.199 -> besides you know

4220.1 -> um a professor

4221.719 -> um so uh so he knows a lot about this

4223.219 -> but even I know that sometimes you know

4225.199 -> people actually when they really uh sell

4228.08 -> their product as a ml system but

4229.58 -> actually the underlying system is not

4231.739 -> really using much ml so sometimes you

4233.78 -> don't really need ammo and and and when

4236.659 -> you need when you use the ml if it

4238.159 -> doesn't work you know what you do so and

4241.04 -> uh and and also like you know how do you

4243.02 -> deal with all the kind of ecosystem

4245.3 -> so and when I use the runner example you

4247.699 -> know we're going to have a Spam detector

4249.26 -> and the question is how do you know

4250.88 -> detect spams

4252.38 -> um

4252.98 -> um you know we'll use this example a lot

4255.62 -> in this like in this course so

4259.1 -> this is the seven steps for ML system so

4262.1 -> here again it's more a little broader

4264.5 -> than just the ml research right so

4266.36 -> you're thinking about designing a system

4267.92 -> that can actually work in practice so

4271.64 -> um so acquire data and you want to look

4274.4 -> at the data you know and maybe you want

4277.1 -> to create some kind of like twin

4278.42 -> development test set as we discussed you

4281.06 -> want to um Define our refund

4283.34 -> specification which I'm going to

4284.719 -> discussing in some sense this is saying

4286.4 -> that you have to have a evaluation

4288.08 -> metric for for your model in what sense

4291.14 -> you want to you want your model to

4292.52 -> succeed and then you want to build your

4294.5 -> model and try a bunch of models maybe

4296.42 -> you are going to spend a lot of time in

4298.1 -> step five and then you're eventually

4300.08 -> you're going to measure uh models

4301.94 -> performance you know not necessarily

4303.32 -> only according to the specification you

4305 -> have defined in step four but maybe

4306.5 -> you'll have like other match

4308.42 -> measurements for example speed training

4310.88 -> time so and so forth and then eventually

4312.62 -> you have to repeat and maybe you have to

4314.12 -> repeat a lot of times

4316.04 -> so

4317.54 -> um I'm going to go through these steps

4318.739 -> you know relatively quickly I only have

4320 -> like 15 minutes so but you can kind of

4322.159 -> like

4323.12 -> um if you're interested you can look at

4324.679 -> a longer slides as well so

4327.56 -> um supposed to say um you want to decide

4329.78 -> what is Spam or not

4332.06 -> so

4333.56 -> so ideally you want a data sample from

4335.54 -> the data that your stump product will be

4337.52 -> around right so you want to have your

4338.84 -> data to be somewhat kind of like closer

4340.76 -> to the the final test data right so you

4343.28 -> don't want to just collect some spam

4345.679 -> data from 30 years ago and then and use

4348.5 -> this data to choose something that's

4349.82 -> that can work not these days but

4352.94 -> sometimes this is not always available

4354.32 -> because you never know what what the

4356.78 -> spam emails will be 10 years after so

4359.12 -> you have to make some sacrifice

4361.52 -> um you know sometimes you don't even

4362.6 -> have the features you know it may be

4364.88 -> like your existing record didn't save

4366.62 -> everything maybe it just saves the title

4368.36 -> of the email didn't save the the entire

4370.28 -> content then that would limit your

4372.14 -> capability of detecting spec and there

4375.26 -> are many legal issues to look at the

4376.52 -> data

4377.719 -> um and this is according to crazy you

4380 -> know I think this is true as well so

4381.8 -> you're getting wrong on the first try

4383.239 -> right so like sometimes you like you'll

4385.58 -> find out that the data you collect are

4387.14 -> not the right one you have to repeat

4389.84 -> and then you know after collect some

4391.58 -> data you have to look at them right so

4393.98 -> and this is something that we actually

4395.36 -> don't really teach a lot you know in in

4397.76 -> this machine learning course looking at

4399.32 -> your data right because you know we are

4401.12 -> mostly assuming that you already have

4402.26 -> your data you you make the right

4403.58 -> assumption already you already know your

4404.84 -> data is gaussian and then you uh you you

4407.239 -> are engaging discriminated analysis

4409.4 -> right but we never say how do you how do

4411.44 -> you decide whether you you really make

4413 -> the Assumption of the about the gaussian

4415.159 -> Assumption so but in practice you have

4417.5 -> to do that because you have to

4419.659 -> um uh see whether data makes sense you

4421.76 -> know there are many nuances there for

4423.56 -> example sometimes your data are not as

4425.06 -> good as you think maybe there are some

4427.159 -> kind of like um maybe the format is not

4429.26 -> right maybe there's some kind of like

4430.64 -> outliers

4431.96 -> um so and so forth and and only if you

4435.14 -> look at the data you cannot see uh

4437.12 -> what's what's going on there so I

4438.98 -> actually you know even research

4440.12 -> sometimes I experience this right so

4441.62 -> like I think in one of my projects

4444.02 -> um like uh I think we use we just use

4446.6 -> the wrong data from day one in some

4448.219 -> sense I think the data like some of the

4449.78 -> data will just crafted you know just by

4451.88 -> accident and we are tuning on them and

4454.1 -> until only until like

4455.96 -> um like one month I think we realized

4457.4 -> that so of course in research it's

4459.56 -> probably easier to tack that you know

4461 -> one month is a long time for us to

4462.679 -> detect it I think but but actually you

4464.84 -> know you can easily detect them but for

4466.64 -> for real life cases you know sometimes

4468.62 -> it's even harder for example you don't

4470.36 -> even necessarily have the tools to look

4472.159 -> at look at your data you maybe have to

4474.14 -> build some tools to look at your data

4476.56 -> and and you need to kind of like think

4480.32 -> about different subpopulations maybe

4481.699 -> spams from edu emails or spams from.com

4484.699 -> emails and and see uh what are the

4487.52 -> differences so this will give you a lot

4489.26 -> of intuitions on you know what data you

4491.6 -> should use and our Workshop models uh

4493.52 -> you should use and and do this at every

4496.52 -> stage because you know like um in for

4500.06 -> example we will really do them

4502.219 -> and this is also while the reason why

4504.56 -> you want to build some tools to look at

4506.12 -> data conveniently right so sometimes

4508.28 -> like if you just look at it once then

4510.62 -> sure then you can just uh maybe print

4512.6 -> out something right so but if you want

4514.04 -> to look at many times then you should

4515.9 -> have some convenient tools which

4517.88 -> actually eventually will reinforce and

4520.1 -> let you to more likely to to look at

4522.32 -> data right so I think at least in

4523.76 -> research I I also realized this so if

4526.64 -> the data is very hard to visualize then

4528.8 -> people are less likely to

4530.78 -> to to visualize the data so so sometimes

4533.96 -> it requires an investment so that you

4536.12 -> can

4536.9 -> um you can have this tool so that in the

4538.82 -> future you are you have less of um uh on

4542.3 -> um kind of like cost to look at your

4544.04 -> data and you should do this at every

4545.96 -> stage uh in many cases

4548.659 -> um

4549.739 -> so and and this is um I guess you know

4552.62 -> this is about domain knowledge where you

4554.06 -> sometimes you know some of the data

4555.38 -> requires expertise right like

4558.08 -> um to know so so some I think there are

4561.08 -> there are some examples you know in the

4562.64 -> slides which I uh removed just to save

4564.679 -> some time but like in short and

4567.08 -> sometimes you know for example if your

4568.94 -> data is crowded like experts only

4571.159 -> experts can know like a like for example

4573.44 -> you have medical data only experts can

4575.12 -> know that your data are crafted but like

4577.04 -> from a machine learner

4578.78 -> um perspective the data looks fine

4581.54 -> so I will talk about children depth on

4584 -> tester split so um so this of course is

4587.42 -> something important for you to

4589.76 -> um to do

4591.14 -> um and in practice you know it's other

4592.64 -> lesson

4593.9 -> um clear

4595.1 -> um than uh than in research right

4596.96 -> because in research you already got

4598.28 -> sometimes you already got a split even

4600.08 -> at the first place right so before like

4602 -> you got a data the data already has a

4603.38 -> split but in real life sometimes you

4605.42 -> have to avoid certain kind of like

4607.82 -> leakage uh so for example maybe

4610.34 -> sometimes for example let me take an

4612.98 -> extreme case right so suppose your data

4615.44 -> has reputations so if you have like a

4618.14 -> million data but actually it's just a

4620.239 -> two like every data points is repeated

4623.3 -> twice so essentially you just have 500k

4625.4 -> and but repeat it twice if that's the

4628.219 -> case then you split the data then you're

4630.199 -> going to see some reputations between

4631.64 -> like you some examples in the test will

4633.8 -> also show up in the training exactly the

4635.239 -> same

4636.02 -> so that would be disasters so so you

4638.659 -> have to kind of avoid some situations

4640.1 -> and this actually happens in in the

4641.78 -> Kegel context so actually in many in

4645.14 -> many many

4646.64 -> like exactly like I actually I I try to

4649.219 -> do some of this Kegel contacts at some

4651.26 -> point at least at that time like that's

4653.42 -> probably three or four years ago

4655.76 -> um maybe maybe more than four years

4656.9 -> maybe six years ago at that time many of

4659.659 -> the contest so if you look at them

4662.12 -> there's some always some kind of like

4663.98 -> forum for discussions discussing like uh

4668.179 -> uh

4670.82 -> okay

4674.6 -> [Music]

4688.239 -> like I think I'm sure this happens more

4690.739 -> in the industry which I'm not less

4692.12 -> familiar with but in even in the Kegel

4694.28 -> contest so

4695.9 -> um so in many of the Kegel context if

4698 -> you look at the Forum always after like

4700.28 -> a half like a like a half a month I mean

4702.679 -> after a few weeks someone will figure

4704.84 -> out some leakage but just because you

4707.179 -> know something examples are very very

4709.4 -> close to test example so that they just

4711.44 -> use this leakage uh to to hide the

4714.32 -> number so it's kind of like some kind of

4715.94 -> like weird rule so that you can make the

4717.739 -> validation performance much better than

4719.54 -> you thought and and everyone has to use

4721.34 -> that and it's kind of interesting I

4722.78 -> don't know why it's like everyone who

4724.4 -> found this kind of late cage they always

4725.78 -> post it in the Forum somehow and and

4728.12 -> then ever I don't know whether this is

4729.62 -> always true but like for the few cases

4732.02 -> I've seen they do this and then everyone

4733.94 -> else will have to use this small Gadget

4736.28 -> to improve their model performance

4737.6 -> because if you don't use it your model

4739.52 -> performance is just not as good as

4740.84 -> others

4742.159 -> um

4742.88 -> so yeah I don't know whether they now

4744.98 -> they have maybe they have some better

4746.54 -> ways to to detect this leakage to design

4748.76 -> the computation much better

4750.679 -> um I don't know so but this is something

4752.06 -> you have to pay attention you know um in

4754.28 -> practice and also another kind of tricky

4756.739 -> thing is that what is a good spirit so

4759.08 -> we have discussed whether you should do

4760.34 -> random splits right so in research as I

4762.86 -> said you know random side is pretty much

4764.36 -> the best way you can do because you

4766.159 -> really literally care about uh the

4768.02 -> validation performance but

4770.78 -> um but the problem is that

4773.12 -> um sometimes you know uh in the real

4775.159 -> world

4776.36 -> like your test you like your the the the

4780.02 -> the test data set is actually not really

4782.12 -> what you care about so

4784.4 -> um so so that's why when you split it

4786.02 -> you also want to split the the children

4787.94 -> validation set in a way such that the

4789.8 -> validation set is

4791.3 -> some of the closer to the real test set

4793.1 -> let's so I guess this is the situation

4796.58 -> so suppose you say you are thinking

4798.14 -> about stock price prediction

4800.06 -> right so so your final goal is to

4802.4 -> predict the price

4803.9 -> in the future

4805.58 -> right so that's something you just don't

4807.14 -> have at all right so and and so now

4810.14 -> suppose you have data between

4812.54 -> um 2020 or 200 2020 right so you have

4816.739 -> these 20 years of data so how do you do

4818.6 -> the tune uh uh validation split should

4822.02 -> you just just do a random split or you

4823.88 -> should do some other things so one

4826.04 -> possible option is that you should

4828.44 -> probably split into for example 2000 to

4831.679 -> 2015. that's the tune and to 2015 to

4835.28 -> 2020 that's the validation why you argue

4837.8 -> that why that's a possible option

4840.5 -> um possibly just because the last five

4842.179 -> years is more predictive of the future

4844.84 -> than the the earlier years

4847.58 -> so you know I don't I don't necessarily

4849.8 -> say I'm not necessarily saying that this

4851.42 -> is the the only option or the best

4852.86 -> option but this is at least something to

4854.36 -> consider right so this was kind of like

4856.52 -> a kind of like a um this is not what we

4859.04 -> do in research just and the reason is

4861.86 -> just because you know like you care

4863.06 -> about the performance in the future

4864.62 -> which is something you don't have access

4867.08 -> to

4868.36 -> right

4869.96 -> so I guess this is a

4875.36 -> right so the better split is they use

4876.92 -> the first 50 days to predict the last 50

4878.42 -> days

4880.64 -> so um okay and create a specification

4883.76 -> right so so I think this is mostly

4885.62 -> related to like how do you kind of like

4887.84 -> Define what you want to predict and and

4890.84 -> what's the what's the kind of the goal

4893.06 -> right so are you so in many cases you

4894.92 -> can use Machine model in many different

4897.02 -> ways uh uh from uh from from what you

4902.36 -> wanted to use it right so and also you

4903.98 -> care about kind of different

4905.84 -> um kind of like uh perspectives for

4908.179 -> example you know what is a spam right so

4910.1 -> like the definition of spam sometimes

4911.719 -> you know are different

4914.42 -> um between different people right so

4916.04 -> like um maybe do you think like an ad

4919.34 -> right from some from from say Google do

4922.159 -> you think that's that as a spam you know

4924.14 -> maybe I always think of like that but

4925.82 -> some something else probably prefer to

4927.5 -> receive some emails some add emails with

4930.92 -> very low rate right so so you have to

4933.08 -> specify exactly what you want to predict

4934.699 -> right so what is really the definition

4935.96 -> of span and you don't want to have

4938.56 -> ambiguities at least you know from

4941.239 -> this day's perspective right so like a

4943.28 -> machine learning models don't like

4945.26 -> ambiguities you really want to have a

4947.12 -> clear cut what is the spam what is not

4949.88 -> and and also like a what level of

4952.159 -> expertise is hard to understand it

4953.6 -> because you know if you specify the spam

4955.76 -> you know you can have a definition but

4957.26 -> if your labeler don't are not able to uh

4961.699 -> label the spam according to your

4963.32 -> definition is not going to be useful

4964.46 -> right suppose you have a very very

4965.78 -> complex definition of spam and then you

4968.36 -> say I have this data and I'll ask

4970.82 -> laborers to label them but the laborers

4972.62 -> cannot execute my definition of spam

4974.78 -> easily that's going to be another issue

4978.14 -> and and also the specifications like uh

4982.58 -> um so you use the specification you want

4984.44 -> because you want to kind of like a uh uh

4987.8 -> on you use the specification to define a

4990.38 -> set of examples right so because

4991.94 -> eventually if you just have a some kind

4994.34 -> of description text description about

4995.9 -> what is a spam that's probably not

4997.58 -> useful you have to really have a set of

4999.5 -> test examples and the test examples have

5001.719 -> labels and that's is really your

5003.4 -> definition of span

5005.679 -> so and for example one of the quick and

5008.86 -> dirty tests here is that

5011.02 -> um

5011.739 -> what are your definition of the span

5014.98 -> um can pass this so-called you know

5016.6 -> inner annotated agreement so basically

5019.36 -> what you say is that you write down some

5020.739 -> definition and then you select a some

5023.52 -> randomly selected or you get some

5025.84 -> examples of emails and then you ask

5029.8 -> um three annotators to see whether they

5032.02 -> can agree on which email is a spam or

5034.84 -> not according to the definition you give

5036.52 -> them and often you don't really get that

5039.52 -> high of agreement you know you know you

5041.38 -> don't get 100 agreement you know in many

5043.36 -> cases you know people's interpretation

5044.86 -> of the same definition would be

5046.6 -> different

5047.8 -> um and let's say you know you have 95

5049.42 -> agreement I think that's already

5050.739 -> considered to be great

5052.679 -> and then you have 95 of agreement then

5055.96 -> the question becomes you know whether uh

5059.14 -> uh it's meaningful to shoot for some

5062.199 -> accuracy more than 95 right if the

5064.48 -> annotators don't agree with each other

5066.58 -> uh um only agree with each other on 95

5069.94 -> percent of 95 of time

5073.5 -> actually sometimes you can do better

5075.34 -> than that just because humans are

5077.38 -> sometimes you know lesson

5079.84 -> um

5080.8 -> have less in accurate like

5082.84 -> interpretation or sometimes machine

5084.58 -> models can do better but you know

5086.08 -> typically you probably shouldn't shoot

5087.82 -> for much higher than 95 in many cases

5091.78 -> so um

5094.3 -> and and then you know you're gonna you

5096.52 -> know do this iteratively for example you

5098.38 -> have to kind of examine the

5099.4 -> specifications you gotta look at you

5101.14 -> know what's the discouragement you know

5102.34 -> why you have discouragement maybe that

5103.9 -> means you have changed your

5104.86 -> specification and

5107.56 -> um and and and you know and the last

5109.84 -> question is kind of interesting what do

5111.04 -> you tune the people on the machine

5111.94 -> eventually at some time if you kind of

5114.1 -> have a lot of like a different

5116.38 -> um sometimes you have to tune the people

5117.4 -> to label them correctly right so for

5119.44 -> example even like the image

5121.54 -> classification problem we have this

5122.98 -> clearly defined labels right so dog cats

5125.739 -> you know but once you go to the the more

5129.04 -> the the the breeze of godox right so

5131.32 -> some some of the laborers don't really

5132.88 -> recognize different breeds of dogs so

5134.739 -> you have to tune the labelers uh in some

5136.9 -> way so I have a friend who did this like

5138.88 -> a in PhD and and basically he has a lot

5141.82 -> of kind of training documents for the

5143.679 -> labelers you know Amazon account workers

5145.36 -> and and I I they have to actually ask

5147.34 -> the Turkish to pass some exams to be

5150.52 -> able to to be a labeler for them so so

5153.46 -> so it's actually kind of complicated

5156.699 -> okay so I guess I'll I'll be quick given

5159.34 -> that we are almost running all the time

5160.6 -> so and then when you do the model this

5162.82 -> is the more machine learning part so you

5164.98 -> want to implement the simplest possible

5166.84 -> model so

5168.94 -> um and you keep it simple and then

5171.28 -> um um and sometimes I think this is the

5173.44 -> this thing so don't get embarked on new

5175.6 -> models uh use them to understand data

5177.639 -> right so sometimes like a

5179.92 -> um the problem is that

5183.159 -> um the models are not only the angle

5184.84 -> sometimes you know you want to use the

5186.159 -> model to understand uh what are the

5188.86 -> problems with the data right so and and

5190.719 -> sometimes you can you can fix the data

5192.639 -> and then the model becomes the

5193.9 -> performance becomes much better so

5196.78 -> um so this this is a whole Loop right

5199.139 -> your bottleneck may not just be only

5201.4 -> about the model sometimes it could come

5204.1 -> from some other places maybe a data

5205.659 -> maybe a specification maybe the chin

5207.639 -> test split so and so forth

5210.699 -> and and you have some bass lines so that

5213.1 -> you know what you are what you are doing

5215.26 -> and you need to do some mobilization

5216.88 -> studies on so and so forth

5219.4 -> and and then step six you know you need

5221.56 -> to match the output so and uh

5225.52 -> you have to measure the output so that

5227.26 -> you don't you know make mistakes twice

5228.76 -> and you want to catch up catch new

5230.5 -> mistakes as soon as possible

5232.659 -> um and you want to measure kind of

5234.219 -> different things and simple things for

5235.719 -> example

5237.219 -> um like like there are a bunch of kind

5239.98 -> of like qualities you care about here

5242.739 -> um and and so and so forth

5246.76 -> and this is probably one thing that

5249.28 -> um

5250 -> um that we it's one challenge we are

5251.739 -> really facing these days about machine

5253.36 -> learning model maybe I would say

5254.679 -> probably one of the most important

5255.88 -> challenges so the reason is that you

5258.58 -> have a description shift right so you're

5260.199 -> training an validation description are

5262.78 -> very different from a distribution that

5264.58 -> you will test eventually

5266.44 -> um or maybe they are similar but there

5268.42 -> is some kind of like special

5269.38 -> subpopulations that make them different

5271.659 -> right so for example you're tuning on

5273.4 -> San Francisco street views a new test on

5276.28 -> Arizona street views for example when

5278.679 -> you build a automatic autonomous driving

5280.6 -> car then you know you tune on some

5282.52 -> street views and you have to test on

5283.9 -> some other places and that creates a

5285.76 -> distribution shift and then uh you can

5288.28 -> you can have surprises

5290.4 -> and you know there are not so many kind

5293.26 -> of like good uh ways to do this you know

5295.36 -> except that you have to kind of be

5296.8 -> careful about it and these are new

5298.42 -> algorithms

5299.8 -> um and um

5301.239 -> and this is incredibly hard right

5302.92 -> there's no real solutions in Industry I

5304.9 -> don't think there's real solutions in

5307.96 -> um in in research either so I think

5310 -> we're gonna have for one guest lecture

5311.739 -> by James o about the robustness of

5313.78 -> machine learning models there are a lot

5315.28 -> of recent work on it

5317.26 -> um but so far we don't have you know I

5319.12 -> think we have definitely better

5320.56 -> algorithms to to be more robust but I

5324.04 -> think the performance is probably the

5325.48 -> robustness is still not

5327.28 -> um ideal enough

5329.56 -> Okay so

5334.239 -> okay I think I'll just jump to step

5335.92 -> seven so yeah you have to repeat and

5338.199 -> look how data I release the longer

5340.48 -> version of the slides you know if you

5341.8 -> are interested in some of the details

5343 -> thanks

Source: https://www.youtube.com/watch?v=NirZnqwYfYU