How to Mit Introduction to deep Learning

Welcome to MIT sus1 191 my name is Alexander amini and I’ll be one of your instructors for the course this year along with Ava and together we’re really excited to welcome you to this really incredible course this is a very fast-paced and very uh intense one week that we’re about to go through together right so we’re going to cover the foundations of a also very fast-paced moving field and a field that has been rapidly changing over the past eight years that we have taught this
course at MIT now over the past decade in fact even before we started teaching this course Ai and deep learning has really been revolutionizing so many different advances and so many different areas of science meth mathematics physics and and so on and not that long ago we were having new types of we were having challenges and problems that we did not think were necessarily solvable in our lifetimes that AI is now actually solving uh Beyond human performance today and each year that we teach this course uh this lecture in particular is
getting harder and harder to teach because for an introductory level course this lecture lecture number one is the lecture that’s supposed to cover the foundations and if you think to any other introductory course like a introductory course 101 on mathematics or biology those lecture ones don’t really change that much over time but we’re in a rapidly changing field of AI and deep learning where even these types of lectures are rapidly changing so let me give you an example of how we introduced this course only a few years
ago hi everybody and welcome to MIT 6s one91 the official introductory course on deep learning taught here at MIT deep learning is revolutionizing so many fields from robotics to medicine and everything in between you’ll learn the fundamentals of this field and how you can build so these incredible algorithms in fact this entire speech and in video are not real and were created using deep learning and artificial intelligence and in this class you’ll learn how it has been an honor to speak with you today and I hope you enjoy the
course the really surprising thing about that video to me when we first did it was how viral it went a few years ago so just in a couple months of us teaching this course a few years ago that video went very viral right it got over a million views within only a few months uh people were shocked with a few things but the main one was the realism of AI to be able to generate content that looks and sounds extremely hyperrealistic right and when we did this video when we created this for the class only a few years ago this video took us about
$10,000 and compute to generate just about a minute long video extremely I mean if you think about it I would say it’s extremely expensive to compute something what we look at like that and maybe a lot of you are not really even impressed by the technology today because you see all of the amazing things that Ai and deep learning are producing now fast forward today the progress in deep learning yeah and people were making all kinds of you know exciting remarks about it when it came out a few years ago now this is common
stuff because AI is really uh doing much more powerful things than this fun little introductory video so today fast forward four years about yeah four years to today right now where are we AI is now generating content with deep learning being so commoditized right deep learning is in all of our fingertips now online in our smartphones and so on in fact we can use deep learning to generate these types of hyperrealistic pieces of media and content entirely from English language without even coding anymore right so
before we had to actually go in train these models and and really code them to be able to create that one minute long video today we have models that will do that for us end to end directly from English language so we can these models to create something that the world has never seen before a photo of an astronaut riding a horse and these models can imagine those pieces of content entirely from scratch my personal favorite is actually how we can now ask these deep learning models to uh create new types of software even
themselves being software to ask them to create for example to write this piece of tensorflow code to train a neural network right we’re asking a neural network to write t flow code to train another neural network and our model can produce examples of functional and usable pieces of code that satisfy this English prompt while walking through each part of the code independently so not even just producing it but actually educating and teaching the user on what each part of these uh code blocks are actually doing you can see example here
and really what I’m trying to show you with all of this is that this is just highlighting how far deep learning has gone even in a couple years since we’ve started teaching this course I mean going back even from before that to eight years ago and the most amazing thing that you’ll see in this course in my opinion is that what we try to do here is to teach you the foundations of all of this how all of these different types of models are created from the ground up and how we can make all of these amazing advances possible so that
you can also do it on your own as well and like I mentioned in the beginning this introduction course is getting harder and harder to do uh and to make every year I don’t know where the field is going to be next year and I mean that’s my my honest truth or even honestly in even one or two months time from now uh just because it’s moving so incredibly fast but what I do know is that uh what we will share with you in the course as part of this one week is going to be the foundations of all of the tech technologies that we have seen
up until this point that will allow you to create that future for yourselves and to design brand new types of deep learning models uh using those fundamentals and those foundations so let’s get started with with all of that and start to figure out how we can actually achieve all of these different pieces and learn all of these different components and we should start this by really tackling the foundations from the very beginning and asking ourselves you know we’ve heard this term I think all of you obviously before you’ve come

Introduction to Deep Learning – MIT OpenCourseWare

to this class today you’ve heard the term deep learning but it’s important for you to really understand how this concept of deep learning relates to all of the other pieces of science that you’ve learned about so far so to do that we have to start from the very beginning and start by thinking about what is intelligence at its core not even artificial intelligence but just intelligence right so the way I like to think about this is that I like to think that in elligence is the ability to process information which will inform your
future decision-mak abilities now that’s something that we as humans do every single day now artificial intelligence is simply the ability for us to give computers that same ability to process information and inform future decisions now machine learning is simply a subset of artificial intelligence the way you should think of machine learning is just as the programming ability or let’s say even simpler than that machine learning is the science of of trying to teach computers how to do that processing of information and
decision making from data so instead of hardcoding some of these rules into machines and programming them like we used to do in in software engineering classes now we’re going to try and do that processing of information and informing a future decision decision making abilities directly from data and then going one step deeper deep learning is simply the subset of machine learning which uses neural networks to do that it uses neural networks to process raw pieces of data now unprocessed data and allows them to ingest all of those very
large data sets and inform future decisions now that’s exactly what this class is is really all about if you think of if I had to summarize this class in just one line it’s about teaching machines how to process data process information and inform decision-mak abilities from that data and learn it from that data now this program is split between really two different parts so you should think of this class as being captured with both technical lectures which for example this is one part of as well as software Labs we’ll have several new
updates this year as I mentioned earlier just covering the rap changing of advances in Ai and especially in some of the later lectures you’re going to see those the first lecture today is going to cover the foundations of neural networks themselves uh starting with really the building blocks of every single neural network which is called the perceptron and finally we’ll go through the week and we’ll conclude with a series of exciting guest lectures from industry leading sponsors of the course and finally on the software side after
every lecture you’ll also get software experience and project building experience to be able to take what we teach in lectures and actually deploy them in real code and and actually produce based on the learnings that you find in this lecture and at the very end of the class from the software side you’ll have the ability to participate in a really fun day at the very end which is the project pitch competition it’s kind of like a shark tank style competition of all of the different uh projects from all of you and win some
really awesome prizes so let’s step through that a little bit briefly this is the the syllabus part of the lecture so each day we’ll have dedicated software Labs that will basically mirror all of the technical lectures that we go through just helping you reinforce your learnings and these are coupled with each day again coupled with prizes for the top performing software solutions that are coming up in the class this is going to start with today with lab one and it’s going to be on music generation
so you’re going to learn how to build a neural network that can learn from a bunch of musical songs listen to them and then learn to compose brand new songs in that same genre tomorrow lab two on computer vision you’re going to learn about facial detection systems you’ll build a facial detection system from scratch using uh convolutional neural networks you’ll learn what that means tomorrow and you’ll also learn how to actually debias remove the biases that exist in some of these facial detection systems
which is a huge problem for uh the state-of-the-art solutions that exist today and finally a brand new Lab at the end of the course will focus on large language models well where you’re actually going to take a billion multi-billion parameter large language model and fine-tune it to build an assistive chatbot and evaluate a set of cognitive abilities ranging from mathematics abilities to Scientific reasoning to logical abilities and so so on and finally at the very very end there will be a final project pitch
competition for up to 5 minutes per team and all of these are accompanied with great prices so definitely there will be a lot of fun to be had throughout the week there are many resources to help with this class you’ll see them posted here you don’t need to write them down because all of the slides are already posted online please post to Piaza if you have any questions and of course we have an amazing team uh that is helping teach this course this year and you can reach out to any of us if you have any
questions the Piaza is a great place to start myself and AA will be the two main lectures for this course uh Monday through Wednesday especially and we’ll also be hearing some amazing guest lectures on the second half of the course which definitely you would want to attend because they they really cover the really state-of-the-art sides of deep learning uh that’s going on in Industry outside of Academia and very briefly just want to give a huge thanks to all of our sponsors who without their support this
course like every year would not be possible okay so now let’s start with the the fun stuff and my favorite part of of the course which is the technical parts and let’s start by just asking ourselves a question right which is you know why do we care about all of this why do we care about deep learning why did you all come here today to learn and to listen to this course so to understand I think we again need to go back a little bit to understand how machine learning used to be uh performed right so machine
learning typically would Define a set of features or you can think of these as kind of a set of things to look for in an image or in a piece of data usually these are hand engineered so humans would have to Define these themselves and the problem with these is that they tend to be very brittle in practice just by nature of a human defining them so the key idea of keep learning and what you’re going to learn throughout this entire week is this Paradigm Shift of trying to move away from hand engineering features and rules that
computer should look for and instead trying to learn them directly from raw pieces of data so what are the patterns that we need to look at in data sets such that if we look at those patterns we can make some interesting decisions and interesting actions can come out so for example if we wanted to learn how to detect faces we might if you think even how you would detect faces right if you look at a picture what are you looking for to detect a face you’re looking for some particular patterns you’re looking
for eyes and noses and ears and when those things are all composed in a certain way you would probably deduce that that’s a face right computers do something very similar so they have to understand what are the patterns that they look for what are the eyes and noses and ears of those pieces of data and then from there actually detect and predict from them so the really interesting thing I think about deep learning is that these foundations for doing exactly what I just mentioned picking out the building blocks picking out the features from raw
pieces of data and the underlying algorithms themselves have existed for many many decades now the question I would ask at this point is so why are we studying this now and why is all of this really blowing up right now and exploding with so many great advances well for one there’s three things right number one is that the data that is available to us today is significantly more pervasive these models are hungry for data you’re going to learn about this more in detail but these models are extremely hungry for data and we’re
living in a world right now quite frankly where data is more abundant than it has ever been in our history now secondly these algorithms are massively compute hungry they’re and they’re massively parallelizable which means that they have greatly benefited from compute Hardware which is also capable of being parallelized the particular name of that Hardware is called a GPU right gpus can run parallel processing uh streams of information and are particularly amenable to deep learning algorithms and the abundance of gpus and
that compute Hardware has also push forward what we can do in deep learning and finally the last piece is the software right it’s the open source tools that are really used as the foundational building blocks of deploying and building all of these underlying models that you’re going to learn about in this course and those open source tools have just become extremely streamlined making this extremely easy for all of us to learn about these Technologies within an amazing onewe course like this so let’s start now with
understanding now that we have some of the background let’s start with understanding exactly what is the fundamental building block of a neural network now that building block is called a perceptron right every single perceptor every single neural network is built up of multiple perceptrons and you’re going to learn how those perceptrons number one compute information themselves and how they connect to these much larger billion parameter neural networks so the key idea of a perceptron or even simpler think of this as a
single neuron right so a neural network is composed osed of many many neurons and a perceptron is just one neuron so that idea of a perceptron is actually extremely simple and I hope that by the end of today this idea and this uh processing of a perceptron becomes extremely clear to you so let’s start by talking about just the forward propagation of information through a single neuron now single neurons ingest information they can actually ingest multiple pieces of information so here you can see this neuron taking has input
three pieces of information X1 X2 and XM right so we Define the set of inputs called x 1 through M and each of these inputs each of these numbers is going to be elementwise multiplied by a particular weight so this is going to be denoted here by W1 through WM so this is a corresponding weight for every single input and you should think of this as really uh you know every weight being assigned to that input right the weights are part of the neuron itself now you multiply all of these inputs with their weights together and then you add them
up we take this single number after that addition and you pass it through what’s called a nonlinear activation function to produce your final output which here be calling y now what I just said is not entirely correct right so I missed out one critical piece of information that piece of information is that we also have what you can see here is called this bias term that bias term is actually what allows your neuron neuron to shift its activation function horizontally on that x axis if you think of it right so on
the right side you can now see this diagram illustrating mathematically that single equation that I talked through kind of conceptually right now you can see it mathematically written down as one single equation and we can actually rewrite this using linear algebra using vectors and Dot products so let’s do that right so now our inputs are going to be described by a capital x which is simply a vector of all of our inputs X1 through XM and then our weights are going to be described by a capital W which is going to be uh W1 through WM.

The input is obtained by taking the dot product of X and W right that dot product does that element wise multiplication and then adds sums all of the the element wise multiplications and then here’s the missing piece is that we’re now going to add that bias term here we’re calling the bias term w0 right and then we’re going to apply the nonlinearity which here denoted as Z or G excuse me so I’ve mentioned this nonlinearity a few times this activation function let’s dig into it a little bit
more so we can understand what is actually this activation function doing well I said a couple things about it I said it’s a nonlinear function right here you can see one example of an activation fun function one common uh one commonly used activation function is called the sigmoid function which you can actually see here on the bottom right hand side of the screen the sigmoid function is very commonly used because it’s outputs right so it takes as input any real number the x- axxis is infinite plus or minus but on the Y AIS
it basically squashes every input X into a number between Z and one so it’s actually a very common choice for things like probability distributions if you want to convert your answers into probabilities or learn or teach a neuron to learn a probability distribution but in fact there are actually many different types of nonlinear activation functions that are used in neural networks and here are some common ones and and again throughout this presentation you’ll see these little tensorflow icons actually
throughout the entire course you’ll see these tensorflow icons on the bottom which basically just allow you to uh relate some of the foundational knowledge that we’re teaching ing in the lectures to some of the software labs and this might provide a good starting point for a lot of the pieces that you have to do later on in the software parts of the class so the sigmoid activation which we talked about in the last slide here it’s shown on the left hand side right this is very popular because of the probability distributions
right it squashes everything between zero and one but you see two other uh very common types of activation functions in the middle and the right hand side as well so the other very very common one probably the this is the one now that’s the most popular activation function is now on the far right hand side it’s called the relu activation function or also called the rectified linear unit so basically it’s linear everywhere except there’s a nonlinearity at x equals z so there’s a kind of a
step or a break discontinuity right so benefit of this very easy to compute it still has the nonlinearity which we kind of need and we’ll talk about why we need it in one second but it’s very fast right just two linear functions piecewise combined with each other okay so now let’s talk about why we need a nonlinearity in the first place why why not just deal with a linear function that we pass all of these inputs through so the point of the activation function even at all why do we have this is to introduce
nonlinearities in of itself so what we want to do is to allow our neural network to deal with nonlinear data right our neural networks need the ability to deal with nonlinear data because the world is extremely nonlinear right this is important because you know if you think of the real world real data sets this is just the way they are right if you look at data sets like this one green and red points right and I ask you to build a neural network that can separate the green and the red points this means that we actually need a
nonlinear function to do that we cannot solve this problem with a single line right in fact if we used linear uh linear functions as your activation function no matter how big your neural network is it’s still a linear function because linear functions combined with linear functions are still linear so no matter how deep or how many parameters your neural network has the best they would be able to do to separate these green and red points would look like this but adding nonlinearities allows our neural networks to be smaller by
allowing them to be more expressive and capture more complexities in the data sets and this allows them to be much more powerful in the end so let’s understand this with a simple example imagine I give you now this trained neural network so what does it mean trained neural network it means now I’m giving you the weights right not only the inputs but I’m going to tell you what the weights of this neural network are so here let’s say the bias term w0 is going to be one and our W Vector is going to be 3 and ne2 right these are
just the weights of your train neural network let’s worry about how we got those weights in a second but this network has two inputs X1 and X2 now if we want to get the output of this neural network all we have to do simply is to do the same story that we talked about before right it’s dot product inputs with weights add the bias and apply the nonlinearity right and those are the three components that you really have to remember as part of this class right dot product uh add the bias and apply a nonlinearity that’s going to
be the process that keeps repeating over and over and over again for every single neuron after that happens that neuron was going to Output a single number right now let’s take a look at what’s inside of that nonlinearity it’s simply a weighted combination of those uh of those inputs with those weights right so if we look at what’s inside of G right inside of G is a weighted combination of X and W right added with a bias right that’s going to produce a single number right but in reality for any
input that this model could see what this really is is a two-dimensional line because we have two parameters in this model so we can actually plot that line we can see exactly how this neuron separates points on these axes between X1 and X2 right these are the two inputs of this model we can see exactly and interpret exactly what this neuron is is doing right we can visualize its entire space because we can plot the line that defines this neuron right so here we’re plotting when that line equals zero and in fact if I give you if I give
that neuron in fact a new data point here the new data point is X1 = -1 and X2 = 2 just an arbitrary point in this two-dimensional space we can plot that point in the two-dimensional space And depending on which side of the line it falls on it tells us you know what the what the answer is going to be what the sign of the answer is going to be and also what the answer itself is going to be right so if we follow that that equation written on the top here and plug in -1 and 2 we’re going to get 1 – 3 – 4 which equal
-6 right and when I put that into my nonlinearity G I’m going to get a final output of 0.2 right so that that don’t worry about the final output that’s just going to be the output for that signal function but the important point to remember here is that the sigmoid function actually divides the space into these two parts right it squashes everything between Z and one but it divides it implicitly by everything less than 0.
5 and greater than 0.5 depending on if it’s on if x is less than zero or greater than zero so depending on which side of the line that you fall on remember the line is when x equals z the input to the sigmoid is zero if you fall on the left side of the line your output will be less than 0.5 because you’re falling on the negative side of the line if your output is if your input is on the right side of the line now your output is going to be greater than 0.
5 right so here we can actually visualize this space this is called the feature space of a neural network we can visualize it in its completion right we can totally visualize and interpret this neural network we can understand exactly what it’s going to do for any input that it sees right but of course this is a very simple neuron right it’s not a neural network it’s just one neuron and even more than that it’s even a very simple neuron it only has two inputs right so in reality the types of neuron neurons that you’re going to be dealing
with in this course are going to be neurons and neural networks with millions or even billions of these parameters right of these inputs right so here we only have two weights W1 W2 but today’s neural networks have billions of these parameters so drawing these types of plots that you see here obviously becomes a lot more challenging it’s actually not possible but now that we have some of the intuition behind a perceptron let’s start now by building neural networks and seeing how all of this comes
together so let’s revisit that previous diagram of a perceptron now again if there’s only one thing to take away from this lecture right now it’s to remember how a perceptron works that equation of a perceptron is extremely important for every single class that comes after today and there’s only three steps it’s dot product with the inputs add a bias and apply your nonlinearity let’s simplify the diagram a little bit I’ll remove the weight labels from this picture and now you can
assume that if I show a line every single line has an Associated weight that comes with that line right I’ll also also remove the bias term for Simplicity assume that every neuron has that bias term I don’t need to show it and now note that the result here now calling it Z which is just the uh dot product plus bias before the nonlinearity is the output is going to be linear first of all it’s just a it’s just a weighted sum of all those pieces we have not applied the nonlinearity yet but our final output is just going to be
G of Z it’s the activation function or nonlinear activ function applied to Z now if we want to step this up a little bit more and say what if we had a multi-output function now we don’t just have one output but let’s say we want to have two outputs well now we can just have two neurons in this network right every neuron say sees all of the inputs that came before it but now you see the top neuron is going to be predicting an answer and the bottom neuron will predict its own answer now importantly
one thing you should really notice here is that each neuron has its own weights right each neuron has its own lines that are coming into just that neuron right so they’re acting independently but they can later on communicate if you have another layer right so let’s start now by initializing this uh this process a bit further and thinking about it more programmatically right what if we wanted to program this this neural network ourselves from scratch right remember that equation I told you it didn’t sound very complex
it’s take a DOT product add a bias which is a single number and apply nonlinearity let’s see how we would actually Implement something like that so to to define the layer right we’re now going to call this a layer uh which is a collection of neurons right we have to first Define how that information propagates through the network so we can do that by creating a call function here first we’re going to actually Define the weights for that Network right so remember every Network every neuron I should say every neuron has weights and
a bias right so let’s define those first we’re going to create the call function to actually see how we can pass information through that layer right so this is going to take us input and inputs right this is like what we previously called X and it’s the same story that we’ve been seeing this whole class right we’re going to Matrix multiply or take a DOT product of our inputs with our weights we’re going to add a bias and then we’re going to apply a nonlinearity it’s really that simple right we’ve now
created a single layer neural network right so this this line in particular this is the part that allows us to be a powerful neural network maintaining that nonlinearity and the important thing here is to note that modern deep learning toolboxes and libraries already Implement a lot of these for you right so it’s important for you to understand the foundations but in practice all of that layer architecture and all that layer logic is actually implemented in tools like tensorflow and P torch through a dense
layer right so here you can see an example of calling or creating initializing a dense layer with two neurons right allowing it to feed in an arbitrary set of inputs here we’re seeing these two neurons in a layer being fed three inputs right and in code it’s only reduced down to this one line of tensorflow code making it extremely easy and convenient for us to use these functions and call them so now let’s look at our single layered neural network this is where we have now one layer between our input and our outputs
right so we’re slowly and progressively increasing the complexity of our neural network so that we can build up all of these building blocks right this layer in the middle is called a hidden layer right obviously because you don’t directly observe it you don’t directly supervise it right you do observe the two input and output layers but your hidden layer is just kind of a uh a neuron neuron layer that you don’t directly observe right it just gives your network more capacity more learning complexity and since we now have a
transformation function from inputs to Hidden layers and hidden layers to Output we now have a two- layered neural network right which means that we also have two weight matrices right we don’t have just the W1 which we previously had to create this hidden layer but now we also have W2 which does the transformation from hidden layer to Output layer yes what happens nonlinearity in Hidden you have just linear so there’s no it’s not is it a perceptron or not yes so every hidden layer also has an nonlinearity
accompanied with it right and that’s a very important point because if you don’t have that perceptron then it’s just a very large linear function followed by a final nonlinearity at the very end right so you need that cascading and uh you know overlapping application of nonlinearities that occur throughout the network awesome okay so now let’s zoom in look at a single unit in the hidden layer take this one for example let’s call it Z2 right it’s the second neuron in the first layer right it’s the same
perception that we saw before we compute its answer by taking a DOT product of its weights with its inputs adding a bias and then applying a nonlinearity if we took a different hidden nodee like Z3 the one right below it we would compute its answer exactly the same way that we computed Z2 except its weights would be different than the weights of Z2 everything else stays exactly the same it sees the same inputs but of course you know I’m not going to actually show Z3 in this picture and now this picture is getting a little bit messy so let’s
clean things up a little bit more I’m going to remove all the lines now and replace them just with these these boxes these symbols that will denote what we call a fully connected layer right so these layers now denote that everything in our input is connected to everything in our output and the transformation is exactly as we saw before dot product bias and nonlinearity and again in code to do this is extremely straightforward with the foundations that we’ve built up from the beginning of the class we can now
just Define two of these dense layers right our hidden layer on line one with n hidden units and then our output layer with two hidden output units does that mean the nonlinearity function must be the same between layers nonlinearity function does not need to be the same through through each layer often times it is because of convenience there’s there are some cases where you would want it to be different as well especially in lecture two you’re going to see nonlinearities be different even within the same layer um let alone
different layers but uh unless for a particular reason generally convention is there’s no need to keep them differently now let’s keep expanding our knowledge a little bit more if we now want to make a deep neural network not just a neural network like we saw in the previous side now it’s deep all that means is that we’re now going to stack these layers on top of each other one by one more and more creating a hierarchical model right the ones where the final output is now going to be computed by going deeper and deeper and
deeper into the neural network and again doing this in code again follows the exact same story as before just cascading these tensorflow layers on top of each other and just going deeper into the network okay so now this is great because now we have at least a solid foundational understanding of how to not only Define a single neuron but how to define an entire neural network and you should be able to actually explain at this point or understand how information goes from input through an entire neural network to compute an output so now
let’s look at how we can apply these neural networks to solve a very real problem that uh I’m sure all of you care about so here’s a problem on how we want to build an AI system to learn to answer the following question which is will I pass this class right I’m sure all of you are really worried about this question um so to do this let’s start with a simple input feature model the feature the two features that let’s concern ourselves with are going to be number one how many lectures you attend
and number two how many hours you spend on your final project so let’s look at some of the past years of this class right we can actually observe how different people have uh lived in this space right between how many lectures and how much time You’ spent on your final project and you can actually see every point is a person the color of that point is going to be if they passed or failed the class and you can see and visualize kind of this V this feature space if you will that we talked about before and then we
have you you fall right here you’re the point 45 uh right in between the the this uh feature space you’ve attended four lectures and you will spend 5 hours on the final project and you want to build a neural network to determine given everyone else in the class right that I’ve seen from all of the previous years you want to help you want to have your neural network help you to understand what is your likelihood that you will pass or fail this class so let’s do it we now have all of the building blocks
to solve this problem using a neural network let’s do it so we have two inputs those inputs are the number of lectures you attend and number of hours you spend on your final project it’s four and five we can pass those two inputs to our two uh X1 and X2 variables these are fed into this single layered single hidden layered neural network it has three hidden units in the middle and we can see that the final predicted output probability for you to pass this class is 0.
1 or 10% right so very Bleak outcome it’s not a good outcome um the actual ual probability is one right so attending four out of the five lectures and spending 5 hours in your final project you actually lived in a part of the feature space which was actually very positive right it looked like you were going to pass the class so what happened here anyone have any ideas so why did the neural network get this so terribly wrong right it’s not trained exactly so this neural network is not trained we haven’t shown any of that data the green and red data right so you
should really think of neural networks like babies right before they see data they haven’t learned anything there’s no expectation that we should have for them to be able to solve any of these types of problems before we teach them something about the world so let’s teach this neural network something about uh the problem first right and to train it we first need to tell our neural network when it’s making bad decisions right so we need to teach it right really train it to learn exactly like how we as
humans learn in some ways right so we have to inform the neural network when it gets the answer incorrect so that it can learn how to get the answer correct right so the closer the answer is to the ground truth so right so for example the actual value for you passing this class was probability one 100% but it predicted a probability of 0.
1 we compute what’s called a loss right so the closer these two things are together the smaller your loss should be and the and the more accurate your model should be so let’s assume that we have data not just from one student but now we have data from many students we many students have taken this class before and we can plug all of them into the neural network and show them all to this to this system now we care not only about how the neural network did on just this one prediction but we care about how it predicted on all of these different
people that the neural network has shown in the past as well during this training and learning process so when training the neural network we want to find a network that minimizes the empirical loss between our predictions and those ground truth outputs and we’re going to do this on average across all of the different inputs that the that the model has seen if we look at this problem of binary classification right between yeses and NOS right will I pass the class or will I not pass the class it’s a zero or one
probability and we can use what is called the softmax function or the softmax cross entry function to be able to inform if this network is getting the answer correct or incorrect right the softmax cross or the cross entropy function think of this as a as an objective function it’s a loss function that tells our neural network how far away these two probability distributions are right so the output is a probability distribution we’re trying to determine how bad of an answer the neural network is predicting so that we can give it
feedback to get a better answer now let’s suppose in instead of training a or predicting a binary output we want to predict a real valued output like a like any number it can take any number plus or minus infinity so for example if you wanted to predict the uh grade that you get in a class right doesn’t necessarily need to be between Z and one or Z and 100 even right you could now use a different loss in order to produce that value because our outputs are no longer a probability distribution right so for example what
you might do here is compute a mean squared error probabil or mean squared error loss function between your true value or your true grade of the class and the predicted grade right these are two numbers they’re not probabilities necessarily you compute their difference you square it to to look at a distance between the two an absolute distance right sign doesn’t matter and then you can minimize this thing right okay great so let’s put all of this loss information with this problem of finding our Network
into a unified problem and a unified solution to actually train our neural network so we knowe that we want to find a neural network that will solve this problem on all this data on average right that’s how we contextualize this problem earlier in the in the lectures this means effectively that we’re trying to solve or we’re trying to find what are the weights for our neural network what are this ve this big Vector W that we talked about in earlier in the lecture what is this Vector W compute this Vector W for me based on all of the
data that we have seen right now the vector W is also going to determine what is the loss right so given a single Vector w we can compute how bad is this neural network performing on our data right so what is the loss what is this deviation from the ground truth of our network uh based on where it should be now remember that that W is just a group of a bunch of numbers right it’s a very big list of numbers a list of Weights uh for every single layer and every single neuron in our neural network right so it’s just a very big
list or a vector of of Weights we want to find that Vector what is that Vector based on a lot of data that’s the problem of training a neural network and remember our loss function is just a simple function of our weights if we have only two weights in our neural network like we saw earlier in the slide then we can plot the Lost landscape over this two-dimensional space right so we have two weights W1 and W2 and for every single configuration or setting of those two weights our loss will have a particular value which here we’re
showing is the height of this graph right so for any W1 and W2 what is the loss and what we want to do is find the lowest point what is the best loss where what are the weights such that our loss will be as good as possible so the smaller the loss the better so we want to find the lowest point in this graph now how do we do that right so the way this works is we start somewhere in this space we don’t know where to start so let’s pick a random place to start right now from that place let’s compute
What’s called the gradient of the landscape at that particular point this is a very local estimate of where is going up basically where where is the slope increasing at my current location right that informs us not only where the slope is increasing but more importantly where the slope is decreasing if I negate the direction if I go in the opposite direction I can actually step down into the landscape and change my weights such that I lower my loss so let’s take a small step just a small step in the opposite direction of
the part that’s going up let’s take a small step going down and we’ll keep repeating this process we’ll compute a new gradient at that new point and then we’ll take another small step and we’ll keep doing this over and over and over again until we converge at what’s called a local minimum right so based on where we started it may not be a global minimum of everywhere in this lost landscape but let’s find ourselves now in a local minimum and we’re guaranteed to actually converge by following this
very simple algorithm at a local minimum so let’s summarize now this algorithm this algorithm is called gradient descent let’s summarize it first in pseudo code and then we’ll look at it in actual code in a second so there’s a few steps first step is we initialize our location somewhere randomly in this weight space right we compute the gradient of of our loss at with respect to our weights okay and then we take a small step in the opposite direction and we keep repeating this in a loop over and over and over
again and we say we keep we keep doing this until convergence right until we stop moving basically and our Network basically finds where it’s supposed to end up we’ll talk about this this uh this small step right so we’re multiplying our gradient by what I keep calling is a small step we’ll talk about that a bit more about a bit more in later part of this this lecture but for now let’s also very quickly show the analogous part in in code as well and it mirrors very nicely right so we’ll
randomly initialize our weight this happens every time you train a neural network you have to randomly initialize the weights and then you have a loop right here showing it without even convergence right we’re just going to keep looping forever where we say okay we’re going to compute the loss at that location compute the gradient so which way is up and then we just negate that gradient multiply it by some what’s called learning rate LR denoted here it’s a small step and then we take a direction in that small
step so let’s take a deeper look at this term here this is called the gradient right this tells us which way is up in that landscape and this again it tells us even more than that it tells us how is our landscape how is our loss changing as a function of all of our weights but I actually have not told you how to compute this so let’s talk about that process that process is called back propagation we’ll go through this very very briefly and we’ll start with the simplest neural network uh that’s
possible right so we already saw the simplest building block which is a single neuron now let’s build the simplest neural network which is just a one neuron neural network right so it has one hidden neuron it goes from input to Hidden neuron to output and we want to compute the gradient of our loss with respect to this weight W2 okay so I’m highlighting it here so we have two weights let’s compute the gradient first with respect to W2 and that tells us how much does a small change in w 2 affect our loss does our loss go up or down if
we move our W2 a little bit in One Direction or another so let’s write out this derivative we can start by applying the chain rule backwards from the loss through the output and specifically we can actually decompose this law this uh derivative this gradient into two parts right so the first part we’re decomposing it from DJ dw2 into DJ Dy right which is our output multiplied by Dy dw2 right this is all possible right it’s a chain rule it’s a I’m just reciting a chain rule here from calculus this is possible because Y is
only dependent on the previous layer and now let’s suppose we don’t want to do this for W2 but we want to do it for W1 we can use the exact same process right but now it’s one step further right we’ll now replace W2 with W1 we need to apply the chain rule yet again once again to decompose the problem further and now we propagate our old gradient that we computed for W2 all the way back one more step uh to the weight that we’re interested in which in this case is W1 and we keep repeating this process
over and over again propagating these gradients backwards from output to input to compute ultimately what we want in the end is this derivative of every weight so the the derivative of our loss with respect to every weight in our neural network this tells us how much does a small change in every single weight in our Network affect the loss does our loss go up or down if we change this weight a little bit in this direction or a little bit in that direction yes I think you use the term neuron is perceptron is there a
functional difference neuron and perceptron are the same so typically people say neural network which is why like a single neuron it’s also gotten popularity but originally a perceptron is is the the formal term the two terms are identical Okay so now we’ve covered a lot so we’ve covered the forward propagation of information through a neuron and through a neural network all the way through and we’ve covered now the back propagation of information to understand how we should uh change every single one of those weights in our
neural network to improve our loss so that was the back propop algorithm in theory it’s actually pretty simple it’s just a chain rule right there’s nothing there’s actually nothing more than than just the chain Rule and the nice part that deep learning libraries actually do this for you so they compute back prop for you you don’t actually have to implement it yourself which is very convenient but now it’s important to touch on even though the theory is actually not that complicated for back propagation let’s touch on it
now from practice now thinking a little bit towards your own implementations when you want to implement these neural networks what are some insights so optimization of neural networks in practice is a completely different story it’s not straightforward at all and in practice it’s very difficult and usually very computationally intensive to do this backrop algorithm so here’s an illustration from a paper that came out a few years ago that actually attempted to visualize a very deep neural Network’s lost landscape so previously
we had that other uh depiction visualization of how a neural network would look in a two-dimensional landscape real neural networks are not two-dimensional they’re hundreds or millions or billions of dimensions and now what would those lost landscap apes look like you can actually try some clever techniques to actually visualize them this is one paper that attempted to do that and it turns out that they look extremely messy right um the important thing is that if you do this algorithm and you start in a bad place depending on your neural
network you may not actually end up in the the global solution right so your initialization matters a lot and you need to kind of Traverse these local Minima and try to try and help you find the global Minima or even more than that you need to construct neural networks that have lost Landscapes that are much more amenable to optimization than this one right so this is a very bad lost landscape there are some techniques that we can apply to our neural networks that smooth out their lost landscape and make them easier to
optimize so recall that update equation that we talked about earlier with gradient descent right so there is this parameter here that we didn’t talk about we we described this as the little step that you could take right so it’s a small number that multiply with the direction which is your gradient it just tells you okay I’m not going to just go all the way in this direction I’ll just take a small step in this direction so in practice even setting this value right it’s just one number setting this
one number can be rather difficult right if we set the learning rate too um small then the model can get stuck in these local Minima right so here it starts and it kind of gets stuck in this local Minima it converges very slowly even if it doesn’t get stuck if the learning rate is too large it can kind of overshoot and in practice it even diverges and explodes and you don’t actually ever find any Minima now ideally what we want is to use learning rates that are not too small and not too large to so they’re
large enough to basically avoid those local Minima but small enough such that they won’t diverge and they will actually still find their way into the global Minima so something like this is what you should intuitively have in mind right so something that can overshoot the local minimas but find itself into a a better Minima and then finally stabilize itself there so how do we actually set these learning rates right in practice what does that process look like now idea number one is is very basic right it’s try a bunch of
different learning rates and see what works and that’s actually a not a bad process in practice it’s one of the processes that people use um so that that’s uh that’s interesting but let’s see if we can do something smarter than this and let’s see how can design algorithms that uh can adapt to the Landscapes right so in practice there’s no reason why this should be a single number right can we have learning rates that adapt to the model to the data to the Landscapes to the gradients that
it’s seeing around so this means that the learning rate may actually increase or decrease as a function of the gradients in the loss function right how fast we’re learning or many other options right there are many different ideas that could be done here and in fact there are many widely used different procedures or methodologies for setting the learning rate and during your Labs we actually encourage you to try out some of these different ideas for different types of learning rates and and even play around with you know
what what’s the effect of increasing or decreasing your learning rate you’ll see very striking differences do it because it’s on a close interval why not just find the absolute minimum you know test right so so a few things what number one is that it’s not a closed space right so there’s an infinite every every weight can be plus or minus up to Infinity right so even if it was a one-dimensional neural network with just one weight it’s not a closed space in practice it’s even worse than
that because you have billions of Dimensions right so not only is your uh space your support system in one dimension is it infinite but you now have billions of infinite Dimensions right or billions of uh infinite support spaces so it’s not something that you can just like search every weight every possible weight in your neural in your configuration or what is every possible weight that this neural network could take and let me test them out because it it’s not practical to do even for a very small neural network in
practice so in your Labs you can really try to put all of this information uh in this picture into practice which defines your model number one right here defines your Optimizer which previously we denoted as this gradient descent Optimizer here we’re calling it uh stochastic gradient descent or SGD we’ll talk about that more in a second and then also note that your Optimizer which here we’re calling SGD could be any of these adaptive optimizers you can swap them out and you should swap them out
you should test different things here to see the impact of these different methods on your training procedure and you’ll gain very valuable intuition for the different insights that will come with that as well so I want to continue very briefly just for the end of this lecture to talk about tips for training neural networks in practice and how we can focus on this powerful idea of really what’s called batching data right not seeing all of your data but now talking about a topic called batching so to do this let’s very
briefly revisit this gradient descent algorithm the gradient is compute this gradient computation the backrop algorithm I mentioned this earlier it’s a very computationally expensive uh operation and it’s even worse because we now are we previously described it in a way where we would have to compute it over a summation over every single data point in our entire data set right that’s how we defined it with the loss function it’s an average over all of our data points which means that we’re
summing over all of our data points the gradients so in most real life problems this would be completely infeasible to do because our data sets are simply too big and the models are too big to to compute those gradients on every single iteration remember this isn’t just a onetime thing right it’s every single step that you do you keep taking small steps so you keep need you keep needing to repeat this process so instead let’s define a new gradient descent algorithm called SGD stochastic gradient descent
instead of computing the gradient over the entire data set now let’s just pick a single training point and compute that one training Point gradient right the nice thing about that is that it’s much easier to compute that gradient right it only needs one point and the downside is that it’s very noisy it’s very stochastic since it was computed using just that one examples right so you have that that tradeoff that exists so what’s the middle ground right the middle ground is to take not one data point and not the full data set but
a batch of data right so take a what’s called a mini batch right this could be something in practice like 32 pieces of data is a common batch size and this gives us an estimate of the true gradient right so you approximate the gradient by averaging the gradient of these 32 samples it’s still fast because 32 is much smaller than the size of your entire data set but it’s pretty quick now right it’s still noisy but it’s okay usually in practice because you can still iterate much faster and since B is normally not that
large again think of something like in the tens or the hundreds of samples it’s very fast to compute this in practice compared to regular gradient descent and it’s also much more accurate compared to stochastic gradient descent and the increase in accuracy of this gradient estimation allows us to converge to our solution significantly faster as well right it’s not only about the speed it’s just about the increase in accuracy of those gradients allows us to get to our solution much faster which ultimately means that we
can train much faster as well and we can save compute and the other really nice thing about mini batches is that they allow for parallelizing our computation right and that was a concept that we had talked about earlier in the class as well and here’s where it’s coming in we can split up those batches right so those 32 pieces of data let’s say if our batch size is 32 we can split them up onto different workers right different parts of the GPU can tackle those different parts of our data points this
can allow us to basically achieve even more significant speed up using GPU architectures and GPU Hardware okay finally last topic I want to talk about before we end this lecture and move on to lecture number two is overfitting right so overfitting is this idea that is actually not a deep learning Centric problem at all it’s it’s a problem that exists in all of machine learning right the key problem is that and the key problem is actually one that addresses how you can accurately Define if if your model is is actually
capturing your true data set right or if it’s just learning kind of the subtle details that are kind of sply correlating to your data set so said differently let me say it a bit differently now so let’s say we want to build models that can learn representations okay from our training data that still generalize to brand new unseen test points right that’s the real goal here is we want to teach our model something based on a lot of training data but then we don’t want it to do well in the training data we want it to
do well when we deploy it into the real world and it’s seeing things that it has never seen during training so the concept of overfitting is exactly addressing that problem overfitting means if if your model is doing very well on your training data but very badly in testing it pro it’s that means it’s overfitting it’s overfitting to the training data that it saw on the other hand there’s also underfitting right on the left hand side you can see basically not fitting the data enough which means that you know you’re going
to achieve very similar performance on your testing distribution but both are underperforming the actual capabilities of your system now ideally you want to end up somewhere in the middle which is not too complex where you’re memorizing all of the nuances in your training data like on the right but you still want to continue to perform well even based on the brand new data so you’re not underfitting as well so to talk to actually address this problem in neural networks and in machine learning in general there’s a
few different ways that you should be aware of and how to do it because you’ll need to apply them as part of your Solutions and your software Labs as well so the key concept here is called regularization right regularization is a technique that you can introduce and said very simply all regularization is is a way to discourage your model from from these nuances in your training data from being learned that’s all it is and as we’ve seen before it’s actually critical for our models to be able to
generalize you know not just on training data but really what we care about is the testing data so the most popular regularization technique that’s important for you to understand is this very simple idea called Dropout let’s revisit this picture of a deep neural network that we’ve been seeing all lecture right in Dropout our training during training what we’re going to do is randomly set some of the activations right these outputs of every single neuron to zero we’re just randomly going
to set them to zero with some probability right so let’s say 50% is our probability that means that we’re going to take all of the activation in our in our neural network and with a probability of 50% before we pass that activation onto the next neuron we’re just going to set it to zero and not pass on anything so effectively 50% of the neurons are are going to be kind of shut down or killed in a forward pass and you’re only going to forward pass information with the other 50% of your neurons so this idea is extremely
powerful actually because it lowers the capacity of our neural network it not only lowers the capacity of our neural network but it’s dynamically lowering it because on the next iteration we’re going to pick a different 50% of neurons that we drop out so constantly the network is going to have to learn to build Pathways different pathways from input to output and that it can’t rely on any small any small part of the features that are present in any part of the training data set too extensively right because it’s constantly being
forced to find these different Pathways with random probabilities so that’s Dropout the second regularization technique is going to be this notion called early stopping which is actually something that is model agnostic you can apply this to any type of model as long as you have a testing set that you can play around with so the idea here is that we have already a pretty formal mathematical definition of what it means to overfit right overfitting is just when our model starts to perform worse on our test set that’s really all it is
right so what if we plot over the course of training so x-axis is as we’re training the model let’s look at the performance on both the training set and the test set so in the beginning you can see that the training set and the test set are both going down and they continue to go down uh which is excellent because it means that our model is getting stronger eventually though what you’ll notice is that the test loss plateaus and starts to increase on the other hand the training loss there’s no reason why the training
loss should ever need to stop going down right training losses generally always continue to Decay as long as there is capacity in the neural network to learn those differences right but the important point is that this continues for the rest of training and we want to BAS basically we care about this point right here right this is the really important point because this is where we need to stop training right after this point this is the happy medium because after this point we start to overfit on parts of the data where our training
accuracy becomes actually better than our testing accuracy so our testing accuracy is going bad it’s getting worse but our training accuracy is still improving so it means overfitting on the other hand on the left hand side this is the opposite problem right we have not fully utilized the capacity of our model and the testing accuracy can still improve further right this is a very powerful idea but it’s actually extremely easy to implement in practice because all you really have to do is just monitor the loss of over the course
of training right and you just have to pick the model where the testing accuracy starts to get worse so I’ll conclude this lecture by just summarizing three key points that we’ve cover covered in the class so far and this is a very g-pack class so the entire week is going to be like this and today is just the start so so far we’ve learned the fundamental building blocks of neural network starting all the way from just one neuron also called a perceptron we learned that we can stack these systems on top of each other to
create a hierarchical network and how we can mathematically optimize those types of systems and then finally in the very very last part of the class we talked about just techniques tips and techniques for actually training and applying these systems into practice ice now in the next lecture we’re going to hear from Ava on deep sequence modeling using rnns and also a really new and exciting algorithm and type of model called the Transformer which uh is built off of this principle of attention.