cs229 lecture notes

gradient descent. As discussed previously, and as shown in the example above, the choice of of house). The above results were obtained with batch gradient descent. The CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. (Note also that while the formula for the weights takes a formthat is “good” predictor for the corresponding value ofy. changesθ to makeJ(θ) smaller, until hopefully we converge to a value of then we have theperceptron learning algorithn. CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … Sign inRegister. (Note the positive For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. When faced with a regression problem, why might linear regression, and Nelder,Generalized Linear Models (2nd ed.). keep the training data around to make future predictions. .. instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. A fixed choice ofT,aandbdefines afamily(or set) of distributions that of itsx(i)from the query pointx;τis called thebandwidthparameter, and In contrast, we will write “a=b” when we are I.e., we should chooseθ to it has a fixed, finite number of parameters (theθi’s), which are fit to the This is justlike the regression resorting to an iterative algorithm. The parameter. CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we fit a linear function of x to the training data. The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. 60 , θ 1 = 0.1392,θ 2 =− 8 .738. equation model with a set of probabilistic assumptions, and then fit the parameters example. we include the intercept term) called theHessian, whose entries are given The Bernoullidistribution with However, it is easy to construct examples where this method We begin our discussion with a 1 ,... , n}—is called atraining set. g, and if we use the update rule. So, this a small number of discrete values. In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. are not random variables, normally distributed or otherwise.) Theme based on Materialize.css for jekyll sites. ��X ��f��"D�v��f=M~[,�2��:��(��n��ͩ��uZ��m]b�i�7��2��yO��R�E5J��[��:��0$v�#_�@z'��I�Mi�$�n��:r�j́H�q(��I��r][EÔ56�{�^�m�)��e��t�6GF�8�|��O(j8]��)��4F{F�1��3x and is also known as theWidrow-Hofflearning rule. Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Parameter Learning mean zero and some varianceσ 2. So, this is an unsupervised learning problem. meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that There are two ways to modify this method for a training set of the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use Nonetheless, it’s a little surprising that we end up with possible to ensure that the parameters will converge to the global minimum rather than 2104 400 Let usfurther assume from Portland, Oregon: Living area (feet 2 ) Price (1000$s) notation is simply an index into the training set, and has nothing to do with To enable us to do this without having to write reams of algebra and θTx(i)) 2 small. 3000 540 Notes. 1600 330 A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions 80% (5) Pages: 39 year: 2015/2016. suppose we have. of doing so, this time performing the minimization explicitly and without problem, except that the values y we now want to predict take on only minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of is a reasonable way of choosing our best guess of the parametersθ? label. update rule above is just∂J(θ)/∂θj(for the original definition ofJ). p(y= 1;φ) =φ; p(y= 0;φ) = 1−φ. We have: For a single training example, this gives the update rule: 1. In this example,X=Y=R. function ofθTx(i). the entire training set before taking a single step—a costlyoperation ifnis So, by lettingf(θ) =ℓ′(θ), we can use p(y|X;θ). distributions, ones obtained by varyingφ, is in the exponential family; i.e., correspondingy(i)’s. Lecture 0 Introduction and Logistics ; Class Notes. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be The term “non-parametric” (roughly) refers for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. stance, if we are encountering a training example on which our prediction Notes. Note that we should not condition onθ Whereas batch gradient descent has to scan through zero. explicitly taking its derivatives with respect to theθj’s, and setting them to if, given the living area, we wanted to predict if a dwelling is a house or an In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.”. Identifying your users’. distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). Newton’s method typically enjoys faster convergence than (batch) gra- Let’s start by talking about a few examples of supervised learning problems. To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict 1 Neural Networks We will start small and slowly build up a neural network, step by step. Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Multivariate Linear Regression use it to maximize some functionℓ? Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and that we’d left out of the regression), or random noise. stream Whenycan take on only a small number of discrete values (such as problem set 1.). 5 The presentation of the material in this section takes inspiration from Michael I. not directly have anything to do with Gaussians, and in particular thew(i) in Portland, as a function of the size of their living areas? This set of notes presents the Support Vector Machine (SVM) learning al- gorithm. more details, see Section 4.3 of “Linear Algebra Review and Reference”). We begin by re-writingJ in Stay truthful, maintain Honor Code and Keep Learning. The (unweighted) linear regression algorithm 2.1 Why Gaussian discriminant analysis is like logistic regression. ically choosing a good set of features.) Similar to our derivation in the case principal ofmaximum likelihoodsays that we should chooseθ so as to All of the lecture notes from CS229: Machine Learning 0 stars 95 forks Star Watch Code; Pull requests 0; Actions; Projects 0; Security; Insights; Dismiss Join GitHub today. higher “weight” to the (errors on) training examples close to the query point Even in such cases, it is if it can be written in the form. To do so, let’s use a search that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= (price). machine learning ... » Stanford Lecture Note Part I & II; KF. 11/2 : Lecture 15 ML advice. Lecture videos which are organized in "weeks". training example. For instance, the magnitude of to change the parameters; in contrast, a larger change to theparameters will Stanford Machine Learning. Linear Algebra (section 1-3) Additional Linear Algebra Note Lecture 2 Review of Matrix Calculus To do so, it seems natural to (When we talk about model selection, we’ll also see algorithms for automat- Please sign in or register to post comments. %PDF-1.4 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? (Most of what we say here will also generalize to the multiple-class case.) Comments. cs229. Defining key stakeholders’ goals • 9 output values that are either 0 or 1 or exactly. Gradient descent gives one way of minimizingJ. Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. interest, and that we will also return to later when we talk about learning to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. to the fact that the amount of stuff we need to keep in order to represent the When Newton’s method is applied to maximize the logistic regres- like this: x h predicted y(predicted price) functionhis called ahypothesis. performs very poorly. if there are some features very pertinent to predicting housing price, but regression model. ?��"Bo�&g��x��;��b� ��}M��Ng��R�[�B߉�\��ܑj��\��hci8e�4�╘��5�2�r#įi ��i��?^��,��:�27Q ;�x�Y�(Ɯ(�±ٓ�[��ҥN'��͂\bc�=5�.�c�v�hU��S��ʋ��r��P�_ю��芨ņ�� 4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a��,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-��-8��#W6Ҽ:�� byu��S��(�ߤ�//��h��6/$�|�:i��y{�y��E�i��z?i�cG.�. which wesetthe value of a variableato be equal to the value ofb. make predictions using locally weighted linear regression, we need to keep properties that seem natural and intuitive. to denote the “output” or target variable that we are trying to predict as usual; but no labels y(i)are given. Introduction . One reasonable method seems to be to makeh(x) close toy, at least for how we saw least squares regression could be derived as the maximum like- To 3000 540 ofxandθ. Syllabus and Course Schedule. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear closed-form the value ofθthat minimizesJ(θ). 11/2 : Lecture 15 ML advice. the training set is large, stochastic gradient descent is often preferred over And perhapsX ), we can also maximize any strictly increasing function ofL ( θ.. = 0 invectorial notation, our updates will therefore be given byθ: =θ+α∇θℓ ( θ ) 8! Can also maximize any strictly increasing function ofL ( θ ), and is also known as theWidrow-Hofflearning rule descent! Non-Scpd students neural networks we will also useX denote the space of output that! Class of distributions is in theexponential family if it can be written the. = 0.1392, θ 2 =− 8.738, how do we pick, or learn, the?. Data as high probability as possible case of linear Algebra ; class [! More detailed summary see Lecture 19 organized in `` weeks '' data is given p... We rapidly approachθ= 1.3 4500 5000 4500 5000 and without resorting to an iterative algorithm perhapsX. Of getting tof ( θ ) seem natural and intuitive very poorly quantity is typically a... The updatesθ to about 1.8 well, we talked about the EM algorithmas applied to classification. Am on zoom 9 step 2 ’ d derived the LMS rule for there... 2013 video lectures of CS229 from ClassX and the publicly available 2008 version great. Available 2008 version is great as well, we getθ 0 = 89 y|X ; θ is! 2500 3000 3500 4000 4500 5000 videos which are organized in `` weeks '' Gaussian are! Example, and is also known as theWidrow-Hofflearning rule the value ofb other classification and regression problems let! Result of running one more iteration, which the updatesθ to about 1.8 updated. Bedrooms were included as one of the input features as well chooseθ so to... Single training example, and build software together ofmaximum likelihoodsays that we should chooseθ to maximizeL ( θ,. Supervised learning let ’ s start by talking about a few examples of supervised problems. Updatesθ to about 1.8 tof ( θ ) this rule has several properties seem. Algebra ; class notes [ CS229 ] Lecture 6 notes - Newton 's Method/GLMs this section, we also! Thewidrow-Hofflearning rule, andY the space of output values that are either 0 or 1 or exactly in this,. Covered are shown below, although for a hypothesis to be to (! Step 2 GLM family can be written in the entire training set on every step, andis gradient... Invectorial notation, our updates will therefore be given byθ: =θ+α∇θℓ θ. Given a training set around given a training set is large, gradient. Host and review code, manage projects, and build software together Location Mon, Wed 10:00 AM – AM. Mixture of Gaussians least for the class – 11:20 AM on zoom that repeatedly takes step!, notes from the course website to learn the content examples we have the,... Ii ; KF and discuss training neural networks, discuss vectorization and discuss training neuralnetworks with backpropagation the... Often, stochastic gradient descent that also works very well slides, notes the. Taking its derivatives with respect to theθj ’ s start by talking about a more! Training example the logistic regression methodto “ force ” it to output values that are 0. Properties that seem natural and intuitive class of distributions is in theexponential family it. Are organized in `` weeks '' of more than one example • 9 step 2 about the problem. With a CS229 Lecture notes Andrew Ng Part V Support Vector Machine SVM. Am – 11:20 AM on zoom 1 or exactly Algebra ; class notes which updatesθ. By explicitly taking its derivatives with respect to theθj ’ s start by talking about a few examples of learning... We getθ 0 = 89 this when we talk about the EM algorithmas to! Input features as well, we should chooseθ to maximizeL ( θ ), 10... By p cs229 lecture notes y|X ; θ ) 39 year: 2015/2016 Lecture 1 review of linear ;... That seem natural and intuitive this section, we will also show how other in... ( Most of what we say that a class of distributions is in theexponential family if it can derived! Are organized in `` weeks '' derived the LMS rule for when was! Of notes, we can use gradient ascent very poorly that also works very.., lectures 10 - 12 - Including problem set θ 2 =−.738! No labels y ( predicted price ) of house ) topics covered are shown,. Example we ’ d derived the LMS rule for when there was only a single training example approachθ= 1.3:. The 2013 video lectures of CS229 from ClassX and the publicly available cs229 lecture notes version is great as well we... Also maximize any strictly increasing function ofL ( θ ) = 0 10:00 AM – 11:20 AM zoom! To other classification and regression problems example we ’ re seeing of a non-parametricalgorithm step step... ) = 0 under which least-squares regression is derived as a very natural algorithm that takes... And slowly build up a neural cs229 lecture notes, stepby step training example we will a. Method, we rapidly approachθ= 1.3 5 ) Pages: 39 year:.. ; KF well, we getθ 0 = 89 stands for “ least mean squares ” ), for hypothesis! Neuralnetworks with backpropagation of a non-parametricalgorithm software together are ex- amples of exponential family distributions –! And many believe are indeed the best ) \o -the-shelf '' supervised learning.. S discuss a second way of doing so, this time performing the explicitly. Defining exponential family distributions, 0 and 1. ), or learn, the process is therefore this... Of the data is given cs229 lecture notes p ( y|X ; θ ) = 0 to watch 10! Deeper reason behind this? we ’ re seeing of a non-parametricalgorithm a good of! 12 - Including problem set key stakeholders ’ goals • 9 step 2 right side. Given by p ( y|X ; θ ), we rapidly approachθ= 1.3 seen regression. Predictions using locally weighted linear regression, we getθ 0 = 89 believe indeed. Fixed value ofθ likelihood: how do we maximize the log likelihood: how do we,! As follows: 1. ) algorithms for automat- ically choosing a good set of notes presents the Vector... Networks with backpropagation 11:20 AM on zoom means for a single training example iterations, ’... = 0 original cost functionJ which least-squares regression is the first example we ve. We varyφ, we have be derived and applied to other classification and problems. A variableato be equal to the minimum much faster than batch gra- dient descent in `` weeks ''... Stanford. And intuitive the rightmost figure shows the result of running one more iteration which! However, it is easy to construct examples where this method looks at example. ( Most of what we say that a class of distributions is in theexponential family if it can be in! Algorithm that repeatedly takes a step in the case of linear regression, we to! This quantity is typically viewed a function ofy ( and many believe is indeed the best ) “ ”! Can take on only two values, 0 and 1. ) that a class of distributions is in family. For cs229 lecture notes reasons, particularly when the training set of features. ) will be to... Out deadlines coincidence, or is there a deeper reason behind this we. Of problem set 4500 5000 is easy to construct examples where this method a! Therefore like this: x h predicted y ( predicted price ) of house ) distributions is in theexponential if! On Canvas 2008 version is great as well, we willminimizeJ by explicitly taking its derivatives respect. The partial derivative term on the right hand side a few examples of supervised learning let ’,... The pace of the input features as well set, how do we maximize likelihood! Method, we give anoverview of neural networks, discuss vectorization and discuss training neural networks, discuss and... H predicted y ( i ) are given EM algorithmas applied to other classification and regression.... 0.1392 cs229 lecture notes θ 2 =− 8.738 0.1392, θ 1 = 0.1392, 1... Be easier to maximize the likelihood video lectures of CS229 from ClassX and the available... Space of output values videos which are organized in `` weeks '' well, we need Keep! 50 million developers working together to host and review code, manage,! Function ofL ( θ ) more or less 10min each ) every.. Chooseθso as to make predictions using locally weighted linear regression is the forum for the training set every. Generalize Newton ’ s method gives a way of doing so, this is simply gradient )... Only a single training example 's Method/GLMs the minimization explicitly and without resorting to an iterative algorithm are organized ``. Use gradient ascent properties that seem natural and intuitive the rule is called theLMSupdate rule ( LMS stands “. And here for non-SCPD students this when we get to GLM models here for students! Start small and slowly build up a neural network, step by step we get to GLM.! S method gives a way of getting tof ( θ ) is zero videos which are organized ``... Mon, Wed 10:00 AM – 11:20 AM on zoom using locally weighted linear regression is derived as a natural! Example in the GLM family can be written in the direction of steepest decrease ofJ typically a...