You need to login before you can make any post submission.

Add New Chapter
## Tutorial

# Linear Regression with One Variable

## Model Representation

## Cost Function

### Cost Function Intuition-1:

### Cost Function Intuition-2:

##### Note:

#### Contour Plots

##### Notes:

## Gradient Descent Algorithm

### Minimizing Gradient Descent

### Gradient Descent Intuition

###### Understanding the derivative term:

###### Understanding the learning rate α:

##### Notes:

### Gradient Descent for Linear Regression

#### Batch Gradient Descent Algorithm

Introduction to Machine Learning

Linear Regression with One Variable

Linear Algebra Review

Linear Regression with Multiple Variables

Octave / Matlab Tutorial

Logistic Regression

Regularization

Neural Networks

Applying Machine Learning

Machine Learning System Design

Support Vector Machines

Unsupervised Learning

Dimensionality Reduction

Anomaly Detection

Recommender Systems

Large Scale Machine Learning

Example Application - Photo OCR

Linear Regression with One Variable

Linear Algebra Review

Linear Regression with Multiple Variables

Octave / Matlab Tutorial

Logistic Regression

Regularization

Neural Networks

Applying Machine Learning

Machine Learning System Design

Support Vector Machines

Unsupervised Learning

Dimensionality Reduction

Anomaly Detection

Recommender Systems

Large Scale Machine Learning

Example Application - Photo OCR

Linear regression is the first learning algorithm, we need to see how the model looks like and more importatnly what is the overall process of supervised learning looks like.

Analyzing Cost Funciton by Simplified Hypotheiss Function by making

θ_{0}=0

Analyzing Cost Funciton by Actual Hypotheiss Function

- Earlier we saw that when we considered only one parameter
**θ**we got a bowl shaped graph of cost function in 2-D plane._{1} - But now when we consider both
**θ**and_{0}**θ**we again get a bowl shaped graph but now in 3-D plane._{1}

- For the purpose of illustration ahead we are not going to use these sort of 3D surfaces to show the cost function J, instead we will use
**contour plots / figures**.

- Now we want is an
**efficient algorithm / software**for automatically finding values of**θ**and_{0}**θ**, that minimizes the cost function J._{1} - In cases of complicated examples with high dimensional figures (more parameters) it cannot really be plotted, and becomes much harder to visualize, so we need software to find those values.

- Used to minimize the cost funnction J.
- It is a more general algorithm and is used not only in linear regression but all over the place in machine learning.
- We use gradient descent to minimize other functions as well, not just the cost function J for the linear regression.

- So we have our hypothesis function and we have a way of measuring how well it fits into the data.
- Now we need to estimate the parameters in the hypothesis function, that’s where gradient descent comes in.
- In gradient descent, what we do is look all around 360 degrees and ask, if I had to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that little baby step in?

- We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.
- The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

Example:

- The distance between each ‘star’ in the graph above represents a step determined by our parameter α, a smaller α would result in a smaller step and a larger α results in a larger step.
- The direction in which the step is taken is determined by the partial derivative of
J(θ._{0}, θ_{1})- Depending on where one starts on the graph, one could end up at different points, the image above shows us two different starting points that end up in two different places.

- To have an intuition about what each of these two terms (learning rate and derivative) is doing and why when put together, this entire update makes sense.In order to convey these intuitions, we will try to minimize the function of just one parameter.
- So say we have a cost function, J of just one parameter,
**θ**. So we can have plots, which are a little bit simpler to look at._{1}

**Case-1:**When we start with a**θ**right to minimum_{1}

We will see that after the update**θ1**will move towards left to attain the minimum as slope will be +ve and**α**being already positive.**Case-2:**When we start with a**θ1**left to minimum

We will see that after the update**θ1**will move towards right to attain the minimum as slope will be -ve and**α**being already positive.

- If we are already at the local optimum it leaves
**θ1**unchanged coz it updates as**θ1**equals**θ1**. So if the parameters are already at a local minimum one step with gradient descent does absolutely nothing it doesn’t change the parameter which is what we want because it keeps our solution at the local optimum. - Gradient descent can converge to a local minimum, even with the learning rate α fixed.

- Now we put together gradient descent with our cost function, and that will give us an algorithm for linear regression (putting a straight line to our data).
- To apply gradient descent to minimize our squared error cost function, the key term we need is the derivative term.

- Need to figure out what is this partial derivative term and plugging it in the definition of the cost function J.

Problem with Gradient Descent Algorithm:Susceptible to local optimmum

Solution:Th Cost function for linear regression is always going to be a bowl-shaped function or convex function for which no other local optimum, only global optimum.

Algorithm in Action:

![File from iOS__65.](https://raw.githubusercontent.com/astikanand/techblogs/master/5_machine_learning/assets/File from iOS.gif)

Finally: *The closest fit line to training data set*

- The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.
- So, this is simply gradient descent on the original cost function J.
- This method looks at every example in the entire training set on every step, and is called batch gradient descent.
- There are some other versions of gradient descent that are not batch versions coz they don’t look at the entire training set but look at small subsets of the training sets at a time.