This is a series where I’m discussing what I’ve learned in Coursera’s machine learning course taught by Andrew Ng by Stanford University. Why? See Machine Learning, Nanodegrees, and Bitcoin. I’m definitely not going into depth, but just briefly summarizing from a 10,000 foot view.
This is a partial review of week 2.
Multivariate Linear Regression
Last week we looked at hypothesis functions, cost functions, and gradient descent that only considered one feature. But what happens when we don’t have just one variable to consider but multiple features?
For example, rather than the size of a house (the input x) determining the selling price (the output, y), what if we have…
- The size of the house (x1)
- The age of the house (x2)
- The location of the house(x3)
All determining the selling price? We need an equation that takes multiple input variables. This is where multivariate linear regression comes in.
This is a very complicated-sounding term that just means we have multiple variables in our hypothesis functions/cost functions/gradient descent functions. For example, rather than our hypothesis just being one x and one y, we have multiple x’s and thetas being multiplied together.
The hypothesis function h_theta(x) is given by thetas multiplied by x (and we assume x0 is 1 for reasons I won’t explain here).
h_theta(x) = theta0 + theta1 * x1 + theta2 * x2 + ... etc.
x is our training data, whereas theta is our feature. For example, in the above example, x1 would be the actual size of the house in square feet, but theta1 would be the price of the house per square feet. And so on for x2, x3, and all the other features in our data set.
The really nice thing about multiple features is these can be put into matrices. We can then use linear algebra to calculate them all at once, which works well with Matlab/Octave (remember me saying the purpose of linear algebra in my previous post?).
If you recall, gradient descent is used to find the theta values of our hypothesis function. We now have a complicated formula that work for multiple variables. That formula is given here on Coursera.
If the features can vary widely in their range (for example, the square feet of a house can be 1500 or 2000 square feet, but age of a house is likely to be much less, such as 10-100 years), this makes gradient descent take a lot longer (since it has a lot lower to descend to the optimum value). Using feature scaling and mean normalization, we can make this values a lot closer together.
- Feature scaling: divide each input by the range (highest value minus lowest value)
- Mean normalization: subtract the average from each input.
Once the input values are a lot closer together, gradient descent can more quickly find the values of theta that are optimized for our hypothesis function.
We run this function over and over in order to find better and better values of theta, with each iteration using the previously found values of theta. The alpha value of the gradient descent function gives our ‘learning rate’. This means how quickly we go down after each iteration. If it is too small, gradient descent takes a really long time. However, if it is too large, we might go past the optimum value every time (go down then come back up because we overshot the maximum between two iterations).
So… how do we determine if gradient descent is actually behaving as we expect it to? If we plot or just print the value of J(theta) – the cost function – after each iteration of gradient descent, it should always get smaller. If it ever gets larger, there is a problem with your implementation. Your value of alpha may be too large.
On the other hand, if alpha is too small, it may take a really long time to converge. Pick a larger value of alpha in this case.
Sometimes a line does not fit the data well. In these cases, we can change our hypothesis function to fit the data. For example, rather than our function representing a line, maybe the data fits a square root function or cubic function instead.
Say there is just one feature, but the data appears to fit a cubic function. In this case, we can change our hypothesis to add the x1, x2 (which is the x input squared), and x3 (which is the x input cubed). This is all in an effort to make the resulting hypothesis better fit the data.
The course didn’t get into much detail over this, such as when to do this. Seems like magic to me. Nevertheless, it’s another method to get a better prediction of the data. I’m sure I’ll understand it better if I use it in the future.
The Normal Equation
The normal equation is another way of minimizing the cost function (in another words, an alternative to gradient descent). Rather than iteratively finding it, it can do it with one iteration of an equation (found here on Coursera). We also don’t have to do feature scaling or find alpha when doing the normal method. So why wouldn’t we always use it?
It is slower when the number of features is large because it requires finding the inverse of a matrix. In general, when the number of features exceeds 100,000, you may consider using gradient descent rather than the normal equation.
In addition, since the normal equation requires taking the inverse, there are some matrices that are non-invertible (they have no inverse). In these cases…
- See if you have duplicated a feature or have two features that are dependent on each other. If so, remove one of them.
- Remove some features.
- Use regularization (to be explained in a future lesson).
If you are anything like me, this amount of math makes me want to vomit. The best way to learn something, in my opinion, is to use it in an application.
Fortunately, we are programming something in Octave/Matlab as part of this week’s lesson. I will go a little into that in the next post (not giving away answers, but explaining some of my insights). This helps reinforce some of the things we have learned so far.