Coursera Machine Learning Review: Introduction

This is a series where I’m discussing what I’ve learned in Coursera’s machine learning course taught by Andrew Ng by Stanford University.  Why?  See Machine Learning, Nanodegrees, and Bitcoin.  I’m definitely not going into depth, but just briefly summarizing from a 10,000 foot view.

This is a review of week one.

Defining Machine Learning

  1. First you need a problem to solve, such as playing a game of chess, determining what factors may increase chances of an illness, etc.).
  2. You then give the machine practice solving the problem.  Perhaps this is with a dataset, playing the game, or something else.
  3. Finally, you have some way of measuring the machine’s performance.  Either it did well or did not do well, and that determines how it will learn.

Putting all those together, over time, the machine will either get worse or better(depending on its performance) while practicing solving the problem until it comes up with a good enough solution.

There are two types of learning – supervised and unsupervised.

  • Supervised: there is a relationship with the input and output that is known before the algorithm is run.  There are two types:
    • Regression: mapping input to continuous output (a function).
    • Classification: mapping input to a discrete outcome.
    • The course gives an example of housing prices – how much can you sell a house given its size?  If you plot the house size vs. the selling price, that is regression.  If you plot the house based on whether it sells for more or less than the asking price (discrete outcomes), it is classification.
  • Unsupervised: you have no idea what the outcome will look.  Relationships among the data is noted and shown.  An example might be determining what genes in the human body affect different characteristics such as height, lifespan, etc.

Woah!  Lots of Math!


Following the types of machine learning, the course dove into a lot of math.  This was somewhat of a shock for me.  In general, I haven’t used much math since I graduated college at the places I’ve worked.  The fact that it is taught so early in this course tells me it just might be important again… which is fine.  I’m surprised how much I remembered.  Thanks math teachers at Oneonta and UAH!

The Hypothesis Function

We use “training data” (using “x” as input and “y” as output) in addition to a learning algorithm in order to produce a hypothesis.  This hypothesis will tell us how to calculate the output based on some input.  This is the ultimate goal (at least from my current understanding) of what we are doing – to get some function that predicts output based on current input.  All the future functions here are just a means of getting a more accurate hypothesis function.

  • Training data + learning algorithm gives us a hypothesis.
  • h(x) is the hypothesis, and its answer gives us y.

Once we have a hypothesis function, we can see how accurate it is by using the cost function.

The Cost Function


How good or bad is the hypothesis?  The cost function (or “mean squared error”) lets us determine this.  If you want to see the formula, you will need to look it up on Coursera or somewhere else online – sorry… it’s going to take me 30 minutes just to figure out how to print out theta symbols, summations, subscripts, etc.!  But I’ll describe it at least.

You take the mean of squares of the predicted value minus the actual value (based on your data set).  So that is 0.5 times the sum of the square of each data set prediction h(x) minus actual value (y) … all divided by the number of data points (defined by the variable m in this formula).  The function takes two variables, theta sub-0 and theta sub-1 in the given formula, which act as input in order to get us closer to the actual values of y.  We will talk about those in the next section on gradient descent.

If you imagine an x-y chart…Line Chart

We have several points based on our dataset.  We then try to connect all those points with one straight line.  The closer that line matches all the data points, the more accurate our hypothesis is, as it will be more likely to accurately predict other points that we don’t have data for.  Minimizing the cost function means there is not much different between the actual value and the predicted value… so we try to minimize it.

Gradient Descent

Remember those theta values I was talking about in the previous section to see how accurate our hypothesis function is?  Gradient descent is a way to determine those theta values.  You can find the formulas on Coursera here.

You keep finding separate values for theta sub-0 and theta sub-1 until they converge.  Each time the gradient descent formula is used, it will use the the tangent of the slope to determine if you are getting a higher value or lower value, where a lower value means you are getting less and less error.  As you converge on that small value, you are getting the most accurate hypothesis function.

Looking at the actual formula and describing what it means is out of the scope of this post.  If you want more information, look at the Coursera course.


I hope you aren’t tired of the math… because the last part of this week is going over linear algebra concepts.  Fortunately for me, I remembered enough of this from college that this part was a breeze.  I actually took the quiz before looking at the content and scored perfect on it (lucky me!).  The next post will be a summary of that material.  In my opinion, it was not very difficult.  Easier than the material in this post.  See you next time!

Leave a Reply

Your email address will not be published. Required fields are marked *