Andrew Ng
Supervised learning
Lets start by talking about a few examples of supervised learning problems.
Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon:
Living area (feet2 )
2104
1600
2400
1416
3000
.
.
.
Price (1000$s)
400
330
369
232
540
.
.
.
We can plot this data: housing prices
1000
900
800
price (in $1000)
700
600
500
400
300
200
100
0
500
1000
1500
2000
2500
3000
square feet
3500
4000
4500
5000
Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas?
1
2
CS229 Winter 2003
To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y (i) to denote the “output” or target variable that we are trying to predict
(price). A pair (x(i) , y (i) ) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples {(x(i) , y (i) ); i =
1, . . . , m}—is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X denote the space of input values, and Y the space of output values. In this example, X = Y = R.
To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a
“good” predictor for the corresponding value of y . For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:
Training
set
Learning algorithm x
(living area of house.) h
predicted y
(predicted price) of house)
When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression