Introduction to regression and classification

Regression and classification are a part of machine learning which predicts certain variables based on labelled data. Both regression and classification operate on several premises:

  • We differentiate between datapoints $x$ and labels $y$. While data points are relatively simple to obtain, labels $y$ are relatively hard to obtain.
  • We consider some parameterized function $\operatorname{predict}(w;x)$ and try to find an unknown variable $w$ to correctly predict the labels from samples (data points)

\[\operatorname{predict}(w;x) \approx y.\]

  • We have a labelled datasets with $n$ samples $x_1,\dots,x_n$ and labels $y_1,\dots,y_n$.
  • We use the labelled dataset to train the weights $w$.
  • When an unlabelled sample arrives, we use the prediction function to predict its label.

The MNIST dataset contains $n=50000$ images of grayscale digits. Each image $x_i$ from the dataset has the size $28\times 28$ and was manually labelled by $y_i\in\{0,\dots,9\}$. When the weights $w$ of a prediction function $\operatorname{predict}(w;x)$ are trained on this dataset, the prediction function can predict which digit appears on images it has never seen before. This is an example where the images $x$ are relatively simple to obtain, but the labels $y$ are hard to obtain due to the need to do it manually.

Regression and classification

The difference between regression and classification is simple:

  • Regression predicts a continuous variable $y$ (such as height based on weight).
  • Classification predict a variable $y$ with a finite number of states (such as cat/dog/none from images).

The body-mass index is used to measure fitness. It has a simple formula

\[\operatorname{BMI} = \frac{w}{h^2},\]

where $w$ is the weight and $h$ is the height. If we do not know the formula, we may estimate it from data. We denote $x=(w,h)$ the samples and $y=\operatorname{BMI}$ the labels. Then regression considers the following data.


The upper index denotes components while the lower index denotes samples. Sometimes it is not necessary to determine the exact BMI value but only whether a person is healthy, which is defined as any BMI value in the interval $[18.5, 25]$. When we assign label $0$ to underweight people, label $1$ to normal people and label $2$ to overweight people, then classification considers the following data.


Mathematical formulation

Recall that the samples are denoted $x_i$ while the labels $y_i$. Having $n$ datapoints in the dataset, the training procedure finds weights $w$ by solving

\[\operatorname{minimize}_w\qquad \frac 1n \sum_{i=1}^n\operatorname{loss}\big(y_i, \operatorname{predict}(w;x_i)\big).\]

This minimizes the average discrepancy between labels and predictions. We need to specify the prediction function $\operatorname{predict}$ and the loss function $\operatorname{loss}$. This lecture considers linear predictions

\[\operatorname{predict}(w;x) = w^\top x,\]

while non-linear predictions are considered in the following lecture.

Linear classifiers:

We realize that

\[w^\top x + b = (w, b)^\top \begin{pmatrix}x \\ 1\end{pmatrix}.\]

That means that if we add $1$ to each sample $x_i$, it is sufficient to consider the classifier in the form $w^\top x$ without the bias (shift, intercept) $b$. This allows for simpler implementation.

BONUS: Data transformation

Linear models have many advantages, such as simplicity or guaranteed convergence for optimization methods. Sometimes it is possible to transform non-linear dependences into linear ones. For example, the body-mass index

\[\operatorname{BMI} = \frac{w}{h^2}\]

is equivalent to the linear dependence

\[\log \operatorname{BMI} = \log w - 2\log h\]

in logarithmic variables. We show the same table as for regression but with logarithmic variable values.

$\log x^1$$\log x^2$$\log y$

It is not difficult to see the simple linear relation with coefficients $1$ and $-2$, namely $\log y = \log x^1 - 2\log x^2.$