Linear Regression

Task Formulation

N data points where where there are certain features that are all numeric.

D = \{..., (x^{(n)}, y^{(n)})\}

where for all $i \in {1,..,n}$ , features and targets are both $\in R$

To find a function that predicts the target y for a given x, is called regression

Linear Models

Linear Algebra (Background)

Given an input vector $x$ of dimension $d$, a linear model is a function with the following form:

h_w(x) = w_0x_0 + w_1x_1 + w_2x_2 + ... + w_dx_d

Where $w0,...,w_d$ are parameters/weights and $x_0$ = 1 is a dummy variable

Shorthand:

h_w(x) = w^Tx = \sum_{j=0}^d w_jx_j

A model is linear as long as it is some linear combination of weights - variables can be squared, even.

Example

There are infinitely many linear models that could represent the relationship between input x and output.

We want a linear model with the lowest loss on the data - commonly use mean squared error

TODO: Include mean squared error loss here.

Question:

Select The correct statements

Linear regression is used when the **target** is a **categorical** variable

In linear regression, we need to find **features** such that they fit to the weights of the linear model.

For EXPERIMENT 1, the best linear model will have $ w_0 > 0 $

For EXPERIMENT 1, the best linear model will have $w_1$ < 0

None of the above.

Background

Learning via Normal Equations

Viewing input/output as a matrix

The Normal Equation

However, the normal equation has its problems:

The cost of the normal equation is d^3 (for inverting the matrix)
It will not work if $(X^TX)^{-1}$ is not invertible
It will not work for non-linear models

Learning via Gradient Descent

Start at some $w$
Update $w$ with a step in the opposite direction of the gradient:

w_j \leftarrow w_j - \gamma \frac {\delta(w_0, w_1, ...) } {\delta w_j}

Learning rate $\gamma > 0$ is a hyperparameter that determines the step size.
Repeat until termination criteria is satisfied.

Gradient Descent Example

Multi-Variable Gradient Descent

You have to update both at the same time, if not the first operation spoils the result of the 2nd equation

Gradient Descent - Analysis

Convexity

Theorem: MSE loss function for a linear regression model is a convex function with respect to the model's parameters.

Definition: Linearly dependent features: A feature is linearly dependent on other features, if it is a linear combination of the other features. E.g. $x_3 = c_1x+1 + c_2x+2$

Theorem: If features in the training data are linearly independent, the function (MSE + Linear model) is strictly convex

And thus,

Theorem: For a linear regression model with a MSE loss function, the Gradient Descent algorithm is guaranteed to converge to a global minimum, provided the learning rate is chosen appropriately. If the features are linearly independent, then the global minimum is unique.

A problem with Gradient Descent

Let's say we have j(w) given data 2 datapoints with 2 dimensions and 1 target variable:

j(w) = (w_1x_1^{(1)} + w_2x_2^{(1)} - y^{(1)}) + (w_1x_1^{(2)} + w_2x_2^{(2)} - y^{(2)})

= (w_1)^2 + (10w_2)^2

(after subbing in values)

Then, it's gradient is given:

\nabla J(w) = \begin{bmatrix} \frac{\partial J(w)}{\partial w_1} \\ \frac{\partial J(w)}{\partial w_2} \end{bmatrix} = \begin{bmatrix} 2w_1 \\ 200w_2 \end{bmatrix}

Gradient descent update is then given:

w_1 \leftarrow w_1 - \gamma2w_1

w_2 \leftarrow w_2 - \gamma200w_2

This causes a problem, because we then have the gradient given by the graph:

The gradient change for $w_2$ is then done by a factor of 100 times, as the same learning rate is applied to both variables. This can cause overshoot when performing descent.

How to fix this?

Min-max scale the features to be within [0, 1]
Standardize the data based on the mean
Use a different learning rate for each weight

Linear Regression Summary

Feature Transformation

Take a d-dimensional feature vector and transform it into an M-dimensional feature vector, where usually $M>d$ .

x \in \reals^d \rightarrow \phi(x) \in \reals^M

Where,

The function $\phi$ is called a feature transformation/map
The vector $\phi(x)$ is called the transformed features

Examples of feature transformations

Using linear regression with feature transformation

Theorem: The convexity of linear regression with MSE is preserved with feature transformation.

TODO: Interpretation of that partial derivative shit (how to perform gradient-descent related) What a minimizer is MSE/MAE tutorial? LR. ACC/PREC/F1 Score