Linear Regression

Task Formulation

N data points where where there are certain features that are all numeric.

D={...,(x(n),y(n))} D = \{..., (x^{(n)}, y^{(n)})\}

where for all i1,..,ni \in {1,..,n}, features and targets are both R\in R


To find a function that predicts the target y for a given x, is called regression

Linear Models

Linear Algebra (Background)


Given an input vector $x$ of dimension $d$, a linear model is a function with the following form:
hw(x)=w0x0+w1x1+w2x2+...+wdxd h_w(x) = w_0x_0 + w_1x_1 + w_2x_2 + ... + w_dx_d

Where w0,...,wdw0,...,w_d are parameters/weights and x0x_0 = 1 is a dummy variable

Shorthand:

hw(x)=wTx=j=0dwjxj h_w(x) = w^Tx = \sum_{j=0}^d w_jx_j

A model is linear as long as it is some linear combination of weights - variables can be squared, even.

Example



There are infinitely many linear models that could represent the relationship between input x and output.

We want a linear model with the lowest loss on the data - commonly use mean squared error

TODO: Include mean squared error loss here.

Question:

Select The correct statements

LR Question

Linear regression is used when the **target** is a **categorical** variable
In linear regression, we need to find **features** such that they fit to the weights of the linear model.

For EXPERIMENT 1, the best linear model will have $ w_0 > 0 $

For EXPERIMENT 1, the best linear model will have $w_1$ < 0
None of the above.

Background



Learning via Normal Equations

Viewing input/output as a matrix

Data Matrix

The Normal Equation

The Normal Equation

However, the normal equation has its problems:

  • The cost of the normal equation is d^3 (for inverting the matrix)
  • It will not work if (XTX)1(X^TX)^{-1} is not invertible
  • It will not work for non-linear models

Learning via Gradient Descent

  • Start at some ww
  • Update ww with a step in the opposite direction of the gradient:
wjwjγδ(w0,w1,...)δwj w_j \leftarrow w_j - \gamma \frac {\delta(w_0, w_1, ...) } {\delta w_j}
  • Learning rate γ>0\gamma > 0 is a hyperparameter that determines the step size.
  • Repeat until termination criteria is satisfied.

Gradient Descent Example



Multi-Variable Gradient Descent

GDCM

You have to update both at the same time, if not the first operation spoils the result of the 2nd equation

Gradient Descent - Analysis

Convexity


Theorem: MSE loss function for a linear regression model is a convex function with respect to the model's parameters.

Definition: Linearly dependent features: A feature is linearly dependent on other features, if it is a linear combination of the other features. E.g. x3=c1x+1+c2x+2x_3 = c_1x+1 + c_2x+2

Theorem: If features in the training data are linearly independent, the function (MSE + Linear model) is strictly convex

And thus,

Theorem: For a linear regression model with a MSE loss function, the Gradient Descent algorithm is guaranteed to converge to a global minimum, provided the learning rate is chosen appropriately. If the features are linearly independent, then the global minimum is unique.

A problem with Gradient Descent

Let's say we have j(w) given data 2 datapoints with 2 dimensions and 1 target variable:

j(w)=(w1x1(1)+w2x2(1)y(1))+(w1x1(2)+w2x2(2)y(2)) j(w) = (w_1x_1^{(1)} + w_2x_2^{(1)} - y^{(1)}) + (w_1x_1^{(2)} + w_2x_2^{(2)} - y^{(2)})
=(w1)2+(10w2)2 = (w_1)^2 + (10w_2)^2 (after subbing in values)

Then, it's gradient is given:

J(w)=[J(w)w1J(w)w2]=[2w1200w2]\nabla J(w) = \begin{bmatrix} \frac{\partial J(w)}{\partial w_1} \\ \frac{\partial J(w)}{\partial w_2} \end{bmatrix} = \begin{bmatrix} 2w_1 \\ 200w_2 \end{bmatrix}

Gradient descent update is then given:

w1w1γ2w1 w_1 \leftarrow w_1 - \gamma2w_1
w2w2γ200w2 w_2 \leftarrow w_2 - \gamma200w_2

This causes a problem, because we then have the gradient given by the graph:

Gradient

The gradient change for w2w_2 is then done by a factor of 100 times, as the same learning rate is applied to both variables. This can cause overshoot when performing descent.

How to fix this?

  • Min-max scale the features to be within [0, 1]
  • Standardize the data based on the mean
  • Use a different learning rate for each weight

Other problems in Gradient Descent

  • Slow for large datasets, since the entire dataset must be used to compute the gradient.

  • May get stuck at a local minimum/plateu on non-convex optimization.

  • Use a subset of training examples at a time (Mini-batch GD). Cheaper/Faster per iteration.

  • Stochastic Gradient Descent - select one random training example at a time.

Linear Regression Summary

Summary

Feature Transformation

Engine Efficiency

Take a d-dimensional feature vector and transform it into an M-dimensional feature vector, where usually M>dM>d.

xRdϕ(x)RM x \in \reals^d \rightarrow \phi(x) \in \reals^M

Where,

  • The function ϕ\phi is called a feature transformation/map
  • The vector ϕ(x)\phi(x) is called the transformed features

Examples of feature transformations


Using linear regression with feature transformation


Theorem: The convexity of linear regression with MSE is preserved with feature transformation.

TODO: Interpretation of that partial derivative shit (how to perform gradient-descent related) What a minimizer is MSE/MAE tutorial? LR. ACC/PREC/F1 Score