Rather than solve the problem explicitly, construct an agent that learns a function to identify patterns in the data and make decisions.
Supervised Learning
Involves:
Data: A set of input-output pairs that the model learns from
Assume it represents the ground truth
Model: A function that predicts the output based on the input features.
Loss: A function that is used by a learning algorithm to assess the model's ability in making predictions over a set of data.
A learning algorithm is an algorithm that seeks to minimize the loss of a model over a training dataset. Used in the training phase.
After training, we usually test on an unseen test set. This is used to evaluate the model using various measures. This measures the generalization of the model.
Tasks in Supervised Learning
Classification
Objective: predict a discrete label - cat or dog?
Output is a categorical value
Cancer prediction, Spam detection
Regression
Objective: predict a continuous numerical value based on input features.
The output variable is a real number
Data formulation
A dataset is represented as:
(x(i),y(i))
x(i)∈Rd is the input vector of d features of the ith data point.
In general, x(i) can be a multi-dimensional array of features
y(i) is the label for the ith data point.
Regression: y(i)∈R for regression
Classification: y(i)∈1,2,..,K for classification with K classes.
Dataset D with n examples can be represented as:
D=(x(1),y(1)),(x(2),y(2)),...,(x(n),y(n))
Model
Refers to a function that maps from input to outputs: h:X→Y. h(x) should best predict the outputs given x.
Performance Measure & Loss
Performance Measure & loss functions take in a dataset D and a model h and returns a number representing how well h predicts the mapping.
Performance measure: is a metric that evaluates how well the model performs on a task. Used in evaluation/testing.
Loss function: Quantifies the difference between the model's predictions and the true outputs. It's used by the model the guide learning to reach the optimal outcome.
Regression
Mean squared error (MSE)
MSE=n1n∑i=1(y(i)−y(i))2
Mean squared error (MSE)
MAE=n1n∑i=1∣(y(i)−y(i))∣
y represents the prediction of the model.
These metrics can also be used as loss functions.
Classification
Accuracy
Accuracy=n1i=1∑n1(y(i)=y(i))
Confusion Matrix
Measure TPR, TNR, FPR, FNR, etc.
Learning Algorihtms
A learning algorithm A takes in a training set Dtrain, consisting of pairs, and seeks to find a model h to approximate the true relationship between inputs and outputs. They can utilize loss functions to guide them.
Example: Hiring Decision
Decision Tree
A decision tree is a model that outputs a categorical decision or prediction. A decision is reached by sequentially checking the attribute value at each node, starting from the root until a leaf is reached.
There generally exists multipledistinct DTs that are all consistent with the training data.
We prefer COMPACT decision trees, are they're less likely to be consistent with training solely due to chance
Optimal decision trees
Entropy
Entropy quantifies the uncertainty or randomness of a random variable.
In binary outcomes, entropy is:
Entropy Calculation Example
Specific Conditional Entropy
Specific conditional entropy H(Y|X=x) is the entropy of a random variable Y conditioned on the discrete random variable X.
Simply fix the DRV, then calculate entropy as before.
Conditional Entropy
It is simply the Expected Value - multiply the probability of a feature with the specific conditional entropy for each category x in X.
Information Gain
Information gain is a measure of the reduction in uncertainty about a particular decision after knowing the value of a specific variable.
IG(Y;X)=H(Y)−H(Y∣X)
Decision Tree Learning
Top-down, greedy, recursive algorithm.
Top-down: Build the decision tree from the root.
Greedy: At each step, choose the locally optimal attribute to split, such as based on immediate information gain, without considering future splits.
Recursive: For each value of a chosen attribute, use DTL again on corresponding subset of data.
Stopping Criteria:
No more data or attributes
All data predicts the same class
DTL - Functions
choose_attribute(attributes, rows)
Accepts a set of attributes (columns) in the data, and the rows containing the data points.
Computes gain of each attribute
Returns the attribute that has highest information gain
majority_decision(rows)
Accepts the table rows containing the data points.
Returns the most common decision in the data points.
DTL - Base Case
No data available -> return default
Same classification for all rows -> return that classification
attributes is empty -> Take majority decision
defDTL(rows, attributes, default):if rows is empty; return default
if rows have the same classification:
return classification
if attributes is empty:
return majority_decision(rows)
best_attribute = choose_attribute(attributes, rows)
tree = a new decision tree with root best_attribute
for value of best_attribute:
subrows = {rows with best_attribute = value}
subtree = DTL(
subrows,
attributes - best_attribute,
majority_decision(rows)
)
add subtree as a branch to tree with label value
return tree
Question:
What is the root node of the DT model based on this data?
Experience
Interview
Skills
DT analysis
Theorem (Representational completeness of Decision Tree)
For any finite set of consistent examples with categorical features and a finite set of label classes, there exists a decision tree that is consistent with the examples.
But this is problematic. The DT then is prone to noise or outliers in excessively complex DTs - overfitting.
DT pruning
Pruning refers to the process of removing parts of the tree. The goal is to produce a simpler, more robust tree that generalizes better.
Common pruning methods:
Max depth: limits the maximum depth of the DT.
Min-sample leaves: Set a minimum threshold for the number of samples required to be in each leaf node.