Supervised Learning EP1 - Classical ML Models Recap

17 minute read

Published:

What is Machine Learning?

A concise definition from T. Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”. Regarding of what learning is,Herbert A. Simon once defined it as “the ability that a system improves the performance of itself by executing some process”. Machine learning is a data driven subject that applies machine learning models to data analyzation and prediction. Generally machine learning can be categorized into: supervised learning, unsupervised learning, reinforcement learning, semi-supervised learning and active learning. To design a machine learning system, basically we need to do the following tasks:

  1. Formalize the learning task
  2. Collect data
  3. Extract features
  4. Choose class of learning models
  5. Train model
  6. Evaluate model

This blog revisits some classical supervised learning Models, including K-nearest neighbours, Decision Tree, Naive Bayes, Perceptron and Support Vector Machine.

K-nearest neighbours

CLICK ME Here's how a basic KNN model works in classification tasks: Given a new instance $x_{new}$, find it's K nearest neighbours and then assign $x_{new}$ to the majority class, aka, majority voting (return mean distances from all K instances in regression tasks).
We usually pick Euclidean distance to measuring the distance between instances, generally a distance function $d$ should satisfy the following properties: for any instances x, y, z in the sampling set,
1. $d(x, x) = 0;$
2. $d(x, y) = d(y, x);$
3. $d(x, y) + d(y, z) ≥ d(x, z);$
Also we'd like to define distance as non-negetive value to avoid troubles, Minkowski distance is the perfect candidate.
Good to know some characteristics of k-nearest neighbour learning:
1. Instance-based learning or lazy learning: The model just memorizes training data. Computation is mostly deferred to the classification phase when there is a test example to be processed. Efficient methods such as kd-tree are usually used to accelarate computation speed.
2. Local learner: assumes prediction should be mainly influenced by nearby instances
3. Uniform feature weighting: all features are uniformly weighted in computing distances


Decision Tree

CLICK ME Usually learning a decision tree contains 3 steps: attribute selection, tree generation and pruning. Classical methods such as ID3 (Quinlan 1986), C4-5 (Quinlan 1993) take a greedy top-down learning strategy. For each ndoe, start from the root with full training set and 1. Choose the best attribute to be evaluated; 2. Split node training set into children and form child node according to value of chosen attribute; 3. Stop splitting a node if it contains examples from a single class, or there are no more attributes to test. The best attribute is chosen based on information gain (IG). The IG of attribute A on training dataset D is defined as the difference between the entropy of dataset $H(D)$ and the conditional entropy $H(D|A)$ of D given A. $$ IG(D,A) = H(D)-H(D|A) $$ In information theory, entropy measures the uncertainty of random variables. Given a discrete random variable $X$ that takes a number n of possible values, we have the probability distribution $P$ and entropy $H$: $$ P(X=x_i) = p_i, i = 1, 2, ..., n \\ H(X) = - \sum_{i=1}^n p_i log_2 p_i $$ If we have 2 discrete random variables $(X, Y)$, the conditional entropy $H(Y|X)$ denotes the uncertainty of $Y$ known $X$, as defined below: $$ H(Y|X) = \sum_{i=1}^n p_i H(Y|X=x_i) $$ In classification tasks, the entropy of a set of labelled examples $H(D)$ measures its label inhomogeneity. $H(D|A)$ represents the sum of entropies of subsets of examples obtained partitioning over A values, weighted by their respective sizes. An attribute with high information gain tends to produce homogeneous groups in terms of labels, thus favouring their classification.
The information gain criterion tends to prefer attributes with a large number of possible values. Considering an extreme, the unique ID of each example is an attribute perfectly splitting the data into singletons, but it will be of no use on new examples. A measure of such spread is the entropy of the dataset wrt the attribute value instead of the class value. $$ H_A(D) = - \sum_{v \in Values(A)} \frac{\lvert D_v \rvert}{\lvert D \rvert} log_2 \frac{\lvert D_v \rvert}{\lvert D \rvert} $$ The information gain ratio (IGR) measures downweights the information gain by such attribute value entropy. $$ IGR(D, A) = \frac{IG(D, A)}{H_A(D)} $$ Pruing is necessary, since a complex tree can easily overfit the training set, and sometimes requiring that each leaf has only examples of a certain class can lead to very complex trees. It is possible to accept impure leaves, assigning them the label of the majority of their training examples. **Pre-pruning** decides whether to stop splitting a node even if it contains training examples with different labels, while **post-pruning** learns a full tree and successively prune it removing subtrees Usuallly there is a labeled validate set for **post-pruning** to improve the performance of the model. Here's the procedures:
1. For each node in the tree: evaluate the performance on the validation set when removing the subtree rooted at it; 2. If all node removals worsen performance, STOP; 3. Choose the node whose removal has the best performance improvement; 4. Replace the subtree rooted at it with a leaf; 5. Assign to the leaf the majority label of all examples in the subtree; 6. Return to step 1. Decision tree also applies to continuous-valued attributes by discreting the continuous values. Discretization threshold can be chosen in order to maximize the attribute quality criterion (e.g. infogain). Procedure: 1. Examples are sorted according to their continuous attribute values; 2. For each pair of successive examples having different labels, a candidate threshold is placed as the average of the two attribute values; 3. For each candidate threshold, the infogain achieved splitting examples according to it is computed; 4. The threshold producing the higher infogain is used to discretize the attribute;


Naive Bayes

CLICK ME Naive Bayes is a classifier based on Bayes' theorem and conditional probability independence assumption. Each input instance $x$ is described by a conjunction of attribute values $(a_1,..., a_m)$. The output class label belongs to s finite label set $Y$. The task is predicting the MAP target value given the instance $$ \begin{aligned} y^* = argmax_{y_i \in Y}P(y_i|x) & = argmax_{y_i \in Y} \frac{P(a_1,...,a_m|y_i)P(y_i)}{P(a_1,...,a_m)} \\ & = argmax_{y_i \in Y} P(a_1,...,a_m|y_i)P(y_i) \end{aligned} $$ Naive Bayes classifier learns the joint probability distribution of instance and labels, and then predicts the MAP target value for the new instance. However, class conditional probabilities $P(a_1,...,a_m|y_i)$ are hard to learn, as the number of terms is equal to the number of possible instances times the number of target values. Naive Bayes assumption simplifies this problem by assuming that attribute values are independent of each other given the target value: $$ P(a_1,...,a_m|y_i) = \prod_{j=1}^m P(a_j|y_i) \\ y^* = argmax_{y_i \in Y}\prod_{j=1}^m P(a_j|y_i)P(y_i) $$ Thus parameters to be learned reduce to the number of possible attribute values times the number of possible target values. The priors $P(y_i)$ can be learned as the fraction of training set instances having each target value, aka maximum-likelihood estimatimation (MLE), while $P(a_j=v_k|y_i=c)$ can also be learned as the fraction of times the attribute value $v_k$ was observed in training examples of class $c$. Suppose there are N instances in the training set, we have: $$ \begin{aligned} P(a_j=v_k|y_i=c) &= \frac{\sum_{i=1}^N I(a_j=v_k,y_i=c)}{\sum_{i=1}^N I(y_i=c)}\\ &=\frac{N_{kc}}{N_c} \end{aligned} $$ Considering that the probability from MLE could be 0, which would affect the final caculation of Posterior probability and lead to bad results, we use Bayes estimation to add priors of attributes. Assume a Dirichlet prior distribution (with parameters $\alpha_{1c},...,\alpha_{kc}$) for attribute parameters, the posterior distribution for attribute parameters is again multinomial, we have: $$ P(a_j=v_k|y_i=c) = \frac{N_{kc}+\alpha_{kc}}{N_c+\alpha_{c}} $$


Perceptron

CLICK ME Perceptron is a Linear classifier for solving binary classification problems. For training dataset $D$, perceptron learns a hyperplane $\omega x + b=0$ that separates instances: $$ D = \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} \ where\ x_i \in R^n, y_i \in \{-1, +1\}\\ f(x)=sign(\omega x + b) $$ Assume that the dataset is linear separable, i.e. for all instances $x_i$ with positive label $y_i=+1$, $\omega x + b \gt 0$; for all instances $x_i$ with negetive label $y_i=-1$, $\omega x + b \lt 0$.
To find the ideal hyperplane, instead of directly minimizing the total number of misclassified instances, perceptron minimize the sum of distances from misclassified instances $x_i \in M$ to the hyperplane. In this case the loss function is continuously differentiable wrt $(\omega,b)$ and can be optimized by Stochastic Gradient Descent (SGD). $$ L(\omega,b)=-\sum_{x_i \in M}y_i(\omega x_i+b) $$ The training procedure: 1. Initialize $\omega_0$, $b_0$; 2. Pick a instance with label $(x_i,y_i)$; 3. If $y_i(\omega x_i+b) \leq 0$, $$ \omega = \omega+\eta x_iy_i \\ b = b+\eta y_i $$ 4. Loop over step 2 ~ 3 until there is no misclassified instance.
Note that the ideal hyperplane is not unique, Perceptron could generate different solutions with different initialized $\omega_0$, $b_0$ or non-identical misclassified instances picked during learning.


Support Vector Machine

Linear SVM Support Vector Machine (SVM) is a linear classifier selecting hyperplane maximizing separation margin between classes (large margin classifiers), with solution only depends on a small subset of training examples (support vectors).
Considering classifying a linearly separable dataset $Set_{train}$ into 2 classes: $$ Set_{train}: \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} \ where\ x_i \in R^n, y_i \in \{-1, +1\} $$ Usually we will find infinite number of hyperplanes defined by $(\omega, b)$ to seperate the data correctly (e.g. using perceptron), while we can find the optimal hyperplane $(\omega^*, b^*)$ by maximizing the geometric margin $\gamma$: $$ \gamma = \min \gamma_i,\\ \gamma_i = y_i(\frac{\omega}{\lVert \omega \rVert} x_i+ \frac{b}{\lVert \omega \rVert}), i=1,...,N $$ Note that for a certain hyperplane $\omega x_i + b = 0$, the distance between instance $x_i$ and the hyperplane is $\frac{1}{\lVert \omega \rVert} \lvert \omega x_i+ b \rvert$. Large distance denotes high confidence, and label class $y_i$ denotes the correctness of classfication for $x_i$. Finding $(\omega^*, b^*)$ for a hard margin SVM is a constrained optimization problem: $$ \begin{aligned} &\max_{\omega,b} \ \gamma\\ &s.t. \ y_i(\frac{\omega}{\lVert \omega \rVert} x_i+ \frac{b}{\lVert \omega \rVert}) \geq\gamma, i=1,...,N \end{aligned} $$ Substitute geometric margin with $\gamma=\frac{\check{\gamma}}{\lVert \omega \rVert}$, where $\gamma_i$ is the functional margin or confidence margin $\check{\gamma_i}=y_i(\omega x_i+ b)$: $$ \begin{aligned} &\max_{\omega,b} \ \frac{\check{\gamma}}{\lVert \omega \rVert}\\ &s.t. \ y_i(\omega x_i+ b) \geq\check{\gamma}, i=1,...,N \end{aligned} $$ Considering that there is an infinite number of equivalent formulation for the same hyperplane: $$ \begin{aligned} \omega x_i+ b&=0\\ \alpha(\omega x_i+ b)&=0, \ \forall \alpha \neq0 \end{aligned} $$ The canonical hyperplane is the hyperplane having functional/confidence margin equal to 1. We can substitute $\check{\gamma} = 1$ since the problem is equivalent for any $\check{\gamma}$, and convert $\max \frac{1}{\lVert \omega \rVert}$ to its equivalent problem $\min \frac{1}{2} \lVert \omega \rVert ^2$. Now we are dealing with a convex quadratic programming problem (objective is quadratic, points satisfying constraints form a convex set): $$ \begin{aligned} &\min_{\omega,b} \ \frac{1}{2} \lVert \omega \rVert ^2\\ &s.t. \ y_i(\omega x_i+ b)-1 \geq0, i=1,...,N \end{aligned} $$ After solving the convex optimization problem, we will get the hyperplane for separating instances. The hyperplane exists and is unique for a linear separable dataset. $$ \omega^*x+b^*=0\\ f(x)=sign(\omega^*x+b^*) $$ $f(x)$ is the decision function for classification. The above method is maximum margin method, the learning algorithm of Linear SVM. Now let's solve this constrained optimization problem. First build lagrange function: $$ L(\omega,b,\alpha)=\frac{1}{2}\lVert \omega \rVert^2 - \sum_{i=1}^N \alpha_i y_i(\omega x_i+b)+\sum_{i=1}^N \alpha_i \\ \alpha_i \geq 0, i=1,2,...,N $$ where $\alpha_i$ is lagrange multiplier. Considering that we can set $\alpha_i \to +\infty$ for any $\omega,b$ that not satisfy the constraint $ y_i(\omega x_i+ b)-1 \geq0$: $$ \max_{\alpha}L(\omega,b,\alpha)=\begin{cases} \frac{1}{2}\lVert \omega \rVert^2, \text{constraint satisfied}\\ +\infty \end{cases} $$ so the original problem is equivalent to the following problem: $$ \min_{\omega,b} \max_{\alpha}L(\omega,b,\alpha) $$ The Lagrangian is minimized wrt $\omega, b$ and maximized wrt $\alpha_i$ (solution is a saddle point).
Usually we tackle this min-max problem by solving its dual problem: $$ \max_{\alpha} \min_{\omega,b}L(\omega,b,\alpha) $$ First, minimize $L(\omega,b,\alpha)$ wrt $\omega,b$ by setting partial derivatives to zero: $$ \frac{\partial}{\partial \omega}L(\omega,b,\alpha)=0 \Rightarrow \omega=\sum_{i=1}^m \alpha_i y_i x_i\\ \frac{\partial}{\partial b}L(\omega,b,\alpha)=0 \Rightarrow \sum_{i=1}^m \alpha_i y_i =0 $$ Substituting in the Lagrangian: $$ \begin{aligned} \min_{\omega,b}L(\omega,b,\alpha)=& \frac{1}{2}\lVert \omega \rVert^2 - \sum_{i=1}^N \alpha_i y_i(\omega x_i+b)+\sum_{i=1}^N \alpha_i \\ =&\frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j(x_i x_j)-\sum_{i=1}^N \alpha_i y_i ((\sum_{j=1}^N \alpha_j y_j x_j)x_i+b)+\sum_{i=1}^N \alpha_i\\ =&-\frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j(x_i x_j)+\sum_{i=1}^N \alpha_i \end{aligned} $$ Then the dual problem is: $$ \begin{aligned} \max_{\alpha} \ &-\frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j(x_i x_j)+\sum_{i=1}^N \alpha_i \\ s.t. \ &\sum_{i=1}^N \alpha_i y_i=0\\ & \alpha_i \geq 0, i=1,2,...,N \end{aligned} $$ Still a constrainted quadratic optimization problem. The solution $(\omega^*,b^*,\alpha^*)$ to the dual problem is the same as the primal problem when Karush-Khun-Tucker (KKT) conditions are satisfied. In this case, KKT conditions: $$ \frac{\partial}{\partial \omega}L(\omega,b,\alpha)=0 \\ \alpha_i^* \geq 0, i=1,2,...,N\\ y_i(\omega^* x_i+b^*)-1 \geq 0, i=1,2,...,N\\ \alpha_i^*(y_i(\omega^* x_i+b^*)-1)=0, i=1,2,...,N\\ $$ From the KKT conditions we know that only points stay on the minimal confidence hyperplane, which are called support vectors, contribute to the final decision function $f(x)$: $$ y_i(\omega^* x_i+b^*)=1 $$ while others points has no contribution since their multiplier have to be zero according to KKT conditions (they could be removed from the training set). SVM are sparse, which means they typically have few support vectors. We notice that dual formulation is easier to solve (simpler constraints) compared with primal formulation. The dual formulation has $N$ variables (number of training examples), while primal formulation has $d + 1$ variables (number of features +1). Depend on the problem, one can choose the primal formulation if it has much less variables. The bias $b$ can be computed from KKT conditions. Given an arbitrary support vector $x_i$ the KKT conditions imply: $$ y_i(\omega^* x_i+b^*)=1\\ b^*=\frac{1-y_i \omega^*x_i}{y_i} $$ For robustness, the bias is usually averaged over all support vectors. We also have the dual SVM decision function by substituting $\omega=\sum_{i=1}^N \alpha_i y_i x_i$: $$ f(x)= sign(\omega x + b)=sign (\sum_{i=1}^N \alpha_i y_i (x_i x)+b) $$ The decision function is a linear combination of dot products between training points and the test point, which denotes the similarity. Weights of the combination are $\alpha_i y_i$: large $\alpha_i$ implies large contribution towards class $y_i$ (times the similarity).
The above mentioned SVM is called hard margin SVM, used for dataset that is strictly linear seperable. For linear seperable dataset with outliers, we can still use SVM by adding slack variables to constraints, which is called soft margin SVM.


Non-linear SVM Non-linearly separable problems need a higher expressive power (i.e. more complex feature combinations). Non-linear SVM maps input examples in a higher dimensional feature space where a non-linear problem is transformed into a linear problem, and perform linear classification in this higher dimensional space. $$ \Phi: \mathcal{X} \to \mathcal{H} $$ where $\Phi$ is a function mapping each example in the input space $\mathcal{X}$ to a higher dimensional space $\mathcal{H}$. The feature mapping should increase the expressive power of the representation (e.g. introducing features which are combinations of input features). Examples should be (approximately) linearly separable in the mapped space. SVM algorithm is applied just replacing $x$ with $\Phi(x)$: $$ \begin{aligned} f(x)=& \ \omega \Phi(x)+b \\ =& \sum_{i=1}^N \alpha_i y_i \Phi(x_i) \Phi(x)+b\\ =& \sum_{i=1}^N \alpha_i y_i K(x_i,x)+b \end{aligned} $$ where $K(x_i,x)$ is kernel function, defined as the dot product of $\Phi(x_i)$ and $\Phi(x)$: $$ K(x_i,x)=\Phi(x_i) \Phi(x) $$ Here are several common used kernel functions. 1. polynomial kernel function: $$ K(x,z)=(xz+1)^p $$ 2. Gaussian kernel function: $$ K(x,z)=exp(- \frac{\lVert x-z \rVert ^2}{2 \sigma^2}) $$ SVM also works in regression tasks, this blog will not jump into it.


Logistic Regression

CLICK ME Logistic regression model classify instances based on conditional probability. Here is logistic distribution: $$ F(x)=P(X \leq x)=\frac{1}{1+e^{-(x-\mu) \ / \gamma}}\\ f(x)=\frac{e^{-(x-\mu) \ / \gamma}}{\gamma (1+e^{-(x-\mu) \ / \gamma})^2} $$ Binomial logistic regression model: $$ P(Y=1|x)=\frac{exp(\omega x+b)}{1+exp(\omega x+b)}\\ P(Y=0|x)=1-P(Y=1|x) =\frac{1}{1+exp(\omega x+b)} $$ where $x \in R^n$ is input, $Y \in \{0,1 \}$ is output, $\omega \in R^n$ is weight, $b \in R$ is bias. For convenience, we can simplify the equations as: $$ P(Y=1|x)=\frac{exp(\omega x)}{1+exp(\omega x)}\\ P(Y=0|x)=\frac{1}{1+exp(\omega x)} $$ by setting $\omega = (\omega^{(1)},\omega^{(2)},...,\omega^{(n)},b)^T$, $x=(x^{(1)},x^{(2)},...,x^{(n)},1)^T$. We can tell when the linear combination approaches positive infinity $\omega b \to +\infin$, the conditional probability approaches 1, and vice versa.
Learn model parameter $\omega, b$ by maximum likelihood estimation (MLE). Set $$ P(Y=1|x)=\sigma(x), P(Y=0|x)=1-\sigma(x) $$ $$ P(y_i|x_i;\omega)= \begin{cases} \sigma(x_i),y_i=1\\ 1-\sigma(x_i),y_i=0 \end{cases}\\ $$ Rewrite it as: $$ P(y_i|x_i;\omega)=\sigma(x_i)^{y_i}(1-\sigma(x_i))^{1-y_i} $$ Likelihood function: $$ \prod_{i=1}^N \sigma(x_i)^{y_i}(1-\sigma(x_i))^{1-y_i} $$ Log-likelihood function: $$ \begin{aligned} L(\omega)=& \sum_{i=1}^N[y_i log \sigma (x_i)+(1-y_i)log(1-\sigma(x_i))]\\ =& \sum_{i=1}^N[y_i log \frac{\sigma(x_i)}{1-\sigma(x_i)}+log(1-\sigma(x_i))]\\ =&\sum_{i=1}^N[y_i(\omega x_i)-log(1+exp(\omega x_i))] \end{aligned} $$ Maximize $L(\omega)$ to find $\omega$ using gradient descent or Newton method. Multi-nominal logistic regression model for multi-class classification task, suppose we have $K$ classes $Y \in \{ 1,2,...,K \}$: $$ P(Y=k|x)=\frac{exp(\omega_k x)}{1+\sum_{k=1}^{K-1}exp(\omega_k x)}, k = 1,2,...,K-1\\ P(Y=K|x)=\frac{1}{1+\sum_{k=1}^{K-1}exp(\omega_k x)} $$ where $x \in R^{n+1}$, $\omega_k \in R^{n+1}$. Parameter estimation follows the same scheme in binomial logistic regression model.