Supervised Learning EP2 - An introduction to Deep Learning

12 minute read

Published: October 10, 2023

Deep learning is a subfield of machine learning that involves training artificial neural networks with multiple layers to analyze and solve complex problems. These neural networks are capable of learning hierarchical representations of data, enabling them to extract features and patterns from raw input data. Deep learning has been used in a wide range of applications such as image recognition, natural language processing, and autonomous driving. This blog covers the basics of four different types of DL models, paves the way to further in-depth study of various DNN.

Multi-layer Perceptron

CLICK ME

Multi-Layer Perceptron (MLP) is a type of feedforward artificial neural network (ANN) that consists of multiple layers of interconnected neurons. The output of each neuron in a layer is then passed on to the next layer, and this process continues through all the layers in the architecture. MLPs have shown great success in many artificial intelligence applications, such as image recognition, natural language processing, and speech recognition.
Let's review the single layer perceptron first. The perceptron model does a weighted sum operation to an input instance $x$ and uses sign function (activation function) to generate final classification results. MLP expands perceptron by applying different activation functions and sets the output from current perceptron as the input to the next perceptron. Layers between input layer and output layer is called hidden layers, which contain non-linear activation functions.
SLP & MLP

Several commonly used activation functions: 1. Sigmoid $$ f(x)=\frac{1}{1+exp(-x)}\\ f^{\prime}(x)=\frac{exp(-x)}{(1+exp(-x))^2}=f(x)(1-f(x)) $$ 2. Rectified linear unit (Relu) $$ f(x)= \begin{cases} x, x \geq 0\\[1ex] 0, x \lt 0 \end{cases}\\ f^{\prime}(x)=\begin{cases} 1, x \geq 0\\[1ex] 0, x \lt 0 \end{cases} $$ 3. Tanh $$ f(x)=\frac{2}{1+exp(-2x)}-1\\ f^{\prime}(x)=\frac{4 exp(-2x)}{(1+exp(-2x))^2}=1-f(x)^2 $$
Activation functions

How to set the number of layers and number of neurons per layer? So far we have not found one model structure that fits all problems, we can find a suitable architecture for a specific task using neural architecture search (NAS). Modern deep neural networks usually have tens or hundreds of layers, although we have Universality Approximation Theorem: 2-layer net with linear output with some squashing non-linearity in hidden units can approximate any continuous function over compact domain to arbitrary accuracy (given enough hidden units!)

Convolutional Neural Network

CLICK ME

Convolutional neural networks (CNN) are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers. CNNs are known for their ability to extract features from images, and they can be trained to recognize patterns within them. You may heard of convolution in signal processing. For 2-D image signal, convolution operation can be written as: $$ f(x,y)=(g*k)(x,y)=\sum_{m} \sum_{n} g(i-m,i-n) k(m,n) $$ where $*$ is convolution operation, $g$ is 2-D image data and $k$ is convolution kernel. First rotate the convolution kernel 180 degrees clockwise, and then do element-wise dot production. In fact, convolution in CNN is not the same convolution operation as mentioned above. It is actually a correlation operation, which directly perform element-wise dot production between kernel and image patches.
In a CNN with multiple layers, neurons in a layer are only connected to a small region of the layer before it, which is so-called local connectivity. One kernel matrix or kernel map is shared between all the patches of an image, namely weight sharing, which allows to learn shift-invariant filter kernels (same feature would be recognized no matter it's location in an image, e.g. a cat will be detected no matter it's on the top left or in the middle of the image) and reduce the number of parameters.
Apart from convolutional layers, the design of CNNs involves operations such as pooling layers, non-linearity and fully connected layers. The convolutional layers are responsible for applying filters for feature extraction to the input image, while the pooling layers then reduce the size of the feature map produced by the convolutional layers, and expand the receptive field from local to global. Finally, the fully connected layers are used to generate prediction result based on the features extracted by the CNN. Take Vgg16 as an example.

vgg16 architecture

Given a three channel input image (RGB) with size 224 $\times$ 224, Vgg16 first resizes it to 225 $\times$ 225 by adding pixels to the input image (padding, usually fill in zero value pixels around the image), and perform convolution using 64 3 $\times$ 3 conv kernels or filters (obtain 64 feature maps,i.e. output 64 channels, with size 224 $\times$ 224) and Relu. Padding is to keep the size of feature map same as the input image, since the size will be smaller than 224 $\times$ 224 after convolution without padding. Perform the same convolution again, then shrink the feature map size to 112 $\times$ 112 by maxpooling, which picks the maximum of 2 $\times$ 2 square pixel patch as the output pixel value to halve the side length. Repeat conv+Relu+maxpool, followed by three fully connected layers and softmax (predict the probability of each class). Here's the configuration of vgg convnets.
vgg16 configuration

Now let's calculate the number of parameters used in a convolution layer and a fully connected layer, to get a better understanding of how a CNN works. A 3 $\times$ 3 conv kernel has 9 parameters. The first convolution layer uses 64 kernels with size 3 $\times$ 3 to process a 3-channel image, which has 3 $\times$ 3 $\times$ 3 $\times$ 64 parameters. We can regard it as a cube kernel with size 3 $\times$ 3 $\times$ 3, obtained by multiplying kernel size and input channels. **Parameter amount = (conv kernel size * the number of channels of the input feature map) * the number of conv kernels**. The feature map obtained by the last convolution of VGG-16 is 7 $\times$ 7 $\times$ 512, which is expanded into a one-dimensional vector 1 $\times$ 4096 by the first fully connected layer. How to do that? We actually employ 4096 cube kernels with size 7 $\times$ 7 $\times$ 512 to do convolution operation on the feature map. The number of parameters of the first FC layer is 7 $\times$ 7 $\times$ 512 $\times$ 4096, a huge number!
Vgg chooses smaller conv kernel compared to the 7 $\times$ 7 kernel used in Alexnet, which reduce the amount of parameters. Resnet, a famous CNN afer Vgg, introduces a short-cut structure to effectively alleviate the deep network degradation. Batch normalization and regularization operations such as drop out and weight decay are usually used in a CNN, which deserves a new blog to introduce.

Recurrent Neural Network

CLICK ME

Recurrent Neural Networks (RNN) are designed to analyze sequence data, such as text or time series data. RNNs have shown great success in a wide range of natural language processing tasks, such as speech recognition, machine translation, and text generation. They are also used in many applications related to time series analysis, such as forecasting and trend prediction. A RNN could be represented in a succinct way:
RNN

The hidden state $h_t$ is decided by $x_t$ and $h_{t-1}$ the hidden state at time $t-1$, $f_w$ injects non-linear weights into $x_t,h_{t-1}$. Classifier usually is a non-linear operation using softmax, assuming the output $y_t$ represents a probability: $$ h_t=tanh(W_x x_t + W_h h_{t-1}) \\ y_t = softmax(W_y h_t) $$ The unfolded graph provides an explicit description of which computations to perform:
RNN(unfold)

RNNs are typically trained using backpropagation through time (BPTT), the unfolded network is treated as one big feed-forward network that accepts the whole time series as input. The weight updates are computed for each copy in the unfolded network, then summed or averaged and applied to the RNN weights. Here the loss function we use is cross entroy, and $y_{t_i}$ is the prediction of class $i, i=1,2,...,n$ (assume that we $n$ classes, the dimention of vector $y_t$ is $n \times 1$). $$ E(y,\hat y) = \sum_t E(y_t,\hat y_t) \\ E(y_t,\hat y_t) = - \sum_i \hat y_{t_i} log y_{t_i}\\ \frac{\partial e_3}{\partial W_y}=\frac{\partial e_3}{\partial y_3} \frac{\partial y_3}{\partial W_y}=(\hat y_3 -y_3) \otimes h_3 $$ $\otimes$ is outer product. $$ \begin{aligned} h_t&=tanh(W_x x_t + W_h h_{t-1})\\ \frac{\partial e_3}{\partial W_h} &= \frac{\partial e_3}{\partial y_3} \frac{\partial y_3}{\partial h_3}\frac{\partial h_3}{\partial W_h} \\&=\sum_{k=1}^3 \frac{\partial e_3}{\partial y_3} \frac{\partial y_3}{\partial h_3}\frac{\partial h_3}{\partial h_k}\frac{\partial h_k}{\partial W_h} \end{aligned} $$ The weights $W_h$ are shared and the partial derivative of $e_3$ with respect to $W_h$ is contributed by $h_1,h_2,h_3$. If input sequences are comprised of thousands of timesteps, then the same amount of derivatives required for a single update weight update. This can cause weights to vanish or explode, thus poor model performance. BPTT can be computationally expensive as the number of timesteps increases, as a result, we use truncated BPTT instead: 1. Present a sequence of k1 timesteps of input and output pairs to the network. 2. Unroll the network then calculate and accumulate errors across k2 timesteps. 3. Roll-up the network and update weights. 4. Repeat.
BPTT

Most frequently used RNNs such as LSTM and GRU are not displayed here.

Transformer

CLICK ME

In recent years, transformers has shown it power in natural language process. Applications like Google translate, ChatGPT are based on the transformer architecture, which uses self-attention mechanism to process sequencial data in parallel and catch long-range dependency.

RNN has poor parallel capabilities and easily forgets long-distance information, but self-attention overcomes these shortcomings. The attention mechanism allows the model to focus on specific parts of the input sequence and produce a weighted representation of the entire sequence, which can then be used for classification or prediction Let's see what happens to the input text. 1. Each input word is turned into a vector using an embedding algorithm; 2. Create 3 vectors from each input embedding(X), i.e. query(Q), key(K), value(V) by multiplying three matrices learned during the training process. $$ Q=XW^q,K=XW^k,V=XW^v $$ 3. Calculate a attention score for each word/embedding by scaled dot product of query and value matrix, and get the output value matrix of each word, which is then input to the next layer. $$ Score(Q,K)=\frac{QK^T}{\sqrt{d_k}} \\ Output_v(Q,K,V)=softmax(Score(Q,K))V $$ where Score(Q,K) is the attention score and Output_v(Q,K,V) is the output value of the attention layer, $d_k$ is the dimention of key vector. Specificly, the attention score of one word is calculated by sum all dot product between its query and all other key matrices, as displayed below (thanks to the author of this amazing gif):
Self-attention calculation process

4. Each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by layer-normalization step. 5. For the original transformer structure which is a encoder-decoder structure, the output of encoders is integrated into the decoder by encoder-decoder attention layer, decoder only structure (GPT) also works perfectly, which reveals the charm of self-attention mechanism. To capture more information in the input text, or gives the attention layer multiple “representation subspaces”, multi-head attention is normally used instead of plain self attention. In essence, we train multiple small weight matrices and concatenate the results into a big matrix. The calculation method of each result is the same as the calculation method of single-head attention. Finally, the generated b is connected to generate the final result. See the Img refered from: http://jalammar.github.io/illustrated-transformer/ . Another blog helps to understand the details of multi-head attention: https://hungsblog.de/en/technology/learnings/visual-explanation-of-multi-head-attention/

Masked Multi-head attention is to force the model to only see the sequence on its left and masks the sequence on its right, which prevent cheating while decoding. In practice, just add negative infinity to each value in the upper triangle part of attention score matrix (excluding diagonal), since softmax will naturally ignore small numbers. The transformer architecture also showed comparable performance to CNN in computer vision tasks after vision transformer (ViT) came out. More details will be added here after further research results appear.

Share on

Twitter Facebook LinkedIn

Hungarian Algorithm in multiple object tracking

7 minute read

Published: August 30, 2024

Facilitated by the development of deep learning in subfields of computer vision including single object tracking (SOT), video object detection (VOD), and re-identification (Re-ID), many methods of multi-object tracking (MOT) have emerged, such as various embedding models designed for object locations and track identities [1]. While tracking-by-detection is still an effective paradigm for MOT, since we can always combine off-the-shelf state-of-the-art detectors and Re-ID models to improve the MOT system. The idea is: detect the objects in the current frame, then try to associate the detected boxes with tracked boxes in the previous frames. In practice, we need a more sophisticated design (e.g. choose similarity indices for data association, use Kalman Filter to predict new locations of tracks etc.) and proper strategies (e.g. to delete or initialize tracks). Hunarian algorithm is commonly used to matching objects after we get the similarity matrix, in this blog, I will try to explain how and why Hunarian algorithm works from the perspective of linear programming.

Problem Description

Hunarian algorithm or Kuhn-Munkres algorithm is well known for solving assignment problem, which involves assigning tasks to agents (or jobs to workers) such that the total cost is minimized (or the total profit is maximized), subject to the constraint that each agent is assigned exactly one task, and each task is assigned to exactly one agent.

Primal Form of the Assignment Problem

We are given an $n \times n$ cost matrix $C$, where $c_{ij}$ is the cost of assigning agent $i$ to task $j$.
$\begin{aligned} \text{min} & \sum_{i \in I} \sum_{j \in J} c_{ij}x_{ij} \\ s.t. & \sum_{j \in J}x_{ij} = 1 \text{ for all } i \in I \\ &\sum_{i \in I}x_{ij} = 1 \text{ for all } j \in J \\ & x_{ij} \geq 0 \end{aligned}$

Note that $x_{ij} = 1$ if i is assigned to j and 0 otherwise.

Assignment Problem is a classic linear programming problem, since both the objective function and constraints are linear sums, even though it involves binary decision variables. We can model the problem as a bipartite graph and interpret the algorithm from the perspective of graph (find the perfect matching by finding augmenting path), but this blog will not cover this topic. Let’s try to understand Hungarian algorithm under the framework of linear programming. Since the primal problem is tricky to deal with, let’s go for the dual form of the problem.

Primal to Dual

To convert the primal Assignment Problem to its dual form, we need to use Lagrange multipliers to relax the equality constraints in the primal problem. Introduce $u_i$ as the Lagrange multiplier associated with the constraint that agent $i$ is assigned exactly one task ($\sum_j x_{ij} = 1$), $v_j$ as the Lagrange multiplier associated with the constraint that task $j$ is assigned to exactly one agent ($\sum_i x_{ij} = 1$), then formulate the Lagrangian:
$$ L(x_{ij},u_i,v_j)=\sum_{i=1}^n \sum_{j=1}^n c_{ij}x_{ij}+\sum_{i=1}^n u_i (1-\sum_{j=1}^n x_{ij})+\sum_{j=1}^n v_j (1-\sum_{i=1}^n x_{ij}) $$ Lagrange dual function is defined as the minimum value of the Lagrangian over primal variables:
$$ g(u,v)=\text{inf}_x L(x,u,v) $$ where $\text{inf}_x$ means the infimum (the lower bound) of the Lagrangian over $x$. Since the dual function is the pointwise infimum of a family of affine functions of (u, ν), it is concave, even when the problem is not convex [2]. To minimize this Lagrangian $L(x_{ij},u_i,v_j)$ with respect to $x_{ij}$, rewrite the Lagrangian as:
$$ L(x_{ij},u_i,v_j)=\sum_{i=1}^n \sum_{j=1}^n [(c_{ij}-u_i-v_j)x_{ij}] + \sum_{i=1}^n u_i + \sum_{j=1}^n v_j $$ The minimization occurs in the condition:
$$ x_{ij}=1 \ \text{ if }\ c_{ij} \leq u_i+v_j \\ x_{ij}=0 \ \text{ if }\ c_{ij} > u_i+v_j $$ Now we want to find the tightest lower bound or minimum duality gap, so we need to maximize the dual function w.r.t $u$ and $v$. In the case of $x_{ij}=1$ If $c_{ij} \leq u_i+v_j$, the maximum occurs when $c_{ij} = u_i+v_j$, so $c_{ij}-u_i-v_j$ in Lagrangian is non-negative.

Dual Form of the Assignment Problem

The dual form of the Assignment Problem can be written as:
$\begin{aligned} \text{max} &\sum_{i=1}^n u_i + \sum_{j=1}^n v_j \\ s.t. & u_i+v_j \leq c_{ij}, \ \forall i, j \end{aligned}$

where $u_i$ is the dual variable associated with agent $i$’s assignment constraint, $v_j$ is the dual variable associated with task $j$’s assignment constraint.
The strong duality theorem of linear programming guarantees that if a feasible and bounded solution exists for the primal linear problem, the dual linear problem is feasible (and bounded as well, by weak duality), with the same optimum value as the primal [3]. The primal problem has constraints that ensure every agent is assigned exactly one task, and every task is assigned exactly one agent, a feasible primal solution exists (since the problem is well-posed with the same number of agents and tasks).

Hungarian Algorithm

Given a $n \times n$ cost matrix (Adjacent Matrix in Graph theory) $C$, the algorithm is summarized as 5 steps below:

Find and subtract the row minimum, namely find the minimum element in each row of the cost matrix and subtract the minimum element in each row from the element in that row.
Find and subtract the colume minimum, namely find the minimum element of each column in the remaining matrix and subtract the minimum element in each column from the elements in that column.
Use the minimum number of horizontal or vertical lines to cover all zero elements in the matrix. If the minimum number of lines is equal to $n$, stop running; if the minimum number of lines is less than $n$, continue to the next step.
Find the minimum value among the uncovered elements, subtract this minimum value from the uncovered elements, and add this minimum value to the elements at the intersection of different lines.
Go back to step 3, that is, continue to cover all zero elements in the matrix with the minimum number of horizontal or vertical lines. If the minimum number of lines is equal to $n$, stop running; if the minimum number of lines is less than $n$, execute step 4.

For details of $O(n^3)$ algorithm implementation, refer to this blog from cp algorithms, where the dual variables $u,v$ are denoted as potential. Hungarian algorithm updates the search of optimal value by continuesly increase the sum of $u,v$: $f= \sum_{i=1}^n u[i] + \sum_{j=1}^n v[j]$

In the context of MOT, cost matrix is filled with IOU or similarity score obtained from ReID features between objects in two consecutive frames. Sometimes the matrix is non-square, we can add dummy rows or columes with zeros and run more iterations.

Reference:
[1] Wang, G., Song, M., & Hwang, J. N. (2022). Recent advances in embedding methods for multi-object tracking: a survey. arXiv preprint arXiv:2205.10766.
[2] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[3] Matoušek, Jiří, and Bernd Gärtner. Understanding and using linear programming. Vol. 1. Berlin: Springer, 2007.

Deploy LLMs on a phone with arm chips

4 minute read

Published: June 13, 2024

In this blog, I’m going to share how I deployed LLMs (1~3b) and ASR models (whisper tiny & base) on my old smart phone, Xiaomi 8 (dipper). It was released in 2018, with a broken screen and depleted battery after several years of heavy usage, it still offers acceptable speed and fluency running daily light-weight apps. So I wonder, if I deploy modern machine models, such as LLMs, on the phone, and can I run the large models smoothly. Let’s see.
The hardware configuration: 8-cores CPU with 4 big cores (Cortex-A75) and 4 small cores (Cortex-A55), 64 GB storage and 6 GB RAM. It has a integrated GPU, which I didn’t find a way to utilize during model inference, so the computation is totally counted on CPU in this trial. To better leverage the power of this CPU, I uninstall the android system (MIUI 12) and port a Ubuntu touch on the phone. Basically it’s a linux OS but using the underlying andriod framework to control hardwares, and it gives me a much longer battery life compared to android, also more convenience since I am a rookie in android development. Models are quantized versions from llamacpp and whispercpp.
To get rid of the cumbersome work build an app with GUI, I run models in command line using the terminal app originally from Ubuntu touch, which can be regarded as the same terminal on Ubuntu desktop in this case. All I need to do is to compile the c++ code into an executable file that runs on my phone. Since the architecture of my laptop cpu is x86, the version of glibc, libstdc++ are different from the libs on the phone, I could either compile on the phone directly or cross compile on my PC with a specific toolchain. I kept all the heavy work on PC and built my own toolchain using crosstool-NG, which is targeted at building toolchains.

LLMs Inference speed up EP1 - kv cache

6 minute read

Published: May 31, 2024

Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling significant advancements in tasks such as language translation, text summarization, and sentiment analysis. However, despite their impressive capabilities, LLMs are not without limitations. One of the most significant challenges facing LLMs today is the problem of inference speed. Due to the sheer size and complexity of these models, the process of making predictions or extracting information from them can be computationally expensive and time-consuming. Several ways to speed up the LLMs without updating hardware:

Customize a local chatbot in streamlit

4 minute read

Published: April 03, 2024

Large Language Models (LLMs) are the brains of various chatbots, and luckily we all have access to some brilliant cheap or even free LLMs now, thanks to open source community. It is possible to run LLMs on a PC and keep everything local. This blog presents my solution to building a chatbot running on my PC, with a totally local file storage system and a costumized graphical user interface.

Teng Ma

Supervised Learning EP2 - An introduction to Deep Learning

Multi-layer Perceptron

Convolutional Neural Network

Recurrent Neural Network

Transformer

Share on

You May Also Enjoy

Hungarian Algorithm in multiple object tracking

Problem Description

Primal Form of the Assignment Problem

Dual Form of the Assignment Problem

Hungarian Algorithm

Deploy LLMs on a phone with arm chips

LLMs Inference speed up EP1 - kv cache

Customize a local chatbot in streamlit