Page Not Found
Page not found. Your pixels are in another canvas.
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Page not found. Your pixels are in another canvas.
About me
This is a page not in th emain menu
Published:
Published:
Facilitated by the development of deep learning in subfields of computer vision including single object tracking (SOT), video object detection (VOD), and re-identification (Re-ID), many methods of multi-object tracking (MOT) have emerged, such as various embedding models designed for object locations and track identities [1]. While tracking-by-detection is still an effective paradigm for MOT, since we can always combine off-the-shelf state-of-the-art detectors and Re-ID models to improve the MOT system. The idea is: detect the objects in the current frame, then try to associate the detected boxes with tracked boxes in the previous frames. In practice, we need a more sophisticated design (e.g. choose similarity indices for data association, use Kalman Filter to predict new locations of tracks etc.) and proper strategies (e.g. to delete or initialize tracks). Hunarian algorithm is commonly used to matching objects after we get the similarity matrix, in this blog, I will try to explain how and why Hunarian algorithm works from the perspective of linear programming.
Hunarian algorithm or Kuhn-Munkres algorithm is well known for solving assignment problem, which involves assigning tasks to agents (or jobs to workers) such that the total cost is minimized (or the total profit is maximized), subject to the constraint that each agent is assigned exactly one task, and each task is assigned to exactly one agent.
We are given an $n \times n$ cost matrix $C$, where $c_{ij}$ is the cost of assigning agent $i$ to task $j$.
\(\begin{aligned} \text{min} & \sum_{i \in I} \sum_{j \in J} c_{ij}x_{ij} \\ s.t. & \sum_{j \in J}x_{ij} = 1 \text{ for all } i \in I \\ &\sum_{i \in I}x_{ij} = 1 \text{ for all } j \in J \\ & x_{ij} \geq 0 \end{aligned}\)
Note that $x_{ij} = 1$ if i is assigned to j and 0 otherwise.
Assignment Problem is a classic linear programming problem, since both the objective function and constraints are linear sums, even though it involves binary decision variables. We can model the problem as a bipartite graph and interpret the algorithm from the perspective of graph (find the perfect matching by finding augmenting path), but this blog will not cover this topic. Let’s try to understand Hungarian algorithm under the framework of linear programming. Since the primal problem is tricky to deal with, let’s go for the dual form of the problem.
The dual form of the Assignment Problem can be written as:
\(\begin{aligned} \text{max} &\sum_{i=1}^n u_i + \sum_{j=1}^n v_j \\ s.t. & u_i+v_j \leq c_{ij}, \ \forall i, j \end{aligned}\)
where $u_i$ is the dual variable associated with agent $i$’s assignment constraint, $v_j$ is the dual variable associated with task $j$’s assignment constraint.
The strong duality theorem of linear programming guarantees that if a feasible and bounded solution exists for the primal linear problem, the dual linear problem is feasible (and bounded as well, by weak duality), with the same optimum value as the primal [3]. The primal problem has constraints that ensure every agent is assigned exactly one task, and every task is assigned exactly one agent, a feasible primal solution exists (since the problem is well-posed with the same number of agents and tasks).
Given a $n \times n$ cost matrix (Adjacent Matrix in Graph theory) $C$, the algorithm is summarized as 5 steps below:
For details of $O(n^3)$ algorithm implementation, refer to this blog from cp algorithms, where the dual variables $u,v$ are denoted as potential. Hungarian algorithm updates the search of optimal value by continuesly increase the sum of $u,v$: \(f= \sum_{i=1}^n u[i] + \sum_{j=1}^n v[j]\)
In the context of MOT, cost matrix is filled with IOU or similarity score obtained from ReID features between objects in two consecutive frames. Sometimes the matrix is non-square, we can add dummy rows or columes with zeros and run more iterations.
Reference:
[1] Wang, G., Song, M., & Hwang, J. N. (2022). Recent advances in embedding methods for multi-object tracking: a survey. arXiv preprint arXiv:2205.10766.
[2] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[3] Matoušek, Jiří, and Bernd Gärtner. Understanding and using linear programming. Vol. 1. Berlin: Springer, 2007.
Published:
In this blog, I’m going to share how I deployed LLMs (1~3b) and ASR models (whisper tiny & base) on my old smart phone, Xiaomi 8 (dipper). It was released in 2018, with a broken screen and depleted battery after several years of heavy usage, it still offers acceptable speed and fluency running daily light-weight apps. So I wonder, if I deploy modern machine models, such as LLMs, on the phone, and can I run the large models smoothly. Let’s see.
The hardware configuration: 8-cores CPU with 4 big cores (Cortex-A75) and 4 small cores (Cortex-A55), 64 GB storage and 6 GB RAM. It has a integrated GPU, which I didn’t find a way to utilize during model inference, so the computation is totally counted on CPU in this trial. To better leverage the power of this CPU, I uninstall the android system (MIUI 12) and port a Ubuntu touch on the phone. Basically it’s a linux OS but using the underlying andriod framework to control hardwares, and it gives me a much longer battery life compared to android, also more convenience since I am a rookie in android development. Models are quantized versions from llamacpp and whispercpp.
To get rid of the cumbersome work build an app with GUI, I run models in command line using the terminal app originally from Ubuntu touch, which can be regarded as the same terminal on Ubuntu desktop in this case. All I need to do is to compile the c++ code into an executable file that runs on my phone. Since the architecture of my laptop cpu is x86, the version of glibc, libstdc++ are different from the libs on the phone, I could either compile on the phone directly or cross compile on my PC with a specific toolchain. I kept all the heavy work on PC and built my own toolchain using crosstool-NG, which is targeted at building toolchains.
Published:
Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling significant advancements in tasks such as language translation, text summarization, and sentiment analysis. However, despite their impressive capabilities, LLMs are not without limitations. One of the most significant challenges facing LLMs today is the problem of inference speed. Due to the sheer size and complexity of these models, the process of making predictions or extracting information from them can be computationally expensive and time-consuming. Several ways to speed up the LLMs without updating hardware:
Published:
Large Language Models (LLMs) are the brains of various chatbots, and luckily we all have access to some brilliant cheap or even free LLMs now, thanks to open source community. It is possible to run LLMs on a PC and keep everything local. This blog presents my solution to building a chatbot running on my PC, with a totally local file storage system and a costumized graphical user interface.
Published:
Diffusion probabilistic models, or diffusion models for brevity are first proposed in [1]. The modified version called Denoising diffusion probabilistic models (DDPM) [2] were popularized due to the excellent high quality samples they generated, and Latent Diffusion Models (LDM) [3] make diffusion models practical.
As a generative model, A DM models complex datasets in a computational tractable way [1]. Specificly, inspired by non-equilibrium statistical physics, a diffusion model first uses a parameterized Markov chain to gradually convert one distribution to another, usually gradually adds noise to the data to convert a very complex distribution to a simple one, e.g. normal distribution. Then it uses trainable Markov chain to reconstruct the target distribution from a simple known distribution.
Here an image $x_0 \sim q(x_0)$ is transitioned to an noisy image $x_t$ over $t$ timesteps by applying the Markov diffusion kernel $q(x_t|x_{t-1})$, or forward diffusion kernel. In the reverse process, $x_t$ is converted back to the target image $x_0$ by using the reverse diffusion kernel $p_\theta (x_t|x_{t-1})$, where $\theta$ represents trainable parameters.
\(p_\theta (x_0) = \int p_\theta(x_0,x_1,...,x_T)dx_1dx_2...dx_T\) As a generative model, diffusion models are supposed to maximize the likelihood $p(x)$. However, integrating over all latent variables is intractable, we maximize Evidence Lower Bound (ELBO) instead.
In diffusion models, $x_0$ is observed data denoted by $x$, $x_1, x_2,…,x_T$ are hidden state denoted by $z$ in above ELBO derivations, and $q_\phi(z|x)$ is a Markov chain modeled by Gaussian distributions, so we can remove $\phi$ in later derivations. We can simply rewrite the ELBO as: \(\text{log } p(x) \geq E_{q(x_{1:T}|x_0)} [\text{log } \frac{p(x_{0:T})}{q(x_{1:T}|x_0)}]\) Take Markov property into consideration $q(x_t|x_{t-1})=q(x_t|x_{t-1},x_0)$, we continue this derivation: \(\begin{aligned} \text{log } p(x) &\geq E_{q(x_{1:T}|x_0)} [\text{log } \frac{p(x_{0:T})}{q(x_{1:T}|x_0)}]\\ &=E_{q(x_{1:T}|x_0)} [\text{log } \frac{p(x_T)\prod^T_{t=1}p_\theta(x_{t-1}|x_t)}{\prod^T_{t=1}q(x_t|x_{t-1})}]\\ &=E_{q(x_{1:T}|x_0)} [\text{log } \frac{p(x_T)p_\theta(x_0|x_1)\prod^T_{t=2}p_\theta(x_{t-1}|x_t)}{q(x_1|x_0)\prod^T_{t=2}q(x_t|x_{t-1},x_0)}]\\ &=E_{q(x_{1:T}|x_0)} [\text{log }\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_1|x_0)}+\text{log } \prod^T_{t=2} \frac{p_\theta(x_{t-1}|x_t)}{\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}}] \ \ \ \ \ \text{(Bayes rule)}\\ &=E_{q(x_{1:T}|x_0)} [\text{log }\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_1|x_0)}+\text{log }\frac{q(x_1|x_0)}{q(x_T|x_0)} +\text{log } \prod^T_{t=2} \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)}] \\ &=E_{q(x_{1:T}|x_0)} [\text{log }\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_T|x_0)}+\sum^T_{t=2}\text{log } \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)}]\\ &=E_{q(x_{1:T}|x_0)}[\text{log }p_\theta(x_0|x_1)]+E_{q(x_{1:T}|x_0)}[\text{log }\frac{p(x_T)}{q(x_T|x_0)}]+\sum^T_{t=2}E_{q(x_{1:T}|x_0)}[\text{log } \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)}] \\ &=E_{q(x_1|x_0)}[\text{log }p_\theta(x_0|x_1)]+E_{q(x_{T}|x_0)}[\text{log }\frac{p(x_T)}{q(x_T|x_0)}]+\sum^T_{t=2} \underbrace{E_{q(x_t,x_{t-1}|x_0)}[\text{log }\frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)}]}_{E_{q(x_t|x_0)}[E_{q(x_{t-1}|x_t,x_0)}\text{log }[\frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)}]]}\\ &=\underbrace{E_{q(x_1|x_0)}[\text{log }p_\theta(x_0|x_1)]}_{\text{RV1}}-\underbrace{D_{\text{KL}}(q(x_T|x_0)||p(x_T))}_{\text{RV2}}-\sum^T_{t=2} \underbrace{E_{q(x_t|x_0)}[D_{\text{KL}}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))]}_{\text{RV3}} \end{aligned}\)
The essence of this derivation is to use Bayes rule and Markov property, so we have the diffusion kernel written as:
\(q(x_t|x_{t-1},x_0)=\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}\)
Now look at the interpretation of ELBO, which consists of 3 terms:
a. RV1 is a reconstruction term, which can be optimized using a Monte Carlo estimate.
b. RV2 measures the distance between the distribution of the output of the forward process and the prior distribution of the reverse process. Since we assume that both are stardard Gaussian, RV2 is 0.
c. RV3 represents how close the distribution of learned denoising transition $ p_\theta(x_{t-1}|x_t) $ and the distribution of ground-truth denoising transition $ q(x_{t-1}|x_t,x_0) $ are. RV3 is called a denoising matching term, which we try to minimize so the learned denoising transition step is approaching the true distribution.
The optimization cost lies in the sum of RV3. Recall that for any timestep t, we have $x_t \sim \mathcal N(x_t;\sqrt{\bar \alpha_t}x_0,(1-\bar \alpha_t)I)$, and according to Bayes rule we have:
\(\begin{aligned} q(x_{t-1}|x_t,x_0)&=\frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)}\\ &=\frac{\mathcal N(x_t;\sqrt{\alpha_t}x_{t-1},(1- \alpha_t)I) \ \mathcal N(x_{t-1};\sqrt{\bar \alpha_{t-1}}x_0,(1-\bar \alpha_{t-1})I)}{\mathcal N(x_t;\sqrt{\bar \alpha_t}x_0,(1-\bar \alpha_t)I)}\\ &\propto\text{exp} \left\{-[\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{2(1-\alpha_t)}+\frac{(x_{t-1}-\sqrt{\bar \alpha_{t-1}}x_0)^2}{2(1-\bar \alpha_{t-1})}-\frac{(x_t-\sqrt{\bar \alpha_t}x_0)^2}{2(1-\bar \alpha_t)}]\right\} \\ &=\text{exp} \left\{-\frac{1}{2}[\frac{-2\sqrt{\alpha_t}x_tx_{t-1}+\alpha_t x_{t-1}^2}{1-\alpha_t}+\frac{x_{t-1}^2-2\sqrt{\bar \alpha_{t-1}}x_{t-1}x_0}{1-\bar \alpha_{t-1}}+\underbrace{\frac{x_t^2}{1-\alpha_t}+\frac{\bar \alpha_{t-1}x_0^2}{1-\bar \alpha_{t-1}}-\frac{(x_t-\sqrt{\bar \alpha_t}x_0)^2}{1-\bar \alpha_t}}_{C(x_t,x_0)}]\right\} \\ &=\text{exp} \left\{-\frac{1}{2}[(\frac{\alpha_t}{1-\alpha_t}+\frac{1}{1-\bar \alpha_{t-1}})x_{t-1}^2-2(\frac{\sqrt{\alpha_t}x_t}{1-\alpha_t}+\frac{\sqrt{\bar \alpha_{t-1}}x_0}{1-\bar \alpha_{t-1}})x_{t-1}+C(x_t,x_0)]\right\}\\ &=\text{exp} \left\{- \frac{1-\bar \alpha_t}{2(1- \alpha_t)(1-\bar \alpha_{t-1})}[x_{t-1}^2-2\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+\sqrt{\bar \alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar \alpha_t}x_{t-1}+\frac{\alpha_t(1-\bar\alpha_{t-1})^2x_t^2+\bar\alpha_{t-1}(1-\alpha_t)^2x_0^2+2\sqrt{\bar \alpha_t}(1-\alpha_t)(1-\bar \alpha_{t-1})x_tx_0}{(1-\bar \alpha_t)^2}]\right\}\\ &=\text{exp} \left\{ -\frac{1}{2 \cdot \frac{(1- \alpha_t)(1-\bar \alpha_{t-1})}{1-\bar \alpha_t}}(x_{t-1}-\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+\sqrt{\bar \alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar \alpha_t})^2 \right\}\\ &=\mathcal N(x_{t-1};\underbrace{\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+\sqrt{\bar \alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar \alpha_t}}_{\mu_q(x_t,x_0)},\underbrace{\frac{(1- \alpha_t)(1-\bar \alpha_{t-1})}{1-\bar \alpha_t}I}_{\Sigma_q(t)}) \end{aligned}\) Note that $\alpha$ coefficients are known and frozen at each timestep, in order for $p_\theta(x_{t-1}|x_t)$ to approach $q(x_{t-1}|x_t,x_0)$, we can construct its varience $\Sigma_\theta=\Sigma_q$ and mean $\mu_\theta$ as a function of $x_t$. The mean of $q(x_{t-1}|x_t,x_0)$:
\(\mu_q(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+\sqrt{\bar \alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar \alpha_t}\)
Follow the form of $\mu_q$, we have:
\(\mu_\theta(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+\sqrt{\bar \alpha_{t-1}}(1-\alpha_t) \hat x_\theta(x_t,t)}{1-\bar \alpha_t}\)
where $\hat x_\theta(x_t,t)$ is the prediction of original data $x_0$ from noisy data $x_t$ at timestep $t$.
Our goal is to minimize the KL Divergence:
\(\begin{aligned} L &=\underset{\theta}{\text{arg min }} D_{\text{KL}}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))\\ &=\underset{\theta}{\text{arg min }} D_{\text{KL}}(\mathcal N(x_{t-1};\mu_q,\Sigma_q(t))||\mathcal N(x_{t-1};\mu_\theta,\Sigma_q(t)))\\ &=\underset{\theta}{\text{arg min }} \frac{1}{2}[(\mu_\theta-\mu_q)^T\Sigma_q(t)^{-1}(\mu_\theta-\mu_q)]\\ &=\underset{\theta}{\text{arg min }} \frac{1}{2\sigma_q^2(t)}[||\mu_\theta-\mu_q||_2^2] \end{aligned}\)
Here we rewrite $\Sigma_q(t)=\sigma_q^2(t)$. Substitute $\mu_q$, $\mu_\theta$:
\(\begin{aligned} L &=\underset{\theta}{\text{arg min }} \frac{1}{2\sigma_q^2(t)}[||\frac{\sqrt{\bar \alpha_{t-1}}(1-\alpha_t)}{1-\bar \alpha_t}(\hat x_\theta(x_t,t)-x_0)||^2_2]\\ &=\underset{\theta}{\text{arg min }}\frac{\bar \alpha_{t-1}(1-\alpha_t)^2}{2\sigma_q^2(t)(1-\bar \alpha_t)^2}[||\hat x_\theta(x_t,t)-x_0||^2_2] \end{aligned}\)
In this way, a neural network is trained to predict the original data $x_0$ from a noisy data $x_t$ at any timestep $t$. While DDPM in [2] take a step further, write $x_0$ using reparameterization as:
\(x_0 = \frac{x_t-\sqrt{1-\bar \alpha_t}\epsilon_0}{\sqrt{\bar \alpha_t}}\)
Bring this into $\mu_q$:
\(\begin{aligned} \mu_q(x_t,x_0)&=\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+\sqrt{\bar \alpha_{t-1}}(1-\alpha_t)x_0}{1-\bar \alpha_t}\\ &=\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})x_t+(1-\alpha_t)\frac{x_t-\sqrt{1-\bar \alpha_t}\epsilon_0}{\sqrt{\alpha_t}}}{1-\bar \alpha_t}\\ &=(\frac{\sqrt{\alpha_t}(1-\bar \alpha_{t-1})}{1-\bar \alpha_t}+\frac{1-\alpha_t}{(1-\bar \alpha_t)\sqrt{\alpha_t}})x_t - \frac{(1-\alpha_t)\sqrt{1-\bar \alpha_t}}{(1-\bar \alpha_t)\sqrt{\alpha_t}}\epsilon_0 \\ &=\frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}\sqrt{\alpha_t}}\epsilon_0 \end{aligned}\)
Accordingly we set $\mu_\theta$ as:
\(\mu_\theta (x_t,t)=\frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}\sqrt{\alpha_t}}\hat \epsilon_\theta(x_t,t)\)
and plug this into our optimization problem:
\(\begin{aligned} L&=\underset{\theta}{\text{arg min }} D_{\text{KL}}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))\\ &=\underset{\theta}{\text{arg min }} D_{\text{KL}}(\mathcal N(x_{t-1};\mu_q,\Sigma_q(t))||\mathcal N(x_{t-1};\mu_\theta,\Sigma_q(t)))\\ &=\underset{\theta}{\text{arg min }} \frac{1}{2\sigma_q^2(t)}[||\mu_\theta-\mu_q||_2^2]\\ &=\underset{\theta}{\text{arg min }} \frac{1}{2\sigma_q^2(t)}[||\frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}\sqrt{\alpha_t}}\hat \epsilon_\theta(x_t,t)-\frac{1}{\sqrt{\alpha_t}}x_t + \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}\sqrt{\alpha_t}}\epsilon_0||_2^2]\\ &=\underset{\theta}{\text{arg min }} \frac{1}{2\sigma_q^2(t)}[||\frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}\sqrt{\alpha_t}}(\epsilon_0-\hat \epsilon_\theta(x_t,t))||_2^2]\\ &=\underset{\theta}{\text{arg min }} \frac{1}{2\sigma_q^2(t)}\frac{(1-\alpha_t)^2}{(1-\bar \alpha_t)\alpha_t}[||\epsilon_0-\hat \epsilon_\theta(x_t,t)||_2^2] \end{aligned}\)
Here a neural network $\hat \epsilon_\theta(x_t,t)$ is learned to predict $\epsilon_0 \sim \mathcal N(\epsilon;0,1)$. Mathematically learning noise is equivalent to learning the original data, but in practice, DDPM [2], which learns to predict noise, achieves better performance.
There’s another equivalent interpretations related to score-based generative models, refers to [5] for more info.
This section gives some background knowledge for readers to better understand VAE and diffusion models, and relationship between the two popular types of generative models. Let’s start from VAE, for more detailed illustration please read VAE. A Variational Autoencoder follows a simple encoder-decoder structure, encoder $q(z|x)$ learns the distribution over latent variables $z$ given observations $x$, decoder $p(x|z)$ generates data in observation space from the latent variables:
A generalization of a VAE called Hierarchical Variational Autoencoder (HVAE)extends latent variables to multiple hierarchies. If each transition down the hierarchy is Markovian, we call it a Markovian HVAE (MHVAE), which seems to be a stack of VAEs:
A Diffusion Model is also a special MHVAE, but its encoder is pre-defined as a linear Gaussian model and at final timestep the distribution of the latent is a standard Gaussian. The latent dimension is equal to the data dimension, while some improvements are made using Perceptual Compression to speed up both training and inference process [3].
Reference:
[1] Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. PMLR, 2015.
[2] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.
[3] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[4] Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” International conference on machine learning. PMLR, 2021.
[5] Luo, Calvin. “Understanding diffusion models: A unified perspective.” arXiv preprint arXiv:2208.11970 (2022).
Published:
Variational Autoencoders (VAEs) are a powerful class of generative models used in unsupervised learning, combining the strengths of both neural networks and probabilistic models. They aim to approximate an underlying latent space that represents the distribution of data. To achieve this, VAEs learn to encode high-dimensional input data into a lower dimensional set of latent variables using a variational inference process. By doing so, they can reconstruct the original input while also enabling manipulation and generation of new samples within the learned distribution. This is particularly useful for applications like image synthesis, where the ability to encode complex data and then decode it as new variations of that data is crucial.
Autoencoder is an encoder-decoder structure nerual network, aims to learn a compressed representation of the data during encoding (also referred as feature extraction), and from which generate new data (reconstruct input) similar as possible to the original data. The formal definition [2]: Autoencoder is to learn the functions $A: R^n -> R^p$ (encoder) and $B: R^p -> R^n$ (decoder) that satisfy \(argmin_{A,B}E[\Delta (x,B A(x))],\) where $\Delta$ is the loss function that measures the distance between the output generated from the decoder and the input, $E$ is expectation, $B A(x)$ is the autoencoder output.
Basiclly, Bayesian Inference works in this way: first we choose a prior distribution over the unknown parameters or latent variables, then update it to a posterior distribution after seeing data. Variational Inference (VI) is an approach to obtain the posterior distribution by finding a surrogate distribution that approximates the true distribution. Another well-known way to calculate a posterior is MCMC, which tries to sample data from the posterior, usually computationally expensive(so slower than VI, but more accurate).
In more complex cases, such as image generation, the model is parameterized with neural networks, and is not fully observed. We call such models deep latent variable model (DLVM), and variables that is not observed latent variables $z$ (e.g. encoded representation in autoencoders). We assume that the observed variables $x$ can be explained in terms of latent variables $z$: \(p_\theta(x)=\int p_\theta(x,z)dz=\int p_\theta(x|z)p(z)dz\) However, the marginal probability of data under the model $p_\theta(x)$ is intractable in DLVMs, since there’s no analytic solution or efficient estimator for the integral (hard to compute for high dimentional variables). Note that $p_\theta(x,z)$ is tractable and easy to compute. A tractable marginal likelihood $p_\theta(x)$ leads to a tractable posterior $p_\theta(z|x)$, and vice versa.
Variational inference approximates the $p_\theta(z|x)$ and $p_\theta(x)$ using parametrised distributions (e.g. Gaussian) by minimize an error (e.g. Kullback-Leibler divergence) between the approximation and target distribution. Now the inference is converted to an optimization problem. We find that minimizing the K-L divergence is equivalent to maximizing the ELBO (Evidence of Lower Bound), the derivation is given in the next section. One way to maximize ELBO is to assume independency between different dimentions of $z$ (mean-field theory) and proceed in a closed form optimization (coordinate ascent variational inference). While VAE takes another approach that is suitable for large dataset in deep neural networks, Stochastic Gradient Variational Bayes estimator, and you will see how it works in the next section as well.
From the perspective of functional analysis, both K-L divergence and ELBO are functional we want to find extreme points, find more below if you are interested in the mathematics.
Let’s interpret VAE from a perspective of Bayesian inference. The encoder is an inference model $q_\phi (z|x)$ that obtains the compressed representation i.e. latent variables $z$ given observed variables $x$. The decoder $p_\theta (x|z)$ decode data from $z$ space back to $x$ space.
A VAE learns stochastic mappings between the observed x-space and a lantent z space. $\phi$ is the parameters of the encoder, and $\theta$ is the parameters of the decoder. $\theta$ is optimized to make the generated data more compatible with the distribution of real data; $\phi$ is optimized so that the encoder $q_\phi (z|x)$ approximates the true but intractable posterior $p_\theta (z|x)$ of the generative model.
\(q_\phi (z|x) \approx p_\theta (z|x)\)
We know that Kullback–Leibler divergence measures the distance of two distributions. Given two continuous distributions’ PDF $p(z)$, $q(z)$, the K-L divergence is defined as:
\(D_{KL}(q||p)=\int q(z)log(\frac{q(z)}{p(z)})dz\)
In practice, we choose evidence lower bound (ELBO) as the optimization objective of variational methods including VAE. See the following derivation.
\(\begin{aligned} logp_\theta(x)&=E_{q_\phi (z|x)}[logp_\theta(x)]\\ &=E_{q_\phi (z|x)}[log[\frac{p_\theta(x,z)}{p_\theta(z|x)}]]\\ &=E_{q_\phi (z|x)}[log[\frac{p_\theta(x,z)}{q_\phi (z|x)} \frac{q_\phi (z|x)}{p_\theta(z|x)}]]\\ &=E_{q_\phi (z|x)}[log[\frac{p_\theta(x,z)}{q_\phi (z|x)}]]+E_{q_\phi (z|x)}[log[\frac{q_\phi (z|x)}{p_\theta(z|x)}]] \\ &=L_{\theta,\phi}(x)+D_{KL}(q_\phi (z|x)||p_\theta (z|x)) \end{aligned}\)
From the equations above, we know that given a $p_\theta (x)$, maximizing ELBO is equivalent to minimizing K-L divergence, since the KL divergence between $q_\phi (z|x)$ and $p_\theta (z|x)$ is non-negative (equal to 0 if, and only if two distributions are equal).
\(\begin{aligned} L_{\theta,\phi}(x)&=logp_\theta(x)-D_{KL}(q_\phi (z|x)||p_\theta (z|x)) \\ &\leq logp_\theta(x) \end{aligned}\)
$L_{\theta,\phi}(x)$ is called variational lower bound, aka evidence lower bound (ELBO):
\(L_{\theta,\phi}(x)=E_{q_\phi (z|x)}[logp_\theta(x,z)-log q_\phi (z|x)]\)
We now see why choosing ELBO over KL divergence: in ELBO, we deal with the joint distribution $p_{\theta}(x,z)$ and posterior of inference model $q_\phi(z|x)$ ( which is tractable because we can choose the surrogate model), tactically avoid the intractable posterior $q_{\theta}(z|x)$ in generative model. Notably, maximizing ELBO w.r.t. $\theta$ and $\phi$ will not only minimize KL divergence of $q_\phi (z|x)$ from $p_\theta (z|x)$, but also maximize $p_\theta(x)$, which means both the inference model (encoder) and the generative model (decoder) get better.
The (unbiased) gradient of ELBO w.r.t. all parameters ($\phi$ and $\theta$) can be optimized using stochastic gradient descent (SGD). For $\theta$, we have:
\(\begin{aligned} \nabla_\theta L_{\theta,\phi}(x) &= \nabla_\theta E_{q_\phi (z|x)}[logp_\theta(x,z)-log q_\phi (z|x)]\\&=E_{q_\phi (z|x)}[\nabla_\theta(logp_\theta(x,z)-log q_\phi (z|x))]\\ & \simeq \nabla_\theta logp_\theta(x,z) \end{aligned}\)
So we can simply sample $z$ from $q_\phi (z|x)$ using Monte Carlo estimator. However, to get the gradient of ELBO w.r.t. $\phi$, we need a reparameterization trick that let $z$ be a function of $\phi,x$ and a random variable $\epsilon$. We cannot follow the same method as we did for $\theta$, because $q_\phi (z|x)$ is a function of $\phi$, the equation is not valid:
\(\begin{aligned} \nabla_\phi L_{\theta,\phi}(x) &= \nabla_\phi E_{q_\phi (z|x)}[logp_\theta(x,z)-log q_\phi (z|x)]\\& \neq E_{q_\phi (z|x)}[\nabla_\phi(logp_\theta(x,z)-log q_\phi (z|x))] \end{aligned}\)
After introducing $\epsilon$, the ELBO is rewritten as:
\(\begin{aligned} L_{\theta,\phi}(x) &= E_{q_\phi (z|x)}[logp_\theta(x,z)-log q_\phi (z|x)] \\ &= E_{p(\epsilon)}[logp_\theta(x,z)-log q_\phi (z|x)] \end{aligned}\)
where $z = g(\epsilon, \phi, x)$, and we can use Monte Carlo estimator to get $\nabla_\phi L_{\theta,\phi}(x)$ by sample random noise $\epsilon \sim p(\epsilon)$.
Reference:
[1] Kingma, Diederik Pieter. “Variational inference & deep learning: A new synthesis.” (2017).
[2] Bank, Dor, Noam Koenigstein, and Raja Giryes. “Autoencoders.” Machine learning for data science handbook: data mining and knowledge discovery handbook (2023): 353-374.
[3] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
[4] Calculus of Variations (https://courses.maths.ox.ac.uk/pluginfile.php/29848/mod_resource/content/1/Notes%202022.pdf)
Published:
Deep learning is a subfield of machine learning that involves training artificial neural networks with multiple layers to analyze and solve complex problems. These neural networks are capable of learning hierarchical representations of data, enabling them to extract features and patterns from raw input data. Deep learning has been used in a wide range of applications such as image recognition, natural language processing, and autonomous driving. This blog covers the basics of four different types of DL models, paves the way to further in-depth study of various DNN.
Published:
Error state is the difference between True state $x_t$ and Nominal state $x$. The nominal state does not take into account the noise, just derived from motion equations. As a result there will be accumulate errors, which are collected in the error-state $\delta x$ and the ESKF will estimate these errors. In parallel with integration of the nominal state, the ESKF predicts a Gaussian estimate of the error-state, and inject the mean of error state back to nominal-state.
When the IMU measurement data arrives, we integrate it and put it into the nominal state variable. Since this approach does not consider noise, the result will naturally drift quickly, so we hope to put the error part as an error variable in ESKF. ESKF will consider the effects of various noises and zero offsets, and give a Gaussian distribution description of the error state. At the same time, ESKF itself, as a Kalman filter, also has a prediction process and a correction process, and the correction process needs to rely on sensor observations other than the IMU. After the correction, ESKF can give the posterior error Gaussian distribution, and then we can put this part of the error into the nominal state variable, and set ESKF to zero, thus completing a loop.
For more derivations and details about quanterion representation in rotation part, please read Joan’s paper: Quaternion kinematics for the error-state Kalman filter, which is also the main reference of this blog.
Published:
Kalman Filter (KF) is a recursive state estimator widely used in sensor/data fusion. It’s an optimal Bayesian filter in linear Gaussian system, since it minimizes the estimated error covariance. It is also known as Gaussian Filter. KF is not applicable to discrete or hybrid state spaces. This blog helps understand Kalman Filter with simple examples and presents mathematical derivation from two different perspectives.
Kalman Filter is the optimal filter for Linear Gaussian system, a linear system with almost every variable be Gaussian distributed. KF can be derived in many ways, this section introduces 2 derivations to help better understand KF.
Published:
A concise definition from T. Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”. Regarding of what learning is,Herbert A. Simon once defined it as “the ability that a system improves the performance of itself by executing some process”. Machine learning is a data driven subject that applies machine learning models to data analyzation and prediction. Generally machine learning can be categorized into: supervised learning, unsupervised learning, reinforcement learning, semi-supervised learning and active learning. To design a machine learning system, basically we need to do the following tasks:
This blog revisits some classical supervised learning Models, including K-nearest neighbours, Decision Tree, Naive Bayes, Perceptron and Support Vector Machine.
Published:
Git is a distributed version control tool that runs on Linux, Unix, Mac and Windows. With the help of Git, you don’t have to manually record and save copies every time when the files are changed, or worry about file exchange and protection when working on big projects with others. This post introduces common usage of Git.
Published:
This blog introduces two fancinating topics: Object detection and SLAM.
Published:
Large Language Models (LLM) have surprised us in many ways, such as talking like human, processing paper work, codeing copilot, analyzing data…, simply by predicting the next token using probability model. As the LLMs get smarter and lighter, we can leverage LLMs that are close to or even surpasse the capabilities of ChatGPT on our device, which is going to make our life much easier. This blog shows how to build a chatbot that runs 100% locally and shares my thoughts on how the LLMs can help in our daily life.
Published:
This blog post go through the basics of point cloud data and related projects I have done, including point cloud registration, point cloud segmentation.
Published:
This is a blog post that records the steps to set environment before I successfully trained and run a deep learning model in a docker container. Docker offers an isolated environment for software and dependencies and ensure the application runs quickly and reliably from one computing environment to another, by using container technology. See the difference between docker container and virtual machine in https://www.docker.com/resources/what-container/. Have docker installed on the host computer, build docker environment following steps below.
Short description of portfolio item number 1
Short description of portfolio item number 2
Published in Journal 1, 2009
This paper is about the number 1. The number 2 is left for future work.
Recommended citation: Your Name, You. (2009). "Paper Title Number 1." Journal 1. 1(1). http://academicpages.github.io/files/paper1.pdf
Published in Journal 1, 2010
This paper is about the number 2. The number 3 is left for future work.
Recommended citation: Your Name, You. (2010). "Paper Title Number 2." Journal 1. 1(2). http://academicpages.github.io/files/paper2.pdf
Published in Journal 1, 2015
This paper is about the number 3. The number 4 is left for future work.
Recommended citation: Your Name, You. (2015). "Paper Title Number 3." Journal 1. 1(3). http://academicpages.github.io/files/paper3.pdf
Published:
This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.