Deploy LLMs on a phone with arm chips

4 minute read

Published:

In this blog, I’m going to share how I deployed LLMs (1~3b) and ASR models (whisper tiny & base) on my old smart phone, Xiaomi 8 (dipper). It was released in 2018, with a broken screen and depleted battery after several years of heavy usage, it still offers acceptable speed and fluency running daily light-weight apps. So I wonder, if I deploy modern machine models, such as LLMs, on the phone, and can I run the large models smoothly. Let’s see.
The hardware configuration: 8-cores CPU with 4 big cores (Cortex-A75) and 4 small cores (Cortex-A55), 64 GB storage and 6 GB RAM. It has a integrated GPU, which I didn’t find a way to utilize during model inference, so the computation is totally counted on CPU in this trial. To better leverage the power of this CPU, I uninstall the android system (MIUI 12) and port a Ubuntu touch on the phone. Basically it’s a linux OS but using the underlying andriod framework to control hardwares, and it gives me a much longer battery life compared to android, also more convenience since I am a rookie in android development. Models are quantized versions from llamacpp and whispercpp.
To get rid of the cumbersome work build an app with GUI, I run models in command line using the terminal app originally from Ubuntu touch, which can be regarded as the same terminal on Ubuntu desktop in this case. All I need to do is to compile the c++ code into an executable file that runs on my phone. Since the architecture of my laptop cpu is x86, the version of glibc, libstdc++ are different from the libs on the phone, I could either compile on the phone directly or cross compile on my PC with a specific toolchain. I kept all the heavy work on PC and built my own toolchain using crosstool-NG, which is targeted at building toolchains.

Following the offical tutorial, I set the version of glibc and libstdc++ according to my phone configuration, while the version of gcc is chosen based on the target code to build, in my case I use gcc 8.5. The version of libstdc++.so.6.0.25 comes with gcc8.5 is not aligned with the libstdc++.so.6.0.21 on my phone, a simple way to address this problem is upgrading old libstdc++ (just replace libstdc++.so could work). I also tried other methods such as building the project in a docker container from clickable (commonly used for deploying applications on ubuntu touch) with a old version of gcc (gcc5.4), I have to define the data types such as ‘vld1q_s16_x2’ manually since these are new types introduced in arm_neon.h in later gcc versions. Anyway, solving problems poping up during cross compiling helps me understand the code and computer system better, and finally I managed to get executable files from both llamacpp and whispercpp. After playing models of different sizes on the phone, I found that the model under 3b (quantized gguf) could achieve balance between speed and performance. I have tried Phi series models from Microsoft and stablelm (1.6b, 3b) from stabilityai, among which phi3 mini (4bit) is the largest one and gives best responses, but the speed is quite slow (1 token/s for prompt evaluation and 3 token/s for generation). The smallest one, stablelm 1.6b (4bit) yeilds 12 token/s and 7 token/s for prompt evaluation and generation respectively, while maintains a good generation quality. For daily use, I prefer this version of stablelm 1.6b, which gives better response after finetuning, with pleasent speed (10 token/s and 6 token/s) at the same time.
I also built piper from source, in which case piper-phonemize must be built in advance. There are built binary releases avaliable for aarch64 or arm, but none of them were built under the same old version of linux kernel and glibc as I have on my phone. Another thread related bug arose when I was naively combining piper with llamacpp (run piper everytime the LLM completed a response), I set the number of thread to 1 when loading piper mode (SetInterOpNumThreads) to avoid the error. Then a voice assistant is ready to serve.
This project proves that my a 6-year old phone is sufficient to run modern machine learning models after optimization, I will try to test image processing on my phone by running some famous models in computer vision.