Customize a local chatbot in streamlit

4 minute read

Published:

Large Language Models (LLMs) are the brains of various chatbots, and luckily we all have access to some brilliant cheap or even free LLMs now, thanks to open source community. It is possible to run LLMs on a PC and keep everything local. This blog presents my solution to building a chatbot running on my PC, with a totally local file storage system and a costumized graphical user interface.

My local chatbot

Considering that my 4-year old laptop has 16 RAM and 4GB VRAM, I choose quantized LLMs from llamacpp. Both gradio and streamlit are sufficient for fast prototyping, I use streamlit just because I think the app GUI looks better. To build a streamlit app, one must deal with session state, which stores variables across reruns and can be manipulated by callback functions. For example, we need to store a LLM in memory when chatting, otherwise everytime you rerun the app, you have to reload the LLM. Streamlit reruns the script when an interaction happens, and loading a LLM takes few seconds meanwhile occupies memory.

Loading LLMs

I want to choose one to use from a list of LLMs, and may switch between different LLMs during chatting. Normally load a LLM in llama-cpp-python looks like:

CLICK ME
llm = LlamaCpp(model_path=llm_path, 
             n_threads=n_threads,
             max_tokens=1024, 
             n_ctx=n_ctx, 
             n_batch=256, 
             n_gpu_layers=n_gpu, # if offload GPU
             callback_manager=CallbackManager)

First initialize llm in session state, list all of the available LLMs in a selection box and match model to load:

if "llm" not in st.session_state: st.session_state.llm=None

models_list = ('-','codellama-13b', 'code209k_13b', 'mistral-7b', 'zephyr-7b','solar-10.7b','dolphin_7b')

selected_model = st.selectbox("Select Model to load", models_list,index=0,on_change=_clear_ram)

if st.session_state.llm is None:     
    match selected_model:
        case 'codellama-13b':
            st.session_state.llm = load_llm(CODELLAMA_13b,chat_box,10240,3) 
        case 'mistral-7b':
            st.session_state.llm = load_llm(MISTRAL_7b,chat_box, 2048, 15) 

Here load_llm is a wrapper of LlamaCpp, _clear_ram is a callback function, called when selection changes, to clear LLM in memory:

def _clear_ram():       
    del st.session_state.llm
    gc.collect()
    torch.cuda.empty_cache()

Restore and clean chat history

After setting up LLMchains, as last blog shows, the chatbot is ready to go. While I want to keep the chat history in case that I want to continue the chat some other day, I save all history in txt files ordered by the time they are generated, and set a select box to choose which chat to continue. To recognize the chats easier, I name these txt files after the content of the first question, summarized by LLMs.

Moreover, facilitated by vector similarity search, I use VectorStoreRetrieverMemory to enhance the chatbot, so it only remembers pieces of history that relevant to the input question. It’s optional to decide whether or not to remember the chat history, click clear to remove chat history from memory and save current memory as history, click restore to show the history on screen.


rag

Chat over documents

Aided by retrieval-augmented generation (RAG), it is possible for a local chatbot to read documents within an acceptable time period. Langchain provides a loads of file loaders, but more work need to be done in practice, for example, one may use some preprocessing tools to preserve the format of content read from a pdf file, or the pdf is a not standard pdf but a scanned pdf consists of images, extracting the content might be tricky (I use ocrmypdf to first transform the scanned pdf to a regular pdf, then put it into libraries like pymupdf). Here’s an example of contextual compression in langchain, which I use to filter relevant context for retrieved texts when I start a q&a over documents with the chatbot.

retriever = db.as_retriever(search_type="mmr",search_kwargs={'k': retrieve_pieces})
    splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0, separator=". ")
    redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
    relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76, k=retrieve_pieces) 
    pipeline_compressor = DocumentCompressorPipeline(
        transformers=[splitter, redundant_filter, relevant_filter]
    )
    compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever)
    

Extra tips

If you set ‘callback’ when loading LLMs, you would see the LLMs’ output popping up in a terminal. To achieve the same popping ups in streamlit apps, I modified the CallbackHandler by adding a streamlit container:

class StreamHandler(BaseCallbackHandler):
    def __init__(self, container, initial_text=""):
        self.container = container
        self.text = ""

    def on_llm_new_token(self, token: str, **kwargs) -> None:
        sys.stdout.write(token)
        sys.stdout.flush()
        self.text += token           
        self.container.markdown(self.text)

    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        self.text = ""

Although I build this for local uses, you can easily add other online LLMs such as GPT3.5 or GPT4, the GUI and file system remains the same. I will add more functions like presenting the retrieved pieces of documents before sending to openai, to save money and gain more control.

DIY Components

I have also combined Automatic Speech Recognition (ASR) and Text to Speech (TTS) with chatbot on my laptop, it works fine but not flexible enough to actually help me in daily life. Next time I will try to implement the chatbot on my phone or a Raspberry Pi. Piece.