LLM as a Chatbot Service
internet mode
option in the control panel. For discord, you need to specify --internet
option in your prompt. For both cases, you need a Serper API Key which you can get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test.The purpose of this repository is to let people to use lots of open sourced instruction-following fine-tuned LLM models as a Chatbot service. Because different models behave differently, and different models require differently formmated prompts, I made a very simple library Ping Pong
for model agnostic conversation and context managements.
Also, I made GradioChat
UI that has a similar shape to HuggingChat but entirely built in Gradio. Those two projects are fully integrated to power this project.
This project has become the one of the default framework at jarvislabs.ai. Jarvislabs.ai is one of the cloud GPU VM provider with the cheapest GPU prices. Furthermore, all the weights of the supported popular open source LLMs are pre-downloaded. You don't need to waste of your money and time to wait until download hundreds of GBs to try out a collection of LLMs. In less than 10 minutes, you can try out any model.
llmchat
framework.dstack
is an open-source tool that allows to run LLM-based apps in a a cloud of your choice via single command. dstack
supports AWS, GCP, Azure, Lambda Cloud, etc.
Use the gradio.dstack.yml
and discord.dstack.yml
configurations to run the Gradio app and Discord bot via dstack
.
dstack
, read the official documentation by dstack
.Prerequisites
Note that the code only works Python >= 3.9
and gradio >= 3.32.0
$ conda create -n llm-serve python=3.9
$ conda activate llm-serve
Install dependencies.
$ cd LLM-As-Chatbot
$ pip install -r requirements.txt
Run Gradio application
There is no required parameter to run the Gradio application. However, there are some small details worth being noted. When --local-files-only
is set, application won't try to look up the Hugging Face Hub(remote). Instead, it will only use the files already downloaded and cached.
Hugging Face libraries stores downloaded contents under ~/.cache
by default, and this application assumes so. However, if you downloaded weights in different location for some reasons, you can set HF_HOME
environment variable. Find more about the environment variables here
In order to leverage internet search capability, you need Serper API Key. You can set it manually in the control panel or in CLI. When specifying the Serper API Key in CLI, it will be injected into the corresponding UI control. If you don't have it yet, please get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test.
$ python app.py --root-path "" \
--local-files-only \
--share \
--debug \
--serper-api-key "YOUR SERPER API KEY"
Prerequisites
Note that the code only works Python >= 3.9
$ conda create -n llm-serve python=3.9
$ conda activate llm-serve
Install dependencies.
$ cd LLM-As-Chatbot
$ pip install -r requirements.txt
Run Discord Bot application. Choose one of the modes in --mode-[cpu|mps|8bit|4bit|full-gpu]
. full-gpu
will be choseon by default(full
means half
- consider this as a typo to be fixed later).
The --token
is a required parameter, and you can get it from Discord Developer Portal. If you have not setup Discord Bot from the Discord Developer Portal yet, please follow How to Create a Discord Bot Account section of the tutorial from freeCodeCamp to get the token.
The --model-name
is a required parameter, and you can look around the list of supported models from model_cards.json
.
--max-workers
is a parameter to determine how many requests to be handled concurrently. This simply defines the value of the ThreadPoolExecutor
.
When --local-files-only
is set, application won't try to look up the Hugging Face Hub(remote). Instead, it will only use the files already downloaded and cached.
In order to leverage internet search capability, you need Serper API Key. If you don't have it yet, please get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test. Once you have the Serper API Key, you can specify it in --serper-api-key
option.
~/.cache
by default, and this application assumes so. However, if you downloaded weights in different location for some reasons, you can set HF_HOME
environment variable. Find more about the environment variables here
$ python discord_app.py --token "DISCORD BOT TOKEN" \
--model-name "alpaca-lora-7b" \
--max-workers 1 \
--mode-[cpu|mps|8bit|4bit|full-gpu] \
--local_files_only \
--serper-api-key "YOUR SERPER API KEY"
Supported Discord Bot commands
There is no slash commands. The only way to interact with the deployed discord bot is to mention the bot. However, you can pass some special strings while mentioning the bot.
@bot_name help
: it will display a simple help message@bot_name model-info
: it will display the information of the currently selected(deployed) model from the model_cards.json
.@bot_name default-params
: it will display the default parameters to be used in model's generate
method. That is GenerationConfig
, and it holds parameters such as temperature
, top_p
, and so on.@bot_name user message --max-new-tokens 512 --temperature 0.9 --top-p 0.75 --do_sample --max-windows 5 --internet
: all parameters are used to dynamically determine the text geneartion behaviour as in GenerationConfig
except max-windows
. The max-windows
determines how many past conversations to look up as a reference. The default value is set to 3
, but as the conversation goes long, you can increase this value. --internet
will try to answer to your prompt by aggregating information scraped from google search. To use --internet
option, you need to specify --serper-api-key
when booting up the program.Different model might have different strategies to manage context, so if you want to know the exact strategies applied to each model, take a look at the chats
directory. However, here are the basic ideas that I have come up with initially. I have found long prompts will slow down the generation process a lot eventually, so I thought the prompts should be kept as short as possible while as concise as possible at the same time. In the previous version, I have accumulated all the past conversations, and that didn't go well.
N
conversations will be kept. Think about the N
as a hyper-parameter. As an experiment, currently the past 2-3 conversations are only kept for all models.intfloat/e5-large-v2
)