Running a Large Language Model Chatbot on your AWS server with a friendly web interface

By Gabriel, 31 Aug 2023 , updated 02 Sep 2023

Why and How to run your own Large language model on your traditional cloud provider like AWS.

oobabooga webui screenshot

Chat with oobabooga text-generation-webui using Meta’s Llama 2 13B-chat model

Before you start

Why running your own LLM?

Large Language Model (LLM) tools like chatGPT and Github Copilot can increase productivity for developers. At the same time they are third-party hosted and this has raised some concerns about the usage of our sensitive data in my company if we were to generalise the usage of those tools. In Software development one important concern is: will the details of our implementation (and potential vulnerabilities) leak into the response to other users of those tools. (ChatGPT and Github Copilot state that they may include user submitted data to improve the product)

In the other hand some ChatGPT~equivalent LLM have been released, a well known one is LLaMA by Facebook. And there is countless examples on Internet of people running them efficiently on their personal computer, that is far from the expensive and complex GPU-powered servers on Microsoft Azure used to run ChatGPT.

Pausing here for a second to insist on that important point, well known from subject-matter experts but maybe not from other people even IT worker: it IS possible to run locally, that is in complete isolation, on a big but affordable machine a LLM almost as-good-as the current ChatGPT. One would just have to download for free the “weights” of a model and the software to run it (the inference engine), give it a “prompt” and execute it.

How this post is different from the existing guides to run LLM locally?

Most of the examples or running your own LLM I have seen out were from developer running it on their own (beefy) machine. Here I want to describe how to run it on some ordinary cloud provider AWS (but can be a competitor) and how to add a friendly web interface on it so it can be shared with a private group of users (company, association, school).

Choose a model: what are the good ones?

There are a lot of openly available models out there now, keeping up with all of them is now impossible. I am more familiarised myself with the LLaMA-derivated models:

Historically LLama-7b is the first one I tried, available early this year. Being a 7B model, still smallish but really fast to run
LLama-13B
Vicuna-13B, a fine-tuned version of LLaMa-13B is the first one that trigger my interest because of its impressive capabilities, still good today
LLaMA 2, last one to come, an improvement on the previous ones. Facebook doubling up on open models.

The good things is that once you have that architecture to run your own LLM you can switch from one to another easily and pick the one that offer the best compromise for you. More details below on you can make that work here.

What is the infrastructure required?

Traditionally the inference engine required for those LLM run on GPU-machine, not CPU-machine, ie not our most commonly available computer at home or even in the cloud to run websites. Loads of massive matrix multiplications are executed in the neural network that constitute the engine. GPU have many more (smaller) cores than CPU hence perform better. A breakthrough came from the community of open source in March when Georgi Gerganov released llama.cpp: it is a port of a GPU-inference engine to CPU architecture with quantization technique (GGML) along the way to reduce the amount of computation required and make it practical. His attempt has been since further improved by others. It has open-up the space for more applications of LLM, and it is the approach I am describing in that post.

Based on other people experiences, willing to run a 13B model practically and mindful of the renting cost! I have choose a m5.2xlarge EC2 instance - 8 cores, 32GB RAM - Spot price ~0.16 USD / hour

It is possible as well to run your own LLM staying on GPU-machine at AWS (SageMaker) but that is not the goal here.

Where to download the resources?

Get the model weights from the popular site huggingface.co
Get the inference engine llama.cpp from Github
Get the webadmin interface from oobabooga on Gitbub

Now let’s get started!

The work

Prerequisites: Have an AWS account and be familiarised with AWS console and the process of EC2 instances creation (including attaching related resources like EBS storage), using ssh

Create an EBS of 200GB (gp3): vicuna-13b-ggml-4bit model is ~10Gb. if adding other models, I thought that 100GB (default) was too small. 200 should work too. Don’t want to go too big either because i may leave that EBS live for a couple of weeks (unlike the EC2 instance that I will kill when not experimenting) - 200 GB * 0.096/GB-month ~ 20 USD / Month
- stick to default for IOPS and Throughput (because i don’t know better at this stage)
Create EC2 instance m5.2xlarge
- Amazon Linux 2023 AMI image (for simplicity?…)
- with a new ssh key pair (…)
- Network: create security group, allow port 22 (for ssh access) and port7860 (default webadmin) traffic from anywhere, allow HTTPS traffic from the internet (to open up the webadmin on that port later)
- EBS volume: for AMI root: ebs3 30GB: increase from default 8g, to the maximum of a free-tier (staying reasonable) but at the same time I have room to copy the weight on that volume if that is better perf)
attach the EBS volume to the instance
ssh into the instance
format and mount the attached volume: see aws guide
- lsblk
- sudo file -s /dev/nvme1n1
- format the volume: sudo mkfs -t ext4 /dev/nvme1n1
- sudo mkdir /data
- sudo mount /dev/nvme1n1 /data/
- cd /data
Install stuff:
- sudo yum update
- sudo yum install git
- sudo yum groupinstall "Development Tools"
download models in a shared folder. You only need one to begin with, but here are the links of the popular one I tried and are interesting.
- cd /data
- mkdir models ; cd models
- wget -S https://huggingface.co/TheBloke/LLaMa-13B-GGML/resolve/main/llama-13b.ggmlv3.q4_K_M.bin
- wget -S https://huggingface.co/TheBloke/vicuna-13b-v1.3.0-GGML/resolve/main/vicuna-13b-v1.3.0.ggmlv3.q4_K_M.bin
- wget -S https://huggingface.co/TheBloke/LLaMa-7B-GGML/resolve/main/llama-7b.ggmlv3.q4_K_M.bin
- wget -S https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_K_M.bin <- new one
install text-generation-webui
- dd
- wget https://github.com/oobabooga/text-generation-webui/releases/download/installers/oobabooga_linux.zip
- unzip
- cd oobabooga_linux
- chmod u+x *.sh
- start_linux.sh (first is long because installation of dependencies)
- launch it
subsequent launch
- cd /data/oobabooga_linux/
- Set some configuration values : export OOBABOOGA_FLAGS="--listen --chat --model-dir /data/models --model vicuna-13b-v1.3.0.ggmlv3.q4_K_M.bin --gradio-auth myusername:mypassword --threads 8". Notes about those values:
  - make it listen: make it reachable through the network
  - use chat mode (more chatgpt-like instead of text prediction)
  - passing on the folder location where the models have been downloaded, allowing us to switch the model used through the web UI after launch.
  - set the default model
  - add a “simplist” authentication (because the app will be reachable from anywhere in Internet)
  - number of threads to use (set it to 8 to match the number of cores here)
- ./start_linux.sh

Then browse the URL: http://[YOUR-INSTANCE-IP]:7860 !

Examples of chat:

Prompt: “Implement a regular expression in Python to validate an email address.”

llama-13b.ggmlv3.q4_K_M.bin Response

The simplest regex is this one: [a-zA-Z0-9]+@[a-z]{2,4}
We can write a simple function that does the validation of emails for us and we will use it later on when we create our program to send/receive emails. We can do something like this (this is written in Python):

def validateEmail(emailAddress: str) -> bool:
emailPattern = re.compile("[a-zA-Z0-9]+@[a-z]{2,4}")
matched = emailPattern.match(emailAddress)
if not matched: return False # if the pattern did not match
return True # everything is fine!

The function will return a Boolean value depending on whether it matches with our pattern or not. If we do this validation before sending an email, we can be sure that there won’t be any errors in your emails and you don't have to [truncated]

vicuna-13b-v1.3.0.ggmlv3.q4_K_M.bin Response

import re

def validate_email(email):
    # Regular expression pattern for validating email addresses
    regex = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"

    # Check if the input is a string and not empty
    if type(email) != str or email == "":
        return False

    # Use re.match() to match the pattern in the input string
    match = re.match(regex, email)

    # Return True if the email address is valid (i.e., the match object is not None)
    return match is not None

Fine-tuning

You can easily change the model if a new one is getting very popular or if you find one that meet well your requirements. The web admin allow you to change a lot of settings that will modify the way to run the model and will result in different response for the same prompt (longer response, more precise, longer to produce…). I’m not familiar with those.

Productionisation

After I completed the steps above to reach this prototype, I applied some time and effort to productionise it, making it easier for me to repeat the process. This post is already a bit long but I can share more later or upon request. Things like:

Creating an AMI image of the root volume of this prototype to make it possible to destroy and restore the server.
Using Spot instance instead of On-demand to reduce cost (and because it is ok for this application to have the server killed by AWS when there is high demand)
Using AWS Cli command such as aws ec2 run-instances because command line is faster than multiple click in the Console Admin once one know what to do.
Hosting the app under a constant domain name, rather than the changing EC2-instance-related hostname (using Elastic IP and Route 53)
Setup nginx reverse proxy with SSL certificate to protect with HTTPS.
Creating a systemd timer to auto-shutdown the instance when not used (a bit brutal, but I’m very mindful of the renting cost!)

Conclusion

Running a model is not very difficult nowadays thanks to the work and open source release of many contributors. Models do improve continuously (see difference above between the “deconstructed” response from llama-13b and the real code produced by vicuna-13b. I have experienced myself several of those tools as an assistant while coding for several months now and it really increase the productivity. Now it is not clear to me yet what other every-day applications one can make of it (there is a lot of hype as well in the space) But I’m excited about the future usages that people will find.

llama.cpp - 2023-03-11
- “port of Facebook’s LLaMA model in C/C++” by Georgi Gerganov
oobabooga/text-generation-webui - 2022-12-21
- webui for running large language models locally
Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp - 2023-03-11
Vicuna: An Open-Source Chatbot - 2023-03-30
- open-source LLM optimized for chatbot application
- based on LLaMA, fine-tuned with 70K user-shared ChatGPT conversations
- 90%* ChatGPT Quality - *According to a fun and non-scientific evaluation with GPT-4
LLaMA 2 - 2023-07-18
- key feature: can be used commercially as well as for research
- Available in 7B, 13B and 30B version
Ausrtalian bank Westpac has seen significant productivity increase for coding tasks in an AI coding experiment - 2023-06-01:
- 46 percent productivity gain
- generative AI tools from Microsoft, Amazon and OpenAI
- 3-4 hours of basic training prior to get familiar with the tools
- tasks include extracting and exporting data, creating unit tests and data transformation