How to Use LLM

Posted on 2024-02-20 Edited on 2025-01-10 In Software Engineering Views:

How to use LLMs (Large Language Models), including how to host them, how to add a frontend to them and others.

Ollama

Ollama
Ollama github repo

Installation

My system configuration is:

Ubuntu 22.04
ISA: x86_64
GPUs: two NVIDIA 1080Tis

Install on bare-metal

1	curl -fsSL https://ollama.com/install.sh \| sh

Manual install instructions

Install with Docker

First, you need to install the NVIDIA Container Toolkit. Since I use Ubuntu, I choose the apt installation method.

Configure the production repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Optionally, configure the repository to use experimental packages:

1	sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the packages list from the repository:
1
sudo apt-get update

Install the NVIDIA Container Toolkit packages:

1	sudo apt-get install -y nvidia-container-toolkit

Restart the Docker daemon:
1
sudo systemctl restart docker

Next, Check if your docker service is healthy:

1	sudo systemctl status docker

If it shows error like:

1	Failed to start Docker Application Container Engine.

You need to reboot the computer.

FInally, you can run the container:

1	docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Run the model

To run and chat with Llama 2:

If you install ollama on bare-metal:

1	ollama run llama2

If you install ollama via docker

1	docker exec -it ollama ollama run llama2

You can switch to other models. Ollama supports a list of open-source models available on the Ollama library.

Here are some example open-source models that can be downloaded:

Model	Parameters	Size	Download
Llama 2	7B	3.8GB	`ollama run llama2`
Mistral	7B	4.1GB	`ollama run mistral`
Dolphin Phi	2.7B	1.6GB	`ollama run dolphin-phi`
Phi-2	2.7B	1.7GB	`ollama run phi`
Neural Chat	7B	4.1GB	`ollama run neural-chat`
Starling	7B	4.1GB	`ollama run starling-lm`
Code Llama	7B	3.8GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8GB	`ollama run llama2-uncensored`
Llama 2 13B	13B	7.3GB	`ollama run llama2:13b`
Llama 2 70B	70B	39GB	`ollama run llama2:70b`
Orca Mini	3B	1.9GB	`ollama run orca-mini`
Vicuna	7B	3.8GB	`ollama run vicuna`
LLaVA	7B	4.5GB	`ollama run llava`

Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Customize a model

Import from GGUF

GGUF (GPT-Generated Unified Format), introduced as a successor to GGML (GPT-Generated Model Language), is a file format used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer).

Ollama supports importing GGUF models in the Modelfile:

Create a file named Modelfile, with a FROM instruction with the local filepath to the model you want to import.
1
FROM ./vicuna-33b.Q4_0.gguf
Create the model in Ollama
1
ollama create example -f Modelfile
Run the model
1
ollama run example

Import from PyTorch or Safetensors

See the guide on importing models for more information.

Customize a prompt

Models from the Ollama library can be customized with a prompt. For example, to customize the llama2 model:

1	ollama pull llama2

Create a Modelfile:

FROM llama2

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1

# set the system message
SYSTEM """
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
"""

Next, create and run the model:

ollama create mario -f ./Modelfile
ollama run mario
>>> hi
Hello! It's your friend Mario.

For more examples, see the examples directory. For more information on working with a Modelfile, see the Modelfile documentation.

Open-webui

Github: open-webui
Open WebUI Community

Ollama works at terminal, we can install a frontend for it. We choose open-webui since its fast and has no bug.

Installation

Install Ollama, by default it listens to port 11434.

Run

1	docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_API_BASE_URL=http://127.0.0.1:11434/api --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Source: Open WebUI: Server Connection Error

Then visit http://localhost:8080

BionicGPT

BionicGPT github repo

Install BionicGPT via docker compose

The easiest way to get running with BionicGPT is with our docker-compose.yml file. You'll need Docker installed on your machine.

mkdir BionicGPT
cd BionicGPT
curl -O https://raw.githubusercontent.com/bionic-gpt/bionic-gpt/main/docker-compose/docker-compose.yml
docker compose up -d

You must have access to ports 7800 and 7810.

Run the User Interface

You can then access the front end from http://localhost:7800 and you'll be redirected to a registration screen.

The first user to register with BionicGPT will become the system administrator.