Deploying LLMs like Google Gemma

In this tutorial we will run Google Gemma with Ollama so that you can send queries via a REST API.

Ollama empowers users to work with large language models (LLMs) through its library of open-source models and its user-friendly API. This allows users to choose the best LLM for their specific task, whether it's text generation, translation, or code analysis. Ollama also simplifies interaction with different LLMs, making them accessible to a wider audience and fostering a more flexible and efficient LLM experience.

In this tutorial we will run Google Gemma with Ollama so that you can send queries via a REST API.

Quick start guide

Prerequisites
Starting a VM with cudoctl
Installing Ollama via SSH
Using Docker to start a LLM API

Prerequisites

Create a project and add an SSH key
Download the CLI tool

Starting a VM with cudoctl

Start a VM with the base image you require, here we will start with an image that already has NVIDIA drivers.

You can use the web console to start a VM using the Ubuntu 22.04 + NVIDIA drivers + Docker image or alternatively use the command line tool cudoctl

To use the command line tool you will need to get an API key from the web console, see here: API key Then run cudoctl init and enter your API key.

First we search to find a VM type to start

    
    cudoctl search --vcpus 4 --mem 8 --gpus 1

Find an image:

    
    cudoctl search images

After deciding on a machine type of epyc-milan-rtx-a4000 (16GB GPU) in the se-smedjebacken-1 data center and image ubuntu-2204-nvidia-535-docker-v20240214 we can start a VM:

    
    cudoctl vm create --id my-ollama --image ubuntu-2204-nvidia-535-docker-v20240214 --machine-type epyc-milan-rtx-a4000 --memory 8 --vcpus 4  --gpus 1 --boot-disk-size 80 -boot-disk-class network --data-center se-smedjebacken-1

Installing Ollama via SSH

Get the IP address of the VM

    
    cudoctl -json vm get my-ollama | jq '.externalIP'

SSH into the VM

    
    ssh root@<IP_ADDRESS>

Install ollama

    
    curl -fsSL https://ollama.com/install.sh | sh

Download and run Google Gemma LLM, then you can enter your prompt.

    
    ollama run gemma:7b

From the Ollama docs:

Model	Parameters	Size	Download
Llama 2	7B	3.8GB	`ollama run llama2`
Mistral	7B	4.1GB	`ollama run mistral`
Dolphin Phi	2.7B	1.6GB	`ollama run dolphin-phi`
Phi-2	2.7B	1.7GB	`ollama run phi`
Neural Chat	7B	4.1GB	`ollama run neural-chat`
Starling	7B	4.1GB	`ollama run starling-lm`
Code Llama	7B	3.8GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8GB	`ollama run llama2-uncensored`
Llama 2 13B	13B	7.3GB	`ollama run llama2:13b`
Llama 2 70B	70B	39GB	`ollama run llama2:70b`
Orca Mini	3B	1.9GB	`ollama run orca-mini`
Vicuna	7B	3.8GB	`ollama run vicuna`
LLaVA	7B	4.5GB	`ollama run llava`
Gemma	2B	1.4GB	`ollama run gemma:2b`
Gemma	7B	4.8GB	`ollama run gemma:7b`

Using Docker to start a LLM API

If you had created a vm in the previous step delete it by running:

    
    cudoctl vm delete my-ollama

Create a text file with a command to start the Ollama docker container: start-ollama.txt

    
    docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Create a VM and include the command to add a start script -start-script-file start-ollama.txt:

    
    cudoctl vm create --id my-ollama --image ubuntu-2204-nvidia-535-docker-v20240214 \
--machine-type epyc-milan-rtx-a4000 --memory 8 --vcpus 4  --gpus 1 --boot-disk-size 80 \
-boot-disk-class network --data-center se-smedjebacken-1 -start-script-file start-ollama.txt

Once the VM is running you can curl the API to pull the model you require, here we use gemma:7b

    
    curl http://<IP_ADDRESS>:11434/api/pull -d '{"name": "gemma:7b"}'

Now it is ready to respond to a prompt:

    
    curl http://<IP_ADDRESS>:11434/api/generate -d '{
"model": "gemma:7b",
"prompt":"Why when you leave water overnight in a glass does it create bubbles in the water ?",
"stream":false
}' | jq '.response'

Want to learn more?

You can learn more about this by contacting us . Or you can just get started right away!