Ollama

Ollama is the easiest way to deploy open source LLMs

Ollama empowers users to work with large language models (LLMs) through its library of open-source models and its user-friendly API. This allows users to choose the best LLM for their specific task, whether it's text generation, translation, or code analysis. Ollama also simplifies interaction with different LLMs, making them accessible to a wider audience and fostering a more flexible and efficient LLM experience.

Get started

Go to the apps section in the web console and click either the small, medium or large instance of Ollama. This will give you some good default settings but you can fully customise your deployment at the next step.

Customise the deployment

You can just choose an id for your App and deploy it. Or you may want to configure th spec of the machine.

GPU selection

The model(s) you wish to run will determine the amount of VRAM you will need on your GPU

Ollama supports a list of models available on ollama.com/library

Here are some example models that can be downloaded:

ModelParametersSizeDownload
Gemma 31B815MBollama run gemma3:1b
Gemma 34B3.3GBollama run gemma3
Gemma 312B8.1GBollama run gemma3:12b
Gemma 327B17GBollama run gemma3:27b
QwQ32B20GBollama run qwq
DeepSeek-R17B4.7GBollama run deepseek-r1
DeepSeek-R1671B404GBollama run deepseek-r1:671b
Llama 3.370B43GBollama run llama3.3
Llama 3.23B2.0GBollama run llama3.2
Llama 3.21B1.3GBollama run llama3.2:1b
Llama 3.2 Vision11B7.9GBollama run llama3.2-vision
Llama 3.2 Vision90B55GBollama run llama3.2-vision:90b
Llama 3.18B4.7GBollama run llama3.1
Llama 3.1405B231GBollama run llama3.1:405b
Phi 414B9.1GBollama run phi4
Phi 4 Mini3.8B2.5GBollama run phi4-mini
Mistral7B4.1GBollama run mistral
Moondream 21.4B829MBollama run moondream
Neural Chat7B4.1GBollama run neural-chat
Starling7B4.1GBollama run starling-lm
Code Llama7B3.8GBollama run codellama
Llama 2 Uncensored7B3.8GBollama run llama2-uncensored
LLaVA7B4.5GBollama run llava
Granite-3.28B4.9GBollama run granite3.2

Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Disk size

The default disk size is set between 100-200GB which should be enough for most users. However, some people often wish to compare the performance of many models so if you plan to download and use multiple models consider increasing your boot disk size.

Using Ollama

When you deploy the VM you will be shown the VM information page. On the left hand side there is a pane called 'Metadata'. For Ollama we can see the following metadata:

    
    CUDO_TOKEN  cudo_8c744hxyo2 # your authetication token
appId   ollama
port    8080  # the port to access ollama

    
  

To connect you need your VMs IP address, the port and CUDO_TOKEN

Pull a model

Use curl from your local machine to pull a model. Model list is here: Ollama library The model needs to fit on your GPU memory and VM Disk.

Here is an example curl request pulling tinyllama:

    
    curl --header "Authorization: Bearer cudo_8c744hxyo2" http://198.145.104.51:8080/api/pull -d '{"model": "tinyllama"}'

    
  

Test completion

Now try sending a completion using curl, here we have turned streaming to make the response more readable.

    
    curl --header "Authorization: Bearer cudo_8c744hxyo2" http://198.145.104.51:8080/api/generate -d '{"model": "tinyllama","prompt": "Why is the sky blue?", "stream": false}'

    
  

Continue with curl / REST API

The API has more end points listed below, you can continue using curl or any other REST tool: API docs

  • Generate a completion
  • Generate a chat completion
  • Create a Model
  • List Local Models
  • Show Model Information
  • Copy a Model
  • Delete a Model
  • Pull a Model
  • Push a Model
  • Generate Embeddings
  • List Running Models
  • Version

Using Ollama with OpenAI API

Ollama also supports and OpenAI compatible API.

Note: OpenAI compatibility is experimental and is subject to major adjustments including breaking changes. For fully-featured access to the Ollama API, see the Ollama Python library, JavaScript library and REST API.

Install the openai sdk:

    
    pip install openai

    
  

Get your IP address from the VM info page and the port and CUDO_TOKEN from the Metadata pane just below.

    
    CUDO_TOKEN  cudo_8c744hxyo2 # your authetication token
appId   ollama
port    8080  # the port to access ollama

    
  

Using python you can write:

    
    ip  = "192.0.0.0"
port = 8080
cudo_token = cudo_8c744hxyo2
from openai import OpenAI

client = OpenAI(
    base_url='http://'+ ip +': + 'port' +/v1/',
    api_key='YOUR_TOKEN',
)

list_completion = client.models.list()


print(list_completion.data)

    
  

See more here: OpenAI docs

Want to learn more?

You can learn more about this by contacting us . Or you can just get started right away!