AI GPU benchmarks

Compare the performance of different GPUs for fine-tuning LLMs and LLM latency and throughput benchmarks

Fine tuning LLMs

We finetuned the following models using LitGPT , each model was run using LoRA and configured with the optimal settings from the LitGPT config hub. The Alpaca 52K dataset was used for fine tuning. The charts display the:

Time: total time taken in hours
Price: total cost of the GPU time in USD

Select a model to view benchmarks

tiiuae/falcon-7b

GPU Model	Time (hours)	Price (USD)
NVIDIA RTX A6000	7.98	6.3
NVIDIA RTX A5000	8.35	3.68
NVIDIA A40	7.93	6.27
Tesla V100-SXM2-16GB	-	-

LLM latency and throughput

To compare inference latency and token throughput we used Ollama to serve LLMs and prompted the models via the REST API. These figures are not taken under load.

Each of the metrics are averaged across 817 prompts from the Truthful QA

TTFW: Time to first word. This metric is measured from the time the request is sent to the API to the time the first JSON response is received. To use this metric in your own latency estimates you will need to add the network latency.
TOK_PS: Tokens per Second. This metric is measured using Ollama's internal counters. It includes the prompt tokens and then prompt tokens evaluation time. It is calculated as:
```
TOK_PS = (Prompt Tokens + Response Tokens) / (Load Time + Prompt Evaluation time + Response Evaluation time)
```

Select a model to view benchmarks

llama2

GPU Model	Time to first word (seconds)	Tokens per second
NVIDIA RTX A5000	3.41	112
NVIDIA RTX A6000	3.67	106
NVIDIA V100	3.58	107
NVIDIA A40	3.77	102

AI image processing

To compare performance of GPUs at common image processing tasks we used the Ultralytics YOLOv8 models in medium, large and extra large sizes.

Several model types were used; object detection, pose estimation and instance segmentation.

yolov8m, yolov8l and yolov8x used the coco dataset.
yolov8m-pose, yolov8l-pose and yolov8x-pose used the coco-pose dataset.
yolov8m-seg, yolov8l-seg and yolov8x-seg used the coco dataset.

Select a model to view benchmarks

yolov8m

GPU Model	Frames per second	Inference Time	Price per million frames (USD)
NVIDIA RTX A6000	127	7.89	1.73
NVIDIA RTX A5000	121	8.28	1.01
NVIDIA A40	132	7.56	1.66
NVIDIA V100	52	19.2	2.08

Purpose-built solutions for enterprise AI demands

Secure dedicated GPU clusters with multi-cluster orchestration fully managed by our team, delivering the performance, scale and compliance enterprises need

View solutions Request a demo

Why CUDO Compute?

Industry demand for HPC resources has grown exponentially, driven by the explosion in ML training, deep learning, and AI inference applications. This growth has made it challenging for organizations to rent GPU resources or even buy some powerful data center and workstation GPUs.

Whether your field is data science, machine learning, or any high-performance computing on GPU, getting started is simple. Start using many of our HPC resources today, or reserve powerful data center GPUs to ensure you have the capacity to empower your developers and delight your customers.

Get started Contact us

High-performance cloud GPUs

Explore our extensive roster of high-performance GPUs optimized for AI and ML

Get started today or speak with an expert...

+44 20 8050 7646

Available Mon-Fri 9am-5pm UK time

Get started Speak with an expert