AI GPU benchmarks
Compare the performance of different GPUs for fine-tuning LLMs and LLM latency and throughput benchmarks
Fine tuning LLMs
We finetuned the following models using LitGPT , each model was run using LoRA and configured with the optimal settings from the LitGPT config hub. The Alpaca 52K dataset was used for fine tuning. The charts display the:
- Time: total time taken in hours
- Price: total cost of the GPU time in USD
tiiuae/falcon-7b
GPU Model | Time (hours) | Price (USD) |
---|---|---|
NVIDIA RTX A6000 | 7.98 | 6.3 |
NVIDIA RTX A5000 | 8.35 | 3.68 |
NVIDIA A40 | 7.93 | 6.27 |
Tesla V100-SXM2-16GB | - | - |
LLM latency and throughput
To compare inference latency and token throughput we used Ollama to serve LLMs and prompted the models via the REST API. These figures are not taken under load.
Each of the metrics are averaged across 817 prompts from the Truthful QA
- TTFW: Time to first word. This metric is measured from the time the request is sent to the API to the time the first JSON response is received. To use this metric in your own latency estimates you will need to add the network latency.
- TOK_PS: Tokens per Second. This metric is measured using Ollamas internal counters. It includes the prompt tokens and then prompt tokens evaluation time. It is calculated as:
TOK_PS = (Prompt Tokens + Response Tokens) / (Load Time + Prompt Evaluation time + Response Evaluation time)
llama2
GPU Model | Time to first word (seconds) | Tokens per second |
---|---|---|
NVIDIA RTX A5000 | 3.41 | 112 |
NVIDIA RTX A6000 | 3.67 | 106 |
NVIDIA V100 | 3.58 | 107 |
NVIDIA A40 | 3.77 | 102 |
AI image processing
To compare performance of GPUs at common image processing tasks we used the Ultralytics YOLOv8 models in medium, large and extra large sizes.
Several model types were used; object detection, pose estimation and instance segmentation.
- yolov8m, yolov8l and yolov8x used the coco dataset.
- yolov8m-pose, yolov8l-pose and yolov8x-pose used the coco-pose dataset.
- yolov8m-seg, yolov8l-seg and yolov8x-seg used the coco dataset.
yolov8m
GPU Model | Frames per second | Inference Time | Price per million frames (USD) |
---|---|---|---|
NVIDIA RTX A6000 | 127 | 7.89 | 1.73 |
NVIDIA RTX A5000 | 121 | 8.28 | 1.01 |
NVIDIA A40 | 132 | 7.56 | 1.66 |
NVIDIA V100 | 52 | 19.2 | 2.08 |
Reserve GPUs, deploy clusters and scale with our enterprise solutions
Why CUDO Compute?
Industry demand for HPC resources has grown exponentially, driven by the explosion in ML training, deep learning, and AI inference applications. This growth has made it challenging for organizations to rent GPU resources or even buy some powerful data center and workstation GPUs.
Whether your field is data science, machine learning, or any high-performance computing on GPU, getting started is simple. Start using many of our HPC resources today, or reserve powerful data center GPUs to ensure you have the capacity to empower your developers and delight your customers.
Sign up and get started today with our on-demand GPU instances, or contact us to discuss your requirements.
High-performance cloud GPUs
Explore our extensive roster of high-performance GPUs optimised for AI and ML
Get started today or speak with an expert...
Available Mon-Fri 9am-5pm UK time