AI GPU benchmarks

Compare the performance of different GPUs for fine-tuning LLMs and LLM latency and throughput benchmarks

Fine tuning LLMs

We finetuned the following models using LitGPT , each model was run using LoRA and configured with the optimal settings from the LitGPT config hub. The Alpaca 52K dataset was used for fine tuning. The charts display the:

  • Time: total time taken in hours
  • Price: total cost of the GPU time in USD

tiiuae/falcon-7b

GPU Model Time (hours) Price (USD)
NVIDIA RTX A60007.986.3
NVIDIA RTX A50008.353.68
NVIDIA A407.936.27
Tesla V100-SXM2-16GB--

LLM latency and throughput

To compare inference latency and token throughput we used Ollama to serve LLMs and prompted the models via the REST API. These figures are not taken under load.

Each of the metrics are averaged across 817 prompts from the Truthful QA

  • TTFW: Time to first word. This metric is measured from the time the request is sent to the API to the time the first JSON response is received. To use this metric in your own latency estimates you will need to add the network latency.
  • TOK_PS: Tokens per Second. This metric is measured using Ollamas internal counters. It includes the prompt tokens and then prompt tokens evaluation time. It is calculated as:
    TOK_PS = (Prompt Tokens + Response Tokens) / (Load Time + Prompt Evaluation time + Response Evaluation time)
                  

llama2

GPU Model Time to first word (seconds) Tokens per second
NVIDIA RTX A50003.41112
NVIDIA RTX A60003.67106
NVIDIA V1003.58107
NVIDIA A403.77102

AI image processing

To compare performance of GPUs at common image processing tasks we used the Ultralytics YOLOv8 models in medium, large and extra large sizes.

Several model types were used; object detection, pose estimation and instance segmentation.

  • yolov8m, yolov8l and yolov8x used the coco dataset.
  • yolov8m-pose, yolov8l-pose and yolov8x-pose used the coco-pose dataset.
  • yolov8m-seg, yolov8l-seg and yolov8x-seg used the coco dataset.

yolov8m

GPU Model Frames per second Inference Time Price per million frames (USD)
NVIDIA RTX A60001277.891.73
NVIDIA RTX A50001218.281.01
NVIDIA A401327.561.66
NVIDIA V1005219.22.08

Rent GPUs today or reserve for future growth

Reserved Cloud

Reserved Cloud

Get the highest performing H100 GPUs and HGX systems at scale.

Reserved Cloud

On-demand cloud

Get access to the GPU performance you need today on-demand

Why CUDO Compute?

Industry demand for HPC resources has grown exponentially, driven by the explosion in ML training, deep learning, and AI inference applications. This growth has made it challenging for organizations to rent GPU resources or even buy some powerful data center and workstation GPUs.

Whether your field is data science, machine learning, or any high-performance computing on GPU, getting started is simple. Start using many of our HPC resources today, or reserve powerful data center GPUs to ensure you have the capacity to empower your developers and delight your customers.

Sign up and get started today with our on-demand GPU instances, or contact us to discuss your requirements.

Deploy high-performance cloud GPUs