CategoriesAI

Benchmarking Self Hosted Embedding Models

Vector embeddings power a lot of modern search and retrieval systems. In practice, though, choosing an embedding model is less about leaderboards and more about engineering tradeoffs:

  • How many tokens per minute can I push through it
  • How much GPU memory does it need

In this post I will walk through a small benchmark setup for four popular self hosted embedding models.

Models in this comparison

All models here are self hosted, available on Hugging Face or via standard Python tooling.

1. sentence-transformers/all-MiniLM-L6-v2

A classic small Sentence Transformers model.

  • Dimension: 384
  • Very small on disk and in memory
  • Great as a CPU friendly and GPU friendly baseline

2. BAAI/bge-small-en-v1.5

A retrieval tuned small model from the BGE family.

  • Dimension: 384
  • Optimized for retrieval tasks
  • Good counterpart to MiniLM at the same vector size

3. BAAI/bge-base-en-v1.5

The middle child.

  • Dimension: 768
  • Medium sized model where you expect decent quality without huge hardware costs
  • Makes a nice step between small and large in terms of VRAM usage and throughput

4. BAAI/bge-large-en-v1.5

A strong open source retriever many people use as a default.

  • Dimension: 1024
  • Larger model that will stress your GPU memory more

Test setup

All benchmarks were run on a single machine with 1x NVIDIA RTX 4070 (12 GB VRAM), 16 CPUs, and 32 GB RAM, using PyTorch and the sentence-transformers library.

For the benchmark I run a total of 2000 requests with a concurrency of 5.

The benchmark

The results of the benchmark are pretty aligned with the increasing dimension of each model. With a clear difference between the all-MiniLM-L6-v2 performance and the bge-small-en-v1.5 models. Where the first is clearly quicker.

ModelsRequests per second
all-MiniLM-L6-v2220.66
BAAI/bge-small-en-v1.5132.48
BAAI/bge-base-en-v1.5127.76
BAAI/bge-large-en-v1.570.67

Conclusion

Looking at these results, the pattern is very clear: the bigger the embedding dimension, the more you pay in throughput.

  • all-MiniLM-L6-v2 is lightning fast at around 220 req/s and is a great default when you care most about speed and cost.
  • bge-small-en-v1.5 and bge-base-en-v1.5 sit in the middle, trading some throughput for better retrieval quality, with similar throughput but different vector sizes (384 vs 768).
  • bge-large-en-v1.5 is the slowest in this setup, which is exactly what you would expect from a large model with 1024 dimensional vectors.

What this really shows is that there is no single best embedding model. You pick based on constraints:

  • If you are throughput bound or indexing huge volumes, start with all-MiniLM-L6-v2 or bge-small-en-v1.5.
  • If quality starts to matter more than raw speed, move up to bge-base-en-v1.5.
  • Only reach for bge-large-en-v1.5 when you have a concrete recall or quality problem that the smaller models cannot solve.

The next logical step is to put these speed numbers next to quality metrics and index size, so you can decide where you personally want to sit on the curve between fast and cheap vs slower and smarter. But that is something we can test in a follow up blog.

Want semantic search without the complexity? At SearchLayer we continuously test new models and setups to give you the best search experience for your website. Get in touch with us and we will help you find the best solution for your use case.