Vector embeddings power a lot of modern search and retrieval systems. In practice, though, choosing an embedding model is less about leaderboards and more about engineering tradeoffs:
- How many tokens per minute can I push through it
- How much GPU memory does it need
In this post I will walk through a small benchmark setup for four popular self hosted embedding models.
Models in this comparison
All models here are self hosted, available on Hugging Face or via standard Python tooling.
1. sentence-transformers/all-MiniLM-L6-v2
A classic small Sentence Transformers model.
- Dimension: 384
- Very small on disk and in memory
- Great as a CPU friendly and GPU friendly baseline
2. BAAI/bge-small-en-v1.5
A retrieval tuned small model from the BGE family.
- Dimension: 384
- Optimized for retrieval tasks
- Good counterpart to MiniLM at the same vector size
3. BAAI/bge-base-en-v1.5
The middle child.
- Dimension: 768
- Medium sized model where you expect decent quality without huge hardware costs
- Makes a nice step between small and large in terms of VRAM usage and throughput
4. BAAI/bge-large-en-v1.5
A strong open source retriever many people use as a default.
- Dimension: 1024
- Larger model that will stress your GPU memory more
Test setup
All benchmarks were run on a single machine with 1x NVIDIA RTX 4070 (12 GB VRAM), 16 CPUs, and 32 GB RAM, using PyTorch and the sentence-transformers library.
For the benchmark I run a total of 2000 requests with a concurrency of 5.
The benchmark
The results of the benchmark are pretty aligned with the increasing dimension of each model. With a clear difference between the all-MiniLM-L6-v2 performance and the bge-small-en-v1.5 models. Where the first is clearly quicker.

| Models | Requests per second |
| all-MiniLM-L6-v2 | 220.66 |
| BAAI/bge-small-en-v1.5 | 132.48 |
| BAAI/bge-base-en-v1.5 | 127.76 |
| BAAI/bge-large-en-v1.5 | 70.67 |
Conclusion
Looking at these results, the pattern is very clear: the bigger the embedding dimension, the more you pay in throughput.
all-MiniLM-L6-v2is lightning fast at around 220 req/s and is a great default when you care most about speed and cost.bge-small-en-v1.5andbge-base-en-v1.5sit in the middle, trading some throughput for better retrieval quality, with similar throughput but different vector sizes (384 vs 768).bge-large-en-v1.5is the slowest in this setup, which is exactly what you would expect from a large model with 1024 dimensional vectors.
What this really shows is that there is no single best embedding model. You pick based on constraints:
- If you are throughput bound or indexing huge volumes, start with
all-MiniLM-L6-v2orbge-small-en-v1.5. - If quality starts to matter more than raw speed, move up to
bge-base-en-v1.5. - Only reach for
bge-large-en-v1.5when you have a concrete recall or quality problem that the smaller models cannot solve.
The next logical step is to put these speed numbers next to quality metrics and index size, so you can decide where you personally want to sit on the curve between fast and cheap vs slower and smarter. But that is something we can test in a follow up blog.
Want semantic search without the complexity? At SearchLayer we continuously test new models and setups to give you the best search experience for your website. Get in touch with us and we will help you find the best solution for your use case.
