๐งชโ๏ธ Benchmarking LLM RAGs: Python vs .NET
February 19, 2026
๐งช Benchmarking LLM RAGs: Python vs .NET (No Spoilers Yet)
Iโve been working on a benchmark that compares LLM-powered RAG pipelines in Python and .NET for candidate matching.
Quick note before we start: this is intentionally a no-spoilers version because Iโll share the final conclusions in a meeting first.
Why this benchmark
Most comparisons online focus on framework preference or subjective opinions. I wanted a setup where both stacks run under comparable conditions so we can discuss tradeoffs with data, not vibes.
The repos behind the work
- Resume preprocessor: github.com/maurogioberti/llm-candidate-profiler
- RAG benchmark (multi-language): github.com/maurogioberti/llm-candidate-rag-benchmark-multilang
How the preprocessor works
Before benchmarking retrieval quality, resumes are normalized through a preprocessing pipeline so both stacks start from the same structured input.
- Scans PDF resumes and extracts text content
- Builds LLM prompts to produce structured candidate profiles
- Supports provider switching via configuration (local/cloud)
- Generates JSON outputs that feed the benchmark dataset
What is being compared
The benchmark compares two implementations of the same product intent:
- Python pipeline (LangChain-based orchestration)
- .NET pipeline (Microsoft.Extensions.AI-based orchestration)
- Shared embeddings strategy for consistency
- Equivalent datasets and prompt intent across both stacks
How the evaluation is designed
This is not a single-run screenshot benchmark. It uses repeatable execution and score aggregation to reduce random variance.
- Multiple evaluation runs per prompt
- Consistent temperature settings for deterministic behavior
- Quality evaluation plus performance/load testing
- Parity checks so differences reflect orchestration, not data drift
If we want fair conclusions, we need parity in inputs, embeddings, prompts, and scoring rules.
Why this matters
Teams choosing between Python and .NET for AI features usually ask the same thing: Which stack gives better quality, reliability, and operational confidence for real business use?
This work is my attempt to answer that with a practical, reproducible benchmark instead of guesswork.
Results soon โ after the meeting. ๐
#RAG #LLM #Python #DotNet #Benchmarking #AIEngineering