Conversational AI: From RAG Prototypes to Domain-Specific Supervision

The Hidden Cost of Embedding

Embedding models don’t get much attention in the cost conversation. The focus is always on the LLM (which model, how many tokens, input vs output pricing). The embedding API bills you on three separate occasions for every interaction with your knowledge base.

The bill nobody talks about

OpenAI’s text-embedding-3-small is cheap per token, which is exactly why the running cost is easy to ignore.

But embeddings aren’t a one-time cost. You pay when you ingest documents. You pay again when you re-ingest after changing your chunking strategy (and you will change it). You pay on every single query, because the user’s question needs embedding before you can search.

For the counselling supervisor project, this meant:

  • Ingestion: 70+ PDFs, chunked and summarised, each summary embedded. Hundreds of chunks.
  • Re-ingestion: Every time I changed chunk sizes, updated the summarisation prompt, or added new documents, the entire corpus got re-embedded.
  • Queries: Every conversation turn that triggered document search embedded the query. During development, that’s dozens of queries per session. In production, it’s every user interaction.

None of these individual costs were large. Together, they added up to a steady drain that scaled with both corpus size and usage.

What I switched to

Ollama running Gemma locally. The setup is straightforward - install Ollama, pull the model, point the application at localhost instead of the OpenAI API.

The code change was minimal. The embedding function signature stays the same: text in, vector out. The only difference is the HTTP endpoint and the model name.

What changed

Cost: Zero. Local compute on hardware I already own. No per-token billing, no API keys, no usage tracking to worry about.

Latency: Slightly slower per embedding on my machine compared to OpenAI’s API. For ingestion, this barely matters - it’s a batch job that runs occasionally. For query-time embedding, the difference is imperceptible. One embedding call at query time adds single-digit milliseconds of overhead.

Quality: For this domain - counselling supervision documents, professional standards, clinical guidelines - retrieval quality was comparable. The documents are formal, well-structured English text. This isn’t a case where you need the absolute best embedding model to distinguish subtle semantic differences. The chunk-then-summarise pipeline does the heavy lifting for retrieval quality, not the embedding model.

Deployment: The trade-off is that you need Ollama running wherever the application runs. For a self-hosted tool on my own infrastructure, that’s not a constraint. For a SaaS product serving thousands of users, it would be.

When it makes sense

Local embeddings work well when:

  • You control the deployment environment
  • Your corpus is in a single language with clear, formal text
  • You’re re-ingesting frequently during development
  • The cost savings matter more than squeezing out the last percentage of retrieval quality
  • You want to run fully offline or air-gapped

They’re a poor fit when:

  • You need the absolute best multilingual embedding quality
  • You’re running on infrastructure where you can’t install Ollama (serverless, edge)
  • Your scale is small enough that the OpenAI bill is negligible
  • You need guaranteed uptime without managing local services

For a domain-specific tool with a well-structured knowledge base and a chunk-then-summarise pipeline handling retrieval quality, the embedding model matters less than you’d think. Retrieval quality on this corpus was comparable between the paid and local models.