Tool-Forced RAG: Stopping the LLM From Making Up Clinical Guidelines - Writing

The counselling supervisor AI is grounded in 70+ professional documents from BACP, NCPS, and UKCP. Supervision frameworks, therapeutic methodologies, ethics codes, clinical placement guidelines.

Without forced retrieval, the LLM will happily answer questions about these frameworks from its training data. The answers sound authoritative and use the right terminology, but in this domain they are sometimes wrong in ways that affect a trainee’s professional decisions.

The problem

LLMs have absorbed clinical and counselling content from their training data. Ask about BACP supervision requirements and you’ll get a coherent, confident answer. The problem is that the answer reflects whatever version of the guidelines existed in the training data, mixed with content from other frameworks, and synthesised into something that sounds right but may not match the current published standard.

For general conversation, the LLM’s summary is usually close enough. For a tool that counselling trainees rely on for professional guidance, “sounds right” is not good enough. The answer needs to come from the actual BACP document, not from the LLM’s impression of what BACP probably says.

Tool-forced retrieval

The solution is a document search tool that the LLM is instructed to use - not optionally, but mandatorily - for any question touching professional standards, ethics, or clinical guidelines.

The system prompt includes an explicit instruction: when a user asks about BACP, NCPS, UKCP, or any professional body’s framework, you must use the document search tool before answering. Do not answer from general knowledge.

The LLM stops acting as the source of clinical knowledge and instead reads the retrieved documents, synthesises an answer from them, and cites its sources. The knowledge comes from the curated, verified document base.

Why not just fine-tune?

Fine-tuning would bake the clinical knowledge into the model’s weights. But fine-tuning has two problems for this use case.

Attribution. A fine-tuned model generates text that incorporates the training data, but you can’t trace which specific document informed a particular answer. With RAG, the retrieved chunks are visible - the user can see exactly which BACP guideline the answer draws from.

Updates. Professional guidelines change. BACP publishes new supervision frameworks. Ethical guidelines get updated. With RAG, you update the document base. With fine-tuning, you retrain the model.

What gets forced

Not every question needs retrieval. “How are you today?” doesn’t require searching the BACP guidelines. The forcing is scoped to specific topics:

Professional body requirements (BACP, NCPS, UKCP)
Therapeutic approaches and methodologies
Ethical guidelines and codes of practice
Supervision frameworks and models
Clinical placement requirements

The LLM is still free to have a normal conversation about other topics. Anything touching professional standards triggers retrieval before the model can answer.

Trainee profiles shape retrieval

Each trainee has a profile: years of experience, therapeutic methodology, governing body, qualifications. This context shapes both the retrieval query and the response framing.

A first-year person-centred student asking about supervision gets different results than an experienced CBT practitioner. The retrieval is scoped to relevant documents, and the response is framed at the appropriate level.

This prevents the tool from overwhelming a beginner with advanced clinical concepts or from being too basic for an experienced practitioner. The knowledge base is the same; the presentation is personalised.

Does it work?

The forced retrieval eliminates the category of errors where the LLM generates confident but inaccurate clinical guidance. When the answer comes from a retrieved document, it’s either correct (the document is accurate) or traceable (the user can check the source).

The trade-off is latency. Every forced retrieval adds an embedding step, a vector search, and a graph traversal before the LLM can generate. For clinical questions, this takes an extra second or two. For a supervision tool where accuracy matters more than speed, that’s an acceptable trade.