Conversational AI: From RAG Prototypes to Domain-Specific Supervision

Creator · 2024 · 7 min read

A series of conversational AI systems exploring streaming interfaces, contextual memory, tool-calling, and the progression from vector search through to graph RAG. Culminated in a counselling supervision tool grounded in 70+ professional documents from BACP, NCPS, and UKCP.

Visit site

Overview

Built iteratively across multiple prototypes - a streaming Slack bot, a Go-based chat API with Shopify tool integration, and a clinical supervision chatbot. Each iteration evolved the retrieval architecture (Pinecone → pgvector → graph RAG), refined memory persistence patterns, and expanded the tool-calling system from hardcoded integrations to a configurable tool builder supporting multiple tool types.

Problem

Off-the-shelf AI chat treats every conversation as stateless and every query as independent. For real use cases, you need persistent context across sessions, domain-specific retrieval that doesn't hallucinate, and extensible tool systems that non-developers can configure. The counselling domain added a further constraint: responses must be grounded in published professional standards, not general LLM knowledge.

Constraints

Responses must be grounded in source documents, not general LLM knowledge
Must maintain conversational context across sessions and channels
Tool system must be extensible without code changes
Must support multiple therapeutic modalities (CBT, person-centred, integrative, gestalt)
Solo project across all iterations

Approach

Built three distinct systems, each learning from the last. The Slack bot established streaming responses and channel-aware context with Pinecone. The Go rewrite moved to pgvector, added OpenAI function-calling with Shopify order lookup and 2FA verification, and deployed to AWS Lambda. The counselling supervisor applied graph RAG for relationship-aware retrieval across 70+ professional PDFs, added a configurable tool builder with custom and built-in tool types, and personalised responses using trainee profiles.

Key Decisions

Pinecone to pgvector to graph RAG

Pinecone worked but added operational complexity. pgvector kept vectors alongside relational data. Graph RAG captured relationships between concepts - document-to-document links, semantic tags with confidence scores, tag co-occurrence - producing more coherent multi-hop answers.

Configurable tool builder over hardcoded tools

The first API had hardcoded tools. The Slack bot needed different tools. Building a tool system with custom tools (user-defined via admin panel), built-in tools (document search, API integrations), and execution tracking made the platform extensible without code changes.

Go rewrite for the chat API

Lambda cold starts and memory usage mattered for a customer-facing API. Go's compiled binary, fast startup, and low memory footprint made it a better fit than Node.js for serverless deployment.

Chunk-then-summarise embedding pipeline

Raw PDF chunks contain formatting noise and partial sentences. Summarising each chunk with an LLM before embedding produces cleaner vectors and better retrieval relevance.

Local Ollama embeddings over OpenAI API

OpenAI's text-embedding-3-small charges per token across ingestion, re-ingestion, and every query. A local Ollama embedding model eliminated the recurring cost with comparable retrieval quality for this domain-specific corpus.

Per-channel memory in Slack, stateless in API

Slack conversations are collaborative - per-channel context captures the full thread. The customer API is stateless with client-managed history, keeping the Lambda lightweight and the architecture simpler.

Tech Stack

TypeScript Go Next.js pgvector Graph RAG Vercel AI SDK

Result & Impact

3 systems (Slack bot, Go API, Counselling Supervisor)

Iterations
Pinecone → pgvector → Graph RAG, OpenAI → Ollama embeddings

RAG Evolution
70+ professional documents (BACP, NCPS, UKCP)

Knowledge Base
Configurable builder with custom and built-in tool types

Tool System

The progression through three systems provided direct comparison data on retrieval architectures, memory models, and tool-calling patterns. The Go rewrite demonstrated the cold-start and memory advantages of compiled languages for serverless AI. The counselling supervisor proved that tool-forced RAG with graph relationships produces more grounded, contextually appropriate responses than flat vector search. The tool builder made the platform reusable across domains without code changes.

Learnings

pgvector is good enough for most use cases and dramatically simplifies ops compared to a managed vector DB.
Graph RAG's advantage shows up most on multi-hop questions. For simple lookups, flat vector search is fine.
Summarising chunks before embedding significantly improved retrieval relevance over raw text.
Tool-forcing for domain-specific questions prevents the LLM from confidently making up guidelines.
Go on Lambda gives sub-second cold starts. The same API in Node.js was noticeably slower to initialise.
Per-channel memory in Slack is more valuable than per-user - collaborative context matters.
For domain-specific corpora with well-structured text, local embedding models match hosted API quality at zero marginal cost.

Progression

Three systems, each building on lessons from the last.

1. Streaming Slack Bot (Iris)

The first prototype was a Slack bot called Iris, built with Next.js and the Vercel AI SDK. The goal was to answer internal questions using company documents rather than generic LLM knowledge.

Streaming and context: Messages stream in real-time using the Vercel AI SDK. Context handling varies by Slack interaction type - channel mentions get the current message only, thread replies load the full thread history, and DMs load today’s messages (up to 50) for continuity.

Conversation memory: A PostgreSQL-backed memory service tracks entities (people, locations, topics), decisions, and preferences extracted from conversations. Memory entries expire after 48 hours by default but are configurable.

RAG: Started with Pinecone for vector search using OpenAI text-embedding-3-small embeddings at 768 dimensions. Documents were uploaded through an admin dashboard with folder hierarchy, enable/disable controls, and metadata.

CSV processing: Added LLM-based analysis for CSV data - column embeddings, contextual summaries, and support for both comma and pipe-delimited formats.

2. Go Chat API

Rewrote the chat API in Go for a customer-facing deployment on AWS Lambda. The Node.js version had slow cold starts and high memory usage. The Go binary starts in under a second with a fraction of the memory.

Architecture: Go Fiber framework running on Lambda (ARM64), fronted by API Gateway with x-api-key authentication. Infrastructure managed with AWS CDK - Lambda, API Gateway, DynamoDB, Route53, Secrets Manager, CloudWatch alarms.

pgvector migration: Moved from Pinecone to PostgreSQL with the pgvector extension. Vectors live alongside relational data in the same database. Search uses cosine similarity with configurable thresholds and scope control - customer scope returns only public documents, internal scope includes confidential materials.

Tool calling: OpenAI function-calling with multiple tools. Order lookup queries the Shopify GraphQL Admin API. Delivery estimation calculates expected dates based on order contents. Email 2FA verification (via Resend) gates access to order information - codes expire after 10 minutes with a maximum of 3 attempts.

Caching: DynamoDB caches search results keyed by SHA256(scope + query). 12-hour TTL in staging, 24-hour in production. Hit rate tracking with atomic counters.

Deployment: CodePipeline CI/CD with separate staging and production environments. Secrets in AWS Secrets Manager, Lambda running inside a VPC with NAT Gateway for outbound access.

3. Counselling Supervisor

The final iteration applied everything learned to a specific domain: clinical supervision training for counsellors. Built with Next.js and the Vercel AI SDK, deployed on Vercel.

Knowledge base: 70+ professional PDFs covering supervision frameworks, therapeutic approaches, ethics codes, and clinical placement guidelines from BACP (British Association for Counselling & Psychotherapy), NCPS, and UKCP. Documents processed through a chunk-then-summarise pipeline - each chunk is summarised by an LLM before embedding, producing cleaner vectors and better retrieval. Embeddings originally used OpenAI’s text-embedding-3-small, later switched to a local Ollama model to eliminate per-token API costs.

Graph RAG: Evolved beyond flat vector search to a knowledge graph structure. Documents are connected through semantic tags (with confidence scores), document-to-document links (with relationship types), and tag co-occurrence analysis. This captures how concepts relate - a question about CBT supervision ethics traverses links to both CBT methodology documents and BACP ethics guidelines, rather than just returning the most similar chunks.

Tool builder: An admin dashboard where tools are created and configured without code changes. Custom tools are user-defined with parameters and execution rules. Built-in tools (document search, API integrations) can be enabled or disabled per deployment. All tool executions are tracked with inputs, outputs, errors, and duration.

Personalisation: Each trainee creates a profile with their years of experience, country, supervisory style, qualifications, accreditations, therapeutic methodologies, and governing bodies. This context shapes the system prompt - a first-year person-centred student gets different supervision guidance than an experienced CBT practitioner.

Multi-LLM: Supports OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Haiku), plus local models via Ollama (Gemma 2, Llama 3.x). Different models suit different interaction types - empathetic conversation vs structured clinical reasoning.

Tool-forced retrieval: The document search tool is forced for any question touching professional standards, ethics, or clinical guidelines. This prevents the LLM from confidently generating plausible-sounding clinical advice that doesn’t match published standards.

Related Writing

All projects