AI Integration - Ollama & RAG
Overview
Section titled “Overview”This guide demonstrates integrating Ollama (local LLM) with RAG (Retrieval-Augmented Generation) to provide intelligent, context-aware explanations for complex domain decisions. The system helps operations staff understand why certain criteria are applied to assessments — without any data leaving the on-premise environment.
Key Components
Section titled “Key Components”- Ollama: Local LLM server running Gemma 3:4b for generating natural language explanations
- FastAPI Embedding Service: Python microservice generating text embeddings using
mxbai-embed-large(1024 dimensions) - PostgreSQL pgvector: Vector database for storing and searching knowledge rationales via cosine similarity
- NestJS Services: Integration layer connecting all components
Architecture Diagram
Section titled “Architecture Diagram”graph TD
A[Client Request] --> B[AssessmentsController]
B --> C[AssessmentExplainersService]
C --> D[EmbeddingService]
D --> E[FastAPI Embedding Server<br/>Port 8001]
E --> F[mxbai-embed-large Model<br/>1024 dimensions]
F --> E
E --> D
D --> C
C --> G[PostgreSQL + pgvector<br/>Semantic Search]
G --> H[Top 3 Similar Documents]
H --> C
C --> I[LlmService]
I --> J[Ollama Server<br/>Port 11434]
J --> K[Gemma 3:4b Model]
K --> J
J --> I
I --> C
C --> B
B --> A
Prerequisites
Section titled “Prerequisites”1. Ollama Installation
Section titled “1. Ollama Installation”Install Ollama:
# macOSbrew install ollama
# Linuxcurl -fsSL https://ollama.com/install.sh | sh
# Windows# Download from https://ollama.com/downloadPull the Gemma 3:4b Model:
ollama pull gemma3:4bStart Ollama Server:
ollama serve# Server runs on http://localhost:11434Verify Installation:
curl http://localhost:11434/api/tags2. FastAPI Embedding Service Setup
Section titled “2. FastAPI Embedding Service Setup”Create Python Virtual Environment:
cd /path/to/embedding-service
python3 -m venv venvsource venv/bin/activate # macOS/Linuxvenv\Scripts\activate # WindowsInstall Dependencies:
pip install fastapi uvicorn sentence-transformersCreate embedding_server.py:
from fastapi import FastAPI, HTTPExceptionfrom pydantic import BaseModelfrom sentence_transformers import SentenceTransformer
app = FastAPI()
# Load the embedding model (1024 dimensions)model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
class EmbedRequest(BaseModel): text: str
class EmbedResponse(BaseModel): embedding: list[float]
@app.post("/embed", response_model=EmbedResponse)async def create_embedding(request: EmbedRequest): try: embedding = model.encode(request.text) return {"embedding": embedding.tolist()} except Exception as e: raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")async def health_check(): return {"status": "healthy", "model": "mxbai-embed-large-v1"}
if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8001)Run the Server:
python embedding_server.py# Server runs on http://localhost:8001Verify:
curl -X POST http://localhost:8001/embed \ -H "Content-Type: application/json" \ -d '{"text": "assessment criteria for resource prioritization"}'3. PostgreSQL pgvector Extension
Section titled “3. PostgreSQL pgvector Extension”Install pgvector:
-- Connect to your database\c app_core_db
-- Enable pgvector extensionCREATE EXTENSION IF NOT EXISTS vector;
-- Verify installationSELECT * FROM pg_extension WHERE extname = 'vector';Create Assessment Rationales Table:
CREATE TABLE assessment_rationales ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), content TEXT NOT NULL, source VARCHAR(255) NOT NULL, embedding vector(1024) NOT NULL, -- 1024 dimensions for mxbai-embed-large created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW());
-- Create index for vector similarity searchCREATE INDEX idx_assessment_rationales_embeddingON assessment_rationalesUSING ivfflat (embedding vector_cosine_ops)WITH (lists = 100);Environment Configuration
Section titled “Environment Configuration”Add the following to your .env file:
# Ollama ConfigurationAI_SERVER_URL=http://localhost:11434/v1AI_API_KEY=ollama # Ollama doesn't require a key, but the SDK needs a non-empty value
# FastAPI Embedding ServiceEMBEDDING_SERVICE_URL=http://localhost:8001/embedAdd validation to libs/config/src/config.module.ts:
validationSchema: Joi.object({ // ... existing variables ...
// AI Configuration AI_SERVER_URL: Joi.string().required(), AI_API_KEY: Joi.string().required(), EMBEDDING_SERVICE_URL: Joi.string().optional(),}),Service Implementation
Section titled “Service Implementation”1. LLM Service (llm.service.ts)
Section titled “1. LLM Service (llm.service.ts)”Purpose: Connect to Ollama server and generate natural language explanations.
Key Features:
- Uses OpenAI-compatible SDK (works with Ollama)
- Configurable model (default:
gemma3:4b) - Temperature control for response consistency
- Error handling and structured logging
Example Usage:
import { LlmService } from './services/llm.service';
@Injectable()export class SomeService { constructor(private llmService: LlmService) {}
async getExplanation(): Promise<string> { const prompt = ` You are a domain expert assistant. Explain why elevated systolic pressure is a critical assessment criterion. `;
return this.llmService.generateAnswer(prompt); }}Configuration Options:
const completion = await this.openai.chat.completions.create({ model: 'gemma3:4b', // Ollama model name temperature: 0.3, // 0.0-0.5 for factual responses max_tokens: 1000, // Maximum response length messages: [...]});2. Embedding Service (embedding.service.ts)
Section titled “2. Embedding Service (embedding.service.ts)”Purpose: Convert text to vector embeddings for semantic search.
Key Features:
- Calls FastAPI embedding server via Axios
- Returns 1024-dimensional vectors
- Error handling for service unavailability
Example Usage:
import { EmbeddingService } from './services/embedding.service';
@Injectable()export class SomeService { constructor(private embeddingService: EmbeddingService) {}
async convertTextToVector(): Promise<number[]> { const text = 'assessment criteria for resource prioritization'; const vector = await this.embeddingService.getEmbedding(text);
console.log(vector.length); // 1024 return vector; }}3. Assessment Explainers Service (assessment-explainers.service.ts)
Section titled “3. Assessment Explainers Service (assessment-explainers.service.ts)”Purpose: Implement the RAG pattern to generate context-aware explanations.
RAG Workflow:
- Query Enhancement: Transform user question into a better search query
- Embedding: Convert query to vector using
EmbeddingService - Retrieval: Find top 3 most similar documents from pgvector
- Augmentation: Build prompt with retrieved context
- Generation: Use LLM to generate the final answer
Example Usage:
import { AssessmentExplainersService } from './services/assessment-explainers.service';
@Injectable()export class AssessmentsService { constructor(private explainerService: AssessmentExplainersService) {}
async explainCriteria(reason: string): Promise<string> { return this.explainerService.explain(reason); }}RAG Implementation Details:
async explain(reason: string): Promise<string> { // 1. Query Enhancement const searchQuery = `assessment criteria for ${reason}`;
// 2. Generate Embedding const queryVector = await this.embeddingService.getEmbedding(searchQuery);
// 3. Semantic Search (cosine distance) const similarDocs = await this.assessmentRationaleRepo.query( `SELECT content, source, embedding <=> $1 AS distance FROM assessment_rationales ORDER BY distance LIMIT 3`, [`[${queryVector.join(',')}]`], );
// 4. Build Context const context = similarDocs .map((doc) => `- ${doc.content} (Source: ${doc.source})`) .join('\n');
// 5. Generate Prompt const prompt = ` You are a domain expert assistant. Use the following reference documents: --- ${context} --- To answer the question: "Why is '${reason}' an important assessment criterion?" Provide a concise answer and clearly cite the sources. `;
// 6. Generate Answer return this.llmService.generateAnswer(prompt);}Module Registration
Section titled “Module Registration”File: apps/data-consumer-bc/src/modules/assessment/assessment.module.ts
import { HttpModule } from '@nestjs/axios';import { Module } from '@nestjs/common';import { TypeOrmModule } from '@nestjs/typeorm';
import { CommonModule } from '@lib/common';import { AppDatabases } from '@lib/common/enum/app-databases.enum';import { DatabaseModule } from '@lib/database';
import { AssessmentRationale } from './entities/assessment-rationale.entity';import { EmbeddingService } from './services/embedding.service';import { LlmService } from './services/llm.service';import { AssessmentExplainersService } from './services/assessment-explainers.service';
@Module({ imports: [ HttpModule, // Required for EmbeddingService CommonModule, DatabaseModule.registerAsync(AppDatabases.APP_CORE), TypeOrmModule.forFeature([AssessmentRationale], AppDatabases.APP_CORE), ], providers: [ EmbeddingService, LlmService, AssessmentExplainersService, ], exports: [ EmbeddingService, LlmService, AssessmentExplainersService, ],})export class AssessmentModule {}API Endpoints
Section titled “API Endpoints”Explain Assessment Criteria
Section titled “Explain Assessment Criteria”Endpoint: GET /data-consumer-bc/v1/assessments/explain
Query Parameters:
reason(required): The criterion to explain (e.g.,elevated-blood-pressure)
Request Example:
curl -X GET "http://localhost:3001/data-consumer-bc/v1/assessments/explain?reason=elevated-systolic-pressure" \ -H "Authorization: Bearer YOUR_JWT_TOKEN"Response Example:
{ "status": { "code": 200000, "message": "Request Succeeded" }, "data": { "type": "assessment-explanation", "attributes": { "reason": "elevated-systolic-pressure", "explanation": "Elevated systolic pressure is a critical assessment criterion because...", "sources": [ "Clinical Assessment Guidelines v4", "Emergency Response Standards" ] } }}Complete Request Flow: Sequence Diagram
Section titled “Complete Request Flow: Sequence Diagram”sequenceDiagram
autonumber
actor Client as Client/Frontend
participant API as API Gateway
participant Auth as AuthGuard (NestJS)
participant Controller as AssessmentsController
participant Explainer as AssessmentExplainersService
participant Embed as EmbeddingService
participant FastAPI as FastAPI Server (Port 8001)
participant Model as mxbai-embed-large
participant DB as PostgreSQL (pgvector)
participant LLM as LlmService
participant Ollama as Ollama Server (Port 11434)
participant Gemma as Gemma 3:4b
participant Transform as TransformInterceptor
Note over Client,Transform: Phase 1 — Authentication & Routing
Client->>+API: GET /assessments/explain?reason=elevated-systolic-pressure
API->>+Auth: Validate JWT & Permissions
Auth->>Auth: Verify session, check permission: assessment:view
Auth-->>-API: Auth OK
Note over API,Controller: Phase 2 — Controller Entry
API->>+Controller: Forward request with validated user context
Controller->>+Explainer: explain("elevated-systolic-pressure")
Note over Explainer,Model: Phase 3 — Query Enhancement & Embedding
Explainer->>Explainer: Enhance: "assessment criteria for elevated-systolic-pressure"
Explainer->>+Embed: getEmbedding(searchQuery)
Embed->>+FastAPI: POST /embed { "text": "assessment criteria for..." }
FastAPI->>+Model: Encode text to vector
Model-->>-FastAPI: Return embedding array (1024 dims)
FastAPI-->>-Embed: { "embedding": [0.234, -0.567, ...] }
Embed-->>-Explainer: Return vector
Note over Explainer,DB: Phase 4 — Semantic Search (RAG Retrieval)
Explainer->>+DB: SELECT content, source, embedding <=> $1 AS distance<br/>FROM assessment_rationales ORDER BY distance LIMIT 3
DB->>DB: pgvector cosine similarity search (IVFFlat index)
DB-->>-Explainer: Top 3 similar documents
Note over Explainer,Gemma: Phase 5 — Augmentation & Generation
Explainer->>Explainer: Build context from retrieved docs
Explainer->>Explainer: Construct RAG prompt with context
Explainer->>+LLM: generateAnswer(prompt)
LLM->>+Ollama: POST /v1/chat/completions { model: "gemma3:4b", temperature: 0.3 }
Ollama->>+Gemma: Process prompt with context
Gemma-->>-Ollama: Generated explanation
Ollama-->>-LLM: { choices: [{ message: { content: "..." } }] }
LLM-->>-Explainer: Return explanation string
Note over Explainer,Transform: Phase 6 — Response Formatting
Explainer-->>-Controller: Return explanation
Controller-->>-Transform: Return response data
Transform->>Transform: Wrap in JSON:API format
Transform-->>Client: HTTP 200 OK
Step-by-Step Breakdown
Section titled “Step-by-Step Breakdown”Phase 1: Authentication & Authorization
Section titled “Phase 1: Authentication & Authorization”- Client sends GET request with
reasonquery parameter and JWT Bearer token - API Gateway validates JWT, checks Redis session, verifies
assessment:viewpermission - Extracts user context (id, roles, permissions)
Phase 3: Query Enhancement & Embedding
Section titled “Phase 3: Query Enhancement & Embedding”const searchQuery = `assessment criteria for ${reason}`;// Example: "assessment criteria for elevated-systolic-pressure"- Transforms user question into an optimized search query by adding domain context
EmbeddingServicesends HTTP POST to FastAPI server:{ "text": "assessment criteria for..." }- FastAPI uses
mxbai-embed-largeSentence Transformer to encode text into a 1024-dimensional vector
Phase 4: Semantic Search
Section titled “Phase 4: Semantic Search”SELECT content, source, embedding <=> $1 AS distanceFROM assessment_rationalesORDER BY distanceLIMIT 3- Uses pgvector’s cosine distance operator
<=> - PostgreSQL uses IVFFlat or HNSW index — searches clusters instead of all records
- Returns top 3 most semantically similar documents
Example retrieved documents:
[ { "content": "Systolic pressure above 180 mmHg is considered a critical threshold...", "source": "Clinical Assessment Guidelines v4", "distance": 0.12 }, { "content": "Clients presenting with chest discomfort alongside elevated pressure...", "source": "Emergency Response Standards", "distance": 0.18 }, { "content": "Accurate measurement procedure is required for reliable decision-making...", "source": "Assessment Standards Manual", "distance": 0.23 }]Phase 5: Context Augmentation & LLM Generation
Section titled “Phase 5: Context Augmentation & LLM Generation”const context = similarDocs .map((doc) => `- ${doc.content} (Source: ${doc.source})`) .join('\n');
const prompt = ` You are a domain expert assistant. Use the following reference documents: --- ${context} --- To answer: "Why is '${reason}' an important assessment criterion?" Provide a concise, evidence-based answer and cite sources.`;- Prompt sent to Ollama with
temperature: 0.3(factual responses) andmax_tokens: 1000 - Gemma 3:4b generates an evidence-based explanation grounded in the retrieved documents
Example Response:
Elevated systolic pressure is a critical assessment criterion because:
1. Safety threshold: Readings above 180 mmHg indicate a potential emergency requiring immediate evaluation (Clinical Assessment Guidelines v4).
2. Compound risk: When combined with other indicators (e.g., chest discomfort), it signals cardiovascular involvement requiring escalation (Emergency Response Standards).
3. Measurement accuracy: Correct measurement procedure is essential to ensure decisions are reliable (Assessment Standards Manual).Related Documentation
Section titled “Related Documentation”- Hybrid AI Architecture (RBS + RAG) — Deterministic routing for high-stakes queries
- Rule-Based Decision Engine — IF-THEN rule engine with safety validation and inventory checks