Skip to content

AI Integration - Ollama & RAG

This guide demonstrates integrating Ollama (local LLM) with RAG (Retrieval-Augmented Generation) to provide intelligent, context-aware explanations for complex domain decisions. The system helps operations staff understand why certain criteria are applied to assessments — without any data leaving the on-premise environment.

  • Ollama: Local LLM server running Gemma 3:4b for generating natural language explanations
  • FastAPI Embedding Service: Python microservice generating text embeddings using mxbai-embed-large (1024 dimensions)
  • PostgreSQL pgvector: Vector database for storing and searching knowledge rationales via cosine similarity
  • NestJS Services: Integration layer connecting all components
graph TD
    A[Client Request] --> B[AssessmentsController]
    B --> C[AssessmentExplainersService]
    C --> D[EmbeddingService]
    D --> E[FastAPI Embedding Server<br/>Port 8001]
    E --> F[mxbai-embed-large Model<br/>1024 dimensions]
    F --> E
    E --> D
    D --> C
    C --> G[PostgreSQL + pgvector<br/>Semantic Search]
    G --> H[Top 3 Similar Documents]
    H --> C
    C --> I[LlmService]
    I --> J[Ollama Server<br/>Port 11434]
    J --> K[Gemma 3:4b Model]
    K --> J
    J --> I
    I --> C
    C --> B
    B --> A

Install Ollama:

Terminal window
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download

Pull the Gemma 3:4b Model:

Terminal window
ollama pull gemma3:4b

Start Ollama Server:

Terminal window
ollama serve
# Server runs on http://localhost:11434

Verify Installation:

Terminal window
curl http://localhost:11434/api/tags

Create Python Virtual Environment:

Terminal window
cd /path/to/embedding-service
python3 -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows

Install Dependencies:

Terminal window
pip install fastapi uvicorn sentence-transformers

Create embedding_server.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
app = FastAPI()
# Load the embedding model (1024 dimensions)
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')
class EmbedRequest(BaseModel):
text: str
class EmbedResponse(BaseModel):
embedding: list[float]
@app.post("/embed", response_model=EmbedResponse)
async def create_embedding(request: EmbedRequest):
try:
embedding = model.encode(request.text)
return {"embedding": embedding.tolist()}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": "mxbai-embed-large-v1"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001)

Run the Server:

Terminal window
python embedding_server.py
# Server runs on http://localhost:8001

Verify:

Terminal window
curl -X POST http://localhost:8001/embed \
-H "Content-Type: application/json" \
-d '{"text": "assessment criteria for resource prioritization"}'

Install pgvector:

-- Connect to your database
\c app_core_db
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Verify installation
SELECT * FROM pg_extension WHERE extname = 'vector';

Create Assessment Rationales Table:

CREATE TABLE assessment_rationales (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
source VARCHAR(255) NOT NULL,
embedding vector(1024) NOT NULL, -- 1024 dimensions for mxbai-embed-large
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create index for vector similarity search
CREATE INDEX idx_assessment_rationales_embedding
ON assessment_rationales
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Add the following to your .env file:

Terminal window
# Ollama Configuration
AI_SERVER_URL=http://localhost:11434/v1
AI_API_KEY=ollama # Ollama doesn't require a key, but the SDK needs a non-empty value
# FastAPI Embedding Service
EMBEDDING_SERVICE_URL=http://localhost:8001/embed

Add validation to libs/config/src/config.module.ts:

validationSchema: Joi.object({
// ... existing variables ...
// AI Configuration
AI_SERVER_URL: Joi.string().required(),
AI_API_KEY: Joi.string().required(),
EMBEDDING_SERVICE_URL: Joi.string().optional(),
}),

Purpose: Connect to Ollama server and generate natural language explanations.

Key Features:

  • Uses OpenAI-compatible SDK (works with Ollama)
  • Configurable model (default: gemma3:4b)
  • Temperature control for response consistency
  • Error handling and structured logging

Example Usage:

import { LlmService } from './services/llm.service';
@Injectable()
export class SomeService {
constructor(private llmService: LlmService) {}
async getExplanation(): Promise<string> {
const prompt = `
You are a domain expert assistant.
Explain why elevated systolic pressure is a critical assessment criterion.
`;
return this.llmService.generateAnswer(prompt);
}
}

Configuration Options:

const completion = await this.openai.chat.completions.create({
model: 'gemma3:4b', // Ollama model name
temperature: 0.3, // 0.0-0.5 for factual responses
max_tokens: 1000, // Maximum response length
messages: [...]
});

2. Embedding Service (embedding.service.ts)

Section titled “2. Embedding Service (embedding.service.ts)”

Purpose: Convert text to vector embeddings for semantic search.

Key Features:

  • Calls FastAPI embedding server via Axios
  • Returns 1024-dimensional vectors
  • Error handling for service unavailability

Example Usage:

import { EmbeddingService } from './services/embedding.service';
@Injectable()
export class SomeService {
constructor(private embeddingService: EmbeddingService) {}
async convertTextToVector(): Promise<number[]> {
const text = 'assessment criteria for resource prioritization';
const vector = await this.embeddingService.getEmbedding(text);
console.log(vector.length); // 1024
return vector;
}
}

3. Assessment Explainers Service (assessment-explainers.service.ts)

Section titled “3. Assessment Explainers Service (assessment-explainers.service.ts)”

Purpose: Implement the RAG pattern to generate context-aware explanations.

RAG Workflow:

  1. Query Enhancement: Transform user question into a better search query
  2. Embedding: Convert query to vector using EmbeddingService
  3. Retrieval: Find top 3 most similar documents from pgvector
  4. Augmentation: Build prompt with retrieved context
  5. Generation: Use LLM to generate the final answer

Example Usage:

import { AssessmentExplainersService } from './services/assessment-explainers.service';
@Injectable()
export class AssessmentsService {
constructor(private explainerService: AssessmentExplainersService) {}
async explainCriteria(reason: string): Promise<string> {
return this.explainerService.explain(reason);
}
}

RAG Implementation Details:

async explain(reason: string): Promise<string> {
// 1. Query Enhancement
const searchQuery = `assessment criteria for ${reason}`;
// 2. Generate Embedding
const queryVector = await this.embeddingService.getEmbedding(searchQuery);
// 3. Semantic Search (cosine distance)
const similarDocs = await this.assessmentRationaleRepo.query(
`SELECT content, source, embedding <=> $1 AS distance
FROM assessment_rationales
ORDER BY distance
LIMIT 3`,
[`[${queryVector.join(',')}]`],
);
// 4. Build Context
const context = similarDocs
.map((doc) => `- ${doc.content} (Source: ${doc.source})`)
.join('\n');
// 5. Generate Prompt
const prompt = `
You are a domain expert assistant.
Use the following reference documents:
---
${context}
---
To answer the question: "Why is '${reason}' an important assessment criterion?"
Provide a concise answer and clearly cite the sources.
`;
// 6. Generate Answer
return this.llmService.generateAnswer(prompt);
}

File: apps/data-consumer-bc/src/modules/assessment/assessment.module.ts

import { HttpModule } from '@nestjs/axios';
import { Module } from '@nestjs/common';
import { TypeOrmModule } from '@nestjs/typeorm';
import { CommonModule } from '@lib/common';
import { AppDatabases } from '@lib/common/enum/app-databases.enum';
import { DatabaseModule } from '@lib/database';
import { AssessmentRationale } from './entities/assessment-rationale.entity';
import { EmbeddingService } from './services/embedding.service';
import { LlmService } from './services/llm.service';
import { AssessmentExplainersService } from './services/assessment-explainers.service';
@Module({
imports: [
HttpModule, // Required for EmbeddingService
CommonModule,
DatabaseModule.registerAsync(AppDatabases.APP_CORE),
TypeOrmModule.forFeature([AssessmentRationale], AppDatabases.APP_CORE),
],
providers: [
EmbeddingService,
LlmService,
AssessmentExplainersService,
],
exports: [
EmbeddingService,
LlmService,
AssessmentExplainersService,
],
})
export class AssessmentModule {}

Endpoint: GET /data-consumer-bc/v1/assessments/explain

Query Parameters:

  • reason (required): The criterion to explain (e.g., elevated-blood-pressure)

Request Example:

Terminal window
curl -X GET "http://localhost:3001/data-consumer-bc/v1/assessments/explain?reason=elevated-systolic-pressure" \
-H "Authorization: Bearer YOUR_JWT_TOKEN"

Response Example:

{
"status": {
"code": 200000,
"message": "Request Succeeded"
},
"data": {
"type": "assessment-explanation",
"attributes": {
"reason": "elevated-systolic-pressure",
"explanation": "Elevated systolic pressure is a critical assessment criterion because...",
"sources": [
"Clinical Assessment Guidelines v4",
"Emergency Response Standards"
]
}
}
}

sequenceDiagram
    autonumber
    actor Client as Client/Frontend
    participant API as API Gateway
    participant Auth as AuthGuard (NestJS)
    participant Controller as AssessmentsController
    participant Explainer as AssessmentExplainersService
    participant Embed as EmbeddingService
    participant FastAPI as FastAPI Server (Port 8001)
    participant Model as mxbai-embed-large
    participant DB as PostgreSQL (pgvector)
    participant LLM as LlmService
    participant Ollama as Ollama Server (Port 11434)
    participant Gemma as Gemma 3:4b
    participant Transform as TransformInterceptor

    Note over Client,Transform: Phase 1 — Authentication & Routing
    Client->>+API: GET /assessments/explain?reason=elevated-systolic-pressure
    API->>+Auth: Validate JWT & Permissions
    Auth->>Auth: Verify session, check permission: assessment:view
    Auth-->>-API: Auth OK

    Note over API,Controller: Phase 2 — Controller Entry
    API->>+Controller: Forward request with validated user context
    Controller->>+Explainer: explain("elevated-systolic-pressure")

    Note over Explainer,Model: Phase 3 — Query Enhancement & Embedding
    Explainer->>Explainer: Enhance: "assessment criteria for elevated-systolic-pressure"
    Explainer->>+Embed: getEmbedding(searchQuery)
    Embed->>+FastAPI: POST /embed { "text": "assessment criteria for..." }
    FastAPI->>+Model: Encode text to vector
    Model-->>-FastAPI: Return embedding array (1024 dims)
    FastAPI-->>-Embed: { "embedding": [0.234, -0.567, ...] }
    Embed-->>-Explainer: Return vector

    Note over Explainer,DB: Phase 4 — Semantic Search (RAG Retrieval)
    Explainer->>+DB: SELECT content, source, embedding <=> $1 AS distance<br/>FROM assessment_rationales ORDER BY distance LIMIT 3
    DB->>DB: pgvector cosine similarity search (IVFFlat index)
    DB-->>-Explainer: Top 3 similar documents

    Note over Explainer,Gemma: Phase 5 — Augmentation & Generation
    Explainer->>Explainer: Build context from retrieved docs
    Explainer->>Explainer: Construct RAG prompt with context
    Explainer->>+LLM: generateAnswer(prompt)
    LLM->>+Ollama: POST /v1/chat/completions { model: "gemma3:4b", temperature: 0.3 }
    Ollama->>+Gemma: Process prompt with context
    Gemma-->>-Ollama: Generated explanation
    Ollama-->>-LLM: { choices: [{ message: { content: "..." } }] }
    LLM-->>-Explainer: Return explanation string

    Note over Explainer,Transform: Phase 6 — Response Formatting
    Explainer-->>-Controller: Return explanation
    Controller-->>-Transform: Return response data
    Transform->>Transform: Wrap in JSON:API format

    Transform-->>Client: HTTP 200 OK

  • Client sends GET request with reason query parameter and JWT Bearer token
  • API Gateway validates JWT, checks Redis session, verifies assessment:view permission
  • Extracts user context (id, roles, permissions)
const searchQuery = `assessment criteria for ${reason}`;
// Example: "assessment criteria for elevated-systolic-pressure"
  • Transforms user question into an optimized search query by adding domain context
  • EmbeddingService sends HTTP POST to FastAPI server: { "text": "assessment criteria for..." }
  • FastAPI uses mxbai-embed-large Sentence Transformer to encode text into a 1024-dimensional vector
SELECT content, source, embedding <=> $1 AS distance
FROM assessment_rationales
ORDER BY distance
LIMIT 3
  • Uses pgvector’s cosine distance operator <=>
  • PostgreSQL uses IVFFlat or HNSW index — searches clusters instead of all records
  • Returns top 3 most semantically similar documents

Example retrieved documents:

[
{ "content": "Systolic pressure above 180 mmHg is considered a critical threshold...",
"source": "Clinical Assessment Guidelines v4", "distance": 0.12 },
{ "content": "Clients presenting with chest discomfort alongside elevated pressure...",
"source": "Emergency Response Standards", "distance": 0.18 },
{ "content": "Accurate measurement procedure is required for reliable decision-making...",
"source": "Assessment Standards Manual", "distance": 0.23 }
]

Phase 5: Context Augmentation & LLM Generation

Section titled “Phase 5: Context Augmentation & LLM Generation”
const context = similarDocs
.map((doc) => `- ${doc.content} (Source: ${doc.source})`)
.join('\n');
const prompt = `
You are a domain expert assistant.
Use the following reference documents:
---
${context}
---
To answer: "Why is '${reason}' an important assessment criterion?"
Provide a concise, evidence-based answer and cite sources.
`;
  • Prompt sent to Ollama with temperature: 0.3 (factual responses) and max_tokens: 1000
  • Gemma 3:4b generates an evidence-based explanation grounded in the retrieved documents

Example Response:

Elevated systolic pressure is a critical assessment criterion because:
1. Safety threshold: Readings above 180 mmHg indicate a potential emergency
requiring immediate evaluation (Clinical Assessment Guidelines v4).
2. Compound risk: When combined with other indicators (e.g., chest discomfort),
it signals cardiovascular involvement requiring escalation (Emergency Response Standards).
3. Measurement accuracy: Correct measurement procedure is essential to
ensure decisions are reliable (Assessment Standards Manual).