← Back to architecture

AI Gateway & LLM Orchestration — Request Routing Architecture

Intelligent model routing: request intake → classification → model selection → guardrails → execution → audit · powering all Kredete AI services

Internal consumers
Prael AI agent
Credit agent
Remittance agent
Card agent
Finance advisor
Compliance engine
Fraud detection
Request metadata
Service ID
Priority level
Latency budget (ms)
Cost tier (economy/premium)
Retry policy
Trace ID
Rate limits
Prael
500/min
Credit
200/min
Others
100/min
Burst allowance: 2x
Queue overflow: 429
SDK / interface
TypeScript SDK
Python SDK
gRPC (internal)
REST (external)
WebSocket (streaming)
Authentication
Service API key validation
JWT verification
Scope/permission check
IP allowlist
mTLS (service mesh)
Request classification
Complexity: simple/medium/complex
Domain: credit/payments/general
Token estimate (input)
Streaming required?
Tool calling needed?
Priority routing
P0: Real-time user (< 3s)
P1: Background (< 30s)
P2: Batch (< 5min)
P3: Async (queue)
Queue depth monitoring
Semantic cache check
Query embedding
Similarity search (> 0.95)
TTL validation
Cache hit → instant return
Cache miss → proceed
Cache hit rate: 23%
Model routing logic
Domain → model affinity
Complexity → model tier
Cost budget → selection
Latency → model speed
Availability → fallback
A/B experiment routing
Available models
Claude 3.5 Sonnet (reasoning)
Claude 3 Haiku (fast)
GPT-4o (multi-modal)
GPT-4o-mini (economy)
Kredete Credit LLM (custom)
Gemini 1.5 Pro (long context)
Prompt library
System prompts (versioned)
Few-shot examples
Chain-of-thought templates
Tool definitions (JSON)
Output schemas (structured)
Prompt versioning (GitOps)
Context assembly
RAG context injection
User profile context
Conversation history
Tool results (prior)
Token budget packing
Context window: 128K max
Input guardrails
Prompt injection detection
Jailbreak attempt block
PII stripping
Topic boundary check
Input length validation
Output guardrails
Hallucination detection
Toxicity filter
PII in response check
Financial advice disclaimer
Brand voice compliance
Factuality checks
Ground truth validation
Citation verification
Numerical accuracy
Date/time validation
Rate/fee verification
Regulatory compliance
No investment advice
Fair lending language
Equal Credit Opportunity
UDAAP compliance
Disclosure insertion
Guardrail metrics
Block rate
0.8%
Hallucination
0.3%
PII leak
0.01%
Latency add
+45ms
Execution engine
Request dispatch
Streaming handler
Token-by-token relay
Timeout enforcement
Retry with backoff
Circuit breaker (open/closed)
Tool calling loop
Tool call detection
Parameter validation
Authorization check
Tool execution (sandbox)
Result injection
Multi-step orchestration
Fallback chain
Primary model timeout
→ Fallback model 1
→ Fallback model 2
→ Graceful degradation
→ Cached response
→ Human escalation
Response processing
Structured output parse
JSON schema validation
Token count tracking
Latency measurement
Quality score (auto)
Cache population
Audit database
Request/response pairs
Model used + version
Token counts (in/out)
Latency breakdown
Guardrail decisions
Immutable (append-only)
FinOps cost tracking
Cost per request
Cost per service
Daily/monthly budgets
Anomaly alerting
Optimization suggestions
Chargeback to teams
Quality monitoring
User satisfaction (CSAT)
Task completion rate
Hallucination rate (daily)
Model drift detection
A/B test results
Regression alerting
Reporting
Executive dashboard
Service-level breakdown
Model comparison report
Cost forecast (30d)
Capacity planning
Regulatory report (monthly)
Model health
Claude 3.5
Healthy
GPT-4o
Healthy
Kredete LLM
Healthy
Gemini
Degraded
Token budget (daily)
Used
4.2M
Budget
8M
Spend
$420
Auto-throttle at 90%
Routing weights
Claude
45%
GPT-4o
30%
Kredete
15%
Other
10%
Latency targets
p50
1.2s
p95
3.8s
p99
8.1s
Feature flags
Streaming
ON
Cache
ON
A/B test
ON
New model
Canary