File size: 4,751 Bytes
0ff9305
 
 
 
 
 
 
 
 
8934223
 
0ff9305
e3674cb
 
e6dc6ea
e3674cb
e6dc6ea
e3674cb
e6dc6ea
 
 
 
 
 
 
e3674cb
e6dc6ea
e3674cb
e6dc6ea
 
 
 
 
e3674cb
e6dc6ea
e3674cb
e6dc6ea
e3674cb
 
 
 
e6dc6ea
e3674cb
 
ea5b598
e6dc6ea
e3674cb
 
e6dc6ea
 
 
 
 
 
 
7145d38
e6dc6ea
e3674cb
e6dc6ea
 
e3674cb
e6dc6ea
 
 
4a81e6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
274ef25
 
 
ea5b598
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: AI Inference Architecture for Healthcare
emoji: 🧠
colorFrom: indigo
colorTo: green
sdk: static
sdk_version: "1.0.0"
app_file: index.html
pinned: true
tags: [healthcare, ai-inference, mlops, kubernetes, triton-inference-server, fastapi, hipaa-compliance, deep-learning, cloud-architecture, monitoring]
description: Scalable, production-ready AI inference architecture for healthcare and pharma using Triton, FastAPI, and Kubernetes.
---
# AI Inference Architecture for Healthcare

This project provides a scalable, production-ready AI inference architecture designed for healthcare and pharmaceutical applications. It integrates Triton Inference Server, FastAPI, and Kubernetes to support high-throughput model inference.

## 🚀 Key Features

- Modular container-based architecture with FastAPI gateway
- Supports NLP and CV models with optional preprocessing
- Inference via Triton Inference Server using ONNX or TorchScript models
- GitHub Actions-powered CI/CD pipeline to auto-deploy model updates
- Kubernetes-based pod management, autoscaling, and volume mounting
- Full observability stack: Prometheus + Grafana for metrics and monitoring
- Compliant with HIPAA-aligned standards: secure APIs, logging, encryption

## 🧱 Architecture Overview

```
Healthcare/Pharma Clients → FastAPI Gateway → Optional Preprocessor → Triton Pod
       ↓                        ↓                            ↓             ↓
 Model Registry ← GitHub CI/CD Pipeline ← Kubernetes ← Monitoring (Prometheus + Grafana)
```

## ⚙️ Deployment Options

### ▶️ Local (Docker Compose)
```bash
docker compose up --build
```

### ☸️ Kubernetes (Production)
```bash
kubectl apply -f k8s.yaml
kubectl apply -f preprocessor.yaml
kubectl apply -f hpa.yaml
```

## 📦 Model Lifecycle

1. Train model locally or in pipeline (e.g., PyTorch/ONNX)
2. Push model to GitHub repository
3. GitHub Actions CI/CD triggers and pushes model to Model Registry
4. Kubernetes mounts model volume into Triton pod
5. Triton automatically reloads model

## 🔍 Monitoring and Observability

- Metrics via Prometheus sidecar scraping port 8002 on Triton pod
- Dashboards in Grafana track latency, throughput, failures

## 🧪 Sample Inference Request
```bash
curl -X POST http://localhost:8000/infer   -H "Content-Type: application/json"   -d '{"input": "Patient data or image here"}'
```

## Enhancements Based on Peer Technical Review

### Preprocessing Execution Model
The NLP/CV preprocessing stage runs as an **independent Kubernetes microservice** for isolation and scale. The FastAPI Gateway performs **conditional routing**:
- `content_type=image/*` → CV preprocessor → Triton
- `content_type=text/*` → NLP preprocessor → Triton
- Already-normalized inputs → direct to Triton
A lightweight schema-validation step remains in the gateway.

### Model Lifecycle: Versioning, Promotion, Rollback
- Models are versioned under `/models/<name>/<version>` (e.g., `/models/ner/1`).
- CI/CD publishes to **staging**; promotion updates a **release tag** (e.g., `current -> 2`) for Triton to hot-reload.
- **Rollback** re-points the tag to the last known-good (`current -> 1`).
- Supports **blue‑green** (two deployments, Service selector switch) and **canary** (small % routed to a second Triton deployment).

### Scalability & Resilience
- **HPA** scales Triton pods based on CPU (and can extend to latency custom metrics).
- **Readiness/Liveness probes** guard rollout and enable auto‑healing.
- Gateway uses timeouts and retry on transient 5xx. If a pod is Unready, traffic shifts to healthy pods.

### Security, Compliance & Audit
- **TLS in transit**; optional mTLS inside cluster.
- **OAuth2/JWT** at the gateway with per‑route scopes.
- **Audit logs** (structured JSON with `request_id`) across gateway, preprocessors, and Triton; logs ship to ELK/Loki.
- Optional **PHI de‑identification** in preprocessors; strict schema validation; data minimization and retention controls aligned to HIPAA/GDPR.

### Data Flow & Validation
- Gateway enforces **MIME/JSON schema** and rejects malformed/unauthorized requests.
- Preprocessors normalize inputs (e.g., tokenize text, resize/normalize images).
- Triton returns prediction JSON; gateway maps to a domain response schema and may **redact** fields per policy.

📌 See [SECURITY.md](./SECURITY.md) for detailed security, compliance, and audit logging implementation.

📄 See preprocessor.yaml for deployment details of the NLP/CV preprocessing microservice.

📄 See hpa.yaml for Triton autoscaling configuration.

📂 File Reference
- k8s.yaml → Triton deployment
- preprocessor.yaml → NLP/CV preprocessing service
- hpa.yaml → Horizontal Pod Autoscaler for Triton