--- title: AI Inference Architecture for Healthcare emoji: 🧠 colorFrom: indigo colorTo: green sdk: static sdk_version: "1.0.0" app_file: index.html pinned: true tags: [healthcare, ai-inference, mlops, kubernetes, triton-inference-server, fastapi, hipaa-compliance, deep-learning, cloud-architecture, monitoring] description: Scalable, production-ready AI inference architecture for healthcare and pharma using Triton, FastAPI, and Kubernetes. --- # AI Inference Architecture for Healthcare This project provides a scalable, production-ready AI inference architecture designed for healthcare and pharmaceutical applications. It integrates Triton Inference Server, FastAPI, and Kubernetes to support high-throughput model inference. ## 🚀 Key Features - Modular container-based architecture with FastAPI gateway - Supports NLP and CV models with optional preprocessing - Inference via Triton Inference Server using ONNX or TorchScript models - GitHub Actions-powered CI/CD pipeline to auto-deploy model updates - Kubernetes-based pod management, autoscaling, and volume mounting - Full observability stack: Prometheus + Grafana for metrics and monitoring - Compliant with HIPAA-aligned standards: secure APIs, logging, encryption ## 🧱 Architecture Overview ``` Healthcare/Pharma Clients → FastAPI Gateway → Optional Preprocessor → Triton Pod ↓ ↓ ↓ ↓ Model Registry ← GitHub CI/CD Pipeline ← Kubernetes ← Monitoring (Prometheus + Grafana) ``` ## ⚙️ Deployment Options ### ▶️ Local (Docker Compose) ```bash docker compose up --build ``` ### ☸️ Kubernetes (Production) ```bash kubectl apply -f k8s.yaml kubectl apply -f preprocessor.yaml kubectl apply -f hpa.yaml ``` ## 📦 Model Lifecycle 1. Train model locally or in pipeline (e.g., PyTorch/ONNX) 2. Push model to GitHub repository 3. GitHub Actions CI/CD triggers and pushes model to Model Registry 4. Kubernetes mounts model volume into Triton pod 5. Triton automatically reloads model ## 🔍 Monitoring and Observability - Metrics via Prometheus sidecar scraping port 8002 on Triton pod - Dashboards in Grafana track latency, throughput, failures ## 🧪 Sample Inference Request ```bash curl -X POST http://localhost:8000/infer -H "Content-Type: application/json" -d '{"input": "Patient data or image here"}' ``` ## Enhancements Based on Peer Technical Review ### Preprocessing Execution Model The NLP/CV preprocessing stage runs as an **independent Kubernetes microservice** for isolation and scale. The FastAPI Gateway performs **conditional routing**: - `content_type=image/*` → CV preprocessor → Triton - `content_type=text/*` → NLP preprocessor → Triton - Already-normalized inputs → direct to Triton A lightweight schema-validation step remains in the gateway. ### Model Lifecycle: Versioning, Promotion, Rollback - Models are versioned under `/models//` (e.g., `/models/ner/1`). - CI/CD publishes to **staging**; promotion updates a **release tag** (e.g., `current -> 2`) for Triton to hot-reload. - **Rollback** re-points the tag to the last known-good (`current -> 1`). - Supports **blue‑green** (two deployments, Service selector switch) and **canary** (small % routed to a second Triton deployment). ### Scalability & Resilience - **HPA** scales Triton pods based on CPU (and can extend to latency custom metrics). - **Readiness/Liveness probes** guard rollout and enable auto‑healing. - Gateway uses timeouts and retry on transient 5xx. If a pod is Unready, traffic shifts to healthy pods. ### Security, Compliance & Audit - **TLS in transit**; optional mTLS inside cluster. - **OAuth2/JWT** at the gateway with per‑route scopes. - **Audit logs** (structured JSON with `request_id`) across gateway, preprocessors, and Triton; logs ship to ELK/Loki. - Optional **PHI de‑identification** in preprocessors; strict schema validation; data minimization and retention controls aligned to HIPAA/GDPR. ### Data Flow & Validation - Gateway enforces **MIME/JSON schema** and rejects malformed/unauthorized requests. - Preprocessors normalize inputs (e.g., tokenize text, resize/normalize images). - Triton returns prediction JSON; gateway maps to a domain response schema and may **redact** fields per policy. 📌 See [SECURITY.md](./SECURITY.md) for detailed security, compliance, and audit logging implementation. 📄 See preprocessor.yaml for deployment details of the NLP/CV preprocessing microservice. 📄 See hpa.yaml for Triton autoscaling configuration. 📂 File Reference - k8s.yaml → Triton deployment - preprocessor.yaml → NLP/CV preprocessing service - hpa.yaml → Horizontal Pod Autoscaler for Triton