digopala's picture
Update README.md
8934223 verified
metadata
title: AI Inference Architecture for Healthcare
emoji: 🧠
colorFrom: indigo
colorTo: green
sdk: static
sdk_version: 1.0.0
app_file: index.html
pinned: true
tags:
  - healthcare
  - ai-inference
  - mlops
  - kubernetes
  - triton-inference-server
  - fastapi
  - hipaa-compliance
  - deep-learning
  - cloud-architecture
  - monitoring
description: >-
  Scalable, production-ready AI inference architecture for healthcare and pharma
  using Triton, FastAPI, and Kubernetes.

AI Inference Architecture for Healthcare

This project provides a scalable, production-ready AI inference architecture designed for healthcare and pharmaceutical applications. It integrates Triton Inference Server, FastAPI, and Kubernetes to support high-throughput model inference.

🚀 Key Features

  • Modular container-based architecture with FastAPI gateway
  • Supports NLP and CV models with optional preprocessing
  • Inference via Triton Inference Server using ONNX or TorchScript models
  • GitHub Actions-powered CI/CD pipeline to auto-deploy model updates
  • Kubernetes-based pod management, autoscaling, and volume mounting
  • Full observability stack: Prometheus + Grafana for metrics and monitoring
  • Compliant with HIPAA-aligned standards: secure APIs, logging, encryption

🧱 Architecture Overview

Healthcare/Pharma Clients → FastAPI Gateway → Optional Preprocessor → Triton Pod
       ↓                        ↓                            ↓             ↓
 Model Registry ← GitHub CI/CD Pipeline ← Kubernetes ← Monitoring (Prometheus + Grafana)

⚙️ Deployment Options

▶️ Local (Docker Compose)

docker compose up --build

☸️ Kubernetes (Production)

kubectl apply -f k8s.yaml
kubectl apply -f preprocessor.yaml
kubectl apply -f hpa.yaml

📦 Model Lifecycle

  1. Train model locally or in pipeline (e.g., PyTorch/ONNX)
  2. Push model to GitHub repository
  3. GitHub Actions CI/CD triggers and pushes model to Model Registry
  4. Kubernetes mounts model volume into Triton pod
  5. Triton automatically reloads model

🔍 Monitoring and Observability

  • Metrics via Prometheus sidecar scraping port 8002 on Triton pod
  • Dashboards in Grafana track latency, throughput, failures

🧪 Sample Inference Request

curl -X POST http://localhost:8000/infer   -H "Content-Type: application/json"   -d '{"input": "Patient data or image here"}'

Enhancements Based on Peer Technical Review

Preprocessing Execution Model

The NLP/CV preprocessing stage runs as an independent Kubernetes microservice for isolation and scale. The FastAPI Gateway performs conditional routing:

  • content_type=image/* → CV preprocessor → Triton
  • content_type=text/* → NLP preprocessor → Triton
  • Already-normalized inputs → direct to Triton A lightweight schema-validation step remains in the gateway.

Model Lifecycle: Versioning, Promotion, Rollback

  • Models are versioned under /models/<name>/<version> (e.g., /models/ner/1).
  • CI/CD publishes to staging; promotion updates a release tag (e.g., current -> 2) for Triton to hot-reload.
  • Rollback re-points the tag to the last known-good (current -> 1).
  • Supports blue‑green (two deployments, Service selector switch) and canary (small % routed to a second Triton deployment).

Scalability & Resilience

  • HPA scales Triton pods based on CPU (and can extend to latency custom metrics).
  • Readiness/Liveness probes guard rollout and enable auto‑healing.
  • Gateway uses timeouts and retry on transient 5xx. If a pod is Unready, traffic shifts to healthy pods.

Security, Compliance & Audit

  • TLS in transit; optional mTLS inside cluster.
  • OAuth2/JWT at the gateway with per‑route scopes.
  • Audit logs (structured JSON with request_id) across gateway, preprocessors, and Triton; logs ship to ELK/Loki.
  • Optional PHI de‑identification in preprocessors; strict schema validation; data minimization and retention controls aligned to HIPAA/GDPR.

Data Flow & Validation

  • Gateway enforces MIME/JSON schema and rejects malformed/unauthorized requests.
  • Preprocessors normalize inputs (e.g., tokenize text, resize/normalize images).
  • Triton returns prediction JSON; gateway maps to a domain response schema and may redact fields per policy.

📌 See SECURITY.md for detailed security, compliance, and audit logging implementation.

📄 See preprocessor.yaml for deployment details of the NLP/CV preprocessing microservice.

📄 See hpa.yaml for Triton autoscaling configuration.

📂 File Reference

  • k8s.yaml → Triton deployment
  • preprocessor.yaml → NLP/CV preprocessing service
  • hpa.yaml → Horizontal Pod Autoscaler for Triton