Model Card: Redact-V1 PII Detection Model

This model is designed to automatically detect and redact personally identifiable information (PII) from text. It leverages a deep learning architecture implemented in TensorFlow and fine-tuned on a curated dataset.

Overview

The Redact-V1 model is engineered for robust PII detection, with applications in data redaction and privacy preservation. The model has been trained and evaluated using the Redact-V1 dataset, ensuring a high degree of accuracy in recognizing sensitive entities.

Model Details

Model File: final_model.h5
Labels: labels.json

The training performance indicators (loss, accuracy, precision, and recall) have been recorded and can be found in the training performance file. Visualizations of model performance, including confusion matrices and training history, are available in the images folder.

Supported Classes

The model supports the following PII classes:

People Name:
Card Number:
Account Number:
Social Security Number:
Government ID Number:
Date of Birth:
Password:
Tax ID Number:
Phone Number:
Residential Address:
Email Address:
IP Number:
Passport:
Driver License:

Usage

Below is sample code to load and use the model in a Python environment:

import os
import json
import tensorflow as tf
import tensorflow_hub as hub

# Paths to the model and labels.
MODEL_PATH = r"final_model.h5"
LABELS_PATH = r"labels.json"

def load_labels(labels_file):
    with open(labels_file, 'r', encoding='utf-8') as f:
        return json.load(f)

def main():
    print("Loading model from:", MODEL_PATH)
    model = tf.keras.models.load_model(MODEL_PATH, custom_objects={'KerasLayer': hub.KerasLayer})
    print("Model loaded successfully.")

    labels = load_labels(LABELS_PATH)
    print("Loaded labels:", labels)

    # Sample sentence for testing.
    sample_sentence = "John Doe's account number 1234567890 was flagged for review due to unusual activity."
    print("Sample sentence:", sample_sentence)

    # Run prediction.
    predictions = model.predict([sample_sentence])
    print("Predictions:")
    for label, prob in zip(labels, predictions[0]):
        print(f"{label}: {prob:.2f}")

if __name__ == "__main__":
    main()

Professional Model Card

Workspace

Collecting workspace information

Training Data & Source Code

Training Data: The model was trained on the Redact-V1 dataset.
Source Code: The training pipeline and preprocessing code can be reviewed in the NLU-Redact-PII repository.

License

This project is licensed under the Apache-2.0 license.