---
language:
- zh
- en
pipeline_tag: other
# widget:
# - text: "Paraphrase the text:\n\n"
# example_title: "example"
# inference:
# parameters:
# # temperature: 1
# # do_sample: true
# max_new_tokens: 50
---
# Hide-and-Seek隐私保护引擎
Github Repo / arXiv Preprint
## 介绍
Hide-and-Seek是一个中英双语隐私保护框架,由[hide](https://huggingface.co/tingxinli/hide-820m)与[seek](https://huggingface.co/tingxinli/seek-820m)两个模型组成。hide模型负责将用户输入中的敏感实体词替换为其他随机实体(加密),seek模型负责将输出中被替换掉的部分还原以对应原文本(解密)。此仓库是我们的社区开源版本,两个模型都以[bloom-1.1b](https://huggingface.co/bigscience/bloom-1b1)为底模,经过词表裁剪和微调后得到。
## 环境依赖
由于机器学习环境配置复杂耗时,我们提供了一个[colab notebook](https://drive.google.com/file/d/1ZkGegZ_JjPy6k_wWnjaUaqq4QbF9LoWG/view?usp=sharing)用于demo,我们在下方列出了必要依赖供您参考。如果您在自己的环境上运行,可能需要根据自己设备做出一些调整。
```shell
pip install torch==2.1.0+cu118
pip install transformers==4.35.0
```
## Quick Start
下面是单独调用hide模型的一个例子。
```ipython
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m")
model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0')
hide_template = """Paraphrase the text:%s\n\n"""
original_input = "张伟用苹果(iPhone 13)换了一箱好吃的苹果。"
input_text = hide_template % original_input
inputs = tokenizer(input_text, return_tensors='pt').to('cuda:0')
pred = model.generate(**inputs, max_length=100)
pred = pred.cpu()[0][len(inputs['input_ids'][0]):]
hide_input = tokenizer.decode(pred, skip_special_tokens=True)
print(hide_input)
# output:
# '李明用华为(Mate 40)换了一箱好吃的橙子。
```
下面是一个完整调用Hide-and-Seek框架的例子。注意完整的隐私保护流程demo需要自备OpenAI的API token。
```ipython
# see hideAndSeek.py in this repo
from hideAndSeek import *
tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m")
hide_model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0')
seek_model = AutoModelForCausalLM.from_pretrained("tingxinli/seek-820m").to('cuda:0')
original_input = "华纳兄弟影业(Warner Bro)著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。"
print('original input:', original_input)
hide_input = hide_encrypt(original_input, hide_model, tokenizer)
print('hide input:', hide_input)
prompt = "Translate the following text into English.\n %s\n" % hide_input
hide_output = get_gpt_output(prompt)
print('hide output:', hide_output)
original_output = seek_decrypt(hide_input, hide_output, original_input, seek_model, tokenizer)
print('original output:', original_output)
# output:
# original input: 华纳兄弟影业(Warner Bro)著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。
# hide input: 迪士尼影业(Disney Studios)著名的作品有《艺术作品1》系列、《艺术作品2》系列、《艺术作品3》系列和《艺术作品4》系列。目前迪士尼未考虑推出《艺术作品1》系列新作。
# hide output: Disney Studios' famous works include the "Artwork 1" series, "Artwork 2" series, "Artwork 3" series, and "Artwork 4" series. Currently, Disney has not considered releasing a new installment in the "Artwork 1" series.
# original output: Warner Bro's famous works include the "Batman" series, "Superman" series, "The Matrix" series, and "The Lord of the Rings" series. Currently, Warner has not considered releasing a new installment in the "Batman" series.
```
## 引用
```
@misc{chen2023hide,
title={Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection},
author={Yu Chen and Tingxin Li and Huiming Liu and Yang Yu},
year={2023},
eprint={2309.03057},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
```