Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,83 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- zh
|
4 |
+
- en
|
5 |
+
pipeline_tag: other
|
6 |
+
# widget:
|
7 |
+
# - text: "Paraphrase the text:\n\n"
|
8 |
+
# example_title: "example"
|
9 |
+
# inference:
|
10 |
+
# parameters:
|
11 |
+
# # temperature: 1
|
12 |
+
# # do_sample: true
|
13 |
+
# max_new_tokens: 50
|
14 |
---
|
15 |
+
|
16 |
+
# Hide-and-Seek隐私保护引擎
|
17 |
+
<a href="https://github.com/alohachen/Hide-and-Seek" target="_blank">Github Repo</a> / <a href="https://arxiv.org/abs/2309.03057" target="_blank">arXiv Preprint</a>
|
18 |
+
|
19 |
+
## 介绍
|
20 |
+
Hide-and-Seek是一个中英双语隐私保护框架,由[hide](https://huggingface.co/tingxinli/hide-820m)与[seek](https://huggingface.co/tingxinli/seek-820m)两个模型组成。hide模型负责将用户输入中的敏感实体词替换为其他随机实体(加密),seek模型负责将输出中被替换掉的部分还原以对应原文本(解密)。此仓库是我们的社区开源版本,两个模型都以[bloom-1.1b](https://huggingface.co/bigscience/bloom-1b1)为底模,经过词表裁剪和微调后得到。
|
21 |
+
|
22 |
+
## 环境依赖
|
23 |
+
由于机器学习环境配置复杂耗时,我们提供了一个[colab notebook](https://drive.google.com/file/d/1ZkGegZ_JjPy6k_wWnjaUaqq4QbF9LoWG/view?usp=sharing)用于demo,我们在下方列出了必要依赖供您参考。如果您在自己的环境上运行,可能需要根据自己设备做出一些调整。
|
24 |
+
```shell
|
25 |
+
pip install torch==2.1.0+cu118
|
26 |
+
pip install transformers==4.35.0
|
27 |
+
```
|
28 |
+
|
29 |
+
## Quick Start
|
30 |
+
下面是单独调用hide模型的一个例子。
|
31 |
+
```ipython
|
32 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
33 |
+
tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m")
|
34 |
+
model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0')
|
35 |
+
hide_template = """<s>Paraphrase the text:%s\n\n"""
|
36 |
+
original_input = "张伟用苹果(iPhone 13)换了一箱好吃的苹果。"
|
37 |
+
input_text = hide_template % original_input
|
38 |
+
inputs = tokenizer(input_text, return_tensors='pt').to('cuda:0')
|
39 |
+
pred = model.generate(**inputs, max_length=100)
|
40 |
+
pred = pred.cpu()[0][len(inputs['input_ids'][0]):]
|
41 |
+
hide_input = tokenizer.decode(pred, skip_special_tokens=True)
|
42 |
+
print(hide_input)
|
43 |
+
|
44 |
+
# output:
|
45 |
+
# '李明用华为(Mate 40)换了一箱好吃的橙子。
|
46 |
+
```
|
47 |
+
|
48 |
+
下面是一个完整调用Hide-and-Seek框架的例子。注意完整的隐私保护流程demo需要自备OpenAI的API token。
|
49 |
+
```ipython
|
50 |
+
# see hideAndSeek.py in this repo
|
51 |
+
from hideAndSeek import *
|
52 |
+
|
53 |
+
tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m")
|
54 |
+
hide_model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0')
|
55 |
+
seek_model = AutoModelForCausalLM.from_pretrained("tingxinli/seek-820m").to('cuda:0')
|
56 |
+
|
57 |
+
original_input = "华纳兄弟影业(Warner Bro)著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。"
|
58 |
+
print('original input:', original_input)
|
59 |
+
hide_input = hide_encrypt(original_input, hide_model, tokenizer)
|
60 |
+
print('hide input:', hide_input)
|
61 |
+
prompt = "Translate the following text into English.\n %s\n" % hide_input
|
62 |
+
hide_output = get_gpt_output(prompt)
|
63 |
+
print('hide output:', hide_output)
|
64 |
+
original_output = seek_decrypt(hide_input, hide_output, original_input, seek_model, tokenizer)
|
65 |
+
print('original output:', original_output)
|
66 |
+
|
67 |
+
# output:
|
68 |
+
# original input: 华纳兄弟影业(Warner Bro)著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。
|
69 |
+
# hide input: 迪士尼影业(Disney Studios)著名的作品有《艺术作品1》系列、《艺术作品2》系列、《艺术作品3》系列和《艺术作品4》系列。目前迪士尼未考虑推出《艺术作品1》系列新作。
|
70 |
+
# hide output: Disney Studios' famous works include the "Artwork 1" series, "Artwork 2" series, "Artwork 3" series, and "Artwork 4" series. Currently, Disney has not considered releasing a new installment in the "Artwork 1" series.
|
71 |
+
# original output: Warner Bro's famous works include the "Batman" series, "Superman" series, "The Matrix" series, and "The Lord of the Rings" series. Currently, Warner has not considered releasing a new installment in the "Batman" series.
|
72 |
+
```
|
73 |
+
## 引用
|
74 |
+
```
|
75 |
+
@misc{chen2023hide,
|
76 |
+
title={Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection},
|
77 |
+
author={Yu Chen and Tingxin Li and Huiming Liu and Yang Yu},
|
78 |
+
year={2023},
|
79 |
+
eprint={2309.03057},
|
80 |
+
archivePrefix={arXiv},
|
81 |
+
primaryClass={cs.CR}
|
82 |
+
}
|
83 |
+
```
|