--- language: - zh - en pipeline_tag: other # widget: # - text: "Paraphrase the text:\n\n" # example_title: "example" # inference: # parameters: # # temperature: 1 # # do_sample: true # max_new_tokens: 50 --- # Hide-and-Seek隐私保护引擎 Github Repo / arXiv Preprint ## 介绍 Hide-and-Seek是一个中英双语隐私保护框架,由[hide](https://huggingface.co/tingxinli/hide-820m)与[seek](https://huggingface.co/tingxinli/seek-820m)两个模型组成。hide模型负责将用户输入中的敏感实体词替换为其他随机实体(加密),seek模型负责将输出中被替换掉的部分还原以对应原文本(解密)。此仓库是我们的社区开源版本,两个模型都以[bloom-1.1b](https://huggingface.co/bigscience/bloom-1b1)为底模,经过词表裁剪和微调后得到。 ## 环境依赖 由于机器学习环境配置复杂耗时,我们提供了一个[colab notebook](https://drive.google.com/file/d/1ZkGegZ_JjPy6k_wWnjaUaqq4QbF9LoWG/view?usp=sharing)用于demo,我们在下方列出了必要依赖供您参考。如果您在自己的环境上运行,可能需要根据自己设备做出一些调整。 ```shell pip install torch==2.1.0+cu118 pip install transformers==4.35.0 ``` ## Quick Start 下面是单独调用hide模型的一个例子。 ```ipython from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m") model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0') hide_template = """Paraphrase the text:%s\n\n""" original_input = "张伟用苹果(iPhone 13)换了一箱好吃的苹果。" input_text = hide_template % original_input inputs = tokenizer(input_text, return_tensors='pt').to('cuda:0') pred = model.generate(**inputs, max_length=100) pred = pred.cpu()[0][len(inputs['input_ids'][0]):] hide_input = tokenizer.decode(pred, skip_special_tokens=True) print(hide_input) # output: # '李明用华为(Mate 40)换了一箱好吃的橙子。 ``` 下面是一个完整调用Hide-and-Seek框架的例子。注意完整的隐私保护流程demo需要自备OpenAI的API token。 ```ipython # see hideAndSeek.py in this repo from hideAndSeek import * tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m") hide_model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0') seek_model = AutoModelForCausalLM.from_pretrained("tingxinli/seek-820m").to('cuda:0') original_input = "华纳兄弟影业(Warner Bro)著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。" print('original input:', original_input) hide_input = hide_encrypt(original_input, hide_model, tokenizer) print('hide input:', hide_input) prompt = "Translate the following text into English.\n %s\n" % hide_input hide_output = get_gpt_output(prompt) print('hide output:', hide_output) original_output = seek_decrypt(hide_input, hide_output, original_input, seek_model, tokenizer) print('original output:', original_output) # output: # original input: 华纳兄弟影业(Warner Bro)著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。 # hide input: 迪士尼影业(Disney Studios)著名的作品有《艺术作品1》系列、《艺术作品2》系列、《艺术作品3》系列和《艺术作品4》系列。目前迪士尼未考虑推出《艺术作品1》系列新作。 # hide output: Disney Studios' famous works include the "Artwork 1" series, "Artwork 2" series, "Artwork 3" series, and "Artwork 4" series. Currently, Disney has not considered releasing a new installment in the "Artwork 1" series. # original output: Warner Bro's famous works include the "Batman" series, "Superman" series, "The Matrix" series, and "The Lord of the Rings" series. Currently, Warner has not considered releasing a new installment in the "Batman" series. ``` ## 引用 ``` @misc{chen2023hide, title={Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection}, author={Yu Chen and Tingxin Li and Huiming Liu and Yang Yu}, year={2023}, eprint={2309.03057}, archivePrefix={arXiv}, primaryClass={cs.CR} } ```