File size: 16,376 Bytes
5cc265b 9f67e66 5cc265b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 |
---
license: mit
---
# UCTopic
This repository contains the code of model UCTopic and an easy-to-use tool UCTopicTool used for <strong>Topic Mining</strong>, <strong>Unsupervised Aspect Extractioin</strong> or <strong>Phrase Retrieval</strong>.
Our ACL 2022 paper [UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining](https://arxiv.org/abs/2202.13469).
# Quick Links
- [Overview](#overview)
- [Pretrained Model](#pretrained-model)
- [Getting Started](#getting-started)
- [UCTopic Model](#uctopic-model)
- [UCTopicTool](#uctopictool)
- [Experiments in Paper](#experiments)
- [Requirements](#requirements)
- [Datasets](#datasets)
- [Entity Clustering](#entity-clustering)
- [Topic Mining](#topic-mining)
- [Pretraining](#pretraining)
- [Contact](#contact)
- [Citation](#citation)
# Overview
We propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly.
# Pretrained Model
Our released model:
| Model | Note|
|:-------------------------------|------|
|[uctopic-base](https://drive.google.com/file/d/1XQzi4E9ctdI373CK5O-pXQyBvOONssp1/view?usp=sharing)| Pretrained UCTopic model based on [LUKE-BASE](https://arxiv.org/abs/2010.01057)
Unzip to get `uctopic-base` folder.
# Getting Started
We provide an easy-to-use phrase representation tool based on our UCTopic model. To use the tool, first install the uctopic package from PyPI
```bash
pip install uctopic
```
Or directly install it from our code
```bash
python setup.py install
```
## UCTopic Model
After installing the package, you can load our model by just two lines of code
```python
from uctopic import UCTopic
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
```
The model will automatically download pre-trained parameters from [HuggingFace's models](https://huggingface.co/models). If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use `model = UCTopic.from_pretrained({PATH TO THE DOWNLOAD MODEL})`.
To get pre-trained <strong>phrase representations</strong>, our model inputs are same as [LUKE](https://huggingface.co/docs/transformers/model_doc/luke). Note: please input only <strong>ONE</strong> span each time, otherwise, will have performance decay according to our empirical results.
```python
from uctopic import UCTopicTokenizer, UCTopic
tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"
inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)
```
`phrase_repr` is the phrase embedding (size `[768]`) of the phrase `Los Angeles`. `outputs` has the same format as the outputs from `LUKE`.
## UCTopicTool
We provide a tool `UCTopicTool` built on `UCTopic` for efficient phrase encoding, topic mining (or unsupervised aspect extraction) or phrase retrieval.
### Initialization
`UCTopicTool` is initialized by giving the `model_name_or_path` and `device`.
```python
from uctopic import UCTopicTool
topic_tool = UCTopicTool('JiachengLi/uctopic-base', device='cuda:0')
```
### Phrase Encoding
Phrases are encoded by our method `UCTopicTool.encode` in batches, which is more efficient than `UCTopic`.
```python
phrases = [["This place is so much bigger than others!", (0, 10)],
["It was totally packed and loud.", (15, 21)],
["Service was on the slower side.", (0, 7)],
["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
["The ingredient weren't really fresh.", (4, 14)]]
embeddings = topic_tool.encode(phrases) # len(embeddings) is equal to len(phrases)
```
**Note**: Each instance in `phrases` contains only one sentence and one span (character-level position) in format `[sentence, span]`.
Arguments for `UCTopicTool.encode` are as follows,
* **phrase** (List) - A list of `[sentence, span]` to be encoded.
* **return_numpy** (bool, *optional*, defaults to `False`) - Return `numpy.array` or `torch.Tensor`.
* **normalize_to_unit** (bool, *optional*, defaults to `True`) - Normalize all embeddings to unit vectors.
* **keepdim** (bool, *optional*, defaults to `True`) - Keep dimension size `[instance_number, hidden_size]`.
* **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch in the model.
### Topic Mining and Unsupervised Aspect Extraction
The method `UCTopicTool.topic_mining` can mine topical phrases or conduct aspect extraction from sentences with or without spans.
```python
sentences = ["This place is so much bigger than others!",
"It was totally packed and loud.",
"Service was on the slower side.",
"I ordered 2 mojitos: 1 lime and 1 mango.",
"The ingredient weren't really fresh."]
spans = [[(0, 10)], # This place
[(15, 21), (26, 30)], # packed; loud
[(0, 7)], # Service
[(12, 19), (21, 27), (32, 39)], # mojitos; 1 lime; 1 mango
[(4, 14)]] # ingredient
# len(sentences) is equal to len(spans)
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
n_clusters=[15, 25])
# predict topic for new phrases
phrases = [["The food here is amazing!", (4, 8)],
["Lovely ambiance with live music!", (21, 31)]]
topics = tool.predict_topic(phrases)
```
**Note**: If `spans` is not given, `UCTopicTool` will extract noun phrases by [spaCy](https://spacy.io/).
Arguments for `UCTopicTool.topic_mining` are as follows,
Data arguments:
* **sentences** (List) - A List of sentences for topic mining.
* **spans** (List, *optional*, defaults to `None`) - A list of span list corresponding sentences, e.g., `[[(0, 9), (5, 7)], [(1, 2)]]` and `len(sentences)==len(spans)`. If None, automatically mine phrases from noun chunks.
Clustering arguments:
* **n_clusters** (int or List, *optional*, defaults to `2`) - The number of topics. When `n_clusters` is a list, `n_clusters[0]` and `n_clusters[1]` will be the minimum and maximum numbers to search, `n_clusters[2]` is the search step length (if not provided, default to 1).
* **meric** (str, *optional*, defaults to `"cosine"`) - The metric to measure the distance between vectors. `"cosine"` or `"euclidean"`.
* **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch for phrase encoding.
* **max_iter** (int, *optional*, defaults to `300`) - The maximum iteration number of kmeans.
CCL-finetune arguments:
* **ccl_finetune** (bool, *optional*, defaults to `True`) - Whether to conduct CCL-finetuning in the paper.
* **batch_size_finetune** (int, *optional*, defaults to `8`) - The size of mini-batch for finetuning.
* **max_finetune_num** (int, *optional*, defaults to `100000`) - The maximum number of training instances for finetuning.
* **finetune_step** (int, *optional*, defaults to `2000`) - The number of training steps for finetuning.
* **contrastive_num** (int, *optional*, defaults to `5`) - The number of negatives in contrastive learning.
* **positive_ratio** (float, *optional*, defaults to `0.1`) - The ratio of the most confident instances for finetuning.
* **n_sampling** (int, *optional*, defaults to `10000`) - The number of sampled examples for cluster number confirmation and finetuning. Set to `-1` to use the whole dataset.
* **n_workers** (int, *optional*, defaults to `8`) - The number of workers for preprocessing data.
Returns for `UCTopicTool.topic_mining` are as follows,
* **output_data** (List) - A list of sentences and corresponding phrases and topic numbers. Each element is `[sentence, [[start1, end1, topic1], [start2, end2, topic2]]]`.
* **topic_phrase_dict** (Dict) - A dictionary of topics and the list of phrases under a topic. The phrases are sorted by their confidence scores. E.g., `{topic: [[phrase1, score1], [phrase2, score2]]}`.
The method `UCTopicTool.predict_topic` predicts the topic ids for new phrases based on your training results from `UCTopicTool.topic_mining`. The inputs of `UCTopicTool.predict_topic` are same as `UCTopicTool.encode` and returns a list of topic ids (int).
### Phrase Similarities and Retrieval
The method `UCTopicTool.similarity` compute the cosine similarities between two groups of phrases:
```python
phrases_a = [["This place is so much bigger than others!", (0, 10)],
["It was totally packed and loud.", (15, 21)]]
phrases_b = [["Service was on the slower side.", (0, 7)],
["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
["The ingredient weren't really fresh.", (4, 14)]]
similarities = tool.similarity(phrases_a, phrases_b)
```
Arguments for `UCTopicTool.similarity` are as follows,
* **queries** (List) - A list of `[sentence, span]` as queries.
* **keys** (List or `numpy.array`) - A list of `[sentence, span]` as keys or phrase representations (`numpy.array`) from `UCTopicTool.encode`.
* **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch in the model.
`UCTopicTool.similarity` returns a `numpy.array` contains the similarities between phrase pairs in two groups.
The methods `UCTopicTool.build_index` and `UCTopicTool.search` are used for phrase retrieval:
```python
phrases = [["This place is so much bigger than others!", (0, 10)],
["It was totally packed and loud.", (15, 21)],
["Service was on the slower side.", (0, 7)],
["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
["The ingredient weren't really fresh.", (4, 14)]]
# query multiple phrases
query1 = [["The food here is amazing!", (4, 8)],
["Lovely ambiance with live music!", (21, 31)]]
# query single phrases
query2 = ["The food here is amazing!", (4, 8)]
tool.build_index(phrases)
results = tool.search(query1, top_k=3)
# or
results = tool.search(query2, top_k=3)
```
We also support [faiss](https://github.com/facebookresearch/faiss), an efficient similarity search library. Just install the package following [instructions](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) here and `UCTopicTool` will automatically use `faiss` for efficient search.
`UCTopicTool.search` returns the ranked top k phrases for each query.
### Save and Load finetuned UCTopicTool
The methods `UCTopicTool.save` and `UCTopicTool.load` are used for save and load all paramters of `UCTopicTool`.
Save:
```python
tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
# finetune UCTopic with CCL
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
n_clusters=[15, 25])
tool.save(**your directory**)
```
Load:
```python
tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
tool.load(**your directory**)
```
The loaded parameters will be used for all methods (for encoding, topic mining, phrase similarities and retrieval) introduced above.
# Experiments
In this section, we re-implement experiments in our paper.
## Requirements
First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.9.0` version corresponding to your platforms/CUDA versions.
Then run the following script to install the remaining dependencies,
```bash
pip install -r requirements.txt
```
Download `en_core_web_sm` model from spacy,
```bash
python -m spacy download en_core_web_sm
```
## Datasets
The downstream datasets used in our experiments can be downloaded from [here](https://drive.google.com/file/d/1dVIp9li1Wdh0JgU8slsWm0ObcitbQtSL/view?usp=sharing).
## Entity Clustering
The config file of entity clustering is `clustering/consts.py` and most arguments are self-explained. Please setup `--gpu` and `--data_path` before running. The clustering scores will be printed.
Clustering with our pre-trained phrase embeddings.
```bash
python clustering.py --gpu 0
```
Clustering with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.
```bash
python clustering_ccl_finetune.py --gpu 0
```
## Topic Mining
The config file of entity clustering is `topic_modeling/consts.py`.
**Key Argument Table**
| Arguments | Description |
|:-----------------|:-----------:|
| --num_classes |**Min** and **Max** number of classes, e.g., `[5, 15]`. Our model will find the class number by [silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).|
| --sample_num_cluster |Number of sampled phrases to confirm class number.|
| --sample_num_finetune|Number of sampled phrases for CCL finetuning.|
| --contrastive_num|Number of negative classes for CCL finetuning.|
| --finetune_step | CCL finetuning steps (maximum global steps for finetuning).|
**Tips**: Please tune `--batch_size` or `--contrastive_num` for suitable GPU memory usage.
Topic mining with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.
```bash
python find_topic.py --gpu 0
```
**Outputs**
We output three files under `topic_results`:
| File Name | Description |
|:-----------------|:-----------:|
| `merged_phraes_pred_prob.pickle` |A dictionary of phrases and their topic number and prediction probability. A topic of a phrase is merged from all phrase mentioins. `{phrase: [topic_id, probability]}`, e.g., {'fair prices': [0, 0.34889686]}|
| `phrase_instances_pred.json`| A list of all mined phrase mentions. Each element is `[[doc_id, start, end, phrase_mention], topic_id]`.|
| `topics_phrases.json`|A dictionary of topics and corresponding phrases sorted by probability. `{'topic_id': [[phrase1, prob1], [phrase2, prob2]]}`|
### Pretraining
**Data**
For unsupervised pretraining of UCTopic, we use article and span with links from English Wikipedia and Wikidata. Our processed dataset can be downloaded from [here](https://drive.google.com/file/d/1wflsmhPI9J0ZA6aVRl2mQjHIE6JIvzAv/view?usp=sharing).
**Training scripts**
We provide example training scripts and our default training parameters for unsupervised training of UCTopic in `run_example.sh`.
```bash
bash run_example.sh
```
Arguments description can be found in `pretrain.py`. All the other arguments are standard Huggingface's `transformers` training arguments.
**Convert models**
Our pretrained checkpoints are slightly different from the checkpoint `uctopic-base`. Please refer `convert_uctopic_parameters.py` to convert it.
# Contact
If you have any questions related to the code or the paper, feel free to email Jiacheng (`[email protected]`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
# Citation
Please cite our paper if you use UCTopic in your work:
```bibtex
@inproceedings{Li2022UCTopicUC,
title = "{UCT}opic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining",
author = "Li, Jiacheng and
Shang, Jingbo and
McAuley, Julian",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.426",
doi = "10.18653/v1/2022.acl-long.426",
pages = "6159--6169"
}
``` |