Rafael Rivera Soto commited on
Commit
99bfd7b
·
1 Parent(s): b39f95c

updated README

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md CHANGED
@@ -1,3 +1,63 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # rrivera1849/LUAR-CRUD
6
+
7
+ Author Style Representations using [LUAR](https://aclanthology.org/2021.emnlp-main.70.pdf).
8
+
9
+ This particular model was trained on a subsample of the [Pushshift](https://arxiv.org/abs/2001.08435) dataset for comments published between January 2015 and October 2019 by authors publishing at least 100 comments during that period.
10
+
11
+ ## Usage
12
+
13
+ ```python
14
+ from transformers import AutoModel, AutoTokenizer
15
+
16
+ tokenizer = AutoTokenizer.from_pretrained("rrivera1849/LUAR-CRUD")
17
+ model = AutoModel.from_pretrained("rrivera1849/LUAR-CRUD")
18
+
19
+ # we embed `episodes`, a colletion of documents presumed to come from an author
20
+ # NOTE: make sure that `episode_length` consistent across `episode`
21
+ batch_size = 3
22
+ episode_length = 16
23
+ text = [
24
+ ["Foo"] * episode_length,
25
+ ["Bar"] * episode_length,
26
+ ["Zoo"] * episode_length,
27
+ ]
28
+ text = [j for i in text for j in i]
29
+ tokenized_text = tokenizer(
30
+ text,
31
+ max_length=32,
32
+ padding="max_length",
33
+ truncation=True,
34
+ return_tensors="pt"
35
+ )
36
+ # inputs size: (batch_size, episode_length, max_token_length)
37
+ tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, episode_length, -1)
38
+ tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, episode_length, -1)
39
+ print(tokenized_text["input_ids"].size()) # torch.Size([3, 16, 32])
40
+ print(tokenized_text["attention_mask"].size()) # torch.Size([3, 16, 32])
41
+
42
+ out = model(**tokenized_text)
43
+ print(out.size()) # torch.Size([3, 512])
44
+ ```
45
+
46
+ ## Citing & Authors
47
+
48
+ If you find this model helpful, feel free to cite our [publication](https://aclanthology.org/2021.emnlp-main.70.pdf).
49
+
50
+ ```
51
+ @inproceedings{uar-emnlp2021,
52
+ author = {Rafael A. Rivera Soto and Olivia Miano and Juanita Ordonez and Barry Chen and Aleem Khan and Marcus Bishop and Nicholas Andrews},
53
+ title = {Learning Universal Authorship Representations},
54
+ booktitle = {EMNLP},
55
+ year = {2021},
56
+ }
57
+ ```
58
+
59
+ ## License
60
+
61
+ LUAR is distributed under the terms of the Apache License (Version 2.0).
62
+
63
+ All new contributions must be made under the Apache-2.0 licenses.