fgrezes commited on
Commit
3bf62ed
·
1 Parent(s): 799e70a

SciX Categorizer tutorial and updated readme

Browse files
Files/SciX_Categorizer_id2label.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"0": "Astronomy", "1": "Heliophysics", "2": "Planetary Science", "3": "Earth Science", "4": "NASA-funded Biophysics", "5": "Other Physics", "6": "Other", "7": "Text Garbage"}
README.md CHANGED
@@ -65,11 +65,13 @@ This model is **cased** (it treats `ads` and `ADS` differently).
65
  ## astroBERT models
66
  0. **Base model**: Pretrained model on English language using a masked language modeling (MLM) and next sentence prediction (NSP) objective. It was introduced in [this paper at ADASS 2021](https://arxiv.org/abs/2112.00590) and made public at ADASS 2022.
67
  1. **NER-DEAL model**: This model adds a token classification head to the base model finetuned on the [DEAL@WIESP2022 named entity recognition](https://ui.adsabs.harvard.edu/WIESP/2022/SharedTasks) task. Must be loaded from the `revision='NER-DEAL'` branch (see tutorial 2).
 
68
 
69
  ### Tutorials
70
  0. [generate text embedding (for downstream tasks)](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/0_Embeddings.ipynb)
71
  1. [use astroBERT for the Fill-Mask task](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/1_Fill-Mask.ipynb)
72
  2. [make NER-DEAL predictions](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/2_NER_DEAL.ipynb)
 
73
 
74
 
75
  ### BibTeX
 
65
  ## astroBERT models
66
  0. **Base model**: Pretrained model on English language using a masked language modeling (MLM) and next sentence prediction (NSP) objective. It was introduced in [this paper at ADASS 2021](https://arxiv.org/abs/2112.00590) and made public at ADASS 2022.
67
  1. **NER-DEAL model**: This model adds a token classification head to the base model finetuned on the [DEAL@WIESP2022 named entity recognition](https://ui.adsabs.harvard.edu/WIESP/2022/SharedTasks) task. Must be loaded from the `revision='NER-DEAL'` branch (see tutorial 2).
68
+ 2. **SciX Categorizer**: This model was finetuned to classify text into one of 7 categories of interest to SciX (Astronomy, Heliophysics, Planetary Science, Earth Science, NASA-funded Biophysics, Other Physics, Other, Text Garbage).
69
 
70
  ### Tutorials
71
  0. [generate text embedding (for downstream tasks)](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/0_Embeddings.ipynb)
72
  1. [use astroBERT for the Fill-Mask task](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/1_Fill-Mask.ipynb)
73
  2. [make NER-DEAL predictions](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/2_NER_DEAL.ipynb)
74
+ 3. [categorize texts for SciX](https://nbviewer.org/urls/huggingface.co/adsabs/astroBERT/raw/main/Tutorials/3_SciX_Categorizer.ipynb)
75
 
76
 
77
  ### BibTeX
Tutorials/3_SciX_Categorizer.ipynb ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "fb8ec95f-5740-462d-b650-0ab5900972fe",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "# Use the trained astroBERT model to make SciX category predictions"
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "markdown",
15
+ "id": "157b160e-63eb-4fff-b5f2-52a4d1099271",
16
+ "metadata": {},
17
+ "source": [
18
+ "# Tutorial 3 - Using astroBERT to make SciX category predictions\n",
19
+ "This tutorial shows you how to use a finetuned astroBERT to classify paper abstracts into SciX Categories. "
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "code",
24
+ "execution_count": 2,
25
+ "id": "a117bf9e-9428-4f23-b2ff-2ff8bb7b3f89",
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "# 1 - load the model"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "code",
34
+ "execution_count": 3,
35
+ "id": "b73baf65-5c16-44db-bcf9-2c57a994dcea",
36
+ "metadata": {},
37
+ "outputs": [],
38
+ "source": [
39
+ "from transformers import AutoModelForSequenceClassification"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": 4,
45
+ "id": "d4df0b34-def3-4b1c-88f6-ff270ca0fca1",
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "pretrained_model_name_or_path = 'adsabs/astroBERT'\n",
50
+ "revision = 'SciX-Categorizer'"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "code",
55
+ "execution_count": 5,
56
+ "id": "60968db9-9aa5-4e32-ab38-595dfed87adb",
57
+ "metadata": {},
58
+ "outputs": [],
59
+ "source": [
60
+ "# load model\n",
61
+ "model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=pretrained_model_name_or_path,\n",
62
+ " revision=revision\n",
63
+ " )"
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "code",
68
+ "execution_count": 6,
69
+ "id": "63aec77f-4135-45dd-af56-3962a8883793",
70
+ "metadata": {},
71
+ "outputs": [
72
+ {
73
+ "data": {
74
+ "text/plain": [
75
+ "{0: 'Astronomy',\n",
76
+ " 1: 'Heliophysics',\n",
77
+ " 2: 'Planetary Science',\n",
78
+ " 3: 'Earth Science',\n",
79
+ " 4: 'NASA-funded Biophysics',\n",
80
+ " 5: 'Other Physics',\n",
81
+ " 6: 'Other',\n",
82
+ " 7: 'Text Garbage'}"
83
+ ]
84
+ },
85
+ "execution_count": 6,
86
+ "metadata": {},
87
+ "output_type": "execute_result"
88
+ }
89
+ ],
90
+ "source": [
91
+ "# check out the categories\n",
92
+ "model.config.id2label"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "code",
97
+ "execution_count": 7,
98
+ "id": "49dfc617-2ee5-4309-b06a-282e6b614d61",
99
+ "metadata": {},
100
+ "outputs": [],
101
+ "source": [
102
+ "# load some texts and tokenize them\n",
103
+ "paper_abstracts = ['Star Formation Efficiencies and Lifetimes of Giant Molecular Clouds in the Milky Way. We use a sample of the 13 most luminous WMAP Galactic free-free sources, responsible for 33% of the free-free emission of the Milky Way, to investigate star formation. The sample contains 40 star-forming complexes; we combine this sample with giant molecular cloud (GMC) catalogs in the literature to identify the host GMCs of 32 of the complexes. We estimate the star formation efficiency epsilonGMC and star formation rate per free-fall time epsilonff. We find that epsilonGMC ranges from 0.002 to 0.2, with an ionizing luminosity-weighted average langepsilonGMCrang Q = 0.08, compared to the Galactic average ≈0.005. Turning to the star formation rate per free-fall time, we find values that range up to ɛ_ff ≡ τ _ff \\cdot \\dot{M}_*/M_GMC≈ 1. Weighting by ionizing luminosity, we find an average of langepsilonffrang Q = 0.14-0.24 depending on the estimate of the age of the system. Once again, this is much larger than the Galaxy-wide average value epsilonff = 0.006. We show that the lifetimes of GMCs at the mean mass found in our sample is 27 ± 12 Myr, a bit less than three free-fall times. The GMCs hosting the most luminous clusters are being disrupted by those clusters. Accordingly, we interpret the range in epsilonff as the result of a time-variable star formation rate; the rate of star formation increases with the age of the host molecular cloud, until the stars disrupt the cloud. These results are inconsistent with the notion that the star formation rate in Milky Way GMCs is determined by the properties of supersonic turbulence.',\n",
104
+ " 'Random Forests. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148-156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.',\n",
105
+ " 'Galactic Stellar and Substellar Initial Mass Function, We review recent determinations of the present-day mass function (PDMF) and initial mass function (IMF) in various components of the Galaxy-disk, spheroid, young, and globular clusters-and in conditions characteristic of early star formation. As a general feature, the IMF is found to depend weakly on the environment and to be well described by a power-law form for m>~1 Msolar and a lognormal form below, except possibly for early star formation conditions. The disk IMF for single objects has a characteristic mass around mc~0.08 Msolar and a variance in logarithmic mass σ~0.7, whereas the IMF for multiple systems has mc~0.2 Msolar and σ~0.6. The extension of the single MF into the brown dwarf regime is in good agreement with present estimates of L- and T-dwarf densities and yields a disk brown dwarf number density comparable to the stellar one, nBD~n*~0.1 pc-3. The IMF of young clusters is found to be consistent with the disk field IMF, providing the same correction for unresolved binaries, confirming the fact that young star clusters and disk field stars represent the same stellar population. Dynamical effects, yielding depletion of the lowest mass objects, are found to become consequential for ages >~130 Myr. The spheroid IMF relies on much less robust grounds. The large metallicity spread in the local subdwarf photometric sample, in particular, remains puzzling. Recent observations suggest that there is a continuous kinematic shear between the thick-disk population, present in local samples, and the genuine spheroid one. This enables us to derive only an upper limit for the spheroid mass density and IMF. Within all the uncertainties, the latter is found to be similar to the one derived for globular clusters and is well represented also by a lognormal form with a characteristic mass slightly larger than for the disk, mc~0.2-0.3 Msolar, excluding a significant population of brown dwarfs in globular clusters and in the spheroid. The IMF characteristic of early star formation at large redshift remains undetermined, but different observational constraints suggest that it does not extend below ~1 Msolar. These results suggest a characteristic mass for star formation that decreases with time, from conditions prevailing at large redshift to conditions characteristic of the spheroid (or thick disk) to present-day conditions. These conclusions, however, remain speculative, given the large uncertainties in the spheroid and early star IMF determinations. These IMFs allow a reasonably robust determination of the Galactic present-day and initial stellar and brown dwarf contents. They also have important galactic implications beyond the Milky Way in yielding more accurate mass-to-light ratio determinations. The mass-to-light ratios obtained with the disk and the spheroid IMF yield values 1.8-1.4 times smaller than for a Salpeter IMF, respectively, in agreement with various recent dynamical determinations. This general IMF determination is examined in the context of star formation theory. None of the theories based on a Jeans-type mechanism, where fragmentation is due only to gravity, can fulfill all the observational constraints on star formation and predict a large number of substellar objects. On the other hand, recent numerical simulations of compressible turbulence, in particular in super-Alfvénic conditions, seem to reproduce both qualitatively and quantitatively the stellar and substellar IMF and thus provide an appealing theoretical foundation. In this picture, star formation is induced by the dissipation of large-scale turbulence to smaller scales through radiative MHD shocks, producing filamentary structures. These shocks produce local nonequilibrium structures with large density contrasts, which collapse eventually in gravitationally bound objects under the combined influence of turbulence and gravity. The concept of a single Jeans mass is replaced by a distribution of local Jeans masses, representative of the lognormal probability density function of the turbulent gas. Objects below the mean thermal Jeans mass still have a possibility to collapse, although with a decreasing probability. The page charges for this Review were partially covered by a generous gift from a PASP supporter.',\n",
106
+ " ]\n"
107
+ ]
108
+ },
109
+ {
110
+ "cell_type": "code",
111
+ "execution_count": 8,
112
+ "id": "4517e4cf-d9f3-4ba2-b42d-8ce1ff18128a",
113
+ "metadata": {},
114
+ "outputs": [],
115
+ "source": [
116
+ "from transformers import AutoTokenizer"
117
+ ]
118
+ },
119
+ {
120
+ "cell_type": "code",
121
+ "execution_count": 9,
122
+ "id": "5df3f5fd-7af2-4696-9399-6c58c60dca1e",
123
+ "metadata": {},
124
+ "outputs": [],
125
+ "source": [
126
+ "tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=pretrained_model_name_or_path,\n",
127
+ " revision=revision,\n",
128
+ " do_lower_case=False)"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "code",
133
+ "execution_count": 10,
134
+ "id": "adddc2f7-c576-4013-a324-d05b4e5ab580",
135
+ "metadata": {},
136
+ "outputs": [
137
+ {
138
+ "name": "stderr",
139
+ "output_type": "stream",
140
+ "text": [
141
+ "/proj.adsnlp/jupyter-lab/one/ads/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2271: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n",
142
+ " warnings.warn(\n"
143
+ ]
144
+ }
145
+ ],
146
+ "source": [
147
+ "list_of_texts_tokenized_input_ids = tokenizer(paper_abstracts, max_length=512, truncation=True, pad_to_max_length=True, add_special_tokens=True)['input_ids']\n"
148
+ ]
149
+ },
150
+ {
151
+ "cell_type": "code",
152
+ "execution_count": 11,
153
+ "id": "9d9ec93e-2699-41e9-9138-b5194f86d6cb",
154
+ "metadata": {},
155
+ "outputs": [],
156
+ "source": [
157
+ "from torch import no_grad, tensor"
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "code",
162
+ "execution_count": 12,
163
+ "id": "ce1213c7-18f2-4bba-833d-07fa4481ab83",
164
+ "metadata": {},
165
+ "outputs": [],
166
+ "source": [
167
+ "# forward call\n",
168
+ "with no_grad():\n",
169
+ " predictions = model(input_ids=tensor(list_of_texts_tokenized_input_ids)).logits.sigmoid().tolist()\n"
170
+ ]
171
+ },
172
+ {
173
+ "cell_type": "code",
174
+ "execution_count": 13,
175
+ "id": "d62625a1-ae58-4237-b980-eac2a6b7344d",
176
+ "metadata": {},
177
+ "outputs": [
178
+ {
179
+ "name": "stdout",
180
+ "output_type": "stream",
181
+ "text": [
182
+ "[[0.9995561242103577, 0.00014585199824068695, 0.00038051357842050493, 0.0002408829896012321, 8.467026191283367e-07, 0.00015598340542055666, 9.46967484196648e-05, 6.663293333986076e-06], [7.198489038273692e-05, 3.548982203938067e-05, 0.005394219420850277, 0.9924236536026001, 4.508056463237153e-06, 0.0031560249626636505, 0.005173855926841497, 2.667580520210322e-05], [0.9982689619064331, 0.0003360261907801032, 0.003926889039576054, 0.00021573311823885888, 5.955367328169814e-07, 0.0001352201506961137, 2.13952280319063e-05, 4.8363999667344615e-06]]\n"
183
+ ]
184
+ }
185
+ ],
186
+ "source": [
187
+ "print(predictions)"
188
+ ]
189
+ },
190
+ {
191
+ "cell_type": "code",
192
+ "execution_count": 14,
193
+ "id": "34e051eb-ec4e-4a68-b4a4-3731f322f34c",
194
+ "metadata": {},
195
+ "outputs": [],
196
+ "source": [
197
+ "import numpy as np"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": 15,
203
+ "id": "e0e3b0f6-082d-465d-bfbd-7649c97d16a2",
204
+ "metadata": {},
205
+ "outputs": [
206
+ {
207
+ "name": "stdout",
208
+ "output_type": "stream",
209
+ "text": [
210
+ "For abstract 0, the model predicts Astronomy with score 0.9996\n",
211
+ "For abstract 1, the model predicts Earth Science with score 0.9924\n",
212
+ "For abstract 2, the model predicts Astronomy with score 0.9983\n"
213
+ ]
214
+ }
215
+ ],
216
+ "source": [
217
+ "# check out the max prediction score\n",
218
+ "for i, scores in enumerate(predictions):\n",
219
+ " print('For abstract {}, the model predicts {} with score {:.4f}'.format(i, model.config.id2label[np.argmax(scores)], np.max(scores)))"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "code",
224
+ "execution_count": null,
225
+ "id": "731bda99-2ad3-4653-b209-d988693476a4",
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": []
229
+ }
230
+ ],
231
+ "metadata": {
232
+ "kernelspec": {
233
+ "display_name": "Python 3 (ipykernel)",
234
+ "language": "python",
235
+ "name": "python3"
236
+ },
237
+ "language_info": {
238
+ "codemirror_mode": {
239
+ "name": "ipython",
240
+ "version": 3
241
+ },
242
+ "file_extension": ".py",
243
+ "mimetype": "text/x-python",
244
+ "name": "python",
245
+ "nbconvert_exporter": "python",
246
+ "pygments_lexer": "ipython3",
247
+ "version": "3.8.5"
248
+ }
249
+ },
250
+ "nbformat": 4,
251
+ "nbformat_minor": 5
252
+ }