Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "en"
|
3 |
+
tags:
|
4 |
+
- agriculture-domain
|
5 |
+
- agriculture
|
6 |
+
widget:
|
7 |
+
- text: "[MASK] agriculture provides one of the most promising areas for innovation in green and blue infrastructure in cities."
|
8 |
+
---
|
9 |
+
# BERT for Agriculture Domain
|
10 |
+
A BERT-based language model further pre-trained from the checkpoint of [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased).
|
11 |
+
The dataset gathered is a balance between scientific and general works in agriculture domain and encompassing knowledge from different areas of agriculture research and practical knowledge.
|
12 |
+
|
13 |
+
The corpus contains 1.3 million paragraphs from National Agricultural Library (NAL) from the US Gov. and 4.2 million paragraphs from books and common literature from the **Agriculture Domain**.
|
14 |
+
|
15 |
+
The self-supervised learning approach of MLM was used to train the model.
|
16 |
+
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
17 |
+
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
18 |
+
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
19 |
+
GPT internally masks the future tokens. It allows the model to learn a bidirectional representation of the
|
20 |
+
sentence.
|
21 |
+
```python
|
22 |
+
from transformers import pipeline
|
23 |
+
fill_mask = pipeline(
|
24 |
+
"fill-mask",
|
25 |
+
model="recobo/chemical-bert-uncased",
|
26 |
+
tokenizer="recobo/chemical-bert-uncased"
|
27 |
+
)
|
28 |
+
fill_mask("we create [MASK]")
|
29 |
+
```
|