MarcosDib commited on
Commit
bdba148
1 Parent(s): c6cd2eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -22
README.md CHANGED
@@ -23,31 +23,18 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
23
 
24
  Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
25
 
26
- This project focus on a specific problem, creating a Research Financing Products Portfolio (FPP) outside
27
- of the Union budget, supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI), The problem
28
- description and conceptual model of FPP/MCTI are shown in Figure 1.
29
-
30
- ![Model](https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png)
 
31
 
32
  ## According to the abstract,
33
 
34
- Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations
35
- require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification
36
- and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores
37
- different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method
38
- with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations
39
- provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by
40
- the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship
41
- of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the
42
- available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence.
43
- Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a
44
- Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
45
- asuccessful case of artificial intelligence in a federal government application.
46
-
47
- This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
48
- supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
49
- [this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
50
- and English.
51
 
52
  ## Model description
53
 
@@ -159,6 +146,9 @@ output = model(encoded_input)
159
 
160
  ### Limitations and bias
161
 
 
 
 
162
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
163
  predictions:
164
 
 
23
 
24
  Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
25
 
26
+ The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses
27
+ on the task of Text Classification and explores different machine learning strategies to classify a small amount
28
+ of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
29
+ solutions were used to learn word relationships from other datasets with considerable similarity and larger scale.
30
+ Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
31
+ models were applied to improve the understanding of each sentence.
32
 
33
  ## According to the abstract,
34
 
35
+ Compared to the 81% baseline accuracy rate based on available datasets and the 85% accuracy rate achieved using a
36
+ Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 93%, according to
37
+ ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Model description
40
 
 
146
 
147
  ### Limitations and bias
148
 
149
+ This model is uncased: it does not make a difference between english
150
+ and English.
151
+
152
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
153
  predictions:
154