unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

English

Summarization

Model card Files Files and versions Community

igorgavi commited on Dec 16, 2022

Commit

6b232ca

•

1 Parent(s): 92aefa9

Update README.md

Browse files

Files changed (1) hide show

README.md +73 -50

README.md CHANGED Viewed

@@ -53,9 +53,6 @@ model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM a
 the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
 methods used for text summarization will be described indvidually in the following sections.
-![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png)
 ## Methods
 Since there are many methods to choose from in order to perform the ATS task using this model, the following table presents useful information
@@ -77,67 +74,91 @@ its implementation and the article from which it originated.
 | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
-## Intended uses & limitations
-You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
-be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
-fine-tuned versions of a task that interests you.
-Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
-to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
-generation you should look at model like XXX.
 ### How to use
-You can use this model directly with a pipeline for masked language modeling:
 ```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
->>> unmasker("Hello I'm a [MASK] model.")
-[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
- 'score': 0.1073106899857521,
- 'token': 4827,
- 'token_str': 'fashion'},
- {'sequence': "[CLS] hello i'm a role model. [SEP]",
- 'score': 0.08774490654468536,
- 'token': 2535,
- 'token_str': 'role'},
- {'sequence': "[CLS] hello i'm a new model. [SEP]",
- 'score': 0.05338378623127937,
- 'token': 2047,
- 'token_str': 'new'},
- {'sequence': "[CLS] hello i'm a super model. [SEP]",
- 'score': 0.04667217284440994,
- 'token': 3565,
- 'token_str': 'super'},
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
- 'score': 0.027095865458250046,
- 'token': 2986,
- 'token_str': 'fine'}]
-```
-Here is how to use this model to get the features of a given text in PyTorch:
 ```python
-from transformers import BertTokenizer, BertModel
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = BertModel.from_pretrained("bert-base-uncased")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
 ```
-and in TensorFlow:
 ```python
-from transformers import BertTokenizer, TFBertModel
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = TFBertModel.from_pretrained("bert-base-uncased")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
 ```
 ## Training data
@@ -147,6 +168,8 @@ unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_
 headers).
 ## Training procedure
 ### Preprocessing

 the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
 methods used for text summarization will be described indvidually in the following sections.
 ## Methods
 Since there are many methods to choose from in order to perform the ATS task using this model, the following table presents useful information
 | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
+## Limitations
 ### How to use
+Initially, some libraries will need to be imported in order for the program to work. The following lines
+of code, then, are necessary:
 ```python
+import threading
+from alive_progress import alive_bar
+from datasets import load_dataset
+from bs4 import BeautifulSoup
+import pandas as pd
+import numpy as np
+import shutil
+import regex
+import os
+import re
+import itertools as it
+import more_itertools as mit
+```
+If any of the above mentioned libraries are not installed in the user's machine, it will be required for
+him to install them through the CMD with the comand:
 ```python
+>>> pip install [LIBRARY]
 ```
+To run the code with given corpus' of data, the following lines of code need to be inserted. If one or multiple
+corpora, summarizers and evaluators are not to be applied, the user has to comment the unwanted option.
 ```python
+if __name__ == "__main__":
+ corpora = [
+ "mcti_data",
+ "cnn_dailymail",
+ "big_patent",
+ "cnn_corpus_abstractive",
+ "cnn_corpus_extractive",
+ "xsum",
+ "arxiv_pubmed",
+ ]
+ summarizers = [
+ "SumyRandom",
+ "SumyLuhn",
+ "SumyLsa",
+ "SumyLexRank",
+ "SumyTextRank",
+ "SumySumBasic",
+ "SumyKL",
+ "SumyReduction",
+ "Transformers-facebook/bart-large-cnn",
+ "Transformers-google/pegasus-xsum",
+ "Transformers-csebuetnlp/mT5_multilingual_XLSum",
+ ]
+ metrics = [
+ "rouge",
+ "gensim",
+ "nltk",
+ "sklearn",
+ ]
+ ### Running methods and eval locally
+ reader = Data()
+ reader.show_available_databases()
+ for corpus in corpora:
+ data = reader.read_data(corpus, 50)
+ method = Method(data, corpus)
+ method.show_methods()
+ for summarizer in summarizers:
+ df = method.run(summarizer)
+ method.examples_to_csv()
+ evaluator = Evaluator(df, summarizer, corpus)
+ for metric in metrics:
+ evaluator.run(metric)
+ evaluator.metrics_to_csv()
+ evaluator.join_all_results()
 ```
 ## Training data
 headers).
 ## Training procedure
 ### Preprocessing