# LED Paraphrase Model This repository contains a LED-based model fine-tuned for paraphrasing tasks using the Quora dataset. ## Model Overview The LED (Longformer Encoder-Decoder) model is a variant of the Transformer model designed for tasks requiring longer context. This particular model is fine-tuned to generate paraphrases of given input sentences, making it useful for tasks like text simplification, query rewriting, and more. ## Model Details - **Architecture:** LEDForConditionalGeneration - **Dataset:** Quora dataset (subset of 20,000 samples) - **Training Configuration:** - Epochs: 1 - Batch size: 2 - Learning rate: 5e-5 - Max input length: 1024 tokens - Max output length: 256 tokens ## Repository Contents - `pytorch_model.bin`: Model weights - `config.json`: Model configuration - `tokenizer_config.json`: Tokenizer configuration - 'generation_config.json': - 'merges.txt': - 'special_tokens_map.json': - `vocab.json`: Tokenizer vocabulary - `fine-tune-led.ipynb`: IPython Notebook for training the model ## Setup ### Install Dependencies To use the model, ensure you have the following dependencies installed: - `transformers` - `datasets` - `torch` ### Usage To use this model, load it via the `transformers` library. The model and tokenizer can be initialized and used to generate paraphrases of input text. ## Training Process ### Dataset The model is trained on the Quora dataset, which consists of pairs of paraphrased questions. A subset of 20,000 samples was used for training, with 80% of the data allocated for training and 20% for evaluation. ### Preprocessing Each question pair is tokenized, and the inputs are prepared with appropriate attention masks and labels. The input sequence length is truncated to 1024 tokens, and the output sequence length is truncated to 256 tokens. ### Training The model is fine-tuned using the `Seq2SeqTrainer` from the `transformers` library with specific training arguments. Gradient accumulation steps and evaluation strategies are employed to optimize the training process. ### Evaluation The model's performance is evaluated using ROUGE and BLEU metrics: - **ROUGE:** Measures the overlap of n-grams between the generated and reference texts. - **BLEU:** Measures the precision of n-grams in the generated text compared to the reference text. ## Evaluation Results The evaluation results show the model's performance in terms of ROUGE and BLEU scores, which indicate the quality and accuracy of the generated paraphrases. ## Example Usage To generate a paraphrase using the trained model: 1. Load the model and tokenizer. 2. Prepare the input text. 3. Generate the paraphrase and decode it to readable text. ## References - [Hugging Face Transformers](https://github.com/huggingface/transformers): The library used for model implementation and training. - [Quora Dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs): The dataset used for training the paraphrase model.