t5-small-github-repo-tag-generation

Machine Learning model to generate Tags for Github Repositories based on their Documentation [README.md] . This model is a fine-tuned version of t5-small fine-tuned on a collection of repositoreis from Kaggle/vatsalparsaniya/github-repositories-analysis. While usually formulated as a multi-label classification problem, this model deals with tag generation as a text2text generation task (inspiration and reference: fabiochiu/t5-base-tag-generation).

The Inference API here expects a cleaned readme text, the code for cleaning the readme is also given below.

Finetuning Notebook Reference: Hugging face summarization notebook.

How to use the model

Input : Github Repo URL
Output : Tags

Remarks: Ensure the repo has README.md

Installations

pip install transformers nltk clean-text beautifulsoup4

Code

Imports

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import re
import nltk
nltk.download('punkt')
from cleantext import clean
from bs4 import BeautifulSoup
from markdown import Markdown
import requests
from io import StringIO
import string

Preprocessing

# Script to convert Markdown to plain text
# Reference : Stackoverflow == https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text

def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)

def readme_extractor(github_repo_url):
    try:
        
        # Get repo HTML using BeautifulSoup
        html_content = requests.get(github['python', 'machine learning', 'ml', 'cnn']_repo_url).text
        soup = BeautifulSoup(html_content, "html.parser")

        # Get README File URL from Repository
        readme_url = "https://github.com/" + soup.find("a",{"title":"README.md"}).get("href")

        # Generate raw readme file URL
        # https://github.com/rasbt/python-machine-learning-book/blob/master/README.md   -->   https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/README.md
        readme_raw_url = readme_url.replace("/blob/","/")
        readme_raw_url = readme_raw_url.replace("github.com","raw.githubusercontent.com")
https://github.com/Lightning-AI/lightning
        readme_html_content = requests.get(readme_raw_url ).text
        readme_soup = BeautifulSoup(readme_html_content, "html.parser")
        readme_text = readme_soup.get_text() 
        documentation_text = unmark(readme_text)
        return documentation_text
    except:
        print("FAILED : ",github_repo_url )
        return "README_NOT_MARKDOWN"

def clean_readme(readme):
    text = clean(readme, no_emoji=True)
    lst = re.findall('http://\S+|https://\S+', text)
    for i in lst:
        text = text.replace(i, '')
    text = "".join([i for i in text if i not in string.punctuation])
    text = text.lower()
    text = text.replace("\n"," ")
    return text

Postprocess Tags [Removing duplicates]

def post_process_tags(tag_string):
    final_tags = []
    for tag in tag_string.split(","):
      if tag.strip() in final_tags or len(tag.strip()) <=1:
        continue
      final_tags.append(tag.strip())
    return final_tags

Main Function

def github_tags_generate(github_repo_url):
    readme = readme_extractor(github_repo_url)
    readme = clean_readme(readme)
    inputs = tokenizer([readme], max_length=1536, truncation=True, return_tensors="pt")
    output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                            max_length=128)
    decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    tags = post_process_tags(decoded_output)

    return tags



github_tags_generate("https://github.com/Enter_Repo_URL")

# github_tags_generate("https://github.com/nandakishormpai/Plant_Disease_Detector")
# ['python', 'machine learning', 'ml', 'cnn']

Dataset Preparation

Over the 1000 articles from the dataset, only 870 had tags and the readme was longer than 50 characters. They were filtered out and using BeautifulSoup, README.md was scraped out.

Intended uses & limitations

The results might contain duplicate tags that must be handled in the postprocessing of results. postprocessing code also given.

Results

It achieves the following results on the evaluation set:

Loss: 1.8196
Rouge1: 25.0142
Rouge2: 8.1802
Rougel: 22.77
Rougelsum: 22.8017
Gen Len: 19.0

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 40
mixed_precision_training: Native AMP

Framework versions

Transformers 4.26.1
Pytorch 1.13.1+cu116
Datasets 2.10.0
Tokenizers 0.13.2

nandakishormpai
/

t5-small-github-repo-tag-generation