nandakishormpai's picture
Added necessary codes for pre and post processing data
6e02123
|
raw
history blame
9.4 kB
metadata
license: apache-2.0
tags:
  - generated_from_trainer
  - documentation_tag
  - tag_generation
  - github
  - github_tag
  - tagging
  - github_repo
  - summarization
metrics:
  - rouge
model-index:
  - name: t5-small-github-repo-tag-generation
    results: []
widget:
  - text: >-
      susya  plant disease detector ml powered app to assist farmers in crop
      disease detection and alerts product walkthrough  download product apk
      here machine learning python notebook solutions system to detect the
      problem when it arises and warn the farmers disease detection using
      machine learning model enabled through android app which uses flask api
      solution to overcome the problem once it arises remedy is suggested for
      the disease detected by the app using ml model solution that will ensure
      that the problem will never occur in the future again pdf report is
      generated on the disease predicted along with user information pdf can be
      used as a document to be submitted in nearby krishibhavan thereby seeking
      help easily method that will reduce the impact of the dilemma to a
      significant level disease detected news can be sent to other users as a
      notification which contatins userplant and disease this will help other
      farmers take up precautions thereby reducing the impact of the dilemma to
      a significant level considering a region machine learning model multiclass
      image classifier built on pytorch framework using cnn architecture
      currently project detects 17 states of disease in 4 plants  aiming kerala
      state  namely cherry pepper potato and tomato framework  pytorch
      architecture  convolutional neural networks validation accuracy  777 how
      to train upload the python notebook to google colab and run each cell for
      training the model i have included a demo dataset to configure quickly you
      can use this kaggle dataset which is the original one with huge amount of
      pictures how it works the input image dataset is converted to tensor and
      is passed through a cnn model returning an output value corresponding to
      the plant disease input image tensor is passed through four convolutional
      layers and then flattened and inputted to fully connected layers api api
      is built using flask framework and hosted in render the api provides two
      functionalities they are plant disease detection accepts a post request
      with an image in the form of base64 string and returns plant disease and
      remedy notification accepts a post request with plant user and disease 
      which is then pushed as a notification to other users to warn them
      regarding a probable outbreak of disease how to use api has been built on
      this classifier url   user has to send a post request to the given api
      with base64 string of the image to be input python import requests url  
      imgdata  base64 string of image r  requestsposturljson  imageimgdata
      printrtextstrip outputpython diseaseseptoria leaf
      spotplanttomatoremedyremove infected leaves immediatelyfungonil and
      daconil  app download product apk here to run app shell  cd app  flutter
      run to build app shell  cd app  flutter build apk features authentication
      using google oauth user profile page uses camera or device media to get an
      image of the crop preview the image and sends it to api for disease
      detection result page showing detected disease and remedy generates a pdf
      report to saveshare predicted disease details option to send the generated
      result as a notification warning to other users tech stack used python
      pytorch flask flutter firebase contributors nanda kishor m paiml model api
      ajay krishna k v flutter dev api hari krishnan uml model data collection
      antony s johnflutter dev
example_title: 'Github Cleaned Readme #1'
language:
  - en
pipeline_tag: summarization

t5-small-github-repo-tag-generation

Machine Learning model to generate Tags for Github Repositories based on their Documentation [README.md] . This model is a fine-tuned version of t5-small fine-tuned on a collection of repositoreis from Kaggle/vatsalparsaniya/github-repositories-analysis. While usually formulated as a multi-label classification problem, this model deals with tag generation as a text2text generation task (inspiration and reference: fabiochiu/t5-base-tag-generation).

The Inference API here expects a cleaned readme text, the code for cleaning the readme is also given below.

Finetuning Notebook Reference: Hugging face summarization notebook.

How to use the model

Input : Github Repo URL
Output : Tags

Remarks: Ensure the repo has README.md

Installations

pip install transformers nltk clean-text beautifulsoup4

Code

Imports

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import re
import nltk
nltk.download('punkt')
from cleantext import clean
from bs4 import BeautifulSoup
from markdown import Markdown
import requests
from io import StringIO
import string

Preprocessing

# Script to convert Markdown to plain text
# Reference : Stackoverflow == https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text

def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)

def readme_extractor(github_repo_url):
    try:
        
        # Get repo HTML using BeautifulSoup
        html_content = requests.get(github['python', 'machine learning', 'ml', 'cnn']_repo_url).text
        soup = BeautifulSoup(html_content, "html.parser")

        # Get README File URL from Repository
        readme_url = "https://github.com/" + soup.find("a",{"title":"README.md"}).get("href")

        # Generate raw readme file URL
        # https://github.com/rasbt/python-machine-learning-book/blob/master/README.md   -->   https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/README.md
        readme_raw_url = readme_url.replace("/blob/","/")
        readme_raw_url = readme_raw_url.replace("github.com","raw.githubusercontent.com")
https://github.com/Lightning-AI/lightning
        readme_html_content = requests.get(readme_raw_url ).text
        readme_soup = BeautifulSoup(readme_html_content, "html.parser")
        readme_text = readme_soup.get_text() 
        documentation_text = unmark(readme_text)
        return documentation_text
    except:
        print("FAILED : ",github_repo_url )
        return "README_NOT_MARKDOWN"

def clean_readme(readme):
    text = clean(readme, no_emoji=True)
    lst = re.findall('http://\S+|https://\S+', text)
    for i in lst:
        text = text.replace(i, '')
    text = "".join([i for i in text if i not in string.punctuation])
    text = text.lower()
    text = text.replace("\n"," ")
    return text

Postprocess Tags [Removing duplicates]

def post_process_tags(tag_string):
    final_tags = []
    for tag in tag_string.split(","):
      if tag.strip() in final_tags or len(tag.strip()) <=1:
        continue
      final_tags.append(tag.strip())
    return final_tags

Main Function

def github_tags_generate(github_repo_url):
    readme = readme_extractor(github_repo_url)
    readme = clean_readme(readme)
    inputs = tokenizer([readme], max_length=1536, truncation=True, return_tensors="pt")
    output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                            max_length=128)
    decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    tags = post_process_tags(decoded_output)

    return tags



github_tags_generate("https://github.com/Enter_Repo_URL")

# github_tags_generate("https://github.com/nandakishormpai/Plant_Disease_Detector")
# ['python', 'machine learning', 'ml', 'cnn']

Dataset Preparation

Over the 1000 articles from the dataset, only 870 had tags and the readme was longer than 50 characters. They were filtered out and using BeautifulSoup, README.md was scraped out.

Intended uses & limitations

The results might contain duplicate tags that must be handled in the postprocessing of results. postprocessing code also given.

Results

It achieves the following results on the evaluation set:

  • Loss: 1.8196
  • Rouge1: 25.0142
  • Rouge2: 8.1802
  • Rougel: 22.77
  • Rougelsum: 22.8017
  • Gen Len: 19.0

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 40
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.26.1
  • Pytorch 1.13.1+cu116
  • Datasets 2.10.0
  • Tokenizers 0.13.2