nandakishormpai commited on
Commit
6e02123
·
1 Parent(s): 7400230

Added necessary codes for pre and post processing data

Browse files
Files changed (1) hide show
  1. README.md +149 -46
README.md CHANGED
@@ -2,11 +2,24 @@
2
  license: apache-2.0
3
  tags:
4
  - generated_from_trainer
 
 
 
 
 
 
 
5
  metrics:
6
  - rouge
7
  model-index:
8
  - name: t5-small-github-repo-tag-generation
9
  results: []
 
 
 
 
 
 
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -14,7 +27,142 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # t5-small-github-repo-tag-generation
16
 
17
- This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on the None dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  It achieves the following results on the evaluation set:
19
  - Loss: 1.8196
20
  - Rouge1: 25.0142
@@ -23,19 +171,6 @@ It achieves the following results on the evaluation set:
23
  - Rougelsum: 22.8017
24
  - Gen Len: 19.0
25
 
26
- ## Model description
27
-
28
- More information needed
29
-
30
- ## Intended uses & limitations
31
-
32
- More information needed
33
-
34
- ## Training and evaluation data
35
-
36
- More information needed
37
-
38
- ## Training procedure
39
 
40
  ### Training hyperparameters
41
 
@@ -49,38 +184,6 @@ The following hyperparameters were used during training:
49
  - num_epochs: 40
50
  - mixed_precision_training: Native AMP
51
 
52
- ### Training results
53
-
54
- | Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Gen Len |
55
- |:-------------:|:-----:|:----:|:---------------:|:-------:|:------:|:-------:|:---------:|:-------:|
56
- | 2.8665 | 1.0 | 87 | 2.3813 | 14.8999 | 2.5462 | 13.9887 | 13.8228 | 19.0 |
57
- | 2.4139 | 2.0 | 174 | 2.1669 | 18.1676 | 3.2466 | 16.6849 | 16.6869 | 19.0 |
58
- | 2.2398 | 3.0 | 261 | 2.0699 | 19.2735 | 4.639 | 17.8309 | 17.8346 | 19.0 |
59
- | 2.1557 | 4.0 | 348 | 2.0234 | 20.529 | 4.8054 | 19.0494 | 19.0052 | 19.0 |
60
- | 2.097 | 5.0 | 435 | 1.9936 | 21.4325 | 5.6856 | 19.6801 | 19.6991 | 19.0 |
61
- | 2.06 | 6.0 | 522 | 1.9644 | 21.2128 | 5.6864 | 19.6366 | 19.6068 | 19.0 |
62
- | 2.0136 | 7.0 | 609 | 1.9457 | 21.9194 | 5.9025 | 20.1281 | 20.068 | 19.0 |
63
- | 1.9781 | 8.0 | 696 | 1.9335 | 22.2101 | 6.366 | 20.6869 | 20.6609 | 19.0 |
64
- | 1.9459 | 9.0 | 783 | 1.9192 | 22.8154 | 6.612 | 21.1065 | 21.1091 | 19.0 |
65
- | 1.943 | 10.0 | 870 | 1.9074 | 23.7665 | 7.0722 | 22.033 | 22.0371 | 19.0 |
66
- | 1.9309 | 11.0 | 957 | 1.8946 | 24.0329 | 7.3522 | 22.2798 | 22.2827 | 19.0 |
67
- | 1.9028 | 12.0 | 1044 | 1.8856 | 24.6311 | 7.6312 | 22.6211 | 22.6265 | 19.0 |
68
- | 1.8837 | 13.0 | 1131 | 1.8808 | 23.8252 | 7.2919 | 22.2651 | 22.2359 | 19.0 |
69
- | 1.8606 | 14.0 | 1218 | 1.8751 | 23.875 | 7.6105 | 21.9304 | 21.9311 | 19.0 |
70
- | 1.8386 | 15.0 | 1305 | 1.8661 | 24.5944 | 7.394 | 22.6082 | 22.5901 | 19.0 |
71
- | 1.8313 | 16.0 | 1392 | 1.8598 | 24.6417 | 7.8094 | 22.6391 | 22.6353 | 19.0 |
72
- | 1.821 | 17.0 | 1479 | 1.8568 | 24.6872 | 7.55 | 22.7157 | 22.7871 | 19.0 |
73
- | 1.8092 | 18.0 | 1566 | 1.8508 | 24.6133 | 7.6888 | 22.6948 | 22.7972 | 19.0 |
74
- | 1.8024 | 19.0 | 1653 | 1.8483 | 25.1081 | 7.61 | 22.8889 | 22.8858 | 19.0 |
75
- | 1.7963 | 20.0 | 1740 | 1.8417 | 24.8799 | 7.6186 | 22.9405 | 22.9435 | 19.0 |
76
- | 1.7775 | 21.0 | 1827 | 1.8383 | 25.3856 | 8.0504 | 23.1873 | 23.1687 | 19.0 |
77
- | 1.7782 | 22.0 | 1914 | 1.8366 | 25.3015 | 8.2145 | 23.3779 | 23.3797 | 19.0 |
78
- | 1.7619 | 23.0 | 2001 | 1.8329 | 24.8709 | 7.397 | 22.5124 | 22.5905 | 19.0 |
79
- | 1.7625 | 24.0 | 2088 | 1.8304 | 25.0142 | 8.1525 | 22.9442 | 23.0429 | 19.0 |
80
- | 1.7461 | 25.0 | 2175 | 1.8260 | 25.2686 | 8.3042 | 23.1614 | 23.2863 | 19.0 |
81
- | 1.7433 | 26.0 | 2262 | 1.8228 | 25.4987 | 8.4777 | 23.2049 | 23.2753 | 19.0 |
82
- | 1.7439 | 27.0 | 2349 | 1.8199 | 25.0074 | 8.2618 | 22.799 | 22.8579 | 19.0 |
83
- | 1.7182 | 28.0 | 2436 | 1.8196 | 25.0142 | 8.1802 | 22.77 | 22.8017 | 19.0 |
84
 
85
 
86
  ### Framework versions
 
2
  license: apache-2.0
3
  tags:
4
  - generated_from_trainer
5
+ - documentation_tag
6
+ - tag_generation
7
+ - github
8
+ - github_tag
9
+ - tagging
10
+ - github_repo
11
+ - summarization
12
  metrics:
13
  - rouge
14
  model-index:
15
  - name: t5-small-github-repo-tag-generation
16
  results: []
17
+ widget:
18
+ - text: "susya plant disease detector ml powered app to assist farmers in crop disease detection and alerts product walkthrough download product apk here machine learning python notebook solutions system to detect the problem when it arises and warn the farmers disease detection using machine learning model enabled through android app which uses flask api solution to overcome the problem once it arises remedy is suggested for the disease detected by the app using ml model solution that will ensure that the problem will never occur in the future again pdf report is generated on the disease predicted along with user information pdf can be used as a document to be submitted in nearby krishibhavan thereby seeking help easily method that will reduce the impact of the dilemma to a significant level disease detected news can be sent to other users as a notification which contatins userplant and disease this will help other farmers take up precautions thereby reducing the impact of the dilemma to a significant level considering a region machine learning model multiclass image classifier built on pytorch framework using cnn architecture currently project detects 17 states of disease in 4 plants aiming kerala state namely cherry pepper potato and tomato framework pytorch architecture convolutional neural networks validation accuracy 777 how to train upload the python notebook to google colab and run each cell for training the model i have included a demo dataset to configure quickly you can use this kaggle dataset which is the original one with huge amount of pictures how it works the input image dataset is converted to tensor and is passed through a cnn model returning an output value corresponding to the plant disease input image tensor is passed through four convolutional layers and then flattened and inputted to fully connected layers api api is built using flask framework and hosted in render the api provides two functionalities they are plant disease detection accepts a post request with an image in the form of base64 string and returns plant disease and remedy notification accepts a post request with plant user and disease which is then pushed as a notification to other users to warn them regarding a probable outbreak of disease how to use api has been built on this classifier url user has to send a post request to the given api with base64 string of the image to be input python import requests url imgdata base64 string of image r requestsposturljson imageimgdata printrtextstrip outputpython diseaseseptoria leaf spotplanttomatoremedyremove infected leaves immediatelyfungonil and daconil app download product apk here to run app shell cd app flutter run to build app shell cd app flutter build apk features authentication using google oauth user profile page uses camera or device media to get an image of the crop preview the image and sends it to api for disease detection result page showing detected disease and remedy generates a pdf report to saveshare predicted disease details option to send the generated result as a notification warning to other users tech stack used python pytorch flask flutter firebase contributors nanda kishor m paiml model api ajay krishna k v flutter dev api hari krishnan uml model data collection antony s johnflutter dev"
19
+ example_title: 'Github Cleaned Readme #1'
20
+ language:
21
+ - en
22
+ pipeline_tag: summarization
23
  ---
24
 
25
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
27
 
28
  # t5-small-github-repo-tag-generation
29
 
30
+ Machine Learning model to generate Tags for Github Repositories based on their Documentation [README.md] . This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) fine-tuned on a collection of repositoreis from [Kaggle/vatsalparsaniya/github-repositories-analysis](https://www.kaggle.com/datasets/vatsalparsaniya/github-repositories-analysis). While usually formulated as a multi-label classification problem, this model deals with _tag generation_ as a text2text generation task (inspiration and reference: [fabiochiu/t5-base-tag-generation](https://huggingface.co/fabiochiu/t5-base-tag-generation)).
31
+ <br><br>
32
+ The Inference API here expects a cleaned readme text, the code for cleaning the readme is also given below.
33
+ <br><br>
34
+ Finetuning Notebook Reference: [Hugging face summarization notebook](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb).
35
+
36
+
37
+ # How to use the model
38
+
39
+ Input : Github Repo URL<br>
40
+ Output : Tags
41
+
42
+ Remarks: Ensure the repo has README.<b>md</b>
43
+ ### Installations
44
+
45
+ ```python
46
+ pip install transformers nltk clean-text beautifulsoup4
47
+ ```
48
+ ### Code
49
+ Imports
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
52
+ import re
53
+ import nltk
54
+ nltk.download('punkt')
55
+ from cleantext import clean
56
+ from bs4 import BeautifulSoup
57
+ from markdown import Markdown
58
+ import requests
59
+ from io import StringIO
60
+ import string
61
+ ```
62
+
63
+ Preprocessing
64
+ ```python
65
+ # Script to convert Markdown to plain text
66
+ # Reference : Stackoverflow == https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text
67
+
68
+ def unmark_element(element, stream=None):
69
+ if stream is None:
70
+ stream = StringIO()
71
+ if element.text:
72
+ stream.write(element.text)
73
+ for sub in element:
74
+ unmark_element(sub, stream)
75
+ if element.tail:
76
+ stream.write(element.tail)
77
+ return stream.getvalue()
78
+
79
+
80
+ # patching Markdown
81
+ Markdown.output_formats["plain"] = unmark_element
82
+ __md = Markdown(output_format="plain")
83
+ __md.stripTopLevelTags = False
84
+
85
+
86
+ def unmark(text):
87
+ return __md.convert(text)
88
+
89
+ def readme_extractor(github_repo_url):
90
+ try:
91
+
92
+ # Get repo HTML using BeautifulSoup
93
+ html_content = requests.get(github['python', 'machine learning', 'ml', 'cnn']_repo_url).text
94
+ soup = BeautifulSoup(html_content, "html.parser")
95
+
96
+ # Get README File URL from Repository
97
+ readme_url = "https://github.com/" + soup.find("a",{"title":"README.md"}).get("href")
98
+
99
+ # Generate raw readme file URL
100
+ # https://github.com/rasbt/python-machine-learning-book/blob/master/README.md --> https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/README.md
101
+ readme_raw_url = readme_url.replace("/blob/","/")
102
+ readme_raw_url = readme_raw_url.replace("github.com","raw.githubusercontent.com")
103
+ https://github.com/Lightning-AI/lightning
104
+ readme_html_content = requests.get(readme_raw_url ).text
105
+ readme_soup = BeautifulSoup(readme_html_content, "html.parser")
106
+ readme_text = readme_soup.get_text()
107
+ documentation_text = unmark(readme_text)
108
+ return documentation_text
109
+ except:
110
+ print("FAILED : ",github_repo_url )
111
+ return "README_NOT_MARKDOWN"
112
+
113
+ def clean_readme(readme):
114
+ text = clean(readme, no_emoji=True)
115
+ lst = re.findall('http://\S+|https://\S+', text)
116
+ for i in lst:
117
+ text = text.replace(i, '')
118
+ text = "".join([i for i in text if i not in string.punctuation])
119
+ text = text.lower()
120
+ text = text.replace("\n"," ")
121
+ return text
122
+ ```
123
+ Postprocess Tags [Removing duplicates]
124
+ ```python
125
+ def post_process_tags(tag_string):
126
+ final_tags = []
127
+ for tag in tag_string.split(","):
128
+ if tag.strip() in final_tags or len(tag.strip()) <=1:
129
+ continue
130
+ final_tags.append(tag.strip())
131
+ return final_tags
132
+ ```
133
+
134
+ Main Function
135
+ ```python
136
+ def github_tags_generate(github_repo_url):
137
+ readme = readme_extractor(github_repo_url)
138
+ readme = clean_readme(readme)
139
+ inputs = tokenizer([readme], max_length=1536, truncation=True, return_tensors="pt")
140
+ output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
141
+ max_length=128)
142
+ decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
143
+ tags = post_process_tags(decoded_output)
144
+
145
+ return tags
146
+
147
+
148
+
149
+ github_tags_generate("https://github.com/Enter_Repo_URL")
150
+
151
+ # github_tags_generate("https://github.com/nandakishormpai/Plant_Disease_Detector")
152
+ # ['python', 'machine learning', 'ml', 'cnn']
153
+ ```
154
+
155
+ ## Dataset Preparation
156
+ Over the 1000 articles from the dataset, only 870 had tags and the readme was longer than 50 characters. They were filtered out and using BeautifulSoup, README.md was scraped out.
157
+
158
+
159
+ ## Intended uses & limitations
160
+
161
+ The results might contain duplicate tags that must be handled in the postprocessing of results. postprocessing code also given.
162
+
163
+
164
+ ## Results
165
+
166
  It achieves the following results on the evaluation set:
167
  - Loss: 1.8196
168
  - Rouge1: 25.0142
 
171
  - Rougelsum: 22.8017
172
  - Gen Len: 19.0
173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
  ### Training hyperparameters
176
 
 
184
  - num_epochs: 40
185
  - mixed_precision_training: Native AMP
186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
 
188
 
189
  ### Framework versions