Spaces:
Sleeping
Sleeping
FoodDesert
commited on
Upload app.py
Browse files
app.py
CHANGED
@@ -72,19 +72,19 @@ You can read more about TF-IDF on its [Wikipedia page](https://en.wikipedia.org/
|
|
72 |
|
73 |
## How does the tag corrector work?
|
74 |
|
75 |
-
We collect the tag sets from over 4 million e621 posts, treating the tag set from each image as an individual document.
|
76 |
We then randomly replace about 10% of the tags in each document with a randomly selected alias from e621's list of aliases for the tag
|
77 |
(e.g. "canine" gets replaced with one of {k9,canines,mongrel,cannine,cnaine,feral_canine,anthro_canine}).
|
78 |
We then train a FastText (https://fasttext.cc/) model on the documents. The result of this training is a function that maps arbitrary words to vectors such that
|
79 |
-
the vector for a tag and the vectors for its aliases are all close together (because the model has seen them in similar contexts).
|
80 |
-
Since the lists of aliases contain misspellings and rephrasings of tags, the model should be robust to these kinds of problems as long as they are not too dissimilar from the alias lists.
|
81 |
|
82 |
To enhance the tag corrector further, we leverage conditional probabilities to refine our predictions.
|
83 |
Using the same 4 million post dataset, we calculate the conditional probability of each tag given the context of other tags appearing within the same document.
|
84 |
-
This is done by creating a co-occurrence matrix from our dataset, which records how frequently each pair of tags appears together across all documents.
|
85 |
By considering the context in which tags are used, we can now not only correct misspellings and rephrasings but also make more contextually relevant suggestions.
|
86 |
The "similarity weight" slider controls how much weight these conditional probabilities are given vs how much weight the FastText similarity model is given when suggesting replacements for invalid tags.
|
87 |
-
A similarity weight slider value of 0 means that only the FastText model's predictions will be used to calculate similarity scores, and a value of 1 means only the conditional probabilities are used (although the FastText model is still used to trim the list of candidates).
|
88 |
"""
|
89 |
|
90 |
|
@@ -236,7 +236,7 @@ def find_similar_tags(test_tags, similarity_weight):
|
|
236 |
|
237 |
modified_tag_for_search = tag.replace(' ','_')
|
238 |
similar_words = find_similar_tags.fasttext_small_model.most_similar(modified_tag_for_search, topn = 100)
|
239 |
-
result, seen = [], set()
|
240 |
|
241 |
if modified_tag_for_search in find_similar_tags.tag2aliases:
|
242 |
if tag in find_similar_tags.tag2aliases and "_" in tag: #Implicitly tell the user that they should get rid of the underscore
|
@@ -287,8 +287,8 @@ def find_similar_artists(new_tags_string, top_n, similarity_weight):
|
|
287 |
parsed = parser.parse(new_tags_string)
|
288 |
# Extract tags from the parsed tree
|
289 |
new_image_tags = extract_tags(parsed)
|
290 |
-
new_image_tags = [tag.replace('_', ' ').strip() for tag in new_image_tags]
|
291 |
-
|
292 |
###unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys())) #We may want this line again later. These are the tags that were not used to calculate the artists list.
|
293 |
unseen_tags_data = find_similar_tags(new_image_tags, similarity_weight)
|
294 |
|
|
|
72 |
|
73 |
## How does the tag corrector work?
|
74 |
|
75 |
+
We collect the tag sets from over 4 million e621 posts, treating the tag set from each image as an individual document.
|
76 |
We then randomly replace about 10% of the tags in each document with a randomly selected alias from e621's list of aliases for the tag
|
77 |
(e.g. "canine" gets replaced with one of {k9,canines,mongrel,cannine,cnaine,feral_canine,anthro_canine}).
|
78 |
We then train a FastText (https://fasttext.cc/) model on the documents. The result of this training is a function that maps arbitrary words to vectors such that
|
79 |
+
the vector for a tag and the vectors for its aliases are all close together (because the model has seen them in similar contexts).
|
80 |
+
Since the lists of aliases contain misspellings and rephrasings of tags, the model should be robust to these kinds of problems as long as they are not too dissimilar from the alias lists.
|
81 |
|
82 |
To enhance the tag corrector further, we leverage conditional probabilities to refine our predictions.
|
83 |
Using the same 4 million post dataset, we calculate the conditional probability of each tag given the context of other tags appearing within the same document.
|
84 |
+
This is done by creating a co-occurrence matrix from our dataset, which records how frequently each pair of tags appears together across all documents.
|
85 |
By considering the context in which tags are used, we can now not only correct misspellings and rephrasings but also make more contextually relevant suggestions.
|
86 |
The "similarity weight" slider controls how much weight these conditional probabilities are given vs how much weight the FastText similarity model is given when suggesting replacements for invalid tags.
|
87 |
+
A similarity weight slider value of 0 means that only the FastText model's predictions will be used to calculate similarity scores, and a value of 1 means only the conditional probabilities are used (although the FastText model is still used to trim the list of candidates).
|
88 |
"""
|
89 |
|
90 |
|
|
|
236 |
|
237 |
modified_tag_for_search = tag.replace(' ','_')
|
238 |
similar_words = find_similar_tags.fasttext_small_model.most_similar(modified_tag_for_search, topn = 100)
|
239 |
+
result, seen = [], set(transformed_tags)
|
240 |
|
241 |
if modified_tag_for_search in find_similar_tags.tag2aliases:
|
242 |
if tag in find_similar_tags.tag2aliases and "_" in tag: #Implicitly tell the user that they should get rid of the underscore
|
|
|
287 |
parsed = parser.parse(new_tags_string)
|
288 |
# Extract tags from the parsed tree
|
289 |
new_image_tags = extract_tags(parsed)
|
290 |
+
new_image_tags = [tag.replace('_', ' ').replace('\\(', '(').replace('\\)', ')').strip() for tag in new_image_tags]
|
291 |
+
|
292 |
###unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys())) #We may want this line again later. These are the tags that were not used to calculate the artists list.
|
293 |
unseen_tags_data = find_similar_tags(new_image_tags, similarity_weight)
|
294 |
|