FoodDesert commited on
Commit
3f3bfef
·
verified ·
1 Parent(s): 179fcb8

Upload app.py

Browse files
Files changed (1) hide show
  1. app.py +8 -8
app.py CHANGED
@@ -72,19 +72,19 @@ You can read more about TF-IDF on its [Wikipedia page](https://en.wikipedia.org/
72
 
73
  ## How does the tag corrector work?
74
 
75
- We collect the tag sets from over 4 million e621 posts, treating the tag set from each image as an individual document.
76
  We then randomly replace about 10% of the tags in each document with a randomly selected alias from e621's list of aliases for the tag
77
  (e.g. "canine" gets replaced with one of {k9,canines,mongrel,cannine,cnaine,feral_canine,anthro_canine}).
78
  We then train a FastText (https://fasttext.cc/) model on the documents. The result of this training is a function that maps arbitrary words to vectors such that
79
- the vector for a tag and the vectors for its aliases are all close together (because the model has seen them in similar contexts).
80
- Since the lists of aliases contain misspellings and rephrasings of tags, the model should be robust to these kinds of problems as long as they are not too dissimilar from the alias lists.
81
 
82
  To enhance the tag corrector further, we leverage conditional probabilities to refine our predictions.
83
  Using the same 4 million post dataset, we calculate the conditional probability of each tag given the context of other tags appearing within the same document.
84
- This is done by creating a co-occurrence matrix from our dataset, which records how frequently each pair of tags appears together across all documents.
85
  By considering the context in which tags are used, we can now not only correct misspellings and rephrasings but also make more contextually relevant suggestions.
86
  The "similarity weight" slider controls how much weight these conditional probabilities are given vs how much weight the FastText similarity model is given when suggesting replacements for invalid tags.
87
- A similarity weight slider value of 0 means that only the FastText model's predictions will be used to calculate similarity scores, and a value of 1 means only the conditional probabilities are used (although the FastText model is still used to trim the list of candidates).
88
  """
89
 
90
 
@@ -236,7 +236,7 @@ def find_similar_tags(test_tags, similarity_weight):
236
 
237
  modified_tag_for_search = tag.replace(' ','_')
238
  similar_words = find_similar_tags.fasttext_small_model.most_similar(modified_tag_for_search, topn = 100)
239
- result, seen = [], set()
240
 
241
  if modified_tag_for_search in find_similar_tags.tag2aliases:
242
  if tag in find_similar_tags.tag2aliases and "_" in tag: #Implicitly tell the user that they should get rid of the underscore
@@ -287,8 +287,8 @@ def find_similar_artists(new_tags_string, top_n, similarity_weight):
287
  parsed = parser.parse(new_tags_string)
288
  # Extract tags from the parsed tree
289
  new_image_tags = extract_tags(parsed)
290
- new_image_tags = [tag.replace('_', ' ').strip() for tag in new_image_tags]
291
-
292
  ###unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys())) #We may want this line again later. These are the tags that were not used to calculate the artists list.
293
  unseen_tags_data = find_similar_tags(new_image_tags, similarity_weight)
294
 
 
72
 
73
  ## How does the tag corrector work?
74
 
75
+ We collect the tag sets from over 4 million e621 posts, treating the tag set from each image as an individual document.
76
  We then randomly replace about 10% of the tags in each document with a randomly selected alias from e621's list of aliases for the tag
77
  (e.g. "canine" gets replaced with one of {k9,canines,mongrel,cannine,cnaine,feral_canine,anthro_canine}).
78
  We then train a FastText (https://fasttext.cc/) model on the documents. The result of this training is a function that maps arbitrary words to vectors such that
79
+ the vector for a tag and the vectors for its aliases are all close together (because the model has seen them in similar contexts).
80
+ Since the lists of aliases contain misspellings and rephrasings of tags, the model should be robust to these kinds of problems as long as they are not too dissimilar from the alias lists.
81
 
82
  To enhance the tag corrector further, we leverage conditional probabilities to refine our predictions.
83
  Using the same 4 million post dataset, we calculate the conditional probability of each tag given the context of other tags appearing within the same document.
84
+ This is done by creating a co-occurrence matrix from our dataset, which records how frequently each pair of tags appears together across all documents.
85
  By considering the context in which tags are used, we can now not only correct misspellings and rephrasings but also make more contextually relevant suggestions.
86
  The "similarity weight" slider controls how much weight these conditional probabilities are given vs how much weight the FastText similarity model is given when suggesting replacements for invalid tags.
87
+ A similarity weight slider value of 0 means that only the FastText model's predictions will be used to calculate similarity scores, and a value of 1 means only the conditional probabilities are used (although the FastText model is still used to trim the list of candidates).
88
  """
89
 
90
 
 
236
 
237
  modified_tag_for_search = tag.replace(' ','_')
238
  similar_words = find_similar_tags.fasttext_small_model.most_similar(modified_tag_for_search, topn = 100)
239
+ result, seen = [], set(transformed_tags)
240
 
241
  if modified_tag_for_search in find_similar_tags.tag2aliases:
242
  if tag in find_similar_tags.tag2aliases and "_" in tag: #Implicitly tell the user that they should get rid of the underscore
 
287
  parsed = parser.parse(new_tags_string)
288
  # Extract tags from the parsed tree
289
  new_image_tags = extract_tags(parsed)
290
+ new_image_tags = [tag.replace('_', ' ').replace('\\(', '(').replace('\\)', ')').strip() for tag in new_image_tags]
291
+
292
  ###unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys())) #We may want this line again later. These are the tags that were not used to calculate the artists list.
293
  unseen_tags_data = find_similar_tags(new_image_tags, similarity_weight)
294