I've been thinking about Related Tags recently, and that many of the results past the first several returned can be way off, and how to mitigate that loss of accuracy. However, I wanted to fully explore this idea before potentially submitting a change on GitHub.
Just for reference, the current sample size for Related Tags is 300, and the top 25 results are returned in the order they are ranked in that sample.
1. Increase the sample size
The following tables illustrate the affect on accuracy. For reference, with each datapoint 1000 simulations were run.
Accuracy is measured by taking the absolute difference of tag rank determined from the sample versus the tags actual rank.
Example
Tag: girls_und_panzer
Accuracy mean
Sample size | ArtTags | CopyTags | CharTags | GenTags |
---|---|---|---|---|
100 | 346.35128 | 40.49696 | 6.38504 | 2.94508 |
200 | 135.95832 | 74.69712 | 4.06152 | 1.99528 |
300 | 71.80496 | 88.05108 | 3.09884 | 1.60708 |
400 | 53.79272 | 87.3518 | 2.6512 | 1.36352 |
500 | 38.25004 | 86.12092 | 2.33996 | 1.22672 |
600 | 33.67356 | 79.93428 | 2.09236 | 1.10228 |
700 | 27.39412 | 73.33748 | 1.9188 | 1.0152 |
800 | 24.6674 | 65.02052 | 1.77448 | 0.94664 |
900 | 21.8714 | 57.18844 | 1.66776 | 0.90916 |
1000 | 19.19132 | 47.91616 | 1.55536 | 0.85092 |
Tag: hong_meiling
Accuracy mean
Sample size | ArtTags | CopyTags | CharTags | GenTags |
---|---|---|---|---|
100 | 698.89072 | 79.3412 | 7.46496 | 3.38508 |
200 | 214.05444 | 119.34272 | 4.00044 | 2.54492 |
300 | 62.76796 | 118.51428 | 3.15696 | 2.12416 |
400 | 40.41652 | 109.331 | 2.7704 | 1.87324 |
500 | 23.62312 | 93.74652 | 2.44128 | 1.72372 |
600 | 19.47424 | 76.87108 | 2.24984 | 1.57416 |
700 | 15.32128 | 59.98752 | 2.04332 | 1.43936 |
800 | 13.2336 | 47.4292 | 1.93004 | 1.3776 |
900 | 11.79732 | 36.52324 | 1.8112 | 1.31532 |
1000 | 10.77936 | 28.91504 | 1.74064 | 1.25048 |
2. Filter results
This doesn't change the the rank of items returned, but it does filter the results for low similarity hits, which has the effect of increasing the overall accuracy of the whole set being returned. The filter cutoff for the following tables is 1%, i.e. with a sample size of 300, at least 3 of the posts returned must have a tag to be counted.
Tag: girls_und_panzer (filtered)
Accuracy mean
Sample size | ArtTags | CopyTags | CharTags | GenTags |
---|---|---|---|---|
100 | 354.42396 | 42.74536 | 6.35968 | 2.93288 |
200 | 73.5458 | 0.9376 | 4.03584 | 1.99076 |
300 | 16.40596 | 0.14872 | 3.17472 | 1.58592 |
400 | 6.00148 | 0.06144 | 2.702 | 1.38088 |
500 | 2.73512 | 0.03216 | 2.36352 | 1.21796 |
600 | 1.527 | 0.01976 | 2.17236 | 1.10428 |
700 | 1.068 | 0.01276 | 1.93584 | 1.03684 |
800 | 0.744 | 0.0082 | 1.81372 | 0.96536 |
900 | 0.5768 | 0.00556 | 1.69012 | 0.89788 |
1000 | 0.46136 | 0.005 | 1.57928 | 0.83592 |
Results returned
Sample size | ArtTags | CopyTags | CharTags | GenTags |
---|---|---|---|---|
100 | 25 | 12.419 | 25 | 25 |
200 | 22.217 | 3.916 | 25 | 25 |
300 | 12.571 | 2.932 | 25 | 25 |
400 | 8.956 | 2.64 | 25 | 25 |
500 | 7.261 | 2.485 | 25 | 25 |
600 | 6.35 | 2.405 | 25 | 25 |
700 | 5.97 | 2.34 | 25 | 25 |
800 | 5.505 | 2.262 | 25 | 25 |
900 | 5.226 | 2.25 | 25 | 25 |
1000 | 5.054 | 2.218 | 25 | 25 |
Tag: hong_meiling (filtered)
Accuracy
Rank | Sample size | ArtTags | CopyTags | CharTags | GenTags |
---|---|---|---|---|---|
1 | 100 | 697.22828 | 125.4646508458 | 7.08316 | 3.40416 |
2 | 200 | 66.7368150365 | 10.0919754236 | 3.9374 | 2.54456 |
3 | 300 | 19.6457464706 | 2.2782906885 | 3.19768 | 2.11728 |
4 | 400 | 10.7071435088 | 1.0675962413 | 2.66876 | 1.89684 |
5 | 500 | 7.3527471395 | 0.6174254831 | 2.40316 | 1.69364 |
6 | 600 | 5.7510901997 | 0.399722415 | 2.1666 | 1.56652 |
7 | 700 | 4.7773598641 | 0.2910205554 | 2.00616 | 1.45104 |
8 | 800 | 3.8342245989 | 0.232428412 | 1.89956 | 1.36888 |
9 | 900 | 3.3319742207 | 0.1825095057 | 1.76412 | 1.29984 |
10 | 1000 | 3.1496296296 | 0.1371541502 | 1.69744 | 1.25328 |
Results returned
Sample size | ArtTags | CopyTags | CharTags | GenTags |
---|---|---|---|---|
100 | 25 | 16.139 | 25 | 25 |
200 | 21.388 | 5.371 | 25 | 25 |
300 | 13.671 | 3.791 | 25 | 25 |
400 | 10.961 | 3.299 | 25 | 25 |
500 | 9.701 | 3.053 | 25 | 25 |
600 | 8.714 | 2.882 | 25 | 25 |
700 | 8.242 | 2.773 | 25 | 25 |
800 | 7.854 | 2.689 | 25 | 25 |
900 | 7.603 | 2.63 | 25 | 25 |
1000 | 7.425 | 2.53 | 25 | 25 |
Initial Thoughts
It looks like the sample size could be increased a bit more without diminishing returns... say to 400 or 500. I don't know, but would the server be able to handle that additional load...?
Also, it looks like filtering can achieve notable affects, at least on certain tags and/or tag categories. Plus, it removes tags that may not be as pertinent to the tags being tested.
Those are just my thoughts though... What do others think? Are there other ways or methods to approach this problem? Also, if needed, I can alway run different/additional simulations.