Donmai

[Feature Change] Related Tags Calculations

Posted under Bugs & Features

I've been thinking about Related Tags recently, and that many of the results past the first several returned can be way off, and how to mitigate that loss of accuracy. However, I wanted to fully explore this idea before potentially submitting a change on GitHub.

Just for reference, the current sample size for Related Tags is 300, and the top 25 results are returned in the order they are ranked in that sample.

1. Increase the sample size

The following tables illustrate the affect on accuracy. For reference, with each datapoint 1000 simulations were run.

Accuracy is measured by taking the absolute difference of tag rank determined from the sample versus the tags actual rank.

Example

The rank of solo for the 1girls tag

  • Sample rank: #3
  • Actual rank: #1

Accuracy = abs(3-1) = 2

So if the average accuracy for a set was 4, then each tag in the sample rank list for that set would be on average off by 4, etc...

Tag: girls_und_panzer
Accuracy mean
Sample sizeArtTagsCopyTagsCharTagsGenTags
100346.3512840.496966.385042.94508
200135.9583274.697124.061521.99528
30071.8049688.051083.098841.60708
40053.7927287.35182.65121.36352
50038.2500486.120922.339961.22672
60033.6735679.934282.092361.10228
70027.3941273.337481.91881.0152
80024.667465.020521.774480.94664
90021.871457.188441.667760.90916
100019.1913247.916161.555360.85092
Tag: hong_meiling
Accuracy mean
Sample sizeArtTagsCopyTagsCharTagsGenTags
100698.8907279.34127.464963.38508
200214.05444119.342724.000442.54492
30062.76796118.514283.156962.12416
40040.41652109.3312.77041.87324
50023.6231293.746522.441281.72372
60019.4742476.871082.249841.57416
70015.3212859.987522.043321.43936
80013.233647.42921.930041.3776
90011.7973236.523241.81121.31532
100010.7793628.915041.740641.25048

2. Filter results

This doesn't change the the rank of items returned, but it does filter the results for low similarity hits, which has the effect of increasing the overall accuracy of the whole set being returned. The filter cutoff for the following tables is 1%, i.e. with a sample size of 300, at least 3 of the posts returned must have a tag to be counted.

Tag: girls_und_panzer (filtered)
Accuracy mean
Sample sizeArtTagsCopyTagsCharTagsGenTags
100354.4239642.745366.359682.93288
20073.54580.93764.035841.99076
30016.405960.148723.174721.58592
4006.001480.061442.7021.38088
5002.735120.032162.363521.21796
6001.5270.019762.172361.10428
7001.0680.012761.935841.03684
8000.7440.00821.813720.96536
9000.57680.005561.690120.89788
10000.461360.0051.579280.83592
Results returned
Sample sizeArtTagsCopyTagsCharTagsGenTags
1002512.4192525
20022.2173.9162525
30012.5712.9322525
4008.9562.642525
5007.2612.4852525
6006.352.4052525
7005.972.342525
8005.5052.2622525
9005.2262.252525
10005.0542.2182525
Tag: hong_meiling (filtered)
Accuracy
RankSample sizeArtTagsCopyTagsCharTagsGenTags
1100697.22828125.46465084587.083163.40416
220066.736815036510.09197542363.93742.54456
330019.64574647062.27829068853.197682.11728
440010.70714350881.06759624132.668761.89684
55007.35274713950.61742548312.403161.69364
66005.75109019970.3997224152.16661.56652
77004.77735986410.29102055542.006161.45104
88003.83422459890.2324284121.899561.36888
99003.33197422070.18250950571.764121.29984
1010003.14962962960.13715415021.697441.25328
Results returned
Sample sizeArtTagsCopyTagsCharTagsGenTags
1002516.1392525
20021.3885.3712525
30013.6713.7912525
40010.9613.2992525
5009.7013.0532525
6008.7142.8822525
7008.2422.7732525
8007.8542.6892525
9007.6032.632525
10007.4252.532525

Initial Thoughts

It looks like the sample size could be increased a bit more without diminishing returns... say to 400 or 500. I don't know, but would the server be able to handle that additional load...?

Also, it looks like filtering can achieve notable affects, at least on certain tags and/or tag categories. Plus, it removes tags that may not be as pertinent to the tags being tested.

Those are just my thoughts though... What do others think? Are there other ways or methods to approach this problem? Also, if needed, I can alway run different/additional simulations.

1