[Feature Change] Related Tags Calculations

I've been thinking about Related Tags recently, and that many of the results past the first several returned can be way off, and how to mitigate that loss of accuracy. However, I wanted to fully explore this idea before potentially submitting a change on GitHub.

Just for reference, the current sample size for Related Tags is 300, and the top 25 results are returned in the order they are ranked in that sample.

1. Increase the sample size

The following tables illustrate the affect on accuracy. For reference, with each datapoint 1000 simulations were run.

Accuracy is measured by taking the absolute difference of tag rank determined from the sample versus the tags actual rank.

Example

The rank of solo for the 1girls tag

Sample rank: #3
Actual rank: #1

Accuracy = abs(3-1) = 2

So if the average accuracy for a set was 4, then each tag in the sample rank list for that set would be on average off by 4, etc...

Tag: girls_und_panzer

Accuracy mean

Sample size	ArtTags	CopyTags	CharTags	GenTags
100	346.35128	40.49696	6.38504	2.94508
200	135.95832	74.69712	4.06152	1.99528
300	71.80496	88.05108	3.09884	1.60708
400	53.79272	87.3518	2.6512	1.36352
500	38.25004	86.12092	2.33996	1.22672
600	33.67356	79.93428	2.09236	1.10228
700	27.39412	73.33748	1.9188	1.0152
800	24.6674	65.02052	1.77448	0.94664
900	21.8714	57.18844	1.66776	0.90916
1000	19.19132	47.91616	1.55536	0.85092

Tag: hong_meiling

Accuracy mean

Sample size	ArtTags	CopyTags	CharTags	GenTags
100	698.89072	79.3412	7.46496	3.38508
200	214.05444	119.34272	4.00044	2.54492
300	62.76796	118.51428	3.15696	2.12416
400	40.41652	109.331	2.7704	1.87324
500	23.62312	93.74652	2.44128	1.72372
600	19.47424	76.87108	2.24984	1.57416
700	15.32128	59.98752	2.04332	1.43936
800	13.2336	47.4292	1.93004	1.3776
900	11.79732	36.52324	1.8112	1.31532
1000	10.77936	28.91504	1.74064	1.25048

2. Filter results

This doesn't change the the rank of items returned, but it does filter the results for low similarity hits, which has the effect of increasing the overall accuracy of the whole set being returned. The filter cutoff for the following tables is 1%, i.e. with a sample size of 300, at least 3 of the posts returned must have a tag to be counted.

Tag: girls_und_panzer (filtered)

Accuracy mean

Sample size	ArtTags	CopyTags	CharTags	GenTags
100	354.42396	42.74536	6.35968	2.93288
200	73.5458	0.9376	4.03584	1.99076
300	16.40596	0.14872	3.17472	1.58592
400	6.00148	0.06144	2.702	1.38088
500	2.73512	0.03216	2.36352	1.21796
600	1.527	0.01976	2.17236	1.10428
700	1.068	0.01276	1.93584	1.03684
800	0.744	0.0082	1.81372	0.96536
900	0.5768	0.00556	1.69012	0.89788
1000	0.46136	0.005	1.57928	0.83592

Results returned

Sample size	ArtTags	CopyTags	CharTags	GenTags
100	25	12.419	25	25
200	22.217	3.916	25	25
300	12.571	2.932	25	25
400	8.956	2.64	25	25
500	7.261	2.485	25	25
600	6.35	2.405	25	25
700	5.97	2.34	25	25
800	5.505	2.262	25	25
900	5.226	2.25	25	25
1000	5.054	2.218	25	25

Tag: hong_meiling (filtered)

Accuracy

Rank	Sample size	ArtTags	CopyTags	CharTags	GenTags
1	100	697.22828	125.4646508458	7.08316	3.40416
2	200	66.7368150365	10.0919754236	3.9374	2.54456
3	300	19.6457464706	2.2782906885	3.19768	2.11728
4	400	10.7071435088	1.0675962413	2.66876	1.89684
5	500	7.3527471395	0.6174254831	2.40316	1.69364
6	600	5.7510901997	0.399722415	2.1666	1.56652
7	700	4.7773598641	0.2910205554	2.00616	1.45104
8	800	3.8342245989	0.232428412	1.89956	1.36888
9	900	3.3319742207	0.1825095057	1.76412	1.29984
10	1000	3.1496296296	0.1371541502	1.69744	1.25328

Results returned

Sample size	ArtTags	CopyTags	CharTags	GenTags
100	25	16.139	25	25
200	21.388	5.371	25	25
300	13.671	3.791	25	25
400	10.961	3.299	25	25
500	9.701	3.053	25	25
600	8.714	2.882	25	25
700	8.242	2.773	25	25
800	7.854	2.689	25	25
900	7.603	2.63	25	25
1000	7.425	2.53	25	25

Initial Thoughts

It looks like the sample size could be increased a bit more without diminishing returns... say to 400 or 500. I don't know, but would the server be able to handle that additional load...?

Also, it looks like filtering can achieve notable affects, at least on certain tags and/or tag categories. Plus, it removes tags that may not be as pertinent to the tags being tested.

Those are just my thoughts though... What do others think? Are there other ways or methods to approach this problem? Also, if needed, I can alway run different/additional simulations.