๐ŸŽ‰ Happy 19th Birthday to Danbooru! ๐ŸŽ‰
Donmai

Upcoming Changes for Upload and Approval Complaints

Posted under General

Saladofstones said:

I'm inclined to think that the quality of a janitor/moderator should be decided on a case by case basis. There isn't a huge amount of staff, and any automatic system is going to cause issues.

Ai-to-Yukai said:

Just throwing this one out here... I noticed I'm able to give User Feedback. Perhaps people could be encouraged to use the User Feedback system and positive feedback could be added to an uploader/approver score?

This complex system is being discussed because the individuals at the top don't want to go case-by-case or rely upon a smaller set of deliberate user feedback. (Not that it would be clear-cut even in such a case - some of the people at the center of this have gotten positive feedback since this thing started as a specific backlash against the complaints raised.)

Shinjidude said:

Hmm even though I proposed a similar idea above, the more I think of it, the less I like going down this route. If we end up making scoring as complicated as the US tax code, like the US tax code, it's going to become full of unintended motivations and loopholes people could (and likely would) exploit.

Modifying the scoring system doesn't seem to be a bad idea especially if it's going to be used in this way, but the changes need to be simple and transparent. Long ago before I was a janitor, there was a thing where mods and admins had greater weight given to their votes, but that proved to be unpopular even to some mods (and might be counterproductive if one of the goals is to be able to demote approvers with less than sterling quality metrics). Something simple along those lines though would be less gameable, more easily implementable, and less likely to have unintended consequences.

Not particularly relevant to this discussion, but the US Tax Code is written by lobbyists of the people who want those "loopholes" created on purpose for them to fit through, so there should be some difference.

But anyway, complexity in any sort of system, regulatory or not, is generally a result of trying to fix the far more obvious and exploitable problems in a more basic system. The current scoring system has obvious categories of images that get extremely high scores all the time, (basically, loli Touhou porn,) and other things, even pages of translated well-illustrated Touhou doujins get 0 score. If you set up a system where nobody wants to approve (and by extension, upload) doujins, anymore, it's a much worse problem than a complex metric for judging score.

Just throwing out some numbers for posts uploaded between 2015-04-01 and 2015-07-01.

Numbers
  • 82,462 Posts (not including deletions)
    • 7,582 Comics (9% of posts)
    • 74,880 Non-comic posts (91% of posts)
    • 6,037 Explicit posts (7% of posts)
    • 10,386 Questionable posts (13%)
    • 66,039 Safe posts (80%)
    • 5,785 Explicit non-comic posts (7%)
    • 10,133 Questionable non-comic posts (12%)
    • 58,962 Safe non-comic posts (72%)
    • 31% of posts with a score less than 3
    • 74% of comic posts with a score less than 3
    • 26% of comic posts with a score less than 3
    • 17% of E-rated non-comic posts w/ a score less than 3
    • 11% of Q-rated non-comic posts w/ a score less than 3
    • 30% of S-rated non-comic posts w/ a score less than 3
    • 66% of E-rated comic posts w/ a score less than 3
    • 75% of Q-rated comic posts w/ a score less than 3
    • 75% of S-rated comic posts w/ a score less than 3
    • 0.1% Banned
  • 50,931 Posts w/o KanColle and Touhou (62% of posts)
    • 5% Comics, 95% Non-comics
    • 9% E-rated, 14% Q-rated, 76% S-rated
    • 87% of comics and 32% of non-comics w/ a score less than 3
    • 18% of E-rated non-comic posts w/ a score less than 3
    • 13% of Q-rated non-comic posts w/ a score less than 3
    • 37% of S-rated non-comic posts w/ a score less than 3
    • 64% of E-rated comic posts w/ a score less than 3
    • 79% of Q-rated comic posts w/ a score less than 3
    • 89% of S-rated comic posts w/ a score less than 3
  • 8,736 Original Posts w/ no other copyrights (11% of posts)
    • 4% Comics, 96% Non-comics
    • 13% E-rated, 21% Q-rated, 66% S-rated
    • 88% of comics and 18% of non-comics w/ a score less than 3
    • 15% of E-rated non-comic posts w/ a score less than 3
    • 10% of Q-rated non-comic posts w/ a score less than 3
    • 21% of S-rated non-comic posts w/ a score less than 3
    • 63% of E-rated comic posts w/ a score less than 3
    • 92% of Q-rated comic posts w/ a score less than 3
    • 92% of S-rated comic posts w/ a score less than 3
  • 13,802 Touhou Posts (17% of posts)
    • 12% Comics, 88% Non-comics
    • 4% E-rated, 9% Q-rated, 87% S-rated
    • 65% of comics and 10% of non-comics w/ a score less than 3
    • 12% of E-rated non-comic posts w/ a score less than 3
    • 4% of Q-rated non-comic posts w/ a score less than 3
    • 10% of S-rated non-comic posts w/ a score less than 3
    • 69% of E-rated comic posts w/ a score less than 3
    • 82% of Q-rated comic posts w/ a score less than 3
    • 65% of S-rated comic posts w/ a score less than 3
  • 17,770 Kantai Collection Posts (22% of posts)
    • 18% Comics, 82% Non-comics
    • 4% E-rated, 10% Q-rated, 86% S-rated
    • 68% of comics and 22% of non-comics w/ a score less than 3
    • 15% of E-rated non-comic posts w/ a score less than 3
    • 9% of Q-rated non-comic posts w/ a score less than 3
    • 24% of S-rated non-comic posts w/ a score less than 3
    • 71% of E-rated comic posts w/ a score less than 3
    • 65% of Q-rated comic posts w/ a score less than 3
    • 68% of S-rated comic posts w/ a score less than 3
Comparisons

Notes: Only active posts were counted for the groupings. Original is with only one copyright.

Comic and Non-Comic percentages
  • All Posts: 9% comics, 91% non-comics
  • Posts w/o KC & TH: 5% comics, 95% non-comics
  • Original: 4% comics, 96% non-comics
  • Touhou: 12% comics, 88% non-comics
  • KanColle: 18% comics, 82% non-comics
  • Vocaloid: 10% comics, 90% non-comics
  • Deleted Posts: 2% comics, 98% non-comics
Rating Composition
  • All Posts: 7% E-rated, 13% Q-rated, 80% S-rated
  • Posts w/o KC & TH: 9% E-rated, 14% Q-rated, 76% S-rated
  • Original: 13% E-rated, 21% Q-rated, 66% S-rated
  • Touhou: 4% E-rated, 9% Q-rated, 87% S-rated
  • KanColle: 4% E-rated, 10% Q-rated, 86% S-rated
  • Vocaloid: 3% E-rated, 6% Q-rated, 91% S-rated
  • Deleted Posts: 37% E-rated, 16% Q-rated, 47% S-rated
Percent of posts with a score less than 3
  • All Posts: 26% E-rated, 13% Q-rated, 35% S-rated
  • Posts w/o KC & TH: 19% E-rated, 14% Q-rated, 41% S-rated
  • Original: 17% E-rated, 11% Q-rated, 24% S-rated
  • Touhou: 16% E-rated, 6% Q-rated, 17% S-rated
  • KanColle: 19% E-rated, 11% Q-rated, 33% S-rated
  • Vocaloid: 14% E-rated, 24% Q-rated, 28% S-rated
  • Deleted Posts: 58% E-rated, 56% Q-rated, 73% S-rated
Percent of non-comic posts with a score less than 3
  • All Posts: 17% E-rated, 11% Q-rated, 30% S-rated
  • Posts w/o KC & TH: 18% E-rated, 13% Q-rated, 37% S-rated
  • Original: 15% E-rated, 10% Q-rated, 21% S-rated
  • Touhou: 12% E-rated, 4% Q-rated, 10% S-rated
  • KanColle: 15% E-rated, 9% Q-rated, 24% S-rated
  • Vocaloid: 9% E-rated, 15% Q-rated, 20% S-rated
  • Deleted Posts: 58% E-rated, 55% Q-rated, 73% S-rated
Percent of comic posts with a score less than 3
  • All Posts: 66% E-rated, 75% Q-rated, 75% S-rated
  • Posts w/o KC & TH: 64% E-rated, 79% Q-rated, 89% S-rated
  • Original: 63% E-rated, 92% Q-rated, 92% S-rated
  • Touhou: 69% E-rated, 82% Q-rated, 65% S-rated
  • KanColle: 71% E-rated, 65% Q-rated, 68% S-rated
  • Vocaloid: 67% E-rated, 100% Q-rated, 97% S-rated
  • Deleted Posts: 54% E-rated, 100% Q-rated, 90% S-rated

NWSiaCB said:

...
Not particularly relevant to this discussion, but the US Tax Code is written by lobbyists of the people who want those "loopholes" created on purpose for them to fit through, so there should be some difference.

But anyway, complexity in any sort of system, regulatory or not, is generally a result of trying to fix the far more obvious and exploitable problems in a more basic system.
...

This was sort of my point, while you may cynically be correct that in practice the complexities get put in by people meaning to tweak the system to their advantage, the purported purpose of the tax credits and deductions that complicate the system is to allow the government to prop up certain people, industries, and practices.

What we're proposing here is essentially the same, to weaken the power of certain tags (Touhou, Kancolle, etc), and boost others (manga, comic). At that point you have to decide how much a boost is appropriate and deal with potential side effects. Maybe at some point Touhou dies down and the penalty is no longer warranted, maybe the next big thing comes out and it's not treated like other franchises. Maybe it's the artist that prompts an automatic high score and not the copyright.

Then we get into what we were talking about with users having weighted votes. At first blush it sounded like a good idea, but if one person only votes on very popular themes, and another person only votes on obscure themes. Supposing the quality across all the images is comparable, is it fair for the first person to have higher weighted votes than the second?

My point with this was that in addition to being possibly impractical to impartially define and effectively implement, the more complicated the rules we choose to go with, the more unforeseen side-effects and problems we may accidentally introduce. We need a relatively simple and transparent solution, I think.

So don't rely on scores.

If that really should be the case then, like I said before, I wouldn't demote any of the current janitors. People keep mentioning names but I look at their approvals and compare them to other approvers with similar scores, and I see pretty much the same thing. Comic heavy maybe, some average quality art, but also a lot of good art.

You, as a user of the site, have a conception of what good art is. At this point it's probably different from my conception. In order for the front page to be a common ground there has to be a compromise between everyone. You may think Touhou art is overvalued, but a Touhou fan might want to see mediocre-quality Touhou art that nevertheless has value because it's funny or is part of a story.

If you don't like that, then you have to give up on the idea of the front page being a common ground. If score is an inadequate filter (and I won't disagree that it's biased) then you need to start seeking out specific artists, uploaders, favoriters, etc. It's still an order of magnitude easier to do this on Danbooru versus other sites. If we need better features to do this then we can go down that path.

I do think the moderation staff needs new blood so I will continue recruiting new approvers and it'll be simple enough to review them after a month.

Shinjidude said:

This was sort of my point, while you may cynically be correct that in practice the complexities get put in by people meaning to tweak the system to their advantage, the purported purpose of the tax credits and deductions that complicate the system is to allow the government to prop up certain people, industries, and practices.

That's not cynical, that's literally how it works - there are over 400 changes to the US tax code per year that go through Congress, and actual legislators don't have time to read them before voting for them...

Shinjidude said:

What we're proposing here is essentially the same, to weaken the power of certain tags (Touhou, Kancolle, etc), and boost others (manga, comic). Maybe at some point Touhou dies down and the penalty is no longer warranted, maybe the next big thing comes out and it's not treated like other franchises. Maybe it's the artist that prompts an automatic high score and not the copyright.

I'm not saying that it should be weakening a specific tag's value, I'm saying that the value of a vote changes depending on how many votes a specific image has already received.

If it's based upon the number of votes of a given image, the system automatically corrects for the rising or declining popularity of a given copyright. (As was already pointed out, images with the "original" "copyright" tend to have far less views and votes even when they are of greater detail and artistic quality than a middling quality Touhou image.)

albert said:

You, as a user of the site, have a conception of what good art is. At this point it's probably different from my conception. In order for the front page to be a common ground there has to be a compromise between everyone. You may think Touhou art is overvalued, but a Touhou fan might want to see mediocre-quality Touhou art that nevertheless has value because it's funny or is part of a story.

If you don't like that, then you have to give up on the idea of the front page being a common ground. If score is an inadequate filter (and I won't disagree that it's biased) then you need to start seeking out specific artists, uploaders, favoriters, etc. It's still an order of magnitude easier to do this on Danbooru versus other sites. If we need better features to do this then we can go down that path.

Don't get me wrong, I AM a Touhou fan. It's what originally caused me to find this site. Still, it's undeniable a lot of mediocre, and sometimes quite terrible Touhou art that I don't want to be part of my fandom gets through.

And I doubt anyone actually just hits Posts to look for something. Even searching for specific characters, alone, outside the really unpopular ones, is generally not worthwhile. The "Comments" page is actually much more useful as a way to find interesting comics and filter out the random bulk porn.

albert said:
If you don't like that, then you have to give up on the idea of the front page being a common ground. If score is an inadequate filter (and I won't disagree that it's biased) then you need to start seeking out specific artists, uploaders, favoriters, etc. It's still an order of magnitude easier to do this on Danbooru versus other sites. If we need better features to do this then we can go down that path.

Thinking about it, is it possible to have a breakdown of score? Right now score is just the net amount of favorites from premium members + upvotes - downvotes. Seeing how many pure upvotes, downvotes, and points from premium members can help get more usable information.

To a degree, seeing what type of user is giving favorites can help. Generally builders and contributors are viewed as having more experience with uploads and pick good uploads themselves, so that way there is a more complete metric to work off of.

Just knowing how many upvotes, favorites, and by what type of user, along with downvotes can make scores more able to be a way to indicate quality.

As an aside, I'm not educated about statistics, but is there a way to adjust for the difference in feedback in a popular commodity compared to a more niche one? I'd imagine there is a simple way to do this, given the amount of market research there is out in the world.

Saladofstones said:

As an aside, I'm not educated about statistics, but is there a way to adjust for the difference in feedback in a popular commodity compared to a more niche one? I'd imagine there is a simple way to do this, given the amount of market research there is out in the world.

I imagine there would be a way to create an adjustment to compensate for the bias of a given tag (something like seasonally adjusted job or sales rates). Say the average post has a score of 2, the average Kancolle post has a score of 3, and the average comic post has a score of 1. If you a given post for each of these (one generic, one Kancolle, and one comic) that had a score of 2, then the adjusted scores would stay 2 for the generic post, be something below 2 for the Kancolle (it'd need adjusted downward), and something above 2 for the comic post (it'd need adjusted upward).

The weights would likely not be super cheap to calculate though due to the number of tags and the fact they'd constantly be changing. Calculating the adjustment might not be either, since the tag weights would interact there are more variables than just season in this case (what happens for a Kancolle comic?). Tagging a post would also change a post's score, pushing it down if you added a popular tag, pushing it up if you added a less popular one. It's also not clear that all tags should have the same power to affect a post's score.

It might be doable, but I don't think it'd be simple.

Just more things to consider. Our end goal is to determine the probability of user approving a "good" post. Post can be considered "good" using something along the lines of the concept of "score normalization", which sounds simple at first - both scores and favorites are counted; the more popular copyright (or, maybe a character?) is, the less score weighs; comic scores weigh more than others; explicit posts weigh less than safe; very recent posts weigh more than older. There are three main problems, though:

1) The worst one is determining proper coefficients for the equation. How much is [touhou score]/[girls_und_panzer score]? How much is [rating:e score]/[rating:s score]? Direct way to calculate those is to gather sets with roughly the same quality and estimate the difference- i.e. "explicit posts generally have 1,5x score when compared to similar safe posts". The thing is, who is going to determine "same quality", and how? Posts from same artists, or by same uploaders, or with same approvers tend to have different styles, from sketches to professional art. It actually sounds like a data mining problem, and I only have basic statistics education, so I'm simply not qualified to do this - we'd probably need someone with a statistics degree to determine those properly. OR, we could go with "common sense", get some coefficients with rough estimations and polish those later via trial and error - but this will surely cause much drama at first.

2) What if a post belongs to multiple copyrights? Does a cross-over between popular and obscure copyrights boost the score? What if both copyrights are popular? If both are obscure? What about original, how do we even treat it, as popular or unpopular? Again, it's easy to assume something, but often it wouldn't be the truth.

3) This is more of a technical difficulty, but still. For example, current order:rank estimations are based on score and age only - it's simple math done with a single query. If the tags are taken into account, however, it becomes much more complex. One query to get post data like rating , age and score, then queries to sort out copyright tags and get their post count, then some complex math to get multipliers depending on tag combination; all of this repeated for every post. Report will be generated significantly slower and it will cause database load; or, this new score will be cached on every post edit, making edits slightly slower - and edits happen often, so this "slight" slowdown might snowball into database timeouts in the worst case.

tl;dr it's doable, but that's a lot of work and it's not even guaranteed to work at first, because we lack proper post sets to check if it's working correctly.

If this entire scoring discussion shows something, it's that no automatic/numeric scheme will be trusted to fully work and definitely shouldn't be used to judge a user's credibility (and I certainly won't go out of my way to approve additional 230 posts to make statistics "accurate"). I'm afraid nothing, outside of hand-picking and checking on admin's own, will do any proper justice. Too many wrongs to make it right. Thinking of ways to do it and then also coding it sounds like a time that could be used better on something else, simply put. Sorry to be 'that' guy.
Scoring won't magically stop being botched and no amount of incentivizing users to use their vote up/down will change that. Putting aside that most users don't even watch all the posts between their visits that could potentially get +1s.

Saladofstones said:

Thinking about it, is it possible to have a breakdown of score? Right now score is just the net amount of favorites from premium members + upvotes - downvotes. Seeing how many pure upvotes, downvotes, and points from premium members can help get more usable information.

Danbooru keeps track of upvotes/downvotes separately with up_score and down_score columns (you can view this in the api if you want).

As for score from favorites, that is counted in the score column but not the up_score column. So if you had a post with these values:
score: 4
up_score: 2
down_score: -1
That would mean 2 people upvoted it, 1 downvoted it, and 3 gold+ users favorited it. The last one can be calculated from (score - up_score - down_score).

But I think Danbooru 1 did things differently, so votes made back then probably can't be broken down.

Wypatroszony said:

If this entire scoring discussion shows something, it's that no automatic/numeric scheme will be trusted to fully work and definitely shouldn't be used to judge a user's credibility (and I certainly won't go out of my way to approve additional 230 posts to make statistics "accurate"). I'm afraid nothing, outside of hand-picking and checking on admin's own, will do any proper justice.

Sure. Demotion should never be fully automated, I think it wasn't intended from the start. Report is intended to bring attention to approvers not doing so well. All that score normalization brainstorming is necessary to make signaling more accurate, and make manual checking easier - in ideal world, normalized score would rank approvals from "best" to "worst".

Shinjidude said:
It might be doable, but I don't think it'd be simple.

Yeah, I don't mean simple as in "easy to do" but having a straightforward, consistent system.

While full automation is undesirable, being able to weighting allows for a better adjustment.

It is true that multiple copyrights and overload of factors can be too much, I often see a weight/unweighted statistic along with the margin of error. If the margin of error would be too high, there can just be a special symbol indicating its not a reliable number, like an asterisk or something.

I agree that it would be a lot of work and load for a minor feature.

Type-kun said:

...
1) The worst one is determining proper coefficients for the equation....

2) What if a post belongs to multiple copyrights?...
...

If it was to be done at all I think some assumptions would need to be made. Namely that we assume roughly normal distributions for quality, and normalize based on the means of the post scores for that tag (compared to the overall score distribution). You'd probably also in that case have to apply the weights of all tags that have weight to a post that includes multiple (they might either work in the same or opposite directions as far as adjustment to the score goes)

That would probably be the simplest and most reasonable way to go about it. I still think even with a relatively dumb normalization algorithm like that it would be an expensive proposition computationally for the number of tags and number of posts we have, and it'd need recomputed on a regular basis based on scoring trends.

Also like you said, it may do seemingly strange things depending on the circumstances, and I'm not sure I'd 100% trust it anyway.

I think giving mods a metric to judge themselves and each other by is probably a good idea, it will help them to become more selective as to conform closer to what the community seems to agree with so far as quality goes, be more accountable for their selections, and it provides justification for rectification when complaints are made. I worry though that over-complicating it may introduce more problems than it solves.

Updated

Saladofstones said:
As an aside, I'm not educated about statistics, but is there a way to adjust for the difference in feedback in a popular commodity compared to a more niche one? I'd imagine there is a simple way to do this, given the amount of market research there is out in the world.

I'm a bit rusty on the subject, but I'm fairly sure there's work out there in the field of collaborative filtering (think: netflix recommendations) that deals with adjusting/normalizing ratings/scores that users assign to items, e.g. try to compensate for individual user preferences such as liking/disliking particular genres, actors, etc. Might be overkill though -- lots of effort to implement, and we don't even know how well it'll work, if at all. Also we've got favorite/vote mechanics rather than review/rate on a scale, so it'd take some adjustment.

Type-kun said:
Just more things to consider. Our end goal is to determine the probability of user approving a "good" post. Post can be considered "good" using something along the lines of the concept of "score normalization", which sounds simple at first - both scores and favorites are counted; the more popular copyright (or, maybe a character?) is, the less score weighs; comic scores weigh more than others; explicit posts weigh less than safe; very recent posts weigh more than older.

The "score normalization" idea has merit, but I think there's also another (simpler?) approach. Rather than asking "What is the quality of post #x, compared to all other posts, compensating for 'popularity factors' of copyright/explicit-rating/comic-status/etc etc" which is computationally hard, why not ask "What is the quality of post #x, compared to its peers that have the same popularity factors?"
(In case you think the questions look off-topic: post quality + approver --> does user approve good posts?)

So, given post #x, we could say things like:

  • its score is at the nth percentile of all images
    • baseline. As we already know, this suffers from inflation/deflation due to presence/absence of popularity factors
  • oh, it's rating:e? it's at the nth percentile of those
    • this is where you can say things like, its score may be >75% of all images, but only >45% of explicit posts; it's probably mediocre, just sexual content boosting the score there. [I'm guessing roughly at reasonable percentiles; keep in mind there's probably many low scoring posts and a long tail as you get to very high post scores]
  • oh, it's rating:s?
    • the opposite case. helps better evaluate posts that suffered under score inflation of posts with rating:e/rating:q
    • unfortunately something like this isn't possible/practical for copyrights.
  • oh, it's touhou? it's at the nth percentile out of 457k images
    • again, a post that looks to be high-scoring against the general population of posts may turn out mediocre vs its peers
    • there's nothing like rating:s, but we could basically do this for all of the largest copyrights. Rather than having to argue about which copyrights grant a popularity boost, just let the numbers speak for themselves.
  • where sample size permits, we can even evaluate for combinations of factors:

...you get the idea.

You don't try to combine all the numbers for a given post, though. That doesn't actually make much mathematical/statistical sense.

What you want to do is couple the above per-post evaluation with user/approver information.
With that, you can gauge the quality of their uploads/approvals:

  • user X has N0 uploads/approvals, they are on average at the n0th percentile of all images
  • N1 are touhou, and they are on average at the n1th percentile versus their 457k peers
  • N2 are rating:e ...
  • ...
  • you can even track performance over time, calculating on windows of, say, monthly or quarterly

I glossed over what it means when I say "on average".

  • It is not averaging the scores of the posts, then pegging that at some percentile of the population.
  • It is 'averaging' the percentiles of the posts.
    • e.g. for each of the N1 touhou posts for user X, take their individual percentile measure as evaluated against their 457k peers, then 'average'.
  • As for 'average', I think it'd be better to use harmonic mean rather than arithmetic mean. This would reward consistency and reduce outlier effects.

I think that at a basic level, this would be usable and informative enough, without requiring arcane statistics or too much implementation effort. Rather than trying to compensate for factors, it grants awareness -- like the numbers/statistics NWF Renim was throwing out earlier in the thread. For presentation, represent the percentile scores as horizontal bars, perhaps with red-yellow-green color scale -- quick and easy visual comparison and comprehension to see how the factors come into play.

Compared to a full-blown compensate/normalize approach, this won't require figuring out how much to compensate or normalize. Just keep a score-to-percentile lookup for a (hopefully reasonably-sized) set of popularity factors and combinations thereof, update periodically. Doesn't have to be precise granularity: 5% intervals will probably suffice. Can be publicly available -- users that care could figure out how their uploads measure up. Running the numbers for users is still going to be a good bit of number crunching, but shouldn't be too frequent. As part of janitor trials, this might be recomputed fortnightly or monthly at worst. Can also be used as part of the contributer promotion test -- probably want to pre-screen based on coarse criteria first to reduce number of candidates though. Not really necessary to store or update any additional per-post information, although it may be useful to temporarily cache the numbers if crunching both janitor trial and user promotion reports.

Shinjidude said:
If it was to be done at all I think some assumptions would need to be made. Namely that we assume roughly normal distributions for quality, and normalize based on the means of the post scores for that tag.

Eh, I suspect image quality and post scores both have a skewed distributions; post scores likely got a long tail. Just a hunch.

-
edit: grammar

Updated

I understand your percentile-to-peers approach, might be a good idea. However, let me clarify something.

So, the post is touhou and rating:e.
Do we only gauge it against its touhou peers, or against rating:e peers?
Or do we gauge the post against all peer groups, storing the lowest/average percentile?
Or do we gauge every post against every possible peer group, then store average percentile per group, and then average them together?

Also, should the groups be set in stone (major copyrights, ratings, comics), or dynamic, based on post count per tag group aka sample size?

Plus, you don't factor in the post age. Post uploaded yesterday will be on the lower side even if it's better than average.

Also, I currently see a problem. How exactly do we determine that post 'X' is in Nth percentile of touhou posts? As I understand it, we'll need a metric like "1% of touhou posts have score above 50, 2% above 45" etc, and then check against it? That sounds good, but gathering such metrics is pretty expensive. You essentially have to go over every post in peer group with "group by" query and some analytical functions on top. Perhaps even twice - for score and for favorites. It will take a dozen seconds on average group and could take a minute or more on large ones. Such calculation cannot be done for every post, so it'll have to be cached, stored somewhere and updated periodically.

Type-kun said:
So, the post is touhou and rating:e.
Do we only gauge it against its touhou peers, or against rating:e peers? both
Or do we gauge the post against all peer groups yes, storing the lowest/average percentile? no
Or do we gauge every post against every possible peer group yes, then store average percentile per group, and then average them together? ??? (probably no)

You'd gauge against each possible peer group; they're individually informative.
peer group = any combination of popularity factors that meets sample size requirements

I'll use a fuller example: suppose you have a post that is touhou and comic and rating:e

  • evaluate the 'baseline' (vs all posts)
  • separately evaluate against each of the three: touhou, comic, rating:e
    • no, don't try to average these three numbers together, doing so is meaningless.
  • also evaluate against peers in: touhou comic, touhou rating:e, comic rating:e
    • e.g. touhou comic gives you a way to gauge the interplay of two factors compared to just touhou and comic individually
  • if sample size permits, you could even go beyond just pairs of factors.
    • the peer group touhou comic rating:e (1200+ posts) sure is big enough.
    • probably diminishing returns in utility, though...

In this example, you could in theory evaluate up to 8 different percentiles for the same post.
And again, don't average these percentiles within the post, it doesn't make sense. You're supposed to compare them for information (e.g. post is >75% of touhou, but only >45% of touhou rating:e), not combine them.

In fact, I don't even think the percentiles absolutely need to be stored persistently for every post. Scores are volatile, so the percentiles would no longer be an accurate reflection of the current relative quality of the post once the score of the post changes.

You can discard the percentiles once you're done with your number crunching for janitor trial or user promotion reports.
Here's what the janitor trial code might look like (for contributor promotions, substitute "candidate for promotion" and "uploaded post")

pseudocode
for each approver {
   for each peer group {
      for each approved post in the peer group {
         get score of the post
         look up percentile with respect to peer group
      }
      calculate harmonic mean of percentiles
      store the harmonic mean
      discard percentiles
   }
}

Since each post has just one approver (or uploader), you can discard the per-post percentiles immediately. The only reason you might want to cache them temporarily is if you know that you're going to run both janitor trial and user promotions in succession, so the values could be reused. (actually, do user promotions first -- smaller cache!)

-

Also, should the groups be set in stone (major copyrights, ratings, comics), or dynamic, based on post count per tag group aka sample size?

I think (major copyrights, ratings, comics) is a good starting point as far as popularity factors goes. More than that is up for discussion. Sample size needs to be met, of course, but I think the bigger consideration is to managing compute cost. The number of possible peer groups goes up as you get to combinations of copyright+rating, copyright+comic, comic+rating, etc... so limiting scope and keeping the number of factors manageable is important. [incidentally, I don't think copyright1+copyright2 would be useful, can probably rule those out]

-

Plus, you don't factor in the post age. Post uploaded yesterday will be on the lower side even if it's better than average.

Oh, yes, absolutely. New posts shouldn't count; probably needs about a week for the score to get somewhere more stable/representative.

-

How exactly do we determine that post 'X' is in Nth percentile of touhou posts? As I understand it, we'll need a metric like "1% of touhou posts have score above 50, 2% above 45" etc, and then check against it? That sounds good, but gathering such metrics is pretty expensive. <snip> it'll have to be cached, stored somewhere and updated periodically.

Yes, this is what I meant when I mentioned a 'score-to-percentile lookup'. Like I said previously, granularity can be at 5%, or maybe even less. 1% granularity is definitely overkill. In fact, you could even just do it with quartiles, or just a few arbitrary percentiles that we decide upon -- c.f.: the user promotion report currently gives confidence intervals for score:3+ and score:6+ <- two arbitrary thresholds. Whatever we find to be a good tradeoff between utility vs computation cost.

Now, I suggested discarding the percentiles... but this lookup table is different. You want to hang on to it, not only because you need to reference it repeatedly when crunching janitor trials or promotion reports, but also because it's an informative snapshot of score distribution information for each peer group. You might only update it, say, fortnightly, but then I don't expect the overall score distribution of posts in the peer groups change too drastically in that time frame. So, if I wanted, I could at any point in time take the current score of any post, and see how it measures up to the most recent score distribution information for the peer groups that are applicable.

Updated

1 2 3 4 5 6 7 8