Donmai

Ratings check thread

Posted under General

feline_lump said:

Partially visible vulva should be Q.

ed: This is apparently a much more common error than I thought. Just finished fixing a lot of "safe" cameltoes.

Several of the rating:s partially visible vulva posts were caused by Black Fox, sometimes even adding the tag and changing the rating to safe at the same time.

@Black_Fox, please check the Rating Guidelines again. pussy_juice and unambiguously portrayed sex are explicit (unless “portrayed in a restrained and tasteful manner”), cameltoe, covered_nipples, partially_visible_vulva and pubic_hair are questionable, unless they’re very minor and not obvious.

For example, post #3159298 and post #3051660 are questionable, post #3161672 is explicit.

Anyone disagree?

Flopsy said:

[…] but I'd probably also have rated post #3161672 Q instead of E. The sex is kinda peripheral in the image and it doesn't show the mechanics of intercourse.

I chose that post as an example because it’s easy to get it wrong. Even if it’s kinda peripheral, I think it still falls under “openly and unambiguously portrayed intercourse”, as the rating guidelines call it.

Could someone else also have a look at these changes ? Almost 400 rating changes within less than six hours despite no other previous tag edits and no related forum posts looks pretty suspicious.

Most changes seem to be borderline cases that could go either way and some are okay to match today’s standards, but some seem definitely wrong, such as changing suggestive_fluid and obvious cameltoes to safe and pussy_juice to questionable.

I was still writing it up, but the summary there is that, as part of my Danbooru2017 project (https://www.gwern.net/Danbooru2017 ; announced here: https://danbooru.donmai.us/forum_topics/8276?page=4 ), I've been working on a NN for classifying images by s/q/e. It's an easy place to start and it has the benefit of increasing the size of the SFW subset for future releases. (It could also be used for something like https://open_nsfw.gitlab.io/ which would be funny.)

I've gotten up to 85% accuracy, but further improvements seem to be stalling out and my hypothesis is that it reflects label noise: borderline cases where they are mislabeled, which damages learning & produces misleading accuracy (ie accuracy is lowered by predicting the 'wrong' category for a mislabeled image).

So I have been extracting the mistakes from the NN, ranking them by confidence (most mistaken to least mistaken), and manually reviewing them to fix the category as appropriate, using some simple scripting to commit the changes semi-automatically. Source code for the Danbooru2017 preprocessing, training the NN, and doing the label cleaning: https://pastebin.com/PVWF12kc

What you see there are the ~400 changes from reviewing the 10k images which happen to be in the validation set (the validation set is a set of 10k images held out from training, and used for estimating the final 85% accuracy), focusing on images labeled Q, since my theory is that as Q is the default, it'll tend to be where the most mistakes are as uploaders leave images at the default setting and S->Q or E->Q mistakes would be the least noticeable kind (it would be surprising if there were a lot of S<->E mistakes! people would notice those quickly). Right now I'm computing the predictions on the training dataset of several hundred thousand.

I admit I have probably made some mistakes. I've read the rating guide several times and checked back while doing my reviewing, but I am still confused about some things. For example, the rating guide says that 'tasteful lingerie' is S but also that 'partial undressing' or ripping is Q, although it seems like most cases of 'undressing' show a lot less skin than lingerie does. What about pantie flashing or exposure, does lifting up a skirt count as 'partial undressing' even though no clothing has been physically removed? And when does 'frontal nudity' (Q) cross over into 'blatantly exposed genitals' (E)? There are a lot of cases of a single female character fully nude with visible genitals but where the NN (and me) are confused about whether it should be Q or E. I thought I was careful about suggestive_fluids since that seemed straightforward enough, but maybe I missed some - I was tripped up by a few which were hard to spot, so I guess I missed some.

Let's see... for 'suggestive_fluid', I see post #1206473. Is that really 'questionable'? I remember pausing on this one but deciding it was S - it looks like rain to me and there's nothing in it to put drops into a sexual context, no other characters or locations or nudity or anything, it's just rain in a forest. For 'pussy_juice', post #368696/post #2844636/post #2426204/post #1627583/post #1079584 were definitely mistakes on my part & I reverted the rest. I either didn't notice it or forgot any fluid makes it E. I'll be more careful about that. 'camel_toe': post #1038909/post #335835/post #2780486/post #986695/post #584441/post #989128/post #1599216/post #2597816/post #1146862/post #1512925/post #184382/post #685035/post #1536489/post #227508/post #2104755/post #447254/post #443494/post #1271792, often hard to see but still there so I've reverted; but post #1484482/post #1572003/post #530994: I don't think that's a camel_toe, that's a crease in the panties.

Updated

I'm no mod, but I'd strongly recommend not running any rating bots on Danbooru itself without bringing it up for discussion here first. That kind of thing will always involve judgment calls (as you yourself seem to be discovering), and an automated process can create a false sense of objectivity. The rating guidelines are only guidelines, and contain clauses that can be interpreted in self-contradictory ways. It's not a good idea to use them as the basis of an algorithm that proceeds to auto-rate every image on the site.

Your invention might actually be useful if it was integrated with the upload process, setting the default rating on each post, but I don't think it should be allowed to overrule ratings consciously made by human users.

Edit: Another potential use would be to let the algorithm generate lists of "potentially misrated" images, which could then be reviewed by human users.

I think you misunderstand. I am definitely not running any automated bots on Danbooru. 85% accuracy is not nearly good enough for fully automated editing*, and even if it was, I would at least ask Albert's permission beforehand like I did for creating Danbooru2017. (Integrating with the upload process is certainly a possible use for it for setting the default, but I would still prefer >>85% accuracy, especially as I understand it is annoying to deploy NNs to servers so I would want a NN which is worth the hassle for Albert.)

Edit: Another potential use would be to let the algorithm generate lists of "potentially misrated" images, which could then be reviewed by human users.

That is precisely what I am doing. All the CNN is doing is providing a list of images sorted by how confused it is by them; the decision to change the label is mine, and the responsibility for those above mistakes also mine. I use the API but I could just use the IDs to open up the page in a web browser and edit it there, it'd amount to the same thing - just slower. If you look at the source, you'll see that there's nothing in the `feh` part which would make edits fully automatically, it requires user action. (If you're familiar with Wikipedia terminology, this is equivalent to a "semi-automated" bot: it proposes edits to the user but does not make them unless specifically told to.)

  • well, as is. There's probably a confidence threshold where the false positive rate is acceptable, like >99% probability, but I haven't looked into that yet.

Updated

gwern-bot said:

What you see there are the ~400 changes from reviewing the 10k images which happen to be in the validation set (the validation set is a set of 10k images held out from training, and used for estimating the final 85% accuracy), focusing on images labeled Q, since my theory is that as Q is the default, it'll tend to be where the most mistakes are as uploaders leave images at the default setting and S->Q or E->Q mistakes would be the least noticeable kind (it would be surprising if there were a lot of S<->E mistakes! people would notice those quickly).

There’s no default rating anymore. It was Q before, but nowadays there’s nothing selected by default and the uploader has to pick a rating. Without picking a rating, the upload fails. I fully agree that Q misratings are less noticeable and S↔E misratings are usually fixed pretty quickly.

Let's see... for 'suggestive_fluid', I see post #1206473. Is that really 'questionable'? I remember pausing on this one but deciding it was S - it looks like rain to me and there's nothing in it to put drops into a sexual context, no other characters or locations or nudity or anything, it's just rain in a forest.

It’s either suggestive_fluid and questionable or safe and not suggestive_fluid. If you think it’s safe and not suggestive fluid, you need to remove the tag as well. If you’re not sure, just post here. That’s what this thread is for.

'camel_toe': […]

The one in post #1038909 is so minor that it probably qualifies for a safe rating nowadays.

post #584441 and post #685035 also have rather minor cameltoes but also come with covered_nipples, which is also questionable unless very minor.

Most of the others seem quite right at questionable.

[…] post #447254 […], often hard to see but still there so I've reverted;

That’s not really “hard to see”. I guess you didn’t check all the posts when you made the list because that’s one that was explicit before, not safe. I changed it back to explicit because it goes beyond a cameltoe and it’s pretty much everything shoved right at the camera in full detail, just with a skin color swap. (Anyone who strongly disagrees can change it back again.)

but post #1484482/post #1572003/post #530994: I don't think that's a camel_toe, that's a crease in the panties.

A crease in the panties right there is pretty much the definition of a cameltoe, isn’t it? First one definitely qualifies. Second one is considered minor enough for a safe rating nowadays. Third one might not qualify for cameltoe, but it’s at least questionable anyway. I’d rate it explicit, but other’s don’t nowadays, I guess.

gwern-bot said:

  • well, as is. There's probably a confidence threshold where the false positive rate is acceptable, like >99% probability, but I haven't looked into that yet.

At 3M posts on Danbooru, that’s 30k misrated images. I won’t claim that users rating everything manually have better accuracy, though, because they probably don’t. ;-)

Your project sounds quite interesting, but wildly changing the ratings of images your NN has problems with will probably not score you any brownie points, even if you review it manually. Considering the speed at which you’ve been changing those ratings, I doubt you took much time to actually review each image and check its tags.

Some images, especially some older ones, are obviously misrated and fixing those will be appreciated, but if you encounter borderline cases, it’s better to bring them up here.

1 27 28 29 30 31 32 33 34 35 66