Donmai

How to find missing cosplay tags

Posted under General

It's possible using the BigQuery data dump from topic #12774. Not particularly easy, to be fair, but possible. Here's the query, it's effectively a *_(cosplay) -cosplay search:

BigQuery: https://bigquery.cloud.google.com/savedquery/657582419813:dbddc4af3565484c9773d076744061be

SELECT
  CONCAT("post #", STRING(id)),
  CONCAT("http://danbooru.donmai.us/posts/", STRING(id))
FROM [danbooru-data:danbooru.posts]
WHERE
  id     IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE REGEXP_MATCH(tags.name, r'.*_\(cosplay\)')) AND
  id NOT IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE tags.name = 'cosplay') AND
  TRUE

And here are the results: https://docs.google.com/spreadsheets/d/1UXnVGTQ6DUkMh5elzWaEf7luan3MxYXLGIHVb-EzJWw/edit?usp=sharing. Turns out there are a lot, 5300 posts. It will probably take a script to fix all these.

So there are ~6000 *_(cosplay) tags, but only ~150 with more than 50 posts. I suppose implicating those wouldn't be unmanageable. Although I'm not sure how much it would help, I think the long tail of *_(cosplay) tags with only a few posts are where most of the missing cosplay tags come from.

Just finished tagging all of the *_(cosplay) tags with cosplay.

I built my own mass updater a while ago, so things like this are pretty easy, plus it takes in the same format as the bulk_update_requests, i.e. it handles the imply, alias, or update format, although imply and alias are really just a shortcut for certain forms of update.

Edit:

Forgot to make a note of it, but it turned out to be ~5000 posts that ended up getting edited. Since *_(cosplay) is such a special tagging (in that it's acknowledged that every *_(cosplay) should also have cosplay), should some kind of daily or weekly maintenance task be set up to specifically handle this tagging issue ...? I can't see developing a wildcard implication function to handle this solitary issue, yet the amount of untagged posts will just continue to pile up again.

Updated

Here's the converse search, cosplay -*_(cosplay) (i.e. things tagged cosplay but missing a character_(cosplay) tag):

https://bigquery.cloud.google.com/savedquery/657582419813:e5f7943e727445fabef2a72d7260f1dc
https://docs.google.com/spreadsheets/d/1PNJxFCwdvC1L-6eiR0A6mSIJCPGPzMIcGA9IDMn9wto/edit?usp=sharing

SELECT
  CONCAT("post #", STRING(id)) AS post,
  CONCAT("http://danbooru.donmai.us/posts/", STRING(id)) AS url,
  GROUP_CONCAT_UNQUOTED(CASE WHEN tags.category = 4 THEN tags.name END, " ") AS tags
FROM [danbooru-data:danbooru.posts]
WHERE
  id NOT IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE REGEXP_MATCH(tags.name, r'.*_\(cosplay\)')) AND
  id IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE tags.name = 'cosplay') AND
  TRUE
GROUP BY
  id, post, url
ORDER BY
  id DESC

Yeesh, you just had to bump up the difficulty ... :p

Unfortunately, all of those will require manual editing. That is, unless someone can program an AI with pattern recognition capabilities equal to what humans have...

Just a quick question about issue #2753 though... does it only update upon upload, or does it also update on post editing as well...?

It works whenever a post is saved, just before implications are added. So effectively it is like a wildcard implication, except hardcoded instead of something more general. It will affect existing posts, but they will have to be force updated with a dummy edit to make them add the character tags.

evazion said:

...but they will have to be force updated with a dummy edit to make them add the character tags.

Applied my mass updater script to the "character_(cosplay) -> character" situation, so that shouldn't be necessary anymore...

1