Donmai

[bulk] Bad ID/Image Sample

Posted under Tags

mass update bad_id -> bad_id bad_pixiv_id

Link to request

See topic #13533 for the discussion on this.

This is the first part of a 2-part BUR.

The second will involve adding the implications as such:

imply bad_pixiv_id -> bad_id
imply bad_twitter_id -> bad_id
Edit:
  • (21 Feb 2017) Changed title to include image sample requests

The bulk update request #1074 has been approved.

The bulk update request #1103 has been approved.

Updated by DanbooruBot

BrokenEagle98 said:

...and with a click of his mouse, Type-kun has already cemented his status as the #1 tagger for the year 2017... (ツ)

...and ensured Danbooru lagging for next 20 hours, or so it seems. Oh boy, I thought it would go much faster than that.

Hmm. It seems that the job is restarting periodically, but times out after a while. The only thing I'm worried about is whether the search is restarted anew or not. If it's indeed restarted, than with every job new run, it will have to load and skip all posts already tagged with bad_pixiv_id, wasting more and more time. The antecedent should have been looking like bad_id -bad_pixiv_id to avoid this, but I can't stop the job once it started...

Also BrokenEagle98, didn't you limit API writes to 1000 an hour or so? It'll take a while to retag this with bots at this rate.

@albert, please take a look at server load, maybe it's better to stop the job at some point.

Type-kun said:

Also BrokenEagle98, didn't you limit API writes to 1000 an hour or so? It'll take a while to retag this with bots at this rate.

As I made mention in issue #2693...

Neither API writes nor reads are currently being counted... so technically, everybody has an infinite amount of API calls per hour... :/

Regardless, RaisingK raised an issue of the low write count and so Albert raised this number for Platinum+ to 5000 writes / 50000 reads.

Type-kun said:

Hmm. It seems that the job is restarting periodically, but times out after a while.

It looks like tag aliases and implications ignore timeouts, but mass updates don't. Probably an oversight.

The only thing I'm worried about is whether the search is restarted anew or not. If it's indeed restarted, than with every job new run, it will have to load and skip all posts already tagged with bad_pixiv_id, wasting more and more time. The antecedent should have been looking like bad_id -bad_pixiv_id to avoid this, but I can't stop the job once it started...

I think it is restarted. I don't see anything for tracking where a job failed and resuming from that point at least.

I don't see an easy way to stop a running job. I think the job can be deleted from the db if it isn't currently running, but I don't think delayed_job has a way to stop a running job. Instead I think you'd have to set a `cancelled` flag somewhere and have the job check that flag periodically.

Bots would be slower but that might be a good thing for server load...

I've deployed the MD5 change although I'm not sure how much that'll help.

I've set the attempts on that particular job to 1000 which should disable it for now.

For a big change like this, I would suggest breaking it up into chunks. You can do this by adding a id:1..50000 search condition and then break it up into multiple changes. This is kind of tedious and perhaps should be automated but it's one way to get around the request repeating all the time.

evazion said:

Bots would be slower but that might be a good thing for server load...

Well, I've started tagging from my end, and it does ~200 posts/minute, so it'll get around ~12K per hour...

Not as fast as the server like you said, but that's not necessarily a bad thing...

Just for reference, I've started from the most recent and working back from there... so someone else could work from the oldest and start working forward...

Well apparently I was wrong about the API limit, as I started getting 429 errors ("Too Many Requests") after around ~6500 tag changes.

Still, my user JSON page still shows that I have used no API requests, either read or write. So it looks like that the restriction is in place and working, but the information about that from the user page is incorrect...?

{"id":23799,"name":"BrokenEagle98","level":32,"inviter_id":13392,"created_at":"2007-12-31T04:13:18.602Z","base_upload_limit":50,"post_update_count":149496,"note_update_count":24233,"post_upload_count":6699,"wiki_page_version_count":1937,"artist_version_count":3896,"artist_commentary_version_count":1436,"pool_version_count":1397,"forum_post_count":1060,"comment_count":1360,"appeal_count":4,"flag_count":30,"positive_feedback_count":10,"neutral_feedback_count":1,"negative_feedback_count":1,"is_banned":false,"can_approve_posts":false,"can_upload_free":true,"is_super_voter":false,"level_string":"Builder","remaining_api_hourly_limit":50000,"remaining_api_hourly_limit_read":50000,"remaining_api_hourly_limit_write":5000}

Can someone else confirm this...? If the above is true, then an issue should be submitted for it, since otherwise I'll have to have my script blindly guess when it's okay to start tagging again... :/

Edit:

Or maybe not... after letting it rest for a couple of minutes and restarting it, it's going at full speed again, so... ¯\_(ツ)_/¯

Updated

It looks like mine beats yours, @BrokenEagle98, judging from the dummy edits in your history and lack of them in mine. :)

I got the "429 Too Many Requests" error, too. Throttling is a 421 status, though, so this is a different limit we're hitting...?

Updated

RaisingK said:

It looks like mine beats yours, @BrokenEagle98, judging from the dummy edits in your history and lack of them in mine. :)

Ah... I see from your tag history that you're also working from most recent -> older posts, so I switched mine to oldest -> newer posts. That way we won't both be contending over the same group of posts...

Edit:

I've added a 5 min timeout every time it 429's... that way I can take my hand off the wheel and just let my script tag as much as it can per hour...

Edit2:

Since I started keeping track of how many edits I can get in before it 429's, it seems inconsistent where the API write limit cuts me off at. A couple of times, it's been 5000 like it's supposed to, but once it kept going to 6500, and another it kept going until 8650... ¯\_(ツ)_/¯

Edit3:

Commented on issue #2821 about the above. Basically, Hijiribe and Sonohara currently track API limits separately...

Updated

create implication bad_pixiv_id -> bad_id
create implication bad_twitter_id -> bad_id
create implication bad_tumblr_id -> bad_id
create implication bad_nicoseiga_id -> bad_id
create implication bad_tinami_id -> bad_id
create implication bad_nijie_id -> bad_id

Link to request

Follow-on request... also, even though Nijie and DeviantArt haven't been done yet (Tinami is only partially done), it's simpler if they're just added now. Can anyone think of any other sites that should be added...?

Edit:
  • (2017-01-06) Added suggested sites from Mikaeri.
  • (2017-01-06) Per discussion on page 2, removed all entries that did not have at least 50 tagged posts
  • (2017-01-09) Added bad nijie id since tag count now exceeds 50

Updated

1 2 3