Donmai

What's our stance on twitter samples?

Posted under General

So there are uploads that when you check the source aren't the actual full size (i.e. instead of suffixing :orig uploads are left either :large or non-suffixed), it seems. Only about 50 are tagged under twitter sample, but I very much suspect there are more.

What's our stance on this? Should we treat these the same way we do pixiv samples? Because I was thinking of flagging post #2004618 as I was sweeping through yaman's uploads on Twitter earlier today, since it was a lower res upload of the full I uploaded (post #2573012). It appears to me really unnecessary to have.

And perhaps another question I could add to this is if there is a way to automate this, like how RaisingK does md5 mismatches on pixiv? Though I guess the problem is that pages aren't enumerated as they are on pixiv, so if you want to check an md5 against a "group" of images on Twitter you would have to check it against all of them and look for one match, which could be more arduous than I imagine. Plus it begs the question of if scraping through Twitter like that is possible or there's some sort of limitation...

There are most definitely more than just 50, probably just aren't tagged with it. Nowadays you can only grab the sample if you upload it from a middleman that used the sample, Danbooru automatically pulls the largest resolution regardless of the link you enter into the url bar.

Tip: Non-scans on yande.re without a source are generally twitter images.

As I suspected... I was ever so curious as to where they sauced png versions of some of his images from, but I upped them for completeness in addition to the twitter images. I noticed you tagged a few of them with waifu2x though, and that might possibly be it -- since I don't know how else they might have gotten it (unless raw file conversion, in which case would be a major facepalm).

It could be. I don't know how Marqant uploaded that post but whichever way it went it apparently ended up having a smaller resolution than the one I found. IIRC you can't really 'replace' images on Twitter in place as you can on pixiv, right?

CodeKyuubi said:

There are most definitely more than just 50, probably just aren't tagged with it.

I wonder how then, post #2508287 was uploaded? Perhaps you could say because he downloaded and then upped it but there is post #2507503 and many other posts from this time that were uploaded by him with :large or no suffix.

Is there perhaps different functionality when uploaded with the bookmarklet and using direct image URL, or was there a change made to correct image URL upload recently?

Mikaeri said:

As I suspected... I was ever so curious as to where they sauced png versions of some of his images from, but I upped them for completeness in addition to the twitter images. I noticed you tagged a few of them with waifu2x though, and that might possibly be it -- since I don't know how else they might have gotten it (unless raw file conversion, in which case would be a major facepalm).

It could be. I don't know how Marqant uploaded that post but whichever way it went it apparently ended up having a smaller resolution than the one I found. IIRC you can't really 'replace' images on Twitter in place as you can on pixiv, right?

I think you are right that it would be a pretty arduous task to check all Twitter posts here against Twitter although theoretically it could be a simple task just at mass scale. I lack the technical knowledge to do any of this but the following is a potential scheme:

  • Locate every single active post on Danbooru with a tweet URL as source
  • Check these against Twitter using api, getting the image address, generating :orig and then checking if that's larger
  • Upload that :orig if larger, then change source to the tweet URL, make original its child
  • Tag original as duplicate and twitter_sample

If a source just has the image URL, I'd have the script check if the :orig is larger and upload, make original the child post, last bullet point here too.

Also Twitter seems to have used to use _normal and _original in file names or something like that.

I could probably do 1, 2, and 4 easily enough, but I'm a little bit wary of leaving uploading an image purely to an automated script. Not only can the script screw up, but if the source was maliciously changed, it could potentially upload pictures that maybe you didn't intend to upload, such as goatse... Most changes are easy to fix with a reversion or undo, but an upload is a glaring mistake potentially necessitating permanent deletions...

To do #3, I'd probably have the script display a side-by-side comparison of the Danbooru image and the "original" image, then ask for a manual confirmation before proceeding to upload...

Regardless, I'll try to work on having something working by the end of the week...

sweetpeɐ said:

I think you are right that it would be a pretty arduous task to check all Twitter posts here against Twitter although theoretically it could be a simple task just at mass scale. I lack the technical knowledge to do any of this but the following is a potential scheme:

  • Locate every single active post on Danbooru with a tweet URL as source
  • Check these against Twitter using api, getting the image address, generating :orig and then checking if that's larger
  • Upload that :orig if larger, then change source to the tweet URL, make original its child
  • Tag original as duplicate and twitter_sample

If a source just has the image URL, I'd have the script check if the :orig is larger and upload, make original the child post, last bullet point here too.

Also Twitter seems to have used to use _normal and _original in file names or something like that.

Sounds suspiciously like the time where someone went through all the md5 mismatches, uploaded the updated ones, then started flagging all the mismatched images for deletion.

I don't need to tell you that it raised a big fuss.

Edit: Iirc, the older images got straight-deleted, and had their favorites and scores migrated to the new image, under the new uploader's name. Understandably, there was a lot of anger over that. I actually looked a bit into this, and it wasn't md5 mismatches, it was between pixiv and twitter versions. Similar, but dissimilar.

Updated

CodeKyuubi said:

Sounds suspiciously like the time where someone went through all the md5 mismatches, uploaded the updated ones, then started flagging all the mismatched images for deletion.

I don't need to tell you that it raised a big fuss.

Edit: Iirc, the older images got straight-deleted, and had their favorites and scores migrated to the new image, under the new uploader's name. Understandably, there was a lot of anger over that. I actually looked a bit into this, and it wasn't md5 mismatches, it was between pixiv and twitter versions. Similar, but dissimilar.

That sounds like an incredibly dickish uploader, wonder how long ago this was and what happened in the end. You said got straight-deleted though? As in a mod/admin literally purged the flagged Twitter images from the database altogether or just that they went marked? Apologies for the off-topic curiosity.

Anyways, I'm with 1, 2, and 4 also. But a person should be in charge for uploading the correct originals, not a bot/script. I'm sure a script would automate things much quicker, especially given that you can't replace images in place as you can on pixiv, but if there happens to be a bug or something it could cause unnecessary trouble. Pixiv md5 mismatches are different since sometimes the artist uploads a smaller res of the original, adds censoring, etc... A bunch of things can happen. Not so for Twitter, but still.

EDIT: After seeing some more opinions on the matter, I think automating this would also be acceptable.

Updated

Just a couple of quick questions...

1. What should I do with images that 404...?

post #2518161
http://pbs.twimg.com/media/CvG29GWWEAA_Lpv.jpg:orig

Should I tag those posts with bad_id...?

2. What should I do with images with the same dimensions but different filesizes...?

post #2417895
http://pbs.twimg.com/media/Clx0eucUoAAnEJO.jpg:orig

For that one in particular, it looks like someone added the "source" several days after the original post was uploaded.

http://danbooru.donmai.us/post_versions?search%5Bpost_id%5D=2417895

Should I tag those posts with md5_mismatch...?

Thanks.

1. What should I do with images that 404...?

post #2518161
http://pbs.twimg.com/media/CvG29GWWEAA_Lpv.jpg:orig

Should I tag those posts with bad_id...?

I'd say we should have bad_<site>_id tags for every site, move bad_id to bad_pixiv_id, and make bad_id an umbrella tag. Also bad id currently conflates a few different cases: the post was deleted, the post was made private, the post was made followers-only, or the post doesn't exist (the id truly is bad). I think we should distinguish between these cases because follower-only posts are potentially accessible.

2. What should I do with images with the same dimensions but different filesizes...?

post #2417895
http://pbs.twimg.com/media/Clx0eucUoAAnEJO.jpg:orig

For that one in particular, it looks like someone added the "source" several days after the original post was uploaded.

http://danbooru.donmai.us/post_versions?search%5Bpost_id%5D=2417895

Should I tag those posts with md5_mismatch...?

Hmm, the danbooru file doesn't match either the twitter sample or the original:

So I don't know if twitter is the true source there. Visually they look identical, but if you compare the JPEG compression levels the danbooru file is more compressed than the twitter files (quality level 64 versus quality level 85).

So yeah, I'd say tag it as md5 mismatch as an indicator that the source is not the exact file.

BrokenEagle98 said:

I could probably do 1, 2, and 4 easily enough, but I'm a little bit wary of leaving uploading an image purely to an automated script. Not only can the script screw up, but if the source was maliciously changed, it could potentially upload pictures that maybe you didn't intend to upload, such as goatse... Most changes are easy to fix with a reversion or undo, but an upload is a glaring mistake potentially necessitating permanent deletions...

I think @RaisingK is already doing this for pixiv samples under his RazingK account. That seems to be working out, so I don't think automation is such a bad thing. You could do some checks like comparing aspect ratios and running it through IQDB to verify the images are similar.

So just a quick status update...

I've finished with source:http://pbs.twimg.com/*

  • Images are tagged with twitter_sample iff there is a corresponding size with matching filesize and MD5 hash
  • Images are tagged with md5_mismatch if there are no corresponding sizes with matching filesizes or MD5 hash
    • There are no matches with large, medium, small, etc., where it will keep checking until the Twitter image filesize < Danbooru image filesize
  • Images are tagged with bad link when the images 404
    • For any other HTTP errors (e.g. 500, 403), they are stored in a repository for later investigation

evazion said:

I think @RaisingK is already doing this for pixiv samples under his RazingK account. That seems to be working out, so I don't think automation is such a bad thing. You could do some checks like comparing aspect ratios and running it through IQDB to verify the images are similar.

Yeah, given my above scheme, I could probably immediately do automation for Twitter Samples, but not MD5 mismatches. Does IQDB have an API, and if so, where can I find documentation?

However, I'd probably want to do uploading on a separate account like RaisingK does...

Thoughts?

I'm not sure about http://iqdb.org but you could use Danbooru's IQDB instance. This is the API:

POST /iqdb_queries?url=http://i4.pixiv.net/img-original/img/2016/12/21/02/39/12/60469487_p0.png
POST /iqdb_queries?post_id=2574629

The trouble is that it currently only returns HTML. I could try to fix this though.

evazion said:

I think @RaisingK is already doing this for pixiv samples under his RazingK account. That seems to be working out, so I don't think automation is such a bad thing.

The decision to upload the full sizes isn't automated. I manually tell my program to re-upload each post.

evazion said:

I'd say we should have bad_<site>_id tags for every site, move bad_id to bad_pixiv_id, and make bad_id an umbrella tag. Also bad id currently conflates a few different cases: the post was deleted, the post was made private, the post was made followers-only, or the post doesn't exist (the id truly is bad). I think we should distinguish between these cases because follower-only posts are potentially accessible.

I'd be fine with the switch to bad pixiv id, but can we just keep it one tag? It's so much easier for automated tagging, and I'm not sure how much of that is apparent from the pixiv API...

Updated

RaisingK said:

I'd be fine with the switch to bad pixiv id, but can we just keep it one tag? It's so much easier for automated tagging, and I'm not sure how much of that is apparent from the pixiv API...

I'd be fine with just a singular tag for Pixiv...

Other then that, are there any objections to bad_twitter_id...? If not, I'll retag all of the bad_link posts to bad_twitter_id, and then start tagging with that going forward.

Edit:

Now that I think about it, should we also use <site>_md5_mismatch instead of md5_mismatch (e.g. pixiv_md5_mismatch, twitter_md5_mismatch, etc). The one primary reason to do this is that only Builder+ accounts can search with wildcards in the source: metatag.

So if a Platinum- wanted to focus on investigating just the Twitter MD5 mismatches, they'd have to pore over all of the results individually.

Updated

Not from me, go for it.

evazion said:

I'd say we should have bad_<site>_id tags for every site, move bad_id to bad_pixiv_id, and make bad_id an umbrella tag. Also bad id currently conflates a few different cases: the post was deleted, the post was made private, the post was made followers-only, or the post doesn't exist (the id truly is bad). I think we should distinguish between these cases because follower-only posts are potentially accessible.

I agree with this suggestion -- it's a good way to keep track of things. Maybe you can make a BUR later for it?

EDIT: Saw RaisingK's post above and I'm thinking now it's difficult to account for those edge cases. Not sure how many of the uploaders are mypixiv'd with other artists...

Updated

BrokenEagle98 said:

Now that I think about it, should we also use <site>_md5_mismatch instead of md5_mismatch (e.g. pixiv_md5_mismatch, twitter_md5_mismatch, etc). The one primary reason to do this is that only Builder+ accounts can search with wildcards in the source: metatag.

So if a Platinum- wanted to focus on investigating just the Twitter MD5 mismatches, they'd have to pore over all of the results individually.

Is there much demand for this, though?

1 2 3 4 5