Donmai

Filesize inconsistencies between pixiv and other sites

Posted under General

Toks said:

For now, I recommend you (and all other uploaders) use the find similar function before uploading each image. It only takes about a second.

All uploaders should make sure they actually compare the images before deciding not to upload, though. "Double uploads" are good when the image here has noticeable compression artifacts or there are other appreciable differences.

Toks said:

For now, I recommend you (and all other uploaders) use the find similar function before uploading each image. It only takes about a second.

Assuming you don't get stuck 50th in the queue, but yes, that is good practice.

Admittedly, I don't upload at super-high rates with stuff from extremely popular artists, but personally I just check the artist's tag. If it's a recent image, it should show up on the first page, at least.

In at least one case, the two of Hammer's images (which are visually identical as far as I could tell, though I have to admit my eyesight is not good) were uploaded here by the same person. This is totally unnecessary. Pick one source or other as the 'official' one and ignore the other. With the frequency of this happening lately, I think for Hammer at least we need to have a rule in order to prevent bloating the database with pointless duplicates.

Also echoing what OOZ said. If I know the artist is already on danbooru, I check their stock of images here first before I try to upload anything, "find similar" is the second step I take. Usually I'm pleased to find someone else already got the image, saving me the work of filling out all those tags. :]

Toks said:

For now, I recommend you (and all other uploaders) use the find similar function before uploading each image. It only takes about a second.

But doesn't Danbooru automatically tell you if the post is a duplicate? That happened to me three or four times already.

MagicalAsparagus said:

But doesn't Danbooru automatically tell you if the post is a duplicate? That happened to me three or four times already.

It doesn't check if they're the same visually. It only checks if they're the same byte-for-byte with md5 comparison. It's possible for two images to look identical, but have different md5s, which is the case with the images this thread is talking about.

Toks said:

It doesn't check if they're the same visually. It only checks if they're the same byte-for-byte with md5 comparison. It's possible for two images to look identical, but have different md5s, which is the case with the images this thread is talking about.

Not really. The images in question here are 100% pixel identical - the difference is just the metadata pixiv stripped from the files. If that data is stripped from the other image files as well they have the same md5, see forum #91081. If danbooru would do that and check the resulting md5 those double uploads would vanish into thin air.

Schrobby said:

Not really. The images in question here are 100% pixel identical - the difference is just the metadata pixiv stripped from the files. If that data is stripped from the other image files as well they have the same md5, see forum #91081. If danbooru would do that and check the resulting md5 those double uploads would vanish into thin air.

That's what he said.
But there's a problem with what you're proposing.

pixiv version = image
nico version = image + metadata

Case 1:
pixiv uploaded first,
nico uploaded, ---> strip metadata and compare MD5 --> match found

Case 2:
nico uploaded first
pixiv uploaded ---> cannot reconstruct metadata
You would then have to compare images. For example:
Take the closest match from a iqdb search. Strip the metadata from the closest match and then compare MD5.

You'd have to search for matching images (iqdb) for every single pixiv upload, and that's all but fast.

Schrobby said:

Not really. The images in question here are 100% pixel identical - the difference is just the metadata pixiv stripped from the files. If that data is stripped from the other image files as well they have the same md5, see forum #91081. If danbooru would do that and check the resulting md5 those double uploads would vanish into thin air.

If the metadata is stripped, then the files are not byte-for-byte the same anymore. This is what I just said.

Schrobby said:

Danbooru saves the md5 of all uploaded images. Doing the same with the metadata stripped md5 should be no problem.

If the nico one is uploaded first, then the image on Danbooru is the one with metadata. It does not save the md5 without metadata in this case. If the pixiv one is then uploaded second, it will be the one with stripped metadata, so its md5 is different from the already uploaded one. Your suggestion only works half of the time.

Toks said:

If the nico one is uploaded first, then the image on Danbooru is the one with metadata. It does not save the md5 without metadata in this case. If the pixiv one is then uploaded second, it will be the one with stripped metadata, so its md5 is different from the already uploaded one. Your suggestion only works half of the time.

No. What I was suggesting is danbooru should create and save a stripped md5 for all uploaded pictures. That way you get all doubles, no matter which one is uploaded first.

Schrobby said:

No. What I was suggesting is danbooru should create and save a stripped md5 for all uploaded pictures. That way you get all doubles, no matter which one is uploaded first.

What you're suggesting isn't practical. It would involve adding a new field for this 'stripped md5', and also indexing it. But adding any new indexes to the posts table is likely to cause serious performance problems.

Not to mention that your suggestion would only help with a small minority of duplicates (jpeg images uploaded to both pixiv + another site).

What you're suggesting isn't practical. It would involve adding a new field for this 'stripped md5', and also indexing it.

That, or adopting the same policy as Pixiv - simply stripping metadata to compare just the images. In this case it's merely a question of how necessary it is to keep these, because any performance issues stemming from maintaining and indexing an additional column in the database are gone.

HaxtonFale said:

That, or adopting the same policy as Pixiv - simply stripping metadata to compare just the images. In this case it's merely a question of how necessary it is to keep these, because any performance issues stemming from maintaining and indexing an additional column in the database are gone.

This wouldn't really do anything unless pixiv is the first upload of an image. It can't calculate the removed metadata from pixiv's stripping.

Log said:

This wouldn't really do anything unless pixiv is the first upload of an image. It can't calculate the removed metadata from pixiv's stripping.

Not if danbooru compares the md5 with it's own stripped md5s.

Bumping this for some clarification regarding Twitter and pixiv. As EB noted on forum #89375 (topic #8811),

EB said:

When I know the artist uploads the same images elsewhere, I do try to prioritize those sources as I've seen the difference. Twitter's own image hosting seems to be even worse than Twitpic as I always find JPEG artifacts very noticeable on images posted there. Well, obviously excepting images in the PNG format (always like it when artists decide to use that for their Twitter images).

Usually when there are duplicate posts on pixiv and Twitter, the pixiv post becomes the parent to the Twitter version. This is the case for jpegs, as they're usually more artifacted on Twitter.
However, with more artists uploading PNGs onto their Twitter and subsequently uploading to pixiv, we're running into the same problem that Ars addressed initially; file size and md5, with the Twitter posts being larger in file size. There is currently no consistency regarding parenting the Twitter and pixiv posts and I would like some consensus regarding this.

As an aside, uploading from Twitpic no longer automatically sources correctly. This most probably happened after Twitter saved them in late Oct.

psich said:

As an aside, uploading from Twitpic no longer automatically sources correctly.

It seems twitpic changed their direct image link format from d3j5vwomefv46c.cloudfront.net to dn3pm25xmtlyu.cloudfront.net a few months ago.

I've updated Danbooru to be able to parse either domain for the next version.

1 2 3 4 5