Donmai

Image Sample Cleanup Project

Posted under General

BrokenEagle98 said:

Not sure what Danbooru uses to test and then add the :orig modifier, but the following regex is what I use to test for valid Twitter URLs and retrieve only the URL portion without the size modifier.

(https?://pbs\.twimg\.com/media/[^.]+\.(?:png|gif|jpg))(?::(?:orig|large|medium|small))?

You can check the link modifiers for each site in app/logical/downloads/rewrite_strategies, "rewrite" method. Here's the one for twitter - it has the following regexp:

 ^(https?://pbs\.twimg\.com/media/[^:]+)

which should match pretty much any kind of thumbnail. That it let :medium through is quite weird.

Type-kun said:

That it let :medium through is quite weird.

Actually, after looking it over again, it looks like I was mistaken... (-‸ლ) I must have confused post #2564324 with http://testbooru.donmai.us/posts/14 when I was testing things out ...?

Anyways, it looks like the Twitter image links are indeed rewritten to use :orig regardless of whether the size addon is there or not. I'm guessing most of the Twitter samples I found are from before the rewrite strategies were placed into effect...?

☆♪ said:

@Mikaeri: Sorry, I screwed up. The Sombra image is actually not downscaled, it's just recompressed. I could see an uploader downscaling in that situation, but I can't find any examples of that having happened.

Hmm, well if it does exist then we'll list it when it happens. I think there's bound to be some example off Artstation, given they allow for absurdly large filesizes.

I don't really see the need for a separate tag if the distinction is just in the way the source addresses things - semantically it's the same tag, no? If anything, bad_id could be renamed to something like source_gone.

Also, relating to worldendDominator's original question, I'm generally against rewriting sources altogether. The source should be where you got the image from, that's what it means, even if that source is no longer available. Linking to a source with a different version of the image is liable to cause confusion, as seen here. If you want to provide an alternate source for context, use a no-bump comment. Obviously not everyone may share my attitude on that, so what does everyone else think? It doesn't seem like there's a clear policy on the site about that; maybe we can decide on one.

I agree on this one. If one is to use a URL to point to a source, it must be to the original image to prevent confusion from automated processes, and to prevent other users from being confused over such things as md5 mismatch, upscaled and downscaled. I'll go ahead and re-edit help:image source to reflect that. Comments are there to provide context, also, so if an alternate source is desired by someone it can always be listed there instead.

@worldendDominator Make note of this. You should also fix the rest of your md5 mismatched upscales: user:worldendDominator upscaled. Put "4chan" if you don't have the original link, and no-bump comment the tumblr sources instead.

BrokenEagle98 said:
Anyways, it looks like the Twitter image links are indeed rewritten to use :orig regardless of whether the size addon is there or not. I'm guessing most of the Twitter samples I found are from before the rewrite strategies were placed into effect...?

They also get through from uploaders unaware they can use the source field to upload a image, instead downloading the sample image to their computer and using "Choose File".

Someone asked me about the nature of artstation samples and which ones should or shouldn't be flagged, but I couldn't come up with a clear response given I've never uploaded from there. The wiki page could use some elaboration or examples.

@sweetpeɐ might be able to say something. From what I recall, however, some "downscaled" or compressed Artstation uploads simply cannot have their original uploaded because they are so incredibly large -- and Danbooru only has a maximum file size of 25 MB for uploads.

@Mikaeri said:

Someone asked me about the nature of artstation samples and which ones should or shouldn't be flagged, but I couldn't come up with a clear response given I've never uploaded from there. The wiki page could use some elaboration or examples.

@sweetpeɐ might be able to say something. From what I recall, however, some "downscaled" or compressed Artstation uploads simply cannot have their original uploaded because they are so incredibly large -- and Danbooru only has a maximum file size of 25 MB for uploads.

This is an incredible misunderstanding. See my thread at topic #13057.

ArtStation as we know only has public links for the 'large' versions of their uploads. We can access the original we just manually have to change the URL. In my email correspondence with ArtStation staff they explained to me that they would face bandwidth problems if they made links for original images public. To be clear, this is NOT an issue of files having massive size, just that ArtStation would rather pay for having to serve out bandwidth for a 300KB file to be seen 10,000 times than the original which may be 2MB (just as an example). Do the math and it almost seems compelling.

Determining when you have an ArtStation original or an ArtStation sample can be tricky. Sometimes if you load the /original/, it does not actually work. In this case the /large/ is not a sample but the "original" (I have a small collection of such images here ). In most cases a /large/ is a sample but you can only tell if you attempt and fail to load the /original/. I think sample tagging could be automated like has been done for other sources. If it is, I please ask that you change the source to the direct image URL.

so, to answer directly, when should you tag? If a /original/ and /large/ are active on the site at the same time, flag the /large/ and parent it to the original. For cases were the /large/ is the largest no flag should take place unless it violate quality and TOS standards. I'm sorry this source is so needless complicated, their site seems so broken and incoherent in many regards but yet still is a host to many nice pictures :s

Updated

sweetpeɐ said:

This is an incredible misunderstanding. See my thread at topic #13057.

ArtStation as we know only has public links for the 'large' versions of their uploads. We can access the original we just manually have to change the URL. In my email correspondence with ArtStation staff they explained to me that they would face bandwidth problems if they made links for original images public. To be clear, this is NOT an issue of files having massive size, just that ArtStation would rather pay for having to serve out bandwidth for a 300KB file to be seen 10,000 times than the original which may be 2MB (just as an example). Do the math and it almost seems compelling.

Determining when you have an ArtStation original or an ArtStation sample can be tricky. Sometimes if you load the /original/, it does not actually work. In this case the /large/ is not a sample but the "original" (I have a small collection of such images here ). In most cases a /large/ is a sample but you can only tell if you attempt and fail to load the /original/. I think sample tagging could be automated like has been done for other sources. If it is, I please ask that you change the source to the direct image URL.

so, to answer directly, when should you tag? If a /original/ and /large/ are active on the site at the same time, flag the /large/ and parent it to the original. For cases were the /large/ is the largest no flag should take place unless it violate quality and TOS standards. I'm sorry this source is so needless complicated, their site seems so broken and incoherent in many regards but yet still is a host to many nice pictures :s

Huh. I was going off of forum #125933, so I wasn't actually aware of all this. It does sound like it could be automated though. From what I can tell, the tag has been added manually by Shallotte, who also seems to be the sole uploader of all images tagged with artstation sample thus far (along with their relative originals). Definitely sounds like it could be automated though, given there's probably more than just Shallotte's. But does the /original/ image just 404 or time out? Heck, I wonder if such a thing would be a problem in the first place for any running scripts.

Well, it is a fairly new site... they might continue going through more iterations (which may either change whether they host originals or not altogether, or introduce a filesize limit).

Just updated artstation sample with some helpful information as sweetpea has indicated.

@BrokenEagle98 Would it be possible for you to start automating the detection for them sometime in the future? Or might you still be busy with other things? Sorry for bothering.

EDIT: On a related sidenote, I'll also start working on howto: pages for all the major websites.

Updated

After doing some preliminary work on Artstation, I noticed that they use gzip encoding for their pictures. I was using the Content-Length attribute in the header to match the filesize, but became supicious when the very first picture I encountered which was uploaded only 2 days ago was an MD5 mismatch. I learned that the header was reporting one size (863466), while the size of the image itself was another (939777, which did match Danbooru).

Image tested:

post #2616867 (http://danbooru.donmai.us/posts/2616867.json)
https://cdna.artstation.com/p/assets/images/images/004/750/650/original/matt-waggle-1920x1080x2-shot-1.jpg

response header
'Server': 'nginx/1.10.2'
'Content-Encoding': 'gzip'
'Last-Modified': 'Fri, 03 Feb 2017 21:46:28 GMT'
'ETag': '"5894fa34-d2cea"'
'Date': 'Fri, 03 Feb 2017 22:41:05 GMT'
'Cache-Control': 'max-age=315360000'
'Connection': 'keep-alive',
'Expires': 'Thu, 31 Dec 2037 23:55:55 GMT'
'Content-Length': '863466'
'Access-Control-Allow-Origin': '*'
'Access-Control-Expose-Headers': 'Accept-Ranges, Content-Length, Range'
'Content-Type': 'image/jpeg'

I also learned that most browser image info viewers use the header size instead of the content size, and so would report the incorrect size in the above scenario. The only way to check the real size then would be to download the image and check its size. I'm sharing the above just in case anyone was using only their browser to validate an image's attributes.

Yeesh that's insane. Yeah, it's more important than ever now to make sure we stop uploading samples from Artstation. I think automating it for Artstation would likely be in the same manner as Twitter, right? Replace /large/ with /original/, use the HTML link as source...

Mikaeri said:

Yeesh that's insane. Yeah, it's more important than ever now to make sure we stop uploading samples from Artstation. I think automating it for Artstation would likely be in the same manner as Twitter, right? Replace /large/ with /original/, use the HTML link as source...

In defense of uploaders, we've only encountered ArtStation in the past year; I'm more worried about the countless bad artist entries being created.

@Mikaeri said:

Just updated artstation sample with some helpful information as sweetpea has indicated.

I must say it's good to see it explained in concrete and accurate terms that others will understand (well... people will find a way to mess this up doubtless). I'll remove my blurb from howto upload and just link to this page and well, correct capitalization for ArtStation 😏.

I think perhaps I confused people with my thread since I have used it more as a log of my findings which you would have to read in order to understand. Some of my original suspicions I later discredited in further posts so perhaps many users didn't read very far into it. Probably though this had little effect since the sample rate is nearly half.

what I can tell, the tag has been added manually by Shallotte, who also seems to be the sole uploader of all images tagged with artstation sample thus far (along with their relative originals).

This, as should be seen now with the mass updates, was a result of @Shallotte's own personal endeavor which was very much appreciateed at the time. I can see why he seems to have quit at around 70 or so. I gave the old college try with cross-checking MD5s against /large/ and /original/s and replaced the source with that URL. It was pretty exhaustive and tedious work so I only was able to go back a few months in posts.

EDIT: On a related sidenote, I'll also start working on howto: pages for all the major websites.

PM me if you need help with this.

@Kikimaru said:

In defense of uploaders, we've only encountered ArtStation in the past year; I'm more worried about the countless bad artist entries being created.

Perfectly valid first point.

What do you mean however about artist entries? The worst I've seen -- and quite a bit unfortunately -- is that users often will only include the ***.arstation.com or the www.artstation.com/*** when both should be included. I haven't been seeing direct ArtStation URLs in the last weeks as often lately but I remove them whenever I encounter them.

@Provence said:

Well, why "insane" if you just don't know how to upload from there^^?
I guess it is good to know after some digging from sweetpea, that the source should be changed (or make the bookmark finally work).

I don't even know if we could technically get the bookmark to work because of the case of broken /original/s explained in artstation sample.

Broken​Eagle​98
ArtStation related posts

Great news to see you've tagged the artstation samples.

Updated

mass update duplicate image_sample status:any -> -duplicate

Link to request

To reflect the position that duplicate is not to be used for samples.

Since @RaisingK uses this tag for samples I'd like to ask, do you have a perspective on this?

EDIT: This bulk update request has been rejected because it was not approved within 60 days.

EDIT: The bulk update request #1076 (forum #126432) has been rejected by @DanbooruBot.

Updated by DanbooruBot

I created a new tag, protected_link, for those posts that I'm unable to access to verify. The examples I've found so far include protected tweets.

Edit:

In addition to the above, I also started using cropped and stitched tags based upon the composition of the image. Both of the prior are verified manually. Additionally, I don't leave an MD5 mismatch comment like normal just because the images are so radically different.

Updated

1 2 3 4 5 6 7 14