Donmai

An enigma about uploading

Posted under General

Let's say somebody wanted to upload some really great pr0n to get a privilege invite. Since they can't see "unsafe" posts, how can they avoid double posting, and thus lower their chances of an invite?
This is like that one time when somebody asked me if the chicken or egg came first. After a four week argument, we settled on the egg, but that's not related to the conundrum at hand.

Updated by Marbleshoot

First of all, do you mean porn in the sense of real people? Because I don't think this place is really meant for that, unless it's like cosplay or stuff like that.
Secondly, most of the time the 'system' detects double posts, namely it gives you the message "that post already exists". When it already exists but goes undetected, that's usually because it's either smaller or bigger than the original one. In that case it won't necessarily be deleted, but kept as a 'parent' or 'child'.
Thirdly, I don't consider it a real tip but: you can start off with uploading 'safe' posts, which you cán see and that way lower the chances of a double post. I try to do this, though it's not always easy to tell whether it's safe/questionable/unsafe.
Anyway, good luck!

ninja_nigga said:
I actually have a question that i think is relevant to that.
What exactly lets the system know that the image is a duplicate?

Seeing that this happens to me far too often, I've discovered it's quite simple.

Image size, tags, and I bet even the original filename are factors. If you have an image of Nia from Gurren Lagann that is 800x600 and your tags are "Nia tengen toppa gurren-lagann beach bikini" and there already is an image with those exact tags (less or more) and exact size, It's a duplicate.

It happens to me a lot, especially with stuff that's "borderline" safe. I stopped trying to upload questionable stuff because I can't tell if it's already there. The only problem I run into is that there are a lot of perfectly safe images labeled as questionable and when I do a search to see if it isn't there I always get let down in the end, heh.

So, my suggestion is... upload what you can see until you can see anything, then upload anything... that isn't uploaded.

IchiMashiPotatos said:
Image size, tags, and I bet even the original filename are factors...

Image size is a factor, but not for the reason you think. Tags and the filename are not. The system performs an md5 hash on the data of every file uploaded. A hash is just a fancy way of deterministically generating a number for any given data. Because the number of numbers md5 can generate is very high, 2^128, and very evenly distributed, it's nearly impossible for two different files to generate the same hash value. Therefore if the system sees a file with the same hash value as one it already has, it knows it sees a duplicate and rejects it.

The image size is intrinsically part of the data, so changing image size, file size (which changes file quality), format, adding metadata, or editing the images will cause the image to fail to be detected as duplicates. So unless you are scanning something yourself, it is probably best to keep the image exactly as you find it on the author's site.

Tags and filename don't affect the data in any way, so the hash performs as it should. By all means though, be as descriptive as you can when tagging things. In addition to being good practice and making the site better as a whole, it makes it easier to find the duplicates that slip through the cracks. Filename makes no difference. So if you find the same image on *chan or on the author's site, provided it wasn't changed (some 3rd party sites compress the hell out of their pictures), it should still be detected as a duplicate if it is one.

Updated

Shinjidude said:
Image size is a factor, but not for the reason you think. Tags and the filename are not. The system performs an md5 hash on the data of every file uploaded. A hash is just a fancy way of deterministically generating a number for any given data. Because the number of numbers md5 can generate is very high, 2^128, and very evenly distributed, it's nearly impossible for two files to generate the same hash value. Therefore if the system sees a file with the same hash value as one it already has, it knows it sees a duplicate and rejects it.

The image size is intrinsically part of the data, so changing image size, file size (which changes file quality), format, adding metadata, or editing the images will fail to be detected as duplicates. So unless you are scanning something yourself, it is probably best to keep the image exactly as you find it on the authors site.

Tags and filename don't affect the data in any way, so the hash performs as it should. By all means though, be as descriptive as you can when tagging things. In addition to being good practice and making the site better as a whole, it makes it easier to find the duplicates that slip through the cracks. Filename makes no difference. So if you find the same image on *chan or on the author's site, provided it wasn't changed (some 3rd party sites compress the hell out of their pictures), it should still be detected as a duplicate if it is one.

That's a whole lot more complicated than I thought it was... I really had no idea, I guess. Sorry about that.

I did notice...at least I think I saw this happen a few times, that if you try to upload an image that is already there, but you list more tags than the image already has, they are added. I guess that's why I got confused on that part...

IchiMashiPotatos said:
That's a whole lot more complicated than I thought it was...

That's actually the easiest way to do things. Md5 is pre-implemented in most web scripting languages (it's also used a lot for encrypting passwords).

What would be nice, but probably infeasible would to be to have the system actually look at the pictures and store a signature for each. That would enable it to automatically flag similar looking files as potential duplicates (such as when you post a compressed or resized version of an existing picture). Unfortunately that's a hard problem, can take a lot of processing power & space, and worse yet, can be very difficult to check without directly comparing each image to every other image.

Shinjidude said:
That's actually the easiest way to do things. Md5 is pre-implemented in most web scripting languages (it's also used a lot for encrypting passwords).

What would be nice, but probably infeasible would to be to have the system actually look at the pictures and store a signature for each, which would allow it to automatically flag similar looking files as potential duplicates (such as when you post a compressed or resized version of an existing picture). Unfortunately that's a hard problem, can take a lot of processing power & space, and worse yet, is very difficult to check without directly comparing one image to another.

Don't they have search engines that can search by color now? I don't mean as in tagged, but as in looking as you said. Google search brings up a few things. I added them to my ebook and I'll read up on them later. Looks like the technology is almost here though...

Yeah, those search engines search by histogram (which counts the amount of times each color is used in any of the pixels). It's one of the most common ways to make an image signature (though it's not super-great for finding duplicates). Think of the proportion of white in all the 'monochrome' images.

In any case, that still leaves the problem of comparing one image across all the others. If you don't have a way of fuzzy-matching signatures in constant time, you have to compare each image with every other.

Say we are only talking about checking one new image with the database, as opposed to searching for existing duplicates - even then every post would need to check 137,687 other posts. If we were searching for existing duplicates it would be 137,687^2 comparisons unless you did something clever. Those sort of numbers make the comparisons unscaleable to a big system with limited resources.

The technology is developing, and with some of the more sophisticated techniques out there it *might* be possible, but I think it's probably still infeasible for our purposes.

archive.4chan.org has a simple "visual fingerprint" system that works well for finding non-binary duplicates of color images. However, it uses quite a lot of resources, which is probably why danbooru doesn't use it.

IchiMashiPotatos said:
Seeing that this happens to me far too often, I've discovered it's quite simple.

Image size, tags, and I bet even the original filename are factors. If you have an image of Nia from Gurren Lagann that is 800x600 and your tags are "Nia tengen toppa gurren-lagann beach bikini" and there already is an image with those exact tags (less or more) and exact size, It's a duplicate.

It happens to me a lot, especially with stuff that's "borderline" safe. I stopped trying to upload questionable stuff because I can't tell if it's already there. The only problem I run into is that there are a lot of perfectly safe images labeled as questionable and when I do a search to see if it isn't there I always get let down in the end, heh.

So, my suggestion is... upload what you can see until you can see anything, then upload anything... that isn't uploaded.

Yes I suppose that makes sense... but I'm an "all or nothing" type of guy i wouldn't bother posting things that i wouldn't want to see or save to my own hard drive. So I'll just keep on posting the same way and risk getting that duplicate notification every once and a while.

Shinjidude said:
Hmm, they seem dead right now. Do you know how they calculated / compared their fingerprint? It'd be interesting to see.

archive.4chan.org (aka a4o) is not a domain name, but a name of the system. The actual URL it's available at is different, and ATM the NBID (non-binary identical) comparison system is indeed non-functional.

How it works is pretty simple, you scale the image down to 4x4, which gives you a 48-element vector of RGB values. By comparing those vectors, you can see how far away they are from each other.

It's a very simple system, there are much more sophisticated algorithms (using wavelets for example). But it works pretty well, and has really good hit ratio for actual dupes (ie. just rescaled or resaved).

葉月 said:
But it works pretty well, and has really good hit ratio for actual dupes (ie. just rescaled or resaved).

Except for b/w images, where it's not very usable. But for color images it works very well.

Interesting. I've heard of that strategy before. It's like the histogram, but with a bit less color information, and a lot more locality based information. Unfortunately it still depends on vector distances which means it still needs to directly compare one image to another. I've been looking into LSH (locality sensitive hashing) that might be able to avoid that, but I still don't quite understand it.

I'd imagine the actual url to this archive is hush-hush secret or only available to administrators? I won't ask then, but it sounds interesting.

Shinjidude said:
I'd imagine the actual url to this archive is hush-hush secret or only available to administrators? I won't ask then, but it sounds interesting.

Yeah, it's running on a home DSL, I wouldn't really want to expose it to danbooru. And the NBID subsystem doesn't work anyway.

spaz102 said: how can they avoid double posting, and thus lower their chances of an invite?

Posting duplicates won't be held against you unless it's obvious you're doing it on purpose, and I haven't seen anyone do that. You don't get penalized for honest mistakes.

albert said:
http://cart.donmai.us/browser/danbooru/trunk/lib/danbooru_image_similarity uses the wavelet algorithm. However, the storage requirements are considerable, and the querying algorithm doesn't map well to relational databases.

Ah C, how my better familiarity with Java, PHP, & Python has spoiled me. It's an interesting approach, but I can see how it would get heavy data-wise, and needing another table with multiple entries per post is bad too (if I'm skimming the code correctly).

Have you ever considered an approach like the one described here: http://research.microsoft.com/users/misard/papers/civr2007.pdf ? They use a combination of the histogram approach I mentioned and the region locality thing 葉月 alluded to. Basically, they build a pyramid of histograms. all together making a signature of 384 bytes per image as described.

They then encode those signatures into a set of locality sensitive hashes, to allow fast & easy constant look up for a given signature. I don't have all the details figured out yet, but it seems that the storage costs would be just 384 bytes for the signature (if they were kept), and a few hash entries after that for the LSH.

It would be easier to store in the database, since you could add all that data in the same entry as the post itself, or at least in a single entry in another table (1:1 relation). I don't know how big is too big though, since 1/2 a kb per image or so adds up when you consider how many images we have.

I just thought it was an interesting paper, since it got past the scalability problem with looking up a duplicate.

Updated

1 2