A sudden platinum upgrade raffle has appeared!
Donmai

Danbooru 2 Issues Topic

Posted under General

This topic has been locked.

Well this is bizarre. I tried downloading the file several times and sometimes it gives me a 2.7M PNG and sometimes it gives me a 648K JPEG. It seems random which file I get. No idea what's going on here.

Show
admin@icarus:~
% wget https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                                                
--2017-03-24 03:19:23--  https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Resolving 68.media.tumblr.com (68.media.tumblr.com)... 69.147.82.57, 216.115.96.175, 69.147.82.56, ...
Connecting to 68.media.tumblr.com (68.media.tumblr.com)|69.147.82.57|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2844421 (2.7M) [image/png]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’

tumblr_okf33cpXvt1v2gw9ko1_1280.png                         100%[==========================================================================================================================================>]   2.71M  1.58MB/s    in 1.7s

2017-03-24 03:19:26 (1.58 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’ saved [2844421/2844421]

admin@icarus:~
% wget https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                                                
--2017-03-24 03:19:33--  https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Resolving 68.media.tumblr.com (68.media.tumblr.com)... 216.115.96.179, 69.147.82.56, 216.115.96.175, ...
Connecting to 68.media.tumblr.com (68.media.tumblr.com)|216.115.96.179|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 663298 (648K) [image/jpeg]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’

tumblr_okf33cpXvt1v2gw9ko1_1280.png.1                       100%[==========================================================================================================================================>] 647.75K  1.12MB/s    in 0.6s

2017-03-24 03:19:34 (1.12 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’ saved [663298/663298]

admin@icarus:~
% identify tumblr_okf33cpXvt1v2gw9ko1_1280.png tumblr_okf33cpXvt1v2gw9ko1_1280.png.1                                                                                                                                                   
tumblr_okf33cpXvt1v2gw9ko1_1280.png PNG 1280x1819 1280x1819+0+0 8-bit sRGB 2.844MB 0.000u 0:00.000
tumblr_okf33cpXvt1v2gw9ko1_1280.png.1 JPEG 1280x1819 1280x1819+0+0 8-bit sRGB 663KB 0.000u 0:00.000

EDIT: it seems to be based on the IP of the server. http://68.media.tumblr.com has 4 IPs:

% drill 68.media.tumblr.com
;; ANSWER SECTION:
68.media.tumblr.com.    23016   IN      CNAME   edge2.gycs.b.yahoodns.net.
edge2.gycs.b.yahoodns.net.      8       IN      A       69.147.82.57
edge2.gycs.b.yahoodns.net.      8       IN      A       216.115.96.175
edge2.gycs.b.yahoodns.net.      8       IN      A       216.115.96.179
edge2.gycs.b.yahoodns.net.      8       IN      A       69.147.82.56

http://69.147.82.57 returns the PNG:

admin@icarus:~
% wget --header "Host: 68.media.tumblr.com" http://69.147.82.57/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                   
--2017-03-24 03:37:04--  http://69.147.82.57/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Connecting to 69.147.82.57:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2844421 (2.7M) [image/png]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’

tumblr_okf33cpXvt1v2gw9ko1_1280.png                         100%[==========================================================================================================================================>]   2.71M  1.40MB/s    in 1.9s

2017-03-24 03:37:06 (1.40 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’ saved [2844421/2844421]

And http://216.115.96.179 returns the JPEG:

% wget --header "Host: 68.media.tumblr.com" http://216.115.96.179/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                 
--2017-03-24 03:37:31--  http://216.115.96.179/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Connecting to 216.115.96.179:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 663298 (648K) [image/jpeg]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’

tumblr_okf33cpXvt1v2gw9ko1_1280.png.1                       100%[==========================================================================================================================================>] 647.75K  1.16MB/s    in 0.5s

2017-03-24 03:37:32 (1.16 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’ saved [663298/663298]

Updated

evazion said:

Well this is bizarre. I tried downloading the file several times and sometimes it gives me a 2.7M PNG and sometimes it gives me a 648K JPEG. It seems random which file I get. No idea what's going on here.

Maybe have a look at post #2651342 and one of its child posts as well? Same behavior there. I believe the deleted parent is the better quality one, actually.

Randeel said:

@BrokenEagle98 did your bot break in some way? It points to :large here post #2639755

Ah, thanks for the point out. It encountered a server error 500-599, and it's supposed to sleep and try again after a minute, but instead it was just skipping to the next size.

Just for reference, if an HTTP response returns an error code 400-499, my script will try the next available size. I have encountered Twitter images where the ":orig" size 404'd but the ":large" was still available. Tumblr is another site where not all sizes are available. It's sort of built into Artstation.

evazion said:

Well this is bizarre. I tried downloading the file several times and sometimes it gives me a 2.7M PNG and sometimes it gives me a 648K JPEG. It seems random which file I get. No idea what's going on here.

Show
admin@icarus:~
% wget https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                                                
--2017-03-24 03:19:23--  https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Resolving 68.media.tumblr.com (68.media.tumblr.com)... 69.147.82.57, 216.115.96.175, 69.147.82.56, ...
Connecting to 68.media.tumblr.com (68.media.tumblr.com)|69.147.82.57|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2844421 (2.7M) [image/png]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’

tumblr_okf33cpXvt1v2gw9ko1_1280.png                         100%[==========================================================================================================================================>]   2.71M  1.58MB/s    in 1.7s

2017-03-24 03:19:26 (1.58 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’ saved [2844421/2844421]

admin@icarus:~
% wget https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                                                
--2017-03-24 03:19:33--  https://68.media.tumblr.com/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Resolving 68.media.tumblr.com (68.media.tumblr.com)... 216.115.96.179, 69.147.82.56, 216.115.96.175, ...
Connecting to 68.media.tumblr.com (68.media.tumblr.com)|216.115.96.179|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 663298 (648K) [image/jpeg]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’

tumblr_okf33cpXvt1v2gw9ko1_1280.png.1                       100%[==========================================================================================================================================>] 647.75K  1.12MB/s    in 0.6s

2017-03-24 03:19:34 (1.12 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’ saved [663298/663298]

admin@icarus:~
% identify tumblr_okf33cpXvt1v2gw9ko1_1280.png tumblr_okf33cpXvt1v2gw9ko1_1280.png.1                                                                                                                                                   
tumblr_okf33cpXvt1v2gw9ko1_1280.png PNG 1280x1819 1280x1819+0+0 8-bit sRGB 2.844MB 0.000u 0:00.000
tumblr_okf33cpXvt1v2gw9ko1_1280.png.1 JPEG 1280x1819 1280x1819+0+0 8-bit sRGB 663KB 0.000u 0:00.000

EDIT: it seems to be based on the IP of the server. http://68.media.tumblr.com has 4 IPs:

% drill 68.media.tumblr.com
;; ANSWER SECTION:
68.media.tumblr.com.    23016   IN      CNAME   edge2.gycs.b.yahoodns.net.
edge2.gycs.b.yahoodns.net.      8       IN      A       69.147.82.57
edge2.gycs.b.yahoodns.net.      8       IN      A       216.115.96.175
edge2.gycs.b.yahoodns.net.      8       IN      A       216.115.96.179
edge2.gycs.b.yahoodns.net.      8       IN      A       69.147.82.56

http://69.147.82.57 returns the PNG:

admin@icarus:~
% wget --header "Host: 68.media.tumblr.com" http://69.147.82.57/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                   
--2017-03-24 03:37:04--  http://69.147.82.57/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Connecting to 69.147.82.57:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2844421 (2.7M) [image/png]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’

tumblr_okf33cpXvt1v2gw9ko1_1280.png                         100%[==========================================================================================================================================>]   2.71M  1.40MB/s    in 1.9s

2017-03-24 03:37:06 (1.40 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png’ saved [2844421/2844421]

And http://216.115.96.179 returns the JPEG:

% wget --header "Host: 68.media.tumblr.com" http://216.115.96.179/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png                                                                                                 
--2017-03-24 03:37:31--  http://216.115.96.179/ff41206113b69a05283784a121a9190a/tumblr_okf33cpXvt1v2gw9ko1_1280.png
Connecting to 216.115.96.179:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 663298 (648K) [image/jpeg]
Saving to: ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’

tumblr_okf33cpXvt1v2gw9ko1_1280.png.1                       100%[==========================================================================================================================================>] 647.75K  1.16MB/s    in 0.5s

2017-03-24 03:37:32 (1.16 MB/s) - ‘tumblr_okf33cpXvt1v2gw9ko1_1280.png.1’ saved [663298/663298]

@BrokenEagle98, you might want to check how your bot handles Tumblr-sourced images. It also appears to be affected by random image retrieval (post #2669456).

D1ce said:

@BrokenEagle98, you might want to check how your bot handles Tumblr-sourced images. It also appears to be affected by random image retrieval (post #2669456).

Does it affect all servers, or only the 68 one?

On a side note, I've thought of having a developer's thread for a while now for stuff like the above. A while back I asked Type-kun about it and he was okay with it as long as it remains Danbooru related.

Thoughts?

BrokenEagle98 said:

Does it affect all servers, or only the 68 one?

On a side note, I've thought of having a developer's thread for a while now for stuff like the above. A while back I asked Type-kun about it and he was okay with it as long as it remains Danbooru related.

Thoughts?

I don't upload enough from there to say for certain, but the posts that gave me trouble came from the 68 one. I would think that a global catchall script for all Tumblr uploads would be safer, considering that the problem could later occur on their other servers. Would such a check be too impactful on performance?

It's not a question of performance...

There are two unsolvable issues (as far as I'm aware of)...

1. Being able to get all IP addresses dynamically.

From what I've read, Tumblr uses a rotating list of servers from Yahoo that can vary from hour to hour. I've been unable to find a method to determine this at run time.

2. Being able to use those IP addresses dynamically.

From initial testing, both with my browser (Chrome 56) and with my script (Python 3.5), I am unable to use the IP address to get the image file. It only works when the hostname is used...??? It doesn't make a lot of sense since the headers I'm sending are exactly the same.

This gets annoying because I would have to manually save the image every time I encounter this issue because direct uploading gives me the inferior version. Can't we disable the automated Tumblr code in the meantime?

BrokenEagle98 said:

2. Being able to use those IP addresses dynamically.

From initial testing, both with my browser (Chrome 56) and with my script (Python 3.5), I am unable to use the IP address to get the image file. It only works when the hostname is used...??? It doesn't make a lot of sense since the headers I'm sending are exactly the same.

Not sure if I understood you correctly, but if you use http://0.1.2.3/file instead of http://example.com/file then you’re not sending the same headers. Nowadays, most webservers support hosting more than one site on the same IP, so you need to send the correct hostname to tell the server which site you want. This often applies even if the server serves only one site. The only way around that is to tell your client to connect to a specific IP but send the correct hostname instead of the IP as part of the request.

If you use curl, you can try the --resolve <host:port:address> option. I expect Python to be able to do it too, but I have no idea how easy that is.

Btw, consider sending HEAD requests to alternate hosts to check the file size and possibly modification date to avoid downloading the same file multiple times.

Updated

I made issue #2938 for the Tumblr bug.

On a side note, I've thought of having a developer's thread for a while now for stuff like the above. A while back I asked Type-kun about it and he was okay with it as long as it remains Danbooru related.

This would be a good idea. This thread is hard to follow as it stands.

The resized version of post #4087, post #1357 and post #930 is broken, not sure if anything can be done about it.

I think this is related to issue #2500. Images for older posts are slowly being migrated offsite to save disk space.

evazion said:

I think this is related to issue #2500. Images for older posts are slowly being migrated offsite to save disk space.

That may be true, I just came across them when going through old tags I have favorited so I thought I'd report them to be on the safe side since they are very old.

On another note, after cleaning up all (?) posts I found in solo solo_focus, the search only gives me this error.

PG::QueryCanceled exception raised

    ERROR: canceling statement due to statement timeout
    app/logical/post_sets/post.rb:126:in `posts'
    app/controllers/posts_controller.rb:15:in `index'

Updated

kittey said:

Not sure if I understood you correctly, but if you use http://0.1.2.3/file instead of http://example.com/file then you’re not sending the same headers. Nowadays, most webservers support hosting more than one site on the same IP, so you need to send the correct hostname to tell the server which site you want. This often applies even if the server serves only one site. The only way around that is to tell your client to connect to a specific IP but send the correct hostname instead of the IP as part of the request.

Many thanks for the above advice, as it worked correctly when I set the Host field to "68.media.tumblr.com". It was a bit confusing because the requests python module that I use allows you to investigate the headers of the send request, and it wasn't setting the Host field so I thought it was unimportant... Maybe it was doing it and just not reporting it...?

Btw, consider sending HEAD requests to alternate hosts to check the file size and possibly modification date to avoid downloading the same file multiple times.

I've learned the hard way that servers often don't respond in the same manner to HEAD requests as GET requests... :/ Additionally, I've discovered that most of the servers lie about the modification datetime, usually setting it to the moment of the GET request.

Regardless, I'm still left with the problem of #1, i.e. determining all of the server IP's at runtime. For instance, all of the methods I've used to determine the IP (NSLOOKUP, http://centralops.net/co/ , python, etc.) hasn't yet returned the faulty IP of 216.115.96.179.

Updated

Unbreakable said:

On another note, after cleaning up all (?) posts I found in solo solo_focus, the search only gives me this error.

This is longstanding problem, discussed in issue #1039. Basically, the problem is that for mutually exclusive tags (tags that independently have a large number of posts, but that when combined together produce few-to-no results), searches become very slow and likely to time out. Some other examples: solo multiple_girls, banned_artist -status:banned, highres lowres, long_hair no_humans.

There is a trick, however: add order:score to the search and like magic it won't time out. The reason why this works is fairly involved, but the basic explanation is that order:score tricks the database into searching using a different strategy than it normally would, which happens to be faster for these searches.

BrokenEagle98 said:

Regardless, I'm still left with the problem of #1, i.e. determining all of the server IP's at runtime. For instance, all of the methods I've used to determine the IP (NSLOOKUP, http://centralops.net/co/ , python, etc.) hasn't yet returned the faulty IP of 216.115.96.179.

It seems that their DNS gives you different sets of IPs based on your geographic location. I've tested the lookup on three machines and they each see different sets of IPs. On one machine, all of the IPs it gave me returned the JPEG. So it looks like even downloading the file from each IP isn't guaranteed to find the best file.