Donmai

Twitter Source Fix Script

Posted under General

(Note: This is old and I'm not really working on it any more. Just keeps running into dtext changes that break its formatting so it gets bumped when I fix that.)

I've gotten a little fed up recently with images whose source is https://pbs.twimg.com/whatever, so you can't get to the original tweet. So, I wrote a script to find posts like that from a given artist, look through all their tweets, and change the sources to the actual tweet URL. As far as I know, there's no way to go backward from a twimg URL to the status it came from, so this is the best I could think of. I figured I would post it here in case anyone else wants to use it.

Requires Python 3. Shouldn't be super hard to make it work with Python 2, but I didn't bother. The only dependency should be TwitterAPI (it also requires Requests, but the former depends on the latter).

Script
#!/usr/bin/env python3

import requests
from TwitterAPI import TwitterAPI
import re
import json
import itertools


TWITTER_URL_RE = r'https?://twitter\.com/([^/]+)'
IMAGE_URL_RE = r'https?://pbs\.twimg\.com/media/([^:]+)'


def danbooru_api_whatever(method, thing, *args, **kw):
    response = method('https://danbooru.donmai.us/{}.json'.format(thing), *args, **kw)
    response.raise_for_status()
    return response.json()

def danbooru_api_get(thing, params=None):
    p = dict(auth_data['danbooru'])
    if params:
        p.update(params)
    return danbooru_api_whatever(requests.get, thing, params=p)

def danbooru_api_put(thing, data):
    body = dict(auth_data['danbooru'])
    body.update(data)
    return danbooru_api_whatever(requests.put, thing, body)

def get_posts(*tags):
    # An iterator over all posts matching tags, doing pages as
    # necessary. Fetches pages lazily. Makes no attempt to remove
    # duplicates in case a post is added while working.
    params = {
        'tags': ' '.join(tags),
        'page': 1
    }
    for page in iter(lambda: danbooru_api_get('posts', params), []):
        for post in page:
            yield post
        params['page'] += 1


def all_media_from_tweets(screen_name):
    # Iterator over all media entities in tweets from a given screen
    # name. Fetches tweets lazily as necessary.
    tapi = TwitterAPI(**auth_data['twitter'])
    params = {
        'screen_name': screen_name,
        'include_rts': False,
        'trim_user': True,
        'count': 200, # this is the max
    }
    for tweets in iter(lambda: tapi.request('statuses/user_timeline', params).json(), []):
        if not isinstance(tweets, list):
            print('Could not get tweets for {}: {}'.format(screen_name, tweets))
            return
        for tweet in tweets:
            try:
                media = tweet['entities']['media']
            except KeyError:
                # No media in this tweet
                continue
            for entity in media:
                yield (entity['media_url_https'], entity['expanded_url'])
        params['max_id'] = min(tweet['id'] for tweet in tweets) - 1 # max_id is inclusive


def main(argv):

    with open('auth.json', 'r') as f:
        globals()['auth_data'] = json.load(f)
    if not ('danbooru' in auth_data and 'twitter' in auth_data):
        raise RuntimeError('auth stuff not provided')

    if len(argv) < 2:
        return 'usage: {} artist_name [twitter_username ...]'.format(argv[0])

    artist_url_match = re.match(r'https?://danbooru\.donmai\.us/artists/([^/]+)', argv[1])
    print('Looking up artist...')
    if artist_url_match:
        artist = danbooru_api_get('artists/' + artist_url_match.group(1))
    else:
        artist_name = argv[1].replace(' ', '_')
        for artist in danbooru_api_get('artists', {'search[name]': 'name:' + artist_name}):
            # Look for an exact match
            if artist['name'] == artist_name:
                break
        else: # No break means nothing matched
            return 'No such artist: {!r}'.format(artist_name)

    twitter_usernames = [
        # From command line
        match.group(1) if match else arg for match, arg in (
            (re.match(TWITTER_URL_RE, arg), arg)
            for arg in argv[2:]
        )
    ] or [
        # From artist entry on danbooru
        match.group(1) for match in (
            re.match(TWITTER_URL_RE, url['normalized_url'])
            for url in artist['urls']
        ) if match
    ]
    if not twitter_usernames:
        return 'No twitter username(s) found.'

    posts_needing_update = {
        re.match(IMAGE_URL_RE, post['source']).group(1): post['id']
        #for post in get_posts(artist['name'], 'source:https://pbs.twimg.com/')
        for post in itertools.chain(
            get_posts(artist['name'], 'source:https://pbs.twimg.com/'),
            get_posts(artist['name'], 'source:http://pbs.twimg.com/')
        )
        # ~source:https://pbs.twimg.com/ ~source:http://pbs.twimg.com/ doesn't work
        # source:http*://pbs.twimg.com/ works but is technically wrong and might be harder on the database (?)
    }
    if posts_needing_update:
        print('Found {} posts needing update.'.format(len(posts_needing_update)))
    else:
        print('No posts with twimg sources found.')
        return

    print('Using twitter username(s):', ', '.join(twitter_usernames))
    for screen_name in twitter_usernames:
        for image_url, tweet_url in all_media_from_tweets(screen_name):
            match = re.match(IMAGE_URL_RE, image_url)
            if not match:
                # It might be a video thumbnail or something
                continue
            media_filename = match.group(1)
            try:
                post_id = posts_needing_update.pop(media_filename)
            except KeyError:
                continue
            tweet_url = tweet_url.replace('http://', 'https://', 1)
            print('Post #{} -> {}'.format(post_id, tweet_url))
            danbooru_api_put('posts/{}'.format(post_id), {
                'post[source]': tweet_url
            })
            if not posts_needing_update:
                print('All sources fixed.')
                return
    # Getting here means some posts couldn't be found. Twitter only
    # returns the most recent 3,200 tweets through the API, so that's
    # probably why. Another possibility is that the post was from a
    # different twitter account that wasn't specified on the command
    # line or present in danbooru's artist entry, or the post on
    # danbooru had the wrong artist tag.
    return 'Could not find sources for {} post(s): {}'.format(
        len(posts_needing_update),
        ', '.join(map(str, sorted(posts_needing_update.values(), reverse=True)))
    )

if __name__ == '__main__':
    import sys
    sys.exit(main(sys.argv))

For the script to work, you'll need a file called auth.json in your working directory like this:

auth.json template
{
    "danbooru": {
        "login": "your danbooru username goes here",
        "api_key": "your danbooru api key goes here"
    },
    "twitter": {
        "consumer_key": "your twitter oauth crap goes here",
        "consumer_secret": "and here",
        "access_token_key": "and here",
        "access_token_secret": "and here"
    }
}

To run it, the bare minimum is to give it the name of the artist tag (or the URL of the artist page) as the first command line argument. Any additional arguments will be interpreted as Twitter usernames (or URLs). If you don't provide any, it'll look up the artist on Danbooru and try to find Twitter accounts from the URLs there.

Examples

Fix posts for mishima_kurone, looking up twitter username(s) automatically:

python3 danbooru-twitter-source-fix.py mishima_kurone

This is equivalent:

python3 danbooru-twitter-source-fix.py https://danbooru.donmai.us/artists/45593 

If you want to specify the Twitter username(s) manually:

python3 danbooru-twitter-source-fix.py some_artist twitter_user alt_twitter

Or:

python3 danbooru-twitter-source-fix.py some_artist https://twitter.com/twitter_user 

I've run it on mishima_kurone, kasu_(return), and caidychen for testing. It seems to work pretty well.

Logs
Looking up artist...
Found 31 posts needing update.
Using twitter username(s): caidychenkd
Post #1926517 -> https://twitter.com/caidychenkd/status/563683407897845760/photo/1
Post #1917221 -> https://twitter.com/caidychenkd/status/560409559651844096/photo/1
Post #1906363 -> https://twitter.com/caidychenkd/status/557132382307115008/photo/1
Post #1906360 -> https://twitter.com/caidychenkd/status/556306153735327745/photo/1
Post #1906356 -> https://twitter.com/caidychenkd/status/555718033235132416/photo/1
Post #1891212 -> https://twitter.com/caidychenkd/status/548122912125759490/photo/1
Post #1882259 -> https://twitter.com/caidychenkd/status/547579848533606400/photo/1
Post #1870682 -> https://twitter.com/caidychenkd/status/543029548879577088/photo/1
Post #1864648 -> https://twitter.com/caidychenkd/status/541019760171827200/photo/1
Post #1860315 -> https://twitter.com/caidychenkd/status/537206267429679104/photo/1
Post #1860444 -> https://twitter.com/caidychenkd/status/523468526166618114/photo/1
Post #1860350 -> https://twitter.com/caidychenkd/status/520171924979056640/photo/1
Post #1860346 -> https://twitter.com/caidychenkd/status/511038212442058752/photo/1
Post #1860339 -> https://twitter.com/caidychenkd/status/509339102110441473/photo/1
Post #1860334 -> https://twitter.com/caidychenkd/status/504613904257777665/photo/1
Post #1860445 -> https://twitter.com/caidychenkd/status/501746650017042432/photo/1
Post #1860328 -> https://twitter.com/caidychenkd/status/499130050919161859/photo/1
Post #1860321 -> https://twitter.com/caidychenkd/status/496082270487183360/photo/1
Post #1860317 -> https://twitter.com/caidychenkd/status/493636937643610114/photo/1
Post #1860409 -> https://twitter.com/caidychenkd/status/473090005322063873/photo/1
Post #1860354 -> https://twitter.com/caidychenkd/status/468614384101519360/photo/1
Post #1706664 -> https://twitter.com/caidychenkd/status/454569768163348480/photo/1
Post #1860407 -> https://twitter.com/caidychenkd/status/431766768705478656/photo/1
Post #1860362 -> https://twitter.com/caidychenkd/status/418705081173692416/photo/1
Post #1860402 -> https://twitter.com/caidychenkd/status/397688070037700608/photo/1
Post #1860371 -> https://twitter.com/caidychenkd/status/391758478814879745/photo/1
Post #1860396 -> https://twitter.com/caidychenkd/status/385621024303104001/photo/1
Post #1860400 -> https://twitter.com/caidychenkd/status/365491412550180865/photo/1
Post #1860391 -> https://twitter.com/caidychenkd/status/339714012239515648/photo/1
Post #1860383 -> https://twitter.com/caidychenkd/status/330974433193914369/photo/1
Post #1860387 -> https://twitter.com/caidychenkd/status/306231217521557504/photo/1
All sources fixed.


Looking up artist...
Found 26 posts needing update.
Using twitter username(s): kasu1923
Post #1931695 -> https://twitter.com/kasu1923/status/567281694185500674/photo/1
Post #1929818 -> https://twitter.com/kasu1923/status/564033031791316992/photo/1
Post #1929821 -> https://twitter.com/kasu1923/status/562914611872419840/photo/1
Could not find sources for 23 post(s): 1832332, 1812079, 1774528, 1758143, 1758139, 1758138, 1758137, 1758130, 1758124, 1758121, 1758119, 1758116, 1758115, 1758113, 1758111, 1758109, 1758108, 1758107, 1758105, 1758100, 1667125, 1667109, 1667108


Looking up artist...
Found 19 posts needing update.
Using twitter username(s): mishima_kurone
Post #1986455 -> https://twitter.com/mishima_kurone/status/589087614063353856/photo/1
Post #1922150 -> https://twitter.com/mishima_kurone/status/564385576766296064/photo/1
Post #1896123 -> https://twitter.com/mishima_kurone/status/552464451400904705/photo/1
Post #1845041 -> https://twitter.com/mishima_kurone/status/532893206770249728/photo/1
Post #1840195 -> https://twitter.com/mishima_kurone/status/530383108273496064/photo/1
Post #1807160 -> https://twitter.com/mishima_kurone/status/515505801260064769/photo/1
Post #1817949 -> https://twitter.com/mishima_kurone/status/511913166964412416/photo/1
Post #1773412 -> https://twitter.com/mishima_kurone/status/502356555610611712/photo/1
Post #1797987 -> https://twitter.com/mishima_kurone/status/478588411750531072/photo/1
Post #1674772 -> https://twitter.com/mishima_kurone/status/461895410223566849/photo/1
Post #1669784 -> https://twitter.com/mishima_kurone/status/459777103563460608/photo/1
Post #1665211 -> https://twitter.com/mishima_kurone/status/457822732520919040/photo/1
Post #1662073 -> https://twitter.com/mishima_kurone/status/455771877378506752/photo/1
Post #1662081 -> https://twitter.com/mishima_kurone/status/451246958850473984/photo/1
Post #1633996 -> https://twitter.com/mishima_kurone/status/442821580809191424/photo/1
Could not find sources for 4 post(s): 1620432, 1427485, 1427484, 1427483

Unfortunately, Twitter only allows retrieving the last 3,200 tweets through the API, so if an artist tweets a lot, older tweets won't be found. (You can see this in the above logs.) If anyone knows of a solution to this issue, I'd like to hear it.

Updates:

  • 5 May 2015 - Basic error handling when getting tweets. Nothing fancy, just print's Twitter's raw response and skips the username. But it stops the program from crashing and allows it to still work when e.g. an artist has multiple Twitters and one or more is deleted.
  • (5 Feb 2017 - Fixed formatting.)
  • (9 Aug 2018 - Fixed formatting again...)

Possible future improvements:

  • Get around the 3,200 tweet limit somehow. (Help?)
  • Keep a permanent list of the latest tweet accessed for each twitter, so on future runs it could use since_id to avoid fetching a ton of redundant data. This would be useful if one were going to run the script on a semi-regular basis as a sort of maintenance operation. (Actually, though, this would prevent the script from working on posts sourced from old tweets, unless all mappings were cached forever, which would take a prohibitive amount of space.)
  • Remember which posts it has failed to find sources for. (Also mainly useful if used regularly as maintenance. This would save a lot of the redundancy without giving up the ability to fix old posts like the other idea does.)

Any feedback is welcome.

Updated

On a related note, Danbooru has a relatively recent feature that automatically sets to source for uploads to the more useful twitter.com url rather than the direct link twimg.com. Only works for bookmarklet uploads since it doesn't know what the twitter.com url should be for non-bookmarklet uploads.

Toks said:

On a related note, Danbooru has a relatively recent feature that automatically sets to source for uploads to the more useful twitter.com url rather than the direct link twimg.com. Only works for bookmarklet uploads since it doesn't know what the twitter.com url should be for non-bookmarklet uploads.

Seems to be broken again.

Kikimaru said:

Seems to be broken again.

You need to click the bookmarklet while on the twitter.com page. If you click it on the twimg.com page Danbooru has no way of knowing what the twitter.com url should be.

I've been thinking about setting up a cronjob to fix Twitter sources for recent uploads, maybe on a weekly basis or so. Not sure when I'd get around to it, but I wanted to ask the admins a few things before I start:

  • Is it okay for me to do this?
  • Should I make a separate account for it, or just use my normal one?
  • Would it be okay for the script to send messages to users who repeatedly mis-source? (I'd use some heuristic to ignore occasional mistakes and avoid messaging the same user more than e.g. once a month.)
  • Would it be okay to message users who upload twimg-sourced posts for artists without Twitter URLs in their artist profile, asking them to at least fill in that info for the artist? (This should be a rare situation, I think.)
  • Should the script add source_request to posts with twimg sources that it can't automatically fix?
  • Post IDs are monotonic, right? (Specifically: if I'm loading pages in order, and I see a post with a number lower than the highest one I saw last time the script ran, is it guaranteed that I've seen all of the posts since the last run?)

If it works without bugs on recent posts then it would also be nice to fix the existing backlog of all twimg posts too at some point.

☆♪ said:

  • Post IDs are monotonic, right? (Specifically: if I'm loading pages in order, and I see a post with a number lower than the highest one I saw last time the script ran, is it guaranteed that I've seen all of the posts since the last run?)

Yes, they always go up by one every time a post is created.

☆♪ said:

  • Should the script add source_request to posts with twimg sources that it can't automatically fix?

Could be useful for anyone that wants to manually fix them and doesn't want to waste time doing the ones that can be done atuomatically. Google reverse image search might come in handy.

☆♪ said:

I've been thinking about setting up a cronjob to fix Twitter sources for recent uploads, maybe on a weekly basis or so. Not sure when I'd get around to it, but I wanted to ask the admins a few things before I start:

  • Is it okay for me to do this?

Bots are okay. I would ask you to generate an API key though and use that to authenticate. Even better, create a new account for the bot and make changes with that.

  • Would it be okay for the script to send messages to users who repeatedly mis-source? (I'd use some heuristic to ignore occasional mistakes and avoid messaging the same user more than e.g. once a month.)
  • Would it be okay to message users who upload twimg-sourced posts for artists without Twitter URLs in their artist profile, asking them to at least fill in that info for the artist? (This should be a rare situation, I think.)

A lot of bots already do these things.

  • Should the script add source_request to posts with twimg sources that it can't automatically fix?

I don't consider the lack of a source for a Twitter post to be that huge of an issue, as long as the artist is properly identified. For me the main utility of the original Twitter link is discovering the account to follow them.

  • Post IDs are monotonic, right? (Specifically: if I'm loading pages in order, and I see a post with a number lower than the highest one I saw last time the script ran, is it guaranteed that I've seen all of the posts since the last run?)

Yes.

Finally set it up, under user #465776. Currently running starting about a week back. If it behaves for a while, I'll run it all the way back.

For now, I decided not to implement a heuristic for avoiding messaging users too much. I think that once you get in the habit of doing it the right way, you should almost never screw up, and if you do and get a message once every couple of months or something it's not a big deal. If users get a lot of mails from the script, it should be because they upload with broken sources a lot. If people complain about getting too many messages, I can implement a heuristic later. The script does consolidate messages, though, so it will send at most one message per user each time it runs. Also, it ignores posts less than 15 minutes old, so uploaders have time to set up artist entries and whatever else if needed.

I'll probably upload the code eventually. Want to let it run for a bit and see if I have to make changes first.

Also, I'm using page=a12345 and page=b12345, but I don't see them officially documented anywhere. Can someone confirm that those will be sticking around?

Well, the Twitter account I was using got suspended. They didn't tell me what they didn't like about what I was doing, so I don't really know what to do. I can't think of any way to reduce the requests to Twitter without caching a significant amount of data on my own machine.

So I guess this bot is going to stop until I have time to figure something else out.

1