(Note: This is old and I'm not really working on it any more. Just keeps running into dtext changes that break its formatting so it gets bumped when I fix that.)
I've gotten a little fed up recently with images whose source is https://pbs.twimg.com/whatever, so you can't get to the original tweet. So, I wrote a script to find posts like that from a given artist, look through all their tweets, and change the sources to the actual tweet URL. As far as I know, there's no way to go backward from a twimg URL to the status it came from, so this is the best I could think of. I figured I would post it here in case anyone else wants to use it.
Requires Python 3. Shouldn't be super hard to make it work with Python 2, but I didn't bother. The only dependency should be TwitterAPI (it also requires Requests, but the former depends on the latter).
Script
#!/usr/bin/env python3 import requests from TwitterAPI import TwitterAPI import re import json import itertools TWITTER_URL_RE = r'https?://twitter\.com/([^/]+)' IMAGE_URL_RE = r'https?://pbs\.twimg\.com/media/([^:]+)' def danbooru_api_whatever(method, thing, *args, **kw): response = method('https://danbooru.donmai.us/{}.json'.format(thing), *args, **kw) response.raise_for_status() return response.json() def danbooru_api_get(thing, params=None): p = dict(auth_data['danbooru']) if params: p.update(params) return danbooru_api_whatever(requests.get, thing, params=p) def danbooru_api_put(thing, data): body = dict(auth_data['danbooru']) body.update(data) return danbooru_api_whatever(requests.put, thing, body) def get_posts(*tags): # An iterator over all posts matching tags, doing pages as # necessary. Fetches pages lazily. Makes no attempt to remove # duplicates in case a post is added while working. params = { 'tags': ' '.join(tags), 'page': 1 } for page in iter(lambda: danbooru_api_get('posts', params), []): for post in page: yield post params['page'] += 1 def all_media_from_tweets(screen_name): # Iterator over all media entities in tweets from a given screen # name. Fetches tweets lazily as necessary. tapi = TwitterAPI(**auth_data['twitter']) params = { 'screen_name': screen_name, 'include_rts': False, 'trim_user': True, 'count': 200, # this is the max } for tweets in iter(lambda: tapi.request('statuses/user_timeline', params).json(), []): if not isinstance(tweets, list): print('Could not get tweets for {}: {}'.format(screen_name, tweets)) return for tweet in tweets: try: media = tweet['entities']['media'] except KeyError: # No media in this tweet continue for entity in media: yield (entity['media_url_https'], entity['expanded_url']) params['max_id'] = min(tweet['id'] for tweet in tweets) - 1 # max_id is inclusive def main(argv): with open('auth.json', 'r') as f: globals()['auth_data'] = json.load(f) if not ('danbooru' in auth_data and 'twitter' in auth_data): raise RuntimeError('auth stuff not provided') if len(argv) < 2: return 'usage: {} artist_name [twitter_username ...]'.format(argv[0]) artist_url_match = re.match(r'https?://danbooru\.donmai\.us/artists/([^/]+)', argv[1]) print('Looking up artist...') if artist_url_match: artist = danbooru_api_get('artists/' + artist_url_match.group(1)) else: artist_name = argv[1].replace(' ', '_') for artist in danbooru_api_get('artists', {'search[name]': 'name:' + artist_name}): # Look for an exact match if artist['name'] == artist_name: break else: # No break means nothing matched return 'No such artist: {!r}'.format(artist_name) twitter_usernames = [ # From command line match.group(1) if match else arg for match, arg in ( (re.match(TWITTER_URL_RE, arg), arg) for arg in argv[2:] ) ] or [ # From artist entry on danbooru match.group(1) for match in ( re.match(TWITTER_URL_RE, url['normalized_url']) for url in artist['urls'] ) if match ] if not twitter_usernames: return 'No twitter username(s) found.' posts_needing_update = { re.match(IMAGE_URL_RE, post['source']).group(1): post['id'] #for post in get_posts(artist['name'], 'source:https://pbs.twimg.com/') for post in itertools.chain( get_posts(artist['name'], 'source:https://pbs.twimg.com/'), get_posts(artist['name'], 'source:http://pbs.twimg.com/') ) # ~source:https://pbs.twimg.com/ ~source:http://pbs.twimg.com/ doesn't work # source:http*://pbs.twimg.com/ works but is technically wrong and might be harder on the database (?) } if posts_needing_update: print('Found {} posts needing update.'.format(len(posts_needing_update))) else: print('No posts with twimg sources found.') return print('Using twitter username(s):', ', '.join(twitter_usernames)) for screen_name in twitter_usernames: for image_url, tweet_url in all_media_from_tweets(screen_name): match = re.match(IMAGE_URL_RE, image_url) if not match: # It might be a video thumbnail or something continue media_filename = match.group(1) try: post_id = posts_needing_update.pop(media_filename) except KeyError: continue tweet_url = tweet_url.replace('http://', 'https://', 1) print('Post #{} -> {}'.format(post_id, tweet_url)) danbooru_api_put('posts/{}'.format(post_id), { 'post[source]': tweet_url }) if not posts_needing_update: print('All sources fixed.') return # Getting here means some posts couldn't be found. Twitter only # returns the most recent 3,200 tweets through the API, so that's # probably why. Another possibility is that the post was from a # different twitter account that wasn't specified on the command # line or present in danbooru's artist entry, or the post on # danbooru had the wrong artist tag. return 'Could not find sources for {} post(s): {}'.format( len(posts_needing_update), ', '.join(map(str, sorted(posts_needing_update.values(), reverse=True))) ) if __name__ == '__main__': import sys sys.exit(main(sys.argv))
For the script to work, you'll need a file called auth.json in your working directory like this:
auth.json template
{ "danbooru": { "login": "your danbooru username goes here", "api_key": "your danbooru api key goes here" }, "twitter": { "consumer_key": "your twitter oauth crap goes here", "consumer_secret": "and here", "access_token_key": "and here", "access_token_secret": "and here" } }
To run it, the bare minimum is to give it the name of the artist tag (or the URL of the artist page) as the first command line argument. Any additional arguments will be interpreted as Twitter usernames (or URLs). If you don't provide any, it'll look up the artist on Danbooru and try to find Twitter accounts from the URLs there.
Examples
Fix posts for mishima_kurone, looking up twitter username(s) automatically:
python3 danbooru-twitter-source-fix.py mishima_kurone
This is equivalent:
python3 danbooru-twitter-source-fix.py https://danbooru.donmai.us/artists/45593
If you want to specify the Twitter username(s) manually:
python3 danbooru-twitter-source-fix.py some_artist twitter_user alt_twitter
Or:
python3 danbooru-twitter-source-fix.py some_artist https://twitter.com/twitter_user
I've run it on mishima_kurone, kasu_(return), and caidychen for testing. It seems to work pretty well.
Logs
Looking up artist... Found 31 posts needing update. Using twitter username(s): caidychenkd Post #1926517 -> https://twitter.com/caidychenkd/status/563683407897845760/photo/1 Post #1917221 -> https://twitter.com/caidychenkd/status/560409559651844096/photo/1 Post #1906363 -> https://twitter.com/caidychenkd/status/557132382307115008/photo/1 Post #1906360 -> https://twitter.com/caidychenkd/status/556306153735327745/photo/1 Post #1906356 -> https://twitter.com/caidychenkd/status/555718033235132416/photo/1 Post #1891212 -> https://twitter.com/caidychenkd/status/548122912125759490/photo/1 Post #1882259 -> https://twitter.com/caidychenkd/status/547579848533606400/photo/1 Post #1870682 -> https://twitter.com/caidychenkd/status/543029548879577088/photo/1 Post #1864648 -> https://twitter.com/caidychenkd/status/541019760171827200/photo/1 Post #1860315 -> https://twitter.com/caidychenkd/status/537206267429679104/photo/1 Post #1860444 -> https://twitter.com/caidychenkd/status/523468526166618114/photo/1 Post #1860350 -> https://twitter.com/caidychenkd/status/520171924979056640/photo/1 Post #1860346 -> https://twitter.com/caidychenkd/status/511038212442058752/photo/1 Post #1860339 -> https://twitter.com/caidychenkd/status/509339102110441473/photo/1 Post #1860334 -> https://twitter.com/caidychenkd/status/504613904257777665/photo/1 Post #1860445 -> https://twitter.com/caidychenkd/status/501746650017042432/photo/1 Post #1860328 -> https://twitter.com/caidychenkd/status/499130050919161859/photo/1 Post #1860321 -> https://twitter.com/caidychenkd/status/496082270487183360/photo/1 Post #1860317 -> https://twitter.com/caidychenkd/status/493636937643610114/photo/1 Post #1860409 -> https://twitter.com/caidychenkd/status/473090005322063873/photo/1 Post #1860354 -> https://twitter.com/caidychenkd/status/468614384101519360/photo/1 Post #1706664 -> https://twitter.com/caidychenkd/status/454569768163348480/photo/1 Post #1860407 -> https://twitter.com/caidychenkd/status/431766768705478656/photo/1 Post #1860362 -> https://twitter.com/caidychenkd/status/418705081173692416/photo/1 Post #1860402 -> https://twitter.com/caidychenkd/status/397688070037700608/photo/1 Post #1860371 -> https://twitter.com/caidychenkd/status/391758478814879745/photo/1 Post #1860396 -> https://twitter.com/caidychenkd/status/385621024303104001/photo/1 Post #1860400 -> https://twitter.com/caidychenkd/status/365491412550180865/photo/1 Post #1860391 -> https://twitter.com/caidychenkd/status/339714012239515648/photo/1 Post #1860383 -> https://twitter.com/caidychenkd/status/330974433193914369/photo/1 Post #1860387 -> https://twitter.com/caidychenkd/status/306231217521557504/photo/1 All sources fixed. Looking up artist... Found 26 posts needing update. Using twitter username(s): kasu1923 Post #1931695 -> https://twitter.com/kasu1923/status/567281694185500674/photo/1 Post #1929818 -> https://twitter.com/kasu1923/status/564033031791316992/photo/1 Post #1929821 -> https://twitter.com/kasu1923/status/562914611872419840/photo/1 Could not find sources for 23 post(s): 1832332, 1812079, 1774528, 1758143, 1758139, 1758138, 1758137, 1758130, 1758124, 1758121, 1758119, 1758116, 1758115, 1758113, 1758111, 1758109, 1758108, 1758107, 1758105, 1758100, 1667125, 1667109, 1667108 Looking up artist... Found 19 posts needing update. Using twitter username(s): mishima_kurone Post #1986455 -> https://twitter.com/mishima_kurone/status/589087614063353856/photo/1 Post #1922150 -> https://twitter.com/mishima_kurone/status/564385576766296064/photo/1 Post #1896123 -> https://twitter.com/mishima_kurone/status/552464451400904705/photo/1 Post #1845041 -> https://twitter.com/mishima_kurone/status/532893206770249728/photo/1 Post #1840195 -> https://twitter.com/mishima_kurone/status/530383108273496064/photo/1 Post #1807160 -> https://twitter.com/mishima_kurone/status/515505801260064769/photo/1 Post #1817949 -> https://twitter.com/mishima_kurone/status/511913166964412416/photo/1 Post #1773412 -> https://twitter.com/mishima_kurone/status/502356555610611712/photo/1 Post #1797987 -> https://twitter.com/mishima_kurone/status/478588411750531072/photo/1 Post #1674772 -> https://twitter.com/mishima_kurone/status/461895410223566849/photo/1 Post #1669784 -> https://twitter.com/mishima_kurone/status/459777103563460608/photo/1 Post #1665211 -> https://twitter.com/mishima_kurone/status/457822732520919040/photo/1 Post #1662073 -> https://twitter.com/mishima_kurone/status/455771877378506752/photo/1 Post #1662081 -> https://twitter.com/mishima_kurone/status/451246958850473984/photo/1 Post #1633996 -> https://twitter.com/mishima_kurone/status/442821580809191424/photo/1 Could not find sources for 4 post(s): 1620432, 1427485, 1427484, 1427483
Unfortunately, Twitter only allows retrieving the last 3,200 tweets through the API, so if an artist tweets a lot, older tweets won't be found. (You can see this in the above logs.) If anyone knows of a solution to this issue, I'd like to hear it.
Updates:
- 5 May 2015 - Basic error handling when getting tweets. Nothing fancy, just print's Twitter's raw response and skips the username. But it stops the program from crashing and allows it to still work when e.g. an artist has multiple Twitters and one or more is deleted.
- (5 Feb 2017 - Fixed formatting.)
- (9 Aug 2018 - Fixed formatting again...)
Possible future improvements:
- Get around the 3,200 tweet limit somehow. (Help?)
- Keep a permanent list of the latest tweet accessed for each twitter, so on future runs it could use since_id to avoid fetching a ton of redundant data. This would be useful if one were going to run the script on a semi-regular basis as a sort of maintenance operation. (Actually, though, this would prevent the script from working on posts sourced from old tweets, unless all mappings were cached forever, which would take a prohibitive amount of space.)
- Remember which posts it has failed to find sources for. (Also mainly useful if used regularly as maintenance. This would save a lot of the redundancy without giving up the ability to fix old posts like the other idea does.)
Any feedback is welcome.
Updated