And I thought I haven't seen Provence for a while here on Danbooru.
Any news from him?
Posted under General
Just to give everyone an update, I've been working on reproducing BrokenEagle98's post change report. I've realized the way the BigQuery table is structured is not optimal for generating the tag type columns. I would probably have to build a new table that explodes out the added_tags and removed_tags so they could be joined against a tags table. On the one hand that's a lot of work. On the other hand such a table would be far more useful for queries. It's probably the solution I'll work on next.
I may as well post some updates here.
I've started exporting tags to a BigQuery table. This is regenerated from scratch once a week.
I've created a new flattened post_versions table with the following schema:
What this enables is joins to the tags table, but also precise queries about the exact number of tags added/removed by a user, the number of initial tags used by the uploader, etc. This is a lot of data so the export is a slow going process.
Hey, I've been working on this myself. I collected a dump of the entire site (everything available in the JSON API) and imported all of it into BigQuery. Just finished the import of post_versions this morning. Here's the data:
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.artist_commentary_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.artist_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.note_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.post_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.wiki_pages_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.artist_commentaries
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.artists
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.bulk_update_requests
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.comments
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.forum_posts
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.forum_topics
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.notes
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.pools
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.post_appeals
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.post_flags
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.posts
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.tag_aliases
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.tag_implications
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.tags
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.uploads
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.user_feedbacks
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.users
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.wiki_pages
Here's a sample query for post_versions, showing how to search for edits removing tagme and adding 1girl and solo:
SELECT id, post_id, updater_id, rating, tags, added_tags, removed_tags, source FROM [danbooru_versions.post_versions] WHERE TRUE AND REGEXP_MATCH(added_tags, "1girl") AND REGEXP_MATCH(added_tags, "solo") AND REGEXP_MATCH(removed_tags, "tagme") -- AND id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE added_tags_array.name = "1girl") -- AND id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE added_tags_array.name = "solo") -- AND id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE removed_tags_array.name = "tagme")
Haven't played with it much, but my initial impression is: it works, queries are fast, but a little expensive. BigQuery bills you on the amount of data processed by per query, $5 per 1TB, and the above search costs 5-6GB. Or $0.03 per search. Limiting searches by date ranges, or partitioning the table itself by dates may help, haven't experimented with that yet.
Here is the list of everyone who has ever added the day tag to a post.
0 | user | count |
---|---|---|
1 | user_460797 | 1239 |
2 | Inferno | 847 |
3 | iphn | 472 |
4 | KazuyaRazuKazama | 352 |
5 | chodorov | 171 |
6 | Randeel | 61 |
7 | Nitrogen09 | 46 |
8 | demosquid | 33 |
9 | ☆♪ | 26 |
10 | dereyoruk | 24 |
11 | zeparoh | 16 |
12 | Tapper | 16 |
13 | warh | 15 |
14 | Schrobby | 15 |
15 | Lannihan | 14 |
16 | Sacriven | 11 |
17 | hitokage | 10 |
18 | titaniachkt | 10 |
19 | zedk8 | 10 |
20 | user_30290 | 9 |
21 | Apollyon | 8 |
22 | Fenen | 8 |
23 | user_344228 | 8 |
24 | Ars | 7 |
25 | gary25566 | 6 |
26 | 4get | 6 |
27 | CyberWire | 6 |
28 | AliceSoftfan | 6 |
29 | setzer52 | 5 |
30 | buehbueh | 5 |
31 | Deuxsonic | 5 |
32 | Qpax | 4 |
33 | OniTea | 4 |
34 | CodeKyuubi | 4 |
35 | GiantCaveMushroom | 4 |
36 | user_349538 | 4 |
37 | Rampardos | 4 |
38 | RomanticTP | 4 |
39 | dean_exia | 4 |
40 | feline_lump | 3 |
41 | Kazzius | 3 |
42 | Xeano94 | 3 |
43 | Jarlath | 3 |
44 | transientfaith | 3 |
45 | Zansnae793 | 3 |
46 | Stefan86 | 2 |
47 | NCAA_Gundam | 2 |
48 | SteelGolem | 2 |
49 | AkterSnurra | 2 |
50 | Kyuzeth | 2 |
51 | Floater | 2 |
52 | clara_dolls | 2 |
53 | FireSky3 | 2 |
54 | WiftyWafter | 2 |
55 | Gauron1786 | 2 |
56 | albert | 2 |
57 | AlephFish | 2 |
58 | RaisingK | 2 |
59 | Arrei | 2 |
60 | SionJack | 2 |
61 | Megamet | 2 |
62 | lavitzlover | 2 |
63 | rollingstart | 2 |
64 | jfkjfk | 1 |
65 | parasol | 1 |
66 | [KN] | 1 |
67 | RandomKitsune | 1 |
68 | nanami | 1 |
69 | Alignn | 1 |
70 | Zer0Axiom | 1 |
71 | .musouka | 1 |
72 | Mavado | 1 |
73 | Azuretan | 1 |
74 | AngryJellyfish | 1 |
75 | user_115224 | 1 |
76 | Butts. | 1 |
77 | DschingisKhan | 1 |
78 | EB | 1 |
79 | Danaii | 1 |
80 | Kayako | 1 |
81 | Dogenzaka | 1 |
82 | FinderX | 1 |
83 | PowderTrail | 1 |
84 | Bibs | 1 |
85 | cutemi2 | 1 |
86 | Crab_Cake | 1 |
87 | iinitori | 1 |
88 | dep11ra | 1 |
89 | Tsumanne | 1 |
90 | EcheIII | 1 |
91 | KubaAseph | 1 |
92 | Chucu | 1 |
93 | v571866 | 1 |
94 | JukitaChan | 1 |
95 | Yamaro | 1 |
96 | Gachayuri | 1 |
97 | theadonicus | 1 |
98 | Becker260 | 1 |
99 | feldt | 1 |
100 | marumaruko | 1 |
101 | Timmaine | 1 |
102 | lady-jotaro | 1 |
103 | beltman | 1 |
104 | Kitsu~ | 1 |
105 | Tzimisce980 | 1 |
106 | Wrongfire | 1 |
107 | DakuTree | 1 |
108 | TheGoldenDarkness | 1 |
109 | mattiasc02 | 1 |
110 | SunakoKirishiki | 1 |
111 | Herrmobel | 1 |
112 | Mulambo | 1 |
113 | UnChocolate | 1 |
114 | Rise_And_Fall | 1 |
115 | FoolyDooly | 1 |
116 | tapnek | 1 |
117 | henmere | 1 |
118 | mangadaemon | 1 |
119 | EatCongee | 1 |
120 | Amoriderefarfalla | 1 |
121 | Lenz | 1 |
122 | warcry | 1 |
123 | F.I.A | 1 |
124 | Doragonn | 1 |
125 | lady_garegga | 1 |
126 | anon652 | 1 |
127 | zaregoto | 1 |
128 | VinceKillinger | 1 |
129 | rantuyetmai | 1 |
130 | tewitr | 1 |
131 | Garren | 1 |
132 | BeyondReason | 1 |
133 | Shallie | 1 |
134 | peshmi | 1 |
135 | iridescent_slime | 1 |
SELECT u.name AS user, COUNT(v.id) AS count FROM [danbooru_versions.post_versions] AS v JOIN [danbooru_latest.users] AS u ON v.updater_id = u.id WHERE TRUE AND v.id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE added_tags_array.name = "day") GROUP BY user ORDER BY count DESC
Whoah... so do you have to set up billing information with Google before you can do search queries that large? If so, I don't know how I was able to run queries on Allynay's BigQuery table (topic #12774).
Also, now that you have this data, could you provide query examples about how to answer your questions as posted in the above topic, forum #119572? (I'm still an SQL n00b...)
evazion said: (excerpt)
For any given tag:
Who coined this tag?
How has this tag grown over time?
How many people have ever used this tag?
Who are the top taggers for this tag?
What posts has it been added to recently? What posts has it been removed from?For any given user:
How many different posts have they tagged?
What are the top tags that they've added? That they've removed? And to which posts?
I think the last question was partially answered with the JSON data I collected (at least over the last month), but it'd be nice to know how to do it with your data.
Just thought of a question I'd like to add to the above.
The top N people that have ever coined a new tag, additive based upon the current size of the tag. Ex: 1girl would be more of a plus than alternate vehicle. Maybe even break it down by tagtype, as the artist/copyright/character tags are natural creations, whereas some of the general tags can require a bit of creativity and tag gardening.
Updated
@BrokenEagle98 said:
Whoah... so do you have to set up billing information with Google before you can do search queries that large? If so, I don't know how I was able to run queries on Allynay's BigQuery table (topic #12774).
I'm actually not sure about that either. I *think* using the web interface is free and only API usage is billed. But I can't find anything stating that. I just know I was able to run queries before signing up as well. And after signing up nothing has shown up on my usage quota either.
Also if you sign up the first 1TB of queries per month is free. Plus there's a trial where they give you like $300 to spend for free, and I think they don't bill you if you go over unless you give permission. So it's not much problem for personal usage at least.
Also, now that you have this data, could you provide query examples about how to answer your questions as posted in the above topic, forum #119572? (I'm still an SQL n00b...)
Haha, I haven't figured that out myself yet! I spent all week just on dumping the site and getting it all imported. It's the next thing I want to work on. I'm no SQL guru myself though, and BigQuery's flavor of SQL is a little different than standard SQL, so I'm gonna have to play around with it and see what I can figure out.
Updated the tables for September's data. Also made generating the report mostly automated (~6 hours), producing a text file at the end that can be copy/pasted into the forum post edit box.
One idea I'm going to start working on is to do add a delta column to all tables, comparing the rank from the last 30 days with the rank from the prior 30 days (60-30 days ago), just to show who has started doing more/less work in certain areas.
Ver 4.5 -> Ver 4.6
I have been slowing generating reports here: http://isshiki.donmai.us/reports
It's still a work in progress but uploads, tags, notes, artists, and wiki is done or will be done in the next few days.
Some initial feedback based upon the reports available...
1. Would it be possible to use the Danbooru CSS styling so that the colors and any other effects for usernames based upon privileges get generated just like they do on Danbooru?
2. You have a Contrib column on the notes report, and maybe it was meant to be Builder+ instead...? Regardless, if #1 above gets implemented, then the Contrib column would no longer be needed as the CSS styling would reveal user levels and privileges.
3. There are no deletions on the member uploads report, which seems odd since there should be at least a few deletions.
4. There are a ton of deletions on the contributor report, even though that should not be the case.
I believe what's going on for #3 and #4 is that the deletion column is actually the Moderation Queue Bypass column from my uploads report, and it's missing the actual deleted column.
5. The notes report is missing the Totals column.
6. Since you can write these reports using pure HTML instead of using DText, would it be possible to make the columns sortable like I've seen on Wikipedia and other sites?
Updated
Some more feedback for the latest reports added...
1. 100 seems like a high cutoff for wiki pages and artists. 50 would be a better cutoff.
2. This wiki pages is report is broken, as it has 0's for all values.
3. Just noticed that there is not a pools folder. Like above, 50 is a good cutoff for pools.
Would it be possible to link to the specific changes from the total? For instance, if someone wanted to check on a user's artist group changes, make the total changes a hyper link to all those changes? And so on and so forth for other changes. I can understand if some of this do not have a page but for the ones that do it would be nice to have the link handy.
Slick looking stuff so far from what I've had a look at.
@albert - have you really flagged 1526 images over the last 30 days or is there something wrong there?
Also it seems to be possible for approvers to have a -0.0 neg conf/del conf as well as a 0.0 neg conf/del conf, which can't be right. The neg conf case is possible for uploads too.
The tag_implications pages aren't loading anything (it's just a blank page). Is this what is meant to happen?
Some of the thresholds seem too high to be useful - the alias/bulk update ones for instance will probably almost never get anyone other than hillside_moose on there. And the forum topics one only has a spammer.
Otherwise good job.
Updated