Donmai

[Prototype] User Report Ver 6.3

Posted under General

Just to give everyone an update, I've been working on reproducing BrokenEagle98's post change report. I've realized the way the BigQuery table is structured is not optimal for generating the tag type columns. I would probably have to build a new table that explodes out the added_tags and removed_tags so they could be joined against a tags table. On the one hand that's a lot of work. On the other hand such a table would be far more useful for queries. It's probably the solution I'll work on next.

I may as well post some updates here.

I've started exporting tags to a BigQuery table. This is regenerated from scratch once a week.

I've created a new flattened post_versions table with the following schema:

  • version_id
  • version
  • updated_at
  • post_id
  • added_tag
  • removed_tag
  • updater_id
  • updater_ip_addr

What this enables is joins to the tags table, but also precise queries about the exact number of tags added/removed by a user, the number of initial tags used by the uploader, etc. This is a lot of data so the export is a slow going process.

Hey, I've been working on this myself. I collected a dump of the entire site (everything available in the JSON API) and imported all of it into BigQuery. Just finished the import of post_versions this morning. Here's the data:

BigQuery Dump

https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.artist_commentary_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.artist_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.note_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.post_versions
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_versions.wiki_pages_versions

https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.artist_commentaries
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.artists
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.bulk_update_requests
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.comments
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.forum_posts
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.forum_topics
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.notes
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.pools
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.post_appeals
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.post_flags
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.posts
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.tag_aliases
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.tag_implications
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.tags
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.uploads
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.user_feedbacks
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.users
https://bigquery.cloud.google.com/table/turing-zone-143603:danbooru_latest.wiki_pages

Here's a sample query for post_versions, showing how to search for edits removing tagme and adding 1girl and solo:

SELECT
  id, post_id, updater_id, rating, tags, added_tags, removed_tags, source
FROM [danbooru_versions.post_versions]
WHERE
  TRUE
  AND REGEXP_MATCH(added_tags, "1girl")
  AND REGEXP_MATCH(added_tags, "solo")
  AND REGEXP_MATCH(removed_tags, "tagme")
--  AND id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE added_tags_array.name = "1girl")
--  AND id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE added_tags_array.name = "solo")
--  AND id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE removed_tags_array.name = "tagme")

Haven't played with it much, but my initial impression is: it works, queries are fast, but a little expensive. BigQuery bills you on the amount of data processed by per query, $5 per 1TB, and the above search costs 5-6GB. Or $0.03 per search. Limiting searches by date ranges, or partitioning the table itself by dates may help, haven't experimented with that yet.

@chodorov:

Here is the list of everyone who has ever added the day tag to a post.

"day" Taggers
0usercount
1user_4607971239
2Inferno847
3iphn472
4KazuyaRazuKazama352
5chodorov171
6Randeel61
7Nitrogen0946
8demosquid33
9☆♪26
10dereyoruk24
11zeparoh16
12Tapper16
13warh15
14Schrobby15
15Lannihan14
16Sacriven11
17hitokage10
18titaniachkt10
19zedk810
20user_302909
21Apollyon8
22Fenen8
23user_3442288
24Ars7
25gary255666
264get6
27CyberWire6
28AliceSoftfan6
29setzer525
30buehbueh5
31Deuxsonic5
32Qpax4
33OniTea4
34CodeKyuubi4
35GiantCaveMushroom4
36user_3495384
37Rampardos4
38RomanticTP4
39dean_exia4
40feline_lump3
41Kazzius3
42Xeano943
43Jarlath3
44transientfaith3
45Zansnae7933
46Stefan862
47NCAA_Gundam2
48SteelGolem2
49AkterSnurra2
50Kyuzeth2
51Floater2
52clara_dolls2
53FireSky32
54WiftyWafter2
55Gauron17862
56albert2
57AlephFish2
58RaisingK2
59Arrei2
60SionJack2
61Megamet2
62lavitzlover2
63rollingstart2
64jfkjfk1
65parasol1
66[KN]1
67RandomKitsune1
68nanami1
69Alignn1
70Zer0Axiom1
71.musouka1
72Mavado1
73Azuretan1
74AngryJellyfish1
75user_1152241
76Butts.1
77DschingisKhan1
78EB1
79Danaii1
80Kayako1
81Dogenzaka1
82FinderX1
83PowderTrail1
84Bibs1
85cutemi21
86Crab_Cake1
87iinitori1
88dep11ra1
89Tsumanne1
90EcheIII1
91KubaAseph1
92Chucu1
93v5718661
94JukitaChan1
95Yamaro1
96Gachayuri1
97theadonicus1
98Becker2601
99feldt1
100marumaruko1
101Timmaine1
102lady-jotaro1
103beltman1
104Kitsu~1
105Tzimisce9801
106Wrongfire1
107DakuTree1
108TheGoldenDarkness1
109mattiasc021
110SunakoKirishiki1
111Herrmobel1
112Mulambo1
113UnChocolate1
114Rise_And_Fall1
115FoolyDooly1
116tapnek1
117henmere1
118mangadaemon1
119EatCongee1
120Amoriderefarfalla1
121Lenz1
122warcry1
123F.I.A1
124Doragonn1
125lady_garegga1
126anon6521
127zaregoto1
128VinceKillinger1
129rantuyetmai1
130tewitr1
131Garren1
132BeyondReason1
133Shallie1
134peshmi1
135iridescent_slime1
SELECT
  u.name AS user,
  COUNT(v.id) AS count
FROM [danbooru_versions.post_versions] AS v
JOIN [danbooru_latest.users] AS u ON
  v.updater_id = u.id
WHERE
  TRUE
  AND v.id IN (SELECT id FROM [danbooru_versions.post_versions] WHERE added_tags_array.name = "day")
GROUP BY user
ORDER BY count DESC

@evazion

Whoah... so do you have to set up billing information with Google before you can do search queries that large? If so, I don't know how I was able to run queries on Allynay's BigQuery table (topic #12774).

Also, now that you have this data, could you provide query examples about how to answer your questions as posted in the above topic, forum #119572? (I'm still an SQL n00b...)

evazion said: (excerpt)

For any given tag:

Who coined this tag?
How has this tag grown over time?
How many people have ever used this tag?
Who are the top taggers for this tag?
What posts has it been added to recently? What posts has it been removed from?

For any given user:

How many different posts have they tagged?
What are the top tags that they've added? That they've removed? And to which posts?

I think the last question was partially answered with the JSON data I collected (at least over the last month), but it'd be nice to know how to do it with your data.

Edit:

Just thought of a question I'd like to add to the above.

The top N people that have ever coined a new tag, additive based upon the current size of the tag. Ex: 1girl would be more of a plus than alternate vehicle. Maybe even break it down by tagtype, as the artist/copyright/character tags are natural creations, whereas some of the general tags can require a bit of creativity and tag gardening.

Updated

@BrokenEagle98 said:
Whoah... so do you have to set up billing information with Google before you can do search queries that large? If so, I don't know how I was able to run queries on Allynay's BigQuery table (topic #12774).

I'm actually not sure about that either. I *think* using the web interface is free and only API usage is billed. But I can't find anything stating that. I just know I was able to run queries before signing up as well. And after signing up nothing has shown up on my usage quota either.

Also if you sign up the first 1TB of queries per month is free. Plus there's a trial where they give you like $300 to spend for free, and I think they don't bill you if you go over unless you give permission. So it's not much problem for personal usage at least.

Also, now that you have this data, could you provide query examples about how to answer your questions as posted in the above topic, forum #119572? (I'm still an SQL n00b...)

Haha, I haven't figured that out myself yet! I spent all week just on dumping the site and getting it all imported. It's the next thing I want to work on. I'm no SQL guru myself though, and BigQuery's flavor of SQL is a little different than standard SQL, so I'm gonna have to play around with it and see what I can figure out.

Update:

Updated the tables for September's data. Also made generating the report mostly automated (~6 hours), producing a text file at the end that can be copy/pasted into the forum post edit box.

One idea I'm going to start working on is to do add a delta column to all tables, comparing the rank from the last 30 days with the rank from the prior 30 days (60-30 days ago), just to show who has started doing more/less work in certain areas.

Latest Update

Ver 4.5 -> Ver 4.6

  • Added Rank Difference for most tables
    • Difference between current month and prior month
    • Nonactivity in the prior month gets displayed as "None"
    • For the bottom/top upload tables:
      • Not making the 100 upload cutoff the prior month gets displayed as "Null"

Some initial feedback based upon the reports available...

1. Would it be possible to use the Danbooru CSS styling so that the colors and any other effects for usernames based upon privileges get generated just like they do on Danbooru?

  • This may eventually require a logon so that custom CSS settings for the user get pushed to the reports site?
  • Another method would be to have toggleable options to color levels differently, and to add various effects based upon permissions
  • Or just pick a scheme and then have a legend on the page explaining how to determine levels/permissions

2. You have a Contrib column on the notes report, and maybe it was meant to be Builder+ instead...? Regardless, if #1 above gets implemented, then the Contrib column would no longer be needed as the CSS styling would reveal user levels and privileges.
3. There are no deletions on the member uploads report, which seems odd since there should be at least a few deletions.
4. There are a ton of deletions on the contributor report, even though that should not be the case.

I believe what's going on for #3 and #4 is that the deletion column is actually the Moderation Queue Bypass column from my uploads report, and it's missing the actual deleted column.

5. The notes report is missing the Totals column.
6. Since you can write these reports using pure HTML instead of using DText, would it be possible to make the columns sortable like I've seen on Wikipedia and other sites?

Updated

Some more feedback for the latest reports added...

1. The post changes report is missing the gentag column.
2. Like I mentioned prior, the tagger's report would not need a contrib column if effects were added to the user's names.

Some more feedback for the latest reports added...

1. 100 seems like a high cutoff for wiki pages and artists. 50 would be a better cutoff.
2. This wiki pages is report is broken, as it has 0's for all values.
3. Just noticed that there is not a pools folder. Like above, 50 is a good cutoff for pools.

Would it be possible to link to the specific changes from the total? For instance, if someone wanted to check on a user's artist group changes, make the total changes a hyper link to all those changes? And so on and so forth for other changes. I can understand if some of this do not have a page but for the ones that do it would be nice to have the link handy.

Slick looking stuff so far from what I've had a look at.

@albert - have you really flagged 1526 images over the last 30 days or is there something wrong there?

Also it seems to be possible for approvers to have a -0.0 neg conf/del conf as well as a 0.0 neg conf/del conf, which can't be right. The neg conf case is possible for uploads too.

The tag_implications pages aren't loading anything (it's just a blank page). Is this what is meant to happen?

Some of the thresholds seem too high to be useful - the alias/bulk update ones for instance will probably almost never get anyone other than hillside_moose on there. And the forum topics one only has a spammer.

Otherwise good job.

Updated

1 3 4 5 6 7 8 9 10 11 15