Donmai

[Prototype] User Report Ver 6.3

Posted under General

About the Top/Bottom taggers: Wouldn't it be better if we only count gentags into that category? Since char/artist/copyright are pretty much given by the artist (except on Twitter). So I would only count gentags into that. Comes a bit late to notice^^............

For now I think this looks then pretty complete. Can't see anything missing now as I looked through all categories.
Well, except the Approval/Janitor report. But that is already covered in this list: http://danbooru.donmai.us/reports/janitor_trials
So that should also be added in the final version when there will be a subitem in the bar above that says "Reports".

Updated

Sacriven said:

Wow, those data are meticulously made. I'm really thankful.

Make sure that you guys don't force yourselves too much.

A dozen or so hours of coding was all it took... after that, I let my computer do all the work :p

RaisingK said:

18 tags/post isn't that bad, is it? Well, at least I'm not as bad as albert. Shame on you, albert. Shame. :p

It's your bot that has that record, not you ;)

Also, I asked an Admin/Mod before uploading the Top/Bottom tables, and this is what they said:

Hah, it turned out rather fun. If I uploaded more, I'd also end up in the bottom listing since I primarily upload comics, where 15-20 tags is good enough. I'd say leave everything intact - collecting data is one thing, deciding what it says is another. Maybe it says there's almost no active contributors who deserve any feedback, even in the bottom part of the list.

Would you be willing to share your code for doing this? Because I'm interested in playing around with this and trying out more rows and different columns. A few things in particular I'm curious about:

  • Average upload score & average favcount.
  • Missing artists: number of uploads with a source but with no artist tag. This indicates people who need to be told to create new artist tags, especially for Pixiv uploads where the artist is easily available.
  • Rating changes: number of uploads where the rating was changed by someone else after initial upload (i.e. rating in latest post version is different than in first version). A high number of these could indicate people who aren't rating their uploads correctly.
  • Copyright/Character/Artist changes: number of uploads with where the chartags/copytags/arttags are changed after initial upload. May indicate neglect in correctly identifying and tagging these things.
  • Reparented uploads: Number of uploads that are reparented by someone else after initial upload. May indicate people who upload a lot of dupes.
  • Average tag share: tag count of the initial upload divided by tag count of the current version, averaged across all uploads. Say you add 10 tags to your initial upload, but then later other people come along and add 20 more tags to it. Therefore you contributed only 33% of the work in tagging the image. So this is another measure of tagging thoroughness beyond the basic tags-per-post metric.

Also, how about posting these as spreadsheets on Google Docs? That way you could post more rows in your tables and it'd be easier for us to sort them by different columns, add custom columns, etc.

evazion said:

Would you be willing to share your code for doing this? Because I'm interested in playing around with this and trying out more rows and different columns. A few things in particular I'm curious about:

Put it on GitHub!

evazion said:

Also, how about posting these as spreadsheets on Google Docs? That way you could post more rows in your tables and it'd be easier for us to sort them by different columns, add custom columns, etc.

Throwing in EtherCalc.

evazion said:

Would you be willing to share your code for doing this?

Working on it... the code cleanliness isn't near the share level yet (i.e. it's embarrassing >__< )... I'll post it on GitHub once the code reaches a certain maturity (<1 week) and I figure GitHub out...

evazion said:

  • Average upload score & average favcount.

I thought about including these as columns, but deliberately decided against it. Comment score is rather benign which is why I included it, but when you start talking about post score or favcount, either average or cumulative, people start getting defensive and wonder if you're trying to demote them as contributors or perhaps assign other penalties...

evazion said: (excerpt)

  • Missing artists... Rating changes ... Copyright/Character/Artist changes... Reparented uploads... Average tag share....

Those are all great ideas... If they get incorporated, the uploader's table might need to get split, as it's pretty wide already... :P

evazion said:

Also, how about posting these as spreadsheets on Google Docs? That way you could post more rows in your tables and it'd be easier for us to sort them by different columns, add custom columns, etc.

I had wondered about other ways to share the data. For now, I've just uploaded the Upload Data.

https://docs.google.com/spreadsheets/d/1NCR-y2hMR3Oxl10wveAvpaV_qx71rQeFVd0HgqY53aU/edit?usp=sharing

I'll upload the rest and share links tomorrow... I'll also take a look at EtherCalc tomorrow...

"Easily accessible"
Well, it is only easy accessible if you know japanese to create the artist tag since we use transliteration. Unless the name is already in roman letters, of course. So this is double-edged and these tables should be as definitive as possible.

This would only make sense if the uploader does not use artist request, since that's what the tag is for. And then we know that these users don't care on this upload about the artist.

Also in these tables is already count the tags that are removed from own uploads.
But if you do tag gardening, it would not only be nice to see how many tags were added by you but were also removed afterwards.
So there is a post and the uploader tags it without breasts. Now I add this tag. But a third user (or the uploader) removes the breast tag afterwards. That should be implemented in the first table.

And since we are also counting tags/upload: I think we should split that into two categories. One category with humans and one category with no_humans. Simply because uploads without any humand have normally very few tags and most of the time less than ten, for example PokéUploads.

Updated

The following was updated in the first post:

  • Added links to raw data for all versions/categories (Raw Data section under Data section)
  • Changed how data was collected for several categories (frontend vs backend)
    • Several tables have been updated with latest data

Besides the above, I've finished cleaning the code for 4 of 6 main files for this data collection. The other 2 should be finished tomorrow. Then it's GitHub time... :)

Latest Update

Ver 4.3 -> Ver 4.4

  • Removed uploads from post table since uploads now have their own table
  • Added granularity to Add Tags and Remove Tags by splitting them up into GenTags, Chartags, CopyTags, ArtTags, EmptyTags
    • Empty Tags were the same as Tag Errors in the Uploads table (Now renamed in the Upload table as well)

The above made the post table very wide. Take a look and see if the + and - tags should be combined, or maybe just shown differently. However, if you combine the two, you lose the fact that some primarily focused on adding, and some on both, and some on removing.

For an example of how they could be, it could also be shown as something like the following:

+-GenTag
(35552,4259)

This would reduce the width of the table by a bit.

Another option would be to create two tables, splitting the stats between tables.

No changes, but I've been experimenting around with the Uploads records and their corresponding Posts records to derive some interesting data. This all started when I heard one user comment on being "Sniped", i.e. using minimal tags on the Uploads interface and going in afterwards to fill in the tags, which is commonly done by some to avoid someone posting the image before themselves.

Each Upload record contains both the original tags entered before hitting the submit button as well as the timestamp for that action. This allows an Upload record to be compared with the first record of the corresponding Post. Unfortunately, only the last 36 hours or so of Uploads data is held before being discarded, but the data could be accumulated once per day so that a longer stretch of data could be collected.

The first data item is what I call tongue-in-cheek a "Snipe", and was counted each time the number of tags entered in the Uploads interface was less than 10 tags and the number of tags added afterward in the Posts interface was greater than 10 tags...

For each "Snipe", the total difference in tags is tallied as well as the total loiter time (difference in time between the Upload record and the Post record). Besides the Snipes, the total Uploads as well as the number of Upload Errors (i.e Duplicates, Invalid filetypes, etc) are counted. From all of that data, some additional data is derived, such as Tags/Snipe, Loiter/Snipe, and Snipe% (Snipes/Uploads).

Once all that data is gathered, multiple tables are compiled with multiple orderings based on the derived data. Only users in the Top 50th percentile for both Uploads and Snipes are included in the following tables.

Data

Updated at Fri Sep 2 19:55:23 2016 UTC; Duration: 1.66 days

Ordered by Tags/Snipe

RankUser IDSnipe TagsSnipesAll UploadsLoiter TimeUpload ErrorsTags/ SnipeLoiter/ SnipeSnipe%
118772225351101h28m55s050.617m47s45.45
2460797730213900h38m39s134.7601m50s53.84
3166417362122100h23m54s030.1601m59s57.14
4397518530188302h07m27s129.4407m04s21.68
549984669236601h13m51s429.0803m12s34.84
6369231299112300h40m10s427.1803m39s47.82
7569478033300h56m33s326.6618m51s9.09
8454478557245702h21m30s123.205m53s42.1
9393097303141404h28m36s021.6419m11s100.0
101333114323900h03m09s621.501m34s5.12

Ordered by Loiter/Snipe

RankUser IDSnipe TagsSnipesAll UploadsLoiter TimeUpload ErrorsTags/ SnipeLoiter/ SnipeSnipe%
13572385444701h35m23s213.523m50s8.51
2393097303141404h28m36s021.6419m11s100.0
3569478033300h56m33s326.6618m51s9.09
418772225351101h28m55s050.617m47s45.45
5397518530188302h07m27s129.4407m04s21.68
6454478557245702h21m30s123.205m53s42.1
730072393232901h43m42s117.0804m30s79.31
8369231299112300h40m10s427.1803m39s47.82
949984669236601h13m51s429.0803m12s34.84
10366860251126200h28m20s620.9102m21s19.35

Ordered by Snipe Percentage

RankUser IDSnipe TagsSnipesAll UploadsLoiter TimeUpload ErrorsTags/ SnipeLoiter/ SnipeSnipe%
1393097303141404h28m36s021.6419m11s100.0
230072393232901h43m42s117.0804m30s79.31
3166417362122100h23m54s030.1601m59s57.14
4460797730213900h38m39s134.7601m50s53.84
5369231299112300h40m10s427.1803m39s47.82
618772225351101h28m55s050.617m47s45.45
7454478557245702h21m30s123.205m53s42.1
849984669236601h13m51s429.0803m12s34.84
9397518530188302h07m27s129.4407m04s21.68
10366860251126200h28m20s620.9102m21s19.35

Ordered by Upload Errors

RankUser IDSnipe TagsSnipesAll UploadsLoiter TimeUpload ErrorsTags/ SnipeLoiter/ SnipeSnipe%
1366860251126200h28m20s620.9102m21s19.35
21333114323900h03m09s621.501m34s5.12
349984669236601h13m51s429.0803m12s34.84
4369231299112300h40m10s427.1803m39s47.82
5569478033300h56m33s326.6618m51s9.09
63572385444701h35m23s213.523m50s8.51
7454478557245702h21m30s123.205m53s42.1
8397518530188302h07m27s129.4407m04s21.68
930072393232901h43m42s117.0804m30s79.31
10460797730213900h38m39s134.7601m50s53.84

Caveats: The above data does not account for tags added from implications yet. It also does not necessarily indicate a "snipe" as known in the community, and may instead be just a preference of using the Posts interface over the Uploads interface. Also, IMO sniping is not bad, but it's about as much fun getting sniped in an upload as getting sniped in an FPS ... >_<

Request for comment, I guess... If the above sounds interesting, I'll continue to gather data as well as finishing up the loose ends. If not, I'll just leave this as the only post just as a demonstration.

Ehhh, what should these tables express?
I don't really see the usage of that, since it was discussed long ago (^-^): Sniping, but tagging afterwards is ok.
And why not the user name? User ID is pretty much the same, but way more inconvenient (I had too look mine up and don't want for others).

The tables are more of an interest item than anything else, and as I said don't necessarily indicate sniping behavior. I myself sometimes use the Posts interface because it's more convenient. Even if it is sniping though it's not bad IMO, just like snipers in FPS's aren't bad... just sometimes annoying. It's not bad like uploading a post with under 10 tags and then just leaving it like that.

Also, I left the User ID because (a): that's how I collect the data and is one of those loose ends I was talking about, and (b): the above represents only a day and a half of data and therefore not really significant enough to tie someone's name to yet.

1 2 3 4 5 6 7 8 15