Page MenuHomePhabricator

Feedback on hive table mediawiki_history by Erik Z
Closed, ResolvedPublic

Description

First a thank you to Dan for the introduction you sent by mail. That was a big help.

Three comments on column names:

  1. There are two sets of columns with '_text' in it. revision_text_... and event_user_text_... The first one is about content, which is also called text in the database scheme and in xml files. The second one is called name in the xml file, and also in the field descriptions of mediawiki_history. Dan also talked about event_user_name in his introduction. So maybe this was an overly greedy global edit?
  1. Several fields are called [something]_latest Explanation is e.g. for events_user_blocks_latest "Current blocks of the user etc" I'd use 'current' instead of 'latest' also in the column names, except for 3)
  1. Overriding 2): I expect few users will be interested in the historical value for e.g. user name, user groups, namespace etc. It's good that it exists, but why not make those historical column names a bit harder to type, instead of the columns that 98% of users will be interested in? And again, while doing so making a better match with names in database. So instead of page_namespace and page_namespace_latest, why not use page_namespace (but now for the current value) and page_namespace_[earlier|before|previous|history] for what used to be, but no longer is.

(tbc, I'll do one post per topic)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I see revision_is_deleted, but how about page_is_deleted?

My understanding was that everything was kept in the database forever, so page deletions merely result in a status update for that page in hive? Am I wrong here, or is it stored differently for revisions and pages?

Building on the previous comment (about page deletions):

I queried for a very small set of titles in one wiki in one namespace, so I could compare title for title between wikistats 1 csv file [1] and hive. I chose namespace Help (=12) in afwiki (South African). Three titles [2] only occurred in csv file (but those were created in Sep 2017) to that's explained: I used snapshot 2017-08. One title [3] occurred in hive but not in csv file. That page says the articles has been deleted Oct 2005. Clearly it's still somewhere in the database, but not in the stub or full archive xml files. So Wikistats 1 doesn't know of it. With hypothetical page_is_deleted it could be filtered.

[1] stat1005:/home/ezachte/wikistats_data/csv/csv_wp/CreatesAF.csv

[2]

https://af.wikipedia.org/w/index.php?title=Hulp:IFA/Italiaans
https://af.wikipedia.org/w/index.php?title=Hulp:IFA/Spaans
https://af.wikipedia.org/w/index.php?title=Hulp:IPA/Spaans

[3] https://af.wikipedia.org/w/index.php?title=Hulp:Gebruik_van_kategorieë ('Help:Use of categories')

[4] This page has been deleted. The deletion, protection, and move log for the page are provided below for reference.
11:25, 6 October 2005 Danielm~afwiki (talk | contribs) deleted page Hulp:Gebruik van kategorieë (Walgelijk taalgebruik)
There is currently no text in this page. You can search for this page title in other pages, search the related logs, or create this page."

[5] stat1005:/mnt/data/xmldatadumps/public/afwiki/20170720/afwiki-20170720-stub-meta-history.xml.gz (last dump before Sep 2017)

Question: with deleted revisions still somewhere in the database, as column revision_is_deleted suggests: should these be shielded from the public once this database is opened for public access?

Revisions are usually deleted (rather than blanked) when there is privacy-sensitive content (e.g. a telephone number).

Putting this question more general: some data are dumped to private xml files, which can't be accessed by the public.
Compare stat1005:/mnt/data/xmldatadumps/public with stat1005:/mnt/data/xmldatadumps/private
Are all data in mediawiki_history OK for public ?

There are columns event_user_is_bot_by_name and user_is_bot_by_name, but not event_user_is_bot or user_is_bot. Wouldn't that make sense to have those as well?

There is page_is_redirect_latest, I imagine it could be very useful to also have a field to which page id or page title the redirect goes. For example for combining pageview counts. Not that Wikistats 1 has such, but still..

Another comparison between Wikistats 1 and 2: this time edit counts per user.
I compared edit counts for users with 5000+ edits on af.wikipedia.org, namespace 0.

After analyzing the first run I realized the query was incomplete so I reran, this time ignoring deleted revisions. Still the spreadsheet also shows outcome of run 1, as it demonstrates how significant that filter on deleted revisions influences the results.

I colormarked where Wikistats 1 tables [1][2][3] and Wikistats 2 hive query are differed by 1 or 2 percent, but only for users who never contributed in 2017. This last criterion guarantees that differences between two reported counts have nothing to do with when the data were exactly collected.

On run 2 with the improved query no lines needed coloring: counts for inactive users always differed less than 1%. That's great.

Note: I can imagine API users will want to include deleted revisions in a count, but let it be an conscious and informed decision.
The results will then match less with Wikistats1, which de-facto doesn't report total edits, but rather total edits which weren't total crap or illegal.

[1] https://stats.wikimedia.org/EN/TablesWikipediaAF.htm#wikipedians
[2] https://stats.wikimedia.org/EN/TablesWikipediaAF.htm#sleeping
[3] https://stats.wikimedia.org/EN/TablesWikipediaAF.htm#bots

mforns triaged this task as Medium priority.Apr 19 2018, 4:50 PM
mforns added a subscriber: Nuria.
mforns subscribed.

Hi @JAllemandou :]
(in grosking)
Can you outline what remains to be done here please?

Only thing remaining here that has not been worked is a field about being a bot or not instead of relying on the groups.
We have user_is_anonymous and user_is_bot_by_name - We could add user_is_bot_by_group.
@Milimetric, good for you?

@JAllemandou what is the difference between user_is_bot_by_name and user_is_bot_by_group/

@Nuria: we set user_is_bot_by_name using a regex: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/user/UserEventBuilder.scala#L27
This represents users having a name that looks like a bot.
user_is_bot_by_group would be the equivalent of WHERE array_contains(user_groups, 'bot'), meaning the user is flagged as bot in mediawiki groups system.

Then I think user_is_flagged_as_bot_in_mediawiki_group (please reword) but something longer and more explicit might be better?

the bot_by_name and bot_by_group terminology is the one we decided to use a while ago - Let's see if others have opinion :)

Closing this one, the remaining work was moved to this other task: T219177

I remember discussing this recently, and the idea we had then was to have a single field, something like bot_detected_by which would be a list of name-regex, group, etc.. We figured this would make queries easier to write and allow the values to be more explicit without making the field name itself longer.