Page MenuHomePhabricator

Decide: start_timestamp for mediawiki history
Closed, ResolvedPublic5 Estimated Story Points

Description

We need to make a decision on how we label the first event in a user or page's timeline. Our current thinking is:

Users

In our understanding, a user timeline is coherent (edits happen AFTER creation). Therefore we think users having no creationTimestamp or a creationTimestamp after the user's firstEditTimetamp should be set to firstEditTimetamp. The formula being: eventTimestamp of create-event = MIN(user_registration, date-of-logging-table-create-event, first-edit). This will set both user_creation_timestamp and event_timestamp for the first event in the user's timeline. We'll keep userCreationTimestamp to be MIN(user_registration, date-of-logging-table-create-event), and will also keep userFirstEditTimestamp.

Pages

Pages can be created as partial restores of an older page, with older revisions. So the first edit on a page could happen before the page's creation date without anything being erroneous. Since this is part of how mediawiki works, we'd like to highlight it in the data and use date-of-logging-table-create-event as the page_creation_timestamp. If such a create event doesn't exist, we leave page_creation_timestamp null. And we populate page_first_edit_timestamp so it can be used instead of creation date where needed. If we find some other data that lets us tell the difference between "old restored first edits" and "edit that created the page", then we can add another field to further clarify. About events, timestamp of a page create event is set to the pageCreationTimestamp, even if it is null, to let analysts know that the page creation is unknown (first edit should however be populated). We might take the MIN(pageCreation, pageFirstEdit) when loading in druid, as druid don't accept NULL timestamps though.

This task can be marked resolved if there's general agreement on the above.

Event Timeline

Ping @Neil_P._Quinn_WMF and @nettrom_WMF - I'll move forward with the suggested implementation this end of week to have it tested next week :)

From reading this, it sounds to me like it'll be possible to identify these events for pages because of the difference between the creation and first edit timestamps. What I'm wondering is if this means that we'll have two fields in mediawiki_page_history? In other words, that we'll have page_creation_timestamp, which reflects the creation timestamp per the logging table, and page_first_edit_timestamp, which has the timestamp of the earliest edit for that page?

It also sounds to me like there won't be a straightforward way to identify this for users, in that there's only the user_creation_timestamp. In that case, can the inferred_from field (in mediawiki_user_history) be populated in a way that reflects how the timestamp was set? Then we'll have a relatively straightforward way to identify these.

By the way, I noticed that there's an inconsistency in the documentation between Mediawiki history and Mediawiki page history. The former defines page_creation_timestamp as "creation timestamp of the page", while the latter defines it as "Timestamp of the page's first revision." I updated the latter to be consistent with the former, partly because this task suggests the former is the correct definition.

Thanks for your comment @nettrom_WMF - I should have explained the plan more thoroughtly.
In the next changes for mediawiki-history, we will add fields for pages and users, ending up in having pageCreationTimestamp and pageFirstEditTimestamp coherent by page_id for each page-event, and similarly for users.
the precise definition of how those values is as follow:

  • pageCreationTimestamp - Timestamp of the page-create event in logging table if it exists, null otherwise.
  • pageFirstEditTimestamp - Timestamp of the oldest revision associated to the page (by page_id), whether in revision or archive table.
  • userCreationTimestamp - oldest from user_registration (in user-table) and user-create event in logging table (if both exist, otherwise the existing one if only one exist, otherwise null).
  • userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.

With those different values available, we now wonder which one to use to set the timestamp for the first-event of a page/user lineage (normally a create event, whether faked or really extracted from logging).

Hopefully with those precision it makes more sense :)

Let's please make sure this clarification/explanation appears on the docs.

@JAllemandou : Thanks for clarifying that, very much appreciated!

The description for how this will be handled for pages looks good to me.

When it comes to users, I would expect userCreationTimestamp to reflect the value of user_registration in the user table in MW. Based on your description I suspect there will be some cases where the event from the logging table will be used instead. In those cases, can the inferred_from field in the mediawiki_user_history table reflect this (e.g. by having a value of "logging" or something like that)?

I did a quick analysis using Spark on user data after @nettrom_WMF comment:

import com.databricks.spark.avro._
// user table data
val u = spark.read.avro("/wmf/data/raw/mediawiki/tables/user/snapshot=2019-03").
        select("wiki_db", "user_id", "user_registration")
// logging table data
val l = spark.read.avro("/wmf/data/raw/mediawiki/tables/logging/snapshot=2019-03").
        where("log_type = 'newusers' and log_user is not null and log_user > 0").
        select("wiki_db", "log_user", "log_timestamp")
// joined data on wiki_db and user_id
val j = u.join(l, u("wiki_db") === l("wiki_db") && u("user_id") === l("log_user")).cache()

// Results:
u.count()
//  187805529
l.count()
//  185002057
j.count()
//  185002056

j.where("user_registration is null").count()
//  456950 (0.25% of joined users)
j.where("user_registration is not null and user_registration = log_timestamp").count()
//  116793680 (63.13% of joined users)
j.where("user_registration is not null and user_registration < log_timestamp").count()
//  67674091 (36.58% of joined users)
j.where("user_registration is not null and user_registration > log_timestamp").count()
//  77335 (0.04% of joined users)

In order to be as explicit and precise as we can, I will add a field for users named userRegistrationTimestamp, and we'll have 3 fields:

  • userRegistrationTimestamp - user_registration from user-table
  • userCreationTimestamp - Timestamp of user-create event in logging table
  • userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.

Then in the event-lineage, we'll put the minimum value of those three.

Just my usual plug to ask to please document in wikitech decision taken.

The user part of this task is in testing with the datasource located at hdfs:///user/joal/wmf/data/wmf/mediawiki/history/snaphsot=2019-03 and hdfs:///user/joal/wmf/data/wmf/mediawiki/user_history/snaphsot=2019-03 (along with a bunch of other changes).

In order to be as explicit and precise as we can, I will add a field for users named userRegistrationTimestamp, and we'll have 3 fields:

  • userRegistrationTimestamp - user_registration from user-table
  • userCreationTimestamp - Timestamp of user-create event in logging table
  • userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.

Then in the event-lineage, we'll put the minimum value of those three.

I agree with the overall philosophy of being very explicit and precise in this dataset, but I do still wonder if it's necessary to provide registration timestamps from both user and logging. It's a big deal that those timestamps differ for 37% of users, but how big are those differences? If it's usually just a matter of seconds, then it doesn't seem necessary to pay the complexity cost.

In T220507#5129134, @Neil_P._Quinn_WMF wrote:

I agree with the overall philosophy of being very explicit and precise in this dataset, but I do still wonder if it's necessary to provide registration timestamps from both user and logging. It's a big deal that those timestamps differ for 37% of users, but how big are those differences? If it's usually just a matter of seconds, then it doesn't seem necessary to pay the complexity cost.

I quickly checked the differences, and they seem to be relevant:

spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03").createOrReplaceTempView("uh")

spark.sql("""
SELECT
  log10diff,
  count(1) as c
FROM (
    SELECT
      CAST(
        LOG10(
          UNIX_TIMESTAMP(user_registration_timestamp) - UNIX_TIMESTAMP(user_creation_timestamp)
        ) AS BIGINT
      ) AS log10diff
    FROM uh
    WHERE TRUE
      AND user_registration_timestamp IS NOT NULL
      AND user_creation_timestamp IS NOT NULL
      AND user_creation_timestamp < user_registration_timestamp
) t
GROUP BY log10diff
ORDER BY log10diff
""").show(100, false)

+---------+-----+                                                               
|log10diff|c    |
+---------+-----+
|0        |1101 |
|1        |10246|
|2        |43588|
|3        |20146|
|4        |9424 |
|5        |13976|
|6        |17127|
|7        |14144|
|8        |670  |
+---------+-----+

In the above, log10diff represents the int part of the log10 of the difference between the creation timestamp and the registration timestamp in seconds. For instance, the row 5 |13976 expresses that 13976 users have a creation date between 10000 and 99999 seconds (between ~2h45 and 27h45).

We have not implemented the proposal defined here for page-create event timestamp definition. I let @Milimetric explain (either here or in sync-up meeting, might be easier face to face).

Quick note that we tried to do what we proposed here but it complicated other parts of the data too much. So we reverted to, for now, the following:

  1. use the first revision timestamp as the "create" event/timestamp for any page, even those that have an explicit create event in the logging table
  2. add a new type of event, called "create-page", to differentiate the explicit page creation events available since 2016

We can explain this in more depth if you need, but just keeping the archives happy here and leaving it open for any questions.

Adding a comment:

Quick note that we tried to do what we proposed here but it complicated other parts of the data too much. So we reverted to, for now, the following:

  1. use the first revision timestamp as the "create" event/timestamp for any page, even those that have an explicit create event in the logging table

There are cases when history-conflicts (due to move or delete mostly) prevent us to use first-edit timestamp and keep single-page-title at one given time coherence. In that case, the pageFirstEditTimestamp value differs from the first create event of the page as we try to keep single-title coherence.

  1. add a new type of event, called "create-page", to differentiate the explicit page creation events available since 2016

We can explain this in more depth if you need, but just keeping the archives happy here and leaving it open for any questions.

Quick note that we tried to do what we proposed here but it complicated other parts of the data too much. So we reverted to, for now, the following:
[...]

  1. add a new type of event, called "create-page", to differentiate the explicit page creation events available since 2016

Thank you for the update! So, now there will separate create and create-page event types for the page event entity? That seems like a sensible solution, although I feel like the meaning would be clearer if the event types were first-revision and create.

well, in most cases the "create" event is actually a create event, just not explicitly logged that way in the logging table because that type of event wasn't added until 2016. So for the vast majority of "normal" pages, their first revision is their creation.

Nuria set the point value for this task to 5.