Page MenuHomePhabricator

[Analytics] Items that contain a sitelink to one of the Wikimedia projects over time
Open, Needs TriagePublic

Description

Purpose

As Wikidata Product Managers, we would like to understand how different segments of Wikidata's data developed over time, so we can inform our projections.

Scope

  • How did the number of Items of the following types develop over time?
    • A) Items that contain a sitelink to one of the Wikimedia projects (e.g. about a notable person)
    • B) Items that are needed to build A (used in A Items for example in a statement or reference; e.g. the non-notable father of that notable person)
    • C) All other Items

Desired output

  • Weekly stats of the number of Items in category A, B and C

Acceptance criteria

  • Current numbers for A, B and C to verify the approach
  • Historic weekly data for A, B, and C (for all snapshots that are available)
  • Weekly data process to capture new data
    • Waiting on confirmation to transfer notebook Spark SQL to jobs codebase
  • Public output of data in the form of a table (and ideally diagram)
    • Ready within the already written DAG
    • Diagram not included here as it's blocked by current infrastructure

Information below this point is filled out by the Wikidata Analytics team.

General Planning

Information is filled out by the analytics product manager.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: 3 days
Actual: ~4 days (scope change)

Sub Tasks

Full breakdown of the steps to complete this task:

  • Explore wmf.wikidata_entity
  • Derive method of determining items in Population A
  • Derive method of determining items in Population B
  • Combine Population A and B methods into a single query that allows for the creation of Population C
  • Write and test DAG
  • Write and test DAG jobs
  • Deploy and check DAG

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • Note

Event Timeline

@AndrewTavis_WMDE asked me for some thoughts/suggestions here :)

I started typing out a DM reply but decided some of this stuff would be good to have on public record.

it's not normal that snapshots go back a decade plus, so I'm a bit confused on this

The way that MediaWiki and Wikidata snapshots work – and have to work, due to the nature of the data – is they are snapshots in time of EVERYTHING at the time of the snapshot generation. This is why even wmf.edits_hourly (or whatever that table is called) can contain counts of edits made in April even though the latest snapshot is '2024-04' – it's indiscriminate of timestamps associated with any of the data.

I think 3-4 snapshots back is probably a good number of snapshots to keep because it does enable us to investigate odd discrepancies between snapshots T355182 – beyond the state change problem. The challenge with this data that you may have come across is that state of things (whether an edit got deleted or reverted, whether a user is labelled as a bot or not) changes over time, so the same edit or the same user made years ago can be categorized differently from snapshot to snapshot.

Ultimately, any metric that is calculated from data which can change state is going to be subject to drift when a static measurement is stored anywhere. We actually run into this problem with the key result for FY23-24 Wiki Experiences Objective 1.1 (Superset dashboard) that aims to increase number of unreverted (and undeleted) mobile contributions to articles on Wikipedia by 10%.

Throughout March 2024 – when the '2024-02' snapshot was used – the metric for the KR was at 4.7%. Then, when the '2024-03' snapshot was generated (at the beginning of April), the February value of that metric changed to 4.4% – because the state of the edits made in February changed. The dashboard uses the most recently available snapshot and has no memory about the values of the metric based on previous snapshots. If we were to store a value in a spreadsheet or a report and then 1+ snapshots later compare the dashboard to the spreadsheet/report, there will be a discrepancy.

There's no getting around it – it's natural and folks who work with or look at these metrics need to become comfortable with that concept. There are some things we can do to improve stability (decrease snapshot-to-snapshot variability) of the metric, but it won't address the problem entirely. Like, we could (and should) impose "not reverted within first 48 hours" as opposed to currently "not reverted at the time of the snapshot" but deletion of edits and also whether a user is considered a real editor or a bot, well, those are going to change snapshot-to-snapshot and dealing with those would be extremely painful.

I won't evaluate the listed metrics but I will recommend asking yourselves the following for each metric:

  • Can we backfill this? Can we re-compute the history of this metric given a snapshot?
  • Are we comfortable re-computing the entire history of this metric with each new snapshot?
  • Will we be reporting this metric anywhere else and would it be a problem if what we reported in the past and what we report in the future differ?
  • Are we comfortable calculating the value of the metric only once and storing that somewhere that we call "source of truth" for measurements of this metric going forward?
    • For example, you calculate the value of metric A for April 2024 (using March 2024 snapshot) and hold on that value because once the March 2024 snapshot is deleted, any re-calculation of metric A for April 2024 using a later snapshot will result in a different value.

Hi @mpopov, thank you for your input!

This confirms what I mentioned already, @AndrewTavis_WMDE: For a similar metric our legacy systems were set up to re-compute the entire history with each new snapshot. This would be the easiest solution in this case as well. To save computing resources, we could also use the newest snapshots only to add the most recent new data points. As a result, the historic data (where no snapshots are available anymore) would follow a slightly different definition, but this seems ok here for now.

Thanks for all of the information, @mpopov!

I talked this over in my bi-weekly with @JAllemandou, and would like to bring some further context to this particular situation :)

The go to table for this would be wmf.wikidata_entity for the following reasons:

  • It has the sitelinks column for Population A above
  • It has the claims column for Population B above

It thus has everything we need for the given task for future data. One change to the output for this though would be the frequency of the DAG, as wmf.wikidata_entity is a weekly data dump, so it'd make sense to do a weekly DAG. If we still want to do a monthly job, then the best option would be to do a DAG that runs on the first Monday of every month (in the docs for wmf.wikidata_entity it mentions the 2020-01-20 snapshot, which was a Monday).

Now we get to the question of the historical data... This is a situation that cannot be solved via the Data Lake at this time given the currently available tables/partitions. As mentioned on Mattermost: we currently do not have Wikidata as a partition within wmf.mediawiki_wikitext_history, so we do not have historical versions of Wikidata items with which we'd be able to rebuild the history. The assumption we're making on this is that the legacy version of these metrics was made using wmf.mediawiki_wikitext_history at a time when Wikidata was still an available partition. The change for removing Wikidata from the wmf.mediawiki_wikitext_history dump process was 2024-02 - see T357859 where ~12 of 25 days of the dump generation is for the Wikidata XML dump. This was slowing down metrics delivery for WMF Movements Insights.

Steps forward on this:

  • I'll begin work on a DAG based on wmf.wikidata_entity, as even if we do get a Wikidata partition within wmf.mediawiki_wikitext_history, it would not be used for recent data updates
    • Are we fine with a weekly DAG?
  • A decision needs to be made on whether WMDE is requesting Wikidata data to again be an output in wmf.mediawiki_wikitext_history snapshot creation process
    • The preferred solution here would be to not revert the changes to T357859, but rather make a new job that adds a new partition to the table via the Wikidata XML dump
    • Reason for this is to assure that WMF Movements Insights can maintain the current speed of delivery
    • @JAllemandou has said that bringing the Wikidata partition back is fine if we need it (again, preferably in the above way)
  • If the request is being made, a new task should be made for it
  • We'd then do what I'd argue would be a separate task whereby the new wmf.mediawiki_wikitext_history Wikidata parition would be used to recompute the historical populations above

Let me know what thoughts are on the above!

Another note on this is: if we don't expect to be needing a Wikidata partition of wmf.mediawiki_wikitext_history for other tasks, then we could work directly from the XML dump for the data backdate. We wouldn't be able to leverage PySpark for the querying though, so I worry about how long all of this would take... It looks like PySpark can be used directly on the XML though, but then this would be my local machine.

Thank you for digging into this:

I'll begin work on a DAG based on wmf.wikidata_entity

Sounds good to me! I changed the description accordingly.

Are we fine with a weekly DAG?

Sure!

About the missing revision history:

  • Did I understand correctly that we do not have any kind of complete edit history for Wikidata on the data lake? If so, we will need to find a solution for this, as my assumption is that we will need this kind of information for other use cases as well. What you found out about potential solutions will be helpful. Still, if this needs to be implemented new in any case, the first step should be finding out what exactly we need. Could you create a separate task for this project?
  • It would help to understand what options this currently leaves us to access historic revisions: To my knowledge, the dumps for Wikidata stopped including a complete page history at some point. Can you please check what the current situation is? What other access paths do we still have? Only the live API and MariaDB?

See T363451: Add job to create Wikidata partition to wmf.mediawiki_wikitext_history for the task about bringing back the partition (hopefully via another job). I added a bit about whether we want to maybe turn this job on when WMDE needs historical data. Let me know what you all think on that :)

Ok, so the new numbers after the change in scope for the max 2024-04-15 snapshot are:

items_with_sitelinks: 32,231,861
items_items_with_sitelinks_link_to: 2,980,388
all_other_items: 72,910,679

For documentation, the numbers for the original Population B definition for the min 2024-02-26 snapshot were:

items_with_sitelinks: 31,978,738
linked_to_items_with_sitelinks: 75,221,879
all_other_items: 242,565

Status on the rest of this:

  • The weekly DAG is written and further does include an export to the published datasets repo
    • I've also included the work for T361203 in this
  • We need to confirm the numbers above and the method that generates them
  • I'll then rewrite the DAG job that runs the query
  • Then testing, I'll need the table wmde.wd_item_sitelink_segments_weekly to be made in HDFS by an admin, and then we can go into production
  • Should all be done by Tuesday/Wednesday evening after I'm back in a few weeks depending on folks' availability
  • I'll make a new task for the historic data generation process, which will depend on T363451

Note that MR#700 has been opened that has the work for this :)

Manuel renamed this task from [Analytics] Segments of Wikidata's data over time to [Analytics] Items that contain a sitelink to one of the Wikimedia projects over time .Tue, Jun 4, 12:23 PM
AndrewTavis_WMDE changed the task status from Stalled to Open.Thu, Jun 6, 12:30 PM

Unstalled as the table has been created :)