Page MenuHomePhabricator

[Analytics] LOD health metrics 2023 for Wikidata as a notebook
Closed, ResolvedPublic

Description

Request

Copying the task for this milestone from the epic task:

Milestone 2: Report for metrics NOT requiring webrequest aggregation

  • First test report (in the format specified in the reporting template)
    • number of active editors (5 edits per month)
    • number of admins (interface-admin)
    • number of bureaucrats
    • number of property creators
    • number of items per active editor
  • The presented information should be ready for "copy & paste" to be used in the respective reports.
  • For the full report we'd also like the % change from the previous year and all data from the previous quarter

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

  • Define all tables needed and add them to the ticket
  • Make baseline connections to the given tables above and explore
    • Including this as this is an initial task for me
  • Define the following:
    • Methods used:
      • Total sum in the period
      • Percentage change (yearly)
    • Conditions for each group:
      • How do we do retroactively check if the user had a certain permission at a given time? snapshot?
        • Prior data exists in a preaggregated form, so we'd use that
        • Aside from this the logs can be used to determine user rights/right changes at a given time
      • Are admins and bureaucrats also active? When do they not count anymore?
        • As of now they are still counted as an admin, but we might want to think about this for product metrics
      • How to exactly differentiate property creations and the edits we want?
        • Are we including edits to user pages?
        • As of now including everything
      • How to define an active editor over a quarter? Do we switch to 5 edits/30 days?
        • We need to look into this further, but right now it's on a given day based on 5 edits over the last 30 days
      • Are there conditions to consider for when an item is counted? Like for Wikipedia we might not count stubs?
        • For Wikidata there's no distinction at all, and for Wikipedia we'd also just count them
    • Time frame for aggregation:
      • Current time that notebook is ran
      • When would the notebook normally be ran by in case we're doing a cron job (for later planning)?
        • Ideality would be self service at time of stakeholders choosing
        • Current consideration is Grafana depending on data access constraints
  • Define output and check with stakeholders (@Manuel)
    • Aggregates for each group/value at the end of the time frame in a dataframe
      • Rows:
        • One row per quarter
      • Columns:
        • One column per group/value
        • After each column, one column for yearly retrospective percentage change
    • Are we including plots?
      • For right now no
  • Report the numbers for June 2023, May 2023 and if possible April 2023
    • June 2023: 13213
    • June 2023 (filtering out an ip): 7997
    • May 2023: 141
    • April 2023 (from the 11th): 87

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • The active users stats sql should likely include an upper bound so that calculations are not dependent on the time that the script is ran in the morning
  • Create and link a Phabricator task to fix this
  • Open question of "quarterly aggregation" vs. "last month of the quarter"
    • For now we're doing last month of the quarter
    • @AndrewTavis_WMDE is going to check if data is still available for April
    • We'll report the number of unique user_agents for May and June at least just for an overview
    • Focus will then switch to getting this automated with Airflow

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMay 22 2023, 1:59 PM

@Manuel, could I get some feedback on the following point for the planning above:

For now we just want the values at the time that the notebook is ran, but for quarterly reporting/cron job purposes we'd need a date that this would be ran at. 20th of the last month of the quarter? Just checking so I can keep this in mind.

Checking the reporting template for more information on the formatting of the output and will update the ticket once I'm done with that ๐Ÿ˜Š

Hi Andrew, for now, the main goal with this is still testing the different tools that we have available. So for now I would use Jupyter notebooks to test our workflow around that particular tool. So let's assume for now that we are running the reports right after the quarter has ended.

That being said, in the end, it is possible that we might settle with some other tool/solution, e.g. a live dashboard optimized for reporting. So we are basically still playing with bricks here and not building with cement yet. ^^

@Manuel, checking how to apply a definition of active editors in this case. We say that it's five edits in a month and specifically not a 30 day period, which makes this a bit more easy. From the perspective of a quarter (again so I can keep it in mind), would someone be counted as being an active editor if they had been an active editor in any of the months in the quarter?

I'm assuming that this is the case, but also wanted to check to make sure that for periods of more than 30 days we'd check to see if they'd edited five times in 30 days in that period rather than in any given month in that period.

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

Great questions, let's discuss this in out session!

Notes for myself:

  • aggregation
  • sanitization

Active editor notes via conversation with @Manuel:

  • There is a standard for active editors: "Registered, non-bot editors who have made five or more edits" is the specific definition. On a daily basis we're using a 30 day rolling average. What this would be on a monthly and quarterly basis is still not defined.

@Manuel, I'm getting the baseline query of all the items on Wikidata working well, but the issue is that I'm getting too many results based on the Wikidata statistics page. I think that the results from mediawiki_page need to be subsetted based on page_content_model, with the general idea being that I'd only query pages with the following content models:

  • wikibase-item
  • wikibase-property
  • wikibase-lexeme

The other options in this column are:

  • wikitext
  • flow-board
  • EntitySchema
  • javascript
  • css
  • Scribunto
  • sanitized-css
  • json

Does the above distinction make sense to you?

Hmmmm and it's still overestimating given a subset on the above mentioned three ๐Ÿค” Currently getting 107,609,517 instead of 103,384,803.

Even with just wikibase-item we're getting 106,507,044 ๐Ÿค” So there's some other kind of subsetting that needs to happen apparently.

Current query:

SELECT 
    COUNT(DISTINCT page_id) AS total_pages

FROM 
    wmf_raw.mediawiki_page
    
WHERE 
    wiki_db = 'wikidatawiki'
    -- wikibase-item, wikibase-property, wikibase-lexeme
    AND page_content_model LIKE 'wikibase%'

Does the above distinction make sense to you?

It may be better to go by page_namespace (we only want the content pages and not the discussion pages).

Also, you could check what role redirects play. It would be great to see how Wikistats generates these metrics.

So in looking at namespaces you mean we'd also subset by page_namespace = 0 for the main namespace?

And beyond that we'd also want page_is_redirect = 0?

Edit: = false as the documentation was wrong and it's a boolean.

I'm down to 103,612,928 with the value on the stats page again being 103,384,803 :)

The namespaces are project specific. You can find the the namespaces and IDs for Wikidata here: https://www.wikidata.org/wiki/Help:Namespaces#Table_of_Namespaces

And beyond that we'd also want page_is_redirect = 0?

I do not know how the official metrics are defined without seeing their SQL queries. Wikidata itself seems to do a split by that variable:
https://phabricator.wikimedia.org/diffusion/ADES/browse/master/src/wikidata/site_stats/sql/select_pages_by_namespace.sql

So for reverse engineering, you could compare the numbers. ;)

We seem to be on the right path! :)

Generally in terms of content though we want wikibase-items, wikibase-properties and wikibase-lexemes for this, correct?

We seem to be on the right path! :)

๐Ÿ™Œ

Accounting for namespaces brings us back up as we need 0 for main content, 120 for properties and 146 for lexemes.

Current query:

SELECT 
    COUNT(DISTINCT page_id) AS total_pages

FROM 
    wmf_raw.mediawiki_page
    
WHERE 
    wiki_db = 'wikidatawiki'
    -- wikibase-item, wikibase-property, wikibase-lexeme
    AND page_content_model LIKE 'wikibase%'
    AND (
        page_namespace = 0
        OR page_namespace = 120 
        OR page_namespace = 146
    )
    AND page_is_redirect = false

Current value: 104,702,674

If this is about the "number of items per active editor" then it's only about Items. If this would have said entities instead, we would also include the other types of entities.

My assumption would be that you can skip the page_content_model LIKE 'wikibase%' for the same result (but with better efficiency).

Thanks for the tip! Ya I was just using that at the start as I wasn't aware of namespaces :)

We're back to the 103,612,928 from before then :) Let me know if there are any other ways of subsetting that you can think of.

Are you looking for the same point in time?

For page_namespace = 0 the following should currently be 103,391,050 if executed on the Wikidata db.

SELECT
  page_namespace AS namespace,
  page_is_redirect AS redirect,
  COUNT(*) AS count
FROM wikidatawiki.page
WHERE page_namespace = 0
OR page_namespace = 1
OR page_namespace = 120
OR page_namespace = 146
OR page_namespace = 640
GROUP BY page_namespace, page_is_redirect

The numbers from above were sourced from the table wmf_raw.mediawiki_page. What are you using to query? I can't access wikidatawiki.page via Presto.

@Manuel, current query is:

SELECT 
    COUNT(DISTINCT page_id) AS total_pages

FROM 
    wmf_raw.mediawiki_page
    
WHERE 
    wiki_db = 'wikidatawiki'
    -- Main namespace for wikibase-items.
    AND page_namespace = 0
    AND page_is_redirect = false

My assumption would be that we need to use page_namespace = 0 and page_is_redirect = False. Using the April snapshot, this gives us 101.785.388 which is very close to what we get on the Grafana data (101.777.563 for 30. April 2023). So I believe this number should be fine.

query = """
SELECT 
  page_namespace AS namespace,
  page_is_redirect AS redirect,
  COUNT(*) AS count
FROM 
    wmf_raw.mediawiki_page
WHERE 
    wiki_db = 'wikidatawiki'
    AND snapshot = '2023-04'
    AND
        (page_namespace = 0
        OR page_namespace = 1
        OR page_namespace = 120
        OR page_namespace = 146
        OR page_namespace = 640)

GROUP BY page_namespace, page_is_redirect
"""
df = wmf.presto.run(commands=query)
df.head()

gives me

	namespace	redirect	count
0	120	False	10977
1	146	False	1077831
2	0	True	4031149
3	0	False	101785388
4	1	True	52
5	146	True	13533
6	1	False	36683
7	640	False	367

Ultimately what needs to happen for the page data query:

  • The notebook needs to be changed over to a process that assumes that it is being ran at the start of the next quarter
  • We need to query information for specific snapshots only and not for all pages
    • This has been causing an overestimate because of the inclusion of pages that are deleted in recent snapshots given that we're querying based on DISTINCT page_id
  • I also need to add the information for one year ago (depending on availability)
    • Generally it's looking like it's not available (admins and pages returned 0)
  • I need to calculate values for active users
    • This would be active users on the last day of the quarter looking retrospectively at the last 30 days
  • Property creators needs to be shifted such that it's those users who have the right to make properties, not those who have made properties
  • I need to make the final table by doing the needed arithmetic

@Manuel, the notebook's generally all ready. I'm not getting data results for one year ago for pages or admins for the current tables (I'm assuming this is the case for other rights metrics as well). We can check the output tomorrow and in that time I'll do some more deep dives into the data to see if I can figure out where more long term data is kept :) Looks like we have a fair number of people in SWE Data now (๐Ÿ™Œ๐Ÿ™Œ), so we can also check in there after a discussion tomorrow.

Note for a task to be done:

  • I need to change the active editors query to be only based on an end date where the start date would be 30 days before. I'm unsure how exactly to do this in Presto.

The following is the current rendition of the quarterly reporting notebook. As the general plan is to shift to Grafana, I'd say that we're fine for now with this file. We can chat about this tomorrow, @Manuel!

Note that the above notebook is just aggregates and the queries used to get them. In those regards I felt like it was ok to share it via Phabricator for documentation of this task.

Hi Andrew, if it's only code and non-sensitive aggregates it seems okay to document our work in progress here for now. We should however work on setting up a proper repository instead to reduce the risk of accidents.

@Manuel, the numbers have been updated in the description for the task :) The discrepancy is pretty dramatic, so I'm unsure if you have some info on why that could be or if we can attribute it to an overall adoption trend. Happy to look a bit more into it to see where the new traffic is coming from :)

I'm looking to see if the info from April is there. Will send it along tomorrow if it's available ๐Ÿ™ƒ

Yes, please look into this. The list of user agents will help us understand.

Manuel renamed this task from Wikidata Quarterly Reporting Milestone 2 Aggregations to Looking into the LOD health metrics for Wikidata, 2023 .Jul 6 2023, 12:50 PM
Manuel renamed this task from Looking into the LOD health metrics for Wikidata, 2023 to [Analytics] LOD health metrics 2023 for Wikidata as a notebook.

We have successfully experimented with a notebook for reporting. We will investigate the discrepancies in the results from different sources at a later time. I am therefore closing this task.

@AndrewTavis_WMDE Could you please move the REST API numbers to T334558: [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023?