Page MenuHomePhabricator

Create reportupdater reports that execute SDC requests
Closed, ResolvedPublic

Description

These queries: https://phabricator.wikimedia.org/T238878#5692516 should be productionized as reportupdater reports either from data from the replicas or from tables newly scooped hadoop

Assigning to @Milimetric to get this work started, please coordinate with @mpopov

Event Timeline

Yay, I get to work with @mpopov :) Ok, questions:

  • how often should this report be updated?
  • is it exactly that query? This task mentions "queries" plural, just making sure
  • given the confusion about deletion (T238878#5706835), should we also count stuff from the archive table?

Yay, I get to work with @mpopov :)

Aw, I feel likewise! :D

  • how often should this report be updated?

I think for the intended purpose a monthly granularity is fine since the check-ins have in the past been quarterly or every 6mo. Even if the query takes like 35 minutes to run on unsqooped data, would it be okay to schedule it to run daily or weekly?

  • is it exactly that query? This task mentions "queries" plural, just making sure

It's starting to look like the query in T238878#5708511 is the one that should be used?

  • given the confusion about deletion (T238878#5706835), should we also count stuff from the archive table?

I don't think deleted files should be counted, no.


I think the end result should be, ideally, a daily-granularity data source in Turnilo/Superset having:

  • total count of files on Commons
  • total count of files on Commons having structured data (per query in T238878#5708511)

This would enable @Abit & @Ramsey-WMF to track progress of SDC over time in a dashboard as (1) an absolute, and (2) relative % (via post-aggregation in Superset) in Superset (esp. since that also has periodicity like YoY built in, which would be useful for them).

Would have to be careful with the auto aggregation, though. The metrics would need to be specified as, like, longMax instead of longSum.

@Milimetric: do you have a destination in mind for the reports? I guess the MVP is just a CSV in /srv/published-datasets and we can figure out next steps later so this task's scope doesn't blow up, or do y'all have an easy pipeline/process for running reportupdater and ingesting the output into Druid?

Some alternatives: superset can source data from other places than druid and we have couple dashboards on top of some tables in staging. This might not be the best option as reportupdater produces tsvs rather than inserting data into staging again. Druid is good for cubes but this is a simple metric that has 1 dimension so with the tsvs in mind would be easy to have a dashiki dashboard that would look similar to: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os

The dashboard above reports "ratios per browser", if we have 2 metrics:1 ) "total_number_of_files" and another with 2) total_number_of_files_with_sdc_data/total_number_of_files

This dashboard will report ratios of presence of SDC in commons, absolute numbers of files can be reported as well but as both of you know those are not that useful

Bless y'all's hearts for setting this up for us ♥♥♥

mforns triaged this task as High priority.Dec 5 2019, 6:05 PM
mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Let's pause this work as it turns out as there is a parallel effort happening , @Abit to create a ticket for ongoing work

It looks like we are going to have to report this number on the tunning session so taking back my comment above, let's proceed.

@Nuria where will this number be reported in the tuning session?

@Abit: Numbers about SDC will be reported in the platform evolution slides.

@Abit: it's still not entirely clear which query from T238878 @Milimetric should productionize in this ticket.

From my conversation with Kate, it seems like your team wants to use the 7.8M number from the Lua-populated table using the query from T238878#5683048, but there's also an overwhelming support for the query in T238878#5708511 which yields a count of 3M? I've pointed out the problems of missing data and quality in general in the Lua-populated table, so I'm not sure if that's the one you want to go with.

Can you or @matthiasmullie please confirm exactly which query should be used?

@Abit: The queries that report the 7.8 million include , per @matthiasmullie comment both Wikidata Items and Mediainfo items. We can help calculate the percentage of each but from numbers thus far it seems that of those 7.8M items more than half are Wikidata items? (need to get more precise numbers on those)

Please see my comment on https://phabricator.wikimedia.org/T238878#5726624 Seems like the 7.9 million items are from contributions of wikidata alone.

I'm not really sure what number we want to go with, but I can probably help clarify what kind of data is in which db tables (and what numbers derived from those actually mean)

So, it seems we have 2 completely separate definitions of "structured data":

  1. MediaInfo entities created for file pages, with captions and/or statements
  2. Existing "entities" (Wikidata or MediaInfo) pulled in via Lua to enrich (file) pages (e.g. given that we know the artwork, we can pull in author information etc.)

And possibly, since IMO both are a valid definition of "structured data":

  1. A combination of both: files with either a MediaInfo entity and/or other (Wikidata/MediaInfo) entities' information included via Lua

1. MediaInfo entities created for file pages, with captions and/or statements: 3 082 976

We can query for it by counting the amount of 'mediainfo' slots & excluding deleted pages & empty content:

SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
# joining on page_latest - we only care about most recent
INNER JOIN slots ON slot_revision_id = page_latest
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo';

AFAICT, there is no easy way to break up stats about MediaInfo entities more (e.g. "how many have only captions", "how many have X number of statements", ...) - not with a simple query on the raw data anyway.
The entity's actual data is in external store (just like any other wikitext page's content) in the form of a JSON blob (so would have to be deserialized)

2. Existing "entities" (Wikidata or MediaInfo) pulled in via Lua to enrich file pages: 7 936 829

That data lives in wbc_entity_usage, which is a table much like categorylinks and templatelinks.
It does not contain all existing entities (and their labels etc), it's just a place to store the relationship of entities and other pages these entities are used on (via Lua)
See T231952#5717638 & Wikibase/Schema/wbc_entity_usage on mw.org for more details on this table and what data is holds.

Something like this should give us the total amount of file pages that are including Wikidata/MediaInfo data via Lua (overwhelming majority is Wikidata entities - it's not been possible to fetch MediaInfo entities via Lua until a month ago)

SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
INNER JOIN wbc_entity_usage ON eu_page_id = page_id
# only include file pages and non-sitelink usage
WHERE page_namespace = 6 AND eu_aspect != 'S';

Note that we might want to exclude Wikibase sitelinks usage (recorded in this table with eu_aspect = 'S').
While it's useful linking of data, I'm not sure these should be considered "structured data" usage.
Anyway, their usage is negligible anyway: excluding those (eu_aspect != 'S') returns a result of 7 935 849 (or only 980 files that *only* have a wikidata sitelink)

It is somewhat possible to break down stats further (e.g. how many are including a label vs how many are including statements), but not in too much detail (e.g. L<lang id> would be unreliable to figure out how much it's used in a particular language, because that language could also be covered by L

FYI: there are about ~3M more (non-file) pages with structured data via Lua, mostly category pages.

3. A combination of both: 10 440 129

This is pretty much just a combination of the above numbers, except that there is some overlap: entities could have MediaInfo entities as well as other data from Wikidata (and having MediaInfo entities is likely going to make it easier to fetch other related info, so I'd expect overlap to grow over time)
Currently, there’s an overlap of (only) ~0.58M files (that have both an own MediaInfo entity, and existing Wikidata/MediaInfo usage via Lua)

A simple union for both of the above queries should get us that data:

SELECT COUNT(*) FROM (
	SELECT DISTINCT page_id
	FROM page
	INNER JOIN slots ON slot_revision_id = page_latest
	INNER JOIN content ON slot_content_id = content_id AND content_size > 122
	INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo'
	UNION
	SELECT DISTINCT page_id
	FROM page
	INNER JOIN wbc_entity_usage ON eu_page_id = page_id
	WHERE page_namespace = 6
) AS t;
SELECT COUNT(*) FROM (
	SELECT DISTINCT page_id
	FROM page
	INNER JOIN slots ON slot_revision_id = page_latest
	INNER JOIN content ON slot_content_id = content_id AND content_size > 122
	INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo'
	UNION
	SELECT DISTINCT page_id
	FROM page
	INNER JOIN wbc_entity_usage ON eu_page_id = page_id
	WHERE page_namespace = 6
) AS t;

Is it just me or will this query be counting some pages twice?

No it shouldn’t count pages more than once.
UNION omits duplicates (UNION ALL doesn’t), so no need to DISTINCT again.

So, it seems we have 2 completely separate definitions of "structured data":

Well, the intent of these numbers is not to measure "structured data usage" nor to define that concept. The intent is to measure the impact of the structure data on commons project per @Abit initial request couple weeks ago.

With that objective in mind I do not see the Wikidata usage on commons to be of relevance as that usage it is not directly tied to the work the team working on "structure data on commons" is been doing. Please do let me know if this is not a correct statement.

As a side note I think the true interesting metric is the rate of growth of the result of the query below. Following rate of growth of usage of structured data on commons growths seems much more relevant to track impact than the absolute numbers we have been talking about to date.

SELECT DISTINCT page_id
	FROM page
	INNER JOIN slots ON slot_revision_id = page_latest
	INNER JOIN content ON slot_content_id = content_id AND content_size > 122
	INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo'

So, it seems we have 2 completely separate definitions of "structured data"

There are three completely separate definitions of structured data:

  1. The definition of structured data for the Sloan report
  2. The definition of structured data to be reported in the Platform Evolution tuning session
  3. The definition of structured data for product analytics, such as making dashboards and product development choices

I thought that this ticket was about (3). Separately, Matthias is working on (1) in another ticket. I'm not sure what the status of (2) is.

If this ticket isn't about (3) can you please let me know what you are trying to create a report for?

@Abit: Sorry it was not clear. This below is the request you send to analytics couple weeks ago via e-mail, as I mentioned then we rather work on requests via phab tickets that via e-mail.

"
From: Amanda Bittaker <abittaker@wikimedia.org>
Date: Tue, Nov 19, 2019 at 1:30 AM
Subject: Analytics for structured data elements on Commons
To: Joseph Allemandou <jallemandou@wikimedia.org>
Cc: Ramsey Isler <risler@wikimedia.org>

We're wondering how many file pages on Commons contain at least one structured data element (statement or caption), and we hear you are the person to ask.

Is this something you can help us set up analytics for? If it wouldn't be too much effort, there are other structured data numbers it would be nice to have, but most urgently we just need to be able to regularly check the number of files with structured data. (We have been using the search results, but search indexing is proving unreliable.) "

Matthias is working on (1) in another ticket.

Can you link that ticket so we do not duplicate efforts?

Ah sorry about that, the discussion moved to T238878: Data about how many file pages on Commons contain at least one structured data element after I sent that email, and the clarification of the three different definitions I listed above came from those conversations.

We would still like productionized reports for (3). If that is still possible, I would love to discuss it more :)

Matthias is working on (1) in another ticket.

Can you link that ticket so we do not duplicate efforts?

It may be T238878 but I'm not certain--@matthiasmullie can you please confirm?

We would still like productionized reports for (3). If that is still possible, I would love to discuss it more :)

Please coordinate with Product-Analytics on those.

I found yet another ticket about SDC metrics in which is apparent that there is duplicated work happening. Let's please try to avoid that going forward.
https://phabricator.wikimedia.org/T231952

Ok, seems like some of this confusion is getting cleared up. For my part, here's what I'm planning to do next:

  • Productionize the query currently getting the 3 million or so role_name = mediainfo slots
  • Productionize the query currently getting the 7 million or so entities linked via Lua templates in wbc_entity_usage
  • publish both of these numbers in a single tsv, monthly, updated on the 5th of the month

For the future, I suggest looking at rev_timestamp for the revisions/pages that are of interest and seeing if it makes sense to get some historical trends that way. See T238878#5730630 for some idea of what that might look like.

Change 556741 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/reportupdater-queries@master] Report structured data use for commons

https://gerrit.wikimedia.org/r/556741

Change 556741 merged by Mforns:
[analytics/reportupdater-queries@master] Report structured data use for commons

https://gerrit.wikimedia.org/r/556741

Change 559580 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/reportupdater-queries@master] Remove poorly defined metric

https://gerrit.wikimedia.org/r/559580

Change 559580 merged by Milimetric:
[analytics/reportupdater-queries@master] Remove poorly defined metric

https://gerrit.wikimedia.org/r/559580

Change 562555 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Enable structured-data report

https://gerrit.wikimedia.org/r/562555

Change 562555 merged by Ottomata:
[operations/puppet@production] Enable structured-data report

https://gerrit.wikimedia.org/r/562555