Page MenuHomePhabricator

Data about how many file pages on Commons contain at least one structured data element
Closed, ResolvedPublic

Description

@Abit needs data about how many file pages on Commons contain at least one structured data element. Ideally I could check every day, but we'll need to know every couple weeks at least.

We cannot use rely solely on the search results for this data because we are still tweaking the search indexing and the search results have been incorrect several times, in one case wildly incorrect. (Showing 200,000 captions instead of 1.7 million iirc).

Event Timeline

Of interest: https://phabricator.wikimedia.org/T190460

The work done by @mpopov (if you are so kind @mpopov
please upload your screenshots)

The wbc_entity_usage table is supposed to hold info on Wikidata usage for the pages For example, here's a random file I added some structured data to a few days ago: https://commons.wikimedia.org/wiki/File:P%C3%B3voa_de_Varzim_-i---i-_(25379025808).jpg

When you look for it the commonswiki replica, it has a page ID of 68860692. Looking for it in the wbc_entity_usage table we only see that it has a caption in English, which I added at basically the same time as several statements:

Screen Shot 2019-11-19 at 5.58.57 PM.png
eu_aspect column does have other values like "O" (statements) and "D" (not documented, but from a brief investigation looks like it's specifically for linking categories on Commons to Wikidata Q-items). There are some records of files with "O" aspects (as the MW page notes, it can refer to a variety to entity usages but typically it's statements) but then it gets weird because the language of the label isn't recorded and there's a bunch of seemingly unnecessary info? Take for example the MediaWiki DB data for https://commons.wikimedia.org/wiki/File:Jodrell_Bank_Mark_II_5.jpg

Screen Shot 2019-11-19 at 6.36.04 PM.png
Screen Shot 2019-11-19 at 6.36.33 PM.png
Woof! That's…not great. So, uh, clearly there's something funky going on with the Wikibase client extension? Or maybe that's data that was recorded by an earlier version of the extension before it knew to append language codes to labels? I don't know enough about the nitty-gritty there, so these are just vaguely educated guesses.

So, with the caveat that we're dealing with imperfect data and that some data is definitely missing, here are SOME lower bounds on files which have structured data:

SELECT
  COUNT(DISTINCT eu_page_id) AS n_files_with_statements,
  COUNT(1) AS n_statements
FROM wbc_entity_usage
INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id
WHERE page_namespace = 6
  AND eu_aspect = 'O';

n_files_with_statements: 5 212 680
n_statements: 8 677 888

Files with labels:

SELECT
  COUNT(DISTINCT eu_page_id) AS n_files_with_captions
FROM wbc_entity_usage
INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id
WHERE page_namespace = 6
  AND eu_aspect RLIKE '^L';

n_files_with_captions: 7 857 520

Files with captions and/or statements:

SELECT
  COUNT(DISTINCT eu_page_id) AS n_files_with_structured_data
FROM wbc_entity_usage
INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id
WHERE page_namespace = 6
  AND (eu_aspect RLIKE '^L' OR eu_aspect = 'O');

n_files_with_structured_data: 7 861 887

Total files (at the moment I ran these queries, so we have a record of % w/ structured data):

SELECT COUNT(1) AS n_files_total
FROM page WHERE page_namespace = 6;

n_files_total: 58 698 512

Thanks for reading! Feel free to correct me if I made any mistakes with the queries or my thought process.

Per T220525 it looks like none of the xml dump files that provide content for analytics contain any data about structure data in commons files.

Proposed changes (that I am not sure got implemented) to change the dumps format to include this info are here: https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Schema

Dumps are on version 0.10, the proposal linked is for (I think) version 0.11

Summing up: there is no way to search actual page content to get the structure data of the page at this time from the dumps.

I think @Addshore had some information to add here.

@Addshore said there were 2,988,205 as of 9 PST this morning. I got the sense he didn't count using search, but he'll have to confirm.

Last night I ran:

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT COUNT(DISTINCT rev_page) FROM commonswiki.revision INNER JOIN commonswiki.slots ON revision.rev_id = slots.slot_revision_id WHERE slots.slot_ role_id = 2;
+--------------------------+
| COUNT(DISTINCT rev_page) |
+--------------------------+
|                  2988233 |
+--------------------------+
1 row in set (1 min 51.29 sec)

This morning I altered it slightly to try and take into account deleted pages.

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT COUNT(DISTINCT rev_page)
    -> FROM commonswiki.revision
    -> INNER JOIN commonswiki.slots
    -> ON revision.rev_id = slots.slot_revision_id
    -> WHERE slots.slot_role_id = 2
    -> AND revision.rev_deleted = 0;


+--------------------------+
| COUNT(DISTINCT rev_page) |
+--------------------------+
|                  2990350 |
+--------------------------+
1 row in set (4 min 27.29 sec)

I then also filtered out media info entities that have been emptied. (by looking up the size of one of these in the DB for a page that I created and removed structured data from)

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT COUNT(DISTINCT rev_page)
    -> FROM commonswiki.revision
    -> INNER JOIN commonswiki.slots
    -> ON revision.rev_id = slots.slot_revision_id
    -> INNER JOIN commonswiki.content
    -> ON slots.slot_content_id = content.content_id
    -> WHERE slots.slot_role_id = 2
    -> AND revision.rev_deleted = 0
    -> AND content.content_size > 122;

+--------------------------+
| COUNT(DISTINCT rev_page) |
+--------------------------+
|                  2990420 |
+--------------------------+
1 row in set (4 min 55.86 sec)

I don't think there will be a more reliable way to check this count.
As long as I am interpreting "at least one structured data element" correctly, which I would read as having a caption or statement.

As far as I can tell:
@mpopov's queries (wbc_entity_usage based) include both MediaInfo items, along with Wikidata-items pulled in via Lua.
@Addshore's queries are stricktly the amount of MediaInfo items.

@Addshore : disclaimer: I know next to nothing about this but how are you taking into account that the revision is the last one for the page? That is, a page might have had a structured data item in a prior revision and from its most current revision that structured data item is removed? In your select * i think* you are counting revisions with slots that contain data but if that revision is a past one for the page it should not be counted, as we only care about structured data present in the page right now, correct? Hopefully this makes sense, I think you are discounting deleted "slots" but are counting "past" revisions for the page where those slots where "alive" as if they pertained to the count.

So, per my comment above, I think the number of items is actually smaller than the one @Addshore has computed but more wise folks can correct me if I am wrong.

Here are the missing screenshots:

The work done by @mpopov (if you are so kind @mpopov
please upload your screenshots)

The wbc_entity_usage table is supposed to hold info on Wikidata usage for the pages For example, here's a random file I added some structured data to a few days ago: https://commons.wikimedia.org/wiki/File:P%C3%B3voa_de_Varzim_-i---i-_(25379025808).jpg

When you look for it the commonswiki replica, it has a page ID of 68860692. Looking for it in the wbc_entity_usage table we only see that it has a caption in English, which I added at basically the same time as several statements:

1.png (198×1 px, 45 KB)

The structured data is missing, despite being added before the caption.

eu_aspect column does have other values like "O" (statements) and "D" (not documented, but from a brief investigation looks like it's specifically for linking categories on Commons to Wikidata Q-items). There are some records of files with "O" aspects (as the MW page notes, it can refer to a variety to entity usages but typically it's statements) but then it gets weird because the language of the label isn't recorded and there's a bunch of seemingly unnecessary info? Take for example the MediaWiki DB data for https://commons.wikimedia.org/wiki/File:Jodrell_Bank_Mark_II_5.jpg

2.png (216×1 px, 43 KB)

unnamed.png (1×966 px, 324 KB)

Woof! That's…not great. So, uh, clearly there's something funky going on with the Wikibase client extension? Or maybe that's data that was recorded by an earlier version of the extension before it knew to append language codes to labels? I don't know enough about the nitty-gritty there, so these are just vaguely educated guesses.

I was looking at populateEntityUsage.php (Maintenance script for populating wbc_entity_usage based on the page_props table.) So if the entity usage table is populated from page props table, it partially explains why the statements for File:Póvoa de Varzim -i---i- (25379025808).jpg aren't showing up. They're not in the page props table:

Screen Shot 2019-11-22 at 4.30.41 PM.png (190×914 px, 50 KB)

SELECT *
FROM page_props AS pp
LEFT JOIN page ON pp.pp_page = page.page_id
WHERE pp_propname = 'wikibase_item'
--  AND page_namespace = 6 -- returns 0 results
LIMIT 100

Only shows that basically only ns:0 (mostly pages listing categories) and ns:14 have the wikibase_item page property.

@daniel @Ladsgroup: hi o/ I'm pinging you because you're listed as the authors on a bunch of the relevant Wikibase code (including that entity usage maintenance script). Can you please help point us at somewhere, anywhere that we can use to figure out how many files on Commons have had labels, depicts, and other statements added?

A different strategy is to use the revision comments to look for how many ns:6 pages have had revisions where the comment included wbset<label, claim>, for example:

SELECT
  page_title, page_namespace, rev_id, IF(rev_comment = '', comment_text, rev_comment) AS revision_comment
FROM revision rev
LEFT JOIN page ON rev.rev_page = page.page_id
LEFT JOIN revision_comment_temp rct ON rev.rev_id = rct.revcomment_rev
LEFT JOIN `comment` ON rct.revcomment_comment_id = `comment`.comment_id
WHERE page_namespace = 6
  AND rev_page = 68860692
  AND (comment_text RLIKE 'wbset(claim|label)' OR rev_comment RLIKE 'wbset(claim|label)')

Screen Shot 2019-11-22 at 5.00.30 PM.png (826×1 px, 564 KB)

Which only looks at additions, not changes/removals but we can fix that. Anyways, using this method we can count how many files have had structured data added to them as of the end of October 2019 (using Analytics Engineering's MediaWiki History in Data Lake:

WITH structured_data_additions AS (
    SELECT
        page_id,
        SUM(IF(event_comment RLIKE 'wbsetclaim', 1, 0)) > 0 AS had_claim_added,
        SUM(IF(event_comment RLIKE 'wbsetlabel', 1, 0)) > 0 AS had_label_added
    FROM mediawiki_history
    WHERE snapshot = '2019-10'
      AND wiki_db = 'commonswiki'
      AND event_entity = 'revision'
      AND page_namespace = 6
      AND event_comment RLIKE 'wbset(label|claim)'
      AND NOT revision_is_identity_reverted
    GROUP BY page_id
)
SELECT
    CASE
      WHEN had_claim_added AND had_label_added THEN 'statement(s) and label(s)'
      WHEN had_claim_added AND NOT had_label_added THEN 'just statement(s)'
      WHEN had_label_added AND NOT had_claim_added THEN 'just label(s)'
    END AS structured_data_added,
    COUNT(1) AS n_files
FROM structured_data_additions
GROUP BY
CASE
    WHEN had_claim_added AND had_label_added THEN 'statement(s) and label(s)'
    WHEN had_claim_added AND NOT had_label_added THEN 'just statement(s)'
    WHEN had_label_added AND NOT had_claim_added THEN 'just label(s)'
END;

@Abit @Ramsey-WMF @Mayakp.wiki: this will be of interest to you. The total number of files which have had structured data added to them (and not reverted) before November 2019 is… 1,401,757. This doesn't include claim/label removals, so just a heads up there.

structured_data_addedn_files
just label(s)1 112 577
just statement(s)163 200
statement(s) and label(s)125 980

For a more up-to-date count, here's an equivalent query for the MW replica in MariaDB, but it doesn't include revert status which is provided in the mediawiki_history data:

SELECT
  CASE WHEN had_claim_added AND had_label_added THEN 'statement(s) and label(s)'
       WHEN had_claim_added AND NOT had_label_added THEN 'just statement(s)'
       WHEN had_label_added AND NOT had_claim_added THEN 'just label(s)'
    END AS structured_data_additions,
  COUNT(1) AS n_files
FROM (
  SELECT
    rev_page,
    SUM(IF(comment_text RLIKE 'wbsetclaim' OR rev_comment RLIKE 'wbsetclaim', 1, 0)) > 0 AS had_claim_added,
    SUM(IF(comment_text RLIKE 'wbsetlabel' OR rev_comment RLIKE 'wbsetlabel', 1, 0)) > 0 AS had_label_added
  FROM revision rev
  LEFT JOIN page ON rev.rev_page = page.page_id
  LEFT JOIN revision_comment_temp rct ON rev.rev_id = rct.revcomment_rev
  LEFT JOIN `comment` ON rct.revcomment_comment_id = `comment`.comment_id
  WHERE page_namespace = 6
    AND (comment_text RLIKE 'wbset(claim|label)' OR rev_comment RLIKE 'wbset(claim|label)')
  GROUP BY rev_page
) AS structured_data_revisions;

(This is taking forever to run so, uh, guess I'll update when it finishes.)

Update: query timed out. Bummer.

So the 'wikibase_item' value for pageprops is for client wikis and not repos. You need to use property keys like wb-claims instead (I queried commons nothing showed up, meaning it doesn't add the value. It can be due to MCR but I'm not sure.). We don't have any keys for number of labels (or any sort of terms) in pageprops table because in wikidata it's accessible through term store (wb_terms/wbt_* tables) or WDQS.

I don't really know much about how mediawiki_history gets populated, but wbcreateclaim & wbeditentity (in addition to wbsetclaim) could also be used to create (or edit) statements IIRC, so this last set of results is likely incomplete.

And I believe wbc_entity_usage currently *only* gets populated with M-entities fetched via Lua - any page where the M-entity is not used, will not show up there (I believe - could be wrong). So that data too (in additional to containing other, non-MediaInfo related entities) is likely incomplete.

I believe that this query (based on @Addshore's, but more strict about including only latest revision, of pages that have not been archived) is quite accurate (takes an awful long time to complete though)
Did I overlook anything here - any reason to believe this number is invalid?

SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
# joining on page_latest - we only care about most recent (not revdeleted) revision
INNER JOIN revision ON rev_id = page_latest AND rev_deleted = 0
INNER JOIN slots ON slot_revision_id = rev_id
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo';
+-------------------------+
| COUNT(DISTINCT page_id) |
+-------------------------+
|                 3004300 |
+-------------------------+
1 row in set (33 min 31.86 sec)

Just passed 3M files!

@Addshore : disclaimer: I know next to nothing about this but how are you taking into account that the revision is the last one for the page? That is, a page might have had a structured data item in a prior revision and from its most current revision that structured data item is removed? In your select * i think* you are counting revisions with slots that contain data but if that revision is a past one for the page it should not be counted, as we only care about structured data present in the page right now, correct? Hopefully this makes sense, I think you are discounting deleted "slots" but are counting "past" revisions for the page where those slots where "alive" as if they pertained to the count.

That is indeed correct! hmmmmm...

Regarding wbc_entity_usage, that is probably not the right way to go, as this will not include all files that currently have media info entities, unless those entities are also used in the wikitext slot via LUA or the property parser function?
Correct me if I am wrong here @Ladsgroup

I believe that this query (based on @Addshore's, but more strict about including only latest revision, of pages that have not been archived) is quite accurate (takes an awful long time to complete though)
Did I overlook anything here - any reason to believe this number is invalid?

SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
# joining on page_latest - we only care about most recent (not revdeleted) revision
INNER JOIN revision ON rev_id = page_latest AND rev_deleted = 0
INNER JOIN slots ON slot_revision_id = rev_id
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo';
+-------------------------+
| COUNT(DISTINCT page_id) |
+-------------------------+
|                 3004300 |
+-------------------------+
1 row in set (33 min 31.86 sec)

The looks better to me!

And wow, yes, it does take some time.
I guess that is okay? If we don't need to monitor this too closely

FWIW that query also looks sensible to me. Unless someone can point out something wrong with it I think this is probably accurate?

We will be scooping the slot tables to the cluster so these queries can run in parallel.

Can someone explain why the content table is as large as the revision table? I though these tables were not used much other than commons but content table is big in dewiki for example

Can someone explain why the content table is as large as the revision table? I though these tables were not used much other than commons but content table is big in dewiki for example

Because all content is managed in the content table, including the main slot.

I noticed that three folks used rev_deleted = 0 to mean "revision not deleted", but this field means something completely different (https://www.mediawiki.org/wiki/Manual:Revision_table#rev_deleted). Basically, it's about parts of the revision that have been suppressed from different types of users in the interface (like hiding revision comments). Good news is that revisions in the revision table are by definition not deleted. When a page is deleted, its revisions all get inserted into the archive table and removed from the revision table. So when you did rev_deleted = 0 you were just filtering out revisions that had any kind of suppression applied. I think that's a small number and didn't seem to add to the confusion above, but should be included just in case.

@Milimetric it wasn't being used to omit deleted pages (see inline comments about archive table), but to also exclude those RevDel'ed revisions.
I was thinking it made sense to exclude rev-deleted revisions: if a page only has (SDC) revisions that are partly restricted for most people, that we might not want to count those.
But one could argue that we should include them, and I'm happy to exclude that condition from the query! (though the number is likely extremely low, too low to even matter)

I was thinking it made sense to exclude rev-deleted revisions: if a page only has (SDC) revisions that are partly restricted for most people, that we might not want to count those.

If you want to count pages, then you are only looking at the current revision, right? And the current revision of a page is by definition never deleted/suppressed. So no need to check rev_deleted.

By the way, if you find rev_deleted != 0 for the current revision, it's a bug. The deletion flags for the current revisions will be ignored by the storage layer.

If you want to count pages, then you are only looking at the current revision, right? And the current revision of a page is by definition never deleted/suppressed. So no need to check rev_deleted.

So page_latest can never point to a revdel'ed revision? Ok cool, had no idea - might as well skip that condition (and the entire join with revision table) then if it's already implicitly so:

SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
# joining on page_latest - we only care about most recent
INNER JOIN slots ON slot_revision_id = page_latest
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo';
+-------------------------+
| COUNT(DISTINCT page_id) |
+-------------------------+
|                 3045691 |
+-------------------------+
1 row in set (7 min 23.17 sec)

So page_latest can never point to a revdel'ed revision?

SHOULD never ;)

@Abit from our conversation last week, it sounds like engineers on SDC are moving forward with this. Is that correct? Can you make sure it has the appropriate priority & is assigned?

After looking at this for a bit with @Ladsgroup and @mpopov (cc @Abit )
A commons page can have data from wikidata and the wikibase instance on commons (called WikibaseMediaInfo, MediaInfor for short).

wbc_entity_usage will not have any data from structured data on commons that is kept on WikibaseMediaInfo. The data on that table only pertains to wikidata items used by "wiki clients". "wiki clients" are users of wikidata's data , like 'eswiki'. So, to find "how many file pages on commons contain at least 1 structured data element" looking at slots is sufficient. That should be inclusive of all commons files. That number is ~3 million.

Now, there are pages in commons that use, for example, wikidata info boxes, like: https://commons.wikimedia.org/wiki/Category:Lucas_Werkmeister
The pages using wikidata items on commons can be tracked with wbc_entity_usage and that number is 7.9 million as of today. Now, to be clear, the 7.9 pertains entirely to a different set of infrastructure, namely wikidata's. None of those 7.9 structure data items represent any info on the WikibaseMediaInfo instance.

Ping @matthiasmullie to asses this info

Select for @mpopov to look at

select count(distinct eu_page_id)  from mediawiki_page as P JOIN mediawiki_wbc_entity_usage as W ON (W.eu_page_id = P.page_id and W.wiki_db=P.wiki_db and W.snapshot=P.snapshot)  where W.wiki_db="commonswiki" and W.snapshot="2019-11" and page_namespace = 6  and eu_aspect RLIKE '^L' and page_is_redirect=0;

Ping @matthiasmullie to asses this info

That is correct; just want to clarify 1 little detail:

wbc_entity_usage will not have any data from structured data on commons that is kept on WikibaseMediaInfo. The data on that table only pertains to wikidata items used by "wiki clients". "wiki clients" are users of wikidata's data , like 'eswiki'.

wbc_entity_usage will have some mediainfo entities, in cases where a file page uses another (or its own) mediainfo entity via Lua.
wbc_entity_usage simply records relationships between pages and entities - that could be wikidata or mediainfo (theoretically even lexeme, I believe) entities (but by and large, it's currently wikidata entities being pulled in via Lua)

Also see T231952#5717638 & T239565#5728117

Indeed it will

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> select count(*) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';
+----------+
| count(*) |
+----------+
|     3401 |
+----------+
1 row in set (0.00 sec)

but by and large, it's currently wikidata entities being pulled in via Lua

It is exclusively wikidata items at this time correct?

but by and large, it's currently wikidata entities being pulled in via Lua

It is exclusively wikidata items at this time correct?

Not based on my query above.
There are 3401 flags showing usage of mediainfo entities.

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> select count(DISTINCT eu_page_id) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';
+----------------------------+
| count(DISTINCT eu_page_id) |
+----------------------------+
|                       1936 |
+----------------------------+

Apparently 1936 pages on commmons using data from media info entities via LUA

I see 7.9 million wikidata items on that table used by commons 1936 of which are mediawinfo items. That is 0.02%

I translated T238878#5708511 to Hive to familiarize myself with it and get ahead of productionizing it. I got similar numbers first as a sanity check and then grouped the numbers by the month of the page_latest revision's timestamp. I was wondering if the numbers increase in some nice way that we could report on regardless of the overall total. There's a fairly clear trend towards more structured data. And this is not a clear way to show the trend because we're just looking at the latest revision not all revisions, but maybe it's useful to @Abit as she thinks about this metric:

month_c1
2019-0150396
2019-0290102
2019-03108367
2019-04112658
2019-05140429
2019-06442834
2019-07142744
2019-08399805
2019-09757923
2019-10255584
2019-11531414

Query:

use wmf_raw;
 SELECT concat(substring(rev_timestamp, 1, 4), '-', substring(rev_timestamp, 5, 2)) as month,
        COUNT(DISTINCT page_id)

        -- page excludes deleted pages (which are in archive)
   FROM mediawiki_page
            -- joining on page_latest - we only care about most recent
            INNER JOIN
        mediawiki_slots         ON slot_revision_id = page_latest
            -- mediainfo slot must contain actual content
            INNER JOIN
        mediawiki_content       ON slot_content_id = content_id
                                AND content_size > 122
            INNER JOIN
        mediawiki_slot_roles    ON role_id = slot_role_id
                                AND role_name = 'mediainfo'
            INNER JOIN
        mediawiki_revision      ON rev_id = slot_revision_id

  WHERE mediawiki_page.wiki_db = 'commonswiki'
    AND mediawiki_slots.wiki_db = 'commonswiki'
    AND mediawiki_content.wiki_db = 'commonswiki'
    AND mediawiki_slot_roles.wiki_db = 'commonswiki'
    AND mediawiki_revision.wiki_db = 'commonswiki'
    AND mediawiki_page.snapshot = '2019-11'
    AND mediawiki_slots.snapshot = '2019-11'
    AND mediawiki_content.snapshot = '2019-11'
    AND mediawiki_slot_roles.snapshot = '2019-11'
    AND mediawiki_revision.snapshot = '2019-11'

  GROUP BY substring(rev_timestamp, 1, 6)
  ORDER BY month
;

Also,

By the way, if you find rev_deleted != 0 for the current revision, it's a bug. The deletion flags for the current revisions will be ignored by the storage layer.

I checked this as I was re-reading and there are 2500 such problems:

use wmf_raw;

 select p.wiki_db,
        page_id,
        page_title,
        page_namespace,
        rev_deleted

   from mediawiki_page p
            inner join
        mediawiki_revision r    on r.wiki_db = p.wiki_db
                                and page_id = rev_page
                                and page_latest = rev_id
                                and rev_deleted <> 0

  where p.snapshot = '2019-11'
    and r.snapshot = '2019-11'

  order by p.wiki_db
;

(side-note: queries on these raw tables as imported from mediawiki are slow because these tables are not stored in parquet format)

By the way, if you find rev_deleted != 0 for the current revision, it's a bug. The deletion flags for the current revisions will be ignored by the storage layer.

I checked this as I was re-reading and there are 2500 such problems:

Thanks for following up on this.

I checked the code again, and I was (somewhat) wrong: the current revision can't have its text suppressed, but it can have the user and comment suppressed. The but map for DELETED_TEXT is 0x01, so the entry would be bad if rev_deleted is an odd number. I didn't spot any odd values when eyeballing the output you linked.

Special:RevisionDelete prevents the DELETED_TEXT flag to be applied to the current revision. I didn't check whether that's safe against race conditions, and whether it can be circumvented with partial undeletion. But so far, there seems to be nothing bad in the database.

I seem to recall that some code skips permission check for the current revision, but I can't find any such code right now to check whether it skips the check just for the text, or also for comments and user name. In any case, for the purpose of this task, rev_deleted can be ignored.

@Addshore @matthiasmullie Can we document on https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage what the "M" in wbc_entity_usage stands for?

See addshore select above:

select count(*) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';

@Addshore @matthiasmullie Can we document on https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage what the "M" in wbc_entity_usage stands for?

See addshore select above:

select count(*) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';

M is about the entities not the aspects, M stands for MediInfo which are basically items for SDC.

@Nuria: what I thought we were doing is productionizing @matthiasmullie's approach from T239565#5728117. This seemed sensible to me, and the change I submitted for T239565 properly comments that we are not measuring MediaInfo entities with the query on wbc_entity_usage. If any of this is wrong or not what others were expecting, I'd like a clear explanation as the multiple threads have become complicated to resolve objectively.

I'd like a clear explanation as the multiple threads have become complicated to resolve objectively.

Indeed.
@Milimetric we just want to have data for commons, for wikidata there are plenty metrics available that also measure usage from this table, see: https://grafana.wikimedia.org/d/000000176/wikidata-entity-usage-project?orgId=1&fullscreen&panelId=17&from=now-5y&to=now

@Addshore @matthiasmullie Can we document on https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage what the "M" in wbc_entity_usage stands for?

See addshore select above:

select count(*) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';

Our docs for this can be found at https://doc.wikimedia.org/Wikibase/master/php/topic_usagetracking.html

nettrom_WMF claimed this task.

It's my understanding that this has been automated in T239565 in such a way that this is now calculated automatically and published on https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/structured-data/, and so I'm closing this as resolved.