Page MenuHomePhabricator

Discovery experiment 2: Prototype subset dumps
Closed, ResolvedPublic

Description

As a short experiment, we'd like to create a subset dump of the regular weekly data dumps but without scholarly articles (so P31=Q13442814) and astronomical objects (so P31=Q6999) and without coming up with an entirely new system/changing the codebase

Event Timeline

Ifrahkhanyaree_WMDE renamed this task from [WIP] Prototype subset dumps to Discovery experiment 2: Prototype subset dumps.Aug 6 2025, 2:44 PM
Ifrahkhanyaree_WMDE updated the task description. (Show Details)

Running a preliminary proof of concept on a dataset of about 500,000 entities, reveals that <~0.02 % are actually being excluded by this condition (471524 out of 471591 are not instances of scholarly articles or astronomical objects). Should we consider checking for instances of transitive subclasses as well (direct subclasses of Q13442814 or Q6999 as well as subclasses of those subclasses). The following SPARQL Query reveals that there are 1577 of those: https://w.wiki/F2fy.

I ran some presto queries on the full dump in the analytics cluster, here are the actual entity counts as of 2025-07-28:

count% of total
Total116955797100.00%
Humans1250427010.69%
Subclasses of Humans1830.00%
Astronomical Objects258920.02%
Subclasses of Astronomical Objects83873617.17%
Scholarly Articles4519335838.64%
Subclasses of Scholarly Articles3623530.31%