Page MenuHomePhabricator

Estimate Wikidata entity accumulation
Closed, ResolvedPublic2 Estimated Story Points

Description

quick estimate of how many Wikidata entities (items + properties + labels) have been ingested into WME since it's been running. This is intended as a rough metric to understand how much of the graph we are accumulating over time.

Acceptance criteria

Provide rough estimates around

  • total unique items ingested
  • total unique properties ingested

Notes: This is not a production metric, just a rough estimate for planning

Event Timeline

I used the following command to count everything under the right paths:

$ aws s3api list-objects-v2 --bucket <bucket> --prefix <path> | grep -o .json | wc -l

The result:

  • 8,633,532 items (unique QIDs)
  • 3,077 properties (unique PIDs)

This will include some scholarly QIDs, from before we fixed the filter. No idea how many that could be, it would take a lot longer to compute that.

Running the command took around 40 minutes for the QIDs. A sample of a single path (e.g. items/0/00) was ~34k QIDs, so if this needs to be repeated, it seems that extrapolating delivers pretty reliable results (256 paths * 34k ~= 8.6M).