Page MenuHomePhabricator

Capacity planning for Commons Structured Data
Open, NormalPublic0 Story Points

Description

We should be planning for expected growth of structured data entries on Commons over the next 3 years. This includes core db servers, external storage, and dumps hosts.

Event Timeline

ArielGlenn triaged this task as Normal priority.Jun 19 2019, 11:28 AM
ArielGlenn created this task.
Restricted Application added a project: Wikidata. · View Herald TranscriptJun 19 2019, 11:28 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

From email from @MarkTraceur

Database needs

  • 54 million files on Commons
  • Estimated average of 10-20 statements per file
  • Estimated 1 revision per statement
  • Therefore, (very) roughly 1 billion estimated rows added to revisions table

External storage needs

  • Each file will have its own MediaInfo entity, which will be analogous to Wikidata items
  • So, given Wikidata has about 57 million items, the storage needs should be about the same
    • Obviously that would need to be additional storage, not including the existing Wikitext

Rates

  • We expect multiple bots to run over Commons very shortly after release (within the next few months)
    • Don't anticipate these will be drastically faster than normal bot runs
    • Could see Multichill's bots for examples - I believe he's rate-limited them aggressively
  • There will likely be micro-contributions as well
    • Think Magnus's "Wikidata game" style, likely similar rates
    • Also sanctioned on-wiki machine-aided work (for depicts statements)
  • By the end of the calendar year, we expect at least 5 million files to have structured data
  • We're currently sitting in the low six figures (100-300k)
Yann added a subscriber: Yann.Jun 20 2019, 6:35 AM
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jun 20 2019, 12:40 PM
ArielGlenn added a subscriber: jcrespo.EditedJun 20 2019, 1:02 PM

@jcrespo I'm adding you too, please remove yourself if you're already covered by other tasks.

@MarkTraceur The number of new revisions to wikidata in a day varies between about 550k and 850k. Of these, only about 12k are new pages in ns 0 (the vast majority of pages), and about 51% of those new revisions on old pages (in the last 3 months) were done by bots.

What that means is a lot of bot activity adding claims to existing entities.

Admittedly, commons will take some time to ramp up to that, but I'd prefer to plan for it sooner rather than later. I definitely don't want us to be in the position of telling people we can't accommodate their edits, and/or throttling them severely.

For external storage core dbs, and dumps hosts, we'll need to make the appropriate projections.

Do you have a link to an overview of what the various statements per file might be (the 10-20 you mentioned)?

Is there a road map for release you can point us at?

@ArielGlenn here's a *tentative* roadmap that provides a high-level view of the SDC work we have planned for the rest of the calendar year. Anything beyond Dec. 31 is still uncertain at this time. https://docs.google.com/presentation/d/1hdqodLhi9Ym-BtLNyhfHKAnTcPMcLOV6hocNQqwiqLE/edit?usp=sharing

So there are some details on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth, but I havn't written too much about media info yet.

Details about number of entities: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth#MediaInfo
Details about number of revisions: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth#Commons

I didn't bother predicting the revision count for commons too much yet as wikidata will likely hit the big int mark before commons.

Naturally edit rate on commons is likely going to be heading up (not sure if anyone is tracking this yet).
For Wikidata currently we have https://grafana.wikimedia.org/d/000000170/wikidata-edits
Even with an extremely high edit rate I think we should be able to spot most capacity issues quite a while before they happen.

Addshore moved this task from incoming to monitoring on the Wikidata board.Jun 21 2019, 10:56 PM

I keep pretty decent tabs on wikidata growth, because of the dumps. I don't do that for commons entities because I can't even find the proper wikibase tables. I checked the wb_* tables on commonswiki and they all appear to be empty (?!)

I can do some very rough numbers gathering by periodically getting the max slotid and the max revid, which would at least let us track those two trends. You don't by any chance track those two numbers already?

I think we'll have bots soon enough that help with the issue of adding captions for already uploaded files, and then you'll see huge growth in the number of slots. The captions are in a separate slot, right? It would be nice to be able to track the growth in the number of specific slots (depicts, caption, anything else on the short-to-mid-term horizon).

I keep pretty decent tabs on wikidata growth, because of the dumps. I don't do that for commons entities because I can't even find the proper wikibase tables. I checked the wb_* tables on commonswiki and they all appear to be empty (?!)

So all of the SDOC stuff just exists in the regular mediawiki page revision etc tables.

I can do some very rough numbers gathering by periodically getting the max slotid and the max revid, which would at least let us track those two trends. You don't by any chance track those two numbers already?

We don't have any automatic tracking of these things yet. (see T68025 for some old thoughts)
I'm doing periodic counting and putting the numbers on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth#MediaInfo for the number of media info entities. The query for that number is:

SELECT COUNT(DISTINCT rev_page)
FROM commonswiki.revision
INNER JOIN commonswiki.slots ON revision.rev_id = slots.slot_revision_id
WHERE slots.slot_role_id = 2;

I think we'll have bots soon enough that help with the issue of adding captions for already uploaded files, and then you'll see huge growth in the number of slots.

Yup

The captions are in a separate slot, right?

The media info entity as a whole is in a different slot. So now commons pages will have at most 2 slots, 1 for wikitext, and one for everything media info.

It would be nice to be able to track the growth in the number of specific slots (depicts, caption, anything else on the short-to-mid-term horizon).

Might be worth adding this to T68025
Tendril already has a report of # of rows in a table:
https://tendril.wikimedia.org/report/table_status?host=db1081&schema=commonswiki&table=slots&engine=&data=&index=
But doesn't track this over time, and also doesn't allow us to easily look into dimensions of a table such as # of rows for slot type in the slots table.

I've commented about this over on the other ticket. Let's see what they say.

As evidenced by https://graphite.wikimedia.org/S/i we already have 5.5 million images with contents in the MediaInfo slot. Two months to go until end of the year and we see how low the prediction was compared to the actual number.

I would like to know if there is some work going on to be able to split those tables from s4 into their own set of servers. My understanding is that it wasn't possible and that's why the tables were created on s4 (where Commons live) directly.
Sharing the same set of servers with commonswiki means that sooner or later those tables will need to be moved out if the growth continues for the SDC related tables, into their own set of servers (as we advised when we were first involved into the conversations about SDC

Ramsey-WMF added a subscriber: matthiasmullie.

Matthias will look into discrepancies between number of files with mediainfo slots vs. what's indexed in Cirrus.

Abit added a subscriber: Abit.Tue, Nov 12, 6:53 PM

Matthias will look into discrepancies between number of files with mediainfo slots vs. what's indexed in Cirrus.

It looks like the difference is mostly revisions vs files.

The 5.5+ number closely matches the amount of revisions with a mediainfo slots record:
SELECT COUNT(*) FROM slots WHERE slot_role_id = 2 = 6161688
That number includes all edits though (SDC edits as well as regular file page edits once the page got its first structured data)

mediainfo slots grouped by page is more similar to the results we get from Cirrus:
SELECT COUNT(DISTINCT rev_page) FROM slots INNER JOIN revision ON slot_revision_id = rev_id WHERE slot_role_id = 2 = 2918327
(This number also includes pages that once had structured data, but which has since been deleted)