We should be planning for expected growth of structured data entries on Commons over the next 3 years. This includes core db servers, external storage, and dumps hosts.
Description
Event Timeline
From email from @MarkTraceur
Database needs
- 54 million files on Commons
- Estimated average of 10-20 statements per file
- Estimated 1 revision per statement
- Therefore, (very) roughly 1 billion estimated rows added to revisions table
External storage needs
- Each file will have its own MediaInfo entity, which will be analogous to Wikidata items
- So, given Wikidata has about 57 million items, the storage needs should be about the same
- Obviously that would need to be additional storage, not including the existing Wikitext
Rates
- We expect multiple bots to run over Commons very shortly after release (within the next few months)
- Don't anticipate these will be drastically faster than normal bot runs
- Could see Multichill's bots for examples - I believe he's rate-limited them aggressively
- There will likely be micro-contributions as well
- Think Magnus's "Wikidata game" style, likely similar rates
- Also sanctioned on-wiki machine-aided work (for depicts statements)
- By the end of the calendar year, we expect at least 5 million files to have structured data
- We're currently sitting in the low six figures (100-300k)
@jcrespo I'm adding you too, please remove yourself if you're already covered by other tasks.
@MarkTraceur The number of new revisions to wikidata in a day varies between about 550k and 850k. Of these, only about 12k are new pages in ns 0 (the vast majority of pages), and about 51% of those new revisions on old pages (in the last 3 months) were done by bots.
What that means is a lot of bot activity adding claims to existing entities.
Admittedly, commons will take some time to ramp up to that, but I'd prefer to plan for it sooner rather than later. I definitely don't want us to be in the position of telling people we can't accommodate their edits, and/or throttling them severely.
For external storage core dbs, and dumps hosts, we'll need to make the appropriate projections.
Do you have a link to an overview of what the various statements per file might be (the 10-20 you mentioned)?
Is there a road map for release you can point us at?
@ArielGlenn https://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?refresh=30m&panelId=4&fullscreen&orgId=1 <-- average statements per item on Wikidata
Let me find an up to date roadmap for you.
@ArielGlenn here's a *tentative* roadmap that provides a high-level view of the SDC work we have planned for the rest of the calendar year. Anything beyond Dec. 31 is still uncertain at this time. https://docs.google.com/presentation/d/1hdqodLhi9Ym-BtLNyhfHKAnTcPMcLOV6hocNQqwiqLE/edit?usp=sharing
So there are some details on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth, but I havn't written too much about media info yet.
Details about number of entities: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth#MediaInfo
Details about number of revisions: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth#Commons
I didn't bother predicting the revision count for commons too much yet as wikidata will likely hit the big int mark before commons.
Naturally edit rate on commons is likely going to be heading up (not sure if anyone is tracking this yet).
For Wikidata currently we have https://grafana.wikimedia.org/d/000000170/wikidata-edits
Even with an extremely high edit rate I think we should be able to spot most capacity issues quite a while before they happen.
I keep pretty decent tabs on wikidata growth, because of the dumps. I don't do that for commons entities because I can't even find the proper wikibase tables. I checked the wb_* tables on commonswiki and they all appear to be empty (?!)
I can do some very rough numbers gathering by periodically getting the max slotid and the max revid, which would at least let us track those two trends. You don't by any chance track those two numbers already?
I think we'll have bots soon enough that help with the issue of adding captions for already uploaded files, and then you'll see huge growth in the number of slots. The captions are in a separate slot, right? It would be nice to be able to track the growth in the number of specific slots (depicts, caption, anything else on the short-to-mid-term horizon).
So all of the SDOC stuff just exists in the regular mediawiki page revision etc tables.
I can do some very rough numbers gathering by periodically getting the max slotid and the max revid, which would at least let us track those two trends. You don't by any chance track those two numbers already?
We don't have any automatic tracking of these things yet. (see T68025 for some old thoughts)
I'm doing periodic counting and putting the numbers on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth#MediaInfo for the number of media info entities. The query for that number is:
SELECT COUNT(DISTINCT rev_page) FROM commonswiki.revision INNER JOIN commonswiki.slots ON revision.rev_id = slots.slot_revision_id WHERE slots.slot_role_id = 2;
I think we'll have bots soon enough that help with the issue of adding captions for already uploaded files, and then you'll see huge growth in the number of slots.
Yup
The captions are in a separate slot, right?
The media info entity as a whole is in a different slot. So now commons pages will have at most 2 slots, 1 for wikitext, and one for everything media info.
It would be nice to be able to track the growth in the number of specific slots (depicts, caption, anything else on the short-to-mid-term horizon).
Might be worth adding this to T68025
Tendril already has a report of # of rows in a table:
https://tendril.wikimedia.org/report/table_status?host=db1081&schema=commonswiki&table=slots&engine=&data=&index=
But doesn't track this over time, and also doesn't allow us to easily look into dimensions of a table such as # of rows for slot type in the slots table.
As evidenced by https://graphite.wikimedia.org/S/i we already have 5.5 million images with contents in the MediaInfo slot. Two months to go until end of the year and we see how low the prediction was compared to the actual number.
I would like to know if there is some work going on to be able to split those tables from s4 into their own set of servers. My understanding is that it wasn't possible and that's why the tables were created on s4 (where Commons live) directly.
Sharing the same set of servers with commonswiki means that sooner or later those tables will need to be moved out if the growth continues for the SDC related tables, into their own set of servers (as we advised when we were first involved into the conversations about SDC
Matthias will look into discrepancies between number of files with mediainfo slots vs. what's indexed in Cirrus.
It looks like the difference is mostly revisions vs files.
The 5.5+ number closely matches the amount of revisions with a mediainfo slots record:
SELECT COUNT(*) FROM slots WHERE slot_role_id = 2 = 6161688
That number includes all edits though (SDC edits as well as regular file page edits once the page got its first structured data)
mediainfo slots grouped by page is more similar to the results we get from Cirrus:
SELECT COUNT(DISTINCT rev_page) FROM slots INNER JOIN revision ON slot_revision_id = rev_id WHERE slot_role_id = 2 = 2918327
(This number also includes pages that once had structured data, but which has since been deleted)
Following my IRC chat with @ArielGlenn - revision and slots table on s4 (commonswiki) are still under reasonable sizes.
We just decrease the size of the revision table as we applied the MCR schema changes there.
As of now, the sizes on disk are:
- revision: 76GB
- slots: 20GB
If we can clean up the image table (T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata) it'll give us leeway for the structured data I assume.
The task's original intent was to cover planning "over the next 3 years" starting in 2019. @ArielGlenn is the task still relevant, can it be closed, do we need a new one?
It depends on whether any tables are expected to grow a fair amount in the next three years. @Ladsgroup will have a better handle on that now.
I think even if they grow a lot, with the new set of servers, we still have 6.6TB free (76% free disk space)...I'd be surprised if we grow that much in 3 years.
Commons is now the biggest section and by far. It used to be so much worse that wikidata dwarfed in comparison. The thing is: It has almost exclusively nothing to do with the SDoC. The biggest tables were templatelinks, categorylinks, externallinks, etc. It has already back to a much smaller size and I hope we will get to a reasonable size by end of the calendar year but it's not really related to SDoC TBH.
Thanks for all the responses. Based on the above I think that a regular (yearly?) check-in should still happen but there are no immediate actions for SRE.