Page MenuHomePhabricator

Performance review of Extension:WikimediaEditorTasks
Closed, ResolvedPublic

Description

Based on the result of the security review, it was suggested:

Since this extension is adding database writes after a save, I think it would be a good idea to get the Performance-Team to give it a thumbs up before deployment.

Security review: T215048: Security review for the WikimediaEditorTasks extension


Project Information

Description of the tool/project

The extension will provide the following functionality:

Contribution counts by defined type
  • Hook into PageSaveComplete and RollbackComplete events to count per-user and per-lang counts of edit and rollback events meeting defined criteria
  • Maintain a DB table of counts
  • Maintain a DB table of target counts met per user
  • Expose maintained data through a MediaWiki Action API module
Microcontribution suggestions
  • Generates microcontribution suggestions (for rollout, suggested Wikidata description edits) and serves them through an Action API module

Description of how the tool will be used at WMF

The tool will be used to support the App Editor Tasks project of the Wikimedia Apps team. Its primary consumers will be the official Wikipedia apps (initially, the Android app only).

Dependencies

Wikibase

Working test environment

The extension is available via the wikimediaeditortasks role in MediaWiki-Vagrant.
Additionally, a development and testing environment is set up on Cloud VPS, consisting of wikidata-edit-counts.wmflabs.org and its client wikis, {en,es,fr,he,ru,zh}-edit-counts.wmflabs.org.

Post-deployment

Responsible team: Reading Infrastructure
Primary contact: Michael Holloway

Event Timeline

How will wikimedia_editor_tasks_entity_description_exists be populated and about how large will it be? Can wetede_language actually go up to 255 length? If the table can be huge, then I also wonder if the wetede_rand index should be on (wetede_language,wetede_description_exists,wetede_rand) instead.

Change 496097 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/WikimediaEditorTasks@master] [WIP] Fix various bugs and make some optimizations

https://gerrit.wikimedia.org/r/496097

Probably should be held back until the maintenance script for loading suggestion data is in place? That seems the only part that could be tricky performance-wise.

How will wikimedia_editor_tasks_entity_description_exists be populated and about how large will it be? Can wetede_language actually go up to 255 length? If the table can be huge, then I also wonder if the wetede_rand index should be on (wetede_language,wetede_description_exists,wetede_rand) instead.

wikimedia_editor_tasks_entity_description_exists will be populated by a maintenance script that's currently in review here: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEditorTasks/+/492526/

I wouldn't expect that wetede_language values would ever actually come close to 255 length. It should only ever hold wiki language codes. That could probably be changed to varbinary(32) (following the sites table).

In principle, the table should contain as many rows as there are sitelinks from Wikidata items, so (if I have the correct query for this) ~67.75 million.

mholloway-shell@stat1006:~$ analytics-mysql wikidatawiki -e 'select count(ips_row_id) from wb_items_per_site'
+-------------------+
| count(ips_row_id) |
+-------------------+
|          67765257 |
+-------------------+

Change 495892 had a related patch set uploaded (by Krinkle; owner: Michael Holloway):
[mediawiki/extensions/WikimediaEditorTasks@master] Tweak targetPassed queries for performance

https://gerrit.wikimedia.org/r/495892

Change 496097 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/WikimediaEditorTasks@master] Fix various bugs and make some optimizations

https://gerrit.wikimedia.org/r/496097

Change 495892 abandoned by Mholloway:
Tweak targetPassed queries for performance

Reason:
superseded by Ia7454c497e82857aa50922e8b1b9656446c92b97

https://gerrit.wikimedia.org/r/495892

Change 496097 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEditorTasks@master] Fix various bugs and make some optimizations

https://gerrit.wikimedia.org/r/496097

The wetede_rand index seems like it will need to be changed as I mentioned, given the size. Otherwise, there are two many cases that could involve massing scanning.

Tagging @jcrespo since he might want to look at this.

In principle, the table should contain as many rows as there are sitelinks from Wikidata items, so (if I have the correct query for this) ~67.75 million.

mholloway-shell@stat1006:~$ analytics-mysql wikidatawiki -e 'select count(ips_row_id) from wb_items_per_site'
+-------------------+
| count(ips_row_id) |
+-------------------+
|          67765257 |
+-------------------+

Change 497860 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/WikimediaEditorTasks@master] Performance: DB schema updates

https://gerrit.wikimedia.org/r/497860

I briefly scanned https://www.mediawiki.org/wiki/Extension:WikimediaEditorTasks and its subpages, as well as the code. There is mostly 2 key words that I would like to raise, "wikidata" and "counters", specially because the combination of the 2 (as wikidata has the highest write activity on the same items) usually leads to contention. This is not theoretical, for example our #1 cause of db-related errors is Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages and this is not really an error, it is the consequence of the 2 keywords I mentioned before (+ the fact that it is multiplied per wiki), which makes it a hard problem to solve.

66 millon rows is a small table, the size itself does not worry me, how writes when updating the table works does, specially with wikidata, where normally multiple writes of the same item happens at the same time. In particular I would like to know more about how concurrency model is setup for table updates (for example, it is not clear to me at the time where the data of this extension is asked to be deployed- per wiki or on a single location, and if in a single location, where; how many languages at first or in the future; the synchronicity of the updates, what measures have been taken to reduce contention/tolerance on failure, etc.), but that would require a deeper code scan- but please feel free to expand about the details of data workflow (beyond the description) about the deployment plan.

I don't see any red alarm on my general scan, only worry is the write contention stuff, performance wise- varchar(255) is not an issue, if most content is 'en', 'es', etc. . I proposed in the future to have numeric ids for references to languages/wikis, but that doesn't exist yet, and it is of course not the responsability of the extension developer.

66 millon rows is a small table, the size itself does not worry me, how writes when updating the table works does, specially with wikidata, where normally multiple writes of the same item happens at the same time.

It's large enough that poor indexing would be an issue. But with the proposed patch above I think it's fine. In terms of bytes, I don't see it being huge either.

In particular I would like to know more about how concurrency model is setup for table updates (for example, it is not clear to me at the time where the data of this extension is asked to be deployed- per wiki or on a single location, and if in a single location, where; how many languages at first or in the future; the synchronicity of the updates, what measures have been taken to reduce contention/tolerance on failure, etc.), but that would require a deeper code scan- but please feel free to expand about the details of data workflow (beyond the description) about the deployment plan.

I believe it would all go in the extension1 cluser in one mysql database. All the counter updates happen in deferred updates. They should run in AUTO-COMMIT mode, though need to fix one thing to them to actually do that.

I don't see any red alarm on my general scan, only worry is the write contention stuff, performance wise- varchar(255) is not an issue, if most content is 'en', 'es', etc. . I proposed in the future to have numeric ids for references to languages/wikis, but that doesn't exist yet, and it is of course not the responsability of the extension developer.

Yeah, I just prefer the field size range being obvious from looking at the schema.

Change 498248 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/WikimediaEditorTasks@master] Use the right DB handles for the deferred updates

https://gerrit.wikimedia.org/r/498248

Change 497860 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEditorTasks@master] Performance: DB schema updates

https://gerrit.wikimedia.org/r/497860

@aaron How are we doing on this? Do you expect for this to be complete in time to enable in production on Thursday of this week? Seems like you've already gone over all of the codebase and made tweaks, but I'm not sure if there's more you had planned.

@aaron How are we doing on this? Do you expect for this to be complete in time to enable in production on Thursday of this week? Seems like you've already gone over all of the codebase and made tweaks, but I'm not sure if there's more you had planned.

If https://gerrit.wikimedia.org/r/498248 is merged, then everything seems fine to me. I can't think of anything else.

Change 498248 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEditorTasks@master] Use the right DB handles for the deferred updates

https://gerrit.wikimedia.org/r/498248

Thanks for the review and the improvements, @aaron!