Security review for Linter extension
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Oct 18 2016, 8:14 PM

Description

Project Information

Name of tool/project: Extension:Linter
Project home page: https://www.mediawiki.org/wiki/Extension:Linter
Name of team requesting review: Parsing
Primary contact: @Legoktm and @Arlolra
Target date for deployment: Before end of 2016
Link to code repository / patchset: mediawiki/extensions/Linter in Gerrit, additionally the linter code in Parsoid is mostly in lib/wt2html/pp/handlers/linter.js and lib/logger/linter.js
Programming Language(s) Used: PHP and JavaScript

Description of the tool/project

T48705: Parsoid-based wikitext "linting" tool for "buggy" / "deprecated" wikitext usage; keywords: broken wikitext information
Parsoid has a linter that can identify common errors in wikitext (deprecated elements, fostered content, bad image options, etc.). The extension collects them in a database table, and surfaces them to users. It additionally has some small client-side JS that highlights the section of wikitext with the error to make it easier for editors to fix.

Description of how the tool will be used at Wikimedia

Random user saves a page
change-prop asks restbase to re-render the page
- note that the originally edited page might have been a template, and change-prop is re-rendering dependent pages
restbase asks parsoid to re-render the page
parsoid does so and spots a lint error
parsoid sends an API request to MW with a JSON encoded set of warnings, and some metadata about them
- even if there are no lint errors, parsoid will still send an API response to tell MW that any previous warnings may have been fixed

Dependencies

Parsoid (technically it's not a dependency, but the extension won't do anything without Parsoid sending it data)

Has this project been reviewed before?

please link to tasks or wiki pages of previous reviews
Nope.

Working test environment

please link or describe setup process for setting up a test environment
You'll need to set up parsoid, and configure it to use your local wiki. Also set the two lint settings as specified on https://www.mediawiki.org/wiki/Extension:Linter in parsoid's config.yaml.

Install the MW extension like normal (wfLoadExtension). If Parsoid is running on another host, you'll need to configure $wgLinterSubmitterWhitelist, as by default it only allows requests from localhost.

I don't know how to set up the whole changeprop/restbase pipeline, so I would usually go to http://localhost:8000/localhost/v3/page/html/Using_big_template in my browser which has parsoid parse the page and submit lint errors. An easy lint error to test is the usage of <big> tags.

Post-deployment

name of team responsible for tool/project after deployment and primary contact
See above

Security concerns

The API module action=record-lint uses an IP whitelist ($wgLinterSubmitterWhitelist) before accepting any data. Any client that passes the whitelist is considered trusted. If it's not whitelisted, the data is discarded.

The other main concern is accidentally DoSing ourselves during the initial rollout, as we expect that there are a large number of errors that already exist, and would want to do a lot of DB writes. There is also a linterAPISampling setting which can be set in parsoid's config.yaml to only submit lint errors for some pages. We could use this to gradually ramp up the number of requests that are going to the MW API. Additionally, the linting variable can be a list of lint warning categories to enable (e.g. "obsolete-tag"), so we can reduce it that way.

Related Objects
Search...

Status	Assigned	Task
Resolved	Legoktm	T48705 Parsoid-based wikitext "linting" tool for "buggy" / "deprecated" wikitext usage; keywords: broken wikitext information
Resolved	Legoktm	T148609 Review and deploy Linter extension to Wikimedia wikis
Resolved	Bawolff	T148583 Security review for Linter extension

Event Timeline

Legoktm created this task.Oct 18 2016, 8:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2016, 8:14 PM

Legoktm added a project: MediaWiki-extensions-Linter.Oct 18 2016, 8:14 PM

Legoktm added a parent task: T148609: Review and deploy Linter extension to Wikimedia wikis.Oct 19 2016, 6:52 AM

Arlolra updated the task description. (Show Details)Oct 19 2016, 7:08 AM

MarcoAurelio subscribed.Oct 19 2016, 1:32 PM

Legoktm mentioned this in T148866: DBA review of Linter extension.Oct 21 2016, 8:36 PM

• dpatrick mentioned this in E348: Security Review: Linter.Oct 25 2016, 8:17 PM

This has been schedule for the week of November 7th. @Legoktm, @Bawolff mentioned that you had some urgency for this. Is next week okay?

In T148583#2742701, @dpatrick wrote:

This has been schedule for the week of November 7th. @Legoktm, @Bawolff mentioned that you had some urgency for this. Is next week okay?

That's fine, yes. My goal is to have it deployed by end of year 2016, but that might not be realistic if there are a lot of deployment freezes like last year. In any case, I will need to make some changes based on the DBA review currently in progress, so a week's worth of time is good :)

• dpatrick moved this task from Incoming to Scheduled on the deprecated-security-team-reviews board.Nov 8 2016, 9:40 PM

Review of 83b1003e6ef of Linter. Note: I only looked at the php extension, I did not look at the restbase code. I assume the restbase code doesn't need to be reviewed.

No security issues :D Everything looks good to me from a security perspective. However, when reviewing this I found some non-security bugs. I've also included some random thoughts here that may or may not make sense.

For the DOS potential you mentioned above - I think it'd be a good idea to start enabling this on smaller wikis before rolling out to enwiki, just to make sure it won't overload anything. However, I imagine you probably were going to do it that way anyways.

Miscellaneous non security comments

Depending on how fast restbase is, in theory there's a possible race condition if restbase inserts lint errors before slaves update the new revision id. This seems pretty unlikely, but perhaps the code should check the master db in the case where the specified revision id during action=record-lint is higher than getLastRevID().
If restbase was really slow, we'd have a lot of old linter records. Does it possibly make sense to include the provided revision in the linter table, and add page_latest = linter_rev as an inner join condition when showing things to the user, to make sure no old entries show up due to latency in rest base recording linter info? On the other hand, that would make for a lot more db write requests in the case where the error stays between edits, so its probably good the way it is for that reason.
Should this use the WikiPageDeletionUpdates hook to remove Linter records when a page gets deleted?
How does the category map cache get cleared? Shouldn't there be at least some TTL on it (e.g. 24 hours)? Otherwise you won't be able to query based on new categories until map falls out of cache, which could be a very long time.
Is linter_id really how users would want us to sort this? (I suppose dba's don't like the idea of denormalizing page_title into the linter table just for the purpose of sorting? However, even just sorting by (namespace,linter_id) seems like it would be a more useful order)
LintErrorsPager - in getFieldNames() - TablePager seems to expect strings not Message Objects. Should explicitly call ->text().
The example url for list=linterrors uses the prefix le instead of the prefix lnt.
The action=record-lint has the field types mixed up with their defaults (i.e. If you don't specify a page, it defaults to the page named "string"). They should also probably be marked mandatory (This obviously doesn't really matter since its internal only).
action=record-lint has poor error handling (e.g. malformed json results in php fatal). Again, doesn't really matter since its internal only.
Given that adding a new category would need code changes anyways (for i18n messages at the very least), is there really much benefit to having the separate db table instead of hard-coding the category -> id mapping in php? Seems like having the db table adds a lot of code complexity without much benefit.
the lntcategory option of list=linterrors is broken because code expects a single value but its actually an array. Additionally if its an array the option name should be plural.
Special:LintErrors should implement getSubpagesForPrefixSearch().

Legoktm mentioned this in T151280: Linter: check the master db in the case where the specified revision id during action=record-lint is higher than getLastRevID().Nov 22 2016, 3:19 AM

Legoktm mentioned this in T151281: Linter: use the WikiPageDeletionUpdates hook to remove Linter records when a page gets deleted.Nov 22 2016, 3:22 AM

Legoktm mentioned this in T151282: Linter: Improve sorting.Nov 22 2016, 3:25 AM

Legoktm mentioned this in T151283: Linter: LintErrorsPager shouldn't rely on implicit Message::__toString().

Legoktm mentioned this in T151284: Linter: example url for list=linterrors uses the prefix le instead of the prefix lnt.

Legoktm mentioned this in T151285: Linter: Fix action=record-lint getAllowedParams().Nov 22 2016, 3:28 AM

Legoktm mentioned this in T151286: Linter: Improve action=record-lint error handling.

In T148583#2813012, @Bawolff wrote:

Review of 83b1003e6ef of Linter. Note: I only looked at the php extension, I did not look at the restbase code. I assume the restbase code doesn't need to be reviewed.

Thanks! To clarify, the rest of the linter code is in Parsoid, and it will be Parsoid that talks to the MW API. That code is already technically deployed, its just disabled behind a config switch.

No security issues :D Everything looks good to me from a security perspective. However, when reviewing this I found some non-security bugs. I've also included some random thoughts here that may or may not make sense.

For the DOS potential you mentioned above - I think it'd be a good idea to start enabling this on smaller wikis before rolling out to enwiki, just to make sure it won't overload anything. However, I imagine you probably were going to do it that way anyways.

Yes :)

Miscellaneous non security comments

Depending on how fast restbase is, in theory there's a possible race condition if restbase inserts lint errors before slaves update the new revision id. This seems pretty unlikely, but perhaps the code should check the master db in the case where the specified revision id during action=record-lint is higher than getLastRevID().

T151280

If restbase was really slow, we'd have a lot of old linter records. Does it possibly make sense to include the provided revision in the linter table, and add page_latest = linter_rev as an inner join condition when showing things to the user, to make sure no old entries show up due to latency in rest base recording linter info? On the other hand, that would make for a lot more db write requests in the case where the error stays between edits, so its probably good the way it is for that reason.

Lint errors are based on the expanded wikitext, so it would need to be based on page_touched on something to account for template updates. In any case, right now I'm not too concerned about providing extremely up to date info, that can be improved later.

Should this use the WikiPageDeletionUpdates hook to remove Linter records when a page gets deleted?

Yes. T151281

How does the category map cache get cleared? Shouldn't there be at least some TTL on it (e.g. 24 hours)? Otherwise you won't be able to query based on new categories until map falls out of cache, which could be a very long time.

It shouldn't need to be cleared? If an id doesn't exist, it should automatically hit the db table and create if necessary. But see below on your suggestion of getting rid of it entirely.

Is linter_id really how users would want us to sort this? (I suppose dba's don't like the idea of denormalizing page_title into the linter table just for the purpose of sorting? However, even just sorting by (namespace,linter_id) seems like it would be a more useful order)

Good point, filed T151282.

LintErrorsPager - in getFieldNames() - TablePager seems to expect strings not Message Objects. Should explicitly call ->text().

T151283

The example url for list=linterrors uses the prefix le instead of the prefix lnt.

T151284

The action=record-lint has the field types mixed up with their defaults (i.e. If you don't specify a page, it defaults to the page named "string"). They should also probably be marked mandatory (This obviously doesn't really matter since its internal only).

T151285

action=record-lint has poor error handling (e.g. malformed json results in php fatal). Again, doesn't really matter since its internal only.

T151286

Given that adding a new category would need code changes anyways (for i18n messages at the very least), is there really much benefit to having the separate db table instead of hard-coding the category -> id mapping in php? Seems like having the db table adds a lot of code complexity without much benefit.

My initial idea was that we could add new lint types to Parsoid, have it start populating the database right away, and then add the i18n messages later on. But I think that's a pretty narrow edge case, and definitely isn't worth this code complexity. T151287

the lntcategory option of list=linterrors is broken because code expects a single value but its actually an array. Additionally if its an array the option name should be plural.

T151288

Special:LintErrors should implement getSubpagesForPrefixSearch().

T151289

Thank you for the in-depth code review :)

It shouldn't need to be cleared? If an id doesn't exist, it should automatically hit the db table and create if necessary. But see below on your suggestion of getting rid of it entirely.

Right. I think when I first read through it, I confused myself and thought getCategories() was coming from the cache, where its really coming from the config.

Legoktm closed this task as Resolved.Dec 1 2016, 9:06 AM

Legoktm claimed this task.

Legoktm reassigned this task from Legoktm to Bawolff.