Page MenuHomePhabricator

Security review for Linter extension
Closed, ResolvedPublic

Description

Project Information

  • Name of tool/project: Extension:Linter
  • Project home page: https://www.mediawiki.org/wiki/Extension:Linter
  • Name of team requesting review: Parsing
  • Primary contact: @Legoktm and @Arlolra
  • Target date for deployment: Before end of 2016
  • Link to code repository / patchset: mediawiki/extensions/Linter in Gerrit, additionally the linter code in Parsoid is mostly in lib/wt2html/pp/handlers/linter.js and lib/logger/linter.js
  • Programming Language(s) Used: PHP and JavaScript

Description of the tool/project

T48705: Parsoid-based wikitext "linting" tool for "buggy" / "deprecated" wikitext usage; keywords: broken wikitext information
Parsoid has a linter that can identify common errors in wikitext (deprecated elements, fostered content, bad image options, etc.). The extension collects them in a database table, and surfaces them to users. It additionally has some small client-side JS that highlights the section of wikitext with the error to make it easier for editors to fix.

Description of how the tool will be used at Wikimedia

  • Random user saves a page
  • change-prop asks restbase to re-render the page
    • note that the originally edited page might have been a template, and change-prop is re-rendering dependent pages
  • restbase asks parsoid to re-render the page
  • parsoid does so and spots a lint error
  • parsoid sends an API request to MW with a JSON encoded set of warnings, and some metadata about them
    • even if there are no lint errors, parsoid will still send an API response to tell MW that any previous warnings may have been fixed

Dependencies

  • Parsoid (technically it's not a dependency, but the extension won't do anything without Parsoid sending it data)

Has this project been reviewed before?

please link to tasks or wiki pages of previous reviews
Nope.

Working test environment

please link or describe setup process for setting up a test environment
You'll need to set up parsoid, and configure it to use your local wiki. Also set the two lint settings as specified on https://www.mediawiki.org/wiki/Extension:Linter in parsoid's config.yaml.

Install the MW extension like normal (wfLoadExtension). If Parsoid is running on another host, you'll need to configure $wgLinterSubmitterWhitelist, as by default it only allows requests from localhost.

I don't know how to set up the whole changeprop/restbase pipeline, so I would usually go to http://localhost:8000/localhost/v3/page/html/Using_big_template in my browser which has parsoid parse the page and submit lint errors. An easy lint error to test is the usage of <big> tags.

Post-deployment

name of team responsible for tool/project after deployment and primary contact
See above

Security concerns

The API module action=record-lint uses an IP whitelist ($wgLinterSubmitterWhitelist) before accepting any data. Any client that passes the whitelist is considered trusted. If it's not whitelisted, the data is discarded.

The other main concern is accidentally DoSing ourselves during the initial rollout, as we expect that there are a large number of errors that already exist, and would want to do a lot of DB writes. There is also a linterAPISampling setting which can be set in parsoid's config.yaml to only submit lint errors for some pages. We could use this to gradually ramp up the number of requests that are going to the MW API. Additionally, the linting variable can be a list of lint warning categories to enable (e.g. "obsolete-tag"), so we can reduce it that way.

Related Objects

Event Timeline

Legoktm created this task.Oct 18 2016, 8:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2016, 8:14 PM
Arlolra updated the task description. (Show Details)Oct 19 2016, 7:08 AM

This has been schedule for the week of November 7th. @Legoktm, @Bawolff mentioned that you had some urgency for this. Is next week okay?

This has been schedule for the week of November 7th. @Legoktm, @Bawolff mentioned that you had some urgency for this. Is next week okay?

That's fine, yes. My goal is to have it deployed by end of year 2016, but that might not be realistic if there are a lot of deployment freezes like last year. In any case, I will need to make some changes based on the DBA review currently in progress, so a week's worth of time is good :)

Review of 83b1003e6ef of Linter. Note: I only looked at the php extension, I did not look at the restbase code. I assume the restbase code doesn't need to be reviewed.

No security issues :D Everything looks good to me from a security perspective. However, when reviewing this I found some non-security bugs. I've also included some random thoughts here that may or may not make sense.

For the DOS potential you mentioned above - I think it'd be a good idea to start enabling this on smaller wikis before rolling out to enwiki, just to make sure it won't overload anything. However, I imagine you probably were going to do it that way anyways.

Miscellaneous non security comments

  • Depending on how fast restbase is, in theory there's a possible race condition if restbase inserts lint errors before slaves update the new revision id. This seems pretty unlikely, but perhaps the code should check the master db in the case where the specified revision id during action=record-lint is higher than getLastRevID().
  • If restbase was really slow, we'd have a lot of old linter records. Does it possibly make sense to include the provided revision in the linter table, and add page_latest = linter_rev as an inner join condition when showing things to the user, to make sure no old entries show up due to latency in rest base recording linter info? On the other hand, that would make for a lot more db write requests in the case where the error stays between edits, so its probably good the way it is for that reason.
  • Should this use the WikiPageDeletionUpdates hook to remove Linter records when a page gets deleted?
  • How does the category map cache get cleared? Shouldn't there be at least some TTL on it (e.g. 24 hours)? Otherwise you won't be able to query based on new categories until map falls out of cache, which could be a very long time.
  • Is linter_id really how users would want us to sort this? (I suppose dba's don't like the idea of denormalizing page_title into the linter table just for the purpose of sorting? However, even just sorting by (namespace,linter_id) seems like it would be a more useful order)
  • LintErrorsPager - in getFieldNames() - TablePager seems to expect strings not Message Objects. Should explicitly call ->text().
  • The example url for list=linterrors uses the prefix le instead of the prefix lnt.
  • The action=record-lint has the field types mixed up with their defaults (i.e. If you don't specify a page, it defaults to the page named "string"). They should also probably be marked mandatory (This obviously doesn't really matter since its internal only).
  • action=record-lint has poor error handling (e.g. malformed json results in php fatal). Again, doesn't really matter since its internal only.
  • Given that adding a new category would need code changes anyways (for i18n messages at the very least), is there really much benefit to having the separate db table instead of hard-coding the category -> id mapping in php? Seems like having the db table adds a lot of code complexity without much benefit.
  • the lntcategory option of list=linterrors is broken because code expects a single value but its actually an array. Additionally if its an array the option name should be plural.
  • Special:LintErrors should implement getSubpagesForPrefixSearch().

Review of 83b1003e6ef of Linter. Note: I only looked at the php extension, I did not look at the restbase code. I assume the restbase code doesn't need to be reviewed.

Thanks! To clarify, the rest of the linter code is in Parsoid, and it will be Parsoid that talks to the MW API. That code is already technically deployed, its just disabled behind a config switch.

No security issues :D Everything looks good to me from a security perspective. However, when reviewing this I found some non-security bugs. I've also included some random thoughts here that may or may not make sense.
For the DOS potential you mentioned above - I think it'd be a good idea to start enabling this on smaller wikis before rolling out to enwiki, just to make sure it won't overload anything. However, I imagine you probably were going to do it that way anyways.

Yes :)

Miscellaneous non security comments

  • Depending on how fast restbase is, in theory there's a possible race condition if restbase inserts lint errors before slaves update the new revision id. This seems pretty unlikely, but perhaps the code should check the master db in the case where the specified revision id during action=record-lint is higher than getLastRevID().

T151280

  • If restbase was really slow, we'd have a lot of old linter records. Does it possibly make sense to include the provided revision in the linter table, and add page_latest = linter_rev as an inner join condition when showing things to the user, to make sure no old entries show up due to latency in rest base recording linter info? On the other hand, that would make for a lot more db write requests in the case where the error stays between edits, so its probably good the way it is for that reason.

Lint errors are based on the expanded wikitext, so it would need to be based on page_touched on something to account for template updates. In any case, right now I'm not too concerned about providing extremely up to date info, that can be improved later.

  • Should this use the WikiPageDeletionUpdates hook to remove Linter records when a page gets deleted?

Yes. T151281

  • How does the category map cache get cleared? Shouldn't there be at least some TTL on it (e.g. 24 hours)? Otherwise you won't be able to query based on new categories until map falls out of cache, which could be a very long time.

It shouldn't need to be cleared? If an id doesn't exist, it should automatically hit the db table and create if necessary. But see below on your suggestion of getting rid of it entirely.

  • Is linter_id really how users would want us to sort this? (I suppose dba's don't like the idea of denormalizing page_title into the linter table just for the purpose of sorting? However, even just sorting by (namespace,linter_id) seems like it would be a more useful order)

Good point, filed T151282.

  • LintErrorsPager - in getFieldNames() - TablePager seems to expect strings not Message Objects. Should explicitly call ->text().

T151283

  • The example url for list=linterrors uses the prefix le instead of the prefix lnt.

T151284

  • The action=record-lint has the field types mixed up with their defaults (i.e. If you don't specify a page, it defaults to the page named "string"). They should also probably be marked mandatory (This obviously doesn't really matter since its internal only).

T151285

  • action=record-lint has poor error handling (e.g. malformed json results in php fatal). Again, doesn't really matter since its internal only.

T151286

  • Given that adding a new category would need code changes anyways (for i18n messages at the very least), is there really much benefit to having the separate db table instead of hard-coding the category -> id mapping in php? Seems like having the db table adds a lot of code complexity without much benefit.

My initial idea was that we could add new lint types to Parsoid, have it start populating the database right away, and then add the i18n messages later on. But I think that's a pretty narrow edge case, and definitely isn't worth this code complexity. T151287

  • the lntcategory option of list=linterrors is broken because code expects a single value but its actually an array. Additionally if its an array the option name should be plural.

T151288

  • Special:LintErrors should implement getSubpagesForPrefixSearch().

T151289

Thank you for the in-depth code review :)

It shouldn't need to be cleared? If an id doesn't exist, it should automatically hit the db table and create if necessary. But see below on your suggestion of getting rid of it entirely.

Right. I think when I first read through it, I confused myself and thought getCategories() was coming from the cache, where its really coming from the config.

Legoktm closed this task as Resolved.Dec 1 2016, 9:06 AM
Legoktm claimed this task.
Legoktm reassigned this task from Legoktm to Bawolff.
Legoktm reopened this task as Open.Dec 7 2016, 6:57 PM

@Bawolff, per T148866#2853544 could you confirm that it's okay to replicate the linter database table to labs and make it publicly available?

Bawolff closed this task as Resolved.Dec 7 2016, 7:00 PM

I confirm that it is ok to publically replicate this table to labs.