Security review for the ReadingLists extension
Closed, ResolvedPublic

Description

(add Security-Reviews once all patches are merged)

Project Information

Description of the tool/project

The ReadingLists extension provides an API to store and retrieve private lists of pages, e.g. for a "bookmark" or "read it later" feature. It's written with wikifarms in mind (a list can contain pages from multiple wikis) and aims to support both browsers and mobile devices which have their own permanent storage .
For more information see https://www.mediawiki.org/wiki/Reading/Reading_Lists and T164990: RfC: Reading List service.

Main security-relevant characteristics

  • API only, no user interface
  • all operations (except for the maintenance script for purging old data) including viewing are limited to the current user, enforced in the DAO class
  • opt-in/opt-out API endpoints; no data can be stored for the given user before opt-in, all data destroyed on opt-out

Description of how the tool will be used at WMF

The API will power a REST proxy (to be written; does not require review¹). Initially, the REST proxy will be used to back up the reading lists in the Android app (which already has such a feature, but with no portability). iOS and web will add support later on.
purge.php will be run periodically via cron to prune old deleted data.

Dependencies

None.

Has this project been reviewed before?

No.

Working test environment

Use the vagrant role as described on the extension page.

Post-deployment

Same as pre-deployment.

Notes

  • ¹ The REST proxy will be written as a simple patch to the existing RESTBase codebase (T168972), and won't do much beyond translating REST requests into api.php requests (and requests to the RESTBase summary endpoint, in one case), and composing the responses into a single JSON. It will rely on api.php for security, by forwarding cookies.
Tgr created this task.Aug 25 2017, 2:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 25 2017, 2:04 AM
Tgr updated the task description. (Show Details)
Tgr updated the task description. (Show Details)

@Bawolff Anything else you need from us here?

We are trying to get this deployed for end of quarter… will it be possible to review before then? Thanks for the help!

@dpatrick is this on your radar?

Please see my previous comment, just trying to get this in before end of quarter. Thanks!

Tgr updated the task description. (Show Details)Sep 19 2017, 8:44 PM

@dpatrick is this on your radar?

Please see my previous comment, just trying to get this in before end of quarter. Thanks!

Yes. My apologies for the delay. I've added this to our review schedule for next week. @Bawolff will be looking at it.

Bawolff added a subscriber: jcrespo.

Review of a1b8a6cc70e. (Just the ReadingList extension. I haven't reviewed anything RESTBase related or client side related)

Overall: Passes the security review. However I'm concerned about general scalability and also some of the db queries (Several of the queries do Using where; Using temporary; Using filesort). I get the impression that much of this code is assuming specific size limits on the lists, it would be nice if scaling assumptions were more specificly documented.

I would ask that the schema/queries be approved by a DBA prior to this extension being deployed.

Misc comments

None of this is security related, just my general opinion. For general comments you should feel free to ignore any that you disagree with. Anything related to db queries can be ignored if a DBA says its ok.

  • The reverse interwiki prefix thing sort of looks like it should actually be in core.
  • This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).
    • First of all, it seems kind of odd to design something with such limitations from the get-go. User demand for large things tend to increase over time. They will probably want large lists eventually, and it will be much harder to add support for that later (schema changes, breaking api changes) than it would be to do it from the beginning.
    • Secondly, there is actually no restrictions in place preventing lists from being arbitrarily large. If we're going to have assumptions lists are short, then that has to actually be enforced.
  • No validation of entries (illegal titles such as {{foo}} allowed)
  • No namespace support for titles. Is the assumption that people will only use main namespace pages (Seems like an odd assumption)? Is the assumption that namespaces will be stored in textual form (e.g. page title 'Project:Foo')? In which case, the ?generator=readinglistentries doesn't work properly. Additionally, not all titles can be stored in such a scheme, as there needs to be 255 bytes for the name part of the title plus the space for the namespace part.
  • No validation of project names. Seems like inserting an invalid project should be an error. Its also kind of odd to use domain names. Almost all other cross-wiki extensions use wiki-id's.
  • The interwiki lookup scheme is not robust against people running multiple wikis on a single domain (Not sure if that matters)
  • If there are no local interwiki prefixes, there are a bunch of php warnings generated on ?action=query&generator=readinglistentries
  • I'm getting some "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" warnings as well from the api query modules.
  • Has @jcrespo approved the SQL parts? Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.
  • The fact that ?action=query&meta=readinglists will potentially return a random order for the order field in the event that no specific order is set on list items (ordering by a null field) with no indication that the elements are unordered, is kind of odd. I would normally expect it to default to date added or something.
  • Doesn't properly truncate description, colour, etc fields before inserting into DB. We shouldn't rely on mysql's silent truncation field, and also its important to truncate to prevent dangling unicode characters. Instead of truncating, it may make sense to return an error in the API if field too long.
Tgr added a comment.Sep 25 2017, 10:46 PM

Thanks for the review!

I haven't reviewed anything RESTBase related or client side related

IMO the RESTBase part doesn't require a security review: it's basically just a web proxy that converts between slightly different URL styles.

I get the impression that much of this code is assuming specific size limits on the lists, it would be nice if scaling assumptions were more specificly documented.
(...)

  • This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).
    • First of all, it seems kind of odd to design something with such limitations from the get-go. User demand for large things tend to increase over time. They will probably want large lists eventually, and it will be much harder to add support for that later (schema changes, breaking api changes) than it would be to do it from the beginning.
    • Secondly, there is actually no restrictions in place preventing lists from being arbitrarily large. If we're going to have assumptions lists are short, then that has to actually be enforced.

Sorry about that; wanted to mention but then I forgot. The relevant task is T168984: Create and enforce a limit for number of Reading Lists and Reading Lists entries for each user. It will be done before the service is exposed to real users.

Internally, the order operations are the only ones with no paging; I should probably fix that. In the API, exposing paging for those would require support for the kind of two-level continuation that you get when combining list/generator and prop modules; I would rather not go there if I don't have to.

I'll fix the offset-based continuation later (T171913: Fix technical debt in ReadingLists extension) but sorting of a list that's too long to handle in one step would be troublesome.

  • No validation of entries (illegal titles such as {{foo}} allowed)

Very theoretically, different projects could have different rules for illegal titles (e.g. by customizing $wgLegalTitleChars) so it's hard to check due to the cross-wiki nature of the project.
In practice the security benefits probably make it worth to ignore that corner case. I'll add a check.

  • No namespace support for titles. Is the assumption that people will only use main namespace pages (Seems like an odd assumption)? Is the assumption that namespaces will be stored in textual form (e.g. page title 'Project:Foo')? In which case, the ?generator=readinglistentries doesn't work properly. Additionally, not all titles can be stored in such a scheme, as there needs to be 255 bytes for the name part of the title plus the space for the namespace part.

I assumed textual form. Apps should be able to generate API requests from the list data they get, and that's difficult with namespace numbers. And the extension cannot turn numbers into namespaces on the fly since it would have to know the namespace config of all the wikis for that. I think I tested the generator with non-mainspace titles and it worked fine but I'll recheck.
Good point about the field length.

  • No validation of project names. Seems like inserting an invalid project should be an error. Its also kind of odd to use domain names. Almost all other cross-wiki extensions use wiki-id's.

Not sure if there is a way to tell what's valid in general. I guess a configurable callback could work.
The domain name is used by the client to construct API calls; clients are generally unable to translate wiki IDs to domain names. Internally, the projects will eventually have their own lookup table to preserve database space; wiki IDs could easily be added through that if there is a use case. (Maybe that should be turned into a more generic have-info-about-all-sites-of-a-wikifarm-in-the-DB thing and moved to core?)

  • The interwiki lookup scheme is not robust against people running multiple wikis on a single domain (Not sure if that matters)

Seemed to much effort to do that robustly when chances are no one will ever use it. The lookup class was made a service so that it is easy to replace with an implementation specific to a local wikifarm.

  • If there are no local interwiki prefixes, there are a bunch of php warnings generated on ?action=query&generator=readinglistentries

Will fix.

  • I'm getting some "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" warnings as well from the api query modules.

That means unclosed connections, but the code does do any kind of low-level DB interaction, so not sure what to do about that...

I would ask that the schema/queries be approved by a DBA prior to this extension being deployed.
(...)

Conditionally. See https://gerrit.wikimedia.org/r/#/c/365987/ and the first half of the discussion on T164990: RfC: Reading List service.
Some testing/improvement has been pushed for later, per T174651: Beta testing of the ReadingLists extension and T171913: Fix technical debt in ReadingLists extension.

Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

With the sortkeys being in a separate table, I don't think there is a way to avoid filesorts. Due to the (to be implemented) limits, they should happen on fairly small result sets as you say.

  • The fact that ?action=query&meta=readinglists will potentially return a random order for the order field in the event that no specific order is set on list items (ordering by a null field) with no indication that the elements are unordered, is kind of odd. I would normally expect it to default to date added or something.

Yeah, that would be very confusing. I should probably just make sure the order is set on insert, but having a fallback sortfield certainly wouldn't hurt.

  • Doesn't properly truncate description, colour, etc fields before inserting into DB. We shouldn't rely on mysql's silent truncation field, and also its important to truncate to prevent dangling unicode characters. Instead of truncating, it may make sense to return an error in the API if field too long.

Good point, I'll add parameter validation for those.

IMO the RESTBase part doesn't require a security review: it's basically just a web proxy that converts between slightly different URL styles.

That's fair, I just wanted to be clear on what parts I reviewed and what parts I didn't.

Very theoretically, different projects could have different rules for illegal titles (e.g. by customizing $wgLegalTitleChars) so it's hard to check due to the cross-wiki nature of the project.
In practice the security benefits probably make it worth to ignore that corner case. I'll add a check.

It does get complicated though if you do normalization. E.g. Wiktionary with allowing lowercase first letters. In practise it doesn't matter too much if they are not validated as long as you don't directly use Title::makeTitle() anywhere, but it would be nice if possible. (At the moment the code does use Title::makeTitle in the generator. It should be switched to Title::makeTitleSafe() if the db entries aren't validated & normalized before insertion).

Not sure if there is a way to tell what's valid in general. I guess a configurable callback could work.
The domain name is used by the client to construct API calls; clients are generally unable to translate wiki IDs to domain names. Internally, the projects will eventually have their own lookup table to preserve database space; wiki IDs could easily be added through that if there is a use case. (Maybe that should be turned into a more generic have-info-about-all-sites-of-a-wikifarm-in-the-DB thing and moved to core?)

I was thinking do the ReverseInterwikiLookup thing, and reject the project if it can't find the relavent interwiki code. Alternatively, could look at the things in WikiMap::getCanonicalServerInfoForAllWikis()

I'm getting some "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" warnings as well from the api query modules.

That means unclosed connections, but the code does do any kind of low-level DB interaction, so not sure what to do about that...

I think maybe its because $this->dbw = wfGetDB( DB_MASTER ); is called even if its unused. I'm unsure if that opens an actual connection to the database or if the connection is only open once you actual use the db reference.

I would ask that the schema/queries be approved by a DBA prior to this extension being deployed.

Has @jcrespo approved the SQL parts?

Conditionally. See https://gerrit.wikimedia.org/r/#/c/365987/ and the first half of the discussion on T164990: RfC: Reading List service.
Some testing/improvement has been pushed for later, per T174651: Beta testing of the ReadingLists extension and T171913: Fix technical debt in ReadingLists extension.

Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

With the sortkeys being in a separate table, I don't think there is a way to avoid filesorts. Due to the (to be implemented) limits, they should happen on fairly small result sets as you say.

TBH, I'm not sure i understand the benefits of having a separate table here. I guess its to use a single INSERT instead of many UPDATE queries, but I'm not sure if avoiding doing many UPDATE queries really matters much. In any case, I'm ok with this provided that DBAs are ok with this, but I'd like them to explicitly say they are ok with the extension doing queries involving (small) filesorts and temporary tables, as that's usually frowned upon afaik.

> This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

I specifically said I will block any deployment to production if lists get larger than 100 items/rows as there was not allocated hardware for this functionality. I still stand by that statement, no matter if it has been ignored by the development team or not.

> This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

I specifically said I will block any deployment to production if lists get larger than 100 items/rows as there was not allocated hardware for this functionality. I still stand by that statement, no matter if it has been ignored by the development team or not.

I mean specificly, are you ok with the extension doing queries like:

[wiki]> explain SELECT rle_id,rle_rl_id,rle_project,rle_title,rle_date_created,rle_date_updated,rle_deleted FROM reading_list_entry LEFT JOIN reading_list_entry_sortkey ON ((rle_id = rles_rle_id)) WHERE rle_rl_id = '1' AND rle_user_id = '1' AND rle_deleted = '0' ORDER BY rles_rle_id,rles_index LIMIT 11 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: reading_list_entry
         type: ref
possible_keys: rle_list_project_title,rle_list_updated,rle_user_project_title
          key: rle_list_project_title
      key_len: 4
          ref: const
         rows: 1
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: reading_list_entry_sortkey
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: wiki.reading_list_entry.rle_id
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

I was under the impression that new code was not allowed to do queries that did Using temporary unless they had a very good reason, but perhaps I am mistaken.

I specifically said I will block any deployment to production if lists get larger than 100 items/rows as there was not allocated hardware for this functionality. I still stand by that statement, no matter if it has been ignored by the development team or not.

Just an FYI: there is a ticket specifically for this limit here T168984: Create and enforce a limit for number of Reading Lists and Reading Lists entries for each user

@jcrespo I don't believe we ignored your concerns, but rather looked into usage numbers of the Reading List feature on Android. See here: https://docs.google.com/document/d/1dZcpGWi8cuuMmXduANYt37pZq6B8Z66_x9dMBNADTFU/edit#heading=h.g882eefm3z9b

To pull out the relevant numbers:

Reading Lists are used by approximately 10.5% of users (730K / 7M active users). 

Median pages per list: 3 
Average pages per list: 17 

Approximate total number of lists: 1,059,354
Approximate number of lists per user: 1.45

Approximately 77% of users who use Reading Lists have one list. Most users have only a single digit number of entries in a list.

Additionally, we have found that there are 2 types of users (and the limits are designed around these use cases):

  1. Users have a low number of items per lists AND have a low number of lists
  2. Some users have one large list (definitely more than 100)

While deciding whether to order new hardware we also thought about the roll out strategy:

  1. Beta Android Q2
  2. Production Android Q2
  3. Beta iOS Q2
  4. Production iOS Q3
  5. Web FY2018-19 Q1

The total MAUs on Android and iOS is around 10M. So this is rather small user-wise. And Web will not be until next FY… and we can get new hardware then.

Given this information, it didn't seem like limits of 100/1000 would be that burdensome to the existing hardware. Is there something you are looking at that is concerning about 1000 vs 100?

I would like to get a handle on real practical limits on what existing hardware can handle and what the effects of 100 vs 1000 would be. Would it be helpful if we measured server load using the Android beta process?

Can we log reads and writes of small vs large lists while it is in beta on Android? We would have a lot of control there and may be able to answer some questions you have about performance. If we can find the bottle necks, we are more than willing to find workable solutions.

Thanks for reading

@jcrespo Hey I wanted to follow up here and make sure you saw my comment… please let me know your thoughts

Not sure what is the question, I already commented my thoughts on the other related ticket.

Tgr moved this task from Backlog to Coming soon on the Reading List Service board.Dec 6 2017, 12:40 AM
Tgr added a comment.Dec 6 2017, 1:30 AM

@Bawolff: in case you do want to look at the RESTBase part after all, it is in the reading_lists_beta branch, with 9ef2b5fd and 32131948 being the main change. I think there are only two aspects which have relevance for security: caching headers (since all data is private) and cookie forwarding (cf. T179553: Cookies should not be forwarded to different domains; the enumeration of known domains in RESTBase config, the network restrictions of the scb cluster, and the static list of projects introduced in a02d6eba are three independent guarantees that the cookies do not get sent to an insecure location).

All the non-security issues mentioned here have been fixed:

  • This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).
    • First of all, it seems kind of odd to design something with such limitations from the get-go. User demand for large things tend to increase over time. They will probably want large lists eventually, and it will be much harder to add support for that later (schema changes, breaking api changes) than it would be to do it from the beginning.

Post T180402: Remove custom ordering from ReadingLists extension allowing larger lists should not cause any performance problems. (The only operation the complexity of which still scales with limit size is counting the number of lists/items to enforce the limits. That's unavoidable and shouldn't be too bad.)

  • Secondly, there is actually no restrictions in place preventing lists from being arbitrarily large. If we're going to have assumptions lists are short, then that has to actually be enforced.

T168984: Create and enforce a limit for number of Reading Lists and Reading Lists entries for each user has been implemented.

  • No validation of entries (illegal titles such as {{foo}} allowed)

Fixed in https://gerrit.wikimedia.org/r/#/c/390134/.

  • No namespace support for titles. Is the assumption that people will only use main namespace pages (Seems like an odd assumption)? Is the assumption that namespaces will be stored in textual form (e.g. page title 'Project:Foo')? In which case, the ?generator=readinglistentries doesn't work properly. Additionally, not all titles can be stored in such a scheme, as there needs to be 255 bytes for the name part of the title plus the space for the namespace part.

Generator support is fixed in https://gerrit.wikimedia.org/r/#/c/390194/.
Field length is increased in https://gerrit.wikimedia.org/r/#/c/395673/.

  • No validation of project names. Seems like inserting an invalid project should be an error. Its also kind of odd to use domain names. Almost all other cross-wiki extensions use wiki-id's.

Fixed in https://gerrit.wikimedia.org/r/#/c/390134/. Apps need the domain names to know where to send API requests, while the wiki ID is not really useful outside of MediaWiki.

  • If there are no local interwiki prefixes, there are a bunch of php warnings generated on ?action=query&generator=readinglistentries

Fixed in https://gerrit.wikimedia.org/r/#/c/390135/.

  • The fact that ?action=query&meta=readinglists will potentially return a random order for the order field in the event that no specific order is set on list items (ordering by a null field) with no indication that the elements are unordered, is kind of odd. I would normally expect it to default to date added or something.

https://gerrit.wikimedia.org/r/#/c/393925/ adds the primary key as a tiebreaker to every non-unique sortkey.

  • Doesn't properly truncate description, colour, etc fields before inserting into DB. We shouldn't rely on mysql's silent truncation field, and also its important to truncate to prevent dangling unicode characters. Instead of truncating, it may make sense to return an error in the API if field too long.

Fixed in https://gerrit.wikimedia.org/r/#/c/390136/ / https://gerrit.wikimedia.org/r/#/c/390940/.

I think maybe its because $this->dbw = wfGetDB( DB_MASTER ); is called even if its unused.

Fixed in https://gerrit.wikimedia.org/r/#/c/395686/

Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

Well, it's either that or use lots of indexes. So I switched to the latter. There are still some filesorts but I don't think they can be avoided and they are on small resultsets. T182053 and the commit message for ad1edc63 has the details.

Tgr closed this task as Resolved.Tue, Jan 2, 7:53 PM
Tgr claimed this task.

I'll close this for housekeeping; the actual security review happened in T174126#3633349, and as far as I can see all the non-security concerns have been addressed as well. Please reopen if you think something was left out.