Page MenuHomePhabricator

Security Review of Revscoring
Closed, ResolvedPublic

Description

Overview

There is no textual user input. The only evidence that someone has used the system is that scores will be in the cache or extractions will in process within celery.

Within the context of a request, formatted http://ores.wmflabs.org/scores/enwiki/reverted/647810762/:

  • enwiki is the instance of Wikipedia being queried
  • reverted is te model being used)
  • 647810762 is the rev_id

All models are versioned so that cache invalidation can be carried out when models are updated.

The revscore team has opted not to, themselves, build applications which consume output ofrom the revscore system. The impetus for moving the project into production is to make it available to product teams so that they can build consumer applications. @Legoktm has implemented one such application -- a MediaWiki extension which consumes data from ores and maintains a mirror of the ores cache.

Request Processing

Celery is checked to determine whether submitted revision is currently being evaluated. If not, celery is asked to extract features from data point and perform its machine learning data score.

Celery retrieves revision information using the public MediaWiki API.

  • ores.wmflabs.org.yml, "extractors" section lists API locations
  • ores is not logged in, and, therefore, cannot retrieve deleted revisions
Libraries and Dependencies
  • ores-wikimedia-config - WMF-specific configuration for the project
  • ores - REST web service which handles execution of models using revscoring
  • revscoring - generic library which uses models for revision scoring
  • python-mwapi - thin wrapper of MediaWiki API
  • deltas - generic library which implements diff algorithms based on recent research
  • yamlconf - generic library which implements yaml config file reading and propogates defaults

Subsystems

revscoring - contains command line utilities

See fabfile.py for information on deployment.

Infrastructure

ores-lb-02.ores.eqiad.wmflabs - load balancer (nginx)
ores-web-0[1-2] - web servers, running uwsgi on tcp/8080
ores-worker-0[1-4] - celery nodes; guessing redis runs here as well?
ores-staging-01 - runs all of the parts of larger; used for staging
ores-misc-01 - used for Debian packaging

Event Timeline

csteipp claimed this task.
csteipp raised the priority of this task from to High.
csteipp updated the task description. (Show Details)
csteipp added subscribers: Joe, yuvipanda, mobrovac and 2 others.

https://github.com/wiki-ai/ores and https://github.com/wiki-ai/revscoring (and associated libraries) - of which the following are built by us - https://github.com/halfak/mediawiki-utilities and https://github.com/halfak/Deltas and https://github.com/halfak/yamlconf.

We are planning to build and deploy everything except ores with debian packages. We can mirror those into gerrit if needed.

The production MediaWiki API is going to be calling into labs? And ORES in production might come from github?

I think that the arrow comes from the API to ORES because that's the direction of data flow. Our model files and code are stored on github, so that means we're pulling data from there into the ORES system.

dpatrick set Security to None.
dpatrick added a subscriber: Legoktm.

Here are notes from @csteipp and I for this review:

yamlconf
General Observations
  • Positive
    • Code is simple, well-formatted, and well-documented
    • Generally uniform in purpose
  • Neutral
    • Why does yamlconf have the import_module function? This is not a security-related concern, it just seems to not really be within the purview of this module.
Files

./demo.py

OK

./yamlconf.py

OK
python-mwapi
General Observations
  • Positive
    • Code is straightforward and uniform in purpose
  • Negative
    • Though the underlying HTTP library ([[http://docs.python-requests.org/en/latest/user/advanced/#ssl-cert-verification|requests]]) in use verifies the SSL certificate chain by default, mwapi does not expose a facility for users of the library to specify their own set of CA certificates. This ability should be added prior to the library being considered production-ready for the broader community. Documentation for requests indicates that it bundles a set of CA certificates, based on those shipped with Mozilla products. However, that bundle is only updated when the library itself is updated. This makes it more important for users of mwapi to have control over specification of CA certificate bundle. The requests project recommends use of [[https://certifi.io/en/latest/|certifi]], a library which exists specifically for delivery of a trust database, and may be more frequently updated.
Files

./mwapi/__init__.py

OK

./mwapi/errors.py

OK

./mwapi/session.py

MEDIUM
- Line 73, mwapi does not allow for specification of a CA cert (or group of CA certs) against which the certificate chain of an SSL connection may be verified
- Line 73, in case of a change to the requests library defaults, verify=True should be passed to the request() method

./setup.py

OK
deltas
General Observations
  • Positive
    • Code is well-documented and well-commented
  • Neutral
    • Some inconsistencies in regular expressions, as noted below
Files

./deltas/__init__.py

OK

./deltas/algorithms/__init__.py

OK

./deltas/algorithms/diff_engine.py

OK

./deltas/algorithms/segment_matcher.py

OK

./deltas/algorithms/sequence_matcher.py

OK

./deltas/algorithms/tests/test_from_config.py

OK

./deltas/algorithms/tests/test_segment_matcher.py

OK

./deltas/algorithms/tests/test_sequence_matcher.py

OK

./deltas/apply.py

OK

./deltas/operations.py

OK

./deltas/segmenters/__init__.py

OK

./deltas/segmenters/functions.py

OK

./deltas/segmenters/paragraphs_sentences_and_whitespace.py

OK

./deltas/segmenters/segmenter.py

OK

./deltas/segmenters/segments.py

OK
(hashlib.sha1 used here)

./deltas/segmenters/tests/test_paragraphs_sentences_and_whitespace.py

OK

./deltas/segmenters/tests/test_segments.py

OK

./deltas/tests/diff_and_replay.py

OK

./deltas/tests/diff_sequence.py

OK

./deltas/tests/test_apply.py

OK

./deltas/tests/test_util.py

OK

./deltas/tokenizers/__init__.py

OK

./deltas/tokenizers/tests/test_text_split.py

OK

./deltas/tokenizers/tests/test_wikitext_split.py

OK

./deltas/tokenizers/text_split.py

NOTE
- Line 13, where "whitespace" is defined as r'[\n\r\s]+', `\n` and `\r` are not needed since they are encapsulated by `\s` alone; see https://docs.python.org/3/howto/regex.html

./deltas/tokenizers/token.py

OK

./deltas/tokenizers/tokenizer.py

OK

./deltas/tokenizers/wikitext_split.py

NOTE
- Line 80, same issue with "whitespace" as in text_split.py

./deltas/util.py

OK

./setup.py

OK
revscoring
General Observations

In general, this shouldn't be run suid, or handle untrusted data. Aaron said the plan was to keep running this in labs for now, so that should be fine. If it ever does need to run on untrusted data (i.e., if it's made into a service or something), additional input validation should be implemented.

  • Negative
    • Pickled models could be pgp-signed/-encrypted to protect against modification in non-WMF storage (Github). TrustedPickle would have been suggested, but it appears to be incompatible with Python 3.)
  • Neutral
    • There are several instances where there is no exception handling for operations such as open() in the revscoring/utilities namespace. We assume that it's desired for a failure in those operations to be made apparent to the operator in a raw form.
Files

./examples/extract_all.py

OK
- Uses old mw.api

./examples/extracter.py

OK
- Uses old mw.api

./examples/scorer.py

OK
- Uses old mw.api

./revscoring/__init__.py

OK

./revscoring/datasources/__init__.py

OK

./revscoring/datasources/datasource.py

OK

./revscoring/datasources/diff.py

OK

./revscoring/datasources/page_creation.py

OK

./revscoring/datasources/parent_revision.py

OK

./revscoring/datasources/previous_user_revision.py

OK

./revscoring/datasources/revision.py

OK

./revscoring/datasources/site.py

OK

./revscoring/datasources/types.py

NOTE
- Any increase in readability in the implemented classes would be beneficial

./revscoring/datasources/user.py

OK

./revscoring/datasources/util.py

OK

./revscoring/dependencies/__init__.py

OK

./revscoring/dependencies/context.py

OK

./revscoring/dependencies/dependent.py

OK

./revscoring/dependencies/functions.py

OK

./revscoring/errors.py

OK

./revscoring/extractors/__init__.py

OK

./revscoring/extractors/api.py

OK

./revscoring/extractors/extractor.py

OK

./revscoring/features/__init__.py

OK

./revscoring/features/diff.py

OK

./revscoring/features/feature.py

OK

./revscoring/features/modifiers.py

OK

./revscoring/features/page.py

OK

./revscoring/features/parent_revision.py

OK

./revscoring/features/previous_user_revision.py

OK

./revscoring/features/revision.py

OK

./revscoring/features/user.py

OK

./revscoring/features/util.py

OK

./revscoring/languages/__init__.py

OK

./revscoring/languages/english.py

OK

./revscoring/languages/french.py

OK

./revscoring/languages/hebrew.py

OK

./revscoring/languages/indonesian.py

OK

./revscoring/languages/language.py

OK

./revscoring/languages/meta/__init__.py

OK

./revscoring/languages/meta/contant.py

OK

./revscoring/languages/meta/infonoise.py

OK

./revscoring/languages/meta/regex_extractors.py

OK

./revscoring/languages/meta/token_filter.py

OK

./revscoring/languages/persian.py

OK

./revscoring/languages/portuguese.py

OK

./revscoring/languages/space_delimited/__init__.py

OK

./revscoring/languages/space_delimited/diff.py

OK

./revscoring/languages/space_delimited/parent_revision.py

OK

./revscoring/languages/space_delimited/revision.py

OK

./revscoring/languages/space_delimited/space_delimited.py

OK

./revscoring/languages/space_delimited/util.py

OK

./revscoring/languages/spanish.py

OK

./revscoring/languages/turkish.py

OK

./revscoring/languages/vietnamese.py

OK

./revscoring/revscoring.py

OK
(Main utility entry point)

./revscoring/scorer_models/__init__.py

OK

./revscoring/scorer_models/nb.py

OK

./revscoring/scorer_models/rf.py

OK

./revscoring/scorer_models/scorer_model.py

OK
(Uses io.StringIO)
(Uees pickle.load/pickle.dump)

./revscoring/scorer_models/svc.py

OK

./revscoring/scorer_models/util.py

OK

./revscoring/utilities/__init__.py

OK

./revscoring/utilities/extract_features.py

MEDIUM
- Line 54, should expose ability to specify CA cert bundle for verification of server SSL certs

./revscoring/utilities/model_info.py

OK

./revscoring/utilities/score.py

OK

./revscoring/utilities/train_test.py

OK

./revscoring/utilities/util.py

OK

./setup.py

OK

./utility

OK
ores
General Observations
  • Positive
    • Code is well-documented and well-commented
  • Negative
    • The file ores/utilities/dev_server.py should allow specification of the network interface on which the server will listen.
    • Workers don't seem to be following recommendations of http://docs.celeryproject.org/en/latest/userguide/security.html (they're using pickle as the encoding, no signing). Those should probably run in firejail to compensate.
    • Redis connections should use authentication by default (and in the wmf cluster)
Files

./celery-dev.service

OK

./config/ores-localdev.yaml

OK

./ores/__init__.py

OK

./ores/metrics_collectors/__init__.py

OK

./ores/metrics_collectors/logger.py

OK

./ores/metrics_collectors/metrics_collector.py

OK

./ores/metrics_collectors/null.py

OK

./ores/metrics_collectors/statsd.py

OK

./ores/ores.py

OK

./ores/score_caches/__init__.py

OK

./ores/score_caches/empty.py

OK

./ores/score_caches/lru.py

OK

./ores/score_caches/redis.py

OK

./ores/score_caches/score_cache.py

OK

./ores/score_processors/__init__.py

OK

./ores/score_processors/celery.py

OK

./ores/score_processors/score_processor.py

OK

./ores/score_processors/tests/__init__.py

OK

./ores/score_processors/tests/test_score_processor.py

OK

./ores/score_processors/tests/test_timeout.py

OK

./ores/score_processors/timeout.py

OK

./ores/scoring_contexts/__init__.py

OK

./ores/scoring_contexts/scoring_context.py

OK

./ores/util.py

OK

./ores/utilities/__init__.py

OK

./ores/utilities/celery_worker.py

OK

./ores/utilities/dev_server.py

NOTE
- Perhaps "host" should default to 127.0.0.1, and a command line parameter should be added allowing for specification of another host address to listen on. Additionally, documentation should be updated to not that 0.0.0.0 can be passed to have the server listen on all allocated interfaces.

./ores/utilities/label_reverted.py

OK

./ores/utilities/precached.py

MEDIUM
- If ORES will eventually use SSL in deployment, ensure that uses of `requests` verify server SSL certs against an appropriate CA bundle.
- Line 77, in the method run(), requests.get() is called with verify=False; this should not be run in production in this manner.
- Also in the method run(), there is no validation of the URL before passing it to requests.get(). For defense in depth, the parameters wiki, model, and rev_id should be URL encoded when constructing the url.

./ores/utilities/util.py

OK
- `import_from_path` appears to be copied from yamlconf; why?

./ores/wsgi/__init__.py

OK

./ores/wsgi/responses.py

OK

./ores/wsgi/routes/__init__.py

OK

./ores/wsgi/routes/scores.py

OK

./ores/wsgi/server.py

OK

./ores/wsgi/util.py

OK

./provision.bash

OK

./R/env.R

OK

./R/exploration.R

OK

./R/loader/enwiki_feature_reverts.R

OK

./R/util.R

OK

./setup.py

OK

./utility

OK

./uwsgi-dev.service

OK

Re. adding certs support to mwapi, see https://github.com/mediawiki-utilities/python-mwapi/pull/23

Re. ORES dev_server concerns -- do these matter? It is intended to be use as a development instance. It seems beneficial to me that the development server should listen on all interfaces. We would never use the dev_server in production.

Re. import_from_path in yamlconf, we often specify a few different types of classpaths (that are unambiguous but non-trivial to handle) in configuration files. This is a convenient place to have the utility method. It's not great, but it doesn't seem to be problematic from a security perspective.

I'll be digging into precached and serialization in ores use of celery next.

ggellerman moved this task from Backlog to Radar on the Research board.
ggellerman edited subscribers, added: csteipp; removed: Halfak.

Re. adding certs support to mwapi, see https://github.com/mediawiki-utilities/python-mwapi/pull/23

Re. ORES dev_server concerns -- do these matter? It is intended to be use as a development instance. It seems beneficial to me that the development server should listen on all interfaces. We would never use the dev_server in production.

If it is absolutely only ever going to be used as a development server, then it may be okay to have it listen on all interfaces by default. But I still think it makes sense to allow the listening interface(s) to be configurable. I can envision a case where a user would want to run the service on a machine (like mine) with five or six interfaces, some which are virtual for VMware or VirtualBox, and have the service listen on all but the actual Internet-connected interface.

Re. import_from_path in yamlconf, we often specify a few different types of classpaths (that are unambiguous but non-trivial to handle) in configuration files. This is a convenient place to have the utility method. It's not great, but it doesn't seem to be problematic from a security perspective.

I concur. It is not a security issue.

I've enabled verification in the precached daemon. See https://github.com/wiki-ai/ores/pull/105

So, I've figured out that we can't simply switch away from pickle within celery. See https://github.com/wiki-ai/revscoring/issues/212 I think that we can do a rewrite of a few key parts of the system to make them work with the JSON serializer, but that's a pretty substantial investment. I'd like to pursue the "run in firejail" proposal at least until we can address this.