Security Review of Revscoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• csteipp
	Aug 24 2015, 5:44 PM

Description

Overview

There is no textual user input. The only evidence that someone has used the system is that scores will be in the cache or extractions will in process within celery.

Within the context of a request, formatted http://ores.wmflabs.org/scores/enwiki/reverted/647810762/:

enwiki is the instance of Wikipedia being queried
reverted is te model being used)
647810762 is the rev_id

All models are versioned so that cache invalidation can be carried out when models are updated.

The revscore team has opted not to, themselves, build applications which consume output ofrom the revscore system. The impetus for moving the project into production is to make it available to product teams so that they can build consumer applications. @Legoktm has implemented one such application -- a MediaWiki extension which consumes data from ores and maintains a mirror of the ores cache.

Request Processing

Celery is checked to determine whether submitted revision is currently being evaluated. If not, celery is asked to extract features from data point and perform its machine learning data score.

Celery retrieves revision information using the public MediaWiki API.

ores.wmflabs.org.yml, "extractors" section lists API locations
ores is not logged in, and, therefore, cannot retrieve deleted revisions

Libraries and Dependencies

ores-wikimedia-config - WMF-specific configuration for the project
ores - REST web service which handles execution of models using revscoring
revscoring - generic library which uses models for revision scoring
python-mwapi - thin wrapper of MediaWiki API
deltas - generic library which implements diff algorithms based on recent research
yamlconf - generic library which implements yaml config file reading and propogates defaults

Subsystems

revscoring - contains command line utilities

[[https://pythonhosted.org/revscoring/revscoring.utilities.html?highlight=train_test#module-revscoring.utilities.extract_features|extract_features]] - extracts features from revisions, for submission to train_test
[[https://pythonhosted.org/revscoring/revscoring.utilities.html?highlight=train_test#module-revscoring.utilities.train_test|train_test]] - trains and tests models
- train_test writes out a model file in Python pickle format
- ores-wikimedia-config contains these models

See fabfile.py for information on deployment.

Infrastructure

ores-lb-02.ores.eqiad.wmflabs - load balancer (nginx)
ores-web-0[1-2] - web servers, running uwsgi on tcp/8080
ores-worker-0[1-4] - celery nodes; guessing redis runs here as well?
ores-staging-01 - runs all of the parts of larger; used for staging
ores-misc-01 - used for Debian packaging

Related Objects
Search...

Status	Assigned	Task
Resolved	Ladsgroup	T130210 [Epic] Deploy ORES extension as a Beta feature
Resolved	Halfak	T140002 [Epic] Deploy ORES review tool
Resolved	Ladsgroup	T130212 Deploy ORES review tool to wikidatawiki
Invalid	Halfak	T106398 Revscoring tasks from Wikimania discussions
Resolved	Halfak	T106867 [Epic] Deploy Revscoring/ORES service in Prod
Resolved	akosiaris	T117560 New Service Request: ORES
Resolved	Halfak	T110072 Security Review of Revscoring

Event Timeline

• csteipp created this task.Aug 24 2015, 5:44 PM

• csteipp claimed this task.

• csteipp raised the priority of this task from to High.

• csteipp updated the task description. (Show Details)

• csteipp added projects: deprecated-security-team-reviews, Security-Team.

• csteipp added subscribers: Joe, yuvipanda, • mobrovac and 2 others.

Krenair subscribed.Aug 24 2015, 5:47 PM

https://github.com/wiki-ai/ores and https://github.com/wiki-ai/revscoring (and associated libraries) - of which the following are built by us - https://github.com/halfak/mediawiki-utilities and https://github.com/halfak/Deltas and https://github.com/halfak/yamlconf.

We are planning to build and deploy everything except ores with debian packages. We can mirror those into gerrit if needed.

• csteipp moved this task from Incoming to Scheduled on the deprecated-security-team-reviews board.Aug 24 2015, 7:13 PM

• csteipp moved this task from Incoming to Ready on the Security-Team board.

And https://github.com/wiki-ai/ores-wikimedia-config is the config (equivalent of mediawiki-config) for ores.

Revscoring:

ORES (details):

https://etherpad.wikimedia.org/p/revscoring_overview

+1 to the diagrams. :)

The production MediaWiki API is going to be calling into labs? And ORES in production might come from github?

I think that the arrow comes from the API to ORES because that's the direction of data flow. Our model files and code are stored on github, so that means we're pulling data from there into the ORES system.

• dpatrick updated the task description. (Show Details)Oct 14 2015, 10:12 PM

• dpatrick set Security to None.

• dpatrick added a subscriber: Legoktm.

• csteipp moved this task from Ready to In Progress on the Security-Team board.Oct 29 2015, 9:28 PM

Here are notes from @csteipp and I for this review:

yamlconf

General Observations

Positive
- Code is simple, well-formatted, and well-documented
- Generally uniform in purpose
Neutral
- Why does yamlconf have the import_module function? This is not a security-related concern, it just seems to not really be within the purview of this module.

Files

./demo.py

OK

./yamlconf.py

OK

python-mwapi

General Observations

Positive
- Code is straightforward and uniform in purpose
Negative
- Though the underlying HTTP library ([[http://docs.python-requests.org/en/latest/user/advanced/#ssl-cert-verification|requests]]) in use verifies the SSL certificate chain by default, mwapi does not expose a facility for users of the library to specify their own set of CA certificates. This ability should be added prior to the library being considered production-ready for the broader community. Documentation for requests indicates that it bundles a set of CA certificates, based on those shipped with Mozilla products. However, that bundle is only updated when the library itself is updated. This makes it more important for users of mwapi to have control over specification of CA certificate bundle. The requests project recommends use of [[https://certifi.io/en/latest/|certifi]], a library which exists specifically for delivery of a trust database, and may be more frequently updated.

Files

./mwapi/__init__.py

OK

./mwapi/errors.py

OK

./mwapi/session.py

MEDIUM
- Line 73, mwapi does not allow for specification of a CA cert (or group of CA certs) against which the certificate chain of an SSL connection may be verified
- Line 73, in case of a change to the requests library defaults, verify=True should be passed to the request() method

./setup.py

OK

deltas

General Observations

Positive
- Code is well-documented and well-commented
Neutral
- Some inconsistencies in regular expressions, as noted below

Files

./deltas/__init__.py

OK

./deltas/algorithms/__init__.py

OK

./deltas/algorithms/diff_engine.py

OK

./deltas/algorithms/segment_matcher.py

OK

./deltas/algorithms/sequence_matcher.py

OK

./deltas/algorithms/tests/test_from_config.py

OK

./deltas/algorithms/tests/test_segment_matcher.py

OK

./deltas/algorithms/tests/test_sequence_matcher.py

OK

./deltas/apply.py

OK

./deltas/operations.py

OK

./deltas/segmenters/__init__.py

OK

./deltas/segmenters/functions.py

OK

./deltas/segmenters/paragraphs_sentences_and_whitespace.py

OK

./deltas/segmenters/segmenter.py

OK

./deltas/segmenters/segments.py

OK
(hashlib.sha1 used here)

./deltas/segmenters/tests/test_paragraphs_sentences_and_whitespace.py

OK

./deltas/segmenters/tests/test_segments.py

OK

./deltas/tests/diff_and_replay.py

OK

./deltas/tests/diff_sequence.py

OK

./deltas/tests/test_apply.py

OK

./deltas/tests/test_util.py

OK

./deltas/tokenizers/__init__.py

OK

./deltas/tokenizers/tests/test_text_split.py

OK

./deltas/tokenizers/tests/test_wikitext_split.py

OK

./deltas/tokenizers/text_split.py

NOTE
- Line 13, where "whitespace" is defined as r'[\n\r\s]+', `\n` and `\r` are not needed since they are encapsulated by `\s` alone; see https://docs.python.org/3/howto/regex.html

./deltas/tokenizers/token.py

OK

./deltas/tokenizers/tokenizer.py

OK

./deltas/tokenizers/wikitext_split.py

NOTE
- Line 80, same issue with "whitespace" as in text_split.py

./deltas/util.py

OK

./setup.py

OK

revscoring

General Observations

In general, this shouldn't be run suid, or handle untrusted data. Aaron said the plan was to keep running this in labs for now, so that should be fine. If it ever does need to run on untrusted data (i.e., if it's made into a service or something), additional input validation should be implemented.

Negative
- Pickled models could be pgp-signed/-encrypted to protect against modification in non-WMF storage (Github). TrustedPickle would have been suggested, but it appears to be incompatible with Python 3.)

Neutral
- There are several instances where there is no exception handling for operations such as open() in the revscoring/utilities namespace. We assume that it's desired for a failure in those operations to be made apparent to the operator in a raw form.

Files

./examples/extract_all.py

OK
- Uses old mw.api

./examples/extracter.py

OK
- Uses old mw.api

./examples/scorer.py

OK
- Uses old mw.api

./revscoring/__init__.py

OK

./revscoring/datasources/__init__.py

OK

./revscoring/datasources/datasource.py

OK

./revscoring/datasources/diff.py

OK

./revscoring/datasources/page_creation.py

OK

./revscoring/datasources/parent_revision.py

OK

./revscoring/datasources/previous_user_revision.py

OK

./revscoring/datasources/revision.py

OK

./revscoring/datasources/site.py

OK

./revscoring/datasources/types.py

NOTE
- Any increase in readability in the implemented classes would be beneficial

./revscoring/datasources/user.py

OK

./revscoring/datasources/util.py

OK

./revscoring/dependencies/__init__.py

OK

./revscoring/dependencies/context.py

OK

./revscoring/dependencies/dependent.py

OK

./revscoring/dependencies/functions.py

OK

./revscoring/errors.py

OK

./revscoring/extractors/__init__.py

OK

./revscoring/extractors/api.py

OK

./revscoring/extractors/extractor.py

OK

./revscoring/features/__init__.py

OK

./revscoring/features/diff.py

OK

./revscoring/features/feature.py

OK

./revscoring/features/modifiers.py

OK

./revscoring/features/page.py

OK

./revscoring/features/parent_revision.py

OK

./revscoring/features/previous_user_revision.py

OK

./revscoring/features/revision.py

OK

./revscoring/features/user.py

OK

./revscoring/features/util.py

OK

./revscoring/languages/__init__.py

OK

./revscoring/languages/english.py

OK

./revscoring/languages/french.py

OK

./revscoring/languages/hebrew.py

OK

./revscoring/languages/indonesian.py

OK

./revscoring/languages/language.py

OK

./revscoring/languages/meta/__init__.py

OK

./revscoring/languages/meta/contant.py

OK

./revscoring/languages/meta/infonoise.py

OK

./revscoring/languages/meta/regex_extractors.py

OK

./revscoring/languages/meta/token_filter.py

OK

./revscoring/languages/persian.py

OK

./revscoring/languages/portuguese.py

OK

./revscoring/languages/space_delimited/__init__.py

OK

./revscoring/languages/space_delimited/diff.py

OK

./revscoring/languages/space_delimited/parent_revision.py

OK

./revscoring/languages/space_delimited/revision.py

OK

./revscoring/languages/space_delimited/space_delimited.py

OK

./revscoring/languages/space_delimited/util.py

OK

./revscoring/languages/spanish.py

OK

./revscoring/languages/turkish.py

OK

./revscoring/languages/vietnamese.py

OK

./revscoring/revscoring.py

OK
(Main utility entry point)

./revscoring/scorer_models/__init__.py

OK

./revscoring/scorer_models/nb.py

OK

./revscoring/scorer_models/rf.py

OK

./revscoring/scorer_models/scorer_model.py

OK
(Uses io.StringIO)
(Uees pickle.load/pickle.dump)

./revscoring/scorer_models/svc.py

OK

./revscoring/scorer_models/util.py

OK

./revscoring/utilities/__init__.py

OK

./revscoring/utilities/extract_features.py

MEDIUM
- Line 54, should expose ability to specify CA cert bundle for verification of server SSL certs

./revscoring/utilities/model_info.py

OK

./revscoring/utilities/score.py

OK

./revscoring/utilities/train_test.py

OK

./revscoring/utilities/util.py

OK

./setup.py

OK

./utility

OK

ores

General Observations

Positive
- Code is well-documented and well-commented
Negative
- The file ores/utilities/dev_server.py should allow specification of the network interface on which the server will listen.
- Workers don't seem to be following recommendations of http://docs.celeryproject.org/en/latest/userguide/security.html (they're using pickle as the encoding, no signing). Those should probably run in firejail to compensate.
- Redis connections should use authentication by default (and in the wmf cluster)

Files

./celery-dev.service

OK

./config/ores-localdev.yaml

OK

./ores/__init__.py

OK

./ores/metrics_collectors/__init__.py

OK

./ores/metrics_collectors/logger.py

OK

./ores/metrics_collectors/metrics_collector.py

OK

./ores/metrics_collectors/null.py

OK

./ores/metrics_collectors/statsd.py

OK

./ores/ores.py

OK

./ores/score_caches/__init__.py

OK

./ores/score_caches/empty.py

OK

./ores/score_caches/lru.py

OK

./ores/score_caches/redis.py

OK

./ores/score_caches/score_cache.py

OK

./ores/score_processors/__init__.py

OK

./ores/score_processors/celery.py

OK

./ores/score_processors/score_processor.py

OK

./ores/score_processors/tests/__init__.py

OK

./ores/score_processors/tests/test_score_processor.py

OK

./ores/score_processors/tests/test_timeout.py

OK

./ores/score_processors/timeout.py

OK

./ores/scoring_contexts/__init__.py

OK

./ores/scoring_contexts/scoring_context.py

OK

./ores/util.py

OK

./ores/utilities/__init__.py

OK

./ores/utilities/celery_worker.py

OK

./ores/utilities/dev_server.py

NOTE
- Perhaps "host" should default to 127.0.0.1, and a command line parameter should be added allowing for specification of another host address to listen on. Additionally, documentation should be updated to not that 0.0.0.0 can be passed to have the server listen on all allocated interfaces.

./ores/utilities/label_reverted.py

OK

./ores/utilities/precached.py

MEDIUM
- If ORES will eventually use SSL in deployment, ensure that uses of `requests` verify server SSL certs against an appropriate CA bundle.
- Line 77, in the method run(), requests.get() is called with verify=False; this should not be run in production in this manner.
- Also in the method run(), there is no validation of the URL before passing it to requests.get(). For defense in depth, the parameters wiki, model, and rev_id should be URL encoded when constructing the url.

./ores/utilities/util.py

OK
- `import_from_path` appears to be copied from yamlconf; why?

./ores/wsgi/__init__.py

OK

./ores/wsgi/responses.py

OK

./ores/wsgi/routes/__init__.py

OK

./ores/wsgi/routes/scores.py

OK

./ores/wsgi/server.py

OK

./ores/wsgi/util.py

OK

./provision.bash

OK

./R/env.R

OK

./R/exploration.R

OK

./R/loader/enwiki_feature_reverts.R

OK

./R/util.R

OK

./setup.py

OK

./utility

OK

./uwsgi-dev.service

OK

Halfak added projects: Machine-Learning-Team (Active Tasks), Research.Nov 2 2015, 9:24 PM

Re. adding certs support to mwapi, see https://github.com/mediawiki-utilities/python-mwapi/pull/23

Re. ORES dev_server concerns -- do these matter? It is intended to be use as a development instance. It seems beneficial to me that the development server should listen on all interfaces. We would never use the dev_server in production.

Re. import_from_path in yamlconf, we often specify a few different types of classpaths (that are unambiguous but non-trivial to handle) in configuration files. This is a convenient place to have the utility method. It's not great, but it doesn't seem to be problematic from a security perspective.

I'll be digging into precached and serialization in ores use of celery next.

Halfak added a parent task: T117560: New Service Request: ORES.Nov 3 2015, 4:17 PM

• ggellerman reassigned this task from • csteipp to Halfak.Nov 5 2015, 11:38 PM

• ggellerman moved this task from Backlog to Radar on the Research board.

• ggellerman edited subscribers, added: • csteipp; removed: Halfak.

In T110072#1775701, @Halfak wrote:

Re. adding certs support to mwapi, see https://github.com/mediawiki-utilities/python-mwapi/pull/23

Re. ORES dev_server concerns -- do these matter? It is intended to be use as a development instance. It seems beneficial to me that the development server should listen on all interfaces. We would never use the dev_server in production.

If it is absolutely only ever going to be used as a development server, then it may be okay to have it listen on all interfaces by default. But I still think it makes sense to allow the listening interface(s) to be configurable. I can envision a case where a user would want to run the service on a machine (like mine) with five or six interfaces, some which are virtual for VMware or VirtualBox, and have the service listen on all but the actual Internet-connected interface.

Re. import_from_path in yamlconf, we often specify a few different types of classpaths (that are unambiguous but non-trivial to handle) in configuration files. This is a convenient place to have the utility method. It's not great, but it doesn't seem to be problematic from a security perspective.

I concur. It is not a security issue.

I've enabled verification in the precached daemon. See https://github.com/wiki-ai/ores/pull/105

So, I've figured out that we can't simply switch away from pickle within celery. See https://github.com/wiki-ai/revscoring/issues/212 I think that we can do a rewrite of a few key parts of the system to make them work with the JSON serializer, but that's a pretty substantial investment. I'd like to pursue the "run in firejail" proposal at least until we can address this.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Nov 20 2015, 6:29 PM

• csteipp moved this task from Scheduled to Waiting/Blocked on the deprecated-security-team-reviews board.Dec 2 2015, 6:20 PM

• csteipp moved this task from In Progress to Waiting on the Security-Team board.Dec 3 2015, 7:24 PM

• csteipp mentioned this in T120922: Security review of the ORES extension.Dec 30 2015, 10:01 PM

Halfak closed this task as Resolved.Jan 21 2016, 3:40 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:47 PM

sbassett moved this task from Waiting to Our Part Is Done on the Security-Team board.Jun 11 2019, 6:03 PM

Restricted Application added a subscriber: jeblad. · View Herald TranscriptJun 11 2019, 6:03 PM

sbassett moved this task from Waiting/Blocked to Done on the deprecated-security-team-reviews board.Jul 9 2019, 8:25 PM

• chasemp removed a project: deprecated-security-team-reviews.Jan 8 2020, 4:14 PM