Page MenuHomePhabricator

Look at overhead of json codec and data passing to feature extraction workers
Closed, DeclinedPublic

Description

The ORES uWSGI service is responsible for making IO requests to the MediaWiki API, and passes the resulting data to Celery workers for feature calculation and scoring. In order to pass the data, it's serialized as JSON and sent over the wire. I believe this data can be several megabytes, and we should investigate how long it takes to serialize and transmit to workers. Communication costs are often a bottleneck in parallel computing.

If this turns out to be a significant bottleneck, there may be alternatives such as shared memory IPC and making Celery pools local to each server.

Event Timeline

@awight Can you elaborate more? Also is this an investigation or actually doing it?

@awight Can you elaborate more? Also is this an investigation or actually doing it?

I was imagining we would investigate mostly by instrumenting with timing metrics. Sorry for the null description, let me fix that now…