Page MenuHomePhabricator

tools.wikihistory seems to be wasting lots of unnecessary cpu resources on toolsdb
Closed, ResolvedPublic

Description

s51512 (tools.wikihistory) is the top CPU user of toolsdb:

root@labsdb1005[information_schema]> SELECT * FROM USER_STATISTICS ORDER BY cpu_time DESC LIMIT 1\G
*************************** 1. row ***************************
                  USER: s51512
     TOTAL_CONNECTIONS: 2634803
CONCURRENT_CONNECTIONS: 0
        CONNECTED_TIME: 70378868
             BUSY_TIME: 27301657.775641054
              CPU_TIME: 22565123.800680105
        BYTES_RECEIVED: 9666784385
            BYTES_SENT: 19169246173
  BINLOG_BYTES_WRITTEN: 3235170216
             ROWS_READ: 18429748528
             ROWS_SENT: 33466060
          ROWS_DELETED: 6515
         ROWS_INSERTED: 10179413
          ROWS_UPDATED: 8192371
       SELECT_COMMANDS: 51657333
       UPDATE_COMMANDS: 17789158
        OTHER_COMMANDS: 2634765
   COMMIT_TRANSACTIONS: 77495286
 ROLLBACK_TRANSACTIONS: 4857285
    DENIED_CONNECTIONS: 0
      LOST_CONNECTIONS: 0
         ACCESS_DENIED: 0
         EMPTY_QUERIES: 19721750
1 row in set (0.00 sec)

taking half of the total resources, in a database that serves hundreds to thousands of accounts:

root@labsdb1005[information_schema]> SELECT sum(cpu_time) FROM USER_STATISTICS;
+-------------------+
| sum(cpu_time)     |
+-------------------+
| 40725292.63999812 |
+-------------------+
1 row in set (0.00 sec)

root@labsdb1005[information_schema]>

This is specially important because as far as I can tell, there is 8 threads doing highly inneficient and easily optimazable queries (Using filesort on a lot of rows):

root@labsdb1005[(none)]> SHOW EXPLAIN FOR 63790240\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dewiki_data
         type: range
possible_keys: queue_sorting
          key: queue_sorting
      key_len: 9
          ref: NULL
         rows: 2451785
        Extra: Using index condition; Using filesort
1 row in set, 1 warning (0.26 sec)

Because of this, CPU is at 100% percent, leaving no resources for other users:

Screenshot_20180220_163002.png (310×2 px, 41 KB)

Please make the queries more efficient, reduce concurrency or ask for dedicated resources to avoid affecting other users. I ended up doing an analysis like this because a user complained of slowdown on toolsdb on IRC.

Event Timeline

bd808 triaged this task as High priority.Feb 21 2018, 5:03 PM
bd808 added a subscriber: bd808.

The tool seems to be running 10 parallel jobs named dewiki_update[1-10] which are using the dewiki_data table as a shared job queue of some type. An example script looks to be infinitely looping on a SELECT * FROM dewiki_data USE KEY(request_prio) WHERE request_prio=1 LIMIT 0,10 query with only a 1 second pause in the case that no results are found.

I can reduce the number ob jobs to ~4 if that is enough.

The tool works the following way:
The user adds a little script code to his common.js:
https://de.wikipedia.org/wiki/Benutzer:APPER/WikiHistory or similar for the english W: https://en.wikipedia.org/wiki/User:Wurgl/WikiHistory.js
This script requests data from some user-database. If that data is outdated or not existent, it adds a record into that database and that record is searched with the polling and processed. Since processing may take some minutes it is not done from the webserver process (before I got access to the tool it tool up to 7 hours for really large pages).

If there is some way of interprocess communication without pollig between the webserver and the servers, show me some example and I implement it.

If I'm understanding the tool correctly from @Wurgl's description and the source code, it has a couple of modes, but the primary use case is a userscript that will show the top authors who have contributed to the history of an article.

The userscript will request this data on each page visit. The request to the server looks to see if there is a pre-computed answer for the current page revision. If there is, the data is retrieved and returned to the caller in a form that can be inserted into the DOM of the page. If there is not, it writes/updates a record in the dewiki_data table (which covers more wikis than just dewiki semi-confusingly) asking for data and then returns a javascript response that will wait 2-60 seconds and ask for the data again.

The actual work of computing the top author data is done by the check_article function. It shells out to a WikiHistory.exe mono application. I can't find any raw source for this binary (or license!), but there is a page at https://de.wikipedia.org/wiki/Benutzer:APPER/WikiHistory/Programm which gives some explanation. It is a form of "wiki blame" and the call from the check_article function makes it produce an XML document describing the authorship:

data/enwiki/9/9/9/en.wikipedia.rev825634950.wha
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<WikiHistory analysis="BasicTextAnalysis" version="1" revision="825634950">
  <User name="MadeYourReadThis" percentage="42.574" />
  <User name="Mlaffs" percentage="29.657" />
  <User name="Rudy2alan" percentage="17.513" />
  <User name="76.27.198.89" percentage="6.918" />
  <User name="Stepheng3" percentage="1.064" />
  <User name="BD2412" percentage="0.968" />
  <User name="Brycehughes" percentage="0.581" />
  <User name="Neutralhomer" percentage="0.484" />
  <User name="192.80.70.51" percentage="0.242" />
  <User name="WillC" percentage="0.000" />
  <User name="Dcgibson55" percentage="0.000" />
  <User name="Jon Kolbert" percentage="0.000" />
  <User name="Ser Amantio di Nicolao" percentage="0.000" />
  <TotalTime sec="0" />
</WikiHistory>

Judging by the way that the PHP code is written, there have been NFS cache problems with reading these files directly after the shell call finishes.

This is all a pretty clever system, but number of queue workers that are running and the use of mysql as a queuing system are problems. Reducing the number of queue runners from 10 to 4 would be a good quick fix.

The next steps would be either to figure out how to optimize the use of the s51512__data.dewiki_data table or to move this queuing function to redis. Moving to redis for the queuing function might be the better long term step. The s51512__data.dewiki_data` could still be used as the source of truth for final output of a run, but the polling step on both the web frontend and the worker pool side could be moved into a series of redis queues. There are tutorials for doing this sort of thing online like https://danielkokott.wordpress.com/2015/02/14/redis-reliable-queue-pattern/. The current s51512__data.dewiki_data is large (5,054,310 rows), heavily indexed (1 primary, 1 unique, and 6 tertiary keys), and currently in high contention for both reads and writes due to the queuing + source of truth double usage. Moving the queuing to redis should be able to help with all of these things.

The new workflow could look something like:

  • initial request comes in
  • query database for the revision
  • if found, process and return result to client
  • if not found, LPUSH a processing request into a redis list including a "callback id"
  • return javascript callback to client

A callback request could look something like:

  • LREM the callback id from a redis results ready list
  • if the LREM returns > 0, query database for results, process, and return to client
  • if LREM == 0, return javascript callback to client

A worker request could look something like:

  • LRANGE from the worker's queue for the next task to perform
  • If task is found
    • run WikiHistory.exe
    • update database
    • RPUSH the callback id into the results ready list
    • LREM the task from the worker's queue
  • Repeat until no tasks are found in worker's queue
  • BRPOPLPUSH to wait for a processing request and save it in the worker queue
  • Repeat from the LRANGE poll step

This reduces the worst case database interaction to 1 query on initial request, 1 insert/update on worker finish, 1 query on callback request success. There is a slight chance for a thundering herd problem where multiple requests for a single revision come in, queue up at the same time, and be processed in parallel. Additional complexity could be added to avoid the duplicate work that this might cause like introducing a redis set to act as a distributed lock for processing any single revision. I would probably not bother with that until actual usage shows it to be a problem.

I actually changed it a little bit. I have 10 queues. When I started to maintain this tool, somewhen in September/October it had 4 queues, but at that time it served only de-wiki (and two very small ones: ndl and als). Now I serve en-wiki too. 9 queues are for different sizes of the page. The change I did already is polling fewer times. The processes for the large files poll now every 15 seconds since the computation takes a long time, so it does not matter if the user has to wait 5 minutes or 5 minutes and 15 seconds. The queue for the small pages now polls every 2 seconds (instead of 1 second). 1 second was not invented by me, that was the value when I started to maintain the tool.

I will reduce that the queues within a few days. All the larger pages are now precomputed, so I can provide the user good data and the missing ones are usually computed from srcratch within a few seconds.

Here on my local machine I have rewritten the whole code in C++ using Qt-Framework. But since it needs the same time (compared to the C#-Part), I stopped that. Well, C++ would need less memory since there is no overhead from mono.

However, I will look at the links you provided and think about them. Although I am writing programs since ages (actually since late/mid 70's) I am relative new to PHP (and to C# too) but I managed to optimize the C# code a lot (as written above). Give me two weeks of time and we both will be happy :-)

This is already implemented. Closing Task