Page MenuHomePhabricator

Very large date ranges can cause the API to fail
Closed, ResolvedPublicBUG REPORT

Description

Attempting to query for very large date ranges using the Toolviews API can fail with a 502 response, for example /api/v1/daily/2018-05-02/2023-05-01.

In my testing, going anywhere beyond a one-year range can be very slow or crash.

For T87001: Provide basic page view metrics for individual tools on toolforge, I can simply set a max date range clientside, or make multiple calls and concatenate the data together, but I imagine a some limitation should be imposed in the API itself.

Details

TitleReferenceAuthorSource BranchDest Branch
database performance improvementstoolforge-repos/toolviews!9bd808work/bd808/T338055-speedup-large-resultsmain
Customize query in GitLab

Event Timeline

MariaDB [s53734__toolviews_p]> select count(1) from daily_raw_views where request_day >= '2018-05-02' and request_day <= '2023-05-01';
+----------+
| count(1) |
+----------+
|  1196180 |
+----------+
1 row in set (0.412 sec)

MariaDB [s53734__toolviews_p]> ANALYZE SELECT request_day, tool, hits, uniqueiphits FROM daily_raw_views WHERE request_day >= '2018-05-02' and request_day <= '2023-05-01' ORDER BY request_day desc, hits desc\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: daily_raw_views
         type: ALL
possible_keys: request_day
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1007223
       r_rows: 1196180.00
     filtered: 50.00
   r_filtered: 100.00
        Extra: Using where; Using filesort
1 row in set (7.453 sec)

MariaDB [s53734__toolviews_p]> SELECT request_day, tool, hits, uniqueiphits FROM daily_raw_views WHERE request_day >= '2018-05-02' and request_day <= '2023-05-01' ORDER BY request_day desc, hits desc;
...snip...
1196180 rows in set (8.546 sec)

Direct interaction with the database via sql toolsdb seems actually quite performant. I imagine that on the backed app itself there may be some resident ram issues with reading in the 1.1 million rows, turning them into an array of dicts, and then returning that as a 1.1 million element JSON record to the caller.

Changing the date range from 2018-05-02 - 2023-05-01 to 2022-05-01 - 2023-05-01 (5 years of data down to 1 year of data) makes some difference in a direct query test, but not a particularly large difference. The 5 year query in T338055#8899789 took 8.546 seconds. The 1 year query is returning for me in 6.258 seconds. The 1 year query is an order of magnitude less data (285073 rows).

If I drop the ORDER BY clause and let the data return in "natural" order the select time as measured by the database server is 0.420 seconds for the 1 year range and 1.097 for the 5 year range. In both of these cases as well as the results in T338055#8899789 the measured time is only the time needed for the server to gather the result. There is additional wall clock time needed to stream the results to the client that is proportional to the size of the dataset being returned.

I think it would be interesting to run some comparisons using python client code to see how timings vary if we move the result sorting to the client side rather than asking ToolsDB to do the necessary file sort in its shared RAM.

which in this case is request_day ascending, tool ascending thanks to the primary key of (request_day, tool)

I think it would be interesting to run some comparisons using python client code to see how timings vary if we move the result sorting to the client side rather than asking ToolsDB to do the necessary file sort in its shared RAM.

which in this case is request_day ascending, tool ascending thanks to the primary key of (request_day, tool)

That is interesting... I see in the code that the query sorts by date/hits, but the API actually gives it sorted by date and then alphabetically by tool. So, I'm currently already doing sorting client side! Sounds like that's a bug in the API, then? It sorts by date properly, just not by hits. Regardless, the current behaviour suggests we're safe to remove the hits sorting from the query if that at all helps performance, since the client of the API never gets that sorting anyway.

bd808 renamed this task from Very large date ranges can cause the API to timeout to Very large date ranges can cause the API to fail.Jun 8 2023, 10:07 PM
bd808 updated the task description. (Show Details)

The 502 response is not a timeout. Instead it is caused by uwsgi terminating the response to the nginx ingress abnormally. When this happens I can see something like this in the $HOME/uwsgi.log:

DAMN ! worker 3 (pid: 10) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 3 (new pid: 17)

I'm actually not 100% certain what monitoring process is sending SIGKILL to the individual uwsgi worker process, but after some experiments I'm pretty sure that it is triggered by a low free memory state within the container. The webservice has been running with the default upper memory limit of 512Mi. This is apparently just not enough room to process pathologically large requests like asking for five years of daily hit totals for all tools.

Mentioned in SAL (#wikimedia-cloud) [2023-06-08T22:45:45Z] <wm-bot> <bd808> Restart after updating to bb42ab4 and setting limits of cpu=1, mem=2G (T338055)

bd808 changed the task status from Open to In Progress.Jun 8 2023, 10:46 PM
bd808 claimed this task.
bd808 moved this task from Backlog to Doing on the Tool-toolviews board.

With the increased memory limit (2G limit, 1G request) and the tweaks I made to the database interactions, I am currently able to pull a report for the entire 5 year history in just over 30 seconds on a cold cache. This results in a 20M json response (~1.2 million data points) from the API.

There will be some point in the future where things start blowing up again when you query out the entire database, but I think we are probably safe for a while.