Page MenuHomePhabricator

Reply tool has gotten slower?
Closed, ResolvedPublic

Description

I have a feeling that people are increasingly complaining about the reply tool being slow. We should investigate if it's a real problem and fix it if so.

Recent examples:

in my experience it's slower
Yeah, it's noticeably slower sometimes

Przede wszystkim często pokazuje błąd serwera, a mimo wszystko odpowiedź jest zapisywana (…) Błąd serwera najczęściej mi się trafiał w dyskusjach nad zgłoszeniami w czywieszu.
First of all, it often shows a server error and yet the response is saved. (…) The server error most often occurred for me in discussions of submissions to did-you-know. [https://pl.wikipedia.org/wiki/Wikiprojekt:Czy_wiesz/propozycje]

(this one sounds like T249071, which is supposed to be fixed)

Break internet connection after the server received the request, but before the browser received the response...this results in a warning dialog, and the Reply Tool's comment form remains open

(dunno, maybe this one is a bit of a stretch)

(I feel like there were more, but I don't have examples handy)

Event Timeline

Peter has reminded me of the investigation in T270346 and asked whether we can use the same data to check the number of timed out save attempts. We can! (kind of)

Re-running the query from that task, we get these error counts (from the last 90 days):

save_failure_typesave_failure_messagecount
responseUnknownhttp634
extensionAbuseFilterabusefilter-warning348
extensionCaptchacaptcha204
responseUnknowndiscussiontools-commentname-notfound136
extensionAbuseFilterabusefilter-disallowed132
editConflicteditconflict114
extensionSpamBlacklistspamblacklist96
responseUnknownblocked74
responseUnknownapierror-visualeditor-docserver-http55
(everything else)138

That seems like a lot of http errors, compared to the old results in that task

And also compared to non-DiscussionTools error rates. I can't be bothered with the table, but I see only 1158 http – against 39582 abusefilter-warning, 61264 captcha, 6692 spamblacklist… This is not necessarily comparable data (maybe reply tool is used by different people in different situations than the normal page editors), but it still seems weird.

http covers various error conditions, not just timed out requests, but it's probably mostly timed out requests.

I also looked at this from another angle, wondering whether we've broken the fix for T249071 (fixed in April 2020).

And it sure looks like we did… In rEDTO3b31aa669dd1: Create controller.getApi method with defaults (October 2020) that code was refactored, and we wrote defaults.timeout = 0; where the correct thing would be defaults.ajax.timeout = 0;. So the default timeout of 30 seconds is being used, and save attempts that take longer than 30 seconds are cancelled and display an error.

Fixing that should avoid the user-visible errors (but it won't improve the performance). Although it clarifies the question – rather than wondering if it has gotten slower, we can ask why so many save attempts are taking longer than 30 seconds. (And is this a performance regression, or are people now using the reply tool on bigger pages?)

Next steps:

Change 720114 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Remove client-side timeout when saving pages (again)

https://gerrit.wikimedia.org/r/720114

Change 720115 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Log more precise error codes for timeout etc. (and HTTP statuses)

https://gerrit.wikimedia.org/r/720115

Change 720114 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Remove client-side timeout when saving pages (again)

https://gerrit.wikimedia.org/r/720114

Change 720115 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Log more precise error codes for timeout etc. (and HTTP statuses)

https://gerrit.wikimedia.org/r/720115

(After this is deployed, compare the 'http' errors rates from last week to next week, to confirm the issue was fixed) [with caveats noted below: T290514#7349230]

Note that it won't be perfectly comparable, since we've made the http error more granular so we can distinguish subtypes. http in the earlier logs will be comparable to http-[various-numbers] + timeout + abort + parsererror.

I tried doing this, and succeeded in profiling it per the instructions, but not really in seeing why it's slow. The output in XHGui leaves a lot to be desired:

image.png (693×2 px, 245 KB)

This is the call graph of everything that happened in the request, with functions that took the most time marked in pink. I am not desperate enough to try to make sense out of this.

You can have a look at the data here: https://performance.wikimedia.org/xhgui/run/callgraph?id=6137c7f5df0198a17a0bbeb5. Once it loads (it takes forever), click on the graph and use mouse wheel to zoom out, click and drag to move around. Copy of the output as PDF (in case you can't get it to work, or in case they get deleted after some time): F34642298.

Note also that this profile was recorded a week ago, and it wasn't a particularly slow one.

I've also tried profiling on Patchdemo using Excimer (which can be used with other tools to produce more readable output, but which can't be easily used on production wikis). It's not really comparable to the production data, but I did it in case it showed some obvious regression. But it did not, the output is almost exactly the same as the last time I did this in April (T279119#6968347).

profile.png (2×2 px, 1 MB)

See the comment in T279119#6968347 for a description of what is shown above.

Other notes:

  • I noticed that the Performance team are working on improvements to save timing: T277788. This will probably positively affect us as well.
  • We do not record save_success_timing data for DiscussionTools edits, so we don't even know how slow it is. We should at least fix this.

Change 720827 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Record timing for 'saveSuccess' and 'saveFailure' events

https://gerrit.wikimedia.org/r/720827

Oh, you filed a subtask for it, I'll move the patch there.

This is done for now – next step is to wait until changes are deployed, then check the logs to confirm that the timeout problem is fixed (T290514#7337387 T290514#7349223).

I can confirm I definitely messed up the logging code, because almost every error is being logged as http-0.

chart.png (781×2 px, 106 KB)

(I re-ran the query, and did some stuff in a sheet: https://docs.google.com/spreadsheets/d/1IWqOTzfqddmx7ZppsMA1zYnLtsI8iC_9Gszx2eTqVkQ/edit#gid=395694778)

Change 724399 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724399

Change 724399 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724399

Change 724378 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.1] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724378

Change 724379 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.2] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724379

Change 724378 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.1] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724378

Change 724379 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.2] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724379

Mentioned in SAL (#wikimedia-operations) [2021-09-29T11:17:47Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.2/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724379|Fix almost all errors codes being logged as http-0 (T290514)]] (duration: 01m 09s)

Mentioned in SAL (#wikimedia-operations) [2021-09-29T11:20:43Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.1/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724378|Fix almost all errors codes being logged as http-0 (T290514)]] (duration: 01m 09s)

We are waiting to confirm that we no longer get timeout errors.

The above can be "tested" as soon as Monday, 4-October.

Updated chart: https://docs.google.com/spreadsheets/d/1W_HivFb3RI_3Zk0-NzBHBPhPwOzZNT-d8AF1_X32mXk/edit#gid=395694778

chart (1).png (781×2 px, 112 KB)

…and with changed colors for clarity:

chart (2).png (781×2 px, 106 KB)

Ignore the blue part, that was the mistake in logging code. We can see that in the past few days, there were almost no http errors (indicating either timeouts, or network connection issues), and no timeout errors (indicating timeouts). Previously their rate was very high. Therefore I think we can call this resolved.

I forgot to copy the Superset query I was using. For future reference:

select year, month, day, event.save_failure_type, event.save_failure_message, count(*) as count
from editattemptstep
where event.integration='discussiontools' and event.action='saveFailure'
group by year, month, day, event.save_failure_type, event.save_failure_message
order by year, month, day, count desc

Updated chart: https://docs.google.com/spreadsheets/d/1W_HivFb3RI_3Zk0-NzBHBPhPwOzZNT-d8AF1_X32mXk/edit#gid=395694778

chart (1).png (781×2 px, 112 KB)

…and with changed colors for clarity:

chart (2).png (781×2 px, 106 KB)

Ignore the blue part, that was the mistake in logging code. We can see that in the past few days, there were almost no http errors (indicating either timeouts, or network connection issues), and no timeout errors (indicating timeouts). Previously their rate was very high. Therefore I think we can call this resolved.

Documentation like this is a gift to our current and future selves. Nicely done creating these visualizations and, of course, stemming the flow of errors, @matmarex.