Reply tool has gotten slower?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matmarex
	Sep 7 2021, 6:16 PM

Description

I have a feeling that people are increasingly complaining about the reply tool being slow. We should investigate if it's a real problem and fix it if so.

Recent examples:

https://en.wikipedia.org/wiki/User_talk:Enterprisey/reply-link#Reply-link_officially_superseded_by_DiscussionTools

in my experience it's slower
Yeah, it's noticeably slower sometimes

https://pl.wikipedia.org/wiki/Wikipedia:Kawiarenka/Kwestie_techniczne#[_odpowiedz_]

Przede wszystkim często pokazuje błąd serwera, a mimo wszystko odpowiedź jest zapisywana (…) Błąd serwera najczęściej mi się trafiał w dyskusjach nad zgłoszeniami w czywieszu.
First of all, it often shows a server error and yet the response is saved. (…) The server error most often occurred for me in discussions of submissions to did-you-know. [https://pl.wikipedia.org/wiki/Wikiprojekt:Czy_wiesz/propozycje]

(this one sounds like T249071, which is supposed to be fixed)

T286409

Break internet connection after the server received the request, but before the browser received the response...this results in a warning dialog, and the Reply Tool's comment form remains open

(dunno, maybe this one is a bit of a stretch)

(I feel like there were more, but I don't have examples handy)

Details

Subject	Repo	Branch	Lines +/-
Fix almost all errors codes being logged as `http-0`	mediawiki/extensions/DiscussionTools	wmf/1.38.0-wmf.2	+2 -2
Fix almost all errors codes being logged as `http-0`	mediawiki/extensions/DiscussionTools	wmf/1.38.0-wmf.1	+2 -2
Fix almost all errors codes being logged as `http-0`	mediawiki/extensions/DiscussionTools	master	+2 -2
Log more precise error codes for timeout etc. (and HTTP statuses)	mediawiki/extensions/DiscussionTools	master	+7 -0
Remove client-side timeout when saving pages (again)	mediawiki/extensions/DiscussionTools	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T276497 Scale DiscussionTools to all projects
Open	None	T251207 [Epic] Scale DiscussionTools to all Wikipedias
Resolved	None	T262331 [Release ticket] Make Reply Tool available as opt-out preference at all Wikipedias
Resolved	ppelberg	T288484 [Release ticket] Offer Reply Tool as Opt-Out at Phase 3 Wikis
Resolved	matmarex	T290514 Reply tool has gotten slower?

Event Timeline

matmarex created this task.Sep 7 2021, 6:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 7 2021, 6:16 PM

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.Sep 7 2021, 6:18 PM

ppelberg added a project: Editing-team (Kanban Board).

ppelberg moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.

ppelberg added a parent task: T288484: [Release ticket] Offer Reply Tool as Opt-Out at Phase 3 Wikis.

Peter has reminded me of the investigation in T270346 and asked whether we can use the same data to check the number of timed out save attempts. We can! (kind of)

Re-running the query from that task, we get these error counts (from the last 90 days):

save_failure_type	save_failure_message	count
responseUnknown	http	634
extensionAbuseFilter	abusefilter-warning	348
extensionCaptcha	captcha	204
responseUnknown	discussiontools-commentname-notfound	136
extensionAbuseFilter	abusefilter-disallowed	132
editConflict	editconflict	114
extensionSpamBlacklist	spamblacklist	96
responseUnknown	blocked	74
responseUnknown	apierror-visualeditor-docserver-http	55
…	(everything else)	138

That seems like a lot of http errors, compared to the old results in that task

And also compared to non-DiscussionTools error rates. I can't be bothered with the table, but I see only 1158 http – against 39582 abusefilter-warning, 61264 captcha, 6692 spamblacklist… This is not necessarily comparable data (maybe reply tool is used by different people in different situations than the normal page editors), but it still seems weird.

http covers various error conditions, not just timed out requests, but it's probably mostly timed out requests.

I also looked at this from another angle, wondering whether we've broken the fix for T249071 (fixed in April 2020).

And it sure looks like we did… In rEDTO3b31aa669dd1: Create controller.getApi method with defaults (October 2020) that code was refactored, and we wrote defaults.timeout = 0; where the correct thing would be defaults.ajax.timeout = 0;. So the default timeout of 30 seconds is being used, and save attempts that take longer than 30 seconds are cancelled and display an error.

Fixing that should avoid the user-visible errors (but it won't improve the performance). Although it clarifies the question – rather than wondering if it has gotten slower, we can ask why so many save attempts are taking longer than 30 seconds. (And is this a performance regression, or are people now using the reply tool on bigger pages?)

• LZaman mentioned this in T288484: [Release ticket] Offer Reply Tool as Opt-Out at Phase 3 Wikis.Sep 7 2021, 8:13 PM

Next steps:

Fix the timeout thing
Improve logging to separate different errors from http
Profile a slow save attempt using https://wikitech.wikimedia.org/wiki/WikimediaDebug#Request_profiling and see why it's slow

matmarex claimed this task.Sep 7 2021, 8:32 PM

matmarex moved this task from Ready to Be Worked On to Doing on the Editing-team (Kanban Board) board.

ppelberg subscribed.Sep 8 2021, 12:04 AM

matmarex updated the task description. (Show Details)Sep 8 2021, 11:16 AM

VPuffetMichel subscribed.Sep 8 2021, 5:30 PM

Change 720114 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Remove client-side timeout when saving pages (again)

https://gerrit.wikimedia.org/r/720114

Change 720115 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Log more precise error codes for timeout etc. (and HTTP statuses)

https://gerrit.wikimedia.org/r/720115

Change 720114 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Remove client-side timeout when saving pages (again)

https://gerrit.wikimedia.org/r/720114

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.23; 2021-09-13).Sep 10 2021, 12:00 AM

Change 720115 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Log more precise error codes for timeout etc. (and HTTP statuses)

https://gerrit.wikimedia.org/r/720115

Maintenance_bot removed a project: Patch-For-Review.Sep 13 2021, 5:10 PM

• LZaman moved this task from Doing to Blocked / Needs More Work on the Editing-team (Kanban Board) board.Sep 13 2021, 5:17 PM

(After this is deployed, compare the 'http' errors rates from last week to next week, to confirm the issue was fixed) [with caveats noted below: T290514#7349230]

Note that it won't be perfectly comparable, since we've made the http error more granular so we can distinguish subtypes. http in the earlier logs will be comparable to http-[various-numbers] + timeout + abort + parsererror.

In T290514#7337740, @matmarex wrote:

Profile a slow save attempt using https://wikitech.wikimedia.org/wiki/WikimediaDebug#Request_profiling and see why it's slow

I tried doing this, and succeeded in profiling it per the instructions, but not really in seeing why it's slow. The output in XHGui leaves a lot to be desired:

This is the call graph of everything that happened in the request, with functions that took the most time marked in pink. I am not desperate enough to try to make sense out of this.

You can have a look at the data here: https://performance.wikimedia.org/xhgui/run/callgraph?id=6137c7f5df0198a17a0bbeb5. Once it loads (it takes forever), click on the graph and use mouse wheel to zoom out, click and drag to move around. Copy of the output as PDF (in case you can't get it to work, or in case they get deleted after some time): F34642298.

Note also that this profile was recorded a week ago, and it wasn't a particularly slow one.

I've also tried profiling on Patchdemo using Excimer (which can be used with other tools to produce more readable output, but which can't be easily used on production wikis). It's not really comparable to the production data, but I did it in case it showed some obvious regression. But it did not, the output is almost exactly the same as the last time I did this in April (T279119#6968347).

profile.svg1 MBDownload

See the comment in T279119#6968347 for a description of what is shown above.

Other notes:

I noticed that the Performance team are working on improvements to save timing: T277788. This will probably positively affect us as well.
We do not record save_success_timing data for DiscussionTools edits, so we don't even know how slow it is. We should at least fix this.

Change 720827 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Record timing for 'saveSuccess' and 'saveFailure' events

https://gerrit.wikimedia.org/r/720827

gerritbot added a project: Patch-For-Review.Sep 13 2021, 11:41 PM

ppelberg mentioned this in T290931: Log save_success_timing in DiscussionTools.Sep 13 2021, 11:44 PM

Oh, you filed a subtask for it, I'll move the patch there.

This is done for now – next step is to wait until changes are deployed, then check the logs to confirm that the timeout problem is fixed (T290514#7337387 T290514#7349223).

• LZaman moved this task from Blocked / Needs More Work to Ready to Be Worked On on the Editing-team (Kanban Board) board.Sep 22 2021, 5:44 PM

I can confirm I definitely messed up the logging code, because almost every error is being logged as http-0.

(I re-ran the query, and did some stuff in a sheet: https://docs.google.com/spreadsheets/d/1IWqOTzfqddmx7ZppsMA1zYnLtsI8iC_9Gszx2eTqVkQ/edit#gid=395694778)

Change 724399 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724399

gerritbot added a project: Patch-For-Review.Sep 28 2021, 11:15 AM

matmarex moved this task from Ready to Be Worked On to Code Review on the Editing-team (Kanban Board) board.Sep 28 2021, 11:19 AM

DLynch moved this task from Code Review to Blocked / Needs More Work on the Editing-team (Kanban Board) board.Sep 28 2021, 2:43 PM

Change 724399 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724399

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.3; 2021-10-05).Sep 28 2021, 5:00 PM

Change 724378 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.1] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724378

Change 724379 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.2] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724379

Change 724378 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.1] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724378

Change 724379 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.38.0-wmf.2] Fix almost all errors codes being logged as `http-0`

https://gerrit.wikimedia.org/r/724379

Mentioned in SAL (#wikimedia-operations) [2021-09-29T11:17:47Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.2/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724379|Fix almost all errors codes being logged as http-0 (T290514)]] (duration: 01m 09s)

Mentioned in SAL (#wikimedia-operations) [2021-09-29T11:20:43Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.1/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724378|Fix almost all errors codes being logged as http-0 (T290514)]] (duration: 01m 09s)

ReleaseTaggerBot edited projects, added MW-1.38-notes (1.38.0-wmf.1; 2021-09-21); removed MW-1.38-notes (1.38.0-wmf.3; 2021-10-05).Sep 29 2021, 12:00 PM

We are waiting to confirm that we no longer get timeout errors.

The above can be "tested" as soon as Monday, 4-October.

Updated chart: https://docs.google.com/spreadsheets/d/1W_HivFb3RI_3Zk0-NzBHBPhPwOzZNT-d8AF1_X32mXk/edit#gid=395694778

…and with changed colors for clarity:

Ignore the blue part, that was the mistake in logging code. We can see that in the past few days, there were almost no http errors (indicating either timeouts, or network connection issues), and no timeout errors (indicating timeouts). Previously their rate was very high. Therefore I think we can call this resolved.

I forgot to copy the Superset query I was using. For future reference:

select year, month, day, event.save_failure_type, event.save_failure_message, count(*) as count
from editattemptstep
where event.integration='discussiontools' and event.action='saveFailure'
group by year, month, day, event.save_failure_type, event.save_failure_message
order by year, month, day, count desc

In T290514#7399698, @matmarex wrote:

Updated chart: https://docs.google.com/spreadsheets/d/1W_HivFb3RI_3Zk0-NzBHBPhPwOzZNT-d8AF1_X32mXk/edit#gid=395694778

…and with changed colors for clarity:

Ignore the blue part, that was the mistake in logging code. We can see that in the past few days, there were almost no http errors (indicating either timeouts, or network connection issues), and no timeout errors (indicating timeouts). Previously their rate was very high. Therefore I think we can call this resolved.

Documentation like this is a gift to our current and future selves. Nicely done creating these visualizations and, of course, stemming the flow of errors, @matmarex.

	F34642298: XHGui - Callgraph - __fd16ac88-da68-493a-8928-109b9c3ca667_w_api.php - Sep 7th 20_13_36.pdf
	Sep 13 2021, 10:54 PM

	F34671991: chart (2).png
	Oct 4 2021, 5:29 PM

	F34671986: chart (1).png
	Oct 4 2021, 5:29 PM

	F34660367: chart.png
	Sep 28 2021, 11:10 AM

	F34642346: profile.svg
	Sep 13 2021, 11:01 PM

Reply tool has gotten slower?Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reply tool has gotten slower?
Closed, ResolvedPublic
Actions

Related Objects
Search...