Investigate CX saving/recovery failures
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

Some logging is in (see T104059) and one possibly cause has been fixed (users getting logged out), but reports of saving or restoration failures are still coming in. This task is for tracking the investigations of causes during the sprint.

Details

	Subject	Repo	Branch	Lines +/-
	Add some extra information to save failure logging	mediawiki/extensions/ContentTranslation	wmf/1.27.0-wmf.7	+14 -1
	Add some extra information to save failure logging	mediawiki/extensions/ContentTranslation	master	+14 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Amire80	T95655 CX Phase 2 deployment issues (tracking)
Open	None	T96003 Publishing and saving errors (tracking)
Resolved	Pginer-WMF	T104059 Failures with auto-save causes users data loss when translating (tracking)
Resolved	Nikerabbit	T116908 Investigate CX saving/recovery failures
Resolved	• santhosh	T120635 Reloading a translation from translationview does not restore the saved content

Event Timeline

Nikerabbit created this task.Oct 28 2015, 12:33 PM

Nikerabbit raised the priority of this task from to Needs Triage.

Nikerabbit updated the task description. (Show Details)

Nikerabbit added projects: ContentTranslation-Release7, LE-CX7-Sprint 3.

Nikerabbit subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2015, 12:33 PM

Nikerabbit added a parent task: T104059: Failures with auto-save causes users data loss when translating (tracking).Oct 28 2015, 12:34 PM

Amire80 triaged this task as High priority.Oct 28 2015, 6:41 PM

Amire80 set Security to None.

Amire80 added a project: Essential-Work.

Amire80 moved this task from Backlog to High-priority Bugs on the ContentTranslation-Release7 board.

Some reports:

https://www.mediawiki.org/w/index.php?title=Topic:Smd55pthmhu4usdb&topic_showPostId=srqpy6m1cswhbme7#flow-post-srqpy6m1cswhbme7

https://www.mediawiki.org/w/index.php?title=Topic:Smd55pthmhu4usdb&topic_showPostId=srr8umcdn6fjdcpr#flow-post-srr8umcdn6fjdcpr

He7d3r subscribed.Oct 31 2015, 11:51 AM

Arrbee renamed this task from Investigate CX saving failures to Investigate CX saving/recovery failures.Nov 4 2015, 7:09 AM

Arrbee assigned this task to Nikerabbit.Nov 4 2015, 7:12 AM

• santhosh edited a custom field.Nov 4 2015, 7:13 AM

There are two parts to consider.

Whether there are failures about autosave - whether the save did not work and translator assumed that the content is saved.
Whether the saved translations are not properly restored

We instrumented eventlogging to capture failures on autosave. So we need to monitor the data there. We also has eventlogging for restore failures as well. This is one source for data to investigate.

If the draft content is fetched successfully, there is also problem in restoring it in the UI by keeping the alignment with source sections. We have a bit complex logic to handle orphan translation sections. We have automated test to see if that works as intended, but last complaint about Latvia article was when the source article changed a lot. We don't have much examples for this kind of issue. There is no producible case. If we can try to produce any such case we have a lead there.

Reproducibility of above issue is crucial for proceeding with any solutions.

There was a parallel discussion about avoiding the source article change. Note that this discussion is entirely based on an assumption that if source article change, the draft translation parts might get lost. But this is yet to be proved.

Now, whether we can lock a source revision for a translation.

The permanent section ids that parsoid giving us - they are not "really" permanent across revisions(See T116350) because they are yet to implement a proper algorithm for that. So when source article changed, we can potentially get new ids - even for the sections that did not change.

If we save the revision number of source and request that revision while translation is resumed, then also it is not guaranteed that we will get source article with the same section ids. This is because, parsoid has to re-render it and any change in transcluded content(templates for example) can change the ids. The probability of this is relatively less compared to probability of article being edited. So we still need our restore algorithm in place.

Nikerabbit moved this task from Backlog to In Progress on the LE-CX7-Sprint 3 board.Nov 12 2015, 8:17 AM

It is difficult to get insight on this topic. The reasons seem very varied and partially caused by misunderstandings.

The restoration is changed articles is surely one part of this, as Santhosh has already explained above.

In addition I noticed we delete drafts when people publish (by accident? intentionally but forgot?) which obviously prevents people from continuing. Maybe we should stop doing that?

I also noticed that we are uploading hundreds of kilobytes multiple times for big articles. I might add more instrumentation to see whether there is any correlation with size and failures and/or saving time.

Change 254817 had a related patch set uploaded (by Nikerabbit):
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/254817

gerritbot added a project: Patch-For-Review.Nov 23 2015, 9:07 AM

Change 254817 merged by jenkins-bot:
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/254817

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-12-08_(1.27.0-wmf.8)).Nov 24 2015, 5:00 AM

I also noticed that we are uploading hundreds of kilobytes multiple times for big articles. I might add more instrumentation to see whether there is any correlation with size and failures and/or saving time.

We can compress the HTML from browser and decompress at PHP side. VE does that for article publish. It uses https://github.com/Jacob-Christian-Munch-Andersen/Easy-Deflate

• santhosh mentioned this in T119490: Pass the compressed translated content to ApiContentTranslationPublish.Nov 24 2015, 9:37 AM

Arrbee added a project: ContentTranslation.Nov 25 2015, 6:43 AM

I take it that the problem of the PUBLISH TRANSLATION button staying gray is not part of this bug?

It is not: I found the correct bug for that: https://phabricator.wikimedia.org/T114621. Sorry.

Change 255956 had a related patch set uploaded (by KartikMistry):
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/255956

Change 255956 merged by jenkins-bot:
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/255956

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-11-17_(1.27.0-wmf.7)).Nov 30 2015, 5:00 PM

Arrbee edited projects, added LE-CX7-Sprint 4; removed LE-CX7-Sprint 3.Dec 1 2015, 7:29 AM

We have spotted some new issues:

Trying to insert a duplicate row (cause yet unknown)
nosourcerevision is a new error (cause yet unknown)

We still don't have a complete picture on this and investigations are slow due to user reports often not having sufficient information to dig deep enough. Some patches are on the pipeline to help with this:

do not delete published drafts
use compression

I feel JavaScript error logging would be handy, not sure what happened to the project aiming to collect those.

Save failures are still way too common (over thousand per day, although only affecting much smaller number of users and articles).
Publication failures are in tens per day, AbuseFilter being the main cause. While we cannot prevent that, T114621 aims to make handling of those better.

Arrbee moved this task from Backlog to In Progress on the LE-CX7-Sprint 4 board.Dec 2 2015, 6:14 AM

• santhosh added a subtask: T120635: Reloading a translation from translationview does not restore the saved content.Dec 7 2015, 8:28 AM

Amire80 moved this task from Needs Triage to CX7 on the ContentTranslation board.Dec 7 2015, 10:45 AM

Once the compression patch lands, we should:

see reduction in timeouts overall due to higher likelihood for requests to complete in slow networks
reduction in size for the remaining timeouts caused by intermittent network

Santhosh is working on patches for section level saving, which will further reduce the size of the requests. I think at that point we can safely say that rest of the timeout issues are caused by poor network and are out of our control.

Santhosh' patch to stop requiring draft parameters is hoped to prevent cases where the issue has been misreported by users who have bookmarked or otherwise used an url without the draft parameter.

The problem for me only occurred while translating an extremely large article. Since then, no problem. But I'd like to echo the encouragement to save already published drafts. People publish incompletely translated articles for a variety of reasons and would like to return to them to fully translate.

With regards to my earlier guesses, we have now some numbers (based on numbers grouped by the source title):

Overall reduction in number of saving timeouts has reduced almost 60% from average of ~5 per day to ~2 per day
The average payload size in timeouts has decreased almost 80% from ~150 KiB to ~30 KiB

• santhosh moved this task from In Progress to Done on the LE-CX7-Sprint 4 board.Dec 22 2015, 6:43 AM

• santhosh closed this task as Resolved.Dec 22 2015, 6:55 AM

• santhosh closed subtask T120635: Reloading a translation from translationview does not restore the saved content as Resolved.Dec 22 2015, 7:12 AM

Investigate CX saving/recovery failuresClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate CX saving/recovery failures
Closed, ResolvedPublic1 Estimated Story Points
Actions

Related Objects
Search...