Investigate CX saving/recovery failures
Closed, ResolvedPublic1 Story Points

Description

Some logging is in (see T104059) and one possibly cause has been fixed (users getting logged out), but reports of saving or restoration failures are still coming in. This task is for tracking the investigations of causes during the sprint.

Nikerabbit updated the task description. (Show Details)
Nikerabbit raised the priority of this task from to Needs Triage.
Nikerabbit added a subscriber: Nikerabbit.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2015, 12:33 PM
Amire80 triaged this task as High priority.Oct 28 2015, 6:41 PM
Amire80 set Security to None.
Amire80 added a project: WorkType-Maintenance.
Amire80 moved this task from Backlog to High-priority Bugs on the ContentTranslation-Release7 board.
He7d3r added a subscriber: He7d3r.Oct 31 2015, 11:51 AM
Arrbee renamed this task from Investigate CX saving failures to Investigate CX saving/recovery failures.Nov 4 2015, 7:09 AM
Arrbee assigned this task to Nikerabbit.Nov 4 2015, 7:12 AM
santhosh edited a custom field.Nov 4 2015, 7:13 AM

There are two parts to consider.

  1. Whether there are failures about autosave - whether the save did not work and translator assumed that the content is saved.
  2. Whether the saved translations are not properly restored

We instrumented eventlogging to capture failures on autosave. So we need to monitor the data there. We also has eventlogging for restore failures as well. This is one source for data to investigate.

If the draft content is fetched successfully, there is also problem in restoring it in the UI by keeping the alignment with source sections. We have a bit complex logic to handle orphan translation sections. We have automated test to see if that works as intended, but last complaint about Latvia article was when the source article changed a lot. We don't have much examples for this kind of issue. There is no producible case. If we can try to produce any such case we have a lead there.

Reproducibility of above issue is crucial for proceeding with any solutions.

There was a parallel discussion about avoiding the source article change. Note that this discussion is entirely based on an assumption that if source article change, the draft translation parts might get lost. But this is yet to be proved.

Now, whether we can lock a source revision for a translation.

The permanent section ids that parsoid giving us - they are not "really" permanent across revisions(See T116350) because they are yet to implement a proper algorithm for that. So when source article changed, we can potentially get new ids - even for the sections that did not change.

If we save the revision number of source and request that revision while translation is resumed, then also it is not guaranteed that we will get source article with the same section ids. This is because, parsoid has to re-render it and any change in transcluded content(templates for example) can change the ids. The probability of this is relatively less compared to probability of article being edited. So we still need our restore algorithm in place.

It is difficult to get insight on this topic. The reasons seem very varied and partially caused by misunderstandings.

The restoration is changed articles is surely one part of this, as Santhosh has already explained above.

In addition I noticed we delete drafts when people publish (by accident? intentionally but forgot?) which obviously prevents people from continuing. Maybe we should stop doing that?

I also noticed that we are uploading hundreds of kilobytes multiple times for big articles. I might add more instrumentation to see whether there is any correlation with size and failures and/or saving time.

Change 254817 had a related patch set uploaded (by Nikerabbit):
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/254817

Change 254817 merged by jenkins-bot:
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/254817

I also noticed that we are uploading hundreds of kilobytes multiple times for big articles. I might add more instrumentation to see whether there is any correlation with size and failures and/or saving time.

We can compress the HTML from browser and decompress at PHP side. VE does that for article publish. It uses https://github.com/Jacob-Christian-Munch-Andersen/Easy-Deflate

I take it that the problem of the PUBLISH TRANSLATION button staying gray is not part of this bug?

It is not: I found the correct bug for that: https://phabricator.wikimedia.org/T114621. Sorry.

Change 255956 had a related patch set uploaded (by KartikMistry):
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/255956

Change 255956 merged by jenkins-bot:
Add some extra information to save failure logging

https://gerrit.wikimedia.org/r/255956

We have spotted some new issues:

  • Trying to insert a duplicate row (cause yet unknown)
  • nosourcerevision is a new error (cause yet unknown)

We still don't have a complete picture on this and investigations are slow due to user reports often not having sufficient information to dig deep enough. Some patches are on the pipeline to help with this:

  • do not delete published drafts
  • use compression

I feel JavaScript error logging would be handy, not sure what happened to the project aiming to collect those.

Save failures are still way too common (over thousand per day, although only affecting much smaller number of users and articles).
Publication failures are in tens per day, AbuseFilter being the main cause. While we cannot prevent that, T114621 aims to make handling of those better.

Arrbee moved this task from Backlog to In Progress on the LE-CX7-Sprint 4 board.Dec 2 2015, 6:14 AM
Amire80 moved this task from Backlog to CX7 on the ContentTranslation board.Dec 7 2015, 10:45 AM

Once the compression patch lands, we should:

  1. see reduction in timeouts overall due to higher likelihood for requests to complete in slow networks
  2. reduction in size for the remaining timeouts caused by intermittent network

Santhosh is working on patches for section level saving, which will further reduce the size of the requests. I think at that point we can safely say that rest of the timeout issues are caused by poor network and are out of our control.

Santhosh' patch to stop requiring draft parameters is hoped to prevent cases where the issue has been misreported by users who have bookmarked or otherwise used an url without the draft parameter.

The problem for me only occurred while translating an extremely large article. Since then, no problem. But I'd like to echo the encouragement to save already published drafts. People publish incompletely translated articles for a variety of reasons and would like to return to them to fully translate.

With regards to my earlier guesses, we have now some numbers (based on numbers grouped by the source title):

  1. Overall reduction in number of saving timeouts has reduced almost 60% from average of ~5 per day to ~2 per day
  2. The average payload size in timeouts has decreased almost 80% from ~150 KiB to ~30 KiB
santhosh moved this task from In Progress to Done on the LE-CX7-Sprint 4 board.Dec 22 2015, 6:43 AM
santhosh closed this task as Resolved.Dec 22 2015, 6:55 AM