Page MenuHomePhabricator

Spike: Investigate Errors Upon First Load in WWT [8 hours]
Closed, ResolvedPublic

Description

As a PM, I want to know if we can identify potential reasons why WWT often fails to load on the first attempt and if there are any potential solutions or work-arounds, so that we can try to implement a more inviting user experience.

Acceptance Criteria:

  • Investigate potential reasons why the WWT tool tends to have API errors upon first load
  • Determine if there are certain types of errors that are more likely to occur upon first load
  • Provide recommendations (if any) that we can implement so that users are less likely to see API errors upon first load

Notes on Behavior:

  • When the user first enables WWT, it is very common for the information bar to load rather quickly and display an error message. If the user then tries again (for example, if they refresh and then click on the WWT link again), the tool often loads correctly the second time.

Screen Shot 2019-09-12 at 12.16.10 PM.png (708×1 px, 219 KB)

Event Timeline

ifried renamed this task from Spike: Investigate Why First Loads Often Errors in WWT to Spike: Investigate Errors Upon First Load in WWT.Sep 9 2019, 2:11 AM
ifried updated the task description. (Show Details)
ifried renamed this task from Spike: Investigate Errors Upon First Load in WWT to Spike: Investigate Errors Upon First Load in WWT [8 hours].Sep 12 2019, 5:15 PM
ifried moved this task from Needs Discussion to Up Next (May 20-June 3) on the Community-Tech board.

This happened to me now when I had the dev tools open. It was the "Requested data is not currently available in WikiWho database. It will be available soon." error, coming from WikiWho. It worked on the second attempt, consistent with the issue reported here. I'm not sure it's always this error, though, because there was console output, and when I encountered the issue before, opening up the dev tools afterwards didn't show anything (console output should still be there).

Either way, the "not available" error apparently can happen, so we should maybe check for it and schedule a re-try a second or two afterwards?

One other error condition I have seen is where WhoColor identifies a revision as vandalism, returning a 200 response with payload like:

{
info: Requested revision (777478614) is detected as vandalism by WikiWho.
success: false
...
}

Try loading WWT on https://en.wikipedia.org/w/index.php?title=Johnny_Restivo&oldid=777478614.

I believe WikiWho has various heuristics for working out if something is vandalism, including whether it is +/-40% larger than previous revision. I believe the above revision is reverting some vandalism, and gets caught by this.

In T232232#5496260, dom_walden wrote:

One other error condition I have seen is where WhoColor identifies a revision as vandalism, returning a 200 response with payload like:

{
info: Requested revision (777478614) is detected as vandalism by WikiWho.
success: false
...
}

Try loading WWT on https://en.wikipedia.org/w/index.php?title=Johnny_Restivo&oldid=777478614.

I believe WikiWho has various heuristics for working out if something is vandalism, including whether it is +/-40% larger than previous revision. I believe the above revision is reverting some vandalism, and gets caught by this.

Nice find! I wonder what the logic is. I'm guessing here WikiWho wouldn't be capable of providing attribution, because the previous revision was deleted.

At any rate, I hope that a large insertion of vandalism (that isn't deleted/suppressed) will normally provide attribution data, because we'd want to know who wrote the vandalism.

Also, I am not sure what happens when we hit WikiWho API's request limit. From https://api.wikiwho.net/:

Currently, there is a limit of 2000 requests/day for unregistered users, and also a 60 requests/minute limit for all users.

In T232232#5503122, dom_walden wrote:

Also, I am not sure what happens when we hit WikiWho API's request limit. From https://api.wikiwho.net/:

Currently, there is a limit of 2000 requests/day for unregistered users, and also a 60 requests/minute limit for all users.

This shouldn't be a problem after T231492: Add a tool to proxy API requests to WhoColor API is resolved.

This issue is proving annoying to work on, because failing on the first load seems to be much less frequent now than it was a couple of weeks ago.

Anyway, PR is ready for review: https://github.com/wikimedia/WhoWroteThat/pull/52

It re-requests up to three times, with a second between each. No UI feedback is given (i.e. the pending state just continues). I haven't been able to find a page that fails in the 'refresh' way more than twice, but if that happens the final error state will still be refresh.

It re-requests up to three times, with a second between each. No UI feedback is given (i.e. the pending state just continues).

For me, it retries up to 4 times. Possibly an off-by-one error? I don't think this is a major problem.

Otherwise, retry behaviour is as Sam describes. The only feedback is in the Javascript Console. Apart from when the final retry returns a "refresh" error also, in which case user sees API Error: Please refresh the page or try again later.

Other error behaviour is as it was implemented in T226760. So, for example, a "refresh" error followed by a "503 Service Unavailable" on the retry will still show API Error: Please contact us about this issue.

Four retries seems fine from my perspective. What's the interval? Does it degrade or backoff?

Four retries seems fine from my perspective. What's the interval? Does it degrade or backoff?

The interval is about 1 second as Sam says (although I did not time this accurately).

Depends what you mean by "degrade" and "backoff".

@dom_walden Thanks for that detail. As for backoff, I was thinking that in other retry scenarios I've created, the delay between tries gets longer and longer the more times you try. This is in an effort to help the downstream system recover if it's under strain. I'm assuming we aren't doing that.

For me, it retries up to 4 times. Possibly an off-by-one error? I don't think this is a major problem.

Yes, I was wrong: it actually says 4 in the code.

the delay between tries gets longer and longer the more times you try. This is in an effort to help the downstream system recover if it's under strain. I'm assuming we aren't doing that.

The interval between retries does get longer: it's the retry number in seconds: setTimeout( resolve, 1000 * retry ), so it starts at 1 second and goes up.

@Samwilson Thanks for clarifying that. I'm glad you put the decay in there.

For future reference, for anyone else wanting to test error behaviour in the WhoColor API.

I was using mitmproxy as an intercepting proxy which the browser uses. I then used a script to randomly modify the responses from the WhoColor API to simulate error conditions.

  1. After installing (I installed via Debian's repos), run mitmdump -s random_modify_whocolor.py (using the attached script).
  2. This launches the proxy, which binds (by default) to all interfaces on port 8080
  3. Change your browser's proxy settings to use mitmproxy (e.g. http://127.0.0.1:8080) (I actually could not get this to work on Firefox due to a weird error)
  4. Use WWT like normal.

You should be able to modify the script to change the probabilities of the various error/success conditions, or add new error/success conditions.

This is very cool. Thanks for sharing @dom_walden!

ifried moved this task from Product sign-off to Done on the Community-Tech (Kanban-Q2-2019-20) board.

We're now seeing huge improvements, as related to the user experience. The WWT tool now rarely fails upon first load. While it typically takes longer to load on average (since there are often multiple attempts), this behavior is highly preferable (since the end result is typically a successful load). I'm marking this work as Done.