Page MenuHomePhabricator

Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails
Open, HighPublic

Description

T106895 exposed a large hole in thumbnail monitoring. Even though this page was filled with broken thumbnails, no SMS or IRC spam was sent by any bots. In general, if that happens, it indicates a pretty serious problem with upload or new thumbnails.

Event Timeline

aaron created this task.Jul 25 2015, 10:31 AM
aaron raised the priority of this task from to Needs Triage.
aaron updated the task description. (Show Details)
aaron added a project: acl*sre-team.
aaron added a subscriber: aaron.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 25 2015, 10:31 AM
Peachey88 renamed this task from Monitor https://en.wikipedia.org/wiki/Special:ListFiles for non 200 HTTP statuses in thumbnails to Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails.Jul 25 2015, 10:39 AM
Peachey88 set Security to None.
Peachey88 added a subscriber: MZMcBride.
Restricted Application added a project: Multimedia. · View Herald TranscriptJul 25 2015, 10:39 AM
Restricted Application added a subscriber: Steinsplitter. · View Herald Transcript
Krenair added a subscriber: Krenair.

@mark is this worthy of a catchpoint alert? It seems like it may be a good external sanity check.

chasemp triaged this task as High priority.Jul 27 2015, 6:22 PM

In general, if that happens, it indicates a pretty serious problem with upload or new thumbnails.

There could be edge cases where this could happen and its not a problem. If someone did a mass upload of files that could not be rendered. Although we've fixed most of the broad file could not be rendered issues, so seems kind of unlikely now a days.

Jdforrester-WMF moved this task from Untriaged to Next up on the Multimedia board.Sep 4 2015, 6:43 PM
Dzahn added a subscriber: Dzahn.Nov 11 2015, 1:28 AM

@mark is this worthy of a catchpoint alert? It seems like it may be a good external sanity check.

@chasemp Is that something that is possible in catchpoint though? As i understand it https://en.wikipedia.org/wiki/Special:ListFiles would always be 200 but we'd have to check if any thumbnails within the page are not being loaded.

@mark is this worthy of a catchpoint alert? It seems like it may be a good external sanity check.

@chasemp Is that something that is possible in catchpoint though? As i understand it https://en.wikipedia.org/wiki/Special:ListFiles would always be 200 but we'd have to check if any thumbnails within the page are not being loaded.

catchpoint can request the page and then all child resources and report on their failure. We do this already now for the pages we watch via selenium and chome alerting on 10% or 10 child resource failures such as images etc. We could totally do something similar in this instance if we want I believe. It does require a sophisticated enough check though.

Restricted Application added a project: Commons. · View Herald TranscriptNov 11 2015, 2:26 PM
Steinsplitter moved this task from Incoming to Backlog on the Commons board.Nov 12 2015, 12:11 PM
Dzahn added a comment.Dec 23 2015, 8:10 PM

Sounds good, but like we'd still need an ACK from Mark because this actually costs money.

mark added a comment.Dec 2 2016, 12:10 PM

I'd say, let's set it up and see how much it costs. We can also vary the check frequency.

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptDec 2 2016, 12:10 PM

@chasemp how would we setup the check and gauge how much it costs?

chasemp added a comment.EditedJul 24 2017, 2:04 PM

@fgiunchedi it depends on what we want to watch move. We already have a number of emulated/chrome checks that could double as thumbnail canaries. If there is a particular page(s) that would demonstrate this failure early then an additional check(s) makes sense to me. Right now it's more or less all project homepages. One good note there is we do run cached and uncached checks.

Cost is such a pita w/ catchpoint. If we add a new check we could get away with emulated for child content awareness and that's 1 point per check per concurrent source (IIRC). For May and June we did have an underrun for point usage (meaning our monthly usage is less than projected based on annual projected usage) we could play with here to see how it effects ongoing usage. I don't have the numbers for last contract renewal / points = cost per point at the moment though.

Check types for reference

Then we can alert based on referenced resources failing like

One big red flag here is it wouldn't differentiate between thumbnails and any other child content. In theory I imagine we shouldn't have any mass content failure on our pages so...but most of this was alerting at one time and did find a few serious externally visible issues and has since had alerting disabled as too verbose.

One option for this to use selenium style automation to make a more contextually specific thumbnail test like the one we have that demonstrates a basic login and edit:

// Step - 1
open("https://en.wikipedia.org/w/index.php?title=Special:UserLogin")
setStepName("Open en.wp.o/Special:UserLogin")
type("//*[@id='wpPassword1']", "barpass")
type("//*[@id='wpName1']", "foouser")

// Step - 2
clickAndWait("//*[@id='wpLoginAttempt']")
setStepName("Login to en.wp.o")

// Step - 3
open("https://en.wikipedia.org/w/index.php?title=User:foouser%29/sandbox&action=edit")
type("//*[@id='wpTextbox1']", "${timeEpoch()} Testid: ${testid} Location: ${locationName} cityName: ${cityName} ispName: ${ispName}")
setStepName("Edit user sandbox")

// Step - 4
clickAndWait("//*[@id='wpSave']")
setStepName("Save user edit")

The big downside here is check complexity probably means failure mode complexity for the check itself and /each step/ of the automation is a single point.

thoughts:

  1. Make an catchpoint-alerts list and get some of these failures we have the ability to see now sent to those who want it
  2. Tweak our emulated and chrome checks for what we think will surface this thumbnail issue (and probably others). i.e. I think most now are if 10 child elements fail or 10% of child elements fail that's bad.
  3. Figure out what web page to watch for emergent thumbnail problems makes sense and add that check. Then watch the point usage for a bit to see how it will fair.

Chatted with @chasemp about this today, the easiest way forward seems to be setting up an emulated check with thresholds for failure to load content. https://commons.wikimedia.org/wiki/Special:NewFiles is the easiest target as it is full of recent thumbnails that should just work.

The test lives at https://portal.catchpoint.com/ui/Content/Tests/TestDetail.aspx?id=231161 and I've set to 15 min from NA only for an estimated impact of 2.25% of points, @chasemp does it look good to you?

@fgiunchedi ack, we don't have a ton of checks running w/ concurrency but @15m intervals seems sane. Let's let it ride for awhile and check up with ongoing points usage.

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Aug 3 2017, 8:13 AM
fgiunchedi moved this task from Radar to Backlog on the User-fgiunchedi board.Jan 2 2019, 10:38 AM