Page MenuHomePhabricator

Unconference: logspam - how do we have less / surface it to the right people?
Closed, ResolvedPublic

Description

A high volume of production errors in logs makes it harder for deployers (and teams with new code in production) to tell if new code is broken.

I believe that error logs should typically be quiet enough that any real error stands out immediately. To state this another way: Even if an error in production doesn't manifest as breakage for an end-user of the software, it should still be treated as broken code in production because it meaningfully reduces our ability to detect and triage things that do break for users.

How can we improve our logspam situation in both technical and social terms? What are teams already doing about the problem that I'm just not aware of?

https://etherpad.wikimedia.org/p/WMTC19-T238250

Event Timeline

brennen created this task.Nov 13 2019, 7:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2019, 7:42 PM
brennen updated the task description. (Show Details)Nov 13 2019, 9:49 PM
hashar updated the task description. (Show Details)Nov 15 2019, 6:36 PM
brennen added a comment.EditedNov 15 2019, 7:53 PM

Notes from etherpad:

Unconference: logspam - how do we have less / surface it to the right
people?

Session leads: Brennen, Tyler

https://phabricator.wikimedia.org/T238250

A high volume of production errors in logs makes it harder for deployers (and teams with new code in production) to tell if new code is broken.  I believe that error logs should typically be quiet enough that any real error stands out immediately. To state this another way: Even if an error in production doesn't manifest as breakage for an end-user of the
software, it should still be treated as broken code in production because it meaningfully reduces our ability to detect and triage things that do break for users.

Questions for discussion

  • Question 1: How can we improve our logspam situation in:
    • Technical terms?
    • Social terms?
  • Question 2: What are teams already doing about the problem that might not be widely known?

Discussion

  • Brennen: Are people watching for errors in produciton from their code?
  • Volans: if we reduce the noise in logspam it will be more efficient
    • the delta in logs from deployment will be more meaningful
  • Lars: Timo setup the new errors dashboard and mukunda built Phatality which makes this easier -- it's still work and it's still manual work and it should be automated away
  • Niklas: for translatewiki.net I have a script that pipes errorlog to IRC which is annoying but effective -- everytime there is something new then I fill a new task
  • L: we have to annoy our developers
  • v: how many people would like to be notifiied of logspam?
  • b: in terms of -- you own this code...so?
  • greg: assuming we have owners, the presumably that's just programming if the error comes from your code -- what happens now is that it's manual -- when there's an issue we use phatality to report it
  • addshore: manually figuring out what team to add -- sounds like there's a need for owners of specific log channels
  • timo: we have statsd metrics for some logstash channels so there's an account in grafana that shows the error count -- we have an icinga alert that spams #-operations
  • a: coming form the wikidata side we'd love to have alerts for our log channels
  • v: if it doesn't have too many false positive the icinga alert could page
  • Timo: it does have a lot of false positivies -- we've got it pretty far down. We have numerous minutes per day where there are 0 fatal errrors
  • Brennen: is there social effor needed?
    •  Addshore: we try to look a these things but there should be alarms for their logs
    •  Timo: if you're not feeling the pain yourself you're less incintivized to fix it, which falls on releng because of the train 
      • -- one of my long term visions is that each team has a logstash dashboard for every team that has patches in the train, it's reviewed after rollout and there is explicit check in on stability
      •  another thing most of work and triaging is done by train operators which are not mediawiki experts. There is other one more qualified person to triage
    • Brennen: Do we modify the train process to add requirements for people with patches going out?
      • TImo: I think slowly towards that -- a signoff idea -- associated teams
      • Addshore: we'll never not be involved in the process
      • Antoine: should we stop doing train... we arbitrarily cut branch for all extensions/repos. Every time we start pushing code that potentially affect every single team
        • what is in stead we did it per feature? eg: one slot for all Growth team, a deployment slot for Wikidata materials etc
        • Addshore: there's a big dependacy tree of things that I see there that makes this hard
        • Greg: we did this-ish (had weekly windows for different teams that wanted them, worked hard to get them on the train)
        • Krinkle: idea is attractive, though difficult by now since teams have different priorities.  But we could move to a push model in which teams would propose their features to be included in next train instead of the current "deploy it all".
        • Brennen: if your team has code going out that's a committment from your team to fixing issues
        • timo: what is a responsible alternative?
        • addshore: with our current dev patterns, you have to deploy when you merge
        • lars: personal interest is CDeploy, not the train, for CDep we wouldn't be deploying TWTh, it'd be as soon as things are +2'd. we need an automated way to tell if there's a problem so we can rollback automatically.
          • the idea I have is if there's a new log message, we roll back, then file a ticket and assing the ticket to the patch that was just deployed (as a first approx).
          • if it hasn't been reacted within 3/30/60 minutes we alert the next level of people who do things, eg: RelEng starts calling people
        •  
  •  Ideas of having sign-off by teams when doing the train.
  •  

Unconference: logspam - how do we have less / surface it to the right
people?

Session leads: Brennen, Tyler

https://phabricator.wikimedia.org/T238250

A high volume of production errors in logs makes it harder for deployers
(and teams with new code in production) to tell if new code is broken. 
I believe that error logs should typically be quiet enough that any real
error stands out immediately. To state this another way: Even if an
error in production doesn't manifest as breakage for an end-user of the
software, it should still be treated as broken code in production
because it meaningfully reduces our ability to detect and triage things
that do break for users.

  • Brennen: Are people watching for errors in produciton from their code?
  • Volans: if we reduce the noise in logspam it will be more efficient
    • the delta in logs from deployment will be more meaningful
  • Lars: Timo setup the new errors dashboard and mukunda built Phatality which makes this easier -- it's still work and it's still manual work and it should be automated away
  • Niklas: for translatewiki.net I have a script that pipes errorlog to IRC which is annoying but effective -- everytime there is something new then I fill a new task
  • L: we have to annoy our developers
  • v: how many people would like to be notifiied of logspam?
  • b: in terms of -- you own this code...so?
  • greg: assuming we have owners, the presumably that's just programming if the error comes from your code -- what happens now is that it's manual -- when there's an issue we use phatality to report it
  • addshore: manually figuring out what team to add -- sounds like there's a need for owners of specific log channels
  • timo: we have statsd metrics for some logstash channels so there's an account in grafana that shows the error count -- we have an icinga alert that spams #-operations
  • a: coming form the wikidata side we'd love to have alerts for our log channels
  • v: if it doesn't have too many false positive the icinga alert could page
  • Timo: it does have a lot of false positivies -- we've got it pretty far down. We have numerous minutes per day where there are 0 fatal errrors
  • Brennen: is there social effor needed?
    •  Addshore: we try to look a these things but there should be alarms for their logs
    •  Timo: if you're not feeling the pain yourself you're less incintivized to fix it, which falls on releng because of the train 
      • -- one of my long term visions is that each team has a logstash dashboard for every team that has patches in the train, it's reviewed after rollout and there is explicit check in on stability
      •  another thing most of work and triaging is done by train operators which are not mediawiki experts. There is other one more qualified person to triage
    • Brennen: Do we modify the train process to add requirements for people with patches going out?
      • TImo: I think slowly towards that -- a signoff idea -- associated teams
      • Addshore: we'll never not be involved in the process
      • Antoine: should we stop doing train... we arbitrarily cut branch for all extensions/repos. Every time we start pushing code that potentially affect every single team
        • what is in stead we did it per feature? eg: one slot for all Growth team, a deployment slot for Wikidata materials etc
        • Addshore: there's a big dependacy tree of things that I see there that makes this hard that's not easy to resolve
        • Greg: we did this-ish and we worked really hard to get these teams on the train (had weekly windows for different teams that wanted them, worked hard to get them on the train)
        • Krinkle: idea is attractive, though difficult by now since teams have different priorities.  But we could move to a push model in which teams would propose their features to be included in next train instead of the current "deploy it all".
        • Brennen: if your team has code going out that's a committment from your team to fixing issues
        • timo: what is a responsible alternative?
        • addshore: with our current dev patterns, you have to deploy when you merge
        • lars: personal interest is CDeploy, not the train, for CDep we wouldn't be deploying TWTh, it'd be as soon as things are +2'd. we need an automated way to tell if there's a problem so we can rollback automatically.
          • the idea I have is if there's a new log message, we roll back, then file a ticket and assing the ticket to the patch that was just deployed (as a first approx).
          • if it hasn't been reacted within 3/30/60 minutes we alert the next level of people who do things, eg: RelEng starts calling people
        • Timo: I like this model, but what would we roll back to? everything in gerrit since the last master branch
      • lars: Git master should be what is run in production
      • addshore: there are probably lots of cases where that would work and lots of cases where it wouldn't 
      • timo: there are lots of folks deploying things all at once, do we want to revert all those patches?
      • lars: yes.
      • addshore: for wikidata, we deployed something, but it didn't start breaking for some time afterwards
      • timo: new warnings can get introduced by a deployment, user behavior, or a cronjob
  • brennen: lars has a point I want to bring back in -- if there's something we can do in the immediate future -- what's the first small step?
    • dj: you have to remind people that their code is going out
    • brennen: so people just don't know when their code is going out
    • addshore: at the moment you create phab tickets for the logs -- but my email setup means I don't see it
    • brennen: as a deployer, I'm not often sure what the consequences of a particular error are unless it's exploding
    • liw: we have branch.py -- I think we should in that script we should email authors and +2ers
    • brennen: an opt-in feature to get emails
    • timo: require a team to sign off on wednesday and thursday -- cut at a different time -- monday evening and tuesday morning are stressful -- as soon as the branch is cut
    • addshore: I think the signoff thing needs to be thought about about
    • timo: the idea that we'd be blocked by default
    • liw: all changes that go into train go into gerrit -- maybe a convention of who and what way people should be notified
    • brennen: do I know that there's a human watching in some way that there may be some breakage
    • liw: runniing the train in a unique timezone is hard
    • timo: there are a number of things -- resourceloader frex -- that get cached by varnish, so they don't blow up the error log, but there is definitely breakage
    • dj: that's similar to ui breakdowns that community folks find and reach out to us about
    • timo: blocking the train isn't necessary as long as you have someone committed to fixing it before the next train
  • brennen: final thoughts?
    • liw: don't make bugs
    • volans: all this will change with CD
    • addshore: automated creation of phab tickets for log messages
  •  Ideas of having sign-off by teams when doing the train.
TheDJ added a subscriber: TheDJ.Nov 15 2019, 9:41 PM

@brennen: Thank you for proposing and/or hosting this session. This open task only has the archived project tag Wikimedia-Technical-Conference-2019.
If there is nothing more to do in this very task, please change the task status to resolved via the Add Action...Change Status dropdown.
If there is more to do, then please either add appropriate non-archived project tags to this task (via the Add Action...Change Project Tags dropdown), or make sure that appropriate follow up tasks have been created and resolve this very task. Thank you for helping clean up!

brennen closed this task as Resolved.Apr 14 2020, 6:21 PM
brennen claimed this task.