Page MenuHomePhabricator

Create a separate 'mwdebug' cluster
Open, Needs TriagePublic

Description

Problem

I am running a test (T261009) on mwdebug1001 that has severely degraded performance, but sadly this has triggered alerts. This has happened in the past of course, where I am trying to test something, and we get alerts from debug servers. So the question is, does it make sense to not have mwdebug* servers contribute to latency metrics, to error metrics and fire alerts of that sort?

Proposal
Create a separate "debug" mwdiawiki cluster, which will have its own dashboards where engineers can look when testing changes there. In other words, remove mwdebug* hosts from the "appserver" cluster, and create a new one for them. The goal is to decouple mwdebug* metrics and alerts from production.

Pros:

  • Better visibility when testing
  • Errors will not alert
  • metrics will not contribute to production metrics

Cons:

  • Scap needs to adapt to this, so when one is testing on mwdebug*, the overall error rate will not prevent scap from moving forward a deployment Release-Engineering-Team

Actionables:

  • implement X-Analytics: debug=1 on the analytics part
  • implement X-Analytics: debug=1 on vcl

TBA

Event Timeline

jijiki created this task.Sep 7 2020, 12:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 7 2020, 12:15 PM
jijiki renamed this task from Should mwdebug servers contribute to cluster latency? to Should we create a separate 'mwdebug' cluster?.Sep 8 2020, 10:11 AM
jijiki updated the task description. (Show Details)
lmata added a subscriber: lmata.EditedSep 8 2020, 3:23 PM

@jijiki should we add this to our backlog or is this tagged mainly for our viewing benefit? A quick team conversation has determined that we too agree this is a good thing. Let me know if/how we can assist.

jijiki updated the task description. (Show Details)

@lmata I can start the work and ask for help from observability for reviews and questions, thank you!

lmata moved this task from Inbox to Radar on the observability board.Sep 9 2020, 3:29 PM

sounds good, will move this to Radar and let me know when/if we can be of assistance :-)

Is this meant for folks deploying? Are we going to use these like we use the current mwdebug hosts? Or is this supposed to be for SRE to test changes to things like php-fpm runtime?

jijiki added a subscriber: LarsWirzenius.EditedSep 9 2020, 4:51 PM

@thcipriani We will continue to use mwdebug* as we do (both for developers and SREs); the existing hosts will join the new cluster. The goal here is to remove mwdebug* hosts from production metrics and alerts.

I don't remember by heart if scap has any dependencies on mwdebug* hosts, but it would make sense to add some functionality like "deploy on mwdebug* but ignore any errors". Moreover, to my knowledge scap checks an error rate that (possibly) includes errors coming from mwdebug* hosts. I believe it makes sense to ensure mwdebug* errors are excluded. I will have a look with @LarsWirzenius, as I have not looked at the code for some time so I may be wrong here:)

jijiki updated the task description. (Show Details)Sep 9 2020, 4:55 PM
Krinkle added a subscriber: Krinkle.EditedSep 9 2020, 9:20 PM

If I understand correctly, from a non-SRE perspective, this proposal effectively means:

  • The $_SERVER['SERVERGROUP'] environment variable will be set to debug instead of appserver.
  • This means Logstash dashboards used by MediaWiki developers and RelEng (such as mediawiki-errors) will gain "debug" in the "server group" breakdown and make it easier to focus on those (in the mwdebug dashboard) or ignore them (in the new-errors dashboard) without having to hardcode individual host names.
  • This means the Grafana dashboards used by SRE for server/service health metrics will gain a new option "debug" in addition to ("appserver", "apio_appserver", "jobrunner", "parsoid") and mwdebug will no longer be mixed into appserver.
  • This means Icinga alerts based on Logstash/Prometheus that fire based on raised "mediawiki exceptions", which can fire based on thresholds by server group, will not have the "appserver" fire when mwdebug servers have raised error levels.

I'm assuming the mwdebug hardware (actually, VMs in this case) won't change, and will still be accessible the same way as before, and otherwise function and behave the same.

Sounds good to me!

@jijiki I haven't ever looked at the parts of Scap that would be affected by this, but I can't imagine it's too hard to change or re-configure Scap to not use mwdebug for canaries when it looks at error rates when deploying. @thcipriani would know better, I'm sure.

Milimetric added a subscriber: Milimetric.

x-wikimedia-debug shouldn't affect anything we do

jijiki added a comment.EditedSep 10 2020, 4:17 PM

@Milimetric my question is, if I want to do a performance test from outside our network, and fire 700k requests towards our debug servers, does this affect and alter our analytics data? thank you!

@Milimetric my question is, if I want to do a performance test from outside our network, and fire 700k requests towards our debug servers, does this affect and alter our analytics data? thank you!

Ah, I think I see. Let me describe how I understand these requests will work, which could be wrong, please correct if so:

  • requests will be to any of our domains, like ro.wikipedia.org
  • but they set x-wikimedia-debug which skips the varnishes and goes directly to the new cluster you propose

If these requests look like normal reading/editing, as in /wiki/Title, /w/index.php, /w/api.php, etc, then I think varnish still logs them as normal HTTP requests, even if it forwards them as cache misses? If that's true, then varnishkafka will put them in kafka and that means they'll end up in HDFS. That's ok, 700k requests will be just a blip, we get that many every few seconds. So they won't upset any of our metrics, unless they meet the criteria to be pageviews. If the tests might look like real pageviews, then maybe we should also send something to ignore them on the X-Analytics header, and change the pageview definition accordingly. If they're not pageviews, they may get loaded into turnilo for debugging or follow other pipelines we haven't yet thought of, but they'll probably be deleted after X days along with other webrequests. It might be smart to pro-actively put something in X-Analytics so we can filter them out later.

If the requests are to other servers other than mediawiki app servers then let's talk more. For example, if you're planning on testing everything like eventlogging endpoints, EventGate endpoints, and so on.

jijiki moved this task from Next up 🥌 to Q2 2020 on the User-jijiki board.Sep 16 2020, 9:09 AM

@Milimetric my question is, if I want to do a performance test from outside our network, and fire 700k requests towards our debug servers, does this affect and alter our analytics data? thank you!

Ah, I think I see. Let me describe how I understand these requests will work, which could be wrong, please correct if so:

  • requests will be to any of our domains, like ro.wikipedia.org

Yes :)

  • but they set x-wikimedia-debug which skips the varnishes and goes directly to the new cluster you propose

Just what we are doing now on the caching layer with requests with X-Wikimedia-Debug, is fine

If these requests look like normal reading/editing, as in /wiki/Title, /w/index.php, /w/api.php, etc, then I think varnish still logs them as normal HTTP requests, even if it forwards them as cache misses? If that's true, then varnishkafka will put them in kafka and that means they'll end up in HDFS. That's ok, 700k requests will be just a blip, we get that many every few seconds. So they won't upset any of our metrics, unless they meet the criteria to be pageviews. If the tests might look like real pageviews, then maybe we should also send something to ignore them on the X-Analytics header, and change the pageview definition accordingly. If they're not pageviews, they may get loaded into turnilo for debugging or follow other pipelines we haven't yet thought of, but they'll probably be deleted after X days along with other webrequests. It might be smart to pro-actively put something in X-Analytics so we can filter them out later.

Requests bearing the X-Wikimedia-Debug header passthrough the caches but they endup in varnishkafka and thus turnilo, just like you mentioned. The requests I am running have been extracted from webrequest_text, and I trust they look like pageviews. One run of 700k requests is a blip, but multiple ones, for an extended period of time, might be an issue I believe (correct me if I am wrong). Would be it too much work for an X-Analytics header like you propose?

Milimetric added a comment.EditedSep 18 2020, 9:16 PM

Requests bearing the X-Wikimedia-Debug header passthrough the caches but they endup in varnishkafka and thus turnilo, just like you mentioned. The requests I am running have been extracted from webrequest_text, and I trust they look like pageviews. One run of 700k requests is a blip, but multiple ones, for an extended period of time, might be an issue I believe (correct me if I am wrong). Would be it too much work for an X-Analytics header like you propose?

Ok, if they look like pageviews than any amount of them is going to affect the metrics, especially on low traffic wikis. So we want to exclude them. Adding a value to the X-Analytics header should be pretty easy, I think there are examples in other varnish code, there's nothing special there just a list of k=v. We already check for pageview=1 for mobile apps, so maybe a good idea would be to set pageview=0. It's quick to make the change and then it would go out with our normal weekly deploy probably.

I'm not super good with varnish code, but maybe it's hard to set pageview=0 if some other code already set it to pageview=1. In that case, it's totally fine to set something like debug=1, we can work off of anything like that.

Krinkle added a comment.EditedSep 18 2020, 11:15 PM

Based on what I've seen in the past, I believe local testing or bulk testing is generally done directly toward the debug or app server. Either against localhost from the same server, or from another production server. (See Debugging in production).

This involves neither the Varnish edge, and therefore does not affect analytics in any way. The X-Wikimedia-Debug header is also not needed in this case, although passing it is harmless and mostly a no-op.

When using the WikimediaDebug browser, or XWD header from your own command-line, for ad-hoc testing and pre-deploy testing, then those are indeed running against the Varnish edge and do get counted as regular page views, which seems fair and is indeed very low noise. For my deployments, I generally test functionality three times: 1) plain prod before deploy, 2) staging XWD on staged change, 3) plain prod again post-deploy. The first and last would be counted either way (if they are page views).

@Milimetric that would be great, if it is not too much work, I would appreciate it. I will work on the varnish part of this. Thank you!

Based on what I've seen in the past, I believe local testing or bulk testing is generally done directly toward the debug or app server. Either against localhost from the same server, or from another production server. (See Debugging in production).

Sometimes we might need to run the test from eg WMCS, which again goes through our cache layer. Since it is an easy fix, it is a nice to have:)

jijiki renamed this task from Should we create a separate 'mwdebug' cluster? to Create a separate 'mwdebug' cluster.Sep 23 2020, 6:01 PM
jijiki updated the task description. (Show Details)

Change 629735 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] varnish: check for pageview=0 value in X-Analytics header

https://gerrit.wikimedia.org/r/629735

jijiki updated the task description. (Show Details)Fri, Oct 23, 3:56 PM
jijiki updated the task description. (Show Details)