Thu, Feb 13
Cumin should for the most part be able to be upgraded to 10.4 with ease as it only holds the client, not the server, and that is why easier to upgrade (and it should be transparent).
Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (but service has to return at it). As I said:
Wed, Feb 12
I am wondering if 0 byte incremental backups are purged as an optimization, or we had another error that caused those to disapper. If that were to be true, we may have to change our strategy assuming existence of recent incrementals.
For some reason, the pool was updated, but not every volumne. I run update pool from resource, and then "all volumnes from pool", and it got applied.
Apparently, databases pool got enlarged, but production one is still on 1 month to purge. Needs checking to increase it too to 3 months, there is space available for that.
I checked users table, looking good, but will let manuel close this.
Battery of db1095, our of warranty, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one?
eqiad backup service has been restored on a different host, now to handle hw issues.
Sorry, I thought I had answered, but I apparently I did not hit submit.
I predict s3 will take more time due to filesystem object overhead.
transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s3
created /srv/sqldata.s2 on db1140 and ran:
Tue, Feb 11
While there is full logical dumps of es hosts, those are not integrated with the general metadata and misc backups. This task, once T244884 is done, will integrate those under the same process, although with a slightly different logic.
Mon, Feb 10
Just to be clear, not precisely a #1 priority thing, we didn't make much noise about this for a reason... but this got unearthed lately after a couple of performance-related requests from mw maintainers, as it could help with as a self-service model rather than pinging DBAs everytime. :-D
For this task, given its age, is it still relevant?
Sat, Feb 8
What about tuning HTTP frontend caching? Last revision is very dynamic, but a hardcoded diff maybe could return stale results more often, as I can only see it changing on deletion/page move?
Wed, Feb 5
yes, this seems to be an issue
I hope you understood this was a rant directed towards the machine/vendor
only and for background info. I don't think we will get rid of it until we
renew the hw, but ofc no problem on trying an upgrade.
Tue, Feb 4
I might be able to tell you who to contact: [disclaimer- please note DBA take care only of server maintenance, things like "are databases up?" "Do al nodes have the same data on them?" "Do developers have enough servers?", but cannot help with debugging mediawiki-related bugs (e.g. we don't handle Wikimedia-database-error tickets)].
Mon, Feb 3
We have been seeing instances of this issue on codfw (soft mobileapp endpoint timeout alerts) specifically in the last several weeks. I can file a new task if this is not the right place to followup on this.
+1, there where icinga errors as early as 11:15:
[2020-02-03 11:15:57] SERVICE ALERT: cp3057;Webrequests Varnishkafka log producer;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
I apologize for assuming gtid changes were to blame- in my defense logging was confusing to me. On the bad news side, I believe there is still some underlying issue showing "fake" lag/creating extra waits, due to some kind of contention on the mysql or application layer. Right now this is most visible on the job queue, so not sure timeout is the right way to go, it could be just some concurrency configuration to "smooth" mysql queries/connections, or maybe some migration or other ongoing issue (e.g. could be a fallout of the application server latency issue we have observed in the past).
Fri, Jan 31
Thu, Jan 30
If I can add a 4th and 5th, with lower priority, and feel free to disagree- "Ensure acme-chief-backend is running only in the active node" check should not use the -a parameter, but match the 1st or 2 first arguments only. Also maybe some extra monitoring related to repeating failures (?) (I believe someone saw errors on execution). Minor things, comparing to avoiding an outage here, which is the top priority for now.
Thanks @Jdlrobson I hoped that either Search or Readers could know a source, but I it wasn't clear to me on filing.
Wed, Jan 29
Tue, Jan 28
[19:41] <icinga-wm> PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
Thanks, let us know how we can help.
Assuming no impact on actual functionality, this could have lower priority, not have the Wikimedia-production-error and be lower priority, just be kept (and eventually be solved or workarounded) to try to reduce root cron spam.
On clients, randomizer fails quite frequently
Thanks a lot for the work on this- may I suggest a step before the next step (after rebuilding) of "checking all data, old and new, is consistent". This is a lot of data, and even on well thought processes missing rows were discovered after I requested a comparison on other well-though migration, which happened due to mistakes/existing inconsistencies/aborts. May I request such a step, which could be as fast as a simple join query (<5m to run) between old an new to check no rows are missing or extra, and have equivalent data?
Mon, Jan 27
API maxlag has been repeatedly declared to always be <5s by several people in the past
I am not part of the Wikidata QS team, so I don't have answers, just questions :-D Only chiming in because my team was been tagged on this ticket- please understand we (SREs) are not in direct charge of this service and that someone else should answer with first hand knowledge.
Thu, Jan 23
Please note that int(8) unsigned and int(10) unsigned are all the exact same actual data type (4 byte integer, from 0 to 4294967295). https://dev.mysql.com/doc/refman/8.0/en/numeric-type-attributes.html
Related to T243479 ?
Tue, Jan 21
Jan 9 2020
Sadly, applying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/563241 to current 1.35 gives me:
Not the original reporter, but after applying the patch I can confirm the turkish installation went through correctly (twice):
helium is no longer the backup server, you could go to https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=webperf1002.eqiad.wmnet-Monthly-1st-Wed-production-arclamp-application-data&from=1577127671546&to=1578586171164 to check backup behavior. Icinga will also alert if backups are not fresh and correct beyond the configured period. I can see full backups for webperf taking 215GB. As part of upcoming goal work, we will require soon assistance from service owners to automate and test the recovery of backups.
backup2001 is now down and ready to be done maintenance (no need to ask again). @Papaul please, when done, just boot it back up and ping here. Thanks.
As an addendum, could something be improved related to T238296#5662905 ? It seems there are at least 3 dashboards related to the job queue health, and while https://grafana.wikimedia.org/d/000000107/job-queue-health is clearly marked as deprecated, I didn't know about T238296#5662894. I would suggest to clarify current status on wikitech or grafana itself.
Jan 6 2020
I have forwarded the link to @herron
Jan 3 2020
I have started mysql instances back again, and replication, as on codfw there is low load.
Jan 2 2020
Feel free to edit the body with a complete list of changes, but I beleive a single task would be enough to track all improvements requested.
I still got no answer yet :-/.
I believe this should be renamed to "Approve a policy to not allow new features being deployed to production that don't have a kill switch".
Dec 31 2019
I was able to reproduce this on 1.34. I wonder if the root cause is similar to T241638, translation service not being yet available during the installer.
I agree we should keep the backup sources either the same version as the master or as >50% of the replicas (aka "upgrade it at the same time as the master"). However, if we decide to change a whole section (e.g. s1 and s6 on codfw only) to a higher version, we could also upgrade the backup source of that dc to test the backup workflow on the new version.
I was able to reproduce it quite consistently with the given steps- it happens on final setup, on creation of the sysop account, see screenshot:
My guess is there may be a dependency cycle in which the translation for the installer requires a service or string that is not yet installed. I believe similar errors happened in the past with non-standard installations, but we need the mw internationalization/dependency experts to confirm.
Adding maintainers as per https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_extensions_deployed_on_the_Wikimedia_Cluster (please correct tags AND wiki if outdated)
I am no mediawiki developer, but normally one would ask for the Mediawiki version, what steps you were trying to accomplish (you seem to be running the installer?) and a copy of your configuration file.
Dec 30 2019
any suggestions for troubleshooting
See also T241534, I am not sure exactly what was the problem detected.
This being a software raid please coordinate with @fgiunchedi .
This is flapping very frequently, but with a 500, not a 429 (scb1002 only, for example, twice per hour). Should I close this and open a new one, or can this be handled here?
Dec 27 2019
This has continued in the last 7 days: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&from=1576888676529&to=1577452275258&var-cirrus_group=eqiad&var-cluster=eqiad&var-exported_cluster=production-search&var-smoothing=1
Dec 26 2019
@hashar based on Dzhan's comment, is that something your team could handle, sending a puppet patch for the missing hiera keys there (and I can help reviewing it and deploying it)? Let me know.
Stalled based on comments, waiting for T202684#5735025 response.
@ayounsi how prioritary would you say this ticket is worth? Spam is annoying but shouldn't have high- however the disk saturation could be dangerous (I don't have all context to be able to decide).
This seems high importance, feel free to tune down if necessary.
How high priority would you say this has, to remove it from triage inbox?
I am personally not familiar with mailman format. Maybe @herron, our mail expert, knows how to proceed with this? Even if he cannot do it, if he knows it is a trivial procedure "copy the file to a path", I could do it myself.
See also T240794, which if agreed could be done at the same time.
@leila researchers typically have time-limited MOUs, is this true in this case? If so, could you share the period of time, so I can add an expiry date for the account?