Assigning to Mathew based on above update as part of clinic duty. Feel free to revert if this is wrong.
Copy jobs are running now- we will see how much it takes to do a full copy.
I am just here doing clinic duty for the Operations tag. Traffic should decide on this ticket, but based on my (limited) understanding of our setup, I suggest we should not do this unless there is a really good reason to.
Not the first time this happens: T237730 And firmware was updated at that time.
Assigning to @leila as per BBlack and Reedy comments, as there seems to be some additional information required. Please feel free to reassign to the right person you are in contact with, as per your original comment there may be 3rd parties involved. Other than that, I will let Traffic handle the request on their own (I am just trying to move forward tasks while on clinic duty).
Hey, @chasemp, is this in your radar (lot of time passed since last update)? If yes, but "there is need of some discussion and work not involving SRE", I would remove the SRE-Access-Requests so it doesn't appear on clinic duty dashboard. If no, maybe this should be closed and a different task should be open with further actionables (technically, the title has been already fullfilled, secteam-users exist on production). If yes, but SREs are blocking work, please let us know how. Cheers!
Please reassign to me when ok or if there are comments.
^I have prepared the patch to merge it as soon as everybody agrees.
though this case is complicated since people want their "latest views" to be immediately reflected
FWD: @Marostegui You may want to defragment the named table before answering the question.
Thu, Dec 5
we'd still probably lose the ability to reuse the same opened connection
Should we consider changing any of this?
I've not yet managed to find suitable ways to join the tables and make some query against revisions and usernames/comments.
Wed, Dec 4
I am seeing db1118 serving dumps. This is a high-throughput main-traffic enwiki replica. I thought at first this was the cause of an outage, but it was unrelated. However, it seems quite worrying.
Now it is ok:
I scheduled by accident the migration, not the copy.
I have provided them already in
Rentention change documented at: https://wikitech.wikimedia.org/wiki/Bacula#Modify_a_pool's_retention_(or_other_similar_properties)
After update, the pools seem ok, although we probably should also increase the offsite one (creating patch).
*list pool +--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+ | PoolId | Name | NumVols | MaxVols | MaxVolBytes | VolRetention | Enabled | PoolType | LabelFormat | +--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+ | 1 | Default | 0 | 1 | 0 | 155,520,000 | 1 | Backup | * | | 2 | production | 33 | 60 | 536,870,912,000 | 7,776,000 | 1 | Backup | production | | 3 | Archive | 2 | 5 | 536,870,912,000 | 157,680,000 | 1 | Backup | archive | | 4 | offsite | 0 | 60 | 536,870,912,000 | 2,592,000 | 1 | Backup | offsite | | 5 | Databases | 5 | 60 | 536,870,912,000 | 7,776,000 | 1 | Backup | databases | +--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+
Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534. Normally I would just find a solution or workaround on my own, but archive file copy was one of the parts in which I compromised my suggested plan because you were quite confident on its forward compatibility :-/. On the other side, most of those files seem to be around 5 years old, which may mean some should be actually be purged. Let me know your thoughts.
"Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking.
I wonder if some of these could be done on reimage, if/when there is one planned anyway.
Tue, Dec 3
Full Backup 10 04-Dec-19 02:05 dbprov2002.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown* Full Backup 10 04-Dec-19 02:05 dbprov2001.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown*
3fe8d696da846e6f3be372e8bf62939242857d99 could help inspire this. This is the reference implementation: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/multiversion/MWWikiversions.php#77
Mon, Dec 2
if the expectation is that the production error tag will give this higher priority compared other Parsoid bugs, that is not going to be the case right now because of the reality of Parsoid vs Parser.php differences. But, if the tag is just an indicator of that this is an exception raised on the production cluster, then, that is fine.
Fri, Nov 29
Same for bast1001:
Error while trying to restore sodium contents:
Wed, Nov 27
firstname.lastname@example.org[bacula9]> UPDATE Media SET StorageId = 11 WHERE StorageId = 4; Query OK, 2 rows affected (0.00 sec) Rows matched: 2 Changed: 2 Warnings: 0
Please note those are for SAS disks, I beleive we have more SATA ones, which are affected by https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00048133ja_jp
Tue, Nov 26
Although checking more closely, this should be closed as invalid- those wikis doesn't have wikidata enabled, so there is no such tables. Not all wikis have the same tables, some depend on the plugin configuration.
Thank you for checking!
Batch editing the DB
The update should be:
UPDATE Media SET StorageId = 11 WHERE StorageId = 4;
It was more a question for @Andrew (which he already answered)
Does this need backups? cc @jcrespo
This is ongoing, so adding production error tag:
I am seeing ATM errors:
/wiki/Special:Search?search=<search string>&ns0=1 ErrorException from line 1591 of /srv/mediawiki/php-1.35.0-wmf.5/includes/GlobalFunctions.php: PHP Notice: Array to string conversion
Mon, Nov 25
"Lowest" sounds belittling and demotivating
Fri, Nov 22
I was planning to have only one mariadb instance acting as multi-source
buster + last version of mariadb
I will document the graph when it is "finished" (WIP), but for now:
- Backup time: end_time - start_time of the last backup
- Backup level: if it is a Full backup ord('F') => 70, incremental ord('I') => 73 or Differential ord('D') => 68, and other options may exist too.
- Backup status: terminated successfully ord('T') => 84, still running, aborted by user, fatal error ('f'), ...
As I feared, the exported during peak hours gets too slow: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-srv-gerrit-git&from=1574390374636&to=1574405281903
Thu, Nov 21
I see, thanks. Again, just to be clear, I don't need this to be an approved policy- I just need something written on mw.org to direct wmf developers & deployers to wikitech instructions because ongoing coordination issues.
It is not in progress, I commented precisely there saying that.
There is a bug:
Wed, Nov 20
This is what I got so far (only per-job information so far):
Fri, Nov 15
I doubt labtestwiki has replicas...
If you don't plan to recover the data, and it is for archival purposes, that is ok. However I strongly suggest to use mydumper in the future, or a recovery on a single thread would take around 5 days, and will make very difficult to do a partial recovery. We preciselly wrap backup_mariadb.py and recover_dump.py so sane defaults are used. The backup taking would also have been 5-10 times faster.
Is this serious enough that we should halt the script and deploy a fix immediately, or can it wait until after the current run finishes?
Thu, Nov 14
Just to be clear, I wasn't suggesting removing it- mostly it was fixing the missing metrics and making things more easy to find/document.
Both in cases, be it deprecated or not, probably we will want better discoverability (tags) on the new dashboards, documentation update https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue and potentially adding a link to the above dashboards on the old one (just a suggested fix).
Also happening on production (rarely though) according to Logstash.
Wed, Nov 13
I'm in the wikibase team. Can you tell me who said it and where, maybe I'm missing something? Technically it's not possible but it's just matter of sending proper connection to the class and that's all.
db1114 is now running percona-server 8.0, if anyone wants to test it.
on our side (which stores and uses the most of stuff SDC use), we can safely move to another server, we don't do any joins with other tables in the code.
Tue, Nov 12
@akosiaris Could you give a quick look to see if these seems like a complete archive contents?