Mon, Nov 12
I will need some more time to digest the rest of your comment, but it seems it is even more I need to answer my question. I can quickly answer:
Nuria apparently subscribed 120 cloud users to this task by mistake- please be careful when using Phabricator to not annoy (with spam) our valuable contributors just to get the desired attention.
Could you check the list of schema changes and maintenance to be ran during switchover to test if they where also undone?
There is now a need to check consistency configuration on all codfw hosts, which were altered to prevent lots of lagging behind, as well as I would recommend some light checking on the consistency of the affected tables.
This is not something we handle- we don't decide on the table structure (this refactoring, comment storage, was owned by Platform team), while the actual view structure changes is handled by Cloud. I don't even know which structure there is on wikireplicas- we only handle the production filtering. When this was setup (views) we weren't very happy about it, and we did only "accept" it with the condition that cloud would handle on its own wikireplicas filtering. They have been working together on T181650- you should comment there your needs and one of the 2 teams may serve you better :-)
Thu, Nov 8
Update from the airport for the practical bits- send a patch with a foreachwiki php script (check other past cases with similar tasks) so many people can review it and +1 it- if it takes less than 1 hour to run, !log on IRC #wikimedia-operations and run it during its own deployment window, it it is longer, write it on the [[wikitech:Deployments]] page in the "week of:" section, as recommended on "Long running tasks/scripts". Make sure no other long-running tasks such as T166733 are writing to the same tables to prevent locking (eg. deadlocks). Even if the reporter do not have access to production, once it is deployed into production, anyone with server access will be able to run it (not only ops- releng and several developer, too).
I was asked to participate on this task explicitly:
I have not done any research that suggests they are related, but he had Watchlist regressions in the past on certain wikis when rcs growed a lot on long watchlists at T171027#3667090 Not assuming it is Wikidata at all, but it could be any other process making rcs grow (a thing that is easy to check or discard).
I voted +1 but check if you can fix some minor style things I suggest so all mariadb roles look similar. Otherwise, your deploy strategy looks sane to me- disable puppet everywhere, test on the new host, then on codfw, then on eqiad, one by one making sure it is noop. Thanks for your hard work!
While it looks good, please wait to have at least one positive review unless there is an emergency- as all the work will be done on your side, you can move to other ticket while you wait. I think I will be able to review it today.
@Anomie I do strongly believe that mediawiki has recently gone unsafe (regression) in the latest releases- this is one of the many replication issues we had recently- we should search recent commits for unsafe statements or this will continue happening.
Tue, Nov 6
@faidon They don't need a DBA, they are searching for someone to support them with puppetization. We already attended the DBA tasks at T205294, the rest can be handled with anyone with production access or knowledge- they can wait for me to have the time, of course, but if they do, they may be waiting forever :-)
Sorry, I don't have bandwidth to work on this search for another op.
Mon, Nov 5
it got missed somehow
hm... replication broken again, now on metawiki.archive
Sun, Nov 4
This is what I did for T208565: stop replication, add db.table to the long list of filters (carefully), let it catch up, then stop the replication on its master (beware of icinga alerts, downtime all possible alerts first), truncate the table without altering its definition and reimport it logically -eg mysqldump- (the triggers should take care of the sanitization), restore the replication filter to the original state and restart replication on the master.
Fri, Nov 2
"extended 1:1" is not my words-- obviously I cannot guarantee such a thing, only a manager can do so, but I believe assistance on writing a puppet patch would be within the functions, as I understand them.
Do you know who's the right person to contact today?
This is technically fixed, but we should do a deeper check on the causes of this, there could be some drift on this or other close dbs that only manifests due to the ROW-based replication.
Work by dbas here is done except making the 2 altered hosts back fully consistent once they catch up.
So I am going to give a shot in the dark and say that then schema changes are done under pressure and in a hurry as this was done only to have the features out early and without proper checks, they are prone to errors.
Thu, Nov 1
Note I didn't ask for a delay- and neither Manuel (vacations) or I (training) will be around that new day either. Balasz will be, however.
Wed, Oct 31
loading a table with ROW_FORMAT=COMPRESSED but I am not sure compression is actually the cause, but maybe (I would actually be more sure about that) just a simple rebuild from scratch after a many edits are suffered (something common after >1000M edits).
A larger stripe size should not be a huge issue (unlike a smaller one, which affected performance significantly and we didn't like it). We were thinking of increasing the one we used due to increased capacity anyway, so this would be a nice test (these are 1.6TB disks anyway). Redoing the RAID and reformatting may take a long time and it may be a waste of time.
We should work on T208383 instead.
Tue, Oct 30
In particular, I see https://www.wikidata.org/wiki/Q2058295 properly merged.
There is nothing else left for DBAs here except waiting for errors.
No problem on my side, a short network outage is not a huge issue on codfw for dbs, but I cannot guarantee they will not page, and I won't be around to attend it- someone else will have to.
Everthing at T184805#4654953 done, except the GTID handling, which has to be checked separately for other reasons.
No filters left that I can see:
./software/dbtools/section s5 | while read host; do echo $host; mysql.py -h $host -e "SHOW ALL SLAVES STATUS\G" | grep 'Wild' ; done ./software/dbtools/section s3 | while read host; do echo $host; mysql.py -h $host -e "SHOW ALL SLAVES STATUS\G" | grep 'Wild' ; done
In theory the drops finished, but I need to do an additonal pass to check for missing hosts/dbs as well as check/remove filters.
All those wikis sanitarium set up on db2094 too
There is also the, I guess related:
Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: WatchedItemStore::removeWatchBatchForUser does not have outer scope. #0 /srv/mediawiki/php-1.33.0-wmf.1/includes/watcheditem/WatchedItemStore.php(392): Wikimedia\Rdbms\LBFactory->getEmptyTransactionTic
Mon, Oct 29
Please give ups the times on eqiad again so we can verify it was the switch indeed.
We should consider declining that and do the work directly on the new hardware: T207259
It is not clearly a RAM issue
https://downloads.mariadb.org/ 10.1.37 not yet considered stable at the time of writing this. While we could deploy something from the tree, that is a big no for database code (unlike other kinds of code I don't have a problem with doing that) unless there is an unbreak now bug. Once it is officially released I will build it and test it.
We don't share passwords publicly, and you shouldn't need it to actually use it- you should create puppet code that reads it and write it to a config file or reads it and performs the load from you. Giving you the password not only it would be dangerous, it would not serve if for any reason it has to be changed and the loading code does not work anymore. Also passwords can be stolen and lost and I personally consider them sensitive data.
Please note that while I have been asking you to wait, I am genuinely concerned about the performance of the query- even if only a few people can run it, group by over long period of times can be bad for the servers- so not blocked just on me not responding - however it is not easy to provide alternatives. Putting a limit on the timespan, or testing the waters by trying to count the number of matching entries beforehand would help to build confidence on the query- note that even if you do a LIMIT 5001, the group by forces to go over all matching entries.
What I was asking was to provide pure SQL and a tip such as "run this on the ip with more edits" or so (it is not easy for me to do the mw syntax -> SQL transformation, I am not that familiar with mediawiki code, sorry.
so the error happens because it is tried to be run manually, which it not a big deal if it errors out- just delete any file you may have added. I ran systemctl disable <service manually started that failed> and then systemctl reset-failedand it should never fail again.
I think this should be moved to zarcillo mariadb metadata database, and centralize there the active database control (substituting tendril, puppet, per-section lists and prometheus mysql exported list).
mysql-prometheus-exporter should not run in a multiinstances host, there is mysql-prometheus@m1, @m2 ... That is specified on puppet.
Thu, Oct 25
@Pigsonthewing I hope my comment at Wikidata Village Pump was helpful- if you think that is ok, I would suggest closing this task, and open a different one to track the merges of old history (this was to track the recovery from backups)?
MarcoAurelio- great suggestion. I would also add to check the effect if a page is retried to be deleted several times (a common occurence in the past due to the way deletion requests are handled)- I know there was work on preventing issues- but it would be nice to recheck on production.
T207881 is mediawiki, db1072 is m5, nothing to do.
Please reload the proxy and work with @Bstorm or whoever may help to identify next steps.
Network likly went down at 19:23 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1540407778981&to=1540410443852, or more likely, connections reached max_connections CC cloud-services-team
The host is not up and running, it says: db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN
Wed, Oct 24
This was done long time ago on dbstore1002, and doesn't apply anymore on dbstores due to multiinstance.
To not just be a pain, this is how you can discover the master for a particular section automatically:
I wouldn't use tables_to_check.txt for now
@Banyek Hardcoding the masters in configurations seems to me like a bad idea- they are already defined redundantly 4 times on mediawiki, on puppet, on tendril and on prometheus. We should reduce the redundancy, not increase it.
Tue, Oct 23
COMMIT takes a few seconds
I believe this affects mostly commonswiki, it regularly shows:
Sorry, there is an ops clinic duty to answer these kind of requests- I did my part which was creating the user account on production. I am not responsible for anything else- anybody can do an RC on beta repos, and I am definitely not in charge of those.
that task is mostly append only
So this has to be done (I will check in case there is a duplicate task already), not arguing against that.
We believe this to be fixed fully both wikireplicas and on production, but will not close until extra checks confirm so.