^So I don't need this, but people really need to think of a way to push fresh configurations to mw tasks- or establish some policy of long-running scripts to avoid https://logstash.wikimedia.org/goto/523d9a64fb0821e25c2e84ca93502c1d
awight - do you need its contents? Maybe it was archived in the past, we would have to do some research about that.
our backup systems has this bug where empty dbs are not recovered, maybe it is an empty and was deleted by accident (with no data loss?).
It should be on m2: https://wikitech.wikimedia.org/wiki/MariaDB/misc#Current_schemas_2
Thanks, yes, as Filippo said avobe, it has been deleted (and it is available) on swift, but not on metadata. We can do 2 things- reupload it, or perform a deletion with SQL and recover it in the normal way. I will need help from a mw developer for the second option.
switchover script works as expected (tested on db1111/db1112):
I wonder why redis- I understand the need for caching, but recently x1 section was expanded to accommodate reading lists needs, and 10 GB is small compared to the reading lists and cx-translation (in-progress translation) needs, which is kind of the same amount of data. Not against using other technology- but this looks very similar to the above mentioned features, or the pre-cached Special:* list pages ? Redis has issues with cross-dc replication, and it is slowly being removed (jobqueue was, sessions next).
es1019 is back into service.
stalling, no errors so far, but I doubt this is the last time we hear abut this. Backups are on dbstore1001 just in case.
So this is solved?
Taking care of it.
Some wikipedias will be affected, if you want to shorter list of wikis that WON'T be affected it is at https://noc.wikimedia.org/db.php :
Tue, Jan 15
es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).
@Cmjohnson The most likely scenario is that we move the dimm and we keep detecting 96GB of ram, and then we will ask you to ask for a replacement. Otherwise we will reboot it and keep observing.
Waiting for Chris to be available to fully shutdown it (as otherwise I wouldn't be able to put it back up).
Could you try to restore it @Platonides using the wiki admin tools before trying some SQL?
[10:08] <jynus> something happened yesterday at 22:40 that made things 0.5 seconds slower [10:08] <jynus> on mobile
Let's add @Anomie here once so he can verify this didn't affect ongoing actor migration as per https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_14th
Mon, Jan 14
I've just seen a dashboard I use is scheduled for deletion. I don't see the replacement as particularly better and lacking. Could you have a look at how other people are doing those such as https://pmmdemo.percona.com/graph/d/qyzrQGHmk/system-overview They can be downloaded at https://github.com/percona/grafana-dashboards
First of all, I am only commenting because I have more information, but access handling is owned by the cloud team.
I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failover.
Sun, Jan 13
Fri, Jan 11
@Zoranzoki21 and @dungodung, as well as other subscribers- Phabricator is not the place for this kind of discussion- that should go to wiki. As @MarcoAurelio said, no deletion will happen until that happens, and even if some deletion happened already, it can be recovered. Please solve disputes on wiki, and only return to this ticket when a consensus is reached with a decision we can apply.
More logs, confirming the module is probably dead:
Asking @Cmjohnson to move around this memory module- either it got disconnected or broken completely- FYI 128 GB of memory should be detected, but only 96 are (aiming for maintenance for Tuesday):
Something else happened on the 17th:
Can we discuss about how to implement these?
Even us roots have one for mysql administration:
root@cumin1001:~$ mysql.py --version /usr/local/sbin/mysql.py Ver 15.1 Distrib 10.1.36-MariaDB, for Linux (x86_64) using readline 5.2
@Nuria There is apparently 2 tools (or the same, reused), one on production, too:
Thu, Jan 10
@Cmjohnson you are the best, the worse Dell is, the more superb you are to cover for their mess. How many beers do I own you already? XD Thanks again.
@Cmjohnson Sorry, cannot today for both organizational reasons (@ at meeting today) and technical ones (cannot depool today due to traffic without being too disruptive). Let's try Tuesday if you are ok with that?
@Volans I have no ssh, https or ipmi access, so there is nothing I can do about it. This needs a power drain.
That was the plan :-)
Sorry, I searched but I didn't find the other one, as on your above comment you probably meant that but linked to itself by mistake. I am ok with any method, as long as there is at least one task open.
I will first try remote debugging techniques myself.
Failing again, acking on icinga, reopening to not forget about it.
Leaving it open and acking it on icinga so we don't forget about it.
I have modified the wording to reuse the meta task for the new goal, which has already solved the decision part, but still needs some design for the architecture, purchases and final implementation.
Wed, Jan 9
I rebuilt db1082- we are no blocker for any maintenance on those servers, but we would prefer to stop mysql if there is a chance for the server to lose power, while it does not cause any user-visible outage, as it is very time consuming for us to recover a pooled server, and takes very little time to depool it and stop it.
db1082 is fully repooled, it and db1124 had gtid reeenabled.
Looking at the logs, the issue (lock wait timeout) I see now is with User::loadFromDatabase (SELECT user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id FROM user LEFT JOIN actor user_actor ON ((user_actor.actor_user = user_id)) WHERE user_id = '1983946' LIMIT 1 FOR UPDATE) so maybe this can be closed as resolved?
I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.
Tue, Jan 8
This is mostly fixed, except gtid must be enabled on 82 and 1124, plus 82 must be repooled.
By the way, are people aware that a shard called "test-s4" has 2 dedicated large hosts and ready to be used for production? I think it was used by Anomie and DanielK to test MCR, could it be shared for whatever testcommonswiki is being used?
If this is a known, ongoing, in-process-of-being decommission issue, you can close this ticket, no reason to keep it open. But I would suggest sending an email to ops@ linking to the above comment and saying so (I didn't know this, and probably more people didn't either, but it sends alerts to icinga).
the best way to accomplish this would probably be a library
There is already an 'sql' tool that developers that query production use without having to know the underlying mediawiki topology (100 servers)- probably could be adapted for analytics hosts?
db1124:s5 stopped at db1082-bin.002490:667685191
Mon, Jan 7
I plan to take care of this tomorrow morning.
^CC @Marostegui so you know why db1082 + db1124 + labsdb replication (s5) are broken or stopped
I am creating a subtask to fix db1082, which may have to be reimaged because the power loss.
I've been told there was some breakage based on assuming s4 ==> commons, or commons ==> s4. I am not too worried, as I said, about a temporary project, but the assumption of that on code or configuration is worrying, as it would not be unthinkable we move commonswiki in the future to separate group.
May a suggest a different route? Let's migrate the mediawiki replicated tables first- then migrate the staging ones on a per-case bases. After all, it makes no sense to copy them to, which of the new servers? Once we check the replication works as intended, we can ask what to keep and what to remover. In some cases, users may prefer to regenerate them from "fresh data"? Just an idea.
I agree with Manuel T197616, I would have preferred creating it on s3 for isolation reasons- enwiki, commons and wikidata require more resources than the typical high-throughput project and they were on purpose set on dedicated hardware. I understand that you want a setup as similar as possible as the actual commonswiki, but from our point of view, s0 deployments are the ones more likely to create outages, and the above wikis, plus metawiki and centralauth are on purpose separate from group0 ones to minimize impact. Also the above 3 wikis have a large amount of hardware behind them, which makes testcommonswiki overprovisioned in some aspects.
I would like to insist on this issue now that the holiday is over- while the service (parsercache) is not at the time affected, we are in a no-hw redundancy mode on eqiad, and after all it was the vendor that sent faulty hardware in the first place. Please escalate to us or a manager if you need help "fighting". Happy 2019 and thanks!
Regarding the second error- binary strings are not text, so they must be converted to python strings explicitly after driver execution.
Nov 30 2018
Thank you, then I guess this can be closed as resolved, or I will let you handle it as you prefer.
Assigning it to you- unclaim it if it doesn't work and need more help, or close it at a later time if the fix works.
You seem to be working on this, do you mind if I assign it to you (you can unclaim it if you want, later), that way it is clear someone is actively working on it, for organization purposes?
To further clarify on the Blocked external/Not db team by Manuel, this seems a fairly simple and strightforward [famous last words], and not worrying storage-wise, you don't need any previous discussion with us to work on it- if agreed. We would like, however, to review potential new queries on implementation, to make sure indexing is used appropriately (it most likely will need a new index to filter on it, and that may not be that simple except on trivial usages- e.g. T209773#4783873 will need thorough review to support large watchlists). We are obviously always open for questions (ping us)- but we are not leading this work.
Sorry, I found this very interesting case, but I don't know a better ticket to to report it. I saw on mediawiki errors logs that every minute, a request to [[User:Acer/Simple1]] on enwiki was done (oldid 844560394, in case it is edited or deleted). It returned 500 errors every time- the page is very large, contains 50000 wikilinks (I am guessing because it took a long time to render).
Thank you, I will ack the alerts on icinga- that was the main trigger of this.