Fri, May 29
Please don't consider dbmonitor2001 as upgraded- as the application doesn't work after os upgrade.
Thu, May 28
I talked to @Jclark-ctr on IRC, hw replacement will likely happen on Tuesday next week.
One thing that may be relevant here is that the query may have worked once or twice in the last 6 months due to this underlying issue. Last update that completed successfully was 27 March. In effect, because this bug, this wasn't disabled officially but it was already (for the most part) not-working (but still causing issues before the merge).
Let me think about it. Things are getting more and more complex, maintaining a lot of (mostly unrelated) stuff in the same repo. How would you see about splitting out transferpy to its own separate repo. Would that make CI (testing and doc generation) as well as packaging easier for you? We can talk about this on today's meeting, but please think if that would simplify development for you- we can totally ask for a repo if that helps you.
Wed, May 27
Hi, @wiki_willy I just want to ping you so your team is aware that the maintenance here didn't complete correctly and that we need more onsite help (I don't need this fast, just making sure it doesn't fall under the radar).
Tue, May 26
$ ssh db2097.mgmt User:root logged-in to ILOMXQ91304KD.(10.193.2.204 / FE80::8230:E0FF:FE3E:F9A2) iLO Standard 1.40 at Feb 05 2019 Server Name: Server Power: Off
host is down and ready for maintenance @Papaul.
Mon, May 25
I have created a specific column for tasks related to the GSOC, on the DBA project- I think we should use that to classify it under the DBA tag.
fuser will actually be more elegant than netstat:
Sorry, when I said netcat before, I meant netstat. =:-D
Starting to work a bit on PIDs and ports for our library of methods would not be a waste of time, as we may want to reuse it later for concurrency handling and error states, not just the integration test. Of course, the immediate need is the error in the test.
I see- our command does some piping- this means that it generates a few subprocesses with a single command. I think the right way would be to use netcat to know which processes are listening on a specific port- something that could be a method within the Firewall class, and then send a kill to the pid obtained from netcat. netcat -tlpn should give us a numeric PID so we don't have to work with commands.
Fri, May 22
Thu, May 21
I cannot reinstall the server because the remote ipmi interface doesn't work (and the ssh or the https acesses, that are enabled, don't accept my password). It looks like the password wasn't setup correctly after reset, but common default user/password combinations doesn't work either.
Per my IRC chat with John
How about writing our document with Sphinx?
As a followup of T161296 we need to research the changes on buster to understand new metrics making scrapping slower, plus if we should enable or disable more metrics for buster.
Wed, May 20
Let's document --verbose flag before closing this ticket.
$ ./transferpy/transfer.py --no-compress --no-encrypt --no-checksum cumin2001.codfw.wmnet:/home/jynus/test_file2 backup1002.eqiad.wmnet:/home/jynus/ ERROR: The final target path /home/jynus/test_file2 already exists on backup1002.eqiad.wmnet.
Cumin execution details are not useful to the user at any time
Very related, the comment T252171#6152787 to make sure before we close a ticket with a new functionality, those are properly documented :-D on wiki and/or --help.
Aside from solving the issues I mention on the patch, the other thing we should not forget to update is the documentation. This is what it says now:
I tested this and this can go as is. Question: Should we give more information to stdout? Or when using the verbose mode, or do you think it is ok as it is.
No rush on our side, just the day before you are going to the DC for this, let us know so I can stop the server 24h in advance.
I don't want to write more here because it is out of topic- I agree with everything you say, but let me go in a different direction:
Tue, May 19
jcrespo moved this task from Triage to Backlog on the DBA board.
Grants for labsdbuser, which is the default role on both servers for cloud users are also (almost) the same:
$ diff <(mysql.py -h labsdb1010 -e "show grants for labsdbuser" | sort) <(mysql.py -h labsdb1011 -e "show grants for labsdbuser" | sort) 290a291 > GRANT SELECT, SHOW VIEW ON `grwikiimedia\\_p`.* TO 'labsdbuser'
One thing I can see is that labsdb1011 uses the new mysql authentication format, meaning:
Mon, May 18
I've checked and both a manual "kill -9" and a "kill -15" should make the port available almost instanatly, so probably it is not that. Maybe kill_job doesn't work properly, will research it on my testing and report back.
The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine).
The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket is to enquire whether it is the expected behaviour or not? If yes, could you please explain a little bit about it?
Of course im also expecting some historical context to potentially raise its head here.
For context, I was opposed to this being on icinga (NOT the concept itself) because I was worried about icinga spam and pings from other users stressing SREs. I compromised because Daniel improved (in my opinion) the proposal with the added whitelist and the promise that people were going to bee "cool" about them. Whitelist was implemented, "coolness" factor was known years ago, but not documented for newer SREs.
I made an amend to the policy:
@Dhzan I think documenting how one is supposed to use the WARNINGS (to adopt some of my feedback) and document the general idea of what not to worry about (e.g. screens running on databases) would be my criteria to resolve this. I think that is a reasonable request :-D.
I said that this is was going to lead to people annoying other people for things that are non impacting, and I agreed to the change because I was sworn that this was only going to be a tool to detect bad patterns, but that SREs were never going to actively ping other people for just having things running for a few hours (it was considered only an issue if it was left like that for months).
Sat, May 16
Fri, May 15
I've added a couple of things to https://wikitech.wikimedia.org/wiki/Transfer.py#Wishlist_and_know_issues as an ideas for later work (we don't have to do everything, these are just ideas for improvement)
This doesn't have to be a task (or it can be, up to you), but I realized that there is very little comments on the main files. For example, there is not a single comment on https://github.com/wikimedia/operations-software-wmfmariadbpy/blob/master/transferpy/Firewall.py
This can be resolved.
@Papaul see the FAILED above for db2140, as well as the
As per T162070#4942720.
All mysql users are system users.
ping @BBlack to know if you prefer to make temporary workaround permanent or revert as per previous comment so this can be closed.
There is a difference in replication "performance" (pc1010 is spikier and lower):
But the comparison is not fair for the 10.4 host, as it replicates from an intermediate master and thus it replicates serially due to using a conservative replication config.
These are the database metrics during the tests (tests were not concurrent between hosts):
- pc1009: https://grafana.wikimedia.org/d/000000273/mysql?from=1589521050221&to=1589524843408&var-dc=codfw%20prometheus%2Fops&var-server=pc2009&var-port=9104
These are my findings:
average latency (ms) percentile 95 latency (ms) read requests per second write requests per second VERSION 10.1 10.4 10.1 10.4 10.1 10.4 10.1 10.4 ro, low concurrency 5.23 5.83 5.65 6.38 21384.53 19195.52 0.00 0.00 ro 15.78 15.99 17.06 17.90 56782.46 56027.81 0.00 0.00 mixed rw 19.50 19.34 21.14 21.48 45943.38 46316.48 16408.35 16541.59 rw, high concurrency 4.54 3.48 7.07 4.93 0.00 0.00 141059.31 183871.83
For memory-only read only traffic, regression seems to be only of 5%, which was around what we expected. Note pc1010 had a 13% average extra latency from client, so it is within the margin of error (test had to be done from network to prevent software version differences).
Thu, May 14
Quick stupid idea - 1) Insert hook after downtime for custom code. 2) Have a configured way to tell which hosts load which class in the hierarchy, be it an abort "this host should never be rebooted", or some other functionality "depool from pybal". 3) Start writing reboot modules for all hosts until complete coverage. 4) Profit!
Regarding the ss issue: I was able to reproduce this:
Not directly related to refactoring, but I though this was very interesting for you in general:
Maybe the sysbench result can give us a better picture of how this is affecting mysql query latency itself (if it is really doing so)
From what I see we did io benchmarks. I would like to know if real sql queries are affected, maybe MariaDB, on memory-limited hosts with loose disk consistency (pc) now generate more io (but that is ok, if it means no extra client latency). I proposed to do some sysbench (sql) of write intensive queries and see if there is a difference between pc hosts with 10.1 and those with 10.4. If it is a metrics/db behaviour change but doesn't really impact queries, we can ignore it (resolve).
I don't know I understood this correctly.
Great job here! I looked at every line of the change, and tested it on several runs and it worked nicely. This change, I think, will make further development much easier. You did a lot of work on refactoring- I liked the way you resolved the dependency inversion on the subclasses. I merged as is.
Wed, May 13
Happy to be helpful. Have a nice day!
I've reset it already, let me know when you receive it and change the password to something else (mail is not a very secure method of sending passwords).
no one willing to handle this list
Hi, @minhhuy you still have control of the email account associated with that list, right? I can force a password reset for you.
Independently of the "strength", I think it could be missunderstood, the same way now many people think "all primary keys should be autoincremental integers" instead of "if there is no good options for a PK, just add a new autoinc".
Error 'Row size too large. The maximum row size for the used table type, not counting BLOBs, is 8126. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs' on query. Default database: 'librenms'. Query: 'alter table `ports` add `ifSpeed_prev` bigint null after `ifSpeed`, add `ifHighSpeed_prev` int null after `ifHighSpeed`'
[14:57] <icinga-wm> PROBLEM - MariaDB Slave SQL: m1 on db2078 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1118, Errmsg: Error Row size too large. The maximum row size for the used table type, not counting BLOBs, is 8126. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs on query. Default database: librenms. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshoo [14:57] <icinga-wm> a_slave