User Details
- User Since
- Jan 7 2025, 6:49 PM (48 w, 2 h)
- Availability
- Available
- IRC Nick
- federico3
- LDAP User
- Federico Ceratto
- MediaWiki User
- FCeratto-WMF [ Global Accounts ]
Today
Fri, Dec 5
Testing https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1215116/1
and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214083/3
using test-cookbook -c 1215116 sre.mysql.clone --source db1233 --target db1229 --nopool -t T411805
Thu, Dec 4
Added documentation as a new heading: https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host#Rolling_restarts_for_the_ES_section
I've updated the dashboard again to ignore test-s4 and it's now showing zero hosts with misconfigured GTID. Next step: move the check/alarm in prometheus/puppet and log alerts on IRC
Verified out-of-band on IRC: @Joe received the password and the database is ready. Closing.
Tue, Dec 2
The clone cookbook is not changed yet. I can update it while doing T410084
Mon, Dec 1
I can run schema_change on my side or we can run it together in a shared tmux session.
To summarize the ongoing investigation:
- There was a bug in _generate_plan triggered only when get_columns is used in check() when --run is used and all_dbs=True. The bug is fixed in https://gitlab.wikimedia.org/repos/data_persistence/dbtools/auto_schema/-/merge_requests/15 by removing _generate_plan. The bug was introduced in add376bcd4679c41e1e060f6ff9ac145fe3f8abb
- There is a different, older bug also when get_columns is used in check() and I'm able to reproduce it in https://gitlab.wikimedia.org/repos/data_persistence/dbtools/auto_schema/-/merge_requests/16 only with all_dbs=False.
- I'm not sure if https://gitlab.wikimedia.org/repos/data_persistence/dbtools/auto_schema/-/merge_requests/15#note_176572 fits into the first or the second bug or it's something different
Thu, Nov 27
Reopening while fixing replication after host reboots
Wed, Nov 26
The host is repooled so we can close this task. @Marostegui can you please clarify the difference around "OS errors (they aren't even on the HW logs)"? E.g. are we seeing cases where Offline_Uncorrectable are false positive and the drives are healthy?
As related tasks might want to address:
Repooling as the raid is not degraded yet and monitoring MariaDB performance
db2166 is not showing any metric on https://grafana.wikimedia.org/goto/n9QcXmZvg?orgId=1 , but other host are showing inconsistent disk temperature readings. For example db1178 shows temps between 26 and 28 for the 8 raid drives, but smartd is logging between 72 and 74 celsius
ssh db1178.eqiad.wmnet -t "sudo journalctl --since '1 h ago' --identifier smartd" Nov 26 11:43:02 db1178 smartd[2802618]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 73 to 72 Nov 26 11:43:02 db1178 smartd[2802618]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 73 to 74 Nov 26 12:13:01 db1178 smartd[2802618]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 72 to 73 Nov 26 12:13:01 db1178 smartd[2802618]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 74 to 73 Connection to db1178.eqiad.wmnet closed.
A summary of disk errors on the host:
for n in {0..10}; do echo $n; sudo smartctl -a /dev/bus/0 -d megaraid,$n | grep Uncorrec; done
0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 352
1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 48
4
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 240
5
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 72
6
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2896
7
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
9
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 104
10MariaDB logged multiple slow writes https://phabricator.wikimedia.org/P85718
The host is showing multiple bad sectors
Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], 352 Offline uncorrectable sectors Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 60 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 62 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 65 to 63 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], 48 Offline uncorrectable sectors Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 62 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], 240 Offline uncorrectable sectors Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 61 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], read SMART Attribute Data worked again, warning condition reset after 1 email Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], 72 Offline uncorrectable sectors (changed +16) Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Usage Attribute: 13 Read_Soft_Error_Rate changed from 100 to 99 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Usage Attribute: 180 Unused_Rsvd_Blk_Cnt_Tot changed from 100 to 99 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_06] [SAT], 2896 Offline uncorrectable sectors Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_06] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 62 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 65 to 64 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 62 Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_09] [SAT], 104 Offline uncorrectable sectors Nov 26 11:49:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_09] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 64 to 63 Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], 352 Offline uncorrectable sectors Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 60 to 61 Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 63 to 64 Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], 48 Offline uncorrectable sectors Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 63 Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], 240 Offline uncorrectable sectors Nov 26 12:19:18 db2166 smartd[932]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... Nov 26 12:19:18 db2166 smartd[932]: Warning via /usr/share/smartmontools/smartd-runner to root: successful Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], 72 Offline uncorrectable sectors Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_06] [SAT], 2896 Offline uncorrectable sectors Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 64 to 65 Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 63 Nov 26 12:19:18 db2166 smartd[932]: Device: /dev/bus/0 [megaraid_disk_09] [SAT], 104 Offline uncorrectable sectors
Tue, Nov 25
CR was approved, the cookbook was tested with a real repool. Merging and closing.
Related PR approved and merged.
Regarding "Add metrics also to single-db dashboard" I made a "demo" dashboard with the new panel on the left titled "dbctl weight": https://grafana.wikimedia.org/goto/L7_M7RWvR?orgId=1
Mon, Nov 24
Fri, Nov 21
Wed, Nov 19
Tue, Nov 18
The 5 VMs are showing up on zarcillo and replicating https://zarcillo.wikimedia.org/ui/sections#test-s4 - the Prometheus metrics will start working as expected once the old metrics created before the VMs redeploy (with a different server_id) disappear.
Mon, Nov 17
Moving code from https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1129904 into https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy to:
- initially provide a CLI tool for the team
- later on a library that can be used in a switchover cookbook by other SREs
@Marostegui I added initial documentation at https://phabricator.wikimedia.org/T384212#11378487
I'm adding documentation for the Web UI at https://doc.wikimedia.org/data_persistence/zarcillo/README.html#_web_ui as a way to share progress here. I can also paste the documentation here if desired.
Puppet is already configured to place VMs in test-s4. Deployment is tracked in https://phabricator.wikimedia.org/T400056