TZ: UTC +1/+2
User Details
- User Since
- Sep 1 2016, 6:48 AM (393 w, 4 d)
- Availability
- Available
- IRC Nick
- marostegui
- LDAP User
- Marostegui
- MediaWiki User
- MArostegui (WMF) [ Global Accounts ]
Yesterday
We just had an email letting us know that this key is now being used in WMCS as well
Thu, Mar 14
It was, es2020 (and only that host) had a mis configured max_allowed_packet (which is strange as this is handled by puppet) which was way smaller than the normal and it is okay on file.
Logstash now seems happy. Thanks a lot for reporting this!
I am checking if it might be related to multi-master, which was enabled in the morning.
We've been discussing with @Ladsgroup that we need to double check if he IPs are really used for dbct /MW or it is just some tech debt. dbctl isn't too old, so if they were added it was probably for a reason, but maybe they're not needed anymore. If that is the case that'd simplify things a lot.
This is all done
Wed, Mar 13
So the filesystem was totally corrupted. A full reimage (deleting all the partitions) seems to have fixed it.
I am recloning the host right now and I will leave it replicating for a few days to make sure the storage is stable.
Rebooting the server resulted in the same XFS errors and the OS doesn't boot past mounting the filesystem. I am going to reimage it in case it was all corrupted after the crash. Hopefully this will fix it, if not...it means the storage is physically broken. Will report back
I've managed to get it boot up past the grub and it looks storage related:
Starting default.target [85598.425324] XFS (dm-0): Metadata corruption detected at xfs_agi_verify+0x11a/0x170 [xfs], xfs_agi block 0x1e3a5e02 [85598.435932] XFS (dm-0): Unmount and run xfs_repair [85598.440731] XFS (dm-0): First 128 bytes of corrupted metadata buffer: [85598.447172] 00000000: 58 41 47 49 00 00 00 01 00 00 00 01 03 c7 4b c0 XAGI..........K. [85598.455178] 00000010: 00 00 00 40 00 00 00 03 00 00 00 01 00 00 00 3b ...@...........; [85598.463176] 00000020: 00 00 00 c0 ff ff ff ff ff ff ff ff 00 00 00 c1 ................ [85598.471176] 00000030: 00 00 00 c2 00 00 00 c3 00 00 00 c4 ff ff ff ff ................ [85598.479175] 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ [85598.487177] 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ [85598.495183] 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ [85598.503182] 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
For what is worth, this error is also present on the HW logs, even though it is a month old, it might be an indication of something else
------------------------------------------------------------------------------- Record: 26 Date/Time: 02/20/2024 06:53:35 Source: system Severity: Critical Description: A fatal error was detected on a component at bus 4 device 0 function 0. -------------------------------------------------------------------------------
This has been done. Give it 30-45 minutes for the change to get spread across the infra.
Tue, Mar 12
This has been deployed. Please give it 30-45 minutes for puppet to run everywhere and spread the change.
Mon, Mar 11
@thcipriani would you approve this request to mwmaint?
Sun, Mar 10
Yes but on the original ticket your username listed is Himejijo, which doesn't match with that email. That's why I believe you just need to confirm that that username is really meant to be rkhan
Fri, Mar 8
Host recloned and being slowly repooled.
Thanks @RLazarus for addressing this incident!
I could probably fix it right away, but I think I am going to quickly reclone it instead
Thu, Mar 7
I have merged and deployed the change. Please give it 30 minutes or so for the change to spread everywhere across the fleet.
Yeah, I think the whole procedure needs to be followed (using the correct template for this ticket too)
@Fabfur as a SRE I assume you'd self serve?
Which access you specifically need for analytics-private-users? https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_shell_(posix)_groups_explained
Closing this as the change is merged and will be deployed with the next train. Thanks everyone!
The user rkhan does exist but it is not associated to that email. Can you post the email it is associated to that user and if you really meant rkhan instead of Himejijo
@Himejijo are you sure that is your wikitech user name and your email? I cannot find anything for any of those two.
Waiting for ssh out of band verification
Wed, Mar 6
@odimitrijevic I assume you are also their manager and hence approving for manager and analytics group?
This is all done
@bdgreenlee please follow the ticket template at https://phabricator.wikimedia.org/maniphest/task/edit/form/8/
@FBellamy-WMF we'd need your manager to approve this.
bdgreenlee added to WMF group.
Thanks. Forcing a table rebuilt fixed db1186, but if we are adding the index, there's no point in spending time doing this anywhere else.
Tue, Mar 5
[16:55:08] marostegui@cumin1002:~$ sudo cumin es20[35-40].codfw.wmnet 'lvextend -L+1T /dev/mapper/tank-data ; xfs_growfs /srv/ ; df -hT /srv' 6 hosts will be targeted: es[2035-2040].codfw.wmnet OK to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit: 6 ===== NODE GROUP ===== (6) es[2035-2040].codfw.wmnet ----- OUTPUT of 'lvextend -L+1T /...v/ ; df -hT /srv' ----- Size of logical volume tank/data changed from 9.09 TiB (2384188 extents) to 10.09 TiB (2646332 extents). Logical volume tank/data successfully resized. meta-data=/dev/mapper/tank-data isize=512 agcount=32, agsize=76294016 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 nrext64=0 data = bsize=4096 blocks=2441408512, imaxpct=5 = sunit=64 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 2441408512 to 2709843968 Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 11T 73G 11T 1% /srv
Thank you so much!
Eqiad hosts are done, the following ones didn't change their query plan
db1186
db1218
db1207
Thanks! @Jhancock.wm see above, you can proceed whenever you want.