I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
On the topic of NFS mounts, as it relates to issues we've seen unmounting things, for reference: https://access.redhat.com/solutions/157873
I believe T199276#4420812 was possibly due to NFS mount happening after package installation (as long as the setup from the package runs when puppet installs it, which I haven't confirmed for sure yet) since the package install should have created those dirs AND run the initiation script. This becomes a non-issue if puppetdb is used to get around NFS.
To get puppet to run on a new Trusty tools node requires a downgrade of libgdal-dev (just as a note). This is because Trusty isn't really supported anymore here, I presume. libgdal was upgraded beyond the support of needed libraries for the grid at WMF.
Thu, Jul 12
Excellent, apparently now that the cluster is running in toolsbeta, puppet succeeds correctly.
It does not. Very interesting.
The service survives a puppet run, but puppet invariably complains about some things:
Info: Retrieving plugin Info: Loading facts Info: Caching catalog for toolsbeta-grid-master.toolsbeta.eqiad.wmflabs Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files Info: Applying configuration version '1531433278' Notice: /Stage[main]/Gridengine::Master/Service[gridengine-master]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Gridengine::Master/Service[gridengine-master]: Unscheduling refresh on Service[gridengine-master] error: commlib error: got select error (Connection refused) Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]/ensure: created Error: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]: Could not evaluate: Field 'shortcut' is required Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[release]/ensure: created Error: /Stage[main]/Toollabs::Master/Gridengine_resource[release]: Could not evaluate: Field 'shortcut' is required Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]/ensure: created Error: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]: Could not evaluate: Field 'shortcut' is required
I think much of this was a chicken or egg thing? All of the above should have been created by the package on install. This suggests that maybe puppet mounts NFS over the package install locations (which can be resolved). It also suggests that getting rid of the NFS config would be good, yet again.
After manual creation (for now), we are bought to:
07/12/2018 19:20:25| main|toolsbeta-grid-master|E|database directory /var/spool/gridengine/spooldb doesn't exist 07/12/2018 19:20:25| main|toolsbeta-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed 07/12/2018 19:20:25| main|toolsbeta-grid-master|C|setup failed
T199276#4420812 is one thing needed for this.
Apparently, there is a dependency on a particular directory that isn't in puppet:
06/06/2018 18:53:57| main|toolsbeta-grid-master|C|can't change to directory "/var/spool/gridengine/qmaster"
Wed, Jul 11
Interestingly, the grid master is down in tools beta. It also fails on puppet runs. Poking at that.
This is done.
Tue, Jul 10
labsdb1006 is now also moved. Asked @akosiaris for help getting it up correctly as a master.
labsdb1004 is moved, tomorrow will be 1005.
labsdb1010 is moved.
This appears to be a problem with the monitor more than the array.
- also, action item for me: check into mount option possibilities to make this work better
Well, that was silly of me. Of course there are a bunch of roles and things not created on the re-imaged server. It'll probably need some kind of dump and restore to make this easy unless there's a doc around.
Re-imaged labsdb1006 to stretch. In the process, I found that the storage is a bit odd. One of the LVs is named "_placeholder", which prevents puppet from working and it isn't mounted. This could be by design. I renamed the _placeholder to the correct volume name, similar to the current master and apparently had to create a filesystem on it. Puppet created the directory tree there once I mounted it, and I think the cron job that syncs over files from OSM should run in a few minutes (checking var/spool). If that finishes by morning, it should actually be ready then. This was a bit heavier than I expected, but it might work out.
Mon, Jul 9
New shelf is now live and part of the /srv/dumps filesystem on labstore1006. It isn't fully restored to service yet, but everything looks good to do so.
Thu, Jul 5
Disabled unused raid controller in the BIOS, which is at least half of this alert. However, this also is missing a battery, which HP considers an optional purchase that we should have.
Now that the spam is done from the last vandalism, @RobH, I am curious what can be done about the battery. There is some quirky history regarding the array here, but I figure we probably need to buy the battery either way. The raid card for this server and its partner were shipped without the "optional" cache backup battery. This is at least one reason there are degraded RAID alerts for them.
I went ahead and disabled the unused RAID controller in the BIOS. I have confirmed is not enough to clear the monitor. The lack of battery still reads as "critical".
That seems fair.
@jcrespo -- With some issues around the RAID still giving me trouble, we could perhaps do that stretch upgrade when we move to VMs. Otherwise, would that draw out the service impact a lot? Databases would need to come down during it, I presume, and perhaps we can do that.
In discussions at today's retrospective, one proposal discussed is to nice the puppet agent process on worker and exec nodes in toolforge, prioritizing user code, which could help with some things (or cause additional staleness alerts).
One of the things I've found causing these failures is that there is a condition (one that is supposedly solved on the Red Hat Network that I cannot access, I hear) that causes Puppet to attempt to mount already mounted NFS mounts. This throws exit code 32 (already mounted), which puppet treats as a failure and kills the whole run.
Removing this task in favor of the previous. There's a lot of work and context in that one.
Found the previous ticket for this. Adding it as a subtask in order to preserve that context, at very least.
Sorry about that!
Wed, Jul 4
So I've added the extra args and restarted the kublet process, however, not only did this not clean up space, running docker container prune and docker image prune also didn't help as much as they would be expected to. A bit strange.
Tue, Jul 3
You should be good to go for now. I've created your project and added you to it. You should be able to access things via Horizon, now, to set up.
Mon, Jul 2
Oh no, all this times up the other way around, and the page join removal would have put things back to the way they were before you started to see problems, and the addition of the joins are likely to still be a problem. If I put the joins back, it would crush performance for basically anything querying the page table. I'm quite sure that has no effect here. The MCR revisions (which added lots more joins, including to the revision table), however, introduced many joins to the tables you are querying. Removing those joins might fix things, but they will also break backward compatibility with MCR. I am concerned about the overall health of the database system on that server, though because of how much I see things in an "opening tables" state. That seems weird. It could be connected to the MCR changes, or it might be something else.
I've rebased and set up the patch for the replicas. Is the actor table ready to go with img_actor and all that now?
Also, any comments on the patch?
Fri, Jun 29
No that's not needed in this case, cause it's a peculiar one. The servers are a mirror of openstreetmap so we can just resync fully from upstream (plus some minor pg_dump+import right before the switch). But
Bump up the switch move or the stretch upgrade? The switch move is scheduled for the 10th and 11th to avoid conflicts with holiday plans and so forth as well as to coincide with two other database clusters moving on the same days.
A timeline for upgrading to stretch or this move event? The basic datacenter reconfig is just scheduled to happen on the dates in the description. To make 1006 the master, we should be mindful that replication needs to be fully set up (it is non-functional at the moment).
Should be. It's an HP Smart P420i in a RAID 10 logical disk and is the only failure. Unless the disk itself isn't a hot swap form factor, it should be good, right? I'm, of course presuming that it is a hot swap form factor, which might be silly.
Thu, Jun 28
Cabling information grabbed from these two documents: D3600 manual: http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04219600-1.pdf
D3000 series wiring guide: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c05252635
Wed, Jun 27
@jcrespo I was going through this again today, and I noticed that the primary replica server for the web has an awful lot of this "opening tables" going on--even from 'system user' processes. I also noticed this is not happening on the analytics servers.
puppetdb-terminus appears to be installed. I don't see it configured, though.
It would appear that this was not set up on the tools puppetmaster at this time Jun 27 18:01:06 tools-puppetmaster-01 puppet-master: You cannot collect exported resources without storeconfigs being set; the export is ignored at /etc/puppet/modules/monitoring/
@RobH Any thoughts on that battery issue above? I'm going to see if the first controller that isn't being used can be disabled in the BIOS or something.
This server appears to be fully functional from all views I can see. However, the monitor for RAID would disagree and think it is critical. I believe it reports that there are no drives on one controller (which is correct!) and no batteries on the live controller (which I'm not so sure of). If the actual live controller actually doesn't have a battery and isn't supposed to, that's probably fine. If it should be reporting a battery, then we might still have something to fix. I'll dig around a little regarding that.
Looking good! The VM is doing a puppet run. I think the network is working on these things now.
This is currently still some kind of an issue on both servers. The thing is that I'm not sure if it is a problem or just describing reality (embedded controller has no disk and installed controller doesn't report a battery).
Tue, Jun 26
Fri, Jun 22
$ systemctl status firstname.lastname@example.org ● email@example.com - prometheus server (instance tools) Loaded: loaded (/firstname.lastname@example.org; static) Active: active (running) since Fri 2018-06-22 19:44:36 UTC; 7min ago Main PID: 12729 (prometheus) CGroup: /email@example.com └─12729 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-a.
Dude! The answer was right in front of me. On a *nix system, prometheus tries to open the file for reading: https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock_unix.go#L43
Apparently, the file's existence shouldn't matter (at least in current Prometheus). It should be able to lock it, but it cannot https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock.go#L31
$ systemctl status firstname.lastname@example.org ● email@example.com - prometheus server (instance tools) Loaded: loaded (/firstname.lastname@example.org; static) Active: activating (auto-restart) (Result: exit-code) since Fri 2018-06-22 19:02:37 UTC; 1s ago Process: 30784 ExecStart=/usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-address 127.0.0.1:9902 -web.external-url https://tools-prometheus.wmflabs.org/tools -storage.local.retention 730h0m0s -config.file /srv/prometheus/tools/prometheus.yml -storage.local.chunk-encoding-version 2 (code=exited, status=1/FAILURE) Main PID: 30784 (code=exited, status=1/FAILURE)
This seem pretty good at this point, so I'll close this task for now.
Taking a look at where I think the query killer lives, it seems like the comment won't have any affect. However, it could be that some exception is needed for LOAD DATA LOCAL type statements? @jcrespo Am I close to the mark here?