Page MenuHomePhabricator

dumps distribution servers space issues
Closed, ResolvedPublic

Description

The dumps distribution servers, labstore1006.wikimedia.org and labstore1007.wikimedia.org, are filling up rather quickly. They are also out of warranty and have both acquired issues that cause random crashes (see T268280: labstore1006 spontaneous reboot and now T281045: labstore1007 crashed after storage controller errors--replace disk?).

It is the nature of the service to increase usage over time, but storage usage appears to have increased a bit faster recently. Also, the storage usage is surprisingly uneven.

  • We should take any actions we can to try to reduce space consumption while we figure out expansion
  • Also, we need to decide what "expansion" looks like since these servers are already problematic, but they lack a support contract
  • Currently both are in racks with no room whatsoever. Shelves for them are 2U, and each can support 2 more shelves (doubling current space). However, there's nowhere to put them.

Event Timeline

Screen Shot 2021-04-23 at 3.33.35 PM.png (1×2 px, 1 MB)

This is the space on labstore1006 (red) and labstore1007 (green) as a percentage of space used. It was pretty level for a little while, but it is steadily climbing since then at a rate that won't work for long (especially at labstore1007's mysteriously higher usage).

Bstorm triaged this task as High priority.Apr 26 2021, 11:28 PM
Bstorm moved this task from Backlog to Dumps on the Data-Services board.
Bstorm added subscribers: nskaggs, ArielGlenn.

In the short term, fewer dumps could be kept, although that only gets us so far.

The discrepancy on the hosts is because /srv/dumps/security has 5.5T on the one host and doesn't exist on the other. Guess their team ought to be roped in on this discussion.

Ah yeah, that's still there. I didn't see it on a very quick check of tab completing (because permissions) and dreamed it was gone :) @Reedy and @JFishback_WMF this is a public ticket, so I don't know if we need another one to discuss that material, but please note the description. This chunk of security data is a bit much for the dumps server to host if it keeps climbing as it is and the server has become more unreliable as well. I'm not sure you want potentially-unique data that matters to you stored on that storage controller.

Linking here as a related issue: T281048 (storage for security related data also under discussion there)

The discrepancy on the hosts is because /srv/dumps/security has 5.5T on the one host and doesn't exist on the other. Guess their team ought to be roped in on this discussion.

Having checked into the relevant places, it seems we do need to keep that data, and unfortunately no idea when that requirement will go away.

Do we have any other options of where we can keep this data? At least, until security potentially find a longer term, more dedicated home for this sort of stuff.

@Reedy I have found a place in the cloud universe where I could put it in CODFW on a temporary basis. That would get it off systems that have public uses as well as saving us some of the disk space problem. A new setup for storing this data really needs to be purchased outside of cloud systems, though.

@Reedy @JBennett Let's get an ask into the Hardware requests for next fiscal and target Q1 for a system that can store this data securely and meet your current and future needs. I would ask you to prioritize doing this migration in Q1 as we are unable to continue to store the data. Ping me if you'd like help on requesting.

I've brought it up on the security team channel about ressurecting T246954/T247492, and working out what to do longer term for this sort of data storage.

Obviously it'll take a little while for the whole procurement and implementation pipeline to happen for the server to be "ready for use". But the sooner we restart those discussions, the sooner that can potentially happen, and therefore somewhere to actually put the data longer term that shouldn't cause issue for other teams/services etc

I think there's also some duplication of data (potentially a compressed and a non compressed version). So some time spent looking at that (though, I don't know who has access to it these days as since Chase left, no one on the team has root) could help slim the dataset down too.

@Reedy new location for this is set up. I'll get on IRC and see how you want to move it.

server: cloudbackup2001.codfw.wmnet and directory /srv/security-temp. There's 7 TB available and it is a much more suitable temporary location.

Hey folks, we're getting ready for hosting of OKAPI (WIkimedia Enterprise) HTML dumps on here and having that space available is going to be important. We need to be able to keep several copies, as we do with all dumps, typically a few months' worth, and each run is over 1T. People planning the rollout of the HTML dumps were looking at mid-June for this. Any chance we can get that data moved by then? @Reedy can you poke someone to make it happen?

No one on the security team can actually move it (no roots etc).

Brooke did PM me about it as she said, and I think that's fine to be moved as she suggested. That is, of course, if the host is still good for use with OKAPI incoming too, or whether it's no longer appropriate.

T246954/T247492 have been reopened, but we're not going to have that hardware in place for a while I guess

The usage data is still marching upward slowly no matter what, but this cluster is far more suitable for the incoming data if I can get this data off of the 1007 node. Refreshes are hoped for and hardware issues are being worked on as an aside.

From my perspective, I need to get that security data out of there no matter what because it is a space issue, even if the best I can get is moving to another server that is prefixed with labs or cloud. As for *how* to move it, normally, I'd use the puppet rsync convenience classes, but this is a complex rsync server as is, so I'm not sure it won't have conflicts without some careful checking. I'll take a look and see what I can do.

Ok, the security team data is now moved to another server. labstore1007 is showing percent full of 84% /srv/dumps which is far more similar to labstore1006 (83% /srv/dumps). They still don't match, but at least they are closer.

Bstorm mentioned this in Unknown Object (Task).Jul 26 2021, 9:45 PM
sbassett changed the task status from Open to Stalled.Aug 4 2021, 6:46 PM
sbassett assigned this task to Reedy.
sbassett moved this task from In Progress to Waiting on the Security-Team board.
Bstorm added a subtask: Unknown Object (Task).Aug 10 2021, 6:27 PM

If it's reaching to become an urgent issue, I can think of this mitigations:

T246954/T247492 have been reopened, but we're not going to have that hardware in place for a while I guess

...and declined again, apparently due to inaction on the Security-Team's part?

Ok, the security team data is now moved to another server. labstore1007 is showing percent full of 84% /srv/dumps which is far more similar to labstore1006 (83% /srv/dumps). They still don't match, but at least they are closer.

@Reedy - is there anything else #security-team-related here? Or should we untag the team for now?