User Details
- User Since
- May 11 2015, 8:31 AM (552 w, 4 d)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Yesterday
@SEgt-WMF any update?
There is nothing else to do here for clinic duty until user gets back to us.
Access has been merged and deployed @Leif_WMDE please test it and reopen if you have any issue with it or any other question.
So my takeaway is (simplifying):
Alternatively, if they are "good to backup just in case", but not "critical", they should be backed up with the normal schedule (daily instead of hourly).
With the current failover process
Those files are not critical for a failover. With the current failover process, they are transferred to the new primary host. Given that Gerrit's compaction is not idempotent, I'm not sure we'd be able to use them if we just copied them without the rest of the data directory structure.
Wed, Dec 10
@Solenne_Lazare_WMDE Access has been deployed, please give it 30 minutes to an hour to propagate, and then test it and reopen if you have further requests or issues.
@KOfori could you approve my request?
@Solenne_Lazare_WMDE You have been added to the NDA and WMDE LDAP groups, which means you should have already login access to the apps, but not yet to private data (managing the access request and access to analytics_privatedata_users next).
@WMDE-leszek May I ask for approval?
I've deployed it to bast1003, can you test?
Tue, Dec 9
With the above feedback and no issue reported since, I would consider this either resolved or invalid, but feel free to reopen if you want global list admins to help you with anything/issue still persists.
Hi, @Gnangarra There is already a list called Wikidebate: https://lists.wikimedia.org/postorius/lists/wikidebate.lists.wikimedia.org/ Mostly empty. Is the request a complete different project? Are you requesting to take over it as owner?
Thank you!
We don't yet have the confirmation from Legal on file, waiting for that.
This looks resolved to me, please @Papaul reopen if something else is needed.
Apparently db2166 lagged again, disk perfomance spiked up on Saturday at 4am: https://grafana.wikimedia.org/goto/xITEqWMDg?orgId=1
Hi, @KFrancis requesting an NDA filing for the email show on the header above for the given WMDE employee: solenne.lazare@wikimedia.de
Hi, @Lena_WMDE we don't have you on the list of approval managers for WMDE: https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group , could you ask one of the people on this list to request to add you (so you can approve future requests)? A message here saying you can approve would be enough.
While we process your request, to speed up the current and future requests, could I ask you, @Solenne_Lazare_WMDE, to add your LDAP developer account to your Phabricator profile, here: https://phabricator.wikimedia.org/settings/user/Solenne_Lazare_WMDE/page/external/
Fri, Dec 5
I would like to mention in particular workflows like renewal/revoking of certificates on server workflos, paths of the private repo, etc., while in general we relay on "run the cookbook" it is precisely on edge/weird cases when we need to look at documentation, and the commands from 5 -> 7 have subtlety changed, enough to cause confusion. I would also suggest to remove it from being on too many places and replace the other outside of the puppet page to links to a single central place.
Updating tags, as there is nothing for the broader team/clinic duty to do, please revert when unblocked.
Thu, Dec 4
There is more issues beyond just that, that we only learned after years of maintenance (almost impossible upgrades).
Wed, Dec 3
Question: I don't know the details of how are things are organized for resilience, but wouldn't the other host should have those files (and thus, backup would frequently have the same size) if reliability was the goal? I don't have a problem with backup requiring more space, but if losing those file makes failover more difficult, those shouldn't be [only] on the local host, but elsewhere, otherwise the most common need for a failover (host failure, maybe the second most common, after service maintenance) will not work.
The dump grants, and thus, backup themselves have been removed.
Tue, Dec 2
Adding backup tag for the backup side.
The reason I ask is because the other hosts' backups are very small in comparison:
Thu, Nov 27
Interestingly, if I do:
SDAddresses = {
ipv4 = {
addr = 0.0.0.0;
port = 9103;
}
ipv6 = {
addr = ::;
port = 9103;
}
}(and removing SDPort, which is incompatible) seems to work:
No idea, it is the first time I've seen this task. Do you want me to test a bacula storage to listen on :: ?
Wed, Nov 26
These are the top files by size:
cumin2024@db1213.eqiad.wmnet[bacula9]> select Name, lstat_size(LStat) FROM File JOIN Filename USING(FilenameId) where JobId=666670 ORDER BY lstat_size(LStat) DESC LIMIT 15; +-----------------------------+-------------------+ | Name | lstat_size(LStat) | +-----------------------------+-------------------+ | git_file_diff.h2.db | 2925682688 | | comment_context.h2.db | 1925754880 | | account_patch_reviews.h2.db | 1427044352 | | gerrit_file_diff.h2.db | 1319897088 | | mergeability.h2.db | 1156052992 | | diff_summary.h2.db | 1086640128 | | diff_intraline.h2.db | 797939712 | | conflicts.h2.db | 787589120 | | web_sessions.h2.db | 705048576 | | change_kind.h2.db | 637464576 | | git_modified_files.h2.db | 349911040 | | modified_files.h2.db | 238133248 | | master | 69714340 | | accounts.h2.db | 62945280 | | persisted_projects.h2.db | 30328832 | +-----------------------------+-------------------+ 15 rows in set (0.071 sec)
So it is unexpected?
gerrit1003.wikimedia.org is now backing up 13GB every hour. Is that normal?
I depooled it to avoid affecting mw performance, the rest of s8 looked ok at the time.
The only supported/working way is to stage the firmwares manually on the cumin nodes and use those :(
Tue, Nov 25
Happened to me again today.
Fri, Nov 21
I was asked by @Ladsgroup to create this task, I don't think it is high priority, but it was semi-related to his work at T410401.
Wed, Nov 19
Based on the spreedsheet, no more interruptions are expected on
Tue, Nov 18
Media backups processing on eqiad is stopped and the following hosts have been downtimed for 24 hours from now:
Mon, Nov 17
@Jclark-ctr Would me stopping backups tomorrow, Tuesday 18 before your TZ (e.g. before 11 am UTC/6am Eastern Timezone) and then those host can be done at any time during your day (ideally not stopped > 24 hours) work for you.
Thu, Nov 13
Garage also doesn't support TLS/HTTPS be default, it requires a reverse proxy: https://garagehq.deuxfleurs.fr/documentation/cookbook/reverse-proxy/
🤨
