PR is at https://github.com/projectcalico/confd/pull/515, waiting for review now. It's been tested locally in a couple of bird containers doing a full mesh with each other.
Mon, May 17
Fri, May 14
Drive by comments by yours truly:
The single-version image will be tagged with the train branch (e.g. wmf-1.37.0-wmf.4) and pushed to the registry, probably updating an existing tag.
Thu, May 13
Upstream calico issue at https://github.com/projectcalico/calico/issues/4607
Wed, May 12
Thu, May 6
Post deployment all 4 metrics (cpu/memory avg/maxes) look quite a bit better
Wed, May 5
Wed, Apr 28
Tue, Apr 27
Resolving. The child tasks are done, we are moving to decommisioning now
Servers switched over, fully functional now.
Done. All apps migrated, replication broken, now rdb2009 and rdb2010 are the canonical ones to use
Stalling until T255250 is completed.
Mon, Apr 26
Reading the answers (thanks!!!) I understand that
@AKhatun_WMF Access has been granted. I 'll resolve this task, feel free to reopen though if problems arise.
Account has been added to wmf ldap group, waiting for analytics-privatedata-users approval from Andrew before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/682601
@Ottomata, since access to analytics-privatedata-users is asked for, we require your approval too on this task. Thanks!
Fri, Apr 23
Cool, done so, thanks!
- Simulate node failures and record/evaluate recovery times
Hi Aisha. There is not such thing as All as far as groups go. Could you please clarify what exactly you are requesting access to?
@Silvan_WMDE, Hi! The change expanding your access has been merged. Give it 30m or so to fully propagate and try it out. I 'll resolve the task but feel free to reopen if any issues arise. Thanks!
Wed, Apr 21
Specifically regarding https://noc.wikimedia.org/conf/fc-list I 've posted https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681665 . It's looks like it's cruft we probably should not be keeping around (happy to hear otherwise though). So maybe I 'll solve 1 of the bullet points of the task.
@matthiasmullie access has been granted. It will take ~30 minutes to fully propagate but otherwise, on our end you are good. I 'll resolve this, but feel free to reopen if any issues arise. Thanks!
User added to wmde and nda ldap groups. @Manuel, I am resolving this task, feel free to reopen if any issues with your access arise.
Hi @Reedy, given the discussion in the task, do you reckon you still need racktables access? Or should be close this as instead?
Any news on this?
@JLaytonWMF I am gonna tentatively resolve this task, it looks like the matter is out of SRE hands and maxmind should be contacted directly to firstname.lastname@example.org. Feel free to reopen if we can somehow be of assistance.
wow, TIL. Thanks for that hint @ema.
So a successful fetch per safari, of 100 bytes per Content-Length. Interestingly, my tests are almost identical HTTP headers wise. We share almost all headers, minus Date and a minor diff in x-cache (I got hit/213). And yet from what I understand the content is garbage/garbled or something similar. I am gonna add @ema and Traffic on this one. It brings back memories of T266373
From what I gather, we are on board with removing it, so resolving this in favor of tracking the work in T280472. Feel free to reopen.
Triaging as low until we can have an easy reproduction scenario.
Is this just safari on iOS and Mac? This works for me (at least on 1 try) on:
Stalling for a couple of weeks per above comments
And with the merge and deploy of the above we got:
Tue, Apr 20
So, the crux of the issue is at those 2 functions below
Reopening. The bug that @jeena reported in T279100#7000270 is reproducable. It seems that safe-service-restart won't take the previous state (pool/depooled) of a resource into account when verifying that everything is pooled.
For the "main" set of clusters, we have devised a plan to adopt upstream binaries but without relying on their repos. The Policy as well as the reasoning for that (the various versioning interdepency requirements we are forced to honor) is documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Infrastructure_upgrade_policy#Using_existing_upstream_binaries and applies to kubernetes as well as calico or helm. The implementation is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/refs/heads/future/debian/get-kubernetes-release.sh and it essentially fetches the .tar.gz of the binaries and stuffs them into packages fit for our infrastructure.
Mon, Apr 19
Adopting the new functionality in networkpolicy resources has indeed created some tech debt. It's a tech debt we created on purpose while devoting resources to finalize the migration away from the old way of maintaining those networkpolicies. Now that that's gone, I want to revisit it and deduplicate it as much as possible.
One thing that I forgot to point out. Given that the internal vs the external services have different audiences, it probably makes sense to also come up with different SLOs, as the requirements will be different.