Page MenuHomePhabricator

fsero (fsero)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 9:56 AM (49 w, 2 d)
Availability
Available
LDAP User
Fsero
MediaWiki User
FSelles (WMF) [ Global Accounts ]

Recent Activity

Aug 5 2019

fsero moved T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] from Backlog to Doing on the serviceops board.
Aug 5 2019, 9:11 AM · User-fsero, serviceops, Prod-Kubernetes

Jul 29 2019

fsero closed T229051: Allow eventgate-analytics service to reach schema.svc.{eqiad,codfw}.wmnet:8190, a subtask of T201068: Modern Event Platform: Stream Intake Service, as Resolved.
Jul 29 2019, 8:52 AM · Analytics, Core Platform Team Legacy (Watching / External), Services (watching), Analytics-EventLogging, Event-Platform
fsero closed T229051: Allow eventgate-analytics service to reach schema.svc.{eqiad,codfw}.wmnet:8190 as Resolved.

merged and applied

Jul 29 2019, 8:52 AM · serviceops, Analytics, Event-Platform

Jul 26 2019

fsero created T229118: create a docker_registry_codfw swift container backup.
Jul 26 2019, 2:36 PM · Release-Engineering-Team-TODO, Operations, Wikimedia-Incident, serviceops
fsero created T229117: create swift container-to-container synchronization metrics.
Jul 26 2019, 2:34 PM · Release-Engineering-Team-TODO, Operations, Wikimedia-Incident, serviceops
fsero added a comment to T229073: Staging k8s ci namespace limitranges.

@thcipriani you can launch the pipeline again and it should work, however a better fix is to change limits in blubber default values in the chart, 1m is not realistic as a cpu minimum

Jul 26 2019, 11:09 AM · Release Pipeline, serviceops
fsero added a comment to T229073: Staging k8s ci namespace limitranges.

@thcipriani is granular per namespace, you can submit a CR with changed values anytime, i will bump those values and refer to this phab task so you can see how is done

Jul 26 2019, 9:19 AM · Release Pipeline, serviceops
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

@greg thanks for following this, i definitely would like to have a retrospective about it, and there are some leftovers like creating phab tasks et al.

Jul 26 2019, 9:17 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops

Jul 25 2019

fsero closed T228700: helmfile apply with values.yaml file change did not deploy new k8s pods as Resolved.
Jul 25 2019, 9:26 AM · Patch-For-Review, Analytics, serviceops, Event-Platform
fsero moved T228967: Set up PodSecurityPolicies in clusters from Backlog to Doing on the serviceops board.
Jul 25 2019, 9:26 AM · Patch-For-Review, User-fsero, serviceops, Prod-Kubernetes
fsero moved T228965: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation from Backlog to Doing on the serviceops board.
Jul 25 2019, 9:26 AM · User-fsero, serviceops, Prod-Kubernetes
fsero triaged T228965: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation as Normal priority.
Jul 25 2019, 9:26 AM · User-fsero, serviceops, Prod-Kubernetes
fsero triaged T228967: Set up PodSecurityPolicies in clusters as Normal priority.
Jul 25 2019, 9:26 AM · Patch-For-Review, User-fsero, serviceops, Prod-Kubernetes
fsero triaged T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] as High priority.
Jul 25 2019, 9:25 AM · User-fsero, serviceops, Prod-Kubernetes
fsero triaged T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] as High priority.
Jul 25 2019, 9:25 AM · User-fsero, serviceops, Prod-Kubernetes
fsero created T228967: Set up PodSecurityPolicies in clusters.
Jul 25 2019, 9:15 AM · Patch-For-Review, User-fsero, serviceops, Prod-Kubernetes
fsero created T228965: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation.
Jul 25 2019, 9:13 AM · User-fsero, serviceops, Prod-Kubernetes

Jul 24 2019

fsero reopened T209271: improve docker registry architecture, a subtask of T202504: Evaluate VMWare's Harbour as a docker registry, as Open.
Jul 24 2019, 8:39 AM · Kubernetes, Operations
fsero reopened T209271: improve docker registry architecture, a subtask of T212123: Kubernetes clusters roadmap, as Open.
Jul 24 2019, 8:39 AM · User-fsero, serviceops, Prod-Kubernetes
fsero reopened T209271: improve docker registry architecture as "Open".
Jul 24 2019, 8:39 AM · User-fsero, Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes, Operations
fsero closed T209271: improve docker registry architecture, a subtask of T202504: Evaluate VMWare's Harbour as a docker registry, as Resolved.
Jul 24 2019, 8:38 AM · Kubernetes, Operations
fsero closed T209271: improve docker registry architecture, a subtask of T212123: Kubernetes clusters roadmap, as Resolved.
Jul 24 2019, 8:38 AM · User-fsero, serviceops, Prod-Kubernetes
fsero closed T209271: improve docker registry architecture as Resolved.
Jul 24 2019, 8:38 AM · User-fsero, Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes, Operations
fsero updated the task description for T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME].
Jul 24 2019, 8:36 AM · User-fsero, serviceops, Prod-Kubernetes
fsero updated the task description for T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME].
Jul 24 2019, 8:36 AM · User-fsero, serviceops, Prod-Kubernetes
fsero created T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME].
Jul 24 2019, 8:22 AM · User-fsero, serviceops, Prod-Kubernetes
fsero created T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME].
Jul 24 2019, 8:21 AM · User-fsero, serviceops, Prod-Kubernetes
fsero added a comment to T209271: improve docker registry architecture.

Keeping this task opened, but we can mark iteration 1 as completed with the exemption of using envoy for proxying between redis instances. Right now if the redis server goes down registry will go down because healthchecks will fail.

Jul 24 2019, 8:16 AM · User-fsero, Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes, Operations
fsero closed T215810: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry, a subtask of T215809: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching, as Resolved.
Jul 24 2019, 8:15 AM · User-fsero, serviceops, Prod-Kubernetes, Kubernetes, Operations
fsero closed T215810: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry as Resolved.

package is done and uploaded long time ago.

Jul 24 2019, 8:15 AM · Patch-For-Review, User-fsero, serviceops, Prod-Kubernetes, Kubernetes, Operations
fsero placed T215809: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching up for grabs.
Jul 24 2019, 8:15 AM · User-fsero, serviceops, Prod-Kubernetes, Kubernetes, Operations
fsero added a comment to T227570: docker registry swift replication is not replicating content between DCs.

as result of this issue, registries in the passive DC (eqiad now) are set in read only mode (they accept pulls but no pushes of new images)

Jul 24 2019, 8:14 AM · serviceops

Jul 23 2019

fsero closed T226814: Create termbox release for test.wikidata.org, a subtask of T212189: New Service Request: Wikidata Termbox SSR, as Resolved.
Jul 23 2019, 10:06 AM · Core Platform Team Legacy (Later), User-Addshore, serviceops, Services (next), Wikidata-Termbox, Wikidata, Service-deployment-requests, Operations
fsero closed T226814: Create termbox release for test.wikidata.org as Resolved.

This has been deployed via the DNS artifact previously discused .

Jul 23 2019, 10:06 AM · Wikibase-Termbox-Iteration-20, Wikidata-Termbox-Iteration-19, serviceops
fsero added a comment to T228700: helmfile apply with values.yaml file change did not deploy new k8s pods.

the main issue is in notifying changes to the deployment object department, not in helmfile. helmfile is AFAICT working as intended.

Jul 23 2019, 6:40 AM · Patch-For-Review, Analytics, serviceops, Event-Platform

Jul 22 2019

fsero moved T228196: docker-registry: some layers has been corrupted due to deleting other swift containers from Active Situation to Follow-up/Actionables on the Wikimedia-Incident board.
Jul 22 2019, 11:24 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops

Jul 19 2019

fsero closed T227775: recreate staging cluster namespaces using helmfile as Resolved.
Jul 19 2019, 3:52 PM · serviceops
fsero updated the task description for T227775: recreate staging cluster namespaces using helmfile.
Jul 19 2019, 3:51 PM · serviceops
fsero closed T227633: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki as Resolved.
Jul 19 2019, 6:02 AM · Operations, SRE-Access-Requests
fsero added a comment to T227633: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki.

@kzimmerman @Mayakp.wiki done, feel free to reopen if you find any issues.

Jul 19 2019, 6:02 AM · Operations, SRE-Access-Requests
fsero triaged T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen as Normal priority.
Jul 19 2019, 5:12 AM · Operations, SRE-Access-Requests
fsero added a comment to T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen.

@cchen as stated in https://wikitech.wikimedia.org/wiki/Production_shell_access we need your public SSH key, this key shouldn't be the same you use to access gerrit or WMCS.

Jul 19 2019, 5:12 AM · Operations, SRE-Access-Requests
fsero lowered the priority of T228196: docker-registry: some layers has been corrupted due to deleting other swift containers from High to Normal.
Jul 19 2019, 4:52 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

I did a complete pull of all images and tags of our registry running (results are in the file attached)

Jul 19 2019, 4:52 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops

Jul 18 2019

fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

fixes also docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1 @Nikerabbit

Jul 18 2019, 9:07 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

i've uploaded the missing layers from a backup, it works for me now

Jul 18 2019, 9:05 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops

Jul 17 2019

fsero closed T228191: Add accraze to deployment and deploy-service groups. , a subtask of T226416: Onboard Andy Craze -- Accounts and access, as Resolved.
Jul 17 2019, 2:31 PM · Scoring-platform-team (Current)
fsero closed T228191: Add accraze to deployment and deploy-service groups. as Resolved.
Jul 17 2019, 2:31 PM · SRE-Access-Requests, Operations, Scoring-platform-team
fsero added a comment to T228191: Add accraze to deployment and deploy-service groups. .

@Halfak thanks for the patch

Jul 17 2019, 2:31 PM · SRE-Access-Requests, Operations, Scoring-platform-team
fsero triaged T227633: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki as Normal priority.
Jul 17 2019, 2:27 PM · Operations, SRE-Access-Requests
fsero added a comment to T227633: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki.

as long @RStallman-legalteam comes back with a positive result, the clinic duty person will move this forward (this week i am this person)

Jul 17 2019, 2:27 PM · Operations, SRE-Access-Requests
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

it seems that container synchronization is broken and swift container on eqiad doesnt hold the same data that in codfw. swift is eventually consistent so lets wait if the sync does it job over the weekend. If it doesnt get restored the best action plan is can think of right now is:

Jul 17 2019, 11:38 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops

Jul 16 2019

fsero moved T228196: docker-registry: some layers has been corrupted due to deleting other swift containers from To Triage to Active Situation on the Wikimedia-Incident board.
Jul 16 2019, 11:19 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero lowered the priority of T228196: docker-registry: some layers has been corrupted due to deleting other swift containers from Unbreak Now! to Normal.
Jul 16 2019, 11:19 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

after rescuing blobs from ms-fe2005 backup it seems to have fixed pulling images. I don't see any errors doing:

Jul 16 2019, 11:18 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

base images wikimedia-jessie and wikimedia-stretch and affected production images

Jul 16 2019, 8:23 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

lisf of affected images

Jul 16 2019, 8:22 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero triaged T228196: docker-registry: some layers has been corrupted due to deleting other swift containers as Unbreak Now! priority.
Jul 16 2019, 6:05 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero created T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.
Jul 16 2019, 6:05 PM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
fsero moved T227775: recreate staging cluster namespaces using helmfile from Backlog to Doing on the serviceops board.
Jul 16 2019, 2:34 PM · serviceops
fsero updated the task description for T227775: recreate staging cluster namespaces using helmfile.
Jul 16 2019, 2:33 PM · serviceops
fsero triaged T227775: recreate staging cluster namespaces using helmfile as Normal priority.
Jul 16 2019, 2:32 PM · serviceops
fsero closed T227570: docker registry swift replication is not replicating content between DCs as Resolved.

uploaded a new image today (coredns) and rechecked like @fgiunchedi and it seems to be working \o/ so resolving this issue.

Jul 16 2019, 2:24 PM · serviceops

Jul 12 2019

fsero updated the task description for T227775: recreate staging cluster namespaces using helmfile.
Jul 12 2019, 12:00 PM · serviceops
fsero updated the task description for T227775: recreate staging cluster namespaces using helmfile.
Jul 12 2019, 11:59 AM · serviceops

Jul 11 2019

fsero claimed T227775: recreate staging cluster namespaces using helmfile.
Jul 11 2019, 1:31 PM · serviceops
fsero created T227775: recreate staging cluster namespaces using helmfile.
Jul 11 2019, 1:30 PM · serviceops
fsero added a comment to T227570: docker registry swift replication is not replicating content between DCs.

Thanks for the audit @fgiunchedi !

Jul 11 2019, 10:01 AM · serviceops

Jul 10 2019

fsero moved T227570: docker registry swift replication is not replicating content between DCs from Backlog to Doing on the serviceops board.
Jul 10 2019, 10:37 AM · serviceops
fsero claimed T227570: docker registry swift replication is not replicating content between DCs.
Jul 10 2019, 10:36 AM · serviceops
fsero triaged T227570: docker registry swift replication is not replicating content between DCs as High priority.
Jul 10 2019, 10:36 AM · serviceops
fsero closed T212130: Helm packages deployment tool, at least for cluster applications., a subtask of T212123: Kubernetes clusters roadmap, as Resolved.
Jul 10 2019, 10:35 AM · User-fsero, serviceops, Prod-Kubernetes
fsero closed T212130: Helm packages deployment tool, at least for cluster applications. as Resolved.
Jul 10 2019, 10:35 AM · Patch-For-Review, serviceops, Prod-Kubernetes

Jul 9 2019

fsero updated the task description for T227570: docker registry swift replication is not replicating content between DCs.
Jul 9 2019, 11:19 AM · serviceops
fsero renamed T227570: docker registry swift replication is not replicating content between DCs from docker registry swift replication is not replication content between DCs to docker registry swift replication is not replicating content between DCs.
Jul 9 2019, 9:52 AM · serviceops
fsero added a project to T227570: docker registry swift replication is not replicating content between DCs: serviceops.
Jul 9 2019, 9:51 AM · serviceops
fsero created T227570: docker registry swift replication is not replicating content between DCs.
Jul 9 2019, 9:50 AM · serviceops

Jul 5 2019

fsero added a comment to T212130: Helm packages deployment tool, at least for cluster applications..

after further testing it seems that in order to use helmfile we need to set up some environment variables i.e HELM_HOME=/etc/helm KUBECONFIG=/etc/kubernetes/zotero-staging.config helmfile diff

Jul 5 2019, 1:04 PM · Patch-For-Review, serviceops, Prod-Kubernetes
fsero triaged T227198: Allow service-checker to run multiple domains for RESTBase as Normal priority.
Jul 5 2019, 7:05 AM · Core Platform Team (Needs Cleaning - Security, stability, performance, and scalability (TEC1)), serviceops
fsero triaged T226642: create a public docker-registry lvs endpoint for being used behind varnish as Normal priority.
Jul 5 2019, 7:04 AM · serviceops

Jul 3 2019

fsero added a comment to T212130: Helm packages deployment tool, at least for cluster applications..

pending some documentation for helping people to migrate this is essentially done

Jul 3 2019, 4:52 PM · Patch-For-Review, serviceops, Prod-Kubernetes
fsero committed rDEPLOYCHARTSd4124cb03f90: helmfile: added codfw current values (authored by fsero).
helmfile: added codfw current values
Jul 3 2019, 4:05 PM
fsero committed rDEPLOYCHARTSa01e2c6333ca: helmfile: added current eqiad cluster values (authored by fsero).
helmfile: added current eqiad cluster values
Jul 3 2019, 2:14 PM
fsero committed rDEPLOYCHARTS078ab1ee2621: sync current staging values with stored values on repo (authored by fsero).
sync current staging values with stored values on repo
Jul 3 2019, 1:52 PM
fsero triaged T226516: deploy CoreDNS as a in-cluster DNS service as Normal priority.
Jul 3 2019, 8:19 AM · serviceops
fsero moved T226516: deploy CoreDNS as a in-cluster DNS service from Backlog to Next up on the serviceops board.
Jul 3 2019, 8:19 AM · serviceops

Jun 28 2019

fsero committed rDEPLOYCHARTS2712f053bc6a: introducing helmfile.d values for staging cluster (authored by fsero).
introducing helmfile.d values for staging cluster
Jun 28 2019, 4:44 PM

Jun 26 2019

fsero created T226642: create a public docker-registry lvs endpoint for being used behind varnish.
Jun 26 2019, 2:29 PM · serviceops

Jun 25 2019

fsero created T226516: deploy CoreDNS as a in-cluster DNS service.
Jun 25 2019, 2:47 PM · serviceops

Jun 21 2019

fsero claimed T212130: Helm packages deployment tool, at least for cluster applications..
Jun 21 2019, 10:47 AM · Patch-For-Review, serviceops, Prod-Kubernetes
fsero moved T37611: Remove port 29418 from cloning process from Backlog to Doing on the serviceops board.
Jun 21 2019, 10:47 AM · serviceops, Developer-Advocacy, Operations, Gerrit
fsero moved T212130: Helm packages deployment tool, at least for cluster applications. from Backlog to Doing on the serviceops board.
Jun 21 2019, 10:47 AM · Patch-For-Review, serviceops, Prod-Kubernetes
fsero added a comment to T220836: Guidelines for Rust/Go tools deployment.

+1 to what @Joe said, there are some challenges with that approach because there are go projects and libraries that would require the really latest go version so it could include a prerequisite of package golang itself to be used as a build dependency.

Jun 21 2019, 10:09 AM · serviceops-radar, Packaging

Jun 20 2019

fsero added a comment to T220085: Getting registry metadata from a public client fails on our registry.

works for me using python 2.7 and docker==3.7.2

Jun 20 2019, 2:32 PM · Traffic, docker-pkg, Operations, serviceops
fsero moved T218812: Provide the ability to have time-delayed or time-offset jobs in the job queue from Backlog to Incoming on the serviceops board.
Jun 20 2019, 2:24 PM · Core Platform Team Legacy (Watching / External), serviceops-radar, TechCom-RFC, Analytics, ChangeProp, Event-Platform, WMF-JobQueue, Community-Tech
fsero moved T218342: Our docker base images lack tags from Backlog to Incoming on the serviceops board.
Jun 20 2019, 2:24 PM · Release-Engineering-Team, Release-Engineering-Team-TODO, serviceops
fsero moved T218217: Make services swagger specs standard compliant from Backlog to Incoming on the serviceops board.
Jun 20 2019, 2:24 PM · Core Platform Team, serviceops-radar, Product-Infrastructure-Team-Backlog, Proton, Graphoid, CX-cxserver, Citoid, Mathoid, Recommendation-API, Services (later), Mobile-Content-Service, RESTBase-API, Operations
fsero moved T211139: Convert Gerrit to use H2 as the database from Backlog to Incoming on the serviceops board.
Jun 20 2019, 2:23 PM · serviceops, Patch-For-Review, Operations, Gerrit
fsero moved T146055: Improve privilege separation for phabricator's config files and mysql credentials from Backlog to Incoming on the serviceops board.
Jun 20 2019, 2:23 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops, DBA, User-MModell, Security, Phabricator
fsero moved T212935: SRE FY2019 Q3 goal: Increase reach of deployment pipeline from Backlog to Goal tasks on the serviceops board.
Jun 20 2019, 2:23 PM · Operations, serviceops