Page MenuHomePhabricator

Stevemunene (Stevemunene)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 1 2022, 1:30 PM (46 w, 6 d)
Availability
Available
LDAP User
Stevemunene
MediaWiki User
SMunene-WMF [ Global Accounts ]

Recent Activity

Fri, Sep 22

Stevemunene closed T327884: Datahub user records are not being created after login as Resolved.

After the switch to OIDC we can confirm that new users, can login and their records are created.

Fri, Sep 22, 11:33 AM · Data-Platform-SRE
Stevemunene closed T305874: Switch DataHub authentication to OIDC, a subtask of T299910: Data Catalog MVP, as Resolved.
Fri, Sep 22, 11:17 AM · Data-Platform-SRE, Data-Engineering, Epic
Stevemunene closed T305874: Switch DataHub authentication to OIDC, a subtask of T305518: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn, as Resolved.
Fri, Sep 22, 11:17 AM · CAS-SSO, Infrastructure-Foundations, SRE
Stevemunene closed T305874: Switch DataHub authentication to OIDC, a subtask of T311999: Enable OIDC in CAS, as Resolved.
Fri, Sep 22, 11:17 AM · CAS-SSO, Infrastructure-Foundations, SRE
Stevemunene closed T305874: Switch DataHub authentication to OIDC, a subtask of T327884: Datahub user records are not being created after login, as Resolved.
Fri, Sep 22, 11:17 AM · Data-Platform-SRE
Stevemunene closed T305874: Switch DataHub authentication to OIDC as Resolved.

Closing this task as resolves as we did meet our acceptance criteria and tracking the login/logout user experience on T347149

Fri, Sep 22, 11:17 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene created T347149: [DataHub] Users are redirected to the wrong screen on logout and from certain urls..
Fri, Sep 22, 11:14 AM · Data-Platform-SRE
Stevemunene updated the task description for T336042: Bring druid10[09-11] into service.
Fri, Sep 22, 11:06 AM · Patch-For-Review, Data-Platform-SRE
Stevemunene updated subscribers of T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team..

Hi @odimitrijevic ,
Requesting approval for adding the analytics-wmde user to analtyics-privatedata-users group for T340648.

Fri, Sep 22, 9:31 AM · Patch-For-Review, Data-Platform-SRE, SRE, SRE-Access-Requests

Wed, Sep 20

Stevemunene added a comment to T336042: Bring druid10[09-11] into service.

I think puppet will set up the load-balancing automatically (by virtual of the profile::lvs::realserver being applioed) but it may be necessary to notify pybal of the change.
See: https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service for more information.

I would reach out to the traffic team to verify whether this is required and if so, when would be a good time to do it.
For this reason alone, I would probably do each of the druid servers individually, as you have started to do in https://gerrit.wikimedia.org/r/959147

I think we need to add them here: https://github.com/wikimedia/operations-puppet/blob/production/conftool-data/node/eqiad.yaml#L394-L399 . Which from previous patches seems to be done separately.

Wed, Sep 20, 11:02 AM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a comment to T336042: Bring druid10[09-11] into service.

From the previous tickets, the steps are roughly

  • Create Keytabs
  • Add the hosts to the role(druid::public::worker)
    • druid1009
    • druid1010
    • druid1011
Wed, Sep 20, 4:30 AM · Patch-For-Review, Data-Platform-SRE

Tue, Sep 19

Stevemunene closed T309382: DataHub rights assignment is case-sensitive as Resolved.

This was resolved by the switch to OIDC, marking it as resolved.

Tue, Sep 19, 2:50 PM · Data-Platform-SRE, Data-Catalog
Stevemunene moved T305874: Switch DataHub authentication to OIDC from Done to In Progress on the Data-Platform-SRE board.

We successfully implemented OIDC on production datahub and auth/login seems to be working great.
However there are some challenges with the user journey during login and logout that are specific to datahub. Due to the change, we see a couple of different login screens depending on where/how we log in, these are the expected SSO login page and the previous JAAS login page.
login in via http://datahub.wikimedia.org/login takes us to the previous user login page ie.

image.png (1×1 px, 75 KB)
The login with SSO button works as expected, but login via the username and password fields is disabled also as expected. However, this is not ideal and the user journey is a bit confusing.
The native login interface is disabled as per datahub configure-oidc-react "Note that by default, enabling OIDC will not disable the dummy JAAS authentication path, which can be reached at the /login route of the React app. To disable this authentication path, additionally specify the following config: AUTH_JAAS_ENABLED=false" and as implemented.

Tue, Sep 19, 2:30 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene moved T343520: decommission an-test-client1001.eqiad.wmnet from In Progress to Done on the Data-Platform-SRE board.
Tue, Sep 19, 2:03 PM · Data-Platform-SRE, decommission-hardware
Stevemunene updated the task description for T343520: decommission an-test-client1001.eqiad.wmnet.
Tue, Sep 19, 2:02 PM · Data-Platform-SRE, decommission-hardware
Stevemunene updated the task description for T343520: decommission an-test-client1001.eqiad.wmnet.
Tue, Sep 19, 1:42 PM · Data-Platform-SRE, decommission-hardware
BTullis awarded T332570: Upgrade hadoop workers to bullseye a Cup of Joe token.
Tue, Sep 19, 10:57 AM · Data-Platform-SRE
Stevemunene moved T332570: Upgrade hadoop workers to bullseye from In Progress to Done on the Data-Platform-SRE board.

We have successfully completed the hadoop worker upgrades to Bullseye.

Tue, Sep 19, 10:56 AM · Data-Platform-SRE
Stevemunene claimed T336042: Bring druid10[09-11] into service.
Tue, Sep 19, 9:13 AM · Patch-For-Review, Data-Platform-SRE

Mon, Sep 18

Stevemunene moved T305874: Switch DataHub authentication to OIDC from In Progress to Done on the Data-Platform-SRE board.
Mon, Sep 18, 4:00 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene moved T309382: DataHub rights assignment is case-sensitive from Blocked / Waiting to In Progress on the Data-Platform-SRE board.
Mon, Sep 18, 1:31 PM · Data-Platform-SRE, Data-Catalog
Stevemunene claimed T309382: DataHub rights assignment is case-sensitive.
Mon, Sep 18, 1:31 PM · Data-Platform-SRE, Data-Catalog
Stevemunene moved T327884: Datahub user records are not being created after login from Blocked / Waiting to In Progress on the Data-Platform-SRE board.
Mon, Sep 18, 1:30 PM · Data-Platform-SRE
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1140 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181210_stevemunene_3044066_an-worker1140.out
    • The reimage failed, see the cookbook logs for the details
Mon, Sep 18, 1:28 PM · Data-Platform-SRE
Stevemunene updated the task description for T305874: Switch DataHub authentication to OIDC.
Mon, Sep 18, 11:42 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Fri, Sep 15

Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

Following the install via IPMI with ipmitool -I lanplus -H "an-worker1138.mgmt.eqiad.wmnet" -U root -E sol activate

Fri, Sep 15, 11:16 AM · Data-Platform-SRE
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

I managed to get access to the instance via regular ssh and confirmed that the right volumes exist, which they do

Fri, Sep 15, 10:53 AM · Data-Platform-SRE
Stevemunene moved T343520: decommission an-test-client1001.eqiad.wmnet from Ready for Work to In Progress on the Data-Platform-SRE board.
Fri, Sep 15, 7:58 AM · Data-Platform-SRE, decommission-hardware
Stevemunene added a subtask for T329363: Upgrade Hadoop test cluster to Bullseye: T343520: decommission an-test-client1001.eqiad.wmnet.
Fri, Sep 15, 7:57 AM · Data-Platform-SRE, Patch-For-Review
Stevemunene added a parent task for T343520: decommission an-test-client1001.eqiad.wmnet: T329363: Upgrade Hadoop test cluster to Bullseye.
Fri, Sep 15, 7:57 AM · Data-Platform-SRE, decommission-hardware

Thu, Sep 14

Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

an-worker1138 is currently facing an error

image.png (1×2 px, 338 KB)

Did a powercycle in order to access the terminal, however the host does not accept the root pw.
First thought was to check the partitions from the previous hosts experience as per Standard_Worker_Installation but the host is still inaccessible

Thu, Sep 14, 2:34 PM · Data-Platform-SRE

Thu, Sep 7

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

@BTullis With the upcoming elevation of the analytics-wmde user to a systemwide user across nodes (airflow, stat100x, hadoop worker nodes, etc..) and membership of analytics-privatedata-users, I'm considering removing access to analytics-wmde for the general analytics-wmde-users group and having only the airflow-wmde-admins with access to the user. This shouldn't affect much since the user was only on stat1007.

Thu, Sep 7, 2:06 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

Seeing some HDFS corrupt blocks from 2023-09-07 10:03 UTC on grafana.
Did a quick check on the master nodes which show 0 corrupt files

Thu, Sep 7, 10:22 AM · Data-Platform-SRE
Stevemunene updated the task description for T340648: [Airflow] Setup Airflow instance for WMDE.
Thu, Sep 7, 8:06 AM · Patch-For-Review, Data-Platform-SRE

Wed, Sep 6

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

Thank you for your response @Manuel , we shall be moving forward with analytics-wmde user, I have sent out the access request for this. Corresponding patches to follow.

Wed, Sep 6, 12:07 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a subtask for T340648: [Airflow] Setup Airflow instance for WMDE: T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team..
Wed, Sep 6, 12:01 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a subtask for T342331: [EPIC] Set up a sustainable tech stack for Wikidata Analytics: T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team..
Wed, Sep 6, 12:01 PM · Wikidata, Epic, Wikidata Analytics (Kanban)
Stevemunene added parent tasks for T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team.: T340648: [Airflow] Setup Airflow instance for WMDE, T342331: [EPIC] Set up a sustainable tech stack for Wikidata Analytics.
Wed, Sep 6, 12:01 PM · Patch-For-Review, Data-Platform-SRE, SRE, SRE-Access-Requests
Stevemunene renamed T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. from Requesting access to RESOURCE for USER[S] to Requesting Creation of a new POSIX group and system user for the Analytics WMDE team..
Wed, Sep 6, 12:01 PM · Patch-For-Review, Data-Platform-SRE, SRE, SRE-Access-Requests
Stevemunene created T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team..
Wed, Sep 6, 11:44 AM · Patch-For-Review, Data-Platform-SRE, SRE, SRE-Access-Requests

Tue, Sep 5

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

Hi folks! Yes I'd follow what we did for analytics-product etc.. since we'll create the same system user (uid/gid) across nodes (airflow, stat100x, hadoop worker nodes, etc..). You can reserve a uid/gid combination in puppet admin's data.yaml file, and add the related system user to the analtyics-privatedata-users group (as the others).

The only follow up that I can think of is that on stat1007, where analytics-wmde is already present IIRC, we'll have almost surely a different uid/gid, so some follow up (chmod -R etc..) will be needed.

Tue, Sep 5, 1:02 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

an-worker1132 seems to be stuck on debian Install as seen below. power cycling the server and retrying the reimage.

image.png (1×3 px, 197 KB)

Tue, Sep 5, 12:34 PM · Data-Platform-SRE

Mon, Sep 4

Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1129 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309040929_stevemunene_2233697_an-worker1129.out
    • The reimage failed, see the cookbook logs for the details
Mon, Sep 4, 10:08 AM · Data-Platform-SRE

Fri, Sep 1

Stevemunene updated the task description for T343762: Bring Hadoop workers an-worker11[49-56] into service.
Fri, Sep 1, 12:49 PM · Data-Platform-SRE
Stevemunene added a comment to T344808: Investigate an-presto1002 failures.

an-presto1002 was showing similar memory utilisation errors on 2023-09-01 with the latest one at time of writing seen here

image.png (2×1 px, 305 KB)
with a peak usage on 2023-09-01 12:16 UTC that led to the system failure then raised the (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 alert. The server was initialised again as seen:

Fri, Sep 1, 12:36 PM · Data-Platform-SRE
Stevemunene triaged T345413: an-worker1145: soft lockup. as Medium priority.

The host seems to be back in service

image.png (470×2 px, 103 KB)
However, leaving this open incase it re appears within the day and for further conversations on the host.

Fri, Sep 1, 8:00 AM · Data-Platform-SRE
Stevemunene updated the task description for T345413: an-worker1145: soft lockup..
Fri, Sep 1, 7:31 AM · Data-Platform-SRE
Stevemunene created T345413: an-worker1145: soft lockup..
Fri, Sep 1, 7:28 AM · Data-Platform-SRE

Thu, Aug 31

Stevemunene updated the task description for T305874: Switch DataHub authentication to OIDC.
Thu, Aug 31, 7:55 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene updated the task description for T305874: Switch DataHub authentication to OIDC.
Thu, Aug 31, 7:50 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene updated subscribers of T340648: [Airflow] Setup Airflow instance for WMDE.

From the [] Create WMDE airflow admin group review, the aiflow-wmde-admins group requires a system user in order to perform the "admin tasks" for the airflow instance.
Our current user analytics-wmde is not a system user since the user was originally created by statistics::wmde a Class for running WMDE releated statistics & analytics scripts on a statsd host.
The user is currenlty availed on the stat host via the profile profile::statistics::explorer::misc_jobs, along with the other scripts and jobs required for WMDE releated statistics & analytics scripts.

We are currently working the procedures to add analytics-wmde as a system user, or use a different one considering that all the airflow system users and those who can access them are members of analytics_privatedata_users documented here. Granted Andrew and Manuel are already members we would likely only need to add Kara then proceed with the right approvals.

Thu, Aug 31, 5:06 AM · Patch-For-Review, Data-Platform-SRE

Tue, Aug 29

Stevemunene updated the task description for T305874: Switch DataHub authentication to OIDC.
Tue, Aug 29, 7:58 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

You can try switching the format to "FLAT" as with Gitlab, that might help datahub locate the attributes

Example from the IDP configuration for Gitlab:

gitlab_oidc:
  id: 31
  service_class: 'OidcRegisteredService'
  service_id: 'https://gitlab\.wikimedia\.org(/.*)?'
  profile_format: 'FLAT'

Thanks @SLyngshede-WMF, Trying this out since the attribute preferred_username is already available and from the docs https://apereo.github.io/cas/6.6.x/authentication/OAuth-Authentication-UserProfiles.html#user-profiles---oauth-authentication should be available when we change to FLAT. Great insight

Tue, Aug 29, 7:45 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

One issue we ran into with Gitlab also involved Gitlab not being able to locate OIDC attributes. This was as a result of how CAS returns the attributes. By default CAS will return the attributes in a nested format, which almost nothing expects.

You can try switching the format to "FLAT" as with Gitlab, that might help datahub locate the attributes

Example from the IDP configuration for Gitlab:

gitlab_oidc:
  id: 31
  service_class: 'OidcRegisteredService'
  service_id: 'https://gitlab\.wikimedia\.org(/.*)?'
  profile_format: 'FLAT'
Tue, Aug 29, 7:07 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

an-worker1117 is stuck at install with an error no root filesystem is defined. Looking into this.

Tue, Aug 29, 6:10 AM · Data-Platform-SRE
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Expanding/adding the AUTH_OIDC_SCOPE doesn't seem to have had much impact on the SSO process, we are still getting the same error.

Tue, Aug 29, 4:39 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Aug 24 2023

Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Adding the AUTH_OIDC_PREFERRED_JWS_ALGORITHM worked and resolved the unsiged token error we had. Datahub can now receive the token from the idp and successfully read it.
However, we still cannot login due to this error (some details and tokens redacted)

2023-08-24 16:34:01,040 [application-akka.actor.default-dispatcher-9] ERROR controllers.SsoCallbackController - Caught exception while attempting to handle SSO callback! It's likely that SSO integration is mis-configured.
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to resolve user name claim from profile provided by Identity Provider. Missing attribute. Attribute: 'preferred_username', Regex: '(.*)', Profile: {at_hash=w9dGLDsaQZn5Z3LCEBu4ew, sub=Stevemunene, amr=["LdapAuthenticationHandler"], id_token=very long redacted token, iss=https://idp-test.wikimedia.org/oidc, client_id=datahub_staging, sid=redacted, access_token=AT-10-redated, token_expiration_advance=-1, aud=[datahub_staging], nbf=Thu Aug 24 16:29:00 UTC 2023, service=https://datahub-frontend.k8s-staging.discovery.wmnet/callback/oidc, auth_time=Thu Aug 24 16:33:58 UTC 2023, expiration=Fri Aug 25 00:34:00 UTC 2023, attributes={"name":"Stevemunene","preferred_username":"stevemunene","email":"myemail@wikimedia.org"}, id=Stevemunene, state=e0701001d7, exp=Fri Aug 25 00:34:00 UTC 2023, iat=Thu Aug 24 16:34:00 UTC 2023, jti=TGT-9-fJmWQ1qgEBv6XELZRzreS9U6SHHBenm60AK-l8gVf3XcLW4xQDzf9pmuqumDPXL09Co-idp-test1002}

Here we see Failed to resolve user name claim from profile provided by Identity Provider. Missing attribute. Attribute: 'preferred_username' where despite the the preferred_username being available in the token it is not visible to the datahub login service. This is the data that will be part of the authenticated users profile/details or similar.
We could solve this by

Aug 24 2023, 5:28 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

an-worker1117 is stuck at install with an error no root filesystem is defined. Looking into this.

image.png (1×1 px, 488 KB)

Aug 24 2023, 11:41 AM · Data-Platform-SRE
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

In case it helps, I did a little digging into the CAS logs on idp-test1002 and stumbled upon this.

root@idp-test1002:/var/log/cas# grep ERROR cas.log 

2023-08-23 09:27:29,477 ERROR [org.apereo.cas.authentication.principal.WebApplicationServiceFactory] - <Unable to extract query parameters from [https://datahub-frontend\.k8s-staging\.discovery\.wmnet(/.*)?]: [java.net.URISyntaxException: Illegal character in authority at index 8: https://datahub-frontend\.k8s-staging\.discovery\.wmnet(/.*)?]>

Thanks @BTullis , I am looking into this.

Aug 24 2023, 9:51 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Aug 23 2023

Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Found some info on what we might be missing,
We have so far verified that authentication on the IDP side is okay and that we do receive a signed id token. The challenge lies in how datahub is validating the token received. We must use the same ID token JWS algorithm used by the IDP to validate the token received.

Aug 23 2023, 2:47 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Added the AUTH_OIDC_CLIENT_AUTHENTICATION_METHOD method and retested, the idp seems okay with everything, my user is authenticated and provided a token

2023-08-23 11:35:37,980 INFO [org.apereo.inspektr.audit.support.Slf4jLoggingAuditTrailManager] - <Audit trail record BEGIN
=============================================================
WHO: stevemunene
WHAT: {access_token=AT-4-********qObqPBK38z-o01q9XsTu, scope=email openid profile, id_token=********..., token_type=Bearer, expires_in=28800}
ACTION: OAUTH2_ACCESS_TOKEN_RESPONSE_CREATED
APPLICATION: CAS
WHEN: Wed Aug 23 11:35:37 UTC 2023
CLIENT IP ADDRESS: *redacted*
SERVER IP ADDRESS: 127.0.0.1

However I am still getting the same error from datahub

Aug 23 2023, 11:47 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Aug 22 2023

Stevemunene updated the task description for T305874: Switch DataHub authentication to OIDC.
Aug 22 2023, 3:37 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Was able to get the deployment to staging done, login redirected to the right SSO page and I was able to enter my login details, however authentication failed with this from the logs.

Aug 22 2023, 2:40 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Got some errors from the first test, but they're mostly related to the current setup. Looking into this

Aug 22 2023, 1:06 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

While reimaging an-worker1117.eqiad.wmnet we found that the server did not reimage with the right mountpoints resulting in puppet error

Error while evaluating a Function Call:
Number of datanode mountpoints (0) below threshold: 10, please check.
…in /etc/puppet/modules/profile/manifests/hadoop/common.pp, line: 389, column: 9.

Using findmnt we can confirm that the required datanode mountpoints are indeed unavailable.

Aug 22 2023, 12:01 PM · Data-Platform-SRE

Aug 21 2023

Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

Picking this up from an-worker1108

Aug 21 2023, 7:58 AM · Data-Platform-SRE

Aug 17 2023

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

From the [] Create WMDE airflow admin group review, the aiflow-wmde-admins group requires a system user in order to perform the "admin tasks" for the airflow instance.
Our current user analytics-wmde is not a system user since the user was originally created by statistics::wmde a Class for running WMDE releated statistics & analytics scripts on a statsd host.
The user is currenlty availed on the stat host via the profile profile::statistics::explorer::misc_jobs, along with the other scripts and jobs required for WMDE releated statistics & analytics scripts.

Aug 17 2023, 3:09 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

These are the values added for our initial idp test and a brief explanation on each.

Aug 17 2023, 9:35 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Aug 15 2023

Stevemunene moved T343520: decommission an-test-client1001.eqiad.wmnet from Incoming to Ready for Work on the Data-Platform-SRE board.
Aug 15 2023, 3:47 PM · Data-Platform-SRE, decommission-hardware
Stevemunene moved T305874: Switch DataHub authentication to OIDC from Blocked / Waiting to In Progress on the Data-Platform-SRE board.

Actively working on this, thus moving it back in progress as we plan on implementing the solutions defined on https://phabricator.wikimedia.org/T343236#9079448

Aug 15 2023, 12:16 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations
Stevemunene updated the task description for T340648: [Airflow] Setup Airflow instance for WMDE.
Aug 15 2023, 11:24 AM · Patch-For-Review, Data-Platform-SRE
Stevemunene moved T342546: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE from In Progress to Done on the Data-Platform-SRE board.

Te changes have been merged and @karapayneWMDE now has shell access and is a member of analytics-wmde-users

Aug 15 2023, 10:00 AM · Data-Platform-SRE, SRE, SRE-Access-Requests

Aug 14 2023

Stevemunene updated the task description for T340648: [Airflow] Setup Airflow instance for WMDE.
Aug 14 2023, 2:23 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene committed rLPRIc148007bea7c: Add dummy keytabs for new an-airflow1007 (authored by Stevemunene).
Add dummy keytabs for new an-airflow1007
Aug 14 2023, 2:06 PM
Stevemunene committed rLPRI9b50500ffcf3: Dummy db for new wmde airflow (authored by Stevemunene).
Dummy db for new wmde airflow
Aug 14 2023, 2:06 PM
Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

We are unblocked on T342546 , Working to merge the tasks listed as in progress and as ready to merge on the ticket.

Aug 14 2023, 1:57 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene updated the task description for T340648: [Airflow] Setup Airflow instance for WMDE.
Aug 14 2023, 1:55 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene closed T342424: eqiad: 1 VM request for WMDE Airflow, a subtask of T340648: [Airflow] Setup Airflow instance for WMDE, as Resolved.
Aug 14 2023, 1:51 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene closed T342424: eqiad: 1 VM request for WMDE Airflow as Resolved.
Aug 14 2023, 1:51 PM · Data-Platform-SRE, vm-requests, Infrastructure-Foundations, SRE
Stevemunene moved T342424: eqiad: 1 VM request for WMDE Airflow from In Progress to Done on the Data-Platform-SRE board.
Aug 14 2023, 1:50 PM · Data-Platform-SRE, vm-requests, Infrastructure-Foundations, SRE
Stevemunene added a comment to T342424: eqiad: 1 VM request for WMDE Airflow.

created the vm with
sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --network analytics --os buster --cluster eqiad --group B an-airflow1007
makevm and reimage succeeded with

Aug 14 2023, 1:50 PM · Data-Platform-SRE, vm-requests, Infrastructure-Foundations, SRE
Stevemunene updated the task description for T342546: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE.
Aug 14 2023, 1:47 PM · Data-Platform-SRE, SRE, SRE-Access-Requests
Stevemunene updated the task description for T342546: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE.
Aug 14 2023, 12:35 PM · Data-Platform-SRE, SRE, SRE-Access-Requests
Stevemunene moved T342546: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE from Blocked / Waiting to In Progress on the Data-Platform-SRE board.

hello, apologies for the delay (was on holiday)

public key is: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIN2rcD7HPKDRY5qREx2jQ7ls4McLYYCEiWS94r/GeF70 kapa@C363

Aug 14 2023, 12:35 PM · Data-Platform-SRE, SRE, SRE-Access-Requests
Stevemunene moved T342424: eqiad: 1 VM request for WMDE Airflow from Incoming to In Progress on the Data-Platform-SRE board.

Verifying the cluster availability and resources via

Aug 14 2023, 8:54 AM · Data-Platform-SRE, vm-requests, Infrastructure-Foundations, SRE
Stevemunene claimed T342424: eqiad: 1 VM request for WMDE Airflow.
Aug 14 2023, 8:33 AM · Data-Platform-SRE, vm-requests, Infrastructure-Foundations, SRE

Aug 10 2023

Stevemunene added a comment to T343236: Get datahub-staging.wikimedia.org working with the staging deployment of datahub.

Currently I would still oppose to making services on wikikube-staging publicly accessible as that is blurring a line we have right now. Also the intention of staging was/is to validate k8s deployments, not to do full fledged integration testing.

OK, I understand your concern. We'll find another way.

Maybe what we should do is:

  • Move datahub to the dse-k8s cluster, deleting the deployments on wikikube and staging
  • Create a datahub-test service (with its own namespace etc.) on dse-k8s and use this for our integration testing.
  • Add https://datahub-test.wikimedia.org to the CDN and route it to this new service.

What do you think?

Can you please elaborate on the "metadata ingestion"/GMS issue a bit? What is the struggle with testing that?

Yes, gladly. It's all about end-to-end testing.

Currently, when we log into the staging deployment of datahub there is no content to be seen. i.e. There is no metadata on hive tables, kafka topics, druid tables, superset charts etc.
This means that there is limited utility in this staging development in verifying that everything is working as it should.

We have scheduled jobs in Airflow that do this regular metadata ingestion for the production DataHub, using the production Hive tables etc.
We could potentially have scheduled jobs that would allow us to to end-to-end validation of the staging deployment. We have a test hadoop cluster, with test airflow, test hive, test druid, test kafka etc. so it would seem like a good idea to link up these services with a test datahub service.

However, we haven't been able to do that at the moment because the GMS port on staging isn't exposed as the one on production is (at https://datahub-gms.discovery.wmnet:30443/)

As you mention that the staging cluster wasn't really intended for this kind of integration testing, but instead for validating kubernetes deployments, maybe this is the time to create a new service and to move it to dse-k8s.

Aug 10 2023, 9:54 AM · serviceops-radar, Data-Platform-SRE, Data-Engineering

Aug 9 2023

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

Linking this comment for transparency on what is in progress.

Aug 9 2023, 6:22 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene updated the task description for T340648: [Airflow] Setup Airflow instance for WMDE.
Aug 9 2023, 6:18 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene added a comment to T343236: Get datahub-staging.wikimedia.org working with the staging deployment of datahub.

JFTR, Puppet runs on idp-test1002 are currently failing, this is caused by the 'client_secret' that was pushed to the private repo, it needs a corresponding entry in hieradata/role/common/idp_test.yaml as well

This was pushed together with 944231 which spawned some conversation on datahub-staging and possibly removing the datahub prod entry from idp_test and only having staging

Aug 9 2023, 6:05 PM · serviceops-radar, Data-Platform-SRE, Data-Engineering

Aug 4 2023

Stevemunene created T343520: decommission an-test-client1001.eqiad.wmnet.
Aug 4 2023, 11:53 AM · Data-Platform-SRE, decommission-hardware
Stevemunene added a comment to T343236: Get datahub-staging.wikimedia.org working with the staging deployment of datahub.

JFTR, Puppet runs on idp-test1002 are currently failing, this is caused by the 'client_secret' that was pushed to the private repo, it needs a corresponding entry in hieradata/role/common/idp_test.yaml as well

Aug 4 2023, 9:01 AM · serviceops-radar, Data-Platform-SRE, Data-Engineering

Aug 3 2023

Stevemunene added a comment to T332570: Upgrade hadoop workers to bullseye.

I noticed the following while checking alerts:

Notice: /Stage[main]/Base::Standard_packages/Package[libpython2.7-minimal]/ensure: removed (corrective)
Notice: /Stage[main]/Base::Standard_packages/Package[python2.7]/ensure: removed (corrective)
Notice: /Stage[main]/Base::Standard_packages/Package[libpython2.7-stdlib]/ensure: removed (corrective)
Notice: /Stage[main]/Base::Standard_packages/Package[python2.7-minimal]/ensure: removed (corrective)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230803-3988867-pxe43n.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/bigtop/manifests/hive.pp, line: 164)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230803-3988867-pxe43n.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/bigtop/manifests/hive.pp, line: 164)
Wrapped exception:
No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230803-3988867-pxe43n.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Bigtop::Hive/File[/usr/lib/hive/bin/ext/hiveserver2.sh]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /usr/lib/hive/bin/ext/hiveserver2.sh20230803-3988867-pxe43n.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/bigtop/manifests/hive.pp, line: 164) (corrective)

This is from analytics1070 but the same issue seems present on all bullseye node (probably a if condition or similar).

Aug 3 2023, 10:48 AM · Data-Platform-SRE

Aug 1 2023

Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

Thanks @jbond
Adding a datahub_staging oidc entry with service_id: 'https://datahub-frontend\.k8s-staging\.discovery\.wmnet(/.*)?' which we access via tunnel and mainly use for testing.

Aug 1 2023, 1:40 PM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Jul 31 2023

Stevemunene added a comment to T305874: Switch DataHub authentication to OIDC.

on the datahub charts we need to replace the jaas configmap with an oidc setup
For the env variables we can avail them via

{{- if .Values.auth.ldap.enabled }}
- name: AUTH_OIDC_ENABLED
  value: "true" 
- name: AUTH_OIDC_CLIENT_ID
  value: "our-client-id"
- name: AUTH_OIDC_CLIENT_SECRET
  value: "getfromsecrets"  
- name: AUTH_OIDC_JIT_PROVISIONING_ENABLED
  value: "true"
- name: AUTH_OIDC_EXTRACT_GROUPS_ENABLED
  value: "true"
- name: AUTH_OIDC_PRE_PROVISIONING_REQUIRED
  value: "false"           
- name: AUTH_OIDC_DISCOVERY_URI
value: "apereo_cas.production.oidc_endpoint"
- name: AUTH_OIDC_BASE_URL
  value: "wmf datahub"
- name: AUTH_OIDC_USER_NAME_CLAIM
  value: "preferred_username"
- name: AUTH_OIDC_EXTRACT_GROUPS_ENABLED
  value: "true"     
{{- end }}

for the idp steps are roughly
add datahub oidc entry with the required groups wmf nda to idp_test.yaml and their corresponding secrets on labs/private
some questions on handling existing users once OIDC comes into play

Jul 31 2023, 7:10 AM · Data-Platform-SRE, CAS-SSO, Infrastructure-Foundations

Jul 25 2023

Stevemunene moved T341700: Migrate analytics_test airflow instance to bullseye an-test-client1002 from In Progress to Done on the Data-Platform-SRE board.
Jul 25 2023, 3:32 PM · Patch-For-Review, Data-Platform-SRE

Jul 24 2023

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

Could you please add @karapayneWMDE to the parent group? If not, what would be required to do so?
(see T284308: Add Kara Payne to the ldap/wmde and ldap/nda group for reference)

Jul 24 2023, 11:56 AM · Patch-For-Review, Data-Platform-SRE

Jul 21 2023

Stevemunene added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

What Andrew said! Please also add our engineering manager @karapayneWMDE to the group.

Jul 21 2023, 1:54 PM · Patch-For-Review, Data-Platform-SRE
AndrewTavis_WMDE awarded T340648: [Airflow] Setup Airflow instance for WMDE a Like token.
Jul 21 2023, 1:15 PM · Patch-For-Review, Data-Platform-SRE
Stevemunene moved T340648: [Airflow] Setup Airflow instance for WMDE from Ready for Work to In Progress on the Data-Platform-SRE board.

Hi @AndrewTavis_WMDE, this has been picked up and is in progress.

Jul 21 2023, 12:54 PM · Patch-For-Review, Data-Platform-SRE