Page MenuHomePhabricator

BTullis (Ben)
Senior SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jun 29 2021, 9:56 AM (58 w, 5 h)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Today

BTullis added a comment to T314838: RAID battery alert in an-worker1089.

I have added 3 months of downtime on the MegaRAID check for this server, just to stop the alert flapping.

Tue, Aug 9, 8:42 AM · Data Engineering Planning
BTullis created T314838: RAID battery alert in an-worker1089.
Tue, Aug 9, 8:38 AM · Data Engineering Planning
BTullis closed T311991: RAID battery alert in an-worker1082 as Resolved.
Tue, Aug 9, 8:33 AM · Data-Engineering

Yesterday

BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

So the final build command was the following:

btullis@marlin-wsl:~/src/bigtop-bullseye$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean hadoop-pkg hive-pkg bigtop-groovy-pkg bigtop-jsvc-pkg bigtop-tomcat-pkg bigtop-utils-pkg flink-pkg hbase-pkg mahout-pkg solr-pkg spark-pkg sqoop-pkg sqoop2-pkg'

I've added the thindparty/bigtop15 component to the wikimedia-bullseye distribution: https://gerrit.wikimedia.org/r/821223

Mon, Aug 8, 3:03 PM · Data Engineering Planning (Sprint 02)
BTullis edited projects for T304373: Also intake Network Error Logging events into the Analytics Data Lake, added: Data Engineering Planning; removed Data-Engineering.

Yes I am still interested. Adding it to our planning board for discussion.

Mon, Aug 8, 9:18 AM · Data Engineering Planning, SRE

Thu, Aug 4

BTullis updated the task description for T314587: Q1:rack/setup/install new machine learning hosts.
Thu, Aug 4, 3:35 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops
BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Thu, Aug 4, 9:04 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

I have rebuilt the solr, spark , and sqoop2 packages using the correct working directory and the same commands as mentioned previously.

Thu, Aug 4, 9:03 AM · Data Engineering Planning (Sprint 02)

Wed, Aug 3

BTullis added a comment to T312858: New airflow instance related to Image Suggestion Jobs.

@xcollazo - The new airflow VM is up and running now, but I have just put it into the insetup role, which means it's ready to be assigned a puppet role.

btullis@marlin-wsl:~$ ssh an-airflow1004.eqiad.wmnet
Linux an-airflow1004 4.19.0-21-amd64 #1 SMP Debian 4.19.249-2 (2022-06-30) x86_64
Debian GNU/Linux 10 (buster)
an-airflow1004 is a Host being setup for later application of a role (insetup)
The last Puppet run was at Wed Aug  3 16:32:38 UTC 2022 (1 minutes ago).
Last puppet commit: (f64a94548d) Dan Andreescu - role::common::aqs: update mw history
Debian GNU/Linux 10 auto-installed on Wed Aug 3 13:40:42 UTC 2022.
Wed, Aug 3, 4:47 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Data Pipelines
BTullis moved T314151: Metrics Platform Event custom_data field isn't refined correctly from Next Up to In progress on the Data Engineering Planning (Sprint 02) board.
Wed, Aug 3, 4:16 PM · Data Engineering Planning (Sprint 02), Metrics-Platform
BTullis edited projects for T314151: Metrics Platform Event custom_data field isn't refined correctly, added: Data Engineering Planning (Sprint 02); removed Data Engineering Planning.
Wed, Aug 3, 4:16 PM · Data Engineering Planning (Sprint 02), Metrics-Platform
BTullis set the point value for T300246: Add alert for varnishkafka low/zero messages per second to alertmanager to 1.

Setting story points to 1 for the remaining work only.

Wed, Aug 3, 4:14 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T314151: Metrics Platform Event custom_data field isn't refined correctly.

Thanks @Ottomata - Could you clarify for me please?

Wed, Aug 3, 4:13 PM · Data Engineering Planning (Sprint 02), Metrics-Platform
BTullis closed T298940: Reimage WMCS db proxies to Bullseye, a subtask of T298586: Upgrade all dbproxy hosts to Bullseye, as Resolved.
Wed, Aug 3, 4:03 PM · Patch-For-Review, DBA
BTullis closed T298940: Reimage WMCS db proxies to Bullseye as Resolved.
Wed, Aug 3, 4:03 PM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis updated subscribers of T314151: Metrics Platform Event custom_data field isn't refined correctly.

I think that in order to delete the existing data we should do the following:

Wed, Aug 3, 11:50 AM · Data Engineering Planning (Sprint 02), Metrics-Platform

Tue, Aug 2

BTullis closed T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team, a subtask of T312858: New airflow instance related to Image Suggestion Jobs, as Resolved.
Tue, Aug 2, 4:34 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Data Pipelines
BTullis closed T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team as Resolved.
Tue, Aug 2, 4:34 PM · SRE, vm-requests
BTullis added a comment to T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team.

Many thanks @MoritzMuehlenhoff

Tue, Aug 2, 4:06 PM · SRE, vm-requests
BTullis added a comment to T312626: Replace RAID controller battery in an-worker1082.

Many thanks indeed @Cmjohnson.

Tue, Aug 2, 3:50 PM · SRE, ops-eqiad, DC-Ops
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

Bother, I built solr, spark, and sqoop[2] in the wrong working directory, so I'm going to rebuild them.

Tue, Aug 2, 3:47 PM · Data Engineering Planning (Sprint 02)
BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Tue, Aug 2, 2:22 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

I built sqoop2 before returning to sqoop. This was successful.

btullis@marlin-wsl:~/src/bigtop$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean sqoop2-pkg'
> Task :sqoop2-pkg
Tue, Aug 2, 2:21 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

The sqoop build failed.

BUILD FAILED
/ws/output/sqoop/sqoop-1.4.6/build.xml:1094: Execute failed: java.io.IOException: Cannot run program "python" (in directory "/ws/output/sqoop/sqoop-1.4.6"): error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at java.lang.Runtime.exec(Runtime.java:621)
        at org.apache.tools.ant.taskdefs.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:58)
        at org.apache.tools.ant.taskdefs.Execute.launch(Execute.java:426)
        at org.apache.tools.ant.taskdefs.Execute.execute(Execute.java:440)
        at org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.java:630)
        at org.apache.tools.ant.taskdefs.ExecTask.runExec(ExecTask.java:671)
        at org.apache.tools.ant.taskdefs.ExecTask.execute(ExecTask.java:497)
        at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:293)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
        at org.apache.tools.ant.Task.perform(Task.java:352)
        at org.apache.tools.ant.Target.execute(Target.java:437)
        at org.apache.tools.ant.Target.performTasks(Target.java:458)
        at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1406)
        at org.apache.tools.ant.Project.executeTarget(Project.java:1377)
        at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
        at org.apache.tools.ant.Project.executeTargets(Project.java:1261)
        at org.apache.tools.ant.Main.runBuild(Main.java:857)
        at org.apache.tools.ant.Main.startAnt(Main.java:236)
        at org.apache.tools.ant.launch.Launcher.run(Launcher.java:287)
        at org.apache.tools.ant.launch.Launcher.main(Launcher.java:112)
Caused by: java.io.IOException: error=2, No such file or directory
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
        at java.lang.ProcessImpl.start(ProcessImpl.java:134)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 23 more
Tue, Aug 2, 2:14 PM · Data Engineering Planning (Sprint 02)
BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Tue, Aug 2, 12:52 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

Spark built successfully

btullis@marlin-wsl:~/src/bigtop$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean spark-pkg'
> Task :spark-pkg
Tue, Aug 2, 12:51 PM · Data Engineering Planning (Sprint 02)
BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Tue, Aug 2, 11:50 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

Solr built successfully

btullis@marlin-wsl:~/src/bigtop$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean solr-pkg'
> Task :solr-pkg
Tue, Aug 2, 11:50 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T312626: Replace RAID controller battery in an-worker1082.

@Cmjohnson - Apologies for all of the delay on this, I just kept missing you. I've now downtimed an-worker1082 for 3 days and I've shut it down already.
If it's convenient you can do the battery swap whenever you like and just boot it afterwards. Feel free to ping me on IRC if you'd like me to check anything.

Tue, Aug 2, 11:15 AM · SRE, ops-eqiad, DC-Ops
BTullis moved T298940: Reimage WMCS db proxies to Bullseye from In progress to Done on the Data Engineering Planning (Sprint 02) board.

The reimage of dbproxy1019 also completed without incident. It has been put back into service and dbproxy1018 has been set back to inactive as before.
Marking this ticket as done.

Tue, Aug 2, 10:45 AM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis updated the task description for T298940: Reimage WMCS db proxies to Bullseye.
Tue, Aug 2, 10:43 AM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis added a comment to T298940: Reimage WMCS db proxies to Bullseye.

These are the confctl operations that I carried out.

Tue, Aug 2, 9:35 AM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis updated the task description for T298940: Reimage WMCS db proxies to Bullseye.
Tue, Aug 2, 9:30 AM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis added a comment to T298940: Reimage WMCS db proxies to Bullseye.

The reimage of dbproxy1018 is complete and it is back in service. I will now proceed to reimage dbproxy1019.

Tue, Aug 2, 9:29 AM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis updated subscribers of T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team.
Tue, Aug 2, 9:23 AM · SRE, vm-requests

Mon, Aug 1

BTullis updated subscribers of T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team.
Mon, Aug 1, 4:35 PM · SRE, vm-requests
BTullis added a subtask for T312858: New airflow instance related to Image Suggestion Jobs: T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team.
Mon, Aug 1, 4:35 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Data Pipelines
BTullis added a parent task for T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team: T312858: New airflow instance related to Image Suggestion Jobs.
Mon, Aug 1, 4:34 PM · SRE, vm-requests
BTullis created T314319: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team.
Mon, Aug 1, 4:34 PM · SRE, vm-requests
BTullis added a comment to T312858: New airflow instance related to Image Suggestion Jobs.

The original request for an-airflow1003 was here: T284225: Create airflow instances for Platform Engineering and Research with a VM request form submitted here: T284934: Site: 2 VM request for an-airflow100{2,3}

Mon, Aug 1, 4:28 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Data Pipelines
BTullis updated subscribers of T298940: Reimage WMCS db proxies to Bullseye.

Sincere thanks to @taavi who rapidly identified the cause and provided a fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/819090

Mon, Aug 1, 2:58 PM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis added a comment to T298940: Reimage WMCS db proxies to Bullseye.

Sadly, it didn't work. When I pooled dbproxy1019 for the wikireplicas-a service it didn't respond to queries. My test command showed errors:

btullis@tools-sgebastion-10:~$ mysql -h commonswiki.analytics.db.svc.wikimedia.cloud -e "SHOW DATABASES"
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

I'm looking into why now.

Mon, Aug 1, 2:43 PM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis added a comment to T298940: Reimage WMCS db proxies to Bullseye.

I have now joined toolforge and have verified that I have access to the wikireplicas via that interface, so when pooling dbproxy1019 and depooling dbproxy1019 I can now verify that the analutics replicas are still working by using the command:

mysql -h commonswiki.analytics.db.svc.wikimedia.cloud -e "SHOW DATABASES"

Proceeding to pool dbproxy1019 for wikireplicas-a now.

Mon, Aug 1, 2:33 PM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis added a comment to T298940: Reimage WMCS db proxies to Bullseye.

I'm planning to begin work on this ticket today.

Mon, Aug 1, 2:02 PM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
BTullis moved T298940: Reimage WMCS db proxies to Bullseye from Next Up to In progress on the Data Engineering Planning (Sprint 02) board.
Mon, Aug 1, 1:34 PM · Data Engineering Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)

Sun, Jul 24

BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

The disk space is now looking much more healthy.

btullis@archiva1002:~$ df -h -t ext4
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        94G  9.5G   80G  11% /
/dev/vdb1       196G   75G  121G  39% /var/lib/archiva

I haven't updated the partman recipe yet, but at least with a second disk we can choose not to format this when we reimage the host, which will save on the rebuild time.

Sun, Jul 24, 9:21 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis moved T313386: archiva1002 is running low on space left in the root partition from Ready to Done on the Data Engineering Planning (Sprint 01) board.
Sun, Jul 24, 9:18 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis closed T313386: archiva1002 is running low on space left in the root partition as Resolved.
Sun, Jul 24, 9:18 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

The git-fat link service apepars to work without errors:

Jul 24 21:15:01 archiva1002 systemd[1]: Started Archiva tool to create jar symlinks using their sha1 checksum as filename..
Jul 24 21:16:28 archiva1002 systemd[1]: archiva-gitfat-link.service: Succeeded.

I will now remove the backup directory: /var/lib/archiva-bak

Sun, Jul 24, 9:17 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

All steps above have now been follwed, except the final removal of the backup in /var/lib/archiva-bak

Sun, Jul 24, 9:14 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

I've mounted /dev/vdb1 to /mnt temporarily and started an rsync operation with:

sudo rsync -av /var/lib/archiva/ /mnt

Once this is complete I will:

  • add /dev/vdb1 to /etc/fstab as /var/lib/archiva
  • stop the archiva service
  • run the rsync command once again to make sure no changes have occurred
  • rename /var/lib/archiva to /var/lib/archiva-bak
  • mkdir /var/lib/archiva and chown it to the correct user (currently this ownership is bacula:ulog which I'm not sure about, but I can come back to double check this)
  • sudo mount -a to mount the new volume
  • start the archiva service
  • If all is well, remove /var/lib/archiva-bak
Sun, Jul 24, 8:59 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

I had to rename the network interface from ens5 to ens14 in /etc/network/interfaces as described here: https://wikitech.wikimedia.org/wiki/Ganeti#Adding_a_disk

Sun, Jul 24, 8:49 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

I checked that the primary (ganeti1008) and secondary (ganeti1025) nodes both have plenty of spare disk space:

Sun, Jul 24, 3:38 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED
BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

I'm looking into this issue now, since I'm on leave next week and I would rather not leave it any longer.
I will take the advice from @hashar and @elukey which is to use a separate disk for /var/lib/archiva and move the existing contents there.

Sun, Jul 24, 3:32 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED

Fri, Jul 22

BTullis added a comment to T313386: archiva1002 is running low on space left in the root partition.

I think I'd probably consider growing the disk in ganeti, which should be able to increase the headroom for us.

Fri, Jul 22, 12:22 PM · Data Engineering Planning (Sprint 01), wmde-team-b-tech, SRE, Discovery-ARCHIVED

Sat, Jul 16

BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Sat, Jul 16, 9:24 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

Mahout built successfully.

btullis@marlin-wsl:~/src/bigtop-bullseye$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean mahout-pkg'
> Task :mahout-pkg
Sat, Jul 16, 9:24 PM · Data Engineering Planning (Sprint 02)

Fri, Jul 15

BTullis created T313130: RAID battery alert in an-worker1093.
Fri, Jul 15, 4:42 PM · Data-Engineering
BTullis closed T306903: Integrate Superset with DataHub, a subtask of T299910: Data Catalog MVP, as Resolved.
Fri, Jul 15, 4:36 PM · Data-Catalog, Data-Engineering, Epic
BTullis closed T306903: Integrate Superset with DataHub as Resolved.
Fri, Jul 15, 4:36 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis closed T310171: Deploy (2) control-plane VMs for dse-k8s cluster as Resolved.

These servers will be configured later, once the etcd cluster is available.

Fri, Jul 15, 4:34 PM · Data Engineering Planning (Sprint 01), Shared-Data-Infrastructure
BTullis closed T310171: Deploy (2) control-plane VMs for dse-k8s cluster, a subtask of T310196: K8 DSE Kubernetes Cluster, as Resolved.
Fri, Jul 15, 4:34 PM · Epic, Shared-Data-Infrastructure, Foundational Technology Requests
BTullis closed T310170: Deploy (3) etcd cluster of VMs for dse-k8s cluster as Resolved.

Configuration of the etcd cluster itself will be done in T313129: Configure etcd for dse-k8s cluster

Fri, Jul 15, 4:34 PM · Data Engineering Planning (Sprint 01), Shared-Data-Infrastructure
BTullis closed T310170: Deploy (3) etcd cluster of VMs for dse-k8s cluster, a subtask of T310196: K8 DSE Kubernetes Cluster, as Resolved.
Fri, Jul 15, 4:34 PM · Epic, Shared-Data-Infrastructure, Foundational Technology Requests
BTullis added a subtask for T310170: Deploy (3) etcd cluster of VMs for dse-k8s cluster: T313129: Configure etcd for dse-k8s cluster.
Fri, Jul 15, 4:33 PM · Data Engineering Planning (Sprint 01), Shared-Data-Infrastructure
BTullis removed a subtask for T311131: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster: T313129: Configure etcd for dse-k8s cluster.
Fri, Jul 15, 4:33 PM · Shared-Data-Infrastructure, vm-requests, Infrastructure-Foundations, SRE
BTullis edited parent tasks for T313129: Configure etcd for dse-k8s cluster, added: T310170: Deploy (3) etcd cluster of VMs for dse-k8s cluster; removed: T311131: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster.
Fri, Jul 15, 4:33 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Shared-Data-Infrastructure
BTullis moved T310171: Deploy (2) control-plane VMs for dse-k8s cluster from In Progress before Value Stream Kickoff (before Aug 15th) to Done before Values Stream Kickoff (before Aug 15th) on the Shared-Data-Infrastructure board.
Fri, Jul 15, 4:32 PM · Data Engineering Planning (Sprint 01), Shared-Data-Infrastructure
BTullis moved T310170: Deploy (3) etcd cluster of VMs for dse-k8s cluster from In Progress before Value Stream Kickoff (before Aug 15th) to Done before Values Stream Kickoff (before Aug 15th) on the Shared-Data-Infrastructure board.
Fri, Jul 15, 4:32 PM · Data Engineering Planning (Sprint 01), Shared-Data-Infrastructure
BTullis added a subtask for T311131: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster: T313129: Configure etcd for dse-k8s cluster.
Fri, Jul 15, 4:30 PM · Shared-Data-Infrastructure, vm-requests, Infrastructure-Foundations, SRE
BTullis added a parent task for T313129: Configure etcd for dse-k8s cluster: T311131: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster.
Fri, Jul 15, 4:30 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Shared-Data-Infrastructure
BTullis created T313129: Configure etcd for dse-k8s cluster.
Fri, Jul 15, 4:30 PM · Patch-For-Review, Data Engineering Planning (Sprint 02), Shared-Data-Infrastructure
BTullis closed T310293: HDFS Namenode failover failure as Resolved.

I'm tempted to resolve this problem for now, given that we know there is a workaround and we know that we should try to avoid failing over during the busiest times for the cluster.

Fri, Jul 15, 4:14 PM · Data-Engineering
BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Fri, Jul 15, 4:10 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

Hbase built successfully:

btullis@marlin-wsl:~/src/bigtop-bullseye$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean hbase-pkg'
> Task :hbase-pkg
Fri, Jul 15, 4:08 PM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

I had to create a patch file to update the version of the maven-shade-plugin that is in use.

btullis@marlin-wsl:~/src/bigtop-bullseye$ cat bigtop-packages/src/common/flink/patch1-fix-maven-shaded-plugin.diff
From 716b16106d889c0e462d74d6cfcbf92780e8ebfa Mon Sep 17 00:00:00 2001
From: Ben Tullis <btullis@wikimedia.org>
Date: Fri, 15 Jul 2022 13:33:28 +0100
Subject: [PATCH] Update maven-shaded-plugin
Fri, Jul 15, 3:00 PM · Data Engineering Planning (Sprint 02)
BTullis moved T312620: Security Related Task [Placeholder] from Ready to deploy to Paused/Blocked on the Data Engineering Planning (Sprint 01) board.
Fri, Jul 15, 2:53 PM · Data Engineering Planning
BTullis moved T306903: Integrate Superset with DataHub from Next Up to Done on the Data-Catalog board.
Fri, Jul 15, 2:51 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis moved T306903: Integrate Superset with DataHub from In progress to Done on the Data Engineering Planning (Sprint 01) board.
Fri, Jul 15, 2:51 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis added a comment to T306903: Integrate Superset with DataHub.

Looking good. After one small modification to the patch (https://gerrit.wikimedia.org/r/c/analytics/datahub/+/812023/2..1) the ingestion worked and the link to Superset is working.

image.png (693×978 px, 58 KB)

Fri, Jul 15, 2:50 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis created P31145 Datahub Superset ingestion error.
Fri, Jul 15, 2:23 PM
BTullis added a comment to T306903: Integrate Superset with DataHub.

Deleted the entities associated with superset.

(2022-04-14T15.32.43_btullis) btullis@stat1008:~/src/datahub/ingestion$ datahub delete --entity_type dashboard --platform superset --hard
This will permanently delete data from DataHub. Do you want to continue? [y/N]: y
[2022-07-15 14:19:58,653] INFO     {datahub.cli.delete_cli:234} - datahub configured with https://datahub-gms.discovery.wmnet:30443
[2022-07-15 14:19:58,756] INFO     {datahub.cli.delete_cli:248} - Filter matched 221 entities. Sample: ['urn:li:dashboard:(superset,32)', 'urn:li:dashboard:(superset,276)', 'urn:li:dashboard:(superset,270)', 'urn:li:dashboard:(superset,115)', 'urn:li:dashboard:(superset,190)']
This will delete 221 entities. Are you sure? [y/N]: y
100% (221 of 221) |#######################################################################################################################################################| Elapsed Time: 0:00:03 Time:  0:00:03
Took 11.786 seconds to hard delete 884 rows for 221 entities
(2022-04-14T15.32.43_btullis) btullis@stat1008:~/src/datahub/ingestion$ datahub delete --entity_type chart --platform superset --hard
This will permanently delete data from DataHub. Do you want to continue? [y/N]: y
[2022-07-15 14:20:17,714] INFO     {datahub.cli.delete_cli:234} - datahub configured with https://datahub-gms.discovery.wmnet:30443
[2022-07-15 14:20:18,062] INFO     {datahub.cli.delete_cli:248} - Filter matched 2001 entities. Sample: ['urn:li:chart:(superset,1206)', 'urn:li:chart:(superset,945)', 'urn:li:chart:(superset,1171)', 'urn:li:chart:(superset,1550)', 'urn:li:chart:(superset,1502)']
This will delete 2001 entities. Are you sure? [y/N]: y
100% (2001 of 2001) |#####################################################################################################################################################| Elapsed Time: 0:00:32 Time:  0:00:32
Took 38.028 seconds to hard delete 8004 rows for 2001 entities
Fri, Jul 15, 2:21 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis added a comment to T306903: Integrate Superset with DataHub.

I have tried a stateful ingestion but it fails validation:

btullis@stat1008:~/src/datahub/ingestion$ datahub ingest -c superset.yml
[2022-07-15 14:16:23,916] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38
1 validation error for SupersetConfig
stateful_ingestion
  extra fields not permitted (type=value_error.extra)

The recipe looked like this:

btullis@stat1008:~/src/datahub/ingestion$ cat superset.yml | grep -v password
source:
  type: "superset"
  config:
    connect_uri: "http://localhost:8088"
    base_uri: "https://superset.wikimedia.org"
    username: admin
    provider: db
    stateful_ingestion:
        enabled: True
        remove_stale_metadata: True
Fri, Jul 15, 2:18 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis added a comment to T306903: Integrate Superset with DataHub.

I think that there is an easier way to test this patch, given that we don't need the whole toolchain. We just need to patch the python script that we are using to run the test.

Fri, Jul 15, 1:23 PM · Data Engineering Planning (Sprint 01), Patch-For-Review, Data-Engineering-Kanban, Data-Catalog
BTullis updated the task description for T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
Fri, Jul 15, 10:11 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

The flink job failed.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:3.0.0:shade (shade-hadoop) on project flink-shaded-hadoop2-uber: Error creating shaded jar: null: IllegalArgumentException -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :flink-shaded-hadoop2-uber
make[1]: *** [debian/rules:29: override_dh_auto_build] Error 1
make[1]: Leaving directory '/ws/output/flink/flink-1.6.4'
make: *** [debian/rules:26: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
debuild: fatal error at line 1182:
dpkg-buildpackage -us -uc -ui -b
 failed
Fri, Jul 15, 10:10 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
btullis@marlin-wsl:~/src/bigtop-bullseye$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean bigtop-utils-pkg'
> Task :bigtop-utils-pkg
Fri, Jul 15, 10:06 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean bigtop-tomcat-pkg'
> Task :bigtop-tomcat-pkg
Fri, Jul 15, 10:04 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
btullis@marlin-wsl:~/src/bigtop-bullseye$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean bigtop-jsvc-pkg'
> Task :bigtop-jsvc-pkg
Fri, Jul 15, 10:01 AM · Data Engineering Planning (Sprint 02)
BTullis closed T312134: Request for SQL Templating to be enabled in Superset as Resolved.

Great! Glad it works as expected.

Fri, Jul 15, 9:23 AM · Product-Analytics, Data-Engineering

Thu, Jul 14

BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.
btullis@marlin-wsl:~/src/bigtop-bullseye$ docker run --rm  -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean bigtop-groovy-pkg'
> Task :bigtop-groovy-pkg
Thu, Jul 14, 11:37 AM · Data Engineering Planning (Sprint 02)
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

The build of oozie was successful.

> Task :oozie-pkg
Thu, Jul 14, 11:19 AM · Data Engineering Planning (Sprint 02)
BTullis renamed T310177: Integrate the (8) existing dse-k8s worker nodes from Integrate the (4) existing dse-k8s worker nodes to Integrate the (8) existing dse-k8s worker nodes.
Thu, Jul 14, 11:18 AM · Data Engineering Planning, Shared-Data-Infrastructure
BTullis added a comment to T310643: Build Bigtop 1.5 Hadoop packages for Bullseye.

The build of hive was successful.

> Task :hive-pkg
Thu, Jul 14, 10:19 AM · Data Engineering Planning (Sprint 02)
BTullis closed T293399: Migrate the majority of the analytics cluster alerts from Icinga to AlertManager as Resolved.

I'm re-resolving this ticket now @fgiunchedi - as I think those checks you identified are all removed from Icinga now.
I'd like to get some time to work on the follow-up tickets myself as well, but we're also considering them as good onboarding tasks for someone.

Thu, Jul 14, 10:18 AM · SRE Observability, Data-Engineering-Kanban, Patch-For-Review, Analytics-Kanban, Data-Engineering, User-fgiunchedi
BTullis closed T293399: Migrate the majority of the analytics cluster alerts from Icinga to AlertManager, a subtask of T288622: All Prometheus based alerts move from Icinga to alert manager exclusively, as Resolved.
Thu, Jul 14, 10:17 AM · SRE Observability (FY2022/2023-Q1)
BTullis closed T304478: Move wikireplicas dbproxy haproxy config to etcd as Declined.

Thanks @Joe for your input on this ticket. I think that we've decided it's too much work for us to take on at the moment, so I'll decline this ticket.
Maybe we will come back to the question of dynamic configuation of the clouddb proxy layer at a later date. I really like the idea of being able to adjust the proxies at runtime, but it's just not high enough up the priority list at the moment.

Thu, Jul 14, 10:15 AM · Patch-For-Review, Data-Engineering, Data-Services
BTullis added a comment to T312134: Request for SQL Templating to be enabled in Superset.

OK, @EBernhardson, @mpopov - The patch has been applied and superset has been restarted with the feature flag enabled. It looks OK to me, but could you test the functionality please when it's convenient? Thanks.

Thu, Jul 14, 10:00 AM · Product-Analytics, Data-Engineering
BTullis claimed T312134: Request for SQL Templating to be enabled in Superset.

I'm happy with this and since you've already done the work in writing the patch, I'll merge and test it today.

Thu, Jul 14, 9:55 AM · Product-Analytics, Data-Engineering
BTullis added a comment to T304492: Upgrade db1108 to Bullseye.

I'll carry out this upgrade, although it might be in a few weeks' time. Hope that's OK.

Thu, Jul 14, 9:41 AM · Data Engineering Planning