Page MenuHomePhabricator

elukey (Luca Toscano)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (222 w, 5 h)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Yesterday

elukey added a comment to T248905: Add aklapper to analytics-privatedata-users.

Thanks everyone!

You are a user of an experimental service, welcome! :) Jokes aside, I'd be happy if you could help testing the new service, it should be quicker for you use case to query data from Superset rather than Hive. If you don't have time, feel free to skip!

Tue, Apr 7, 2:00 PM · Developer-Advocacy (Apr-Jun 2020), SRE-Access-Requests, Operations
elukey added a comment to T248962: Occasional NIC Tx bandwidth saturation for mc1027 .

It stores the serialized naive "top frame" (e.g. headings, paragraphs, template invocation parameters) of the wikitext of pages, as well as the "sub-frames" from template invocations, upon recursive expansion. This all happens on page parse. Note that these keys do not need purges. If template X is invoked the same way on multiple pages, then parses of those pages will reuse a common sub-frame cache key for those template invocations and likewise for templates invoked from that template. So, I suppose a popular template invoked with a low enough cardinality of paramater/context bundles would trigger traffic spikes upon invalidation. The traffic would come from either (a) refreshLinks jobs or (b) page views to backlink pages that got purged via htmlCacheUpdate jobs.
This would be even worse for a ~1Mb key (instead of just ~200Kb). Is there any ETA on 10GB link upgrades?

Tue, Apr 7, 9:27 AM · Performance-Team, Operations
elukey triaged T249593: Move systemd timer from an-coord1001 to an-launcher1001 as High priority.
Tue, Apr 7, 8:24 AM · Analytics-Kanban, Analytics
elukey added a comment to T240230: Run a script to check REFINE_FAILED flags daily.

Just tested the deployed refinery jars:

Tue, Apr 7, 7:55 AM · Analytics-Kanban, User-Elukey, Analytics
elukey added a comment to T246343: Service implementation on wdqs200[7-8].codfw.wmnet.
  • data+categories jnl files copied for both hosts via cookbook
  • restarted nginx to pick up new mapping for categories (after David's suggestion)
  • 2007's lag seems ok, 2008 still catching up
  • both hosts are not pooled in LVS/Pybal
Tue, Apr 7, 7:33 AM · Discovery-Search, Wikidata, Wikidata-Query-Service
elukey updated the task description for T242301: (Need by: TBD) codfw: rack/setup/install wdqs200[7-8].codfw.wmnet.
Tue, Apr 7, 7:30 AM · ops-codfw, Operations

Mon, Apr 6

elukey added a comment to T244506: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet.

These are failing during install. @elukey can you verify the raid configuration please
Failed to partition the selected disk │ │

│     │ This probably happened because there are too many (primary)  │    │
│ Comp│ partitions in the partition table.                           │    │
│     │
Mon, Apr 6, 5:09 PM · Operations, Analytics, ops-eqiad
elukey moved T244499: Upgrade the Hadoop test cluster to BigTop from Next Up to Paused on the Analytics-Kanban board.
Mon, Apr 6, 2:36 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey moved T247082: Upgrade AMD ROCm to latest upstream from Next Up to In Progress on the Analytics-Kanban board.
Mon, Apr 6, 2:36 PM · Analytics-Kanban, Analytics
elukey moved T249495: Upgrade to Superset 0.36.0 from Next Up to In Progress on the Analytics-Kanban board.
Mon, Apr 6, 2:36 PM · Patch-For-Review, Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey added a project to T247082: Upgrade AMD ROCm to latest upstream: Analytics-Kanban.
Mon, Apr 6, 2:36 PM · Analytics-Kanban, Analytics
elukey added a comment to T211706: Superset Updates .
Mon, Apr 6, 1:40 PM · Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey committed rLPRI3ab92b7fab0e: Add fake kerberos keytabs for stat1008 (authored by elukey).
Add fake kerberos keytabs for stat1008
Mon, Apr 6, 11:29 AM
elukey committed rLPRI6770593374dd: Add fake kerberos keytabs for Superset hosts (authored by elukey).
Add fake kerberos keytabs for Superset hosts
Mon, Apr 6, 11:18 AM
elukey merged T249405: Review error in table visualization with Superset 0.36.0 into T249495: Upgrade to Superset 0.36.0.
Mon, Apr 6, 10:37 AM · Patch-For-Review, Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey merged task T249405: Review error in table visualization with Superset 0.36.0 into T249495: Upgrade to Superset 0.36.0.
Mon, Apr 6, 10:37 AM · Analytics
elukey added a comment to T249495: Upgrade to Superset 0.36.0.

Opened https://github.com/apache/incubator-superset/issues/9468 to upstream to fix a bug found in 0.36.0rc3

Mon, Apr 6, 10:37 AM · Patch-For-Review, Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey created T249495: Upgrade to Superset 0.36.0.
Mon, Apr 6, 10:32 AM · Patch-For-Review, Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey added a comment to T248905: Add aklapper to analytics-privatedata-users.

@Aklapper have you tried https://turnilo.wikimedia.org/ or http://superset.wikimedia.org/ ? The latter allows you now to explore hive data via Presto (see SQLLab), I am wondering if it is sufficient for your use case. If so, it wouldn't require an explicit kerberos account (since you wouldn't access hadoop directly), let me know :)

Mon, Apr 6, 6:57 AM · Developer-Advocacy (Apr-Jun 2020), SRE-Access-Requests, Operations
elukey added a comment to T246882: commonswiki shard size grew more than 50G in eqiad and codfw.

Acked this morning warnings in icinga for commonswiki_content shard size for cloudelastic.

Mon, Apr 6, 6:37 AM · Discovery-Search (Current work), Elasticsearch, Discovery

Sat, Apr 4

elukey added a comment to T249405: Review error in table visualization with Superset 0.36.0.

Opened https://github.com/apache/incubator-superset/issues/9468

Sat, Apr 4, 5:28 PM · Analytics
elukey added a comment to T211706: Superset Updates .

https://www.preset.io/blog/2019-12-16-elasticsearch-in-superset/ looks very interesting..

Sat, Apr 4, 5:25 PM · Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey updated the task description for T249405: Review error in table visualization with Superset 0.36.0.
Sat, Apr 4, 4:42 PM · Analytics
elukey added a comment to T249405: Review error in table visualization with Superset 0.36.0.

In the 0.36.0 changelog I see https://github.com/apache/incubator-superset/pull/9006/commits that could be related, it would fit with what I am seeing.

Sat, Apr 4, 4:27 PM · Analytics
elukey added a parent task for T249405: Review error in table visualization with Superset 0.36.0: T211706: Superset Updates .
Sat, Apr 4, 4:22 PM · Analytics
elukey added a subtask for T211706: Superset Updates : T249405: Review error in table visualization with Superset 0.36.0.
Sat, Apr 4, 4:22 PM · Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey added a comment to T249405: Review error in table visualization with Superset 0.36.0.

Forgot to mention that the above is a Druid datasource. After some tests the error seems appearing only when no data is returned for a query, I was able to successfully visualize some results. So the regression seems a minor one, only in the message displayed when no data is returned.

Sat, Apr 4, 4:21 PM · Analytics
elukey added a comment to T211706: Superset Updates .

Superset 0.36+ requires:

  • superset.app:create_app() in the ExecStart of the superset.service unit
  • DRUID_IS_ACTIVE = True in superset_config.py
Sat, Apr 4, 10:26 AM · Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey updated the task description for T249405: Review error in table visualization with Superset 0.36.0.
Sat, Apr 4, 9:56 AM · Analytics
elukey updated the task description for T249405: Review error in table visualization with Superset 0.36.0.
Sat, Apr 4, 9:50 AM · Analytics
elukey created T249405: Review error in table visualization with Superset 0.36.0.
Sat, Apr 4, 8:43 AM · Analytics

Fri, Apr 3

elukey added a comment to T246755: Investigate Hadoop HDFS ACLs.

Another example that looks very good:

Fri, Apr 3, 9:36 AM · User-Elukey, Analytics
elukey added a comment to T246755: Investigate Hadoop HDFS ACLs.

Did a little test today:

Fri, Apr 3, 9:28 AM · User-Elukey, Analytics
elukey added a comment to T248905: Add aklapper to analytics-privatedata-users.

@Aklapper I checked T213780 and I see that the user in question doesn't have a Kerberos account, how are you guys accessing Eventlogging data?

Fri, Apr 3, 7:41 AM · Developer-Advocacy (Apr-Jun 2020), SRE-Access-Requests, Operations
elukey updated subscribers of T224454: Create an alert for high memcached bw usage.

@CDanis this is an old task that I opened, do you think that we could revamp it and use what you have in mind to detect bursts in bandwidth usage? It would make a big difference in managing memcached..

Fri, Apr 3, 7:06 AM · observability, Performance-Team (Radar), User-Elukey, serviceops, Operations
elukey closed T248962: Occasional NIC Tx bandwidth saturation for mc1027 as Resolved.
Fri, Apr 3, 7:04 AM · Performance-Team, Operations
elukey added a comment to T248962: Occasional NIC Tx bandwidth saturation for mc1027 .

@Krinkle @aaron Can you tell me some hints about how to figure out what is the key's purpose? It feels like a popular template change in nlwiki but I'd like to triple check. We are still seeing saturation problems.. :(

Fri, Apr 3, 6:48 AM · Performance-Team, Operations
elukey added a comment to T249309: Eqiad: C6 mgmt switch down .

Interesting that ganeti1011's mgmt interface recovered, but not the others. Adding dcops to see if we can schedule in the next days/weeks a check of msw-c6-eqiad.

Fri, Apr 3, 6:34 AM · ops-eqiad, netops, Operations
elukey added a project to T249309: Eqiad: C6 mgmt switch down : ops-eqiad.
Fri, Apr 3, 6:33 AM · ops-eqiad, netops, Operations
elukey added a comment to T249309: Eqiad: C6 mgmt switch down .

On msw1 I see all events like the following, starting at 5:40 UTC:

Fri, Apr 3, 6:14 AM · ops-eqiad, netops, Operations

Thu, Apr 2

elukey added a comment to T249059: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit.

Reason for access: As part of our work for CheckUser and IP masking projects, we need to make sure our DB queries are suitable for production, and would need a place to properly test potentially high-load queries with production data outside of the production cluster itself. It's my understanding that the place to do these queries is stat1006 which is in the analytics group.

Thu, Apr 2, 2:55 PM · Patch-For-Review, Anti-Harassment, Operations, SRE-Access-Requests
elukey moved T240230: Run a script to check REFINE_FAILED flags daily from In Code Review to In Progress on the Analytics-Kanban board.
Thu, Apr 2, 2:27 PM · Analytics-Kanban, User-Elukey, Analytics
elukey moved T248980: Move netflow to TLS encryption/authentication via librdkafka from In Progress to Done on the Analytics-Kanban board.
Thu, Apr 2, 2:27 PM · Analytics-Kanban, netops, Analytics, Operations
elukey set Final Story Points to 5 on T248980: Move netflow to TLS encryption/authentication via librdkafka.
Thu, Apr 2, 2:26 PM · Analytics-Kanban, netops, Analytics, Operations
elukey added a comment to T248980: Move netflow to TLS encryption/authentication via librdkafka.

For the moment I am happy with TLS encryption only, since we'll probably move to kerberos authentication soon and it doesn't make much sense to do TLS client auth first.

Thu, Apr 2, 2:26 PM · Analytics-Kanban, netops, Analytics, Operations
elukey closed T249103: Analytics Kerberos Welcome Email contains hostname typo as Resolved.

Fixed, sorry for the trouble. Really appreciated the feedback :)

Thu, Apr 2, 6:05 AM · Analytics

Wed, Apr 1

elukey moved T248980: Move netflow to TLS encryption/authentication via librdkafka from Next Up to In Progress on the Analytics-Kanban board.
Wed, Apr 1, 4:24 PM · Analytics-Kanban, netops, Analytics, Operations
elukey triaged T248980: Move netflow to TLS encryption/authentication via librdkafka as High priority.
Wed, Apr 1, 4:24 PM · Analytics-Kanban, netops, Analytics, Operations
elukey added a comment to T248482: Requesting access to analytics for ItamarWMDE.

@elukey Should this also provide me with access to hue.wikimedia.org?

Wed, Apr 1, 1:40 PM · Analytics, Operations, SRE-Access-Requests
elukey added a comment to T248980: Move netflow to TLS encryption/authentication via librdkafka.

We could start with TLS authentication only, with:

Wed, Apr 1, 11:50 AM · Analytics-Kanban, netops, Analytics, Operations
elukey closed T248482: Requesting access to analytics for ItamarWMDE as Resolved.
Wed, Apr 1, 11:45 AM · Analytics, Operations, SRE-Access-Requests
elukey added a comment to T248482: Requesting access to analytics for ItamarWMDE.
elukey@krb1001:~$ sudo manage_principals.py create itamar --email_address=itamar.givon@wikimedia.de
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to itamar.givon@wikimedia.de
Wed, Apr 1, 11:42 AM · Analytics, Operations, SRE-Access-Requests
elukey closed T248498: Requesting access to analytics for tarrow as Resolved.
Wed, Apr 1, 11:17 AM · Analytics, Patch-For-Review, SRE-Access-Requests, Operations
elukey added a comment to T248498: Requesting access to analytics for tarrow.
elukey@krb1001:~$ sudo manage_principals.py create tarrow --email_address=thomas.arrow_ext@wikimedia.de
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to thomas.arrow_ext@wikimedia.de
Wed, Apr 1, 11:07 AM · Analytics, Patch-For-Review, SRE-Access-Requests, Operations
elukey added a comment to T249078: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub).

Hey Diego, not sure if you have seen https://docs.google.com/document/d/1r-oqMXViWvQCqsYz0qzezZBWpip8LvkvCGF6GivFB_8/edit# but Andrew is working on a solution that will deploy Anaconda to all the Worker nodes, so after that we shouldn't have any more package problems. The MW specific ones are probably simply to be installed, the first two should be covered by Anaconda in theory.

Wed, Apr 1, 6:07 AM · Scoring-platform-team, ORES, Analytics-Cluster, Analytics, Research

Tue, Mar 31

elukey created T248980: Move netflow to TLS encryption/authentication via librdkafka.
Tue, Mar 31, 1:36 PM · Analytics-Kanban, netops, Analytics, Operations
elukey added a comment to T248865: Move netflow data to Eventgate Analytics.

Got it, so basically do all the process of creating the schemas etc.. but push data from pmacct directly to the kafka topic bypassing Eventgate. The fact that we'll have to have something ad-hoc is not appealing to me, the goal of this task was/is to avoid any special settings for netflow, but it is probably the best compromise (and then later on it should be easy to move to Eventgate validation / kafka topic creation if we see any issue).

Tue, Mar 31, 10:01 AM · netops, Analytics, Operations
elukey updated the task description for T248962: Occasional NIC Tx bandwidth saturation for mc1027 .
Tue, Mar 31, 9:38 AM · Performance-Team, Operations
elukey updated subscribers of T248962: Occasional NIC Tx bandwidth saturation for mc1027 .
Tue, Mar 31, 9:37 AM · Performance-Team, Operations
elukey triaged T248962: Occasional NIC Tx bandwidth saturation for mc1027 as High priority.
Tue, Mar 31, 9:37 AM · Performance-Team, Operations

Mon, Mar 30

elukey created T248890: MW Memcached get hit ratio trend over the past months.
Mon, Mar 30, 5:41 PM · Performance-Team, Operations
elukey added a comment to T245569: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78].

After T245810 we don't really need a custom partman recipe anymore, let's use partman/raid10-4dev.cfg (already configured in puppet for all druid nodes).

Mon, Mar 30, 4:49 PM · Analytics, ops-eqiad, Operations, DC-Ops
elukey updated the task description for T245569: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78].
Mon, Mar 30, 4:49 PM · Analytics, ops-eqiad, Operations, DC-Ops
elukey added a comment to T248865: Move netflow data to Eventgate Analytics.

Hm, if you just want to get the Refine side working, all you really need is for the data in Kafka to look right. EventGate gets you schema validation (and maybe a couple of other things, but not much else). If you are the only producer of this data, producing directly to Kafka is totally fine too.
If it is easier to POST from pmacct than use a Kafka Producer, then by all means use eventgate-analytics. But, for the rest of the pipeline, all you need is a proper schema and the data in Kafka.

Mon, Mar 30, 3:59 PM · netops, Analytics, Operations
elukey created T248865: Move netflow data to Eventgate Analytics.
Mon, Mar 30, 2:38 PM · netops, Analytics, Operations
elukey moved T240230: Run a script to check REFINE_FAILED flags daily from Next Up to In Code Review on the Analytics-Kanban board.
Mon, Mar 30, 2:22 PM · Analytics-Kanban, User-Elukey, Analytics
elukey claimed T240230: Run a script to check REFINE_FAILED flags daily.
Mon, Mar 30, 2:22 PM · Analytics-Kanban, User-Elukey, Analytics

Fri, Mar 27

elukey added a comment to T245179: Add SWAP profile to stat1005 .

Ok so next steps are:

Fri, Mar 27, 5:22 PM · Patch-For-Review, Analytics-Kanban, User-Elukey, Research, Analytics
elukey claimed T230724: Upgrade all SWAP users to JupyterLab 1.0.
Fri, Mar 27, 5:16 PM · User-Elukey, Patch-For-Review, Analytics-SWAP, Analytics, Product-Analytics
elukey added a comment to T230724: Upgrade all SWAP users to JupyterLab 1.0.

As FYI stat100[5,8] (Buster nodes) are running with JupyterLab 1.1.0, if anybody wants to test and provide feedback I'd be happy :)

Fri, Mar 27, 5:16 PM · User-Elukey, Patch-For-Review, Analytics-SWAP, Analytics, Product-Analytics
elukey added a comment to T243521: Hadoop Hardware Orders FY2019-2020.

I had a chat with Willy about racking requirements of the new hosts (16 refreshed + GPU ones). We currently have this configuration:

Fri, Mar 27, 4:37 PM · Analytics-Kanban, Analytics-Cluster, Analytics
elukey added a subtask for T244211: Analytics Hardware for Fiscal Year 2019/2020: T243521: Hadoop Hardware Orders FY2019-2020.
Fri, Mar 27, 4:22 PM · Analytics
elukey added a parent task for T243521: Hadoop Hardware Orders FY2019-2020: T244211: Analytics Hardware for Fiscal Year 2019/2020.
Fri, Mar 27, 4:22 PM · Analytics-Kanban, Analytics-Cluster, Analytics
elukey removed a subtask for T243521: Hadoop Hardware Orders FY2019-2020: T244211: Analytics Hardware for Fiscal Year 2019/2020.
Fri, Mar 27, 4:22 PM · Analytics-Kanban, Analytics-Cluster, Analytics
elukey removed a parent task for T244211: Analytics Hardware for Fiscal Year 2019/2020: T243521: Hadoop Hardware Orders FY2019-2020.
Fri, Mar 27, 4:22 PM · Analytics
elukey added a subtask for T243521: Hadoop Hardware Orders FY2019-2020: T244211: Analytics Hardware for Fiscal Year 2019/2020.
Fri, Mar 27, 4:21 PM · Analytics-Kanban, Analytics-Cluster, Analytics
elukey added a parent task for T244211: Analytics Hardware for Fiscal Year 2019/2020: T243521: Hadoop Hardware Orders FY2019-2020.
Fri, Mar 27, 4:21 PM · Analytics
elukey added a comment to T245179: Add SWAP profile to stat1005 .

0.3.0

Removed support for PySpark and Spark R in Toree (use specific kernels)

Fri, Mar 27, 12:34 PM · Patch-For-Review, Analytics-Kanban, User-Elukey, Research, Analytics
elukey added a comment to T231517: Investigate and fix GC issues on cloudelastic machines.

The issue seems intermittent, and since there is a big re-index in progress we cannot really do much. The long term fix should be add more nodes to the cluster, waiting for a rack/setup/deploy after T233720

Fri, Mar 27, 8:35 AM · Discovery-Search

Thu, Mar 26

elukey added a comment to T244499: Upgrade the Hadoop test cluster to BigTop.

The first attempt of rollback was a disaster, I wasn't able to restore HDFS to its previous state.

Thu, Mar 26, 6:18 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey added a comment to T248574: GPUs are not correctly handling multitasking .

We are still not sure what the issue is, but we decided to upgrade stat1008 according to T247082 to have the last ROCm upstream version before contacting the devs.

Thu, Mar 26, 1:09 PM · Analytics
elukey added a comment to T241650: Investigate sporadic failures in oozie hive actions due to Kerberos auth.

Reopened, since today a webrequest load failed due to the same issue..

Thu, Mar 26, 9:12 AM · Patch-For-Review, Analytics-Kanban, User-Elukey, Analytics
elukey reopened T241650: Investigate sporadic failures in oozie hive actions due to Kerberos auth as "Open".
Thu, Mar 26, 9:12 AM · Patch-For-Review, Analytics-Kanban, User-Elukey, Analytics

Wed, Mar 25

elukey lowered the priority of T231517: Investigate and fix GC issues on cloudelastic machines from High to Medium.
Wed, Mar 25, 5:07 PM · Discovery-Search
elukey added a comment to T231517: Investigate and fix GC issues on cloudelastic machines.

After the above changes the GC timings are way better, no more flapping alerts.

Wed, Mar 25, 5:07 PM · Discovery-Search

Tue, Mar 24

elukey added a comment to T244499: Upgrade the Hadoop test cluster to BigTop.

Working on the rollback in https://etherpad.wikimedia.org/p/analytics-bigtop

Tue, Mar 24, 3:52 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey removed a project from T245833: Enable layered data-access and sharing for a new form of collaboration: Traffic.
Tue, Mar 24, 12:58 PM · User-Elukey, Operations, WMF-Legal, Research, Analytics

Mon, Mar 23

elukey added a comment to T247510: Refine + EventLoggingSchemaLoader should use api.svc instead of meta.wikimedia.org directly..

If we want to make this switch api.svc.eqiad.wmnet will need to be whitelisted in the Analytics' VLAN firewall rules, and we'd have a problem when/if that ip will change. This task was created after a change in Webproxies (migration to buster) caused some problems in Refine IIRC, but maybe we are not adding much resiliency in doing this switch? Nothing against it, just thinking out loud :)

Mon, Mar 23, 3:47 PM · Analytics
elukey closed T248314: SparkR Kernel not starting in Jupyter & JupyterLab as Resolved.

@EYener we'd prefer to do it ourselves, reporting the host and username is sufficient for us to find the problem usually :) thanks!

Mon, Mar 23, 3:12 PM · Analytics
elukey added a comment to T248314: SparkR Kernel not starting in Jupyter & JupyterLab.

Hi! Do you mean notebook1003? I don't see recent activities in there, but I have restarted your notebook just in case you want to retry. But maybe you are trying from a different host? Would help to check if there are meaningful logs, let me know :)

Mon, Mar 23, 2:33 PM · Analytics
elukey added a comment to T244499: Upgrade the Hadoop test cluster to BigTop.
Mon, Mar 23, 11:29 AM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics

Fri, Mar 20

elukey added a comment to T245833: Enable layered data-access and sharing for a new form of collaboration.

I had a chat with Miriam about this:

Fri, Mar 20, 3:45 PM · User-Elukey, Operations, WMF-Legal, Research, Analytics
elukey added a comment to T244499: Upgrade the Hadoop test cluster to BigTop.

Opened https://issues.apache.org/jira/browse/BIGTOP-3330

Fri, Mar 20, 2:10 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey added a comment to T244499: Upgrade the Hadoop test cluster to BigTop.

Today while re-installing the new version of oozie/hadoop packages I experienced a problem that I forgot to fix, namely:

Fri, Mar 20, 1:59 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey added a project to T244499: Upgrade the Hadoop test cluster to BigTop: Analytics-Kanban.
Fri, Mar 20, 1:15 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey lowered the priority of T244499: Upgrade the Hadoop test cluster to BigTop from High to Medium.
Fri, Mar 20, 1:15 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey added a comment to T244499: Upgrade the Hadoop test cluster to BigTop.

The rollback of HDFS at this stage should be easy, the main question mark are the oozie/hive db schemas. We have been running the Hadoop cluster with the new version of HDFS for some days, but hive and oozie were upgraded (together with their db schemas). During this timeframe oozie jobs were ran, and hive changes were made to the metastore. For the hadoop test cluster it might be a simple matter of reverting back to a known good db state (we have backups) but if this happens in production, what would be the strategy?

Fri, Mar 20, 1:12 PM · Analytics-Kanban, User-Elukey, Analytics-Cluster, Analytics
elukey awarded T248161: Add a deprecated flag to admin groups a Love token.
Fri, Mar 20, 10:54 AM · User-jbond, Operations, Puppet
elukey claimed T247884: Add Prometheus Presto metrics and dashboards.
Fri, Mar 20, 7:35 AM · Patch-For-Review, Analytics, Analytics-Kanban, User-Elukey
elukey moved T247055: Upgrade jupyterhub-systemdspawner from 0.9.9 to 0.13 to allow the use of systemd custom slices from Next Up to Done on the Analytics-Kanban board.
Fri, Mar 20, 7:34 AM · Analytics-Kanban, Analytics