Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Analytics/Data engineering

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (279 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Yesterday

elukey added a project to T282589: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17): Analytics.
Wed, May 12, 4:52 PM · Analytics, CommRel-Specialists-Support (Apr-Jun-2021), SRE, LDAP-Access-Requests
elukey created P15952 (An Untitled Masterwork).
Wed, May 12, 3:56 PM
elukey added a comment to T278723: ORES deployment - Spring 2021.

https://ores-beta.wmflabs.org/v3/scores/viwiki/123125/articletopic seems working now :)

Wed, May 12, 2:41 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey updated subscribers of T282664: Add joanna_borun to WMF-NDA.

Trail followed:

Wed, May 12, 12:52 PM · WMF-NDA-Requests
elukey added a comment to T278423: Upgrade the Hadoop masters to Debian Buster.

Ok, here's my new plan, including draining the cluster and using safemode to take a stable fsimage. If this looks good to you @elukey we can pick a day at least a week away so that we can communicate the maintenance. We could do this without doing maintenance but I'd appreciate the safety and the opportunity to learn about safemode.

Wed, May 12, 9:06 AM · Analytics-Kanban, Analytics-Clusters
elukey added a comment to T278723: ORES deployment - Spring 2021.

IIUC the next steps should be to run something like T212818#4865070 for drafttopic, then updating the related submodule in the deploy repo and then re-test in Beta.

Wed, May 12, 7:02 AM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES

Tue, May 11

elukey added a comment to T280107: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia.

Hi - I am trying to make this happen.
Data for the wikidata project is very big (many edits, and the itemquality model to be added to the other ones). Is it needed for you or can I not export this project (this would be all models for all edits of all projects except enwiki and wikidatawiki).
Thanks

Tue, May 11, 3:32 PM · Analytics-Kanban, artificial-intelligence, editquality-modeling, ORES, Machine-Learning-Team, Analytics
elukey added a comment to T278723: ORES deployment - Spring 2021.

@Halfak quick check in to understand the status of the fix (and if my team should follow up to fix the regression etc..) :)

Tue, May 11, 6:09 AM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES

Mon, May 10

elukey closed T276791: Configure the Hadoop cluster to use the GPUs available on some workers as Resolved.

This is done! With T277062 Aiko and Miriam were able to run tensorflow-rocm only on GPU nodes :)

Mon, May 10, 4:31 PM · Analytics, Machine-Learning-Team
elukey closed T280262: Decommission analytics-tool1001 and all the CDH leftovers as Resolved.
Mon, May 10, 3:52 PM · Analytics-Kanban, Patch-For-Review, Analytics
elukey added a comment to T278192: Install Istio on ml-serve cluster.
FROM docker-registry.wikimedia.org/golang:1.13-3 as build
Mon, May 10, 7:04 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing
elukey reopened T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet as "Open".

Still reported down :(

Mon, May 10, 6:39 AM · SRE, ops-codfw, Discovery-Search, DC-Ops
elukey added a comment to T281711: Add approvals on Github for all the ORES-related repositories.

@Legoktm we just added a step for github repositories that ends up in production to ensure that a member of the ML team reviews the patch, it is a compromise to avoid having patches being merged without us noticing it (especially self-merge actions).

Mon, May 10, 6:29 AM · Machine-Learning-Team, ORES

Sat, May 8

elukey added a comment to T282185: Add password reset to kerberos manage_principals.py.

Before closing - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Manage_principals_and_keytabs needs to be updated :)

Sat, May 8, 10:29 AM · Patch-For-Review, Analytics-Kanban, Analytics
elukey moved T281792: Yarn NM stopping due to failures while creating native threads from In Progress to Done on the Analytics-Kanban board.
Sat, May 8, 7:48 AM · Analytics-Kanban, Analytics
elukey added a comment to T281792: Yarn NM stopping due to failures while creating native threads.

Second day without any error!

Sat, May 8, 7:48 AM · Analytics-Kanban, Analytics
elukey moved T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop from In Code Review to Done on the Analytics-Kanban board.
Sat, May 8, 7:47 AM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund thanks a lot for all the work! Have a great weekend too :)

Sat, May 8, 7:47 AM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts

Fri, May 7

elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund +1 for the image credit thanks!

Fri, May 7, 4:58 PM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund it seems that there is nothing outstanding anymore, we can publish!!

Fri, May 7, 3:39 PM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund thanks a lot for the extra pass, I'll try to resolve the last open comments today so you'll be free to publish anytime :)

Fri, May 7, 2:59 PM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T281711: Add approvals on Github for all the ORES-related repositories.

I took some extra steps:

Fri, May 7, 8:38 AM · Machine-Learning-Team, ORES
elukey moved T281792: Yarn NM stopping due to failures while creating native threads from Next Up to In Progress on the Analytics-Kanban board.
Fri, May 7, 6:44 AM · Analytics-Kanban, Analytics
elukey claimed T281792: Yarn NM stopping due to failures while creating native threads.
Fri, May 7, 6:44 AM · Analytics-Kanban, Analytics
elukey added a comment to T281792: Yarn NM stopping due to failures while creating native threads.

No errors for native threads registered in the past hours, it looks that we are out of the woods, but I'll wait until Monday before declaring victory.

Fri, May 7, 6:43 AM · Analytics-Kanban, Analytics

Thu, May 6

elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund if you have ideas we all all ears, any elephant-related image could be ok :)

Thu, May 6, 7:48 PM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@Ottomata the image looks so good! :(

Thu, May 6, 7:47 PM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund thanks a lot for the review! I accepted all the comments and left two questions for you, feel free to solve them in case everything looks good.

Thu, May 6, 7:42 PM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T281711: Add approvals on Github for all the ORES-related repositories.

I cannot see "Settings" in my GH view, so I guess I need more permissions to be able to add the rules. We should:

Thu, May 6, 5:47 PM · Machine-Learning-Team, ORES
elukey moved T280998: Scap deploy for ORES reports success even when uwsgi fails to start up from Unorganized to SRE on the Machine-Learning-Team board.
Thu, May 6, 5:45 PM · Scap, ORES, Machine-Learning-Team
elukey moved T278723: ORES deployment - Spring 2021 from Active Tasks to SRE on the Machine-Learning-Team board.
Thu, May 6, 5:45 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey moved T281713: Review pre-cached wikis for ORES from Unorganized to SRE on the Machine-Learning-Team board.
Thu, May 6, 5:45 PM · Machine-Learning-Team, ORES
elukey moved T281711: Add approvals on Github for all the ORES-related repositories from Unorganized to SRE on the Machine-Learning-Team board.
Thu, May 6, 5:45 PM · Machine-Learning-Team, ORES
elukey moved T281495: Restructure ORES labs redis puppet role from Unorganized to SRE on the Machine-Learning-Team board.
Thu, May 6, 5:45 PM · Puppet, Machine-Learning-Team, ORES
elukey added a comment to T281792: Yarn NM stopping due to failures while creating native threads.

After a chat with Joseph we decided to proceed one change at the time:

Thu, May 6, 1:49 PM · Analytics-Kanban, Analytics
elukey added a comment to T278192: Install Istio on ml-serve cluster.

More info about what binaries are executed in the minikube test that I made:

Thu, May 6, 11:43 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing
elukey added a comment to T277015: Evaluate possible solutions to backup Analytics Hadoop's HDFS data.

@jcrespo quick question - if we want to move forward with this, do we need hardware planned for next fiscal? I know that the use case is very high level and there are a lot of unclear points, so any inputs will be appreciated :)

Thu, May 6, 9:40 AM · Analytics-Clusters, Data-Persistence-Backup
elukey added a comment to T281621: elastic2033 without bootable devices available.

@RKemper I restarted the failed prometheus units on the node to clear icinga, but puppet is still disable, can you enable it when you have a moment if ok? (I didn't want to do it in case you were working on it)

Thu, May 6, 6:13 AM · Discovery-Search (Current work), SRE, Discovery, ops-codfw
elukey added a comment to T281621: elastic2033 without bootable devices available.

@Papaul what did you do to fix it?? (curious)

Thu, May 6, 6:10 AM · Discovery-Search (Current work), SRE, Discovery, ops-codfw
elukey added a comment to T278137: Migrate eventlog1002 to buster.

+1

Thu, May 6, 6:05 AM · Analytics-Kanban, Analytics-Clusters

Wed, May 5

elukey added a comment to T278723: ORES deployment - Spring 2021.

The pipelines are documented/automated in the relevant Makefiles. E.g. if you install the dependencies for https://github.com/wikimedia/drafttopic, delete the old viwiki models and run make models it should rebuild the relevant models.

Wed, May 5, 4:07 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

@Halfak what is the likelihood that other models have the same issues, but we haven't seen errors yet due to not enough requests ending up in ERRORS?

Wed, May 5, 4:02 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

In my opinion we should rollback, work on a patch and re-rollout when we are ok, doing more testing. Thoughts?

Wed, May 5, 3:57 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

@Halfak I see mostly 'model_names': ['reverted', 'articletopic'] for viwiki in codfw..

Wed, May 5, 3:54 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.
elukey@ores2001:~$ sudo journalctl -u celery-ores-worker.service  | grep Warning
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.ensemble.gradient_boosting module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   warnings.warn(message, FutureWarning)
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.20.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   UserWarning)
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.tree.tree module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   warnings.warn(message, FutureWarning)
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.20.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   UserWarning)
Wed, May 5, 3:30 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

There seems to be a regression in scores errored mostly in codfw (ORES is active/active), so some traffic is impacted. This is an example:

Wed, May 5, 3:27 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278192: Install Istio on ml-serve cluster.

Joe gave me a nice pointer in production-images, namely the loki multi-stage container example. Basically the idea is to build go binaries in one container first, then use them for the official Docker image to push to the registry. If we find a way to build istio (that in theory shouldn't be super difficult) we should also be able to re-use the Docker images like https://github.com/istio/istio/blob/master/pilot/docker/Dockerfile.proxyv2 relatively easy (same thing for Knative etc..)

Wed, May 5, 10:20 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing
elukey updated the task description for T281135: codfw: Relocate servers in 10G racks .
Wed, May 5, 9:02 AM · serviceops, DBA, SRE, ops-codfw
elukey added a comment to T281917: Could not find class ::profile::swap for an-test-client1001.eqiad.wmnet.

@razzi each check has its own interval, check_puppet_run_changes might run every X hours so it may be slow to update. If you want to get fresh results you can force a reschedule of the check via Icinga UI (you should find the option in the dropdown menu where Acknowledge etc.. options are).

Wed, May 5, 8:59 AM · Analytics-Clusters
elukey added a comment to T278423: Upgrade the Hadoop masters to Debian Buster.

Almost forgot - the procedure should also include T231067#6863800 :)

Wed, May 5, 8:55 AM · Analytics-Kanban, Analytics-Clusters
elukey moved T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop from In Progress to In Code Review on the Analytics-Kanban board.
Wed, May 5, 8:54 AM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund draft ready! I shared the gdoc with you and the Analytics team :)

Wed, May 5, 8:54 AM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts

Tue, May 4

elukey added a comment to T278192: Install Istio on ml-serve cluster.

Something interesting that I found today is: https://gcsweb.istio.io/gcs/istio-build/dev/1.6-alpha.3ddc57b6d1e15afebefd725e01c0dc7099f3f6dd/docker/

Tue, May 4, 5:23 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing
elukey added a comment to T278137: Migrate eventlog1002 to buster.

Yes let's fully decommission eventlog1002 once we are ok with 1003 :)

Tue, May 4, 1:30 PM · Analytics-Kanban, Analytics-Clusters
elukey updated the task description for T281792: Yarn NM stopping due to failures while creating native threads.
Tue, May 4, 6:07 AM · Analytics-Kanban, Analytics
elukey created T281792: Yarn NM stopping due to failures while creating native threads.
Tue, May 4, 6:04 AM · Analytics-Kanban, Analytics

Mon, May 3

elukey added a comment to T281713: Review pre-cached wikis for ORES.

@Pchelolo the summary is great! I can certainly work from the ML/Analytics side when needed, so let me know if we can start scoping out the problem (maybe another task?).

Mon, May 3, 5:50 PM · Machine-Learning-Team, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

I totally agree about big deployments! It's been too long.

Mon, May 3, 4:56 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T281713: Review pre-cached wikis for ORES.

Very interestingly, the pre-caching stuff is what powers https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_revision_score. The scores are sent to kafka and then exposed, so I am not sure if we can turn this off. It is also a good thing to keep in mind for Lift Wing.

Mon, May 3, 4:21 PM · Machine-Learning-Team, ORES
elukey added a comment to T281713: Review pre-cached wikis for ORES.

+1 definitely, the right wording is popularity, I agree 100%

Mon, May 3, 3:27 PM · Machine-Learning-Team, ORES
elukey created T281713: Review pre-cached wikis for ORES.
Mon, May 3, 2:12 PM · Machine-Learning-Team, ORES
elukey created T281711: Add approvals on Github for all the ORES-related repositories.
Mon, May 3, 2:06 PM · Machine-Learning-Team, ORES
elukey added a comment to T279440: Data drifts between superset_production on an-coord1001 and db1108.

I would do it anyway since these are the dbs that we back up periodically, and it may take a while (namely months) to get everything set up and running and migrated. Since it is mostly my fault I can spend the time on it, but if the team thinks it is not worth it I can drop the ball and decline :)

Mon, May 3, 12:45 PM · Analytics-Kanban, Analytics
elukey awarded T276407: An End-to-End Image Classification Pipeline a Party Time token.
Mon, May 3, 10:29 AM · Research (FY2020-21-Research-April-June), Structured-Data-Backlog, MachineVision
elukey added a comment to T281621: elastic2033 without bootable devices available.

The other thing that may happen is that the mbr was installed only on one of the two disks of the RAID1, so now nothing boots. IIRC PXE wasn't also able to start as well, otherwise I'd have proposed to boot with a rescue image to inspect the two disks.
All these failures smell like a major host problem..

Mon, May 3, 8:13 AM · Discovery-Search (Current work), SRE, Discovery, ops-codfw
elukey added a comment to T277133: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop.

@srodlund sorry for the lag, me and Joseph should have a draft for this week :)

Mon, May 3, 7:21 AM · Analytics-Kanban, Analytics-Clusters, Technical-blog-posts
elukey added a comment to T243089: Spike. Try to ML models distributted in jupyter notebooks with dask.

To keep archives happy - we are already testing https://github.com/criteo/tf-yarn with Miriam and Aiko, that behind the scenes uses Skein.

Mon, May 3, 7:18 AM · Analytics
elukey added a comment to T278423: Upgrade the Hadoop masters to Debian Buster.

Alright, here's my plan @elukey, perhaps we can discuss this next week and if it looks good we can plan the maintenance.

Mon, May 3, 7:15 AM · Analytics-Kanban, Analytics-Clusters
elukey added a comment to T279440: Data drifts between superset_production on an-coord1001 and db1108.

I want to work on this! Is it ok to drop superset_production on db1108 in order to do this? If so, I think I'll be able to figure it out with some trial and error.

Mon, May 3, 6:35 AM · Analytics-Kanban, Analytics

Sat, May 1

elukey added a comment to T281621: elastic2033 without bootable devices available.

I left the host in the System Config panel so it will not keep trying to PXE, so it needs a power reset to start investigations :)

Sat, May 1, 7:39 AM · Discovery-Search (Current work), SRE, Discovery, ops-codfw
elukey created T281621: elastic2033 without bootable devices available.
Sat, May 1, 7:32 AM · Discovery-Search (Current work), SRE, Discovery, ops-codfw

Fri, Apr 30

elukey added a comment to T279440: Data drifts between superset_production on an-coord1001 and db1108.

@Ottomata @razzi I think that we should do this sooner rather than later, do you want me to do it or do you prefer to do it during May?

Fri, Apr 30, 3:09 PM · Analytics-Kanban, Analytics
elukey closed T278371: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned as Declined.

Let's revisit this if anything happens again, it seems a sporadic issue.

Fri, Apr 30, 3:07 PM · Analytics, SRE
elukey added a comment to T281316: WDCM_Sqoop_Clients.R fails from stat1004 (again).

I think we should be fine from now on, I wouldn't add more complexity to what we have :)

Fri, Apr 30, 2:46 PM · Analytics, User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata
elukey added a comment to T278723: ORES deployment - Spring 2021.

On paper we should have free memory available on Production nodes, but ideally the three changes outlined in the description could have been broken down into three separate deployments to have a better sense of what performance impact each change has. I know that there may be some interconnection between the jobs, and that now it would be a problem to break everything down, but please let's remember it next time. Big deployments are not great in general, I really prefer smaller ones :)

Fri, Apr 30, 1:35 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T281316: WDCM_Sqoop_Clients.R fails from stat1004 (again).

@GoranSMilovanovic sure! During the migration of the hosts where Hive Server/Metastore runs to Debian Buster, we encountered a lot of problems with the only available java lib for mysql, namely the one containing the org.mariadb.jdbc.Driver JDBC driver. We have now reverted back to the old mysql driver, manually porting the missing debian packages from Stretch to Buster, and now sqoop needs to run without the extra --driver option. So this option caused problems due to us trying to figure out how to upgrade our systems following Debian best practices, but hopefully now we should be good (at least until Debian Bullseye, the new version, will be out).

Fri, Apr 30, 10:00 AM · Analytics, User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata
elukey moved T280262: Decommission analytics-tool1001 and all the CDH leftovers from Next Up to Done on the Analytics-Kanban board.
Fri, Apr 30, 7:09 AM · Analytics-Kanban, Patch-For-Review, Analytics
elukey added a project to T280262: Decommission analytics-tool1001 and all the CDH leftovers: Analytics-Kanban.
Fri, Apr 30, 7:08 AM · Analytics-Kanban, Patch-For-Review, Analytics
elukey added a comment to T280262: Decommission analytics-tool1001 and all the CDH leftovers.

Everything looks good! Also dropped the hue_next database so it is less confusing when inspecting what we run on the various db nodes (basically we now have only the hue database).

Fri, Apr 30, 7:08 AM · Analytics-Kanban, Patch-For-Review, Analytics
elukey added a comment to T280262: Decommission analytics-tool1001 and all the CDH leftovers.

Plan is:

Fri, Apr 30, 6:50 AM · Analytics-Kanban, Patch-For-Review, Analytics
elukey added a comment to T276239: Try to move some new analytics worker nodes to different racks.

@Cmjohnson hi! Any news about the worker nodes?

Fri, Apr 30, 6:37 AM · Analytics-Radar, SRE, ops-eqiad

Thu, Apr 29

elukey added a comment to T281427: Re-add disk to an-worker1100.

Yep it takes a bit! If the datanode got the new config you'll see more data in the upcoming days :)

Thu, Apr 29, 6:36 PM · Analytics-Kanban, Analytics-Clusters
elukey added a comment to T278723: ORES deployment - Spring 2021.

@Halfak I'd ask, whenever you have a moment, for some details about the following points:

Thu, Apr 29, 6:33 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey closed T281316: WDCM_Sqoop_Clients.R fails from stat1004 (again) as Resolved.

No issues from our side, going to close, please reopen if necessary!

Thu, Apr 29, 4:10 PM · Analytics, User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata
elukey closed T281316: WDCM_Sqoop_Clients.R fails from stat1004 (again), a subtask of T281063: Wikidata Concepts Monitor: some datasets are empty, as Resolved.
Thu, Apr 29, 4:10 PM · User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata
elukey added a comment to T278723: ORES deployment - Spring 2021.

@Halfak beta seems unblocked for the moment, please check if there are other issues. Current problems live-patched that may require a better fix:

Thu, Apr 29, 6:00 AM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T281427: Re-add disk to an-worker1100.

@razzi you have the wrong slot, it is the 10th :)

Thu, Apr 29, 5:52 AM · Analytics-Kanban, Analytics-Clusters

Wed, Apr 28

elukey added a comment to T278723: ORES deployment - Spring 2021.

@Halfak nono I meant how to trigger the Failed to establish a new connection: [Errno 111] Connection refused problem (related IIUC to connections to localhost:6500 failing). Now that we have a proxy things should flow nicely, but not sure how to test it.

Wed, Apr 28, 4:57 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

Ok there is a basic nginx listening on localhost:6500 on deployment-ores01, @Halfak can you tell me how to repro the connection error highlighted in T278723#7031400?

Wed, Apr 28, 1:23 PM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey added a comment to T278723: ORES deployment - Spring 2021.

Spent some time today trying to add the Envoy config to the Ores instance in Beta, and all the production code assumes (rightfully) TLS + LVS IPs, so adapting it to beta may not be possible without further puppet changes.

Wed, Apr 28, 10:35 AM · Patch-For-Review, Machine-Learning-Team, artificial-intelligence, drafttopic-modeling, articlequality-modeling, ORES
elukey closed T271136: Some Foundation clusters do not appear to support IPv6 as Resolved.

Added the remaining AAAA records for kafka-main200[2-5]!

Wed, Apr 28, 7:20 AM · SRE, IPv6, SRE-tools, User-jbond
elukey closed T271136: Some Foundation clusters do not appear to support IPv6, a subtask of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK), as Resolved.
Wed, Apr 28, 7:20 AM · IPv6, User-jbond, netbox
elukey updated the task description for T271136: Some Foundation clusters do not appear to support IPv6.
Wed, Apr 28, 7:19 AM · SRE, IPv6, SRE-tools, User-jbond
elukey added a comment to T281316: WDCM_Sqoop_Clients.R fails from stat1004 (again).

Nice! Let's keep it open since I want to understand if we need to use --driver com.mysql.jdbc.Driver or not, it will have some impact also for Analytics, thanks a lot for bringing this up and sorry for the trouble!

Wed, Apr 28, 6:47 AM · Analytics, User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata
elukey added a comment to T271136: Some Foundation clusters do not appear to support IPv6.

@crusnov we are good to deploy the other AAAA records, can we proceed?

Wed, Apr 28, 6:24 AM · SRE, IPv6, SRE-tools, User-jbond
elukey added a comment to T280132: Degraded RAID on an-worker1100.

@Ottomata @razzi this task needs some follow up :)

Wed, Apr 28, 6:09 AM · SRE, ops-eqiad
elukey added a comment to T281316: WDCM_Sqoop_Clients.R fails from stat1004 (again).

Hi Goran!

Wed, Apr 28, 6:02 AM · Analytics, User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata

Tue, Apr 27

elukey added a comment to T277062: Review the Yarn Capacity scheduler and see if we can move to it.

Added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Yarn_Labels

Tue, Apr 27, 2:56 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
elukey added a comment to T257359: Update Turkish Wikipedia's labeling campaign for 2020.

@kevinbazira on ores-misc-01 the root partition is full :(

Tue, Apr 27, 11:45 AM · Turkish-Sites, artificial-intelligence, editquality-modeling, Machine-Learning-Team
elukey moved T277062: Review the Yarn Capacity scheduler and see if we can move to it from Ready to Deploy to Done on the Analytics-Kanban board.
Tue, Apr 27, 7:20 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters