Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Analytics/Data engineering

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (403 w, 2 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Yesterday

elukey added a comment to T347477: eventgate: eventstreams: update nodejs and OS.

First attempt not good:

Thu, Sep 28, 1:27 PM · EventStreams, Event-Platform, Data-Engineering, Data Engineering and Event Platform Team
elukey added a comment to T347477: eventgate: eventstreams: update nodejs and OS.

In https://horizon.wikimedia.org/project/instances/beed7e0d-4e7f-446f-a73c-60dce7ecff4f/ I see the config for stream-beta.wmflabs.org. The current docker image used is: version: 2022-01-20-101239-production

Thu, Sep 28, 1:20 PM · EventStreams, Event-Platform, Data-Engineering, Data Engineering and Event Platform Team
elukey closed T347344: User-scripts running on Wikipedia can no longer use ORES (CORS issue) as Resolved.

Applied the permanent fix, all use cases work afaics! Thanks for reporting :)

Thu, Sep 28, 1:06 PM · Machine-Learning-Team, ORES
elukey closed T347214: Cannot set Api-User-Agent header when making requests to ORES from a user script - CORS as Resolved.

Applied a permanent fix, thanks for reporting!

Thu, Sep 28, 1:06 PM · Machine-Learning-Team, ORES

Wed, Sep 27

elukey added a comment to T347481: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors.
[2023-09-27 13:01:02,991] ERROR [GroupMetadataManager brokerId=1003] Appending metadata message for group kafka-mirror-main-eqiad_to_jumbo-eqiad generation 19961 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager)
Wed, Sep 27, 1:34 PM · Data-Engineering, Data-Platform-SRE
elukey added a comment to T347481: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors.

It seems that the 15th mirror maker instance triggers the issue (it is independent which one, after the 14th we see the trouble).

Wed, Sep 27, 1:14 PM · Data-Engineering, Data-Platform-SRE
elukey added a comment to T347481: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors.

Two theories for the moment:

Wed, Sep 27, 12:37 PM · Data-Engineering, Data-Platform-SRE

Tue, Sep 26

elukey added a comment to T347344: User-scripts running on Wikipedia can no longer use ORES (CORS issue).

I applied the following config to both ml-serve-eqiad and codfw:

Tue, Sep 26, 3:58 PM · Machine-Learning-Team, ORES
elukey updated the task description for T347278: Decommission ORES configurations and servers.
Tue, Sep 26, 3:45 PM · Machine-Learning-Team
elukey added a comment to T347214: Cannot set Api-User-Agent header when making requests to ORES from a user script - CORS.

Applied a manual fix on ml-serve-codfw, it should now work. I need to add the proper config to deployment-charts to make it permanent :)

Tue, Sep 26, 3:40 PM · Machine-Learning-Team, ORES

Mon, Sep 25

elukey added a comment to T347263: Create external endpoint for recommendation-api-ng hosted on LiftWing.

I think that is my understanding of our goals here, it was just more that the repo that was linked is not the repo for the recommendation-api that the mobile apps team leverages and so I'm assuming (perhaps incorrectly) that this hasn't yet to be included but that remains the destination we are aiming for.

Mon, Sep 25, 3:22 PM · Machine-Learning-Team
elukey reopened T342266: use wikiID in inference name on LW for revscoring models as "Open".
Mon, Sep 25, 3:20 PM · Machine-Learning-Team
elukey updated the task description for T347278: Decommission ORES configurations and servers.
Mon, Sep 25, 3:16 PM · Machine-Learning-Team
elukey added a comment to T347263: Create external endpoint for recommendation-api-ng hosted on LiftWing.

@Seddon my understanding is that this version of the recommendation API is the one that we want to progress from now on, deprecating the one that the apps are using). We need to consolidate the work into one single API, and the recommendation-api that the apps are currently using is already exposed via Restbase.

Mon, Sep 25, 2:06 PM · Machine-Learning-Team
elukey added a member for WMF-NDA: gmodena.
Mon, Sep 25, 11:24 AM
elukey updated the task description for T347278: Decommission ORES configurations and servers.
Mon, Sep 25, 10:55 AM · Machine-Learning-Team
elukey created T347278: Decommission ORES configurations and servers.
Mon, Sep 25, 10:42 AM · Machine-Learning-Team

Fri, Sep 22

mforns awarded T266641: Test Alluxio as cache layer for Presto a Burninate token.
Fri, Sep 22, 4:40 PM · Data-Platform-SRE, Data-Engineering
elukey added a comment to T347193: Support for basic boolean flags in ores-legacy.

The following works nicely and it seems more precise:

Fri, Sep 22, 3:59 PM · ORES, Machine-Learning-Team
elukey added a comment to T347194: Feature injection does not appear to work in ores-legacy.

We currently don't support feature injection, what is the use case for it? From our traffic analysis this is not a feature that is really used. Ores is being deprecated so we'd like to keep the features to maintain as few as possible.

Fri, Sep 22, 3:58 PM · ORES, Machine-Learning-Team
elukey updated the task description for T347193: Support for basic boolean flags in ores-legacy.
Fri, Sep 22, 3:56 PM · ORES, Machine-Learning-Team
elukey added a comment to T347136: Review Revert Risk reports from WME.

For the first use case, I tried to check logs in Lift Wing for ruwiki:133170407, and I see this:

Fri, Sep 22, 8:04 AM · Machine-Learning-Team, Research
elukey updated the task description for T347136: Review Revert Risk reports from WME.
Fri, Sep 22, 8:02 AM · Machine-Learning-Team, Research
elukey created T347136: Review Revert Risk reports from WME.
Fri, Sep 22, 8:01 AM · Machine-Learning-Team, Research
elukey added a comment to T346446: Upgrade revscoring Docker images to KServe 0.11.

kserve python package v0.11 has been upgraded for revscoring model servers and deployed to staging. The previous issue with the Content-type has been resolved by validating the inputs.

Fri, Sep 22, 6:48 AM · Patch-For-Review, Machine-Learning-Team
elukey closed T342116: Deprecate mediawiki revision-score stream as Resolved.
Fri, Sep 22, 6:46 AM · Machine-Learning-Team
elukey closed T327620: Define SLI/SLO for Lift Wing, a subtask of T333453: Lift Wing improvements to get out of MVP state, as Resolved.
Fri, Sep 22, 6:45 AM · Epic, Machine-Learning-Team
elukey closed T327620: Define SLI/SLO for Lift Wing as Resolved.
Fri, Sep 22, 6:45 AM · Machine-Learning-Team
elukey added a comment to T327241: Move the kserve custom helm chart to the upstream one.

The K8s SIG group added a new policy for upstream charts imported in our repo: https://wikitech.wikimedia.org/wiki/Kubernetes/Upstream_Helm_charts_policy

Fri, Sep 22, 6:44 AM · Machine-Learning-Team
elukey closed T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly as Resolved.

Change merged! Thanks to all for the feedback :)

Fri, Sep 22, 6:24 AM · SRE Observability (FY2023/2024-Q1), serviceops, observability
elukey added a comment to T327620: Define SLI/SLO for Lift Wing.

Final status for all the dashboards:

Fri, Sep 22, 6:22 AM · Machine-Learning-Team

Thu, Sep 21

elukey added a comment to T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.

+1 for trying this. Thinking out loud:

  1. With something like this in place should we worry about an alternate workflow to inspect/review a previous quarters SLO dashboard? Or would manually "make editable" and adjust when needed be good enough?
  1. Since mostly empty panels (when rolling over to a new time window) might be understood as broken/missing data, lets include information in the dashboard header to help clarify what is being displayed
Thu, Sep 21, 8:59 AM · SRE Observability (FY2023/2024-Q1), serviceops, observability
elukey closed T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic as Resolved.
Thu, Sep 21, 8:32 AM · Patch-For-Review, Machine-Learning-Team

Wed, Sep 20

elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Fix deployed to goodfaith/damaing pod environments. Let's double check tomorrow that the memory metrics are stable and then we can roll out the changes even further.

Wed, Sep 20, 4:30 PM · Patch-For-Review, Machine-Learning-Team
elukey committed rMLIS013b438471f7: Revert "Upgrade revscoring images to KServe 0.11" (authored by elukey).
Revert "Upgrade revscoring images to KServe 0.11"
Wed, Sep 20, 3:30 PM
elukey added a reverting change for rMLIS8f167a7af93b: Upgrade revscoring images to KServe 0.11: rMLIS013b438471f7: Revert "Upgrade revscoring images to KServe 0.11".
Wed, Sep 20, 3:30 PM
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

We set it in our requirements.txt :(

Wed, Sep 20, 3:19 PM · Patch-For-Review, Machine-Learning-Team
elukey committed rODCTW6d3a83831431: Replace yaml load() calls with safe_load() (authored by elukey).
Replace yaml load() calls with safe_load()
Wed, Sep 20, 3:13 PM
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Something really interesting: the following rev-id (https://es.wikipedia.org/w/index.php?diff=153880256) causes a big jump in the size of memory stored by mwparserfromhell:

Wed, Sep 20, 2:51 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

I left the code running for a bit, I wanted to test disabling caching in revscoring, I ended up with the following (more precise) trace:

Wed, Sep 20, 1:13 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

One trace sample from tracemalloc:

Wed, Sep 20, 10:17 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T327620: Define SLI/SLO for Lift Wing.

Last changes applied, we should be good to close!

Wed, Sep 20, 9:36 AM · Machine-Learning-Team
elukey created P52537 (An Untitled Masterwork).
Wed, Sep 20, 9:18 AM
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

I created the following test environment (locally):

Wed, Sep 20, 7:58 AM · Patch-For-Review, Machine-Learning-Team

Tue, Sep 19

elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Killed {es,ko,ru}wiki-{goodfaith,damaging}.

Tue, Sep 19, 4:19 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Using this graph to find the busiest pods to kill before leaving for EOD (to avoid issues during the EU night).

Tue, Sep 19, 4:16 PM · Patch-For-Review, Machine-Learning-Team
elukey claimed T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.
Tue, Sep 19, 2:39 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T300130: Move Kafka logging to the new intermediate PKI.

From the SAL:

Tue, Sep 19, 1:54 PM · Patch-For-Review, SRE Observability (FY2022/2023-Q2)
elukey added a comment to T341698: Support WME migration to Lift Wing - COMPLETE.

WME is going to perform some extra tests on Lift Wing this week, and they will enable the full request flow after that.

Tue, Sep 19, 8:42 AM · Goal, Machine-Learning-Team
elukey added a comment to T346446: Upgrade revscoring Docker images to KServe 0.11.

Tried to hit eswiki-damaging in staging:

Tue, Sep 19, 8:09 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346151: Lift Wing alerting.

@elukey We could do that. but I'll need to find the appropriate query for the alert.

Tue, Sep 19, 8:00 AM · Observability-Alerting, Machine-Learning-Team
elukey committed rMLIS8f167a7af93b: Upgrade revscoring images to KServe 0.11 (authored by elukey).
Upgrade revscoring images to KServe 0.11
Tue, Sep 19, 6:22 AM

Mon, Sep 18

Milimetric awarded T266641: Test Alluxio as cache layer for Presto a Party Time token.
Mon, Sep 18, 2:45 PM · Data-Platform-SRE, Data-Engineering
elukey created T346638: Rename the envoy's uses_ingress option to sets_sni .
Mon, Sep 18, 2:10 PM · Patch-For-Review, Machine-Learning-Team, serviceops
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Ilias rolled out https://gerrit.wikimedia.org/r/958393 to damaging/goodfaith pods in ml-serve-eqiad, so far we haven't seen any occurrence of the memory leak. Let's keep it monitored.

Mon, Sep 18, 1:37 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346151: Lift Wing alerting.

@RLazarus thanks a lot! We can wait and be the first beta-testers of the new alerts if you are ok!

Mon, Sep 18, 1:32 PM · Observability-Alerting, Machine-Learning-Team
elukey added a comment to T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.

@RLazarus What do you think? :)

Mon, Sep 18, 1:24 PM · SRE Observability (FY2023/2024-Q1), serviceops, observability
elukey added a comment to T342116: Deprecate mediawiki revision-score stream.

Email sent to Wikitech-l, the task is completed. Let's leave it open for a couple of days to see if everything works as expected.

Mon, Sep 18, 10:34 AM · Machine-Learning-Team
elukey committed rMLISc594367b49a0: python: remove unnecessary self attributes in revscoring's model svc (authored by elukey).
python: remove unnecessary self attributes in revscoring's model svc
Mon, Sep 18, 8:03 AM
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Killed eswiki damaging/goodfaith, same pattern.

Mon, Sep 18, 7:06 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346446: Upgrade revscoring Docker images to KServe 0.11.

Error while uploading the new revscoring to Pypi:

Mon, Sep 18, 6:37 AM · Patch-For-Review, Machine-Learning-Team

Sun, Sep 17

elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Killed eswiki-damaging again.

Sun, Sep 17, 4:06 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

The last pod that I deleted on ml-serve-eqiad was eswiki-damaging-predictor-default-00012-deployment-754bf46tdg6p. Something interesting is that hours before an OOM event occurred:

Sun, Sep 17, 2:44 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

There seems to be a metric to look for, namely response-flags:

Sun, Sep 17, 8:36 AM · Patch-For-Review, Machine-Learning-Team

Sat, Sep 16

elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Killed eswiki goodfaith and damaging today.

Sat, Sep 16, 6:27 PM · Patch-For-Review, Machine-Learning-Team

Fri, Sep 15

elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

As desperate attempt, we restarted all the pods in goodfaith/damaging (eqiad and codfw). It is likely not gonna help but worth trying.

Fri, Sep 15, 4:50 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.

Tried to log on a ml-serve node running a pod hanging, and tried to run gdb (without nsenter):

Fri, Sep 15, 4:38 PM · Patch-For-Review, Machine-Learning-Team
elukey created P52514 (An Untitled Masterwork).
Fri, Sep 15, 4:19 PM
elukey added a comment to T346446: Upgrade revscoring Docker images to KServe 0.11.

Filed https://github.com/halfak/yamlconf/pull/8 for yamlconf
Filed https://github.com/wikimedia/revscoring/pull/547 for revscoring

Fri, Sep 15, 2:01 PM · Patch-For-Review, Machine-Learning-Team
elukey updated the task description for T346446: Upgrade revscoring Docker images to KServe 0.11.
Fri, Sep 15, 1:29 PM · Patch-For-Review, Machine-Learning-Team
elukey created T346446: Upgrade revscoring Docker images to KServe 0.11.
Fri, Sep 15, 1:26 PM · Patch-For-Review, Machine-Learning-Team
elukey created T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.
Fri, Sep 15, 1:23 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T346175: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared.

We found a serious bug though, namely sometimes the kserve container inside an isvc pod stops working for some reason blackholing traffic. We noticed this since the retry queue in changeprop for ORESFetchScoreJob has been increasing for the past days, and the related kafka topic was constantly getting new events inserted. The related Kafka consumer lag is decreasing, but we need to file a new task to investigate this problem (since it can happen anytime again).

Fri, Sep 15, 1:16 PM · Growth-Team, MediaWiki-Recent-changes, ORES, Machine-Learning-Team
elukey added a comment to T346175: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared.

The metrics now look better! One thing that I noticed is that we have a lot of events in the ORESFetchScoreJob retry topic.

Fri, Sep 15, 9:23 AM · Growth-Team, MediaWiki-Recent-changes, ORES, Machine-Learning-Team
elukey added a comment to T346175: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared.

The Kafka consumer lag dashboard shows what Ilias pointed out, namely changeprop is lagging in consuming (and processing) ORESFetchSCore jobs.

Fri, Sep 15, 8:21 AM · Growth-Team, MediaWiki-Recent-changes, ORES, Machine-Learning-Team
elukey added a comment to T346175: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared.

Latency for enwiki damaging

Fri, Sep 15, 8:18 AM · Growth-Team, MediaWiki-Recent-changes, ORES, Machine-Learning-Team
elukey added a comment to T346175: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared.

Really nice finding! It seems to match exactly https://sal.toolforge.org/log/HIAEhIoBGiVuUzOdDi6t, that is when we moved wikidata and enwiki to Lift Wing. Maybe it is only a matter of adding more pods?

Fri, Sep 15, 8:11 AM · Growth-Team, MediaWiki-Recent-changes, ORES, Machine-Learning-Team
elukey added a comment to T334182: Deploy multilingual readability model to LiftWing.

@klausman Assigned the task to you since there are a couple of steps that are more related to SRE (lemme know if you don't have time, I'll take care of it).

Fri, Sep 15, 7:35 AM · Patch-For-Review, Research (FY2023-24-Research-July-September), Machine-Learning-Team
elukey reassigned T334182: Deploy multilingual readability model to LiftWing from achou to klausman.
Fri, Sep 15, 7:35 AM · Patch-For-Review, Research (FY2023-24-Research-July-September), Machine-Learning-Team
elukey reassigned T334182: Deploy multilingual readability model to LiftWing from MGerlach to achou.
Fri, Sep 15, 7:34 AM · Patch-For-Review, Research (FY2023-24-Research-July-September), Machine-Learning-Team
elukey closed T346032: Elevate LiftWing access to WME tier for development and production environment as Resolved.
Fri, Sep 15, 7:33 AM · Wikimedia Enterprise, Machine-Learning-Team
elukey closed T344058: Tune LiftWing autoscaling settings for Knative as Resolved.

We applied the rps strategy to all our isvcs, and re-calibrated autoscaling settings. The autoscaling graphs looks much better now, I am inclined to close. Thanks Aiko for the work!

Fri, Sep 15, 7:33 AM · Machine-Learning-Team
elukey updated subscribers of T346151: Lift Wing alerting.

I think that we should coordinate with SRE (@RLazarus for example) before proceeding further with SLO alarming, we don't want to derail from the SRE recommendations :)

Fri, Sep 15, 6:55 AM · Observability-Alerting, Machine-Learning-Team

Thu, Sep 14

elukey added a comment to T342116: Deprecate mediawiki revision-score stream.

We have disabled the revision-score stream from Eventstreams, so it is not published anymore in https://stream.wikimedia.org/.

Thu, Sep 14, 4:03 PM · Machine-Learning-Team
elukey added a comment to T334182: Deploy multilingual readability model to LiftWing.

The service is currently deployed to production!

Thu, Sep 14, 12:35 PM · Patch-For-Review, Research (FY2023-24-Research-July-September), Machine-Learning-Team

Wed, Sep 13

elukey updated subscribers of T344614: Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster.

@bking I had a chat with @dcausse and from https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/ha/zookeeper_ha/#example-configuration (the 1.16 docs) it seems that an example config is:

Wed, Sep 13, 10:37 AM · Discovery-Search (Current work), Data-Platform-SRE
elukey updated subscribers of T334182: Deploy multilingual readability model to LiftWing.

Thanks for the info @MGerlach!

Wed, Sep 13, 7:59 AM · Patch-For-Review, Research (FY2023-24-Research-July-September), Machine-Learning-Team
elukey added a comment to T344614: Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster.

@bking another thing to verify:

Wed, Sep 13, 7:55 AM · Discovery-Search (Current work), Data-Platform-SRE
elukey added a comment to T344614: Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster.

The flink cluster in eqiad looks healthy:

Wed, Sep 13, 7:37 AM · Discovery-Search (Current work), Data-Platform-SRE
elukey added a comment to T344614: Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster.

The flink-app in dse-k8s is healthy again, but I have no evidence that it's talking to Zookeeper. I ran the same logging command as @elukey did above, but unlike him I couldn't find any references to high availability at all.

Wed, Sep 13, 7:23 AM · Discovery-Search (Current work), Data-Platform-SRE

Tue, Sep 12

elukey added a comment to T341699: Order 1 GPU for Lift Wing.

Some info:

Tue, Sep 12, 3:37 PM · Goal, Machine-Learning-Team
elukey added a comment to T334182: Deploy multilingual readability model to LiftWing.

Great results @achou!

Tue, Sep 12, 1:58 PM · Patch-For-Review, Research (FY2023-24-Research-July-September), Machine-Learning-Team
elukey added a comment to T327620: Define SLI/SLO for Lift Wing.

Opened T346144 as related change for the dashboards :)

Tue, Sep 12, 1:41 PM · Machine-Learning-Team
elukey created T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.
Tue, Sep 12, 1:29 PM · SRE Observability (FY2023/2024-Q1), serviceops, observability
elukey added a comment to T342116: Deprecate mediawiki revision-score stream.

From this query it seems that only one IP address has been active in the few weeks. It is difficult to follow up with them, it seems a stream running in a cloud provider but it seems not possible to get the UA or similar from our metrics and logs.

Tue, Sep 12, 6:40 AM · Machine-Learning-Team
elukey added a comment to T345058: Cassandra instance with corrupted commit log after powercycle of restbase1027.

No no for the moment it is fine, what I wanted to do is to avoid single persons on-call for events like Cassandra being in trouble (namely, you :). We can slowly build knowledge over time, I am in if you want to evangelize more how to deal with Cassandra events!

Tue, Sep 12, 6:12 AM · Cassandra, serviceops
elukey added a comment to T346032: Elevate LiftWing access to WME tier for development and production environment.

@prabhat It shouldn't be a problem, so let's keep both dev and prod accounts for the moment. The total traffic per second should get up 100 rps with both clients active, not a ton but still a sizeable one :D We can review the status down the line, maybe let's start with just one hitting at full power for the moment, would it be ok?

Tue, Sep 12, 6:10 AM · Wikimedia Enterprise, Machine-Learning-Team

Mon, Sep 11

elukey added a comment to T346032: Elevate LiftWing access to WME tier for development and production environment.
elukey@mwmaint1002:~$ mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client ee4742635eb8098fbbc5a0d3ee037251 --tier wme
Successfully added tier wme for ee4742635eb8098fbbc5a0d3ee037251. 
elukey@mwmaint1002:~$ mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client 686b05ed258704c7d71bb584cfe60865 --tier wme
Successfully added tier wme for 686b05ed258704c7d71bb584cfe60865.
Mon, Sep 11, 1:59 PM · Wikimedia Enterprise, Machine-Learning-Team
elukey added a comment to T339890: Host the recommendation-api container on LiftWing.

@elukey, on IRC you mentioned:

one quick thing - I am reading https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L167
and it is probably something that we can improve
for example, we could do the prep work offline and force np to load from file
I never done it but I am pretty sure it should be doable
could you please check if this is doable? And also update the task with all the findings etc.

I am not sure what kind of "prep work" you are referring to. Could you please, clarify? Thanks!

Mon, Sep 11, 1:40 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T345058: Cassandra instance with corrupted commit log after powercycle of restbase1027.

@Eevans I totally understand your point of view, but at the same time I am not clear what procedure we should follow when an issue like this one happens (while on-call etc..). Is your recommendation to just to let the instance depooled, stop puppet etc.. and then ping Data Persistence for a permanent fix?

Mon, Sep 11, 1:36 PM · Cassandra, serviceops