User Details
- User Since
- Jan 5 2016, 9:54 PM (375 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Today
@MoritzMuehlenhoff given how simple this use case is, I'd just avoid to keep track of the whole cassandra upstream branch in the new repo, to just have one main branch with the debian config and the .py files copied over in the right places. Does it make sense or do you prefer something more elaborate?
Requested the creation of operations/debs/cqlsh4 in https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests
Next steps:
Yesterday
Had a chat with Joe, the idea is to have one node reimaged (so that we can confirm that everything works etc..) leaving the rest of the cluster(s) untouched. I think that moving to PKI is not doable, there are still clients using the puppet CA bundle only, so scratch my proposal above.
Added docs pages!! To avoid mixing up and complicating them, I decided to:
- Have a dedicated page for each revscoring model, even if they share a lot.
- Have a dedicated page for the Article Topic Outlink model (different from the above ones).
Fri, Mar 17
okok this is the part that I wasn't unclear about - we'd just deploy cqlsh in another way, like via puppet, and leverage the /usr/local precedence right? If so this could be something to hack next week :)
Moreover it would be really great to couple this task with T319372, if possible, so that every new reimage will start from PKI directly.
These hosts are delicate, they run the MediaWiki job queues :) We can take down a node but it is very important to preserve the /srv partition to avoid kafka to get all the data back from other brokers.
Opened https://github.com/benthosdev/benthos/issues/1806 to upstream.
Getting back to this after a while, since we now need to move to Bullseye. The last blocker is cqlsh running on py2 only, so what if we keep our version of Cassandra but we upgrade its pylib only?
ping again @Eevans :)
@calbon @kevinbazira should we keep this task open? If so, what are the next steps and/or subtasks?
We are moving away from git to store models with Lift Wing (in favor of Swift).
We are deprecating ORES as part of T312518.
Upstream already reached 5.x, we should probably upgrade to a more recent version as well to keep up and have better support (especially if we want to support more up-to-date GPUs).
Closing this task since we are hopefully moving off AWS for this model. We can re-open the task in case it will be needed in the future.
Closing this task since we are hopefully moving off AWS for this model. We can re-open the task in case it will be needed in the future.
This task needs a bit more clarification, we already have an experimental model server for nsfw content. Putting back in "Unsorted" status so the ML team can re-asses the work to be done.
This is probably a duplicate of https://phabricator.wikimedia.org/T319170, let's decide what ticket to keep open and close the other one :)
Since it was asked over IRC: Lift Wing documentation can be found in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Usage, this task is to copy the same information to the API portal :)
Tue, Mar 14
Hi! Is there anything that the ML team needs to do? (just to organize the work etc..)
Checked https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores?forceLogin&from=now-7d&orgId=1&refresh=1m&to=now-1m&var-datasource=codfw%20prometheus%2Fops&var-model=All&viewPanel=74 and came up with some basic autoscaling numbers. We'll need to refine them as we go of course.
Mon, Mar 13
Me and Filippo tried a ton of workarounds and solutions today, but none of them really worked. In the end we removed the restriction on the first 12 partitions for each webrequest topic (we introduced a limit a while ago to reduce the bw usage) and we started seeing a different behavior from Benthos:
@prabhat thanks! So to recap, if I got it correctly, to migrate away from ORES to Lift Wing you'd need to be able to query goodfaith/damaging model servers on demand (this is already possible), but nothing more right? You don't really use the revision-score stream for anything (only the revision-create one, but that is not controlled by ML and out of the scope for the migration).
@apaskulin thanks a lot!
To keep archives happy - in order to be able to delete the consumer group I had to add the following:
Tried to stop all the consumers on centrallog nodes, delete the consumer group and restart all. Traffic changed and dropped back to previous values, still one third of the events processed.
Sun, Mar 12
The weird thing is that I keep seeing zero consumers:
Tried to stop both consumers (benthos systemd units) on centrallog 1002 and 2002, reset again the offsets, start the consumers.
The traffic handled by benthos is around 1/3 of the original one now (improved but not really ok). I don't see clear indications that Benthos itself is suffering, since it now runs on a better hardware and its config didn't really change.
Seems better now, from the consumer group's consistency point of view:
elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live kafka-consumer-groups --bootstrap-server kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --describe --group benthos-webrequest-sampled-live Note: This will not show information about old Zookeeper-based consumers. Consumer group 'benthos-webrequest-sampled-live' has no active members.
I would try with a consumer group offset reset:
Re-added 1001 back into Kafka Jumbo's firewall allowed host list, and restarted benthos on it. The traffic volume increased a lot, but then we went back into the only-upload-data state.
Something is still off, the traffic volume reported by turnilo for live vs batch webrequest data is still different (live a lot less). Something clearly happened when centrallog1001 was firewalled on kafka brokers, I suspect that it didn't have the time to offloads its partitions assignment to the consumer group and something got weird on the Kafka side.
I see some text data in https://w.wiki/6Rzi, I'll recheck in a bit to see if everything is stable.
On March 9th ~ 16 UTC there was a severe drop in data ingested by Benthos:
Fri, Mar 10
@Eevans we can definitely use the ml-cache clusters to test the upgrade, they are still not used so no problem in making experiments.
@prabhat Thanks a lot for the explanation! Have you ever checked https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.revision-score ? It is basically the same thing, but with more scores. At the moment the stream hits ORES for every revision-create event, calculating scores for multiple models (including goodfaith and damaging). We are trying to transform it into more granular streams, like:
Thu, Mar 9
Updated, all images that we don't use are gone :)
I see that the blocker should be the following in _api_gateway_ratelimit.tpl:
Cleaned up from build2001 following Wikitech's docs. Let's wait https://docker-registry.wikimedia.org/ to sync with the new changes before closing :)
@prabhat hi! Do you have some info about how Enterprise uses ORES? More specifically, I see two use cases in the OKAPI repo (not sure if it is the right one or not though):
Wed, Mar 8
@Htriedman thanks a lot! If you want to test the docker image: https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-nsfw/tags/ (the last tag should contain the issue).
Task completed, all clusters upgraded to kserve 0.10. The nsfw model doesn't work but since it is experimental we'll follow up in T331416
New control plane deployed on all clusters and tested!
Candidates for deletion: