We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now. I have confirmed this for ruwiki.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 15 2024
Mar 26 2024
Mar 25 2024
In T360446#9649946, @Jhancock.wm wrote:Found the drive as absent in iDRAC. Physically, the drive is there but is not blinking like the other drives.
For this one, the recommended remedy is to reseat this drive. is that safe to do at this time?
Mar 22 2024
During some experimentation with various approaches of generating the Docker images differently, and stripping out unneeded information, I have tried the following things:
Mar 19 2024
Mar 7 2024
Mar 5 2024
In T358467#9582988, @kevinbazira wrote:The article-descriptions model server was firing InfServiceHighMemoryUsage alerts. This usually happens when an isvc uses >90% of its limit for 5mins. I have increased the memory limit used by this model server from 4Gi to 5Gi so that prod can handle processing more isvc requests without running out of memory.
This was indeed caused by using the wrong metric. We have chosen to move to using the existing k8s alerts.
And the external endpoint is live:
Feb 29 2024
Hypothesis why the other services never alerted: their base usage (container_memory_working_set_bytes) is much lower than the limit, and they don't do enough disk-I/O to fill the page cache to the point the combined metric (container_memory_usage_bytes) gets close to the limit
I have found this:
Feb 28 2024
In T356256#9583456, @elukey wrote:
- What is the schema selected for the data stored in Cassandra? We should document it in here so people can find it, and probably discuss the replication strategy etc.. (for example, do we want to eventually be able to replicate a write to eqiad in codfw and vice-versa? etc.. Cassandra does a lot of things automatically but they need to be stated).
Feb 27 2024
I've updated the partman lines. I will update modules/profile/data/profile/installserver/preseed.yaml to include the new host in a moment, so standard imaging should pick the right recipe for the host.
One addendum to the 'None has no attribute "shape"': this happened only once, the same request seconds later (and before!) worked just fine.
I just got an error when querying the service:
Feb 26 2024
I had missed pushing the admin_ng change. That is fixed now, so pushing the model server config should work now.
Feb 23 2024
In T358195#9572174, @isarantopoulos wrote:
- the model takes into account articles (the first paragraphs in our case) and short descriptions in all languages where the article is available.
Feb 22 2024
Another option is using something like https://mobileapps.discovery.wmnet:4102/es.wikipedia.org/v1/page/summary/Madrid, so there would be neither RESTbase nor the REST API in the path, but I am seeing similar latencies there.
Note the wide variety of latencies, spanning fromn 118ms for "Coal", to more than 10x that for "Poetry". This indicates to me that any rigorous latency testing has to use a wide dataset of pages that the summaries are requested for.
And with a variety of pages requested:
This is run from within the container article-descriptions-predictor-default-00025-deployment-5czmjql currently running on staging:
Feb 14 2024
Feb 13 2024
Feb 12 2024
Feb 8 2024
ml-serve in codfw also done, so all done for ML team
Feb 7 2024
Downtime has been added.
ml-serve1xxx are all done.
After dropping the version specifiers (/v...) at the end of the apiGroups directives, this is now working properly.
Roll-restart of the staging ML cluster is done, eqiad and codfw prod clusters today and tomorrow.
Feb 6 2024
Jan 31 2024
Jan 30 2024
One question for clarification: what piece of software would be talking to statsd? RR as it runs on LW cannot access any statsd atm, since it is mostly isolated.
Jan 25 2024
Downtime done and machine is back in service.
Nice work. On our machine (ml-serve2002), it was but four seconds:
Move complete, machine undrained.
Jan 24 2024
ml-serve2005 is back up and working fine