Yes, I think we can close this. I'd even say that having different user configs in codfw vs eqiad is an antipattern.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
May 24 2023
May 22 2023
May 17 2023
The changes from 920208 have been deployed.
May 15 2023
Namespaces are live in both eqiad and codfw:
May 11 2023
In T333124#8843788, @elukey wrote:@klausman do you have time to work with Aiko to push this to production during then next days?
May 2 2023
Apr 28 2023
All machines in codfw done.
Apr 26 2023
Apr 19 2023
Namespace has been created on staging, and is visible:
Apr 18 2023
Apr 17 2023
https://wikitech.wikimedia.org/wiki/SLO/Lift_Wing Started a draft doc here
Yes, my plan was to elaborte on my write up a bit (it's mostly for sorting my thoughts), and then use the template you mentioned and develop that into something like the API GW SLO (with plenty SRE input).
Mar 30 2023
https://docs.google.com/document/d/1NspQtkfyuD_kiYCgms1gRZeFFiAaetnk/edit <- My thoughts so far, comments here or on the doc welcome.
Feb 24 2023
The current (now resolved) reason for the disk fillup was a 22G logfile:
Feb 21 2023
Feb 13 2023
I resolved this by doing the following:
Dec 13 2022
We did some more refactoring/improving of the Docker image today, and have done basic tests. The staging endpoint now uses the new image, and it looks like it's working fine. The new way of building the image has been committed to my fork of Stopes, on the usual aws_publish branch.
Dec 12 2022
Dec 5 2022
Nov 30 2022
ores2009 is shutting down & powering off now
Nov 28 2022
Nov 25 2022
Nov 23 2022
After some discussion, we have decided that the API-GW side URL scheme for LW should look like:
Nov 22 2022
I'll close this ticket for now, since the main effort is focused on NLLB200 on AWS (https://phabricator.wikimedia.org/T321781). If-when we look at MarianNMT again, we can reopen (or more likely make a new Task).
Nov 8 2022
This still needs a fix to https://github.com/wikimedia/wikilabels-wmflabs-deploy/blob/master/config/00-main.yaml#L20 which I have prepared in https://github.com/wikimedia/wikilabels-wmflabs-deploy/pull/57
Nov 3 2022
Nov 2 2022
As just added to T307389: DBs have been migrated and docs updated. Taavi has shut down the old clouddb instances and if we don't find we still need them for some reason, will delete them in a week.
Created VM and Puppet stuff as detailed above, and migrated the data, then switched the uwsgi applications on the main instance and staging to use said VM. Updated docs accordingly, including this new section:
Oct 25 2022
Oct 3 2022
Sep 6 2022
Aug 31 2022
Aug 22 2022
From an ML POV, the useful tiers would probably be:
Aug 17 2022
A few notes:
Jul 27 2022
Ok, the machine is booted and sitting in GRUB. @Papaul I can't seem to run memtes86+ via idrac (I just get a black screen). Can you check whether it works with direct access? Alternatively, do you know how to run it so that console redirection works? Thanks!
Jul 26 2022
Change 817210 actually fixes this, we now see messages in logstash again. Apparently, an unset buffer size causes JSON generation to break. The upstream bug is still open, but I doubt it will be fixed soon, especially with a mitigation being available now.
Jul 7 2022
Upstream issue: https://github.com/unbit/uwsgi/issues/2456
Jul 5 2022
Now also running draftquality for enwiki:
Jun 29 2022
Jun 23 2022
Prometheus is now correctly set up with its own volumes (we hadn't done that yet), and I managed to save the old data.
Jun 22 2022
Add'l things done:
Jun 21 2022
Jun 13 2022
Istio config and (most of) the cert-manager config have been applied. For cert-manager, I need to sync up with Luca regarding part of said config referring to the ml-serve endpoints.
May 13 2022
May 10 2022
Mar 22 2022
Mar 18 2022
I put the smaller staging allocation at the end to avoid fragmentation (at least for now, it can't be avoided forever, in my experience). Similar, the Train/DSE range is "flipped" (/21 first) to avoid fragmentation between it and the preceding prod ranges. If there would be sufficiently smaller ranges needed in EQIAD for future projects, they should follow the same scheme as the staging ranges in CODFW (allocate from the end, try to avoid fragmentation in the same alternating-sizes pattern as for prod/train).
I have setup IP ranges (and sliced them up for our use):
Mar 15 2022
# etcdctl -C https://ml-staging-etcd2001.codfw.wmnet:2379 cluster-health member 493aa03d462725d1 is healthy: got healthy result from https://ml-staging-etcd2002.codfw.wmnet:2379 member b12825ca936a35a6 is healthy: got healthy result from https://ml-staging-etcd2003.codfw.wmnet:2379 member fce0f93975c27096 is healthy: got healthy result from https://ml-staging-etcd2001.codfw.wmnet:2379 cluster is healthy #