https://ores-beta.wmflabs.org/v3/scores/viwiki/123125/articletopic seems working now :)
IIUC the next steps should be to run something like T212818#4865070 for drafttopic, then updating the related submodule in the deploy repo and then re-test in Beta.
Tue, May 11
@Halfak quick check in to understand the status of the fix (and if my team should follow up to fix the regression etc..) :)
Mon, May 10
This is done! With T277062 Aiko and Miriam were able to run tensorflow-rocm only on GPU nodes :)
FROM docker-registry.wikimedia.org/golang:1.13-3 as build
Still reported down :(
@Legoktm we just added a step for github repositories that ends up in production to ensure that a member of the ML team reviews the patch, it is a compromise to avoid having patches being merged without us noticing it (especially self-merge actions).
Sat, May 8
Before closing - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Manage_principals_and_keytabs needs to be updated :)
Second day without any error!
@srodlund thanks a lot for all the work! Have a great weekend too :)
Fri, May 7
@srodlund +1 for the image credit thanks!
@srodlund it seems that there is nothing outstanding anymore, we can publish!!
@srodlund thanks a lot for the extra pass, I'll try to resolve the last open comments today so you'll be free to publish anytime :)
I took some extra steps:
No errors for native threads registered in the past hours, it looks that we are out of the woods, but I'll wait until Monday before declaring victory.
Thu, May 6
@srodlund if you have ideas we all all ears, any elephant-related image could be ok :)
@Ottomata the image looks so good! :(
@srodlund thanks a lot for the review! I accepted all the comments and left two questions for you, feel free to solve them in case everything looks good.
I cannot see "Settings" in my GH view, so I guess I need more permissions to be able to add the rules. We should:
After a chat with Joseph we decided to proceed one change at the time:
More info about what binaries are executed in the minikube test that I made:
@jcrespo quick question - if we want to move forward with this, do we need hardware planned for next fiscal? I know that the use case is very high level and there are a lot of unclear points, so any inputs will be appreciated :)
@RKemper I restarted the failed prometheus units on the node to clear icinga, but puppet is still disable, can you enable it when you have a moment if ok? (I didn't want to do it in case you were working on it)
@Papaul what did you do to fix it?? (curious)
Wed, May 5
@Halfak what is the likelihood that other models have the same issues, but we haven't seen errors yet due to not enough requests ending up in ERRORS?
In my opinion we should rollback, work on a patch and re-rollout when we are ok, doing more testing. Thoughts?
@Halfak I see mostly 'model_names': ['reverted', 'articletopic'] for viwiki in codfw..
elukey@ores2001:~$ sudo journalctl -u celery-ores-worker.service | grep Warning May 05 13:59:19 ores2001 celery-ores-worker: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.ensemble.gradient_boosting module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API. May 05 13:59:19 ores2001 celery-ores-worker: warnings.warn(message, FutureWarning) May 05 13:59:19 ores2001 celery-ores-worker: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.20.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk. May 05 13:59:19 ores2001 celery-ores-worker: UserWarning) May 05 13:59:19 ores2001 celery-ores-worker: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.tree.tree module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API. May 05 13:59:19 ores2001 celery-ores-worker: warnings.warn(message, FutureWarning) May 05 13:59:19 ores2001 celery-ores-worker: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.20.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk. May 05 13:59:19 ores2001 celery-ores-worker: UserWarning)
There seems to be a regression in scores errored mostly in codfw (ORES is active/active), so some traffic is impacted. This is an example:
Joe gave me a nice pointer in production-images, namely the loki multi-stage container example. Basically the idea is to build go binaries in one container first, then use them for the official Docker image to push to the registry. If we find a way to build istio (that in theory shouldn't be super difficult) we should also be able to re-use the Docker images like https://github.com/istio/istio/blob/master/pilot/docker/Dockerfile.proxyv2 relatively easy (same thing for Knative etc..)
@razzi each check has its own interval, check_puppet_run_changes might run every X hours so it may be slow to update. If you want to get fresh results you can force a reschedule of the check via Icinga UI (you should find the option in the dropdown menu where Acknowledge etc.. options are).
Almost forgot - the procedure should also include T231067#6863800 :)
@srodlund draft ready! I shared the gdoc with you and the Analytics team :)
Tue, May 4
Something interesting that I found today is: https://gcsweb.istio.io/gcs/istio-build/dev/1.6-alpha.3ddc57b6d1e15afebefd725e01c0dc7099f3f6dd/docker/
Yes let's fully decommission eventlog1002 once we are ok with 1003 :)
Mon, May 3
@Pchelolo the summary is great! I can certainly work from the ML/Analytics side when needed, so let me know if we can start scoping out the problem (maybe another task?).
Very interestingly, the pre-caching stuff is what powers https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_revision_score. The scores are sent to kafka and then exposed, so I am not sure if we can turn this off. It is also a good thing to keep in mind for Lift Wing.
+1 definitely, the right wording is popularity, I agree 100%
I would do it anyway since these are the dbs that we back up periodically, and it may take a while (namely months) to get everything set up and running and migrated. Since it is mostly my fault I can spend the time on it, but if the team thinks it is not worth it I can drop the ball and decline :)
The other thing that may happen is that the mbr was installed only on one of the two disks of the RAID1, so now nothing boots. IIRC PXE wasn't also able to start as well, otherwise I'd have proposed to boot with a rescue image to inspect the two disks.
All these failures smell like a major host problem..
@srodlund sorry for the lag, me and Joseph should have a draft for this week :)
Sat, May 1
I left the host in the System Config panel so it will not keep trying to PXE, so it needs a power reset to start investigations :)
Fri, Apr 30
Let's revisit this if anything happens again, it seems a sporadic issue.
I think we should be fine from now on, I wouldn't add more complexity to what we have :)
On paper we should have free memory available on Production nodes, but ideally the three changes outlined in the description could have been broken down into three separate deployments to have a better sense of what performance impact each change has. I know that there may be some interconnection between the jobs, and that now it would be a problem to break everything down, but please let's remember it next time. Big deployments are not great in general, I really prefer smaller ones :)
@GoranSMilovanovic sure! During the migration of the hosts where Hive Server/Metastore runs to Debian Buster, we encountered a lot of problems with the only available java lib for mysql, namely the one containing the org.mariadb.jdbc.Driver JDBC driver. We have now reverted back to the old mysql driver, manually porting the missing debian packages from Stretch to Buster, and now sqoop needs to run without the extra --driver option. So this option caused problems due to us trying to figure out how to upgrade our systems following Debian best practices, but hopefully now we should be good (at least until Debian Bullseye, the new version, will be out).
Everything looks good! Also dropped the hue_next database so it is less confusing when inspecting what we run on the various db nodes (basically we now have only the hue database).
@Cmjohnson hi! Any news about the worker nodes?
Thu, Apr 29
Yep it takes a bit! If the datanode got the new config you'll see more data in the upcoming days :)
@Halfak I'd ask, whenever you have a moment, for some details about the following points:
No issues from our side, going to close, please reopen if necessary!
@Halfak beta seems unblocked for the moment, please check if there are other issues. Current problems live-patched that may require a better fix:
@razzi you have the wrong slot, it is the 10th :)
Wed, Apr 28
@Halfak nono I meant how to trigger the Failed to establish a new connection: [Errno 111] Connection refused problem (related IIUC to connections to localhost:6500 failing). Now that we have a proxy things should flow nicely, but not sure how to test it.
Spent some time today trying to add the Envoy config to the Ores instance in Beta, and all the production code assumes (rightfully) TLS + LVS IPs, so adapting it to beta may not be possible without further puppet changes.
Added the remaining AAAA records for kafka-main200[2-5]!
Nice! Let's keep it open since I want to understand if we need to use --driver com.mysql.jdbc.Driver or not, it will have some impact also for Analytics, thanks a lot for bringing this up and sorry for the trouble!
@crusnov we are good to deploy the other AAAA records, can we proceed?
Tue, Apr 27
@kevinbazira on ores-misc-01 the root partition is full :(