Sat, Nov 17
Fri, Nov 16
Thu, Nov 15
It's varnish: https://vote.wikimedia.org/wiki/Main_Page?ll
We decided to go with redis-sentinel
This is done, we are not fixing transcations. We'll work on sentinel
I only did the jump form, there are lots of forms in this extension that would benefit from migrating to OOUI. Specially the ballot forms.
I looked at sentinel. It's a little bit complex but easily doable. We probably need to do install redis on ores nodes and make them to run sentinel (Example 3 in https://redis.io/topics/sentinel) Setting up the configuration would be a little bit hard and basically all needs to be done via puppet. One other concern is dockerizing ores. That's going to be even harder with docker.
Still errors and I can't merge anything.
Wed, Nov 14
Should I turn this into an RFC?
The backend for Special:Tags have been changed a while ago. This should not happen anymore. Feel free to reopen it if you encounter such things.
I did it today.
Property Suggester is also failing with this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PropertySuggester/+/473545
I put up this workaround for now: https://www.wikidata.org/w/index.php?title=MediaWiki:Gadget-ProtectionIndicators.js&diff=789802852&oldid=393479817
I'm looking for a proper fix
I will work on it.
Tue, Nov 13
Fri, Nov 9
^ I made this patch but I basically had no way to test things. Please double check and if possible run it for a short period of time before merging.
We can't increase number of our celery workers due to memory allocation. Maybe we need to tackle T182350: Profile ORES code memory use, optimize pre-fork allocation then fix things and then come back to increasing number of celery workers. I would encourage against lifting the connection limit but making it a little bit higher would be nice. Also, providing all of ores scores in the analytics data lake, or the brand new hadoop/presto cluster for labs (that is being built), or in dumps.wikimedia.org would lift most of the need for the researchers.
Given that we have the 4 parallel connection per IP in place and difficulties of implementing this. I propose we decline this ticket. We need some better protections against malicious DDoS though but I doubt this would help.
With the new number of 9 parallel connection and 14 minutes to deploy (down from around half an hour), I think it's good now.
Deployed the change on beta, it works. Will deploy it later next week.
Thu, Nov 8
Just saying, the "ORES" service and backend of the ORES extension is being maintained by scoring platform team not growth.
Running redis-cli monitor on deployment-ores01 gives these kind of transactions:
Which now brings us to the question of what's the next step? Should we reinvent the wheel and make a basic nutcracker that is able to handle transactions? Should we change to use RabbitMQ for the broker and redis for other stuff? (In that case we can put the redis behind the nutcracker and I have no idea how to handle RabbitMQ). Fork celery and drop the transaction part? (Nooo)
With the hotfix, it's back to normal but we can't deploy anything because it might break them again.
Apparently in scap deploy, git lfs pull didn't work on all of wheels assets (only on the *.whl files and not on the corpora text files):
ladsgroup@deployment-ores01:/srv/deployment/ores/deploy/submodules/wheels$ sudo git lfs pull Git LFS: (55 of 55 files) 96.63 KB / 96.63 KB
That is broken on production now.
I can download it on my laptop but my connection is super slow but it's not downloading anything on labs nodes or prod stat machines :/
Git lfs is broken:
ladsgroup@stat1007:~$ git clone ssh://email@example.com:29418/research/ores/wheels . . . ladsgroup@stat1007:~/wheels$ git lfs pull Git LFS: (0 of 203 files) 0 B / 212.78 MB EGit LFS: (0 of 203 files) 0 B / 212.78 MB
I have the feeling that rebuilding the wheels with git lfs caused this (that was the only thing that was deployed yesterday)
This is done
Wed, Nov 7
Now deployment time has been reduced from 22 minutes to 17 minutes. I will increase the number of parallel connections from 5 to 8 in the next try.