Fri, Dec 2
Adding a reminder to myself to capture dashboard links, log-diving commands and other troubleshooting info from today's IRC log and add to the WDQS runbook .
Follow-up email sent to Wikidata Users list:
Email sent to Wikidata Users list:
Thu, Dec 1
We changed the java GC options above to reduce old GC alerts, but we had another one today for cloudelastic:
Wed, Nov 30
Mon, Nov 28
Per last week's conversation with @dcausse :
The helm chart for operator needs to be modified for the DSE environment. @BTullis 's spark helm chart PR is probably a good template. Also note that @Ottomata is working on the Flink docker images .
The Flink jobs are deployed. I also mirrored the Flink Helm Chart so we can see what's happening.
Tue, Nov 22
Will create a new task for
- we don't hit an OOME within 2 weeks
- gather some data
Agreed, this is not a best practice and we wouldn't make that a permanent change in the cookbook. We did look at the rsync Puppet class and saw there was an option to disable encryption, so we figured it would be OK as a one-off (also considering we were transferring publicly-available data).
I've updated my post to the exact quote. I don't think this has offended anyone who did engage, but if it did please let me know and I will remove it from this page entirely. I don't know how to read "attributing the comment to someone else," can you clarify? It sounds like you are accusing me of sockpuppeting. If you don't believe me, I can ask the person if they are OK with me revealing their personal details.
Rough way of validating that the extraneous JVM options have been removed:
Creating this ticket retroactively to link the following code changes:
Fri, Nov 18
Opened T323380 for the disk errors, closing this one for now.
Just a heads-up as I'm re-engaging. I plan to use the stream-enrichment-poc namespace within the next couple of weeks. Ping me here or on the Slack thread if this is going to be a problem and/or if you have any advice on this.
Hello DC Ops,
We looked the rsync class in Puppet, and it didn't seem like a great fit for our use case. Instead, we simplified the cookbook (removing openssl and pigz ) and were able to complete the data transfer.
The reload is complete. However, we had to reboot wcqs1003.eqiad.wmnet several times before it would actually load the OS, and the BIOS displayed disk errors every time:
I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.
Thu, Nov 17
Wed, Nov 16
Host deployment-flink0.deployment-prep.eqiad1.wikimedia.cloud is up and running the default Flink job as described at Flink's quickstart page. Next steps are to test Flink Application Cluster as k8s Deployment and k8s Job resource as described here
Tue, Nov 15
In addition to the patches listed above, we wrote a new test as well.
Mon, Nov 14
I reviewed the commit above, but I don't see any place where the actual server FQDNs are hardcoded. So I don't think a Puppet patch will be necessary, just some manual steps on the servers.
See also this commit.
wcqs1003 is the only host left that needs a reload:
Wed, Nov 9
Mon, Nov 7
- Adding a timeout to the nc command did not help.
Looking on Blazegraph Database's Github, there are a few issues related to Java version.
Thanks again for your response.
Nov 4 2022
Unassigning as I'm no longer actively working on this. I do intend to finish it eventually.
Nov 3 2022
Started the Terraform repo
Nov 2 2022
Thanks jbond, these are all legitimate points and must be addressed before we start to consider Ansible. Here's what I have so far:
Oct 31 2022
Closing in favor of the original ticket
I wrote this before I realized that there is no cumin/spicerack in the deployment-prep env. I still need to get more familiar with cookbooks, but we'll close this one for now and get another ticket with some better requirements started.
Closing this out, as the original script fulfilled its purpose and we need to open a new ticket with more specific requirements before we continue.
Adding T321587 as a prereq.
Oct 27 2022
Closing, will continue similar work in https://phabricator.wikimedia.org/T321587
Oct 26 2022
Still not sure exactly why they're failling, wcqs2002 also has the new data.
wcqs2003 and wcqs1003 still need it.
The transfer cookbook continues to fail with the same errors noted above.
Oct 25 2022
We had a few transfers fail very quickly today. From #wikimedia-operations:
As of this writing, the following hosts have the updated data:
Oct 24 2022
The data reload is complete, but there were errors .
Oct 21 2022
When I copy repository_elasticsearch-oss.list in /etc/apt/sources.list.d on the test server and run` apt-cache update`, I get the following error:
Created server es-oss on the WMCS Search project to help troubleshoot
Looks like we (as in Search Platform SREs) need to cut a new package for wmf-elasticsearch-search-plugins . The general SRE teams are not needed for this work, so I'm removing the tag.