User Details
- User Since
- Oct 7 2014, 4:49 PM (492 w, 6 d)
- Availability
- Available
- LDAP User
- EBernhardson
- MediaWiki User
- EBernhardson (WMF) [ Global Accounts ]
Tue, Mar 5
We are in the process of deploying a new updater for CirrusSearch, with cloudelastic as the first destination cluster. Duplicates could be a result of that, and are good to report so we can get everything working great before moving on to the primary search clusters.
Thu, Feb 29
@bking this is likely related to the transition of cloudelastic to private ips? I'll take a look later if you don't have ideas.
Wed, Feb 28
Tue, Feb 27
Mon, Feb 26
I suspect at the time we initially setup global-search we didn't have the cloudelastic.wikimedia.org alias up and running yet, but now that that exists should certainly point at it instead of individual servers.
Fri, Feb 23
Thu, Feb 22
To review the documentation changes (there are also two revisions from bking mixed in there): https://wikitech.wikimedia.org/w/index.php?title=Search&diff=2153071&oldid=2127290
Example query of the rest api (could be nicer if we installed curl or wget, or exposed the rest api directly):
KUBECONFIG=/etc/kubernetes/cirrus-streaming-updater-deploy-staging.config kubectl \> exec \ flink-app-consumer-search-backfill-5b9f979487-dsqsb \ -c flink-main-container \ -- \ python3 -c 'import urllib.request; print(urllib.request.urlopen("http://localhost:8081/v1/jobs").read().decode("utf8"))'
On further review, simply documenting the various commands to run seemed error prone. Attached patch adds a python script that simplifies away most of the reindexing and backfill to ease future burden.
Feb 15 2024
Was supposed to be in the backport window today, but train problems blocked that. This is a pretty safe patch though, i'll ship it a little later.
It seems the patch didn't actually make it into wmf.18 as expected, jenkins-bot never finished the merge so this was only deployed in wmf.17. I'll get it shipped there too.
Feb 14 2024
I've been reviewing our options for backfilling and trying to come up with a plan, i think the following will work:
This looks resolved now, the bi-hourly spikes have gone away since the monday deployment.
Feb 12 2024
Updated the mw.org page with the latest changes, so it's now inline with the repository. I think this is enough to call this ticket complete. T355267 is the task for deploying this extension to the wikis.
Released the plugin as -wmf12. Patch above updates the .deb to use the newest versions. MR also up on gitlab to update the dev image (for cindy/dev envs) to use the new .deb once available.
Feb 9 2024
If we need them silenced, best bet is probably to re-enable the writes for these wikis. Can be done with a mediawiki-config patch.
I haven't managed to track down where the Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic error comes from. We recently turned off writes to this cluster from mediawiki on select wikis, but somewhere in the codebase is still trying to create writes even though it shouldn't. Needs more invetigation on our side.
Feb 8 2024
Feb 7 2024
Feb 5 2024
Current process (to be refined). None of this is committed anywhere yet, mostly working out what is going to work.
Idea is something like:
This is a bit of a non error. What happened is:
Feb 1 2024
Started with the ghost page in index errors, since there are only a couple. We have two pages in cloudelastic for frwiki that have been correctly deleted in eqiad but still exist in cloudelastic:
Localization - The only localization is the extension description, unclear if necessary (or how).
The selected set of wikis has been enabled in production and are performing writes. Issues resulting from this deployment will be delt with in separate tickets.
It seems like the problem here is code that runs prior to endpoint specific code needs a generic way to inform the output layer that what error has occured. Today what we are doing is registering hook handlers for each specific output layer, and throwing different output specific exceptions.
Jan 31 2024
First pass review of the administration processes listed on wikitech and which will be changing. This started as only about the streaming updater, but added a second section out outdated topics. perhaps another ticket?
Jan 30 2024
We followed this data over time and it seemed to stay in line. We've now progressed from relforge to a cloudelastic deployment and we can probably consider this complete.
Jan 29 2024
Jan 24 2024
Jan 23 2024
The patch documents changing app.config_files.app\.config\.yaml via the command line which hopefully does as we need. It allows changing the same values, and avoids the problem described in the ticket since the passed arg never needs to be included in app.job.args, and thus doesn't need to pass the is-string check.
I believe removing cloudelastic from the list of clusters to write to should be sufficient. mediawiki-config has the appropriate bits to do this on a per-wiki basis.
The idea behind the empty dataset is that airflow looks at the hive metadata to see if it's ready for processing. The idea would be to add a partition to the hive metadata that points at nothing. Something like the following:
Jan 22 2024
Jan 18 2024
Note that once deployed this will not instantly fix the pages. The pages will be fixed on the next edit, or when the background reindexer gets to the page (once every ~16 weeks).
The extension is now documented and written, but still needs to finish code review. Perhaps though it would be worth talking about what is the appropriate level of verification that should be applied to the requests. Some options, in order of increasing complexity:
Jan 17 2024
Went through https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment to make sure we've done what's needed:
Jan 12 2024
Jan 9 2024
After reviewing the option here, along with reviewing the current state of the mw k8s deployment, it looks like we can drop the requirement to execute via the job runner infrastructure. If that's the case, is running a pseudo-job still the best plan? Some considerations:
Jan 5 2024
Dec 7 2023
Another option we came up with was to backfill to a null flink sink, this would allow measuring capacity of the flink pipeline by itself, separate from the ability of the chosen elasticsearch cluster to consumer those updates.
Dec 6 2023
In my estimation an appropriate solution here is to move the cirrusCheckerJob back to the old job runners, and bring them back after solving the TLS issue.
Cloudelastic uses acmechief for it's TLS certificates, vs most prod services which probably (?) have an internally signed certificate. It seems plausible that the problem has something to do with the certs coming from acmechief (not the certs themselves, but how envoy validates them).
Combined with another bug that doesn't correctly recognize these failures has resulted in an increase of cirrusSearchLinksUpdate from 300-500/s to around 800/s
For comparison envoy works fine from mwdeploy2002 itself:
deploy2002 $ mwscript shell.php testwiki Psy Shell v0.11.21 (PHP 7.4.33 — cli) by Justin Hileman > $ch = curl_init('http://localhost:6105') = curl resource #1575
Dec 5 2023
I've run this a few times, it claims the indices in relforge match the ones in production. I'm still a bit suspicious that it passed on the first try, maybe could try harder to see what is broken. But we've done the testing and what we have so far claims to work.
Dec 4 2023
Current plan for gradual deploy is to start with a selection of wikis that add up to ~25% of the total rate. If that's too high we can remove commonswiki from the set, which should bring it down around ~13%. Before we can turn those events on I believe we need to have the topic partitioning changes applied, the topic currently has a single partition.
It looks like we are estimating the page rerender events to be at approximately the same rate as the existing cirrusSearchLinksUpdate jobs. Some related stats, estimated from one week (nov 27-dec 3) of kafka history. This reuses the prior estimate of 3 copies of the data at 0.6kB per event with 7 days retention. I added the row about removing commons since it is the largest of the selected wikis, giving an option to reduce the initial rollout.