- Open firewall on cloudelastic machines to allow connections from mwmaint*, mw job runners to cloudelastic
- Add cloudelastic to wgCirrusSearchClusters on all non-private wikis. Do not add to wgCirrusSearchWriteClusters initially
- Set wgCirrusSearchDropDelayedJobsAfter to 15 minutes for cloudelastic
- Initialize all indices
- Import cirrussearch dumps from dumps.wikimedia.org
- Add cloudelastic to wgCirrusSearchWriteClusters
- reindex updates between when dumps were created and when writes were enabled
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | Feature | None | T71489 Expose mwgrep functionality on-wiki | ||
Resolved | None | T109715 Replicate production elasticsearch indices to labs | |||
Resolved | EBernhardson | T220625 Initialize CirrusSearch on cloudelastic | |||
Resolved | debt | T223519 Expose cloudelastic to wmf cloud | |||
Resolved | debt | T224324 LB for cloudelastic | |||
Resolved | EBernhardson | T230495 Partition CirrusSearch mediawiki jobs by cluster | |||
Resolved | EBernhardson | T236186 Move bulk content out of the ElasticaWrite job | |||
Resolved | Ottomata | T239135 Create partitioned CirrusSearchElasticaWrite topic |
Event Timeline
Open firewall on cloudelsatic machines to allow connections from mwmaint*, mw job runners and cloudelastic
A quick look into our puppet code and ferm configuration does not show an obvious variable that would identify either the mwmaint* or the job runners nodes. I'll keep digging!
Change 502829 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic
Change 502832 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [cirrus] add cloudelastic service
Change 502829 merged by Gehel:
[operations/puppet@production] cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic
Change 507219 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Add cloudelastic servers to wgCirrusSearchClusters
Change 507219 merged by jenkins-bot:
[operations/mediawiki-config@master] Add cloudelastic servers to wgCirrusSearchClusters
Mentioned in SAL (#wikimedia-operations) [2019-04-29T23:54:21Z] <ebernhardson@deploy1001> Synchronized tests/: T220625 Add cloudelastic servers to wgCirrusSearchClusters (1/5) (duration: 00m 53s)
Mentioned in SAL (#wikimedia-operations) [2019-04-29T23:55:33Z] <ebernhardson@deploy1001> Synchronized wmf-config/LabsServices.php: T220625 Add cloudelastic servers to wgCirrusSearchClusters (2/5) (duration: 00m 52s)
Mentioned in SAL (#wikimedia-operations) [2019-04-29T23:56:47Z] <ebernhardson@deploy1001> Synchronized wmf-config/ProductionServices.php: T220625 Add cloudelastic servers to wgCirrusSearchClusters (3/5) (duration: 00m 50s)
Mentioned in SAL (#wikimedia-operations) [2019-04-29T23:58:34Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625 Add cloudelastic servers to wgCirrusSearchClusters (4/5) (duration: 00m 52s)
Mentioned in SAL (#wikimedia-operations) [2019-04-29T23:59:46Z] <ebernhardson@deploy1001> Synchronized wmf-config/CirrusSearch-production.php: T220625 Add cloudelastic servers to wgCirrusSearchClusters (5/5) (duration: 00m 52s)
Created indices using the following. First created the index for testwiki and verified it made into a sane state.
expanddblist private > ~/private_wikis expanddblist all | grep -vFf ~/private_wikis | while read wiki; do mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=$wiki --cluster=cloudelastic; done
It turns out we can't set the max shards per node on a per-cluster basis, we only have one value used on all clusters. Shouldn't be a big change, but this needs to be updated to take a per-cluster value much like our other configurations.
We should additionally set the refresh interval to ~15 minutes for cloudelastic to help push back on expectations of update recency.
Importing group0 wikis from dumps with following script:
DUMPDATE=20190429 DUMPURL=https://dumps.wikimedia.your.org/other/cirrussearch/${DUMPDATE} expanddblist group1 | while read WIKI; do CIRRUSPORT=$(echo 'echo (new CirrusSearch\SearchConfig)->getClusterAssignment()->getServerList( "cloudelastic" )[0]["port"];' | mwscript eval.php --wiki=${WIKI}) for INDEXTYPE in content general; do DUMPFILE="${WIKI}-${DUMPDATE}-cirrussearch-${INDEXTYPE}.json.gz" echo "Importing $DUMPFILE" echo https_proxy=http://webproxy.eqiad.wmnet:8080/ curl -s ${DUMPURL}/${DUMPFILE} | zcat | pv -c -N bytes | pv -c -N lines -l | \ bin/parallel --blocksize 20971520 --pipe -L 2 -N 100 -j20 "https_proxy="" curl -s https://cloudelastic.wikimedia.org:${CIRRUSPORT}/${WIKI}_${INDEXTYPE}/_bulk --data-binary @- -H 'Content-Type: application/x-ndjson' >/dev/null" done done
Change 507609 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Start writing to cloudelastic from testwiki
Change 507609 merged by jenkins-bot:
[operations/mediawiki-config@master] Start writing to cloudelastic from testwiki
Mentioned in SAL (#wikimedia-operations) [2019-05-01T16:58:46Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625 Start writing to cloudelastic from testwiki (duration: 01m 01s)
Change 507703 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Start writing to cloudelastic for group0
Change 507703 merged by jenkins-bot:
[operations/mediawiki-config@master] Start writing to cloudelastic for group0
Mentioned in SAL (#wikimedia-operations) [2019-05-01T23:19:18Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625 Start writing to cloudelastic for group0 (duration: 01m 05s)
group0 indices have all been imported from dumps, live writes have been enabled, and forceSearchIndex.php has been used to catch up any changes missed. Additionaly saneitize.php has been run over them to ensure everything is in an expected state. Not sure how practical it will be to use saneitize.php on the larger wikis.
I've additionally started importing group1 and group2 wikis from the dumps. This will likely take many days. They are all importing from the 20190429 dumps. group0 was imported from the 20190422 dumps.
@Gehel Something i haven't been able to figure out, I can't find the cloudelastic servers in the grafana 'Cluster overview' dashboard. Is it expected to show up there, and is there anything we should do to help things along?
Change 508732 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Limit the clusters archive index is written to
Change 508733 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Configure wgCirrusSearchPrivateClusters
Change 508733 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure wgCirrusSearchPrivateClusters
Mentioned in SAL (#wikimedia-operations) [2019-05-07T23:31:08Z] <ebernhardson@deploy1001> Synchronized wmf-config/CirrusSearch-production.php: T220625 Configure wgCirrusSearchPrivateClusters (duration: 00m 58s)
Change 508732 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Limit the clusters archive index is written to
Change 509110 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.34.0-wmf.4] Limit the clusters archive index is written to
Change 509110 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.34.0-wmf.4] Limit the clusters archive index is written to
Mentioned in SAL (#wikimedia-operations) [2019-05-09T23:43:36Z] <ebernhardson@deploy1001> Synchronized php-1.34.0-wmf.4/extensions/CirrusSearch/: T220625 Limit the clusters archive index is written to (duration: 00m 59s)
Mentioned in SAL (#wikimedia-operations) [2019-05-09T23:52:11Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625: Dont write to private wikis on cloudelastic (duration: 00m 50s)
This is still only taking group0 updates, waiting to roll out group1 updates on figuring out a proper inbound loadbalancer for job runners -> cloudelastic. Without this a single host in cloudelastic being unavailable will result in a constant stream of errors in logstash.
Change 502832 abandoned by DCausse:
[cirrus] add cloudelastic service
Reason:
will be using lvs
Waiting on Ops to let us know about load balancing and how this should work in the future.
Change 528260 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Repoint cloudelastic at LB dns
Change 528263 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Temporarily stop writing to cloudelastic
Mentioned in SAL (#wikimedia-operations) [2019-08-05T20:49:14Z] <ebernhardson> nuke all search indices on cloudelastic preparing for fresh imports and live updates T220625
Change 528263 abandoned by EBernhardson:
Temporarily stop writing to cloudelastic
Reason:
wasn't necessary
Change 528260 merged by jenkins-bot:
[operations/mediawiki-config@master] Repoint cloudelastic at LB dns
Mentioned in SAL (#wikimedia-operations) [2019-08-05T23:03:15Z] <urbanecm@deploy1001> Synchronized wmf-config/ProductionServices.php: SWAT: 87b428d: Repoint cloudelastic at LB dns (T220625) (duration: 00m 48s)
Change 528503 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Turn on cloudelastic writes for group1
Change 528503 merged by jenkins-bot:
[operations/mediawiki-config@master] Turn on cloudelastic writes for group1
Mentioned in SAL (#wikimedia-operations) [2019-08-06T16:19:56Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625: Turn on cloudelastic writes for group1 (duration: 00m 47s)
Mentioned in SAL (#wikimedia-operations) [2019-08-06T16:40:37Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625: Re-sync enable group1 on cloudelastic, job runners are claiming its not enabled while app servers are sending jobs (duration: 00m 47s)
Mentioned in SAL (#wikimedia-operations) [2019-08-07T23:07:27Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T220625: Send writes for all non-private wikis to cloudelastic (duration: 01m 02s)
All wikis are writing to cloudelastic now. Still be a few days to catchup on writes since july 29, the day the dump was made. Also somehow importing commonswiki_file only imported ~25M out of 50M items. The saneitizer is working on fixing that, but will take a bit.
Change 529993 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200
Change 529993 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200
Mentioned in SAL (#wikimedia-operations) [2019-08-13T19:15:46Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@f1a562e]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625
Mentioned in SAL (#wikimedia-operations) [2019-08-13T19:17:17Z] <ppchelko@deploy1001> deploy aborted: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 (duration: 01m 30s)
Mentioned in SAL (#wikimedia-operations) [2019-08-13T19:22:40Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@3882ddb]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625
Mentioned in SAL (#wikimedia-operations) [2019-08-13T19:23:38Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@3882ddb]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 (duration: 00m 58s)
Change 624237 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cloudelastic: remove temporarily increased timeout
Change 624237 merged by Ryan Kemper:
[operations/puppet@production] cloudelastic: remove temporarily increased timeout