Thu, Oct 10
This looks to be a problem with coordinate ordering, geodata needs to ensure when setting the top left and bottom right coordinates of the bounding box that it's actually using the left and right. It looks like geodata is intending for the input to be (lat_1,lon_1), (lat_2,lon_2). but then passes that directly as the bounding box edges without ensuring appropriate ordering.
search document building is still finding the coordinates, so geodata is appropriately parsing the coordinates and injecting them into the parser output. I also verified in the prod db that geo_tags does not contain any rows for matching gt_page_id. This puts the error somewhere in the table updates. I'm not sure whats going on, but seems should be looked into.
Wed, Oct 9
Tue, Oct 8
Poking through the list suggests this is mostly old stuff, only ~1k files are dated 2019.
Thu, Oct 3
The servers today will not be able to utilize 10G, so they could go in 1G racks for the time being. The cluster can't take advantage of 10G until all the nodes are on 10G.
Wed, Oct 2
Mon, Sep 30
Data looks to have backfilled appropriately, thanks!
I don't see anything in here that we would be losing, this is safe to delete.
Wed, Sep 25
Doesn't look like this is catching up. New data is arriving again from the new partitions, but the previous data does not appear to be backfilling.
Tue, Sep 24
Thu, Sep 19
Indeed I've completely mixed the two, sorry for confusion!
To use https add 'transport' => 'Https' to your configuration array. Should do the trick.
I started writing a patch for this, but got stuck trying to get mw vagrant back into working order. In this patch MediaInfo essentially always provides its fields to the NS_FILE namespace, but when no mediainfo is present it provides appropriate empty values. I'm not clear on what tagging the revision would do, I assume that must trigger some other process? I've uploaded the patch as is, but I've been unable to test this.
Mon, Sep 16
cindy was recently upgraded (patch merged today). It still runs mwv, but has a hacked up node10+npm install. This should be unblocked now.
CirrusSearch tests upgraded in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/536653/
Sep 13 2019
Everything in search should be running on 3, sadly that migration only happened in the last year. But it happened!
Sep 12 2019
For whatever reason the container stopped, I started it back up again. This probably needs to move to our more managed tools collection rather than a custom container on a cloud instance.
Sep 9 2019
I only see one directory in smalyshev hdfs home, looks safe to delete.
Sep 3 2019
Reducing replica count from 2 to 1 had a dramatic effect on the cluster. Things are generally looking happy now.
@dcausse as a percentage of requests this is exceptionally low, but suggests we are missing some edge case. Any ideas?
Aug 29 2019
Not sure it's helping or not, but i increased refresh_interval on all cloudelastic-chi indices to 5 minutes, and removed their index.merge.max_thread_count settings (was 1, now takes default of 3) to see if we could cut back on the number of tiny indices. Segment count reduced from 65k to 57k, about 10%. Might be a minor memory savings but likely very little as the tradeoff is an increase of IndexWriter buffers from ~200M/server to ~1GB/server.
Looking into the graphs, it seems to me that the underlying problem is that the min heap keeps growing over time. When the node gets to 2/3 (~30.8GB) heap used the old GC goes crazy. We can re-apply NewRatio with the 45G heap and the old gen will be able to grow by a few more GB, but unless we can figure out what the final steady state value is for the old gen we can only really keep trying larger values.
Not sure the right way to go about it, but the problem is essentially here:
Aug 28 2019
ubuntu bionic, chrome 73.0
Aug 27 2019
Is it going to back off the same amount of time as the last back-off. So it will start processing messages in a timely manner, just not instantly. But I don't think that is really an issue here. If the cluster was read-only for 30 minutes, waiting for max 10 more for the processing to start is acceptable IMHO.
Aug 22 2019
Aug 21 2019
Aug 20 2019
@mpopov Any movement here? No huge rush but this will let us stop generating all events twice
We need to rework our updater a little bit to share some expensive work before the partitioned jobs, but pull the ContentHandler data per-partition. Shouldn't be that much work, but needs to be done on our end so the cirrusSearchElasticaWrite job can be partitioned
Aug 19 2019
Try to evenly space out elastic nodes in the row evenly in 1G racks.
Aug 16 2019
Aug 15 2019
Aug 14 2019
Known problem related to recent deployment, currently updates are backlogged about 12 hours. The plan to deal with this is T230495, while some short term hacks are being worked on to alleviate the current backlog.
It looks like enwikisource has added a suggestion to Special:Search to enable subphrase completion matching. This has resulted in ~421 users that have turned it on, and we don't see any that have turned it back off for the default. This suggests we could possibly move forward with making subphrase matching the default on wikisource.
Aug 8 2019
All wikis are writing to cloudelastic now. Still be a few days to catchup on writes since july 29, the day the dump was made. Also somehow importing commonswiki_file only imported ~25M out of 50M items. The saneitizer is working on fixing that, but will take a bit.
Aug 7 2019
Aug 6 2019
Think i found it:
Looked into this a little bit (on cloudelastic1001.wikimedia.org), no solution yet:
Aug 5 2019
I reviewed the draft on mw.org, everything there looks accurate as far as I'm aware. I didn't realize that implicit and explciit AND behave differently. The on-wiki documentation doesn't feel scary enough for what's really going on, but I'm not sure how to make it more explicit that this thing is funny and not what they think it is.
Resolution: The ipv6 address I set, which was the ipv6 version of the ipv4 address, was incorrect. Rather a new ipv6 address within the LVS range needed to be used. Brandon applied a fix with https://gerrit.wikimedia.org/r/#/c/528215/ and https://gerrit.wikimedia.org/r/#/c/528216/ and everything now looks to be working as expected wrt ipv6