Sat, May 19
Wed, May 16
Tue, May 15
For splitting wikis between clusters and ensuring sister searches stay on the same cluster i was hoping i could get by with a test case in mw-config that pokes at the SiteMatrix configuration (unfortunately without the SiteMatrix code in mw-debug test suite) and verifies things all belong on the "correct" clusters.
Mon, May 14
For indexing we need to be a little more involved. The use cases look to be:
For search we might consider Cross Cluster Search. This was added in 5.4 and came out of beta in 6.0 and is the blessed replacement for tribe nodes. It essentially allows us to query the other index as if it were local by prefixing the index with the cluster name, such as eqiad-large:commonsiwiki_file. This allows us to ignore the question of which cluster (eqiad, codfw?) to read from in the cirrus code, relegating it to elasticsearch configuration.
Fri, May 11
This looks like fallout from https://gerrit.wikimedia.org/r/#/c/429815/. Will revert and re consider how to fix this.
We've decided to split archive into it's own index, and move namespace into metastore. As such only metastore needs the new type field now.
Thu, May 10
We decided to not add a new type field, and instead split archive into it's own index. To handle the sharding problem we will create 2 new "tiny" clusters on the existing hardware to split all of the tiny wiki's between.
We decided to not add a new type field
We decided to split archive and into it's own index removing the need for a type field
It looks like all the supporting code has shipped and deployed, we need only to deploy the config patch now? https://gerrit.wikimedia.org/r/419367
There are also some deprecation warnings coming from the phabricator index, I pinged them in T181393.
6 months is approaching :) Still don't have a date, but am doing some preliminary work resolving current problems that would block the upgrade, specifically phabricator index in elasticsearch is created with include_in_all: false set on a variety of fields. The false is perfect, and the default value, but the property is to be removed in elasticsearch 6 and needs to be removed from the mapping.
Wed, May 9
Redirects are indexed, but not as their own thing. Redirects in search are considered a property of the parent document. Because of this the only information kept is the namespace and title of the redirect.
Potential avenues to investigate:
- The send timeout on mediawiki kafka is 10ms. We could try increasing? Although this should be more than enough.
- Exceptions are currently logged to 'wfDebugLogFile' channel, but that channel looks unconfigured in our production logging so the messages are all thrown away. We could start logging that channel, or turn on a dedicated channel. Whatever errors it's emitting currently are being thrown away (there might also be some uncertainty about logging new messages while the app is shutting down and already flushing logs, hard to say).
php-rdkafka would be our best bet, but unfortunately they do not support hhvm and we will not likely be rid of hhvm in this calendar year.
Tue, May 8
As another option, this would be a very simple option to provide as an additional filter in fulltext search as a keyword. I'm not sure what the use cases are for this tool and if that would help.
Upstream opened a bug: https://github.com/elastic/elasticsearch/issues/30428 which has a pull request now attached, and the bug is tagged to be backported to 5.6.x. Proabably we will skip 5.6 and go straight to 6.x which will be backported as well.
we might want to add a warning when using deepcat on an unsupported wiki as well
Fri, May 4
Well, there were some hi level and low level changes visible on the elasticsearch servers:
Thu, May 3
There is an automated process that visits all pages and verifies they contain the latest revision every two weeks, so it would have eventually been fixed. But it would certainly be better if the problem never ocured in the first place.
Calling this one done, I've exported the report to pdf so it lasts longer:
After https://gerrit.wikimedia.org/r/430441 it will work fairly simply. Each cluster can have the following request issued:
Wed, May 2
I think i've gone through all the deprecation warnings from the last week and either resolved or submitted patches for them. There is one remaining that I haven't been able to track down:
Tue, May 1
upgraded metastore on eqiad and codfw from 0.2 to 0.3 to fix more deprecation warnings about "index": "not_analyzed" which should be "index": "no". I'm not sure why but the minor upgrade didn't work so i forced a major (re-create and reindex) upgrade.
eqiad elasticsearch cluster (not logstash) was out of sync with the apifeatureusage template in puppet causing it to create indices with deprecation warnings. I've updated the template from the one in puppet and new indices going forward should not log deprecation warnings. Some day we have to figure out how those templates get from logstash to elasticsearch, somehow or another it wasn't auto-magically deployed (should it be?).
Surprisingly some of these warnings are simply for very old indices. Somehow commonswiki_general is dated feb 2017, although we've certainly done full reindexes since then. It's possible reindex failed somewhere and we never noticed, the logs for a reindex are so large we don't actually check they all succeed.
Mon, Apr 30
Finished new eqiad test with expected number of archives, notebook linked about has been updated. This shows a similar problem to the 2x-archive test, in that adding new shards to the cluster is typically finished in a reasonable timeframe, but sometimes waiting for the cluster to return to green takes several minutes (up to 5 in this test).
Currently re-running the eqiad tests.
Fri, Apr 27
Thu, Apr 26
I suppose we should lower the drop timeout, in $wgCirrusSearchDropDelayedJobsAfter, to something more reasonable as well. This was added to put a cap on the amount of time we backup into the job queue to keep it's size limited, but it seems the current value of three hours created more load than the system can handle.
It looks like writes were frozen to the codfw cluster and never thaw'd. Moving forward we need a timestamp indexed along with the freeze. We should then start alerting on a freeze that has lasted more than N (60?) minutes so someone can unfreeze before we start dropping jobs on the ground.
Wed, Apr 25
Tue, Apr 24
Latency numbers for p95 and lower all look great, probably the most stable they've been since bringing load back to eqiad. p99 is still a bit spiky, some quick looks through graphs suggests a correlation between high io-wait and p99, but I think investigating that will need to be prioritized separately from this ticket. This looks to most likely be resolved, although a full cluster restart should follow to bring things into a consistent state. Currently -XX:+UseNUMA has been dpeloyed to all the machines, but only 1024-31 have been restarted in this configuration. The cluster restart will coincide with some plugin updates we have in the pipeline as well.
I hadn't thought about copy_to and tried it out, indeed elasticsearch seems to handle multiple types with varied copy_to on the same field correctly. I don't see any obvious solution to this while moving away from multiple types, short of adding a field that is only populated for archive documents. Looking around the cluster we have at most 50k archive documents per-wiki.
This could probably even use the existing metastore, but need to double check what analysis chains and query types we use.
It's perhaps also worth considering that GC will likely behave a little differently with numa awareness enabled. At a minimum i've seen that GC will not compact across numa regions, which means those parts of the heap are working with some fraction of the heap instead of the whole thing.
Restarted elastic1024-1031 with -XX:+UseNUMA. Inspecting the state of the jvm memory maps it looks like this causes the jvm to allocate three separate memory regions from the kernel instead of the single allocation it used before. This allocation is split between a shared heap that is interleaved between numa nodes, and then two allocations that are pinned to a specific node. The main benefit of numa awareness instead of a brute force interleave would be to have better memory locality so i took a look (again with intel pcm):
Mon, Apr 23
I was actually thinking namespace could be it's own single index, shared between the wiki's. I suppose we could use the metastore for that, it's tiny data and fits into the metastore concept.
Reviewing per-node latency graphs since rolling out the numa --interleave=all approach looks like a success. Pretty much all of the minor latency spikes above cluster baseline are coming from the older systems that have not had interleave enabled.
Apr 21 2018
If we split namespace into it's own single index for all wiks, the main difference remaining may now only be archive documents. Perhaps we need to revisit archive and see if it makes sense to unify the two as a single document with two different states (archived/current) ? Will need to think about it.
Apr 20 2018
To clarify, index types are not being removed. What has been removed is the ability to have an index with more than 1 type. This means titlesuggest and ttmserver should need no direct changes, as they already meet this requirement afaict. The content/general indices contain multiple types and need this new field.
Having a hard time reproducing directly, although i am seeing semi-regular occurrences on mediawiki.org. For reference this isn't only limited to page type, i've seen logs for archive as well. It's some sort of generic problem but elasticsearch isn't logging any errors, and mediawiki isn't logging any useful errors. Will need to revisit what is logged on the mediawiki side after figuring out what should have been logged here.
Apr 19 2018
The error message seems to be mostly of the form:
Cluster just alerted on latency again. elastic1027 and 1025 has pushed server load > # of cores and are now doing about 2x the latency of other servers. The initial spike started at about 21:40 UTC, i was able to start taking measurements at 21:50 UTC after it alerted. Latency was still abnormally high, resolving around 22:15.
Ori had suggested looking at memory installed, and it brought up a pertinent point with respect to the memory bandwidth:
Apr 18 2018
I was talking to someone about bot detection, and they mentioned that they have gotten good mileage in bot filtering by grading ip addresses by the ratio of html pages requested. I ran a quick query against a day's webrequest logs to get a top level idea of whats plausible:
High level theory: We are over consuming some resource on the machines. This is basically IO (network, disk), CPU, and Memory. IO was a problem in the past, but doesn't look like a problem this time around. So i grabbed intel's performance counter monitor and used it to look at some top level cpu/io stats and look for differences between the older servers performing poorly, and the newer servers that are doing well.