Fri, Feb 15
I'm not sure if it's great, but i see two possible solutions to en being the final fallback language for almost everything:
Actually i hadn't thought about inlabel:py-br,pt|colaborativa, that might be better than what I had with successive | characters. The successive pipes can be undetermined, but taking everything before the first pipe is very easy reason about.
Yes, i'll go back and edit. I started with , then used | then forget to switch them all.
Proposed syntax as follows. Note that we can have an incaption alias of inlabel, but this will be implemented in WikibasecirrusSearch where these considered labels so the code in wikibase to filter by them should probably reference label. One potential sticking point is the syntax of specifying one or more languages. I'm not entirely convinced this is the best syntax, but I'm not sure we have something today to draw from as an example. The pipe usage here is slightly different than we use in other places. We could potentially replace the pipe with a comma, i'm not sure if that's better or worse.
Thu, Feb 14
Wow, i didn't realize we threw away so many interesting tokens. Unfortunate, but seems this task can become a child of the other to be considered "some day".
In particular i think this refers to the fuzzy completion suggester? We recently made it the default on mw.org and wikitech, might as well do office wiki too.
Since then the analytics team has built up the mediawiki-history job that, essentially, exposes all of mediawiki history in a reasonably structured way. Perhaps now some job can aggregate over that to generate a list of page titles and all their contributors. If implemented this way itwould not be real-time contributors though, it would be a batch job that runs monthly or something to generate a contibutors list to index in the search engine. We would probably need to find some strong use cases to justify the extra moving pieces of this kind of setup as well.
At a high level, this from the opening seems to capture the request:
Hastemplate can find template usage where the target is a secondary template, and this ability should also be able to find where the target template is passed as a parameter. Currently hastemplate doesn't recognize the parameter list as "a place for template names", as it does in template code, where it the target template as a secondary.
Implemented as either a filter or a boost: https://www.mediawiki.org/wiki/Help:CirrusSearch#Geo_Search
intitle now queries the redirect titles, but this bug is still not fixed. It looks like the analyzers throw away this token:
Basically we have a different opinion than elastic on how this should work. Elastic has to support pretty generic use cases, and a huve variety of ways people set things up. We only have to support ourselves, and our opinion for production puppet is the servers should have no swap enabled. If there is no swap then mlockall does nothing, Then no additional security rights are required for the elasticsearch user either. The reasoning for this is that elasticsearch only works when at least half the server memory is available as a disk cache. If the servers get anywhere near needing to swap memory to disk something else is fatally wrong.
I believe the implementation of deepcat resolves this feature request:
in general this has been complete with the implementation of mjolnir.
Looks like one more option, a workstation card from AMD, the Vega Frontier has 16GB of memory with very similar compute to the Vega 64. Based on my review, I think this is probably our best bet for an AMD card.
I look back over things and it looks like stat1005 is in an R470 case, they advertise compatibilty with several full-size nvidia cards so length is probably ok. For cards the choices seem to be:
As long as it fits in the case, a high end consumer GPU from AMD should be
just fine. The most important spec for choosing will probably be the amount
of memory, more is (almost) always better. The sticking point .ight be
length, consumer cards are typically 10.5" long which might be too long for
our case? Need to find out how much space is in there.
Wed, Feb 13
One possibly way to test would be to drop our 2 minute master timeout back to 30s and see how daily completion suggester builds and whatnot work. I would love to rip out all the related master timeout code in cirrus.
The numbers that seem most important, only chi-eqiad (primary load-bearing cluster)
Re-ran data collection and the report. Of particular interest here is going to be chi-eqiad which is serving the majority of traffic. The over-time graphs for chi-eqiad aren't great, but they are better than before. Additionally the largest spikes are directly attributable to disk space issues we are currently experiencing in eqiad. Looking at the allocation explain while running the test shows that sometimes the master decides all nodes are above the disk threshold. I ended up needing to increase the watermark from 75% to 79% for the test to even run.
Tue, Feb 12
I think the main idea will be to keep all of the appropriate code for performing transformations in mjolnir, and add oozie jobs to wikimedia/search/analytics. The new jobs can be python scripts, we already build venv's with dependencies for transfer_to_es, that import mjolnir and run the appropriate transformations. The scripts would primarily be concerned with where to load data from and write to store it. The algorithms would stay in mjolnir.
Mon, Feb 11
Kind of good news / bad news. The good news is the patch is merged and will deploy this week. The bad news is the bug was in the process that backfill's old properties like the somewhat recently added page creation date. It's basically going to take 2 more months before these new property sorts take into account all pages.
AB test reports published. Clicks@1 improved in all languages with no regressions in other metrics. Clicks@1 starts around 78-80% and improves by 2-5% to 83-85% depending on language. Will deploy these profiles as the new defaults.
I don't know if this meets your needs, but the cirrussearch dumps have the wikidata id's broken out. This is the wikibase_item field of the ebernhardson.cirrus2hive table in hive. Alternatively there are full dumps with each article as a json object: https://dumps.wikimedia.your.org/other/cirrussearch/
Longer term search will potentially want to generate some significantly larger datasets to ship to production, but we don't yet have a concrete implementation plan so everything is a bit hand-wavy. As one example though we have looked into turning all sentences from articles on a wiki into vectors. These vectors are 4kB and a previous estimation was 250M sentences on en.wiki, 50M sentences on fr.wiki, declining from there. Overall data size on the order of 2-10TB. This is fairly far off though, something closer term would be more like a 1kB vector per article which is a much more reasonable ~5GB for enwiki and declining from there. This is far enough out I'm really not
Sun, Feb 10
The servers have been purchased and racked up. Patches were going through puppet last week getting new security groups setup for accessing the cluster, installing the servers, etc. Basically, things are progressing and I'm optimistic we will have a public service ready in time for the summer hackathon.
Thu, Feb 7
The default prefix search is heavily tuned to finding content articles, It considers redirects and the page redirected to to be a singular entity, the version of the string chosen to show amounts to a heuristic that tries to decide between showing something closer to what you typed that exists as a redirect, or the original page title. Additionally this system considers two versions of the string with different casing to be the same string, only one cased version (chosen fairly randomly) is available to find.