Tue, Feb 25
Oozie complained about discovery-query-clicks-hourly multiple times, typical. Ran P10525 (fix-oozie.sh)
We've (@TJones) talked about this in the past, but it never made it high enough up the priority list. Essentially the existing language detection code can be re-purposed to detect the language of "hebrew but transliterated to qwerty", after which it can transliterate and run a second-try search (the "Showing results for washington. No results found for Washingtxn") if the first search has poor enough results. There is nothing ground breaking here, but it would have to be prioritized as it will take some time to work out properly without simply doubling the query load for certain languages.
Mon, Feb 24
Fri, Feb 21
It wouldn't be too hard to adjust mw_prepare_rev_score.py to source it's data from some other source, at some point in that script we have a DataFrame containing three fields, (wikiid, page_id, dict from label to prediction probability)`. It sounds like we should be able to generate a dataset in that format from oresapi, for a one-time script i can hack reading that in relatively easily. Simplest format for exchange is probably gzip'd files containing a json row per line. If the dataset is large these can be split across multiple files.
Thu, Feb 20
The reindex wont block anything, essentially elasticsearch will store all the data we send to it, but it's only searchable once the reindex process makes it to that wiki.
The per-topic thresholding is now deployed. I ran only the thresholding and extraction portion of last weeks job to see how it would look. This will do a full run, where the predictions are also shipped to elasticsearch, on sunday (feb 23rd).
This reindex process is running, will probably finish late next week
Reviewing how our job queue usage is going, the pre-partitioned queue here backlogs fairly significantly, up to ~500k messages, while the post-partitioned queue only backlogs when the consumer decides to stop reading for 10-30 minutes (separate ticket, T224425). As some approximate stats, on 2020-02-20 08:00-09:00 UTC the commited offset increment of the partitioner went up by ~3M, and the peak backlog over this period was ~500k jobs. This is around 850 jobs/sec, which puts 500k jobs at a 10 minute backlog. This happens for an hour every two hours when the scheduled jobs queue up.
Re-deployed our glent esbulk oozie job against refinery versioned 2020-02-19T16.58.16+00.00--scap_sync_2020-02-19_0001. Additionally shipped an update to our airflow scheduler that changes the eventgate port used there as well.
Wed, Feb 19
@dcausse looked into the rescore building, RescoreBuilder::isProfileSyntaxSupported is rejecting the profile containing the LTR because it the query_string search query syntax is reported. Essentially this is saying that the search query needs further parsing on the elasticsearch side, and since our LTR query can't be modified to apply that parsing it instead rejects it outright. Most likely we could add a regex to check how simple the query string is and allow through all queries that are simple search strings and contain no syntax that would be handled by the query_string query.
Talked about this today. Short term: Investigate why the classic query building isn't generating an LTR query, it almost certainly should. If that isn't fruitful we should ship the test without jawiki. Longer term: Elasticsearch is deprecating (turning into a noop) the phrase query we use here in 6.8, and removing it in 7.x. We need to re-evaluate the kuromoji analysis chain and hopefully move ja onto a proper analysis chain before upgrading to 6.8. The other spaceless languages are probably not big enough to need specific support, we can perhaps use shingles since the text content is minimal (outside jawiki).
Tue, Feb 18
To fake the data, a couple parts:
While this workflow is deployed, it's currently flagged to off in the airflow admin. My main thought there was that we are adding thresholding, and current runs aren't taking that into account. It seemed better to wait until per-wiki/topic thresholding was deployed before turning on the data shipping. This was briefly deployed and run for a week or two, before I realized we needed the updated thresholding.
This is still waiting for an in-place reindex before it is queryable. We were waiting on wmf.19 and an unrelated mapping change before running that. Now that that change is deployed along with this one we should be able to run the re-index this week.
Reviewing the history, I think the primary concern related to bm25 and spaceless languages was:
Random data points so I can find them next time we have this issue:
Fri, Feb 14
As an alternate solution, we may actually be able to drop the cache and repopulate it. There is a failover cluster matching the primary cluster in the codfw datacenter, and CirrusSearch has support to redirect classes of queries to particular clusters. We can direct all of these related articles requests to the codfw cluster, and with it not being occupied by the typical search load we may be able to serve the full request load as the cache refills.
Thu, Feb 13
Essentially what has happened here is that the namespace passed to elasticsearch has changed types from a string to an integer. This changed the request, which invalidates the cache. As for best method forward, not sure. Most direct of course is to change that integer back into a string, but there isn't an obviously great place to do that.
For the given time period the CirrusSearch more_like cache, which is a second level cache (the http responses should be cached by traffic infra ) hit rate dropped from ~75% to 5%, and climbed back to ~12% over the ten minutes of deployment. The number of successfull requests doubled from ~200 to ~400, but this was not enough to handle the dramatic drop in hit rate.
Mon, Feb 10
This is not particularly urgent, more so surprising. I'm fixing a bug elsewhere that determined if a bare title string returned from an external source needs to be namespace prefixed, it worked for local titles but not for interwiki since interwiki reports as NS_MAIN. I understand this is an assumption that is likely baked in all over the place and unlikely to be fixed.
Fri, Feb 7
In terms of actual deployment I think we can simply install the php-phpdbg package (available from our php7.2 deb component) and adjust MWScript.php to allow the 'phpdbg' SAPI in addition to the 'cli' SAPI that it currently allows.
Another concern i just realized with respect to thresholds, will be updating the models. If a new articletopic model is released and topic A threshold goes from 0.9 to 0.8, we will still have indexed that old scores, with no real way to distinguish which version of the model the prediction came from.
They could also live in the script that loads data from Hadoop to ES (and currently uses a cutoff of 0.5 for discarding low scores). That would reduce ES space usage, but seems like an even more unpleasant location to manage such config, especially since it will be different from each wiki (or will it? thresholds, for sure, but threshold definitions?)
Thu, Feb 6
This looks to be caused by the text highlighting. If we search for User: "greetings from new zealand" the interwiki result is displayed as User:Robin Patterson, but if there is highlighting available for the title that is used and it does not include the namespace prefix. In full text search title highlighting appropriately provides the namespace prefix.
In this case i think i18n was avoided because this api module is only included when running the integration testing suite, it should never be available on a real site. As such there would be no point in translating it. Will simply drop the function.
How do we think these thresholds should be applied, It sounds like we need to inject them prior to the indexing pipeline?
Wed, Feb 5
Tue, Feb 4
Patch is up to change the analytics side to articletopic as well. Since some ores_drafttopic data has already been shipped we will need to remember to ask the update script to delete those from the source documents when doing the reindex to add ores_articletopic to the schema
Jan 24 2020
There are two main options I can think of for merging essentially two separate morelikethis queries trying to give them equal weight:
Jan 21 2020
Jan 17 2020
Jan 15 2020
Jan 14 2020
Jan 13 2020
Followup on how the weights got set to 0 to start with:
Looking at the HTML output for one example page we have:
<span style="display:none" class="sortkey">Durener Straße 040 '"`UNIQ--nowiki-00000009-QINU`"' </span>