Thu, Apr 15
It looks like the source of this is I2bcd7305 from T210106. It looks like this changed the logged value from '0 edits' to null for most logged requests. I don't know who is using this info downstream, safest approach seems to be to transform null back into '0 edits'.
Wed, Apr 14
Pondering this, first step should probably be closing rather than deleting the indices. Closed indices can be easily reopened if we start getting errors from CirrusSearch that we closed an active index. Without errors after some reasonable time period the indices can be safely deleted.
Tue, Apr 13
Would putting everything in the backend solve the Vue.js problem, or would the frontend still need some tweaking to do the right thing?
Also potentially related:
Thu, Apr 1
I suppose the alternate step 1 is restart the jvm's and see if it happens again (it usually does).
Wed, Mar 31
Realized the superset dashboard doesn't break down any stats by wiki, and adding that isn't particularly easy. The most important stat is probably the prevalence of mismatch sessions, here is a quick breakdown from hive for a single day:
This shows using a single wiki as a example, but this is repeated for all of the wikis that are split between omega and psi. Here acewiki correctly does not exist on 9243 (chi). It should not exist on 9443 (omega), but does exist on cloudelastic:9443. It should exist on 9643 (psi) and does in all clusters.
Tue, Mar 30
I don't have a solution for the autocomplete problem. Perhaps we need a hybrid solution where buckets are constantly assigned from the backend, and Special:Search auto-magically uses the bucket, but autocomplete requests will still have to include the query string parameter. This feels messy, will ponder more.
Tracing things back through git history, end up at 1fcba848 from T121542 adding the trigger functionality. The commit message justifys adding ab testing to frontend because textcat was going to need some special query parameters, and this allowed the frontend to provide the single testing parameter instead of using various cirrus debug query params. I found this a bit unclear, so pieced together a bit more of the history.
I took a sample of 30 complete sessions, joined against cirrus backend logs for information like referers and actual query strings. I reviewed this for inconsistencies, then tried to calculate stats about how prevalent those inconsistencies are in the full dataset of sessions that are in a mismatch state.
Completed a number of namespaces, it's up to 14 now. Taking it's time but looking good.
Fri, Mar 26
David suggested that starting a session without going through autocomplete could perhaps be a source of problems. Looking into specifically sessions starting on enwik iand dewiki, of the sessions that have mismatch events ~75% of those sessions have a mismatch as the first event we see. Of the sessions that have an autocomplete dt prior to a fulltext dt (filtering ac-only), it looks like only 13% of those sessions transition into the mismatch state.
Thu, Mar 25
enwiki ns 0 has completed, ns 1 is working it's way though. Optimistically, looks like this should work out and complete.
Looking at this from more of an events/stats perspective, what can we see is different in the mismatched sessions? I first noticed that for automatically rewritten queries mismatched sessions only see 10% with interaction, but the control and test bucket are around 30%. Similarly mismatch sessions are only rewriting 45% of zero results queries, while the test and control buckets are seeing closer to 60% rewrite rates.
I've been able to switch enter a mismatched state from incongnito windows multiple times now, but it's not clear what the trigger is. It seems we have two different options: We could try and fix the frontend bucketing, or we previously implemented bucketing in the backend as well. But for some reason i can't remember we quickly transistioned from the backend doing bucketing to doing the bucketing inside the frontend browser code. Perhaps the problem was that the only way thread data arbitrary extra data like a bucket through api responses is to inject them into headers, or something like that.
Wed, Mar 24
Counting mismatches as any query in a session that contains mismatched events, we have 42% of sessions and 52% of search requests falling into the mismatched bucket. In some testing in an incognito window, by the time i figured out how to set the breakpoint inside searchSatisfaction.js my subTest was already set to mismatched. Clearly we need to dig into the search satisfaction tracking and figure out whats going on here if we want to have usable AB test results.
Data is suspicious. The mismatch bucket, which has searches where the testing bucket reported by the backend is different than the frontend expected, Is 44% of all search requests. The backend aggregation looks to be a bit optimistic here as well, the reported bucket is whichever test it saw first (unordered) on a per-session basis, rather than if that particular search reported a mismatch. I'm currently testing a patch that will mark a full session as mismatch if any event in the session is a mismatch, should hopefully get a better idea of the scope of the issue.
database is imported to ebernhardson.machine_vision_safe_search/date=20200323, haven't had a chance to dig into it yet.
Tue, Mar 23
Reworked exports so we can run a task per namespace. Triggered a new run of the ores_predictions_bulk_ingest dag and manually marked all the articletopic tasks as success so it skips them and only does the drafttopic. Now waiting for it to complete.
Test is started, results will be found in the superset Search Query Suggestions dashboard. Data is loaded into this dashboard daily, with the prior days data arriving around 3:00 UTC. The test will be run for 7 days, assuming data collection looks reasonable that means turning off next Monday.
Mon, Mar 22
I've spent some time friday and again today looking at the queries found in the second csv but not the first. Everything i've looked at (only a few dozen) seems reasonable on closer inspection. The only particularly suspicious thing is there are a class of queries that don't have a dym in the rerun, but when i run them through the test suite they provide the expected suggestion. Since the suggestion algo seems to still be correct i put together a test case that runs the whole suggester, simulating the input dataframes, but still looks reasonable. I'm not really finding an answer, tempted to call it "good enough".
Mar 18 2021
First off, 俸納 doesn't get any suggestions in Chinese, but it does get 奉納 as a suggestion in Japanese. They use the same confusion tables, but different frequency tables. Which brings up a few thoughts:
Could you re-generate glent_m2_rerun_filtered.csv with the language used?
Mar 17 2021
All imports have completed. Next step is to re-run the previous work joining the datasets and verify we now have an acceptable percentage of queries with predictions.
I tried reviewing some of the changes, particularly the 7,870 queries that used to have suggestions but no longer do, and it's not clear to me. For example 俸納 used to suggest 奉納 but doesn't any more. This doesn't seem to match the patterns we are dealing with here, but perhaps that was a previous bug that was fixed and we are only now doing a rerun of historical m2?
Mar 16 2021
I'm not entirely sure if it's correct, but in theory ebernhardson.glent_suggestions/algo=m2run/date=20210313 should contain all the queries in the regular m2run history, but re-run with the updated algorithm. This is mostly a naive re-shaping of the historical data to look like it's a log and processing it that way.
Articletopic should be fully loaded into prod now, the ores_articletopics and the weighted_tags fields. We will have to decide if we are going to push through drafttopic and refactor the orchestration into smaller pieces that don't retry on a week-long window.
This seems to be a misunderstanding in review, CirrusSearch uses ContLang both for local-wiki and cross-wiki behavior. When the wiki is remote Cirrus provides ContLang, but when the wiki is local it depends on the global variable to already exist. I suspect the misunderstanding revolves around how cirrus queries indices and returns results for any wiki in the cluster, and not only the local wiki being queried.
The above set of patches should start the test. The first two we should merge and deploy soon-ish, before actually starting the test we will want to run some test queries against prod and make sure it looks as we expect.
Mar 15 2021
Seems we are about ready, should i run a release on glent and update airflow with the new jar?
The following is hardly comprehensive, but i've tried to collect together information from various sources about what this dataset is. The overall goal seems to be: Provide simplified and unified access to full-text search queries issued to CirrusSearch
articletopic dumps have been processed and uploaded to swift. This includes updates for ~35M pages and will likely take a day or two to make it through the indexing pipeline.
The script finished, but the processing framework OOM'd while finishing up and putting everything where it belongs. For now I'm bypassing the drafttopic dump which will allow articletopic to ship to the cluster. To run drafttopic we will need a minor refactor of the orchestration to partition the intermediate data by namespace and re-run drafttopic one namespace at a time.
Reasonable chance this is because we are pulling ContLang from SearchConfig, ContLang was never configuration and might no longer be accessible that way.
Mar 12 2021
It's still running. Looks like it's requested just under 34M of the expected ~40M predictions, with a current runtime of ~110 hours. Once the dump finishes it should automatically be processed and uploaded to the production clusters.
All seems reasonable to me.
classification.ores.articletopic/History and Society.Politics and government|0.85337919734571
Mar 11 2021
After having sat with this some time, the metrics look reasonably happy since mid october when above mitigations were applied. Since the T271493 found what was likely the underlying cause of the increase in working site size, also mitigated. The only remaining thing is to re-enable sister search for commonswiki, a sub-task has been created and this can be closed.
In theory we could do something like run !xnxx ignoring syntax, although that code path doesn't exist today. I feel like that makes lots of things more complicated though, pains are taken to try and ensure the query that was submitted and is represented in the logs is what we run when suggesting it to another user. Not just the query string, but also the context of the query. Suggesting a query to a user based on historical behavior, but then running the suggestion such that it's no longer the same query as was recorded, seems incorrect.
Mar 10 2021
Looking at this it seems like setting searchEngineType so late is unintentional. Moving it to the top of load() seems the most sensible, it matches the purpose stated in the doc comment. Might be nice to also add a test case that verifies the appropriate search engine is chosen.
So far 352 files have processed, with 115 remaining.
While pondering a fix, i wonder if cirrussearch really even needs to throw an exception here. Page titles are well documented as limited to 255 bytes. In other cases where we receive queries with no possible answer we simply return no results as the correct answer. That seems sane here, and if we want to provide a better UX in this narrow case (i suspect these are mostly automated and it helps the rare human) that can be implemented in a sensible layer.
Mar 9 2021
A cutoff wouldn't be too hard to add, we can accept some timestamp in the cli arguments and then apply it. I suppose we could deploy and let it run once with a shorter timespan, then update the calling pieces to start providing -15 months.
The suggestion table includes a ts which is the max timestamp seen of seeing this (query, dym) pair as the top pairing of the given query. Looking at only the rows that suggest '!xnxx', the most recent timestamp is 2020-4-24T23:37:47Z . Basically, yes they are old and it seems likely we are now filtering these. Two things occur to me while looking at this:
Mar 8 2021
Same thing, the node it was running on was taken down for reimage this morning. It's now running on a host that's already been reimaged, letting it try again.
I completely forgot about how it will behave while reindexing. Indeed we must account for two full copies of any one index as part of normal operations. While reviewing the previous order to see if it would make sense to add disks to these machines, I noticed the above spec for a recent server build is incorrect. Specifically the most recent machines are 2x1.75TB disks, with 3.4T usable on /srv. That brings 10 servers up to 34TB, matching the identified need.
Mar 5 2021
Mar 4 2021
Simple summary of current relforge usage by the analytics network (and/or elsewhere if there's any other usage), and ideally a super-high level description of the flow of data
The node the job was running on was taken down for a reimage, it has restarted on another host.
Mar 3 2021
Looked into the dashboards we created to see the state is today, and how that data has changed since late november when we deployed stats collection:
Mar 2 2021
Restarted the dump after deploying change to error_threshold, it was only ~4 hours into the run since the last fail. The last fail was:
merge finished, minimal impact. It only triggered ~700GB of merges in 10.5T worth of shards, not a large enough % to have a significant effect.