Page MenuHomePhabricator

Enable ICUTokNorm() for Glent M0 and M1
Closed, ResolvedPublic2 Estimated Story Points

Description

User story: As someone who has been working on this for too long, I want the new Glent normalizer to be activated so we can generate better suggestions for M0 now and for M‍1 in the near future.

When the last code for T238151 is merged, we need to enable icutoknorm as the normalizer for M0 and M‍1 on the command line in Airflow ( --query-normalizer ).

We may also want to regenerate old data using the new normalizer if needed. (I'm not sure if it is needed, and we could also decide that it isn't worth it and let the older data age out, but that would take a while, and we wouldn't have the new normalizer in place for the M‍1 A/B test.)

Acceptance Criteria:

  • Airflow shows that M0 and M‍1 data prep are using --query-normalizer icutoknorm as a command line option.
  • Old data has been regenerated for M0 and M‍1, or we've explained why we shouldn't or don't need to.

Event Timeline

Gehel set the point value for this task to 2.Sep 14 2020, 5:57 PM

The plan is for @EBernhardson to document the process and for me to perform the process following the docs, so that bumped the estimation for the task up to 2.

Created Discovery/Analytics/Glent on wikitech. Bare bones but will rank highly in wikitech search and covers the commands used for releasing new jars along with links to configuration and the related analytics airflow.

Updated Discovery/Analytics on wikitech. Added updating java jars section with some descriptions and a link to glent docs for jar release process. Added section on updating airflow dags and removed sections on deploying oozie coordinators that no longer exit. Added sections about airflow in the How to Deploy section as well.

I suspect this is still going to be a bit unclear, but please let me know whats missing and how it could make more sense. I

Old data has been regenerated for M0 and M‍1, or we've explained why we shouldn't or don't need to.

I took a quick look and my initial impression here was incorrect. There may be changes necessary here, but it's not immediately obvious. For query similarity (m1) we can re-process the latest dataset and re-calculate the norm'd query. Session similarity (m0) isn't as obvious, because it maintains pairs of (query, dym).

Possible concerns for session similarity:

  • We throw out query.norm == dym.norm pairs. changing the normalizer means we may have pairs that match. Shouldn't be a problem, they won't have new matches added and will fall out of the dataset
  • Edit distance is calculated between the normalized values, if re-calculating norm'd queries the distance needs to be recalced as well.
    • May want to re-apply the max edit distance filters? But again they will fall out of dataset over time and can probably be ignored.

Rough guess:

  • We should write a process that reads in an m0 partition and re-calculates the query norm. hive's alter table rename partition can be used to replace the existing partition
  • Similarly a process for m1 to recalculate query norm and edit distance.

Created Discovery/Analytics/Glent on wikitech. Bare bones but will rank highly in wikitech search and covers the commands used for releasing new jars along with links to configuration and the related analytics airflow.

It looks reasonable so far, but I'm stuck getting everything ready to release. I'm not sure which machine is "scm" or how to check what my password is there. I tried to log into archiva.wikimedia.org, but either I can't, or I can't figure out which password is the right one. It doesn't seem to be my wikitech password.

Looking ahead to deployment, I am listed in the archiva-deployers group, so that's good.

I've asked for access to archiva.wikimedia.org after talking to David this morning. Will update when that's done.

Claiming this and moving this to "in progress"—though Erik is also doing work with documentation, and I'm on hold waiting for access to the right server.

Turns out there was a misunderstanding about how to access archiva.wikimedia.org, and all that is taken care of. Moved the ticket to waiting until Erik updates the filter that unintentionally filtered all of the Japanese data. When that's done, we can re-deploy the Glent jars just the once, then enable this.

Change 629787 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[search/glent@master] Repair words only query filter

https://gerrit.wikimedia.org/r/629787

Change 629787 merged by jenkins-bot:
[search/glent@master] Repair words only query filter

https://gerrit.wikimedia.org/r/629787

Change 630223 had a related patch set uploaded (by Tjones; owner: Tjones):
[wikimedia/discovery/analytics@master] Configure glent to use the new ICUTokNorm normalizer

https://gerrit.wikimedia.org/r/630223

Change 630223 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Configure glent to use the new ICUTokNorm normalizer

https://gerrit.wikimedia.org/r/630223

Today Erik and I worked through the process on getting this enabled. We ran into a few permissions and tool problems, but we improved the docs and will look into permissions to make it better for next time.

New settings are in production. Erik kicked off some re-runs to backfill available data, and the rest should be run tomorrow as part of the normal weekly glent process.