Page MenuHomePhabricator

Update Search Engine list
Closed, ResolvedPublic

Description

I try to annually take a look at our search engine regex and propose fixes based on what's missing / being grouped into "other" that's a large search engine in its own right.

Proposed fixes:

  • Separate out Brave Search as its own search engine (new-ish)
  • Separate out Petal Search as its own search engine (new-ish)
  • Baidu doesn't always start with a leading . (bug-fix)
  • Add go.mail.ru as a search engine (I just incorrectly assumed this was email not search way back when...)
  • Add suche as a generic search regex because there seem to be several German search engines that use it but don't really meet the threshold for their own

Relevant code: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngine.java#L25

Event Timeline

I'm happy to take a pass on updating these regexes but would want someone from DE to validate / deploy

Some more notes:

  • After adding suche, the highest new search engine in the Predicted Other category is suche.t-online.de but it's below 20k per day (and from a high-traffic country) so I think no need to add at this stage. By that logic, AOL wouldn't be added today but I'll leave on for continuity.
  • Implementing the change would have the following effects:
    • Move ~400k pageviews per day from Predicted Other to Brave
    • Move ~50k pageviews per day from external referer class to external (search engine) under VK (go.mail.ru)
    • Move ~35k pageviews per day from Predicted Other to Petal
    • Move ~30k pageviews per day from external referer class to external (search engine) under Predicted Other (suche. referrals)
    • Move ~3k pageviews per day from external referer class to external (search engine) under Baidu

Code

Check for other search engines:

regexes = {"Google": "google\\.",
      "Yahoo": "search\\.yahoo\\.",
      "Bing": "\\.bing\\.",
      "Yandex": "yandex\\.",
      "Baidu": "\\.baidu\\.",
      "DuckDuckGo": "duckduckgo\\.",
      "Ecosia": "\\.ecosia\\.",
      "Startpage": "\\.(startpage|ixquick)\\.",
      "Naver": "search\\.naver\\.",
      "Docomo": "\\.docomo\\.",
      "Qwant": "qwant\\.",
      "Daum": "search\\.daum\\.",
      "MyWay": "search\\.myway\\.",
      "Seznam": "\\.seznam\\.",
      "AU": "search\\.auone\\.",
      "Ask": "\\.ask\\.",
      "Lilo": "\\.lilo\\.",
      "Coc Coc": "coccoc\\.",
      "AOL": "search\\.aol\\.",
      "Rakuten": "\\.rakuten\\.",
      "Predicted Other": "(^.?|(?<!re)|(^|\\.)(pre|secure))search",
}
for r in regexes:
    regexes[r] = re.compile(regexes[r])

def getSE(referer):
    for site, r in regexes.items():
        if re.search(r, referer):
            return site
    return 'none'
    
spark.udf.register('getSE', getSE, 'String')

query = f"""
WITH se_pageviews AS (
    SELECT
      parse_url(referer, 'HOST') AS host,
      getSE(referer) AS search_engine
    FROM wmf.pageview_actor
    WHERE
      year = {year}
      AND month = {month}
      AND day = {day}
      AND is_pageview
      AND agent_type = 'user'
      AND referer_class = 'external (search engine)'
)
SELECT SUBSTRING(host, 1, 50) AS host,
       search_engine AS search_engine,
       COUNT(1) AS num_referrals
  FROM se_pageviews
 GROUP BY
   host,
   search_engine
 ORDER BY
   num_referrals DESC
"""
spark.sql(query).show(5000, False)

Look at current list to see if any Predicted Other should move up:

query = f"""
WITH se_pageviews AS (
    SELECT
      parse_url(referer, 'HOST') AS host,
      getSE(referer) AS search_engine
    FROM wmf.pageview_actor
    WHERE
      year = {year}
      AND month = {month}
      AND day = {day}
      AND is_pageview
      AND agent_type = 'user'
      AND referer_class = 'external (search engine)'
)
SELECT SUBSTRING(host, 1, 50) AS host,
       search_engine AS search_engine,
       COUNT(1) AS num_referrals
  FROM se_pageviews
 GROUP BY
   host,
   search_engine
 HAVING
   num_referrals > 1000
 ORDER BY
   num_referrals DESC
"""
spark.sql(query).show(5000, False)

See impact of suggested changes:

updated_regexes = regexes.copy()
updated_regexes.pop('Predicted Other')  # otherwise can override anything after
updated_regexes['Brave'] = "search\\.brave\\."
updated_regexes['Petal'] = "petalsearch\\."
updated_regexes['Baidu'] = "baidu\\."
updated_regexes['VK'] = 'go\\.mail\\.ru'
updated_regexes['Predicted Other'] = "(^.?|(?<!re)|(^|\\.)(pre|secure))(search|suche)"

for r in updated_regexes:
    updated_regexes[r] = re.compile(updated_regexes[r])

def getUpdatedSE(referer):
    for site, r in updated_regexes.items():
        if re.search(r, referer):
            return site
    return 'none'
    
spark.udf.register('getUpdatedSE', getUpdatedSE, 'String')

query = f"""
WITH hosts AS (
    SELECT
      parse_url(referer, 'HOST') AS host,
      getSE(referer) AS search_engine,
      getUpdatedSE(referer) as updated_search_engine
    FROM wmf.pageview_actor
    WHERE
      year = {year}
      AND month = {month}
      AND day = {day}
      AND is_pageview
      AND agent_type = 'user'
      AND (referer_class = 'external (search engine)' OR referer_class = 'external')
)
SELECT SUBSTRING(host, 1, 50) AS host,
       search_engine,
       updated_search_engine,
       COUNT(1) AS num_referrals
  FROM hosts
 GROUP BY
   host,
   search_engine,
   updated_search_engine
 HAVING
   num_referrals > 100
   AND search_engine <> updated_search_engine
 ORDER BY
   num_referrals DESC
"""

spark.sql(query).show(5000, False)

I'll try to put together the patch but the code change at SearchEngine.java is essentially:

...
BAIDU("Baidu", "\\.baidu\\.", ""),
...
RAKUTEN("Rakuten", "\\.rakuten\\.", ""),

// Maintain PREDICTED_OTHER in last position for `SearchEngineClassifier.identifySearchEngine()`
PREDICTED_OTHER("Predicted Other", "(^.?|(?<!re)|(^|\\.)(pre|secure))search", "");
...

to

...
BAIDU("Baidu", "baidu\\.", ""),
...
RAKUTEN("Rakuten", "\\.rakuten\\.", ""),
BRAVE("Brave", "search\\.brave\\.", ""),
PETAL("Petal", "petalsearch\\.", ""),
VK("VK", "go\\.mail\\.ru", ""),

// Maintain PREDICTED_OTHER in last position for `SearchEngineClassifier.identifySearchEngine()`
PREDICTED_OTHER("Predicted Other", "(^.?|(?<!re)|(^|\\.)(pre|secure))(search|suche)", "");
...

Code and changes make sense to me @Isaac!
I'll confirm with the team, but feel free to create a patch for that if you feel like it!

Change 826581 had a related patch set uploaded (by Isaac Johnson; author: Isaac Johnson):

[analytics/refinery/source@master] Update search engine detection

https://gerrit.wikimedia.org/r/826581

Thanks -- patch uploaded! Big caveat that while I tested a Python version to make sure no adverse effects etc., I don't have a good way of testing this Java version.

Thanks a lot for the patch @Isaac , I've merged the code :)
About testing the java version, your patch will be deployed next week making the UDF the new version when using the correct jar. May I let you validate after the deploy that the UDF provides expected results in regard to your python research?
After that is done, I assume we wish to update https://github.com/wikimedia/analytics-refinery/blob/master/oozie/referrer/daily/coordinator.properties#L35 to use the new UDF.

Change 826581 merged by jenkins-bot:

[analytics/refinery/source@master] Update search engine detection

https://gerrit.wikimedia.org/r/826581

I've merged the code :) About testing the java version, your patch will be deployed next week making the UDF the new version when using the correct jar.

Yay -- many thanks @JAllemandou !

May I let you validate after the deploy that the UDF provides expected results in regard to your python research?

Happily -- I may need some guidance though as I very rarely work with Java. Mostly following this documentation, I think the way to test would be:

  • Open hive cli interface on e.g., stat1008
  • ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive-shaded.jar;
    • this is what I'm most uncertain of -- would this be the right jar?
  • CREATE TEMPORARY FUNCTION get_search_engine as 'org.wikimedia.analytics.refinery.hive.GetRefererSearchEngineUDF'; (this uses the code that I changed)
  • essentially do what I did with the Python UDF and run get_search_engine on some pageview_actor data and make sure the results match what's currently produced except for the expected changes

And the code is now released and deployed as v0.2.5 !
@Isaac your description in the previous comment of the process to follow is perfect :)
The /srv/deployment/analytics/refinery/artifacts/refinery-hive-shaded.jar will be the latest deployed version, which now contains the updated version of the code.

@JAllemandou Excellent thanks! I looked at 29 August's data and compared against wmf.referrer_daily and all the shifts make sense: no change to existing search engines; Brave/Petal/VK as new search engines and corresponding decreased in the other category as those moved out and suche search engines started to be captured; no change to Baidu for this table because the previously-missed ones didn't clear the privacy threshold. I also spot-checked all the url domains that would move from external to external (search engine) on that day and they made sense with just a very small number of url domains with suche in their names that probably aren't search engines but an acceptable level by my read (~1% of total suche domain referrals). If you're curious, you can see the updated data in isaacj.referrer_daily.

After that is done, I assume we wish to update https://github.com/wikimedia/analytics-refinery/blob/master/oozie/referrer/daily/coordinator.properties#L35 to use the new UDF.

I think I can do that but a question: the current code lists 0.1.2 as the jar in use. That's surprising to me because it looks like we made updates in 0.0.20 (per webrequest table) to add some search engines that I think are reflected in the current table. I was going to just create a patch to update refinery_hive_jar_path = ${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-0.1.2.jar to refinery_hive_jar_path = ${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-0.2.5.jar but I wasn't sure if that's what actually needed to be done. Also does the jar need updated for webrequest too when the initial external vs. external (search engine) is determined (code)?

I can also update the webrequest documentation FYI unless there are other changes in 0.2.5 that you intend to add: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Changes_and_known_problems_since_2015-03-04

About the 0.1.2 jar version: it is the version used in the original commit (see on github.) I assume it was the latest jar when the job has been devised.

Your plan for the patch is almost perfect. You need updating the refinery-hive version from 0.1.2 to 0.2.5, and you also need to add the -shaded postfix to the jar (full name: ${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-0.2.5-shaded.jar). The separation between shaded and non-shaded has been made explicit on version 0.1.18, so any change from before to after needs to include the postfix :)

Finally, it's indeed necessary to update the webrequest/load jar as well, as your change impacts how referer are classified in there (I had completely forgotten about that, thank yyou for pointing it out!). Both changes to refer and webrequest can be done in the same patch, and if you update the docs with it it's glorious :)

Thanks a lot for taking care of this @Isaac :)

Change 829029 had a related patch set uploaded (by Isaac Johnson; author: Isaac Johnson):

[analytics/refinery@master] Bump search engine versions.

https://gerrit.wikimedia.org/r/829029

Thanks @JAllemandou -- patch submitted! And I realize now that my confusion about the versions came from a mixture of looking at old code and confusing the record number for webrequest with the jar version numbers. That said, I bumped the record version too for webrequest but let me know if that's not desirable for such a small change. Once this is through, I'll update wikitech.

Change 829029 merged by Joal:

[analytics/refinery@master] Bump search engine versions.

https://gerrit.wikimedia.org/r/829029

Code merged! The doc change should happen after deploy - I've asked on our deployment doc to ping you @Isaac here once done :)

@JAllemandou can I mark this one as resolved once the change is done?

@JArguello-WMF - You can close this one after this weeks deploy and validation by @Isaac :) Many thanks!

Hi @Isaac: I just deployed the code and restarted the jobs:

the new referrer/daily is running as of 2022-09-06
the new webrequest load bundle is running as of 2022-09-06T21:00

Checked and data was looking as expected! Documentation updated (diff). @JArguello-WMF feel free to resolve and thanks @JAllemandou and @Milimetric!

JArguello-WMF claimed this task.