Page MenuHomePhabricator

NLP contractor set up and access
Closed, ResolvedPublic


This is a ticket to track the work that needs to be done to get our NLP contractor (Julia / @Julia.glen) everything she needs to work on T212884: [EPIC] Improve Search Suggestions with NLP.

Event Timeline

TJones triaged this task as Normal priority.Jan 3 2019, 7:51 PM
TJones created this task.
TJones added a comment.Jan 3 2019, 7:54 PM

@EBernhardson, Let me know if you want specific sub-tickets for any of these tasks. I'm okay with just checking them off as they get done and adding new ones as needed.

EBernhardson added a comment.EditedJan 3 2019, 8:31 PM

There are two primary sources of logs:

  • Frontend logs, collected by javascript running in user browsers.
  • Backend logs, generated by the mediawiki application and the web serving infrastructure.


Frontend logs are collected by javascript running in the browser. These are heavily sampled, collecting around 400-500k events per day which adds up to 50-80MB per day. These are typically the most interpretable logs as they are collected much closer to the user. We can adjust sampling rates to retain significantly more events when necessary, but typically do not.


Backend logs are generated by the mediawiki application and the web serving infrastructure. These are unsampled and contain a wealth of information, but also require a fairly deep understanding to work with. The data gets a bit easier to work with after aggregation into the query clicks tables.

The most raw logs are the CirrusSearchRequestSet logs. Each record represents a single external request that resulted in 1 or more requests to the elasticsearch cluster. These are about 3GB per hour.

There are also web request logs for every request made to wmf infrastructure. These are 30-50GB per hour.

An automated process joins web request and cirrus logs into hourly click through logs. These are limited to full text content search performed in the web interface. These are 50-100MB per hour.

The hourly click through logs are also available in a daily format that drops high volume (>1k req/day) users, sessionizes, and filters requests without clicks. These are around 500MB per day and are typically just shy of 1M click throughs per day.

While checking these i noticed we can't join the click logs back to cirrus logs to grab additional information, such as suggested queries that were presented/performed. I've put up a patch in gerrit that keeps the cirrus log id in the query clicks tables and will run a rebuild soon.

@TJones I've pulled sample datasets. They can be found at stat1007.eqiad.wmnet:/home/ebernhardson/julia-datasets. These datasets are of around 100 rows each for the tables:

  • cirrussearchrequestset
  • query_clicks_hourly
  • query_clicks_daily
  • search_satisfaction

The is a .hql and .csv file for each containing the query and results. Each dataset is constructed by randomly sampling a set of session or user identifiers from the day and collecting all events for the chosen identifiers for the day. I've tried to either pull forward or duplicate the queries to the begining of each line, as the lines are quite long.

EBernhardson updated the task description. (Show Details)Jan 7 2019, 11:53 PM
TJones added a comment.Jan 8 2019, 3:49 PM

@EBernhardson, I've reviewed all the files. I made copies (adding _edit) and deleted a few lines from three of them. I'm assuming the various id fields are MD5 hashes or similar and not something that can be decoded.

I believe last time we decided not to reconcile our edits and just take any vote to remove as a reason to remove—so when you've reviewed them, we're ready.

Thanks for pulling the data!

EBernhardson updated the task description. (Show Details)Jan 9 2019, 10:47 PM

Sent email to julia/trey with links to the reviewed dataset.

Nuria added a subscriber: Nuria.Jan 15 2019, 5:51 PM

Request access for Julia to Stats machines (after NDA)

Please let me know if there is issues with NDA, we will need a date by which this contract is over in order to expire access. Confirming that data never leaves our systems, we expect collaborators to do all the work on our cluster/stats machines.

debt added a subscriber: debt.Jan 15 2019, 6:25 PM
TJones updated the task description. (Show Details)Jan 24 2019, 6:49 PM
TJones added a subscriber: Gehel.
TJones updated the task description. (Show Details)Jan 24 2019, 6:54 PM
Nuria added a comment.Jan 24 2019, 7:42 PM

Added subtask with what i think is needed to start requesting NDA, I think @TJones needs to do a bit of work to explain project (documentation might exist on meta)

TJones updated the task description. (Show Details)Jan 24 2019, 7:45 PM
TJones updated the task description. (Show Details)Feb 5 2019, 6:46 PM
TJones updated the task description. (Show Details)Feb 8 2019, 2:49 AM
TJones updated the task description. (Show Details)Feb 11 2019, 6:38 PM
TJones moved this task from in progress to Done on the Discovery-Search (Current work) board.

I am unable to access with my LDAP account. Could you take a look? Thanks.

I am unable to access with my LDAP account. Could you take a look? Thanks.

Double check that you are using the right username and password, and if that doesn't work we can open a ticket or contact some folks via email or IRC. Wikitech instructions:

Log in using your UNIX shell username and Wikimedia developer account (Wikitech) password. If you already have cluster access, but can't log into Hue, it is likely that your account needs to be manually synced. Ask an Analytics Opsen (ottomata (aotto at or elukey (ltoscano at ) -- or file a Phabricator task -- for help.

I am using the same username/password as dev and phabricator accounts. Thanks for looking into this.

@Julia.glen, my hue username has the same weird capitalization as Gerrit (Tjones), which I don't use elsewhere.

I've also asked on IRC, so we'll see what happens there.

Should be done now—so try again, please!

No access yet. I'll try again in an hour. TY

Nuria added a comment.Feb 11 2019, 9:17 PM

@Julia.glen hue will work to see what is going on but to run queries is probably going to be much much slower than command line on hive, if you tell us what is what you are trying to do we can see how to best help.

@Nuria thanks for looking into it. Two prongs approach - 1. viewing small samples of data via hue. 2. programmatic access via spark.

Nuria added a comment.Feb 11 2019, 9:29 PM

JUuia, let's touch base of irc on #wikimedia-analytics, we do not use hue for either but rather the command line. Please see:

@Nuria I see, I am on IRC.

debt closed this task as Resolved.Feb 15 2019, 7:12 PM

Thanks to everyone who helped on this! :)