Page MenuHomePhabricator

Shiny Dashboard for usage of the AdvancedSearch extension
Closed, ResolvedPublic

Description

Show the number of searches per keyword and per wiki in a graph, and as a single number for the past day


**Possible correlations*
Create a table (or something similar) that shows

keyword combinationusage past day usage past month % of searches with that combination overall

sorted by usage past month, with highest one on top

Event Timeline

Lea_WMDE created this task.Nov 15 2017, 9:49 AM
Addshore changed the task status from Stalled to Open.Nov 21 2017, 10:09 AM
Addshore claimed this task.
Addshore added a subscriber: Addshore.

I am picking this up to see the current patch through to merger, which will then have the data appearing in event logging for a dashboard to be created from.

Restricted Application added a project: User-Addshore. · View Herald TranscriptNov 21 2017, 10:09 AM

@Lea_WMDE I don't think the keywords section of this will be a grafana dashboard, as that is not the kind of data grafana is designed to present.

@GoranSMilovanovic will likely have to come up with something else!

@Addshore Ok, it will be either a Shiny app or an interactive Rmarkdown hosted on labs then, which opens the following question: how frequently does the keywords section need to be updated?

@Addshore @GoranSMilovanovic no problem! Daily updates are definitely enough, and ideal, but if longer time spans would simplify things a lot, the frequency may also be lower

The EventLogging data is working:

mysql:research@analytics-slave.eqiad.wmnet [log]> select count(*) from AdvancedSearchRequest_17379859 limit 10;
+----------+
| count(*) |
+----------+
|      113 |
+----------+
1 row in set (0.00 sec)
GoranSMilovanovic renamed this task from Grafana board for usage of the AdvancedSearch extension to Shiny Dashboard for usage of the AdvancedSearch extension.Dec 6 2017, 2:12 PM

Changed the task name since this is not going to be Grafana but Shiny.

@Lea_WMDE @Addshore

Ok, I see no particular problems in relation to this task, except for I would need to learn more about the following:

how many times a search was submitted using the new advanced search form

  • as a single number for the past day
  • as a line chart comparing the history of searches where the AdvancedSearch was used and wasn't used (i.e. together with the point above)

Namely, @Addshore: where do we look for the number of searches where the new extension was not used? Nothing in the log database seems to be providing for these numbers.

Also,

Keywords:
Create a table (or something similar) that shows...

@Lea_WMDE I assume that by "keywords" here you mean things like: event_hastemplate, event_intitle, event_not, event_or, event_phrase, and event_plain? These are coded as binary features for every search recorded in log.AdvancedSearchRequest_17379859 (take a look at at the excerpt of the log.AdvancedSearchRequest_17379859 data provided below).

Finally, @Addshore Do you know how to connect to the log database from Labs? I can access any replica there (e.g. dewiki_p, enwiki_p, etc), but I don't seem able to find the log database? It would be much easier to keep everything on a Labs instance simply because Shiny will run from there. I remember there were some changes in respect to what host will replicate what databases in the future, but of course I didn't bookmark that info.

@Addshore solved, running hourly scheduled queries against log.AdvancedSearchRequest_17379859 in production, Labs will sync to update the dashboard.

@Lea_WMDE The dashboard to track the usage of the new extension is under development as of now. I will keep you posted and let you know when you can test.

@Addshore Please, Adam:

Namely, @Addshore: where do we look for the number of searches where the new extension was not used? Nothing in the log database seems to be providing for these numbers.

@Addshore Please, Adam:

Namely, @Addshore: where do we look for the number of searches where the new extension was not used? Nothing in the log database seems to be providing for these numbers.

Hmmm, good question.

What number do we want?

  1. The number of searches on all wikimedia projects that were made without using advanced search?
  2. The number of searches on wikimedia projects where the AdvancedSearch extension is enabled, without using advanced search?

I guess we also ONLY care about Special:Search?
We could probably get these from the web request table / page view tables.

@Addshore As of your first question, what we want is

  1. The number of searches on wikimedia projects where the AdvancedSearch extension is enabled, without using advanced search?

Now, as of

I guess we also ONLY care about Special:Search?

I think so, yes, but please consult @Lea_WMDE

Finally,

We could probably get these from the web request table / page view tables.

I wonder whether can we then get the AdvancedSearch extension numbers from Hive too and avoid working with both Hive and MariaDB? However, that is not essential.

One thing to consider: (a) I am having some trouble in running beeline from Rscript on crontab from stat1005 (see: ToDo section in T181035#3827488), so until that is solved I would prefer to put SQL from R only on crontab (however, I understand this is not really feasible, given that the numbers will be quite large, I assume).

We could probably get these from the web request table / page view tables.

I wonder whether can we then get the AdvancedSearch extension numbers from Hive too and avoid working with both Hive and MariaDB? However, that is not essential.

The event logging information should now also be accessible in hive.

https://lists.wikimedia.org/pipermail/analytics/2017-December/006074.html

Hey @GoranSMilovanovic, sorry for the delay! So I'm not interested in Searches without AdvancedSearch at all. I'm interested in comparing the number of times people with AdvancedSearch were on SpecialSearch and searched (no matter if with AdvancedSearch or not) and the number of times people with AdvancedSearch were on SpecialSearch and searched with AdvancedSearch.
And yes, only Special:Search is relevant, all other searches are not.

Concerning the keywords being event_intitle, event_hastemplate and such: I guess these are the keywords, but @kai.nissen should confirm. The keywords I am interested in are has_template:, in_title: etc, the actual keywords you can use during the search, but I am guessing the events you mentioned are the tracking code for them.

So I'm not interested in Searches without AdvancedSearch at all. I'm interested in comparing the number of times people with AdvancedSearch were on SpecialSearch and searched (no matter if with AdvancedSearch or not) and the number of times people with AdvancedSearch were on SpecialSearch and searched with AdvancedSearch.

And yes, only Special:Search is relevant, all other searches are not.

So this sounds like you are only interested in searches of users that have advanced search enabled, and make a search from Special:Search, in which case all of the info should be in the event logging table i believe.

Concerning the keywords being event_intitle, event_hastemplate and such: I guess these are the keywords, but @kai.nissen should confirm. The keywords I am interested in are has_template:, in_title: etc, the actual keywords you can use during the search, but I am guessing the events you mentioned are the tracking code for them.

The schema for a search request can be found @ https://meta.wikimedia.org/wiki/Schema:AdvancedSearchRequest

@Addshore @Lea_WMDE

I'm interested in comparing
(A) the number of times people with AdvancedSearch were on SpecialSearch and searched (no matter if with AdvancedSearch or not) and
(B) the number of times people with AdvancedSearch were on SpecialSearch and searched with AdvancedSearch.

Am I correct in the following assumption: no "keywords" used = no AdvancedSearch performed?
Because from the https://meta.wikimedia.org/wiki/Schema:AdvancedSearchRequest schema (and I've already shared an example data set with you here) there does not seem to be any other intuitively clear way of distinguishing between (A) and (B).

@GoranSMilovanovic

Am I correct in the following assumption: no "keywords" used = no AdvancedSearch performed?

Unfortunately no, there is also the special syntax "without word (-) ", "excatly the text (" ") and "one of these words (OR)".
But writing this down I realized, that this (and the keywords) may all come from the full search bar as well. @Addshore do you know if the beforementioned events fire in all cases or just when the keywords were entered through the form?

@Lea_WMDE @Addshore Could you please define in a manner as precise as possible the criteria that has to be satisfied for a row from the https://meta.wikimedia.org/wiki/Schema:AdvancedSearchRequest table to count as a case of: 'AdvancedSearch were on SpecialSearch and searched with AdvancedSearch'?

The following fields are now present in the table:

  • id
  • dt
  • timestamp
  • userAgent
  • webHost
  • wiki
  • event_hastemplate
  • event_intitle
  • event_not
  • event_or
  • event_phrase
  • event_plain

of which only the keywords seem to be relevant for the question at hand: some combination of their values must determine whether a row is, or is not, a case of 'AdvancedSearch were on SpecialSearch and searched with AdvancedSearch'.

Take into your consideration that I am not using this extension (it's available on two wikis only at the moment, or as far my understanding goes).

Thanks!

@GoranSMilovanovic thanks for the specifications!
So problem 1: We need an event more (for the file type), but I will create another ticket for that that goes to FUN or qwerty.
As long as we dont have the file type event, we cannot really tell whether or not the advancedsearch form was used. If we did have the event, then I would guess the absence of all events would mean: Search without advancedsearch used. @Addshore or @kai.nissen could you confirm that? I will update the ticket description for the current status of "no file type event" now.

Lea_WMDE updated the task description. (Show Details)Dec 14 2017, 2:52 PM

@Lea_WMDE I can start the development of the Dashboard without this feature and then add it once when all data become available, if @Addshore and @kai.nissen can guarantee that none of the fields already existing in the https://meta.wikimedia.org/wiki/Schema:AdvancedSearchRequest schema will change.

An event will be recorded whenever the search button on Special:Search is clicked while AdvancedSearch is displayed.

For example here is a basic search and matching event:

{"event":{"plain":false,"phrase":false,"not":false,"or":false,"intitle":false,"hastemplate":false},"revision":17379859,"schema":"AdvancedSearchRequest","webHost":"test.wikipedia.org","wiki":"testwiki"}

For a request such as "foobar -notthisstring", you would get the same event as above if the user manually uses the "not" filter without using the AdvancedSearch UI.

Only when keywords are entered through the UI will you get anything other thans "false" for eventlogging events.

For example:

Will result in the event:

{"event":{"plain":false,"phrase":false,"not":true,"or":false,"intitle":false,"hastemplate":false},"revision":17379859,"schema":"AdvancedSearchRequest","webHost":"test.wikipedia.org","wiki":"testwiki"};

@Addshore there's something going on with the Phab servers right now so I can't see the images, but I have to say that I am still not sure how to detected the following

a case of 'AdvancedSearch were on SpecialSearch and searched with AdvancedSearch'.

from the the other case from your message.

This event you will go into the https://meta.wikimedia.org/wiki/Schema:AdvancedSearchRequest table, right? What exactly combination of field define the a case of 'AdvancedSearch were on SpecialSearch and searched with AdvancedSearch'. case then?

a case of 'AdvancedSearch were on SpecialSearch and searched with AdvancedSearch'.

So i guess right now we can not determine this.
If the beta feature is enabled, but not used and the user makes a search then there will be an event logging event for the search.
If the beta feature is enabled, and expanded, but no keywords entered there will be the same event as above....
If the beta feature is enabled, and expanded, and one of the keywords is used that is not tracked, then again the same event (all with false values) will be recorded.

Looks like we need to alter the tracking and either:

  • Track the usage of each & all keywords?
  • If we dont care about ALL of the keywords then simply include a boolean in the event of if the advanced search was used or expanded? (or maybe one for each case)?
  • We could also NOT track AdvancedSearch event logging events when it is not used and track an event of a different name / type with different data?
GoranSMilovanovic added a comment.EditedDec 14 2017, 11:15 PM

@Addshore I need to think about this and probably have a chat with @Lea_WMDE to determine what we need to.

I'm totally not into preaching anything to anyone, but when one ends up in a situation like this, that is a consequence of not having a research design for what needs to be done, which would be step 1, often implicit, in any Data Science/Analytics project.

Ok. Give me some time, I'll think through this, and then contact @Lea_WMDE and probably @Addshore to see what are the necessary steps.

Again @Lea_WMDE we can have a Dashboard that tracks everything else except this (crucial) feature, and then add it later on when we decide what we need to do first.

GoranSMilovanovic added a comment.EditedDec 18 2017, 11:23 PM

@Addshore

  • Regular hourly update from https://meta.wikimedia.org/wiki/Schema:AdvancedSearchRequest table is prepared in production (stat1005), and
  • it can be synced anytime with the Labs instance to update the future Dashboard.

Hopefully, I hope that we will figure out a way to reach out to the log database replica from Labs somewhere, because it sounds quite silly to me to have to fetch from MariaDB on one server, transfer via HTTP to another, and feed another SQL database from there. I do it for the WDCM system, however, it's a huge system and I can live with it. For a single dashboard tracking a single MediaWiki extension, listen... I was so hoping that it would be possible to do it from a single server and with no HTTP transfers.

@Lea_WMDE @Addshore

My questions for you for tomorrow:

  1. @Addshore Do we predict that this thing will hit the big data segment anytime soon? Because if we do, then you should direct the event logging procedures to Hadoop directly, not MariaDB. However, and if I understand correctly, all event logging should be now available in Hive: https://lists.wikimedia.org/pipermail/analytics/2017-December/006074.html - let's discuss this briefly tomorrow.
  2. @Addshore Of course: do you have any idea how to reach a replica of the log database from Labs?
  3. @Lea_WMDE Please try to define precisely what ratio will define the percentages that you need on the Dashboard. By "precisely" I mean: what combination of the fields available from event logging (i.e. the database) result in A, and what in B, in the ratio A/B expressed as percentage that you would be interested to track. I am totally eager to help thinking through this question until the answer becomes crystal clear to us.
Lea_WMDE updated the task description. (Show Details)Dec 19 2017, 9:16 AM
Lea_WMDE updated the task description. (Show Details)Dec 19 2017, 5:19 PM

Please correct me if I am wrong:

  • I am waiting for @Addshore to let me know whether the file_type field is forwarded to the EventLogging,
  • before I proceed to develop a Dashboard for this?

Because I think that all the data that we need for this, except that field, is already available.

  • I am waiting for @Addshore to let me know whether the file_type field is forwarded to the EventLogging,

This won't happen until the new year.

@Addshore No problem; just ping when you think the new schema is ready, please.

@Addshore Hey, do you anticipate that the inclusion of the file_type field is possible here? Thanks.

@Addshore Hey, do you anticipate that the inclusion of the file_type field is possible here? Thanks.

It looks like we missed something in the first patch, so there is a new patch now attached to T173572 that needs to be merged.
If merged today this will be deployed this week (thursday evening for dewiki) and data will start appearing in the new db table for the new version of the schema.

Looks good.
This really shouldn't be under http://wdcm.wmflabs.org, or under the wdcm project?

Do we need a new project?
Should we create a generic wmde-shiney project?

@Addshore Thanks. That's a prototype, and yet more features need to be developed. Let's keep it there until the development ends and @Lea_WMDE also reviews it.

In general, yes, I agree that we need a new project for this, wmde-tw-dashboards for example.

It would be better to have a project for wmde dashboards, within that we
can have separate subdomains for dashboards if we require, or even
different instances.

It might be good to do this and finish puppetizing the current wdcm
dashboards at once.

Agreed. Let's have one project then, wdme-dashboards, for example, and serve all (non Wikidata, non WDCM) Shiny dashboards from there.

@Lea_WMDE You can find the summary data for the (a) previous day, (b) previous week, and (c) previous month served from the dashboard now.

Next step: per Wiki statistics.

GoranSMilovanovic added a comment.EditedJan 23 2018, 4:33 PM

@Lea_WMDE per Wiki statistics are now ready. Reminder: I am still using the old schema (no file_type field) until the new schema is populated with more data.

Check it out: http://wdcm.wmflabs.org/TW_AdvancedSearchExtension/

Next step: the Correlations tab, where the the combination of keywords will be presented.

@Lea_WMDE @Addshore

  • There are no keyword co-occurrences present in the currently available data, so the Correlations tab will be developed once we gather enough data for it.
  • There are only 324 data points in log.AdvancedSearchRequest_17621725 currently, and no attempts to use the event_filetype key in searches, so the Dashboard will include the event_filetype only once more data become available.
  • Action: lowering the priority of the task and keeping it open. The update module on the statboxes will be put on crontab. Play with the prototype in the meanwhile, suggest what would you like changed, etc.

@Addshore I am a little bit concerned about the moment when this starts hitting the Big Data segment. Not too much concerned, just a little bit.

GoranSMilovanovic lowered the priority of this task from Medium to Low.Jan 23 2018, 11:46 PM

There are no keyword co-occurrences present in the currently available data, so the Correlations tab will be developed once we gather enough data for it.

If you want me to go and make some tests to add some data there then I can :)

@Addshore If you decide to add some test data, make sure that you add *a lot*. If you can put the data directly into the table - bypassing the event logging - then I can generate the test data, send you the .csv file or whatever you prefer, and then play with it from SQL.

@Addshore If you decide to add some test data, make sure that you add *a lot*. If you can put the data directly into the table - bypassing the event logging - then I can generate the test data, send you the .csv file or whatever you prefer, and then play with it from SQL.

Per today's meeting I won't bother adding any :)

@GoranSMilovanovic When I try to access the dashboard today I get an nginx 402 Bad gateway response. Is this something you can solve? Thanks!

Lea_WMDE closed this task as Resolved.Feb 15 2018, 4:45 PM

Site is live again, and the current version is enough for now. Let's create a new ticket when we feel like it's time for correlations

Lea_WMDE reopened this task as Open.Feb 15 2018, 4:46 PM
Lea_WMDE updated the task description. (Show Details)

sorry, reopening because the file type keyword is not supported yet.

@Lea_WMDE Thanks for re-opening, and it's not just the file type keyword, the back-end is not fully developed yet. This will be finished tomorrow.

@Lea_WMDE @Addshore

  • Back-end completed; keeping data for the last three months, hourly resolution;
  • crontab set: new update on hourly basis
  • public data set review requested in order to migrate the updates from production to CloudVPS (where the Dashboard is hosted): T187606

Next step: feature correlations tab on the Dashboard.

GoranSMilovanovic added a comment.EditedFeb 17 2018, 5:03 AM
  • Feature correlations tab on the Dashboard is now implemented (screenshot attached);
  • Shiny Server problems are currently preventing the Dashboard to go live; I need to inspect this in detail;
  • We are still running manual updates for this Dashboard, but as soon as we have the public data set approved we will sync Labs and production and run on automation.

@Lea_WMDE @Addshore

@GoranSMilovanovic I can't see any file type keyword yet :/

@Lea_WMDE Let me check. It's certainly in the data and there is a possibility that I was so absent-minded to update the data set but not the dashboard charts themselves.

@Lea_WMDE The dashboard now shows the event_filetype data too.

Next step: we wait for a public data set review on T187606 to be completed and then (1) the update procedure goes on crontab from the statbox, (2) the dashboard gets synced with it, and (3) everything runs smoothly on a regular hourly update schedule.

Lea_WMDE closed this task as Resolved.Feb 26 2018, 9:08 AM