Page MenuHomePhabricator

Measure usefulness of special:search page with advancedSearch
Closed, ResolvedPublic

Description

Hypthesis: SpecialPage:Search is going to be used more when AdvancedSearch is available.

Task
In a graph with a granularity of one day

  • Count how many searches are submitted from the SpecialPage:Search (from any web request)
  • Count how many searches are submitted from the SpecialPage:Search (from web requests with AdvancedSearch enabled)
  • Count how many searches are submitted from the SpecialPage:Search (from web requests without AdvancedSearch but within the wikimedia cluster)

Note
The existence of AdvancedSearch can be found from the URL params, e.g. the existence of "advancedSearch-current" as in this example:
https://www.mediawiki.org/w/index.php?advancedSearchOption-original=&search=&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearchOption-filetype=&advancedSearch-current=%7B%22options%22%3A%7B%7D%2C%22namespaces%22%3A%5B%2212%22%2C%22100%22%2C%22102%22%2C%22104%22%2C%22106%22%2C%220%22%5D%7D&ns12=1&ns100=1&ns102=1&ns104=1&ns106=1&ns0=1

Event Timeline

Restricted Application added a project: TCB-Team. · View Herald TranscriptFeb 12 2018, 10:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Lea_WMDE triaged this task as Medium priority.Feb 13 2018, 11:24 AM
Lea_WMDE updated the task description. (Show Details)
Lea_WMDE updated the task description. (Show Details)Feb 15 2018, 1:59 PM

@Lea_WMDE Please have a look and let me know do you really want any analysis based on the following data.

These are the last 36 hours (everything starting from March 15, 2018). The SpecialSearchCount column represents the number of all Cirrus searches where source = "web" and Special:Search was used. The AdvancedExtensionCount column tells us the number of searches where the advancedSearch-current URL parameter was found in the query, as suggested by @thiemowmde and @daniel. As you can observe, the Advanced Search Extension is rarely used. However, this does not mean that it will remain so in the future, so I am interested primarily to learn whether this data set (which can then be aggregated daily, weekly, monthly, etc) encompasses what you need.

N.B. The amount of data is massive and the ETL procedure is thus lengthy and expensive. If you want to perform any tests on the range of several days or more, I can start collecting the data for any provided date range now and you can expect the results early next week, I guess - depending on how much data do we need to observe.

year	month	day	hour	SpecialSearchCount	AdvancedExtensionCount
2018	3	15	0	18490	                0
2018	3	15	1	19457	                0
2018	3	15	2	18716	                0
2018	3	15	3	17734	                0
2018	3	15	4	17506	                0
2018	3	15	5	18844	                0
2018	3	15	6	21238	                0
2018	3	15	7	22650	                0
2018	3	15	8	25692	                0
2018	3	15	9	26221	                0
2018	3	15	10	27402	                0
2018	3	15	11	27591	                0
2018	3	15	12	30793	                3
2018	3	15	13	33654	                1
2018	3	15	14	37378	                3
2018	3	15	15	37468	                2
2018	3	15	16	34070	                0
2018	3	15	17	31199	                0
2018	3	15	18	31872	                0
2018	3	15	19	31127	                0
2018	3	15	20	28722	                0
2018	3	15	21	24291	                3
2018	3	15	22	22092	                2
2018	3	15	23	18359	                0
2018	3	16	0	16608	                0
2018	3	16	1	17607	                0
2018	3	16	2	18274	                0
2018	3	16	3	18459	                0
2018	3	16	4	16931	                0
2018	3	16	5	17026	                0
2018	3	16	6	19195	                0
2018	3	16	7	21631	                1
2018	3	16	8	23479	                2
2018	3	16	9	25563	                0

@thiemowmde @daniel Thanks for your help with the URL params.

@GoranSMilovanovic thanks for the update! I would be interested in seeing the numbers for the advancedSearch people seperated. So we can really have a graph that says: A search was submitted from special:search x times by anyone, and out of that a search was submitted from special:search y times be people who definitely have AdvancedSearch and z times by people who don't have AdvancedSearch enabled. So that we can compare slopes :)

I won't ever be interested in any hourly info, so if it makes it easier to store, you can handle them daily from the very beginning. I also don't think I will need any special tests, apart from the graph that slowly but surely shows me the slopes of change (if there is any). Does that help?

One question to be sure: In this data, we are not collecting searches that land on special:search, but only searches that start (and land) on the special:search form, right?

@Lea_WMDE

One question to be sure: In this data, we are not collecting searches that land on special:search, but only searches that start (and land) on the special:search form, right?

Yes, these are the searches that have originated from Special:Search.

So we can really have a graph that says: A search was submitted from special:search x times by anyone, and out of that a search was submitted from special:search y times be people who definitely have AdvancedSearch and z times by people who don't have AdvancedSearch enabled. So that we can compare slopes :)

Ok, then we are looking at the appropriate data set (which was my question: is this data set appropriate, or would you like something else, if the later can be obtained). Namely:

"A search was submitted from special:search x times by anyone... " = SpecialSearchCount in the data set
"... and out of that a search was submitted from special:search y times be people who definitely have AdvancedSearch" = AdvancedExtensionCount in the data set, and finally
"... and z times by people who don't have AdvancedSearch enabled" = SpecialSearchCount - AdvancedExtensionCount from the data set.

I won't ever be interested in any hourly info, so if it makes it easier to store, you can handle them daily from the very beginning. I also don't think I will need any special tests, apart from the graph that slowly but surely shows me the slopes of change (if there is any). Does that help?

It does not help :) - but there are no principal obstacles to implement the ETL that way. Please state the date range that you are interested in (start day, end day).

@GoranSMilovanovic what does ETL stand for?

Please state the date range that you are interested in (start day, end day).

The thing is, I am interested in the development of the graph. So my dream start day would be as much into the past as possible, and my end day would be a year into the full deployment of AdvancedSearch. However, if that is not possible, because then we would have to host the data somewhere ourselves and it would just be too much to carry along, we can also reduce the granularity from daily, to one day every fortnight or so

GoranSMilovanovic added a comment.EditedMar 16 2018, 11:24 AM

@Lea_WMDE Sorry, my bad, a technical thing: ETL (abbr.) = Extract-Transform-Load, a set of procedures used to re-structure the data from a database table (typically) into another table that will be used for analytics.

So my dream start day would be as much into the past as possible, and my end day would be a year into the full deployment of AdvancedSearch. However, if that is not possible...

Well... It is possible, in a relative way ("dream start day would be as much into the past as possible" ~ data are purged after 60 days from these tables, so your dream start date is today - 60 days), but it implies that we need to start storing the data on daily basis as of today. Also, it would be rather expensive (time, resources) to get the data for the last 60 days: would you agree to start tracking the numbers as of the beginning of March 2018, say, and then do daily updates? At some point we would accumulate enough data points for your graph. I will look into the possibilities to speed up the procedure, of course.

we can also reduce the granularity from daily, to one day every fortnight or so

We can do daily granularity, no worries.

@GoranSMilovanovic sounds good, then let's start today! And as I said, if it does save us considerable resources (be it human or storage wise) we can go down with the granularity.

@Lea_WMDE Ok, the procedure will be run on daily basis as of today, and I will develop a feature on the Advanced Search Extension dashboard for you to track the progress and visualize the data. Resources: storage is no problem, human could be (namely, me: I am doing this by combining HiveQL and R processing, and I am constantly in a doubt on whether a person with a better knowledge of HiveQL than mine could improve the current workflow; that being said, I'll see to optimize this as much as I can).

Reporting back as soon as I have something interesting for you.

thanks @GoranSMilovanovic! Looking forward :)

@Lea_WMDE Initial data acquisition is completed, you can expect the first visualization examples later during the day.

GoranSMilovanovic added a comment.EditedMar 26 2018, 9:53 PM

@Lea_WMDE @daniel @thiemowmde

The charts refer to the previous 90 days.

  1. Count how many searches are submitted from the SpecialPage:Search (from any web request)

2a. Count how many searches are submitted from the SpecialPage:Search (from web requests with AdvancedSearch enabled)

2b. Percent of SpecialPage:Search requests where the Advanced Search Extension was used

  1. Count how many searches are submitted from the SpecialPage:Search (from web requests without AdvancedSearch but within the wikimedia cluster).

The complete data set is attached:

Please let me know it these visualizations work. If yes, I will incorporate them into the Advanced Search Extension dashboard. The accumulation of daily data is already running.

@Lea_WMDE Please, you still did not clarify what do you mean by ...within the wikimedia cluster.

@GoranSMilovanovic Thanks for the update! "Within the wikimedia cluster" means that we are only interested in all WMF hosted mediawiki installations (i.e. Wikipedias, Commons, Wiktionary....).
The graphs look good, but ideally, I would want to see the lines in one graph, and with a better to read scale on the side, i.e. in 1000s or millions, or whatever withs best.

GoranSMilovanovic added a comment.EditedMar 27 2018, 11:25 AM

@Lea_WMDE

  • A. The scale will be fixed.
  • B. Not all graphs should fit on the same panel, because the Advanced Search Extension are on a different scale (hundreds of observations vs. many thousands of observations for Special:Search). You won't be able to read anything from such charts.Q: did you mean: have one chart and a drop-down menu on the Dashboard to change the metric that is being presented?

Thanks for clarifying what wikimedia cluster means.

B. Not all graphs should fit on the same panel, because the Advanced Search Extension are on a different scale (hundreds of observations vs. many thousands of observations for Special:Search). You won't be able to read anything from such charts.

True. No, then let's keep them seperate for now.

Q: did you mean: have one chart and a drop-down menu on the Dashboard to change the metric that is being presented?

No thanks, then let's stick to seperate charts :)

GoranSMilovanovic closed this task as Resolved.Apr 12 2018, 8:24 AM