Collect ideas for feature engineering of LTRank
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Apr 5 2017, 3:49 PM

Description

We have various ideas of features we can use, this ticket is to collect those ideas, and to create sub tasks for ideas that are worth investigating.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T162279 Collect ideas for feature engineering of LTRank
Resolved	EBernhardson	T187148 Evaluate features provided by `query_explorer` functionality of ltr plugin
Resolved	EBernhardson	T188015 Increase ltr.cache.max_size in Cirrus elasticsearch clusters

Event Timeline

Some ideas we've kicked around:

One hot encoding of popular templates/categories
# of tokens in query
# of tokens in title
size of the document
derivative features, such as wp10s # wikilinks / # content chars
wp10 scores from ores
Features used by wp10 in ores:

>>> from wikiclass.feature_lists import enwiki
>>> import pprint
>>> pprint.pprint(enwiki.wp10)
[<feature.wikitext.revision.chars>,
 <feature.wikitext.revision.content_chars>,
 <feature.wikitext.revision.ref_tags>,
 <feature.(wikitext.revision.ref_tags / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.wikilinks>,
 <feature.(wikitext.revision.wikilinks / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.external_links>,
 <feature.(wikitext.revision.external_links / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.headings_by_level(2)>,
 <feature.(wikitext.revision.headings_by_level(2) / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.headings_by_level(3)>,
 <feature.(wikitext.revision.headings_by_level(3) / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.image_links>,
 <feature.(enwiki.revision.image_links / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.category_links>,
 <feature.(enwiki.revision.category_links / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.cite_templates>,
 <feature.(enwiki.revision.cite_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.(enwiki.revision.cite_templates / max(wikitext.revision.ref_tags, 1))>,
 <feature.max((wikitext.revision.ref_tags - enwiki.revision.cite_templates), 0)>,
 <feature.(max((wikitext.revision.ref_tags - enwiki.revision.cite_templates), 0) / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.non_cite_templates>,
 <feature.(enwiki.revision.non_cite_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.infobox_templates>,
 <feature.(enwiki.revision.cn_templates + 1)>,
 <feature.(enwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.(enwiki.revision.cn_templates + 1)>,
 <feature.(enwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.main_article_templates>,
 <feature.(enwiki.main_article_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.(english.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1))>]

EBernhardson renamed this task from Collect ideas for feature engineering of LTR to Collect ideas for feature engineering of LTRank.Apr 5 2017, 8:13 PM

Some high level ideas to think about from the Yahoo LTR challange (http://www.jmlr.org/proceedings/papers/v14/chapelle11a/chapelle11a.pdf?WT.mc_id=Blog_MachLearn_General_DI):

Web graph

This type of features tries to determine the quality or the popularity of a
document based on its connectivity in the web graph. Simple features are functions
of the number of inlinks and outlinks while more complex ones involve some kind of
propagation on the graph. A famous example is PageRank (Page et al., 1999). Other
features include distance or propagation of a score from known good or bad documents
(Gy¨ongyi et al., 2004; Joshi et al., 2007).

Document statistics

These features compute some basic statistics of the document such
as the number of words in various fields. This category also includes characteristics
of the url, for instance the number of slashes.

Document classifier

Various classifiers are applied to the document, such as spam, adult,
language, main topic, quality, type of page (e.g. navigational destination vs informational).
In case of a binary classifier, the feature value is the real-valued output of the
classifier. In case of multiples classes, there is one feature per class.

Query

Features which help in characterizing the query type: number of terms, frequency
of the query and of its terms, click-through rate of the query. There are also result set
features, that are computed as an average of other features over the top documents
retrieved by a previous ranking function. For example, the average adult score of the
top documents retrieved for a query is a good indicator of whether the query is an
adult one or not.

Text match

The most important type of features is of course the textual similarity between
the query and the document; this is the largest category of features. The basic
features are computed from different sections of the document (title, body, abstract,
keywords) as well as from the anchor text and the url. These features are then aggregated
to form new composite features. The match score can be as simple as a count or
can be more complex such as BM25 (Robertson and Zaragoza, 2009). Counts can be
the number of occurrences in the document, the number of missing query terms or the
number of extra terms (i.e. not in the query). Some basic features are defined over
the query terms, while some others are arithmetic functions (min, max, or average)
of them. Finally, there are also proximity features which try to quantify how far in
the document are the query terms (the closer the better) (Metzler and Croft, 2005).

Topical matching

This type of feature tries to go beyond similarity at the word level and
compute similarity at the topic level. This can for instance been done by classifying
both the query and the document in a large topical taxonomy. In the context of
contextual advertising, details can be found in (Broder et al., 2007).

Click

These features try to incorporate the user feedback, most importantly the clicked
results (Agichtein et al., 2006). They are derived either from the search or the toolbar
logs. For a given query and document, different click probabilities can be computed:
probability of click, first click, last click, long dwell time click or only click. Also of
interest is the probability of skip (not clicked, but a document below is). If the given
query is rare, these clicks features can be computed using similar, but more frequent
queries. The average dwell time can be used as an indication of the landing page
quality. The geographic similarity of the users clicking on a page is a useful feature to
determine its localness. Finally, for a given host, the entropy of the click distribution
over queries is an indication of its specificity.

External references

For certain documents, some meta-information, such as Delicious
tags, is available and can be use to refine the text matching features. Also documents
from specific domains have additional information which can be used to evaluate
the quality of the page: for instance, the rating of an answer in Yahoo! Answers
documents.

Time

For time sensitive queries, the freshness of a page is important. There are several
features which measure the age of a document as well as the one of its inlinks and
outlinks. More information on such features can be found in (Dong et al., 2010,
Section 5.1).

Microsoft also has a list of 136 features, which are a subset of those used in bing and are "widely used in the research community." : https://www.microsoft.com/en-us/research/project/mslr/

These 136 features are actually a much smaller number of features, but applied multiple times. For example the IDF of the query terms in the body, anchor, title, url and whole document make up 5 separate features, but all are IDF. We would likely do similar, but with our fields: title, heading, opening text, etc. The deduplicated list follows, everything after the - is my own guesses about what these features mean, or data from the site describing the feature.

covered query term number - perhaps this is the number of terms from query that are matched by the document?
covered query term ratio - the percentage of terms from query that are matched by the document?
stream length - size in bytes/codepoints/something?
IDF
Sum of TF
Min of TF
Max of TF
Mean of TF
Variance of TF
sum of stream length normalized TF
min of stream length normalized TF
max of stream length normalized TF
mean of stream length normalized TF
variance of stream length normalized TF
sum of tf*idf
min of tf*idf
max of tf*idf
mean of tf*idf
variance of tf*idf
boolean model - ??? no clue
vector space model - This may be word2vec style vector space embeddings? not sure
BM25
LMIR.ABS - Site says "Language model approach for information retrieval (IR) with absolute discounting smoothing"
LMIR.DIR - Site says "Language model approach for IR with Bayesian smoothing using Dirichlet priors"
LMIR.JM - Site says "Language model approach for IR with Jelinek-Mercer smoothing"
Number of slash in URL - probably similar to a subpage depth in our case
Length of URL - Probably not relevant to use, the "stream length" of title is same
Inlink number
Outlink number
PageRank
SiteRank - Site says "Site level PageRank".
QualityScore - Site says "The quality score of a web page. The score is outputted by a web page quality classifier."
QualityScore2 - Site says "The quality score of a web page. The score is outputted by a web page quality classifier, which measures the badness of a web page."
Query-url click count - Site says "The click count of a query-url pair at a search engine in a period"
url click count - Site says " The click count of a url aggregated from user browsing data in a period"
url dwell time - Site says "The average dwell time of a url aggregated from user browsing data in a period"

@dcausse one thing to think about is how we could efficientlly extract some of the data above from lucene for use as feature vectors. Likely we wouldn't want to have a separate query for each of the sum/min/max/mean/variance of the TF (and again for a length-normalized version). It seems like all 10 features could come out of a single query some way or another, as they all have basically the same input data: The TF for each term in the query. These queries would still have to be duplicated against the set of fields we want to use, such as title, headings, opening_text, body, etc.

EBernhardson moved this task from Current work to Up Next on the Discovery-Search board.Apr 17 2017, 6:05 PM

EBernhardson edited projects, added Discovery-Search; removed Discovery-Search (Current work).

I was thinking that some of the generic text features of the query might be useful. Mikhail has an interesting list and some data in prevalence in his Zero To Hero report. Presence of a question mark, presence of quotes, and the presence of advanced syntax seem interesting—though as we discussed, these may be too low frequency to be useful.

Another one that just came to mind would be the presence of Unicode character classes (e.g., writing systems, very roughly) might be interesting. I'd expect that for any given wiki, it'd come down to "the character set of this wiki" and "not the character set of this wiki", but the generic features would be easy to implement.

I had the idea to take a list of popular categories and templates, train a model with them, and then use the results of that model to try and decide what to use.

I took all the categories that are on more than 1k pages, which amounts to 3166 categories, then trained a model with 1.2M samples using those 3166 categories as binary features along with our default set of features. This ends up being pretty expensive to train, the memory usage per-fold went from 186MB for the 9 feature model to 26GB for the 3175 feature model. Because of this instead of doing the full hyperparameter/cross validation pipeline i trained a single model using the same parameters as the best model previously trained with the same 1.2M sample and 9 features. This ended up using 66 categories in the final model. 45 of the features were used at least twice, and 34 of them at least three times. The data below only considers features with weight >= 3.

The tree statistics in graph form. Note that I had to put all the values on a log scale, otherwise the variations are incredibly hard to see due to the difference between largest and smallest values.

1193k with categories importance graph (1×1 px, 269 KB)

	cover_average	cover_max	cover_min	cover_std	gain_average	gain_max	gain_min	gain_std	weight
all_near_match	7856.791455	77573.500000	721.484000	10445.911281	109.320012	1955.920000	1.471630	245.164143	99.000000
auxiliary_text	1903.908401	10814.400000	203.011000	2522.685795	3.643522	33.661800	0.523651	4.125408	197.000000
category	2780.704416	16740.400000	216.221000	3071.818299	19.662336	496.121000	0.536197	47.158397	305.000000
category_Album_infoboxes_lacking_a_cover	10150.000000	10150.000000	10150.000000	0.000000	1.449850	1.449850	1.449850	0.000000	1.000000
category_All_Wikipedia_articles_written_in_American_English	7653.325000	9237.420000	6069.230000	1584.095000	1.408616	1.941830	0.875401	0.533215	2.000000
category_All_article_disambiguation_pages	2213.824947	6384.110000	506.982000	1519.815516	21.261653	88.393600	2.061880	19.531496	19.000000
category_All_articles_containing_potentially_dated_statements	4532.765000	6131.270000	3432.390000	849.892955	3.579538	9.466640	1.684510	2.659005	6.000000
category_All_articles_lacking_in-text_citations	7789.790000	7789.790000	7789.790000	0.000000	1.745920	1.745920	1.745920	0.000000	1.000000
category_All_articles_needing_additional_references	1580.877000	3946.750000	750.090000	1008.447527	3.990525	10.168700	1.080850	2.425760	11.000000
category_All_articles_that_may_contain_original_research	7923.220000	8012.860000	7833.580000	89.640000	2.126150	2.431170	1.821130	0.305020	2.000000
category_All_articles_to_be_expanded	7907.570000	7907.570000	7907.570000	0.000000	1.641730	1.641730	1.641730	0.000000	1.000000
category_All_articles_with_dead_external_links	2816.539636	6786.000000	635.216000	1615.660861	3.711856	7.049610	1.481710	1.905442	11.000000
category_All_articles_with_unsourced_statements	1129.119250	3469.800000	228.276000	1145.390486	2.456759	7.397000	0.596479	1.970288	8.000000
category_All_articles_with_vague_or_ambiguous_time	9257.830000	10055.200000	8157.800000	797.002353	2.277140	4.330550	1.254990	1.213447	4.000000
category_All_pages_needing_cleanup	9225.210000	9225.210000	9225.210000	0.000000	1.516080	1.516080	1.516080	0.000000	1.000000
category_All_stub_articles	2226.875000	4061.850000	391.900000	1834.975000	2.359034	3.784640	0.933428	1.425606	2.000000
category_American_films	3722.825000	4749.560000	2696.090000	1026.735000	5.126180	6.493220	3.759140	1.367040	2.000000
category_Articles_containing_French-language_text	9322.855000	9850.700000	8795.010000	527.845000	1.315705	1.343160	1.288250	0.027455	2.000000
category_Articles_containing_Japanese-language_text	7388.881429	10145.800000	3077.050000	2362.409697	13.862240	41.399900	2.267750	14.606613	7.000000
category_Articles_containing_Latin-language_text	9935.510000	9935.510000	9935.510000	0.000000	1.322220	1.322220	1.322220	0.000000	1.000000
category_Articles_using_Infobox_video_game_using_locally_defined_parameters	9594.330000	9594.330000	9594.330000	0.000000	1.557290	1.557290	1.557290	0.000000	1.000000
category_Articles_using_small_message_boxes	4327.260000	4327.260000	4327.260000	0.000000	2.356100	2.356100	2.356100	0.000000	1.000000
category_Articles_with_Japanese-language_external_links	9834.363333	10312.200000	8932.790000	637.891087	2.198603	2.498320	1.813420	0.286066	3.000000
category_Articles_with_hAudio_microformats	1099.979960	4453.010000	218.175000	880.762320	3.870963	13.093700	1.251610	2.722776	25.000000
category_Articles_with_hCards	2012.479000	6453.320000	1105.000000	1517.261728	6.399660	21.462100	1.310660	6.248099	10.000000
category_Articles_with_unsourced_statements_from_April_2017	8910.570000	9082.080000	8729.170000	144.244529	3.451200	6.921620	1.605180	2.455625	3.000000
category_Articles_with_unsourced_statements_from_March_2017	8858.300000	9358.200000	7934.000000	552.664112	2.353010	4.887050	1.010820	1.494767	4.000000
category_Articles_with_unsourced_statements_from_May_2017	9475.970000	9475.970000	9475.970000	0.000000	1.255200	1.255200	1.255200	0.000000	1.000000
category_CS1_German-language_sources_(de)	3768.144000	5856.350000	2620.870000	1253.865299	2.222646	2.655590	1.545290	0.415291	5.000000
category_CS1_Italian-language_sources_(it)	4861.760000	4861.760000	4861.760000	0.000000	3.106550	3.106550	3.106550	0.000000	1.000000
category_CS1_Japanese-language_sources_(ja)	4447.737500	5920.890000	2380.550000	1341.312755	5.777220	15.972800	1.546310	5.909375	4.000000
category_CS1_Spanish-language_sources_(es)	6485.320000	7430.760000	5539.880000	945.440000	2.977605	4.013390	1.941820	1.035785	2.000000
category_CS1_errors:_dates	5369.375000	5887.680000	4851.070000	518.305000	1.745210	1.764150	1.726270	0.018940	2.000000
category_CS1_errors:_external_links	8700.200000	8700.200000	8700.200000	0.000000	2.499840	2.499840	2.499840	0.000000	1.000000
category_CS1_maint:_BOT:_original-url_status_unknown	5779.540000	5779.540000	5779.540000	0.000000	1.958110	1.958110	1.958110	0.000000	1.000000
category_CS1_maint:_Extra_text:_authors_list	8944.030000	9373.900000	8514.160000	429.870000	1.260570	1.427910	1.093230	0.167340	2.000000
category_CS1_maint:_Multiple_names:_authors_list	3390.900000	3390.900000	3390.900000	0.000000	3.013650	3.013650	3.013650	0.000000	1.000000
category_CS1_maint:_Uses_authors_parameter	8893.703333	9827.490000	7765.980000	852.707235	1.065207	1.080830	1.035000	0.021364	3.000000
category_Coordinates_on_Wikidata	2806.448192	7392.220000	524.328000	1727.514344	9.481149	41.358200	1.431290	9.720991	26.000000
category_Disambiguation_pages	3958.796429	8605.370000	1199.430000	2310.781504	18.418981	39.934700	1.602440	12.612326	14.000000
category_English-language_films	3512.302632	6260.320000	1211.790000	1444.109175	10.836084	30.118500	2.442770	7.719818	19.000000
category_English-language_television_programming	6481.297500	8940.270000	4912.910000	1455.166923	5.849578	11.671500	1.556760	3.031421	8.000000
category_Featured_articles	8312.340000	8312.340000	8312.340000	0.000000	1.049160	1.049160	1.049160	0.000000	1.000000
category_Good_articles	4178.960000	5469.300000	2321.130000	1346.421742	3.585040	4.710550	2.013200	1.145522	3.000000
category_Grammy_Award_winners	38801.700000	38801.700000	38801.700000	0.000000	11.357900	11.357900	11.357900	0.000000	1.000000
category_Living_people	2421.145778	6724.290000	339.075000	1939.661512	4.820238	19.999700	1.248460	5.526307	9.000000
category_Music_infoboxes_with_deprecated_parameters	6113.650000	10046.600000	3582.240000	2819.009676	2.628400	3.890960	1.634950	0.940465	3.000000
category_Official_website_different_in_Wikidata_and_Wikipedia	6280.697500	8192.150000	4563.340000	1495.126984	2.156560	2.664920	1.411320	0.494347	4.000000
category_Official_website_not_in_Wikidata	8961.215000	9347.510000	8574.920000	386.295000	1.937255	2.324550	1.549960	0.387295	2.000000
category_Pages_using_ISBN_magic_links	5745.245000	8321.630000	2571.870000	2125.443198	2.542183	3.194240	1.501150	0.651861	4.000000
category_Pages_using_deprecated_image_syntax	1418.721500	3306.740000	525.160000	1102.883235	3.581443	4.944390	2.131470	1.048212	4.000000
category_Pages_using_infobox_television_with_alias_parameters	8294.400000	8564.950000	8023.850000	270.550000	1.397480	1.595830	1.199130	0.198350	2.000000
category_Pages_with_citations_lacking_titles	7194.290000	9303.370000	6164.080000	1242.491771	5.794030	15.150700	1.688810	5.435741	4.000000
category_Singlechart_called_without_song	1866.160000	1866.160000	1866.160000	0.000000	2.594980	2.594980	2.594980	0.000000	1.000000
category_Singlechart_usages_for_Flanders	2652.490000	2652.490000	2652.490000	0.000000	13.861700	13.861700	13.861700	0.000000	1.000000
category_Singlechart_usages_for_New_Zealand	2603.290000	2900.080000	2306.500000	296.790000	18.521900	20.893300	16.150500	2.371400	2.000000
category_Singlechart_usages_for_UK	1754.620000	1754.620000	1754.620000	0.000000	1.390770	1.390770	1.390770	0.000000	1.000000
category_Webarchive_template_wayback_links	1724.904286	6912.160000	460.395000	1545.328652	3.907785	10.618100	0.975075	2.749317	14.000000
category_Wikipedia_articles_with_BNF_identifiers	1996.369500	3740.620000	376.147000	1149.715100	4.101380	7.323580	1.176160	2.254660	6.000000
category_Wikipedia_articles_with_GND_identifiers	2896.427100	7919.690000	219.310000	2149.628298	3.524528	9.439000	1.394820	2.855558	10.000000
category_Wikipedia_articles_with_ISNI_identifiers	3936.788571	5620.180000	1718.970000	1397.486752	5.621847	12.938100	1.648750	3.957913	7.000000
category_Wikipedia_articles_with_LCCN_identifiers	4312.973857	7170.670000	213.647000	2404.923280	6.673487	18.810100	0.681220	5.597346	7.000000
category_Wikipedia_articles_with_MusicBrainz_identifiers	3904.021765	9730.230000	1613.060000	2245.947353	7.333017	32.360600	0.848471	8.170776	17.000000
category_Wikipedia_articles_with_NLA_identifiers	8472.770000	8472.770000	8472.770000	0.000000	1.491540	1.491540	1.491540	0.000000	1.000000
category_Wikipedia_articles_with_VIAF_identifiers	2875.418182	6894.480000	333.600000	1882.207391	3.721722	9.880090	1.372930	2.308892	11.000000
category_Wikipedia_indefinitely_move-protected_pages	4577.627500	5664.490000	4088.570000	646.047406	3.116955	5.802890	1.847450	1.573712	4.000000
category_Wikipedia_indefinitely_semi-protected_pages	1142.523500	1304.340000	980.707000	161.816500	1.513860	1.885110	1.142610	0.371250	2.000000
category_Wikipedia_semi-protected_pages	10763.400000	10763.400000	10763.400000	0.000000	3.411780	3.411780	3.411780	0.000000	1.000000
heading	1895.433073	9481.700000	202.480000	2135.437554	4.092752	28.385900	0.648968	4.419571	123.000000
incoming_links	1496.991459	12206.600000	203.392000	1914.872663	6.220038	109.890000	0.500567	13.011488	268.000000
popularity_score	4892.191485	76631.500000	201.126000	9847.520413	82.666712	5166.300000	0.532374	370.839558	447.000000
redirect_or_suggest_dismax	5064.524240	93105.300000	201.933000	9396.001517	114.857069	10821.300000	0.519488	769.990566	317.000000
text_or_opening_text_dismax	6212.299015	32519.200000	211.854000	5723.114839	22.560473	272.160000	0.576887	41.259152	333.000000
title	6879.432536	169737.000000	208.919000	16408.034361	359.397284	45255.000000	0.545352	2827.596701	435.000000

And same for templates:

1193k with templates importance graph (1×1 px, 291 KB)

	cover_average	cover_max	cover_min	cover_std	gain_average	gain_max	gain_min	gain_std	weight
Module:About	3038.335000	4641.660000	1601.400000	1116.495332	4.489700	8.796370	2.333630	2.544566	4.000000
Module:Based_on	5209.598000	6651.420000	4337.430000	858.581457	5.098178	6.822640	2.986130	1.352553	5.000000
Module:Category_handler	625.001500	1051.700000	213.548000	304.035597	3.349380	6.439650	0.753760	2.249221	4.000000
Module:Check_for_unknown_parameters	1022.769667	8796.370000	209.436000	2346.882165	2.453300	5.802470	1.490570	1.119229	12.000000
Module:Convert/data	1338.328375	3277.750000	270.419000	1116.176360	8.289195	22.917500	2.689340	6.305320	8.000000
Module:Delink	626.448000	2146.420000	217.060000	760.549031	2.035043	4.730210	0.645084	1.434142	5.000000
Module:Icon/data	1547.093889	6414.710000	242.283000	1346.264644	7.667730	51.711700	0.808619	11.488085	18.000000
Module:If_empty	1095.887909	4904.810000	236.739000	1333.262759	6.686526	32.132200	0.543026	8.728927	11.000000
Module:InfoboxImage	1137.067786	7016.760000	267.089000	1748.151634	13.140067	87.907500	0.940490	26.042514	14.000000
Module:Labelled_list_hatnote	1098.251250	4159.650000	232.013000	1226.022746	3.147601	6.005170	1.364350	1.377207	8.000000
Module:List	785.886667	1027.530000	521.863000	207.043374	4.439693	6.242230	2.582070	1.494762	3.000000
Module:Main	658.555429	1360.000000	217.118000	465.458420	2.219721	5.867800	0.869246	1.551419	7.000000
Module:Math	1392.910556	6232.190000	506.553000	1739.868543	3.852906	7.214700	2.037410	1.766025	9.000000
Module:Navbar	618.255429	1295.480000	272.240000	338.878193	3.691514	6.768700	1.368600	1.869285	7.000000
Module:No_globals	494.124300	1023.060000	236.493000	279.822948	2.960521	7.487600	0.682521	2.056213	10.000000
Module:Portal	616.299250	1616.660000	273.303000	420.553692	3.152402	8.403340	1.738880	2.139090	8.000000
Module:Protection_banner/config	2339.946667	3214.160000	1155.290000	868.722325	4.031667	4.948330	2.221200	1.280227	3.000000
Module:Separated_entries	1556.489667	2471.830000	626.919000	753.248974	19.514140	43.104400	4.354720	16.905497	3.000000
Module:Side_box	508.737400	1102.220000	246.764000	315.896618	2.626566	4.507890	1.182990	1.299819	5.000000
Module:String	656.874909	1600.780000	214.010000	468.057495	2.994624	8.605310	0.871009	2.521170	11.000000
Module:Unsubst	434.491250	624.366000	288.299000	120.809805	2.945555	6.026830	1.617920	1.804893	4.000000
Module:Wikidata	2502.740000	4896.260000	1127.160000	1698.760349	3.971257	6.165590	1.738560	1.807531	3.000000
Module:Yesno	599.722600	1147.140000	263.746000	327.274001	8.566100	35.606900	1.184500	13.530669	5.000000
Template:Abbr	2001.960000	3182.640000	1107.160000	890.108109	4.368700	9.517910	1.731520	3.160259	4.000000
Template:Age	2158.344333	3969.420000	239.393000	1524.686465	4.797140	6.130500	3.546860	1.056401	3.000000
Template:Category_handler	443.144800	683.920000	214.572000	180.345006	2.515296	7.005780	1.174510	2.250409	5.000000
Template:Citation	4081.551000	9172.760000	898.093000	3637.504996	2.316850	2.879190	1.605390	0.530609	3.000000
Template:Cite_book	1906.584692	6798.710000	268.546000	2377.054239	4.410043	12.436000	1.961020	2.653352	13.000000
Template:Colend	6708.222000	9776.460000	5375.960000	1589.025939	4.805518	9.217530	3.132090	2.248690	5.000000
Template:Column-count	816.731750	958.538000	692.485000	100.891089	2.989982	6.546450	1.672900	2.058372	4.000000
Template:Commons_category	502.186000	1164.250000	256.630000	382.553383	4.253318	7.462770	2.376770	1.982276	4.000000
Template:Div_col_end	1933.156667	2546.870000	1117.410000	600.806166	3.649673	4.704310	3.029060	0.749621	3.000000
Template:Dmbox	3308.263643	5950.650000	990.911000	1734.750874	20.946984	47.174300	2.857050	11.185136	14.000000
Template:Episode_list	9827.836667	10570.700000	8972.810000	657.139545	3.078457	4.321570	2.333390	0.884780	3.000000
Template:Europe_topic	8981.520000	10041.800000	8097.790000	803.397430	6.744090	8.936940	4.891160	1.669080	3.000000
Template:Film_date	2757.850000	4808.080000	1103.450000	1331.502011	7.003625	12.094600	3.597840	3.134535	4.000000
Template:Fix	561.562667	1313.360000	240.393000	402.942703	3.050322	6.396830	0.979071	2.248514	6.000000
Template:Hlist	1981.474333	3661.430000	203.643000	1413.331164	2.296923	3.396320	1.677690	0.779461	3.000000
Template:ISBN	2133.214286	5450.100000	631.741000	1646.873373	4.190803	7.126340	2.248850	1.477510	7.000000
Template:ISO_639_name	1653.859250	2280.820000	975.957000	550.317569	3.709502	5.305220	1.839350	1.375066	4.000000
Template:Infobox	1099.200300	4125.550000	231.468000	1100.270517	3.953127	10.489100	0.741740	2.609462	10.000000
Template:Infobox_musical_artist/color	2630.824000	4479.580000	1728.850000	1017.348671	12.599154	17.770200	3.036470	5.171008	5.000000
Template:Infobox_musical_artist/hCard_class	1274.273333	1584.210000	989.860000	243.312640	4.592970	7.379190	2.406180	2.073997	3.000000
Template:Iso2country/article	6322.416250	8871.300000	3451.130000	1655.306236	9.712257	24.719900	1.873850	7.911530	8.000000
Template:Italic_title	1036.842200	1348.830000	621.874000	241.063607	17.537522	59.546700	3.120080	21.200141	5.000000
Template:Longitem	3798.706000	7907.630000	936.972000	2443.805514	8.940289	24.663500	2.017000	7.357739	7.000000
Template:MONTHNUMBER	1834.051778	8704.590000	255.940000	2640.220339	2.609677	5.214850	0.530897	1.675970	9.000000
Template:Main_other	599.736286	1234.850000	281.585000	337.703592	3.409467	10.473300	1.375950	2.982183	7.000000
Template:Military_navigation	5558.120000	8326.750000	1930.770000	2386.230915	2.967158	4.093250	1.293730	1.035348	5.000000
Template:N/a	6403.832500	10078.000000	3349.900000	2462.834497	3.610293	5.143370	2.012900	1.207022	4.000000
Template:Navbox	845.438500	1725.500000	262.409000	565.130906	4.781637	10.279200	1.512760	2.746361	6.000000
Template:Nihongo	6581.853333	10083.600000	3460.430000	2567.759953	16.849522	47.013400	1.723120	17.188524	6.000000
Template:Nom	2434.643333	3420.110000	1260.280000	891.806358	3.023803	3.606130	2.707800	0.412266	3.000000
Template:Nowrap	1212.220667	3171.920000	222.617000	841.915862	6.681436	25.525800	1.779050	6.816189	9.000000
Template:PAGENAMEBASE	1300.665071	5353.980000	250.053000	1366.744185	5.830538	12.748800	1.710800	3.568120	14.000000
Template:Page_needed	8296.636667	9744.230000	7067.790000	1103.545905	2.348477	2.814670	1.707160	0.468790	3.000000
Template:Plainlist	2607.623833	5639.310000	848.473000	1622.072175	15.584710	43.490700	3.095410	13.267005	6.000000
Template:Portal	728.272250	1608.660000	272.174000	537.247627	6.155303	10.895500	2.644020	3.006525	4.000000
Template:Reflist	2849.657000	7129.600000	555.650000	2489.451180	4.819608	10.829400	1.552450	3.061941	8.000000
Template:See_also	3897.124000	8206.590000	573.146000	3099.966132	2.835470	3.751110	2.195790	0.614401	4.000000
Template:Sister_project	405.968800	702.482000	233.761000	173.364049	3.710846	7.023260	1.815540	1.886300	5.000000
Template:Small	1882.101923	5461.210000	289.704000	1456.959490	8.023437	31.832900	0.956956	8.877712	13.000000
Template:Start_date	1908.139500	2777.640000	520.148000	872.587895	4.840312	9.549640	2.432760	2.771849	4.000000
Template:Str_left	489.051250	829.110000	250.496000	226.541190	10.194913	33.211300	1.779830	13.318370	4.000000
Template:Title_disambig_text	3680.125000	5913.370000	1244.400000	2234.317645	15.851737	24.763000	5.821580	8.397381	4.000000
Template:Trim	1856.432889	8440.050000	221.117000	2433.906056	2.921578	8.678670	1.196990	2.246457	9.000000
Template:URL	2419.286000	4595.420000	1211.500000	1128.795288	4.454379	7.040570	2.431250	1.667490	10.000000
Template:Use_mdy_dates	662.550900	1333.100000	295.684000	309.491165	3.467121	4.874330	1.165760	1.129295	10.000000
Template:Wikiquote	2683.466667	5444.170000	1158.690000	1835.166469	4.931308	8.211030	2.752270	1.707159	6.000000
Template:Wiktionary	3908.083333	8270.900000	1448.890000	2858.110320	12.676310	32.974000	2.539400	11.320915	6.000000
Template:Yesno	1533.884143	4596.340000	278.981000	1378.203164	2.978943	5.039240	1.038900	1.497769	7.000000
auxiliary_text	2294.417826	12166.000000	207.404000	3066.430924	4.623832	29.578300	0.509898	4.979459	109.000000
category	3196.505281	19827.300000	207.021000	3660.340958	23.890592	449.402000	0.529290	50.367074	217.000000
heading	2055.264218	9955.460000	213.840000	2604.138003	5.165725	30.943800	0.582010	5.165669	78.000000
incoming_links	1605.545014	10078.300000	207.984000	1925.795619	7.265509	91.575200	0.520631	12.177630	145.000000
redirect_or_suggest_dismax	6295.663675	92000.100000	202.933000	10959.660087	154.234967	9930.910000	0.517249	850.034527	237.000000
Module:Distinguish	8445.290000	9813.540000	6051.050000	1698.739603	1.789097	2.138410	1.452520	0.280158	3.000000
Module:Hatnote	666.074333	1369.950000	226.081000	434.786090	3.183713	6.624170	0.749369	2.233907	6.000000
Module:Message_box	2196.974000	5570.530000	735.492000	1852.696788	4.394183	12.442400	1.671320	3.713934	6.000000
Module:Official_website	2428.511667	6537.890000	1052.180000	1883.378661	7.196810	14.776100	1.487300	4.566128	6.000000
Module:Portal/images/a	6671.803333	9956.350000	4906.060000	2324.712161	3.929380	5.329160	2.665100	1.091810	3.000000
Module:Portal/images/e	8234.620000	8663.050000	7994.040000	303.709633	6.028453	7.697790	3.711110	1.690826	3.000000
Module:Redirect	2146.208250	4795.580000	349.802000	1440.790405	28.586876	77.163800	5.319880	27.175469	8.000000
Module:Series_overview	9699.633333	10325.700000	9264.660000	453.753947	4.698167	6.115900	3.229880	1.178755	3.000000
Module:StringReplace	1323.821500	2610.100000	704.798000	671.206265	5.247787	10.584300	1.995060	3.000373	6.000000
Module:WPMILHIST_Infobox_style	4415.930000	10044.400000	1297.880000	3011.915673	6.228968	13.035600	1.971790	3.467265	15.000000
Template:Both	1716.386417	5682.500000	436.530000	1303.609764	8.020113	18.778600	1.378780	5.316927	12.000000
Template:Cbignore	7120.640000	9154.170000	4547.250000	1712.847086	4.066368	5.893950	2.313950	1.267074	4.000000
Template:Citation_needed	361.338500	448.028000	220.694000	92.243198	2.879653	5.585180	0.834362	1.735115	4.000000
Template:Country_data_Bosnia_and_Herzegovina	8364.810000	10227.400000	6459.560000	1538.509988	2.598370	2.712980	2.422850	0.126033	3.000000
Template:Disambiguation/cat	3426.925000	7480.710000	1625.050000	2379.878417	21.204920	36.142000	8.843780	10.202672	4.000000
Template:Fix/category	470.143000	678.487000	346.986000	129.158860	2.210450	3.610620	0.760329	1.011078	4.000000
Template:Greater_color_contrast_ratio	11262.324000	38375.800000	1708.270000	13702.498422	19.179310	33.376800	8.027150	9.153371	5.000000
Template:I2c	4346.856000	6730.380000	2223.910000	1859.400014	16.037544	42.054600	4.286240	13.389019	5.000000
Template:IMDb_title	4387.217769	8185.140000	202.521000	2337.147274	19.687484	42.214400	4.215380	10.911641	13.000000
Template:Infobox_single	2526.114500	8361.000000	812.327000	2647.704836	8.902750	20.804200	2.917870	6.082021	6.000000
Template:Navbox_subgroup	754.168333	1115.410000	555.346000	255.868392	2.523997	4.161350	1.619920	1.159881	3.000000
Template:Ns0	1951.935000	2425.940000	1230.050000	472.576771	18.458360	36.183400	5.494440	11.517967	4.000000
Template:Sort	6703.841667	10285.700000	1441.340000	2870.993608	2.908305	3.985550	2.173550	0.579257	6.000000
Template:Track_listing	7828.186667	8897.180000	6078.260000	1247.485510	4.139633	5.490920	2.330950	1.329978	3.000000
Template:Use_dmy_dates	563.635250	846.771000	455.229000	163.804342	2.495427	3.250360	1.887650	0.614926	4.000000
Template:Webarchive	1562.522889	3257.900000	225.196000	868.793804	7.012448	24.834100	1.653600	6.750085	9.000000
4488	1871.496857	4500.000000	292.241000	1377.653450	5.973984	15.761300	1.184030	4.660533	14.000000
4489	6007.608000	7293.800000	3225.340000	1461.933050	5.850684	13.136900	2.706960	3.825827	5.000000
Template:Br_separated_entries	1055.068000	1675.880000	605.784000	361.925648	4.551178	5.955240	2.444260	1.393840	5.000000
Template:Refimprove	2509.490000	4888.190000	1247.260000	1426.490530	3.225167	4.365030	2.017860	0.830793	4.000000
Module:Check_isxn	893.582000	1756.730000	278.716000	574.720252	5.480575	12.156700	1.300140	4.124137	4.000000
popularity_score	5813.819768	77727.700000	201.002000	10593.805969	100.931088	5314.380000	0.678481	410.865875	353.000000
Template:As_of	3972.990000	5329.470000	1629.430000	1663.995031	3.114540	3.564300	2.787010	0.328887	3.000000
Module:Other_uses	2038.920000	3305.790000	1390.640000	895.894366	3.211587	4.703520	1.962300	1.132175	3.000000
Template:Cite_web	2048.961800	7281.990000	237.515000	2651.823850	2.593682	6.934650	0.528218	2.283810	5.000000
Template:Cite_news	1204.834091	3387.860000	273.929000	917.802503	3.949386	8.505780	1.254120	2.289441	11.000000
Module:TableTools	766.873818	3506.790000	222.391000	951.369159	3.862119	16.959500	0.919608	4.515097	11.000000
Template:(!	784.651333	1220.260000	429.543000	327.789131	6.209387	7.091400	4.985750	0.892924	3.000000
Template:Birth_date_and_age	3894.830000	5628.150000	1320.130000	1856.645597	4.014737	5.504800	3.070240	1.066147	3.000000
Template:Column-width	782.524455	1396.290000	293.064000	406.013227	3.121418	4.519570	1.747900	0.777650	11.000000
all_near_match	8173.933956	77454.200000	1014.040000	10806.929035	117.155361	1948.640000	2.942300	254.969331	91.000000
Module:Portal/images/t	6655.315000	9773.630000	5073.140000	1858.471438	2.277273	3.320680	1.501180	0.664872	4.000000
Module:Link_language	6647.216667	7141.010000	5731.910000	647.891874	5.439580	10.070500	2.481620	3.316310	3.000000
Template:ISBNT	10140.575000	10980.800000	9241.900000	624.680920	4.011893	6.360720	2.091950	1.925094	4.000000
Template:Main_article	1251.902333	3899.850000	326.975000	1226.888263	2.469382	4.107020	0.991312	1.040132	6.000000
Template:Authority_control	1381.088444	2613.950000	349.833000	642.238887	7.640454	19.093500	1.761500	6.131434	9.000000
Template:Rotten_Tomatoes	4393.315000	5954.570000	3827.080000	901.983301	8.168822	11.617000	4.446790	3.278071	4.000000
Template:Infobox_settlement/impus	5205.086667	8690.290000	1299.830000	3031.727599	62.608657	142.111000	3.043770	58.498150	3.000000
Template:Cite_journal	2051.755588	7202.770000	215.862000	1953.875946	4.685924	12.031100	0.794502	2.933385	17.000000
Template:Cite_encyclopedia	5535.030000	6552.020000	4667.740000	673.131352	4.859253	7.614620	2.456700	1.834768	4.000000
Module:Unsubst-infobox	549.736500	798.701000	314.950000	173.042546	5.998965	16.912800	1.236200	6.348363	4.000000
Template:For	2303.903333	2653.670000	1652.320000	461.154660	3.219033	4.262700	2.572470	0.744978	3.000000
title	9063.832716	169728.000000	201.354000	19086.794947	511.310996	45781.400000	0.609168	3389.279432	306.000000
Module:Namespace_detect/config	294.229667	330.378000	229.476000	45.891192	2.250953	2.668800	1.737890	0.385960	3.000000
Template:Ifempty	3992.504286	10083.800000	1493.240000	2981.665112	4.883199	11.238200	1.569390	3.093045	7.000000
Module:Hatnote_list	475.784000	944.469000	252.753000	277.628248	2.855527	4.381960	1.712960	0.992243	4.000000
Template:Ambox	738.824600	1632.240000	338.025000	467.619345	2.157544	2.952190	1.187380	0.787314	5.000000
text_or_opening_text_dismax	7309.665569	38935.200000	209.103000	6182.307296	29.835002	260.371000	0.557776	49.510657	255.000000
Module:Portal_bar	4541.746667	5796.560000	3457.200000	962.598690	2.773373	3.090570	2.484480	0.248243	3.000000

Need to decide what to do with these exactly. For english i can certainly manually review the lists, take everything that seems sane, and build a combined model with all of them and re-review what happens. I think we also might need T162711 because talk page categories are probably as interesting as the main page categories. They will probably also change things.

There is also non-exact matches. Especially for talk page categories we might want matches like 'Grade A' or some such, rather than strictly matching specific categories to generalize things. Matching specific wikiproject tags regardless of quality might also be useful. I'm not sure how to generalize this all beyond english though, our best bet might be to do similar to ORES and work with communities to curate lists of sane things.

Thanks for all the data. (It imports into Libre Office very easily.) Those max/avg/min swings are huge—orders of magnitude!

Whether or not you should filter for things that seem sane isn't clear to me. Machine learning often doesn't seem sane. ;) What really matters is whether things are sufficiently stable to be predictive. I think the refresh period for the model might be more important. Things like category_All_Wikipedia_articles_written_in_American_English doesn't seem to have any theoretical value, but it might have real-world value, at least in the short term, because whoever is tagging that category is focusing on high-quality articles first, or something. If the model is refreshed often enough (for some value of "often enough") then the reason something is predictive doesn't matter too much, because the newer model will catch up with any shifts in predictive value. That also resolves the problem of dealing with languages & cultures we don't know.

I definitely agree that the talk page categories will be very valuable!

While I agree that matching on parts of categories, like "A Class" or "Good Articles" could boost the signal from the larger quality categories to the smaller ones, there's the problem of parsing them. There could be problems like "Good Articles for Deletion" (terrible category name, but you get the idea) in English, and there's the problem of languages we don't know (and some of those could have inflections of the words in titles that make parsing them even more annoying). So, maybe, it would make sense to try them without parsing and see how it goes. "History good articles" might be a significantly more valuable indicator than "Pokeman good articles". You could also try just the most obvious and easiest-to-extract non-exact matches to see how much they help over the full category names.

All very fair. I haven't automated the above selection of categories/templates, but i suppose i could. It's in code but it's code i pasted into a REPL as opposed to building into mjolnir directly.
. iIwas thinking though that perhaps some of these things that were chosen are basically proxies for some more specific signal. For example:

category_Articles_with_unsourced_statements_from_[March|April|July]_2017 - Maybe this is a proxy for age of articles, or low quality new articles? Might be worth trying to use age as a feature directly and see if xgboost finds less value in these particular categories. We don't currently store the age of the oldest revision in elasticsearch but i think it would be pretty easy to inject.

Template:Navbox - Almost all pages have this, pages that don't are perhaps lower quality pages? Maybe using ores wp10 scores would make this kind of proxy for page quality less useful?

Module:WPMILHIST_Infobox_style - Maybe this is a proxy for general 'military history' pages? Not sure what better signal we could come up with though, this is probably pretty reasonable.

category_Wikipedia_indefinitely_semi-protected_pages - This almost seems like a proxy for popularity or quality? Generally something is semi-protected due to spam which probably happens on either controversial or popular articles.

As with anything though I'm not sure that guessing at what the actual signal behind the choice is and trying to more directly capture that is going to be worthwhile. The above was relatively easy to do with just a computer stuffing numbers into a box and poping out an answer, whereas each individual signal will probably take a day or two of engineering time to do an initial validation on to see if it's of any use. And of course there is plenty of opportunity to be wrong about what signal it was trying to capture. The results of testing a signal and finding it to be useless is probably still useful information, but not as usefull as a new feature that works.

I think it's a great idea to look at English templates and categories like you have and try to get at some idea of a more generic signal. Article age and ORES score both seem like good ideas—but as you say, it may or may not be worth the engineering time to extract the info and test its utility in every conceivable case. OTOH, something like article age is universal across all wikis (that is, it's available and might be meaningful, unlike ORES scores, which are probably always meaningful, but not available for all wikis).

Some evaluations of current QI features, and the addition of wp10 and page_created_ts:

added / removed features	cv-test-ndcg@10	holdout-test-ndcg@10	diff from baseline	% possible improvement
baseline	0.84791	0.84894	0.00000	0.00000%
-incoming_links	0.84597	0.84710	-0.00184	-1.21540%
-popularity_score	0.82103	0.82354	-0.02540	-16.81655%
-popularity_score, -incoming_links	0.81071	0.81256	-0.03638	-24.08052%
+page_created	0.84886	0.85006	0.00112	0.74417%
+page_created, -incoming_links	0.84795	0.84874	-0.00020	-0.13183%
+page_created, -popularity_score	0.82305	0.82628	-0.02266	-15.00040%
+wp10	0.84865	0.84860	-0.00033	-0.22167%
+wp10, -incoming_links	0.84586	0.84734	-0.00160	-1.05778%
+wp10, -popularity_score	0.82344	0.82633	-0.02261	-14.96935%
+one_hot_wp10	0.84803	0.84862	-0.00032	-0.20859%

Surprisingly while both of our existing QI features are "good", the new features for evaluation have much smaller or no impact. wp10 doesn't look to be a useful feature either as a weighted sum or a one-hot encoding of article classes. The page created date has some small utility, but relatively little. Adding it to our elasticsearch docs would be relatively easy if we want to add it.

One more (or really two) new features, adding a token count of the query string with both the text_search and plain_search analyzers:

added / removed features	cv-test-ndcg@10	holdout-test-ndcg@10	diff from baseline	% possible improvement
baseline	0.84791	0.84894	0.00000	0.00000%
+num_text_terms, +num_plain_terms	0.8494	0.85131	0.00237	1.568%

Will train the model with them individually to see if both are necessary.

Some other random ideas for features:

Prefix match against title/redirect. It seems likely that a prefix match would be more important than a general match against title. May want to do this per-term? As in if any of the terms in the search query match the first term of the title.
Age of last edit to page
Age of last significant content change to page (so, disregarding edits to fix typos, or bots that swap a link out to point at internet archive, etc)
Does the page contain multimedia, ratio of images to page length may be a quality signal (but wp10 turned out to not be useful, so maybe not as important or overshadowed by popularity?)
Estimated reading level of a page? Seems it would somehow need to be combined with an estimated reading level of the query as well
Number of Section? Average section length?
Could category matches be influenced by the size of the category? Matching a very narrow category might be a better match than a large category.
Search popularity of page (% of search clicks that go to that page)? Might provide a slightly different signal than overall popularity
Backlink anchor text could be useful to index. It's possible that the choices of words people use to link to an article are different from the title/redirects that already exist.
popularity velocity, or change over time. A page with increasing popularity may be more important than one with decreasing popularity. Or it might all be noise.
backlink co-occurance, or the words that appear nearby a link to the article, may provide useful context.
existence of a talk page for an article may be a quality signal (but again, wp10...hard to say)
Page dwell time, or how long an "average" reading session on the page is.
Does the page have an infobox?

Perhaps also interesting are the ranking features available in the newly released vespa, which yahoo used for many internal ranking and recommendation tasks: http://docs.vespa.ai/documentation/reference/rank-features.html

Another idea for a feature—some similarity measure between the search term and the matching term in a document. Though after talking to @dcausse it sounds too expensive because there's no good way to map matched terms to specific query terms, and even pulling out the matched terms is a pain.

But it could help with unexpected stemmer bugs—for example, the Polish stemmer is statistical and has a few really weird errors. It might penalize some stemmings of related but dissimilar words (e.g., the English stemmer stems Dutch and Holland as dutch, and the Ukrainian stemmer groups жене and гнали)—but those are very rare.

debt closed subtask T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin as Resolved.Jun 1 2018, 1:47 PM

This is a pretty open-ended never ending ticket. The original need was met though, and some of the above ideas made it into the production feature sets.

	F8626927: 1193k with templates importance graph
	Jul 6 2017, 12:24 AM

	F8626920: 1193k with categories importance graph
	Jul 6 2017, 12:21 AM

Collect ideas for feature engineering of LTRankClosed, ResolvedPublicActions