Page MenuHomePhabricator

Collect ideas for feature engineering of LTRank
Closed, ResolvedPublic

Description

We have various ideas of features we can use, this ticket is to collect those ideas, and to create sub tasks for ideas that are worth investigating.

Event Timeline

EBernhardson created this task.EditedApr 5 2017, 3:49 PM

Some ideas we've kicked around:

  • One hot encoding of popular templates/categories
  • # of tokens in query
  • # of tokens in title
  • size of the document
  • derivative features, such as wp10s # wikilinks / # content chars
  • wp10 scores from ores
  • Features used by wp10 in ores:
>>> from wikiclass.feature_lists import enwiki
>>> import pprint
>>> pprint.pprint(enwiki.wp10)
[<feature.wikitext.revision.chars>,
 <feature.wikitext.revision.content_chars>,
 <feature.wikitext.revision.ref_tags>,
 <feature.(wikitext.revision.ref_tags / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.wikilinks>,
 <feature.(wikitext.revision.wikilinks / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.external_links>,
 <feature.(wikitext.revision.external_links / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.headings_by_level(2)>,
 <feature.(wikitext.revision.headings_by_level(2) / max(wikitext.revision.content_chars, 1))>,
 <feature.wikitext.revision.headings_by_level(3)>,
 <feature.(wikitext.revision.headings_by_level(3) / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.image_links>,
 <feature.(enwiki.revision.image_links / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.category_links>,
 <feature.(enwiki.revision.category_links / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.cite_templates>,
 <feature.(enwiki.revision.cite_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.(enwiki.revision.cite_templates / max(wikitext.revision.ref_tags, 1))>,
 <feature.max((wikitext.revision.ref_tags - enwiki.revision.cite_templates), 0)>,
 <feature.(max((wikitext.revision.ref_tags - enwiki.revision.cite_templates), 0) / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.non_cite_templates>,
 <feature.(enwiki.revision.non_cite_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.revision.infobox_templates>,
 <feature.(enwiki.revision.cn_templates + 1)>,
 <feature.(enwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.(enwiki.revision.cn_templates + 1)>,
 <feature.(enwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.enwiki.main_article_templates>,
 <feature.(enwiki.main_article_templates / max(wikitext.revision.content_chars, 1))>,
 <feature.(english.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1))>]
EBernhardson renamed this task from Collect ideas for feature engineering of LTR to Collect ideas for feature engineering of LTRank.Apr 5 2017, 8:13 PM

Some high level ideas to think about from the Yahoo LTR challange (http://www.jmlr.org/proceedings/papers/v14/chapelle11a/chapelle11a.pdf?WT.mc_id=Blog_MachLearn_General_DI):

Web graph

This type of features tries to determine the quality or the popularity of a
document based on its connectivity in the web graph. Simple features are functions
of the number of inlinks and outlinks while more complex ones involve some kind of
propagation on the graph. A famous example is PageRank (Page et al., 1999). Other
features include distance or propagation of a score from known good or bad documents
(Gy¨ongyi et al., 2004; Joshi et al., 2007).

Document statistics

These features compute some basic statistics of the document such
as the number of words in various fields. This category also includes characteristics
of the url, for instance the number of slashes.

Document classifier

Various classifiers are applied to the document, such as spam, adult,
language, main topic, quality, type of page (e.g. navigational destination vs informational).
In case of a binary classifier, the feature value is the real-valued output of the
classifier. In case of multiples classes, there is one feature per class.

Query

Features which help in characterizing the query type: number of terms, frequency
of the query and of its terms, click-through rate of the query. There are also result set
features, that are computed as an average of other features over the top documents
retrieved by a previous ranking function. For example, the average adult score of the
top documents retrieved for a query is a good indicator of whether the query is an
adult one or not.

Text match

The most important type of features is of course the textual similarity between
the query and the document; this is the largest category of features. The basic
features are computed from different sections of the document (title, body, abstract,
keywords) as well as from the anchor text and the url. These features are then aggregated
to form new composite features. The match score can be as simple as a count or
can be more complex such as BM25 (Robertson and Zaragoza, 2009). Counts can be
the number of occurrences in the document, the number of missing query terms or the
number of extra terms (i.e. not in the query). Some basic features are defined over
the query terms, while some others are arithmetic functions (min, max, or average)
of them. Finally, there are also proximity features which try to quantify how far in
the document are the query terms (the closer the better) (Metzler and Croft, 2005).

Topical matching

This type of feature tries to go beyond similarity at the word level and
compute similarity at the topic level. This can for instance been done by classifying
both the query and the document in a large topical taxonomy. In the context of
contextual advertising, details can be found in (Broder et al., 2007).

Click

These features try to incorporate the user feedback, most importantly the clicked
results (Agichtein et al., 2006). They are derived either from the search or the toolbar
logs. For a given query and document, different click probabilities can be computed:
probability of click, first click, last click, long dwell time click or only click. Also of
interest is the probability of skip (not clicked, but a document below is). If the given
query is rare, these clicks features can be computed using similar, but more frequent
queries. The average dwell time can be used as an indication of the landing page
quality. The geographic similarity of the users clicking on a page is a useful feature to
determine its localness. Finally, for a given host, the entropy of the click distribution
over queries is an indication of its specificity.

External references

For certain documents, some meta-information, such as Delicious
tags, is available and can be use to refine the text matching features. Also documents
from specific domains have additional information which can be used to evaluate
the quality of the page: for instance, the rating of an answer in Yahoo! Answers
documents.

Time

For time sensitive queries, the freshness of a page is important. There are several
features which measure the age of a document as well as the one of its inlinks and
outlinks. More information on such features can be found in (Dong et al., 2010,
Section 5.1).

Microsoft also has a list of 136 features, which are a subset of those used in bing and are "widely used in the research community." : https://www.microsoft.com/en-us/research/project/mslr/

These 136 features are actually a much smaller number of features, but applied multiple times. For example the IDF of the query terms in the body, anchor, title, url and whole document make up 5 separate features, but all are IDF. We would likely do similar, but with our fields: title, heading, opening text, etc. The deduplicated list follows, everything after the - is my own guesses about what these features mean, or data from the site describing the feature.

  1. covered query term number - perhaps this is the number of terms from query that are matched by the document?
  2. covered query term ratio - the percentage of terms from query that are matched by the document?
  3. stream length - size in bytes/codepoints/something?
  4. IDF
  5. Sum of TF
  6. Min of TF
  7. Max of TF
  8. Mean of TF
  9. Variance of TF
  10. sum of stream length normalized TF
  11. min of stream length normalized TF
  12. max of stream length normalized TF
  13. mean of stream length normalized TF
  14. variance of stream length normalized TF
  15. sum of tf*idf
  16. min of tf*idf
  17. max of tf*idf
  18. mean of tf*idf
  19. variance of tf*idf
  20. boolean model - ??? no clue
  21. vector space model - This may be word2vec style vector space embeddings? not sure
  22. BM25
  23. LMIR.ABS - Site says "Language model approach for information retrieval (IR) with absolute discounting smoothing"
  24. LMIR.DIR - Site says "Language model approach for IR with Bayesian smoothing using Dirichlet priors"
  25. LMIR.JM - Site says "Language model approach for IR with Jelinek-Mercer smoothing"
  26. Number of slash in URL - probably similar to a subpage depth in our case
  27. Length of URL - Probably not relevant to use, the "stream length" of title is same
  28. Inlink number
  29. Outlink number
  30. PageRank
  31. SiteRank - Site says "Site level PageRank".
  32. QualityScore - Site says "The quality score of a web page. The score is outputted by a web page quality classifier."
  33. QualityScore2 - Site says "The quality score of a web page. The score is outputted by a web page quality classifier, which measures the badness of a web page."
  34. Query-url click count - Site says "The click count of a query-url pair at a search engine in a period"
  35. url click count - Site says " The click count of a url aggregated from user browsing data in a period"
  36. url dwell time - Site says "The average dwell time of a url aggregated from user browsing data in a period"
EBernhardson added a subscriber: dcausse.EditedApr 5 2017, 11:14 PM

@dcausse one thing to think about is how we could efficientlly extract some of the data above from lucene for use as feature vectors. Likely we wouldn't want to have a separate query for each of the sum/min/max/mean/variance of the TF (and again for a length-normalized version). It seems like all 10 features could come out of a single query some way or another, as they all have basically the same input data: The TF for each term in the query. These queries would still have to be duplicated against the set of fields we want to use, such as title, headings, opening_text, body, etc.

TJones added a subscriber: TJones.May 9 2017, 7:18 PM

I was thinking that some of the generic text features of the query might be useful. Mikhail has an interesting list and some data in prevalence in his Zero To Hero report. Presence of a question mark, presence of quotes, and the presence of advanced syntax seem interesting—though as we discussed, these may be too low frequency to be useful.

Another one that just came to mind would be the presence of Unicode character classes (e.g., writing systems, very roughly) might be interesting. I'd expect that for any given wiki, it'd come down to "the character set of this wiki" and "not the character set of this wiki", but the generic features would be easy to implement.

I had the idea to take a list of popular categories and templates, train a model with them, and then use the results of that model to try and decide what to use.

I took all the categories that are on more than 1k pages, which amounts to 3166 categories, then trained a model with 1.2M samples using those 3166 categories as binary features along with our default set of features. This ends up being pretty expensive to train, the memory usage per-fold went from 186MB for the 9 feature model to 26GB for the 3175 feature model. Because of this instead of doing the full hyperparameter/cross validation pipeline i trained a single model using the same parameters as the best model previously trained with the same 1.2M sample and 9 features. This ended up using 66 categories in the final model. 45 of the features were used at least twice, and 34 of them at least three times. The data below only considers features with weight >= 3.

The tree statistics in graph form. Note that I had to put all the values on a log scale, otherwise the variations are incredibly hard to see due to the difference between largest and smallest values.

cover_averagecover_maxcover_mincover_stdgain_averagegain_maxgain_mingain_stdweight
all_near_match7856.79145577573.500000721.48400010445.911281109.3200121955.9200001.471630245.16414399.000000
auxiliary_text1903.90840110814.400000203.0110002522.6857953.64352233.6618000.5236514.125408197.000000
category2780.70441616740.400000216.2210003071.81829919.662336496.1210000.53619747.158397305.000000
category_Album_infoboxes_lacking_a_cover10150.00000010150.00000010150.0000000.0000001.4498501.4498501.4498500.0000001.000000
category_All_Wikipedia_articles_written_in_American_English7653.3250009237.4200006069.2300001584.0950001.4086161.9418300.8754010.5332152.000000
category_All_article_disambiguation_pages2213.8249476384.110000506.9820001519.81551621.26165388.3936002.06188019.53149619.000000
category_All_articles_containing_potentially_dated_statements4532.7650006131.2700003432.390000849.8929553.5795389.4666401.6845102.6590056.000000
category_All_articles_lacking_in-text_citations7789.7900007789.7900007789.7900000.0000001.7459201.7459201.7459200.0000001.000000
category_All_articles_needing_additional_references1580.8770003946.750000750.0900001008.4475273.99052510.1687001.0808502.42576011.000000
category_All_articles_that_may_contain_original_research7923.2200008012.8600007833.58000089.6400002.1261502.4311701.8211300.3050202.000000
category_All_articles_to_be_expanded7907.5700007907.5700007907.5700000.0000001.6417301.6417301.6417300.0000001.000000
category_All_articles_with_dead_external_links2816.5396366786.000000635.2160001615.6608613.7118567.0496101.4817101.90544211.000000
category_All_articles_with_unsourced_statements1129.1192503469.800000228.2760001145.3904862.4567597.3970000.5964791.9702888.000000
category_All_articles_with_vague_or_ambiguous_time9257.83000010055.2000008157.800000797.0023532.2771404.3305501.2549901.2134474.000000
category_All_pages_needing_cleanup9225.2100009225.2100009225.2100000.0000001.5160801.5160801.5160800.0000001.000000
category_All_stub_articles2226.8750004061.850000391.9000001834.9750002.3590343.7846400.9334281.4256062.000000
category_American_films3722.8250004749.5600002696.0900001026.7350005.1261806.4932203.7591401.3670402.000000
category_Articles_containing_French-language_text9322.8550009850.7000008795.010000527.8450001.3157051.3431601.2882500.0274552.000000
category_Articles_containing_Japanese-language_text7388.88142910145.8000003077.0500002362.40969713.86224041.3999002.26775014.6066137.000000
category_Articles_containing_Latin-language_text9935.5100009935.5100009935.5100000.0000001.3222201.3222201.3222200.0000001.000000
category_Articles_using_Infobox_video_game_using_locally_defined_parameters9594.3300009594.3300009594.3300000.0000001.5572901.5572901.5572900.0000001.000000
category_Articles_using_small_message_boxes4327.2600004327.2600004327.2600000.0000002.3561002.3561002.3561000.0000001.000000
category_Articles_with_Japanese-language_external_links9834.36333310312.2000008932.790000637.8910872.1986032.4983201.8134200.2860663.000000
category_Articles_with_hAudio_microformats1099.9799604453.010000218.175000880.7623203.87096313.0937001.2516102.72277625.000000
category_Articles_with_hCards2012.4790006453.3200001105.0000001517.2617286.39966021.4621001.3106606.24809910.000000
category_Articles_with_unsourced_statements_from_April_20178910.5700009082.0800008729.170000144.2445293.4512006.9216201.6051802.4556253.000000
category_Articles_with_unsourced_statements_from_March_20178858.3000009358.2000007934.000000552.6641122.3530104.8870501.0108201.4947674.000000
category_Articles_with_unsourced_statements_from_May_20179475.9700009475.9700009475.9700000.0000001.2552001.2552001.2552000.0000001.000000
category_CS1_German-language_sources_(de)3768.1440005856.3500002620.8700001253.8652992.2226462.6555901.5452900.4152915.000000
category_CS1_Italian-language_sources_(it)4861.7600004861.7600004861.7600000.0000003.1065503.1065503.1065500.0000001.000000
category_CS1_Japanese-language_sources_(ja)4447.7375005920.8900002380.5500001341.3127555.77722015.9728001.5463105.9093754.000000
category_CS1_Spanish-language_sources_(es)6485.3200007430.7600005539.880000945.4400002.9776054.0133901.9418201.0357852.000000
category_CS1_errors:_dates5369.3750005887.6800004851.070000518.3050001.7452101.7641501.7262700.0189402.000000
category_CS1_errors:_external_links8700.2000008700.2000008700.2000000.0000002.4998402.4998402.4998400.0000001.000000
category_CS1_maint:_BOT:_original-url_status_unknown5779.5400005779.5400005779.5400000.0000001.9581101.9581101.9581100.0000001.000000
category_CS1_maint:_Extra_text:_authors_list8944.0300009373.9000008514.160000429.8700001.2605701.4279101.0932300.1673402.000000
category_CS1_maint:_Multiple_names:_authors_list3390.9000003390.9000003390.9000000.0000003.0136503.0136503.0136500.0000001.000000
category_CS1_maint:_Uses_authors_parameter8893.7033339827.4900007765.980000852.7072351.0652071.0808301.0350000.0213643.000000
category_Coordinates_on_Wikidata2806.4481927392.220000524.3280001727.5143449.48114941.3582001.4312909.72099126.000000
category_Disambiguation_pages3958.7964298605.3700001199.4300002310.78150418.41898139.9347001.60244012.61232614.000000
category_English-language_films3512.3026326260.3200001211.7900001444.10917510.83608430.1185002.4427707.71981819.000000
category_English-language_television_programming6481.2975008940.2700004912.9100001455.1669235.84957811.6715001.5567603.0314218.000000
category_Featured_articles8312.3400008312.3400008312.3400000.0000001.0491601.0491601.0491600.0000001.000000
category_Good_articles4178.9600005469.3000002321.1300001346.4217423.5850404.7105502.0132001.1455223.000000
category_Grammy_Award_winners38801.70000038801.70000038801.7000000.00000011.35790011.35790011.3579000.0000001.000000
category_Living_people2421.1457786724.290000339.0750001939.6615124.82023819.9997001.2484605.5263079.000000
category_Music_infoboxes_with_deprecated_parameters6113.65000010046.6000003582.2400002819.0096762.6284003.8909601.6349500.9404653.000000
category_Official_website_different_in_Wikidata_and_Wikipedia6280.6975008192.1500004563.3400001495.1269842.1565602.6649201.4113200.4943474.000000
category_Official_website_not_in_Wikidata8961.2150009347.5100008574.920000386.2950001.9372552.3245501.5499600.3872952.000000
category_Pages_using_ISBN_magic_links5745.2450008321.6300002571.8700002125.4431982.5421833.1942401.5011500.6518614.000000
category_Pages_using_deprecated_image_syntax1418.7215003306.740000525.1600001102.8832353.5814434.9443902.1314701.0482124.000000
category_Pages_using_infobox_television_with_alias_parameters8294.4000008564.9500008023.850000270.5500001.3974801.5958301.1991300.1983502.000000
category_Pages_with_citations_lacking_titles7194.2900009303.3700006164.0800001242.4917715.79403015.1507001.6888105.4357414.000000
category_Singlechart_called_without_song1866.1600001866.1600001866.1600000.0000002.5949802.5949802.5949800.0000001.000000
category_Singlechart_usages_for_Flanders2652.4900002652.4900002652.4900000.00000013.86170013.86170013.8617000.0000001.000000
category_Singlechart_usages_for_New_Zealand2603.2900002900.0800002306.500000296.79000018.52190020.89330016.1505002.3714002.000000
category_Singlechart_usages_for_UK1754.6200001754.6200001754.6200000.0000001.3907701.3907701.3907700.0000001.000000
category_Webarchive_template_wayback_links1724.9042866912.160000460.3950001545.3286523.90778510.6181000.9750752.74931714.000000
category_Wikipedia_articles_with_BNF_identifiers1996.3695003740.620000376.1470001149.7151004.1013807.3235801.1761602.2546606.000000
category_Wikipedia_articles_with_GND_identifiers2896.4271007919.690000219.3100002149.6282983.5245289.4390001.3948202.85555810.000000
category_Wikipedia_articles_with_ISNI_identifiers3936.7885715620.1800001718.9700001397.4867525.62184712.9381001.6487503.9579137.000000
category_Wikipedia_articles_with_LCCN_identifiers4312.9738577170.670000213.6470002404.9232806.67348718.8101000.6812205.5973467.000000
category_Wikipedia_articles_with_MusicBrainz_identifiers3904.0217659730.2300001613.0600002245.9473537.33301732.3606000.8484718.17077617.000000
category_Wikipedia_articles_with_NLA_identifiers8472.7700008472.7700008472.7700000.0000001.4915401.4915401.4915400.0000001.000000
category_Wikipedia_articles_with_VIAF_identifiers2875.4181826894.480000333.6000001882.2073913.7217229.8800901.3729302.30889211.000000
category_Wikipedia_indefinitely_move-protected_pages4577.6275005664.4900004088.570000646.0474063.1169555.8028901.8474501.5737124.000000
category_Wikipedia_indefinitely_semi-protected_pages1142.5235001304.340000980.707000161.8165001.5138601.8851101.1426100.3712502.000000
category_Wikipedia_semi-protected_pages10763.40000010763.40000010763.4000000.0000003.4117803.4117803.4117800.0000001.000000
heading1895.4330739481.700000202.4800002135.4375544.09275228.3859000.6489684.419571123.000000
incoming_links1496.99145912206.600000203.3920001914.8726636.220038109.8900000.50056713.011488268.000000
popularity_score4892.19148576631.500000201.1260009847.52041382.6667125166.3000000.532374370.839558447.000000
redirect_or_suggest_dismax5064.52424093105.300000201.9330009396.001517114.85706910821.3000000.519488769.990566317.000000
text_or_opening_text_dismax6212.29901532519.200000211.8540005723.11483922.560473272.1600000.57688741.259152333.000000
title6879.432536169737.000000208.91900016408.034361359.39728445255.0000000.5453522827.596701435.000000

And same for templates:

cover_averagecover_maxcover_mincover_stdgain_averagegain_maxgain_mingain_stdweight
Module:About3038.3350004641.6600001601.4000001116.4953324.4897008.7963702.3336302.5445664.000000
Module:Based_on5209.5980006651.4200004337.430000858.5814575.0981786.8226402.9861301.3525535.000000
Module:Category_handler625.0015001051.700000213.548000304.0355973.3493806.4396500.7537602.2492214.000000
Module:Check_for_unknown_parameters1022.7696678796.370000209.4360002346.8821652.4533005.8024701.4905701.11922912.000000
Module:Convert/data1338.3283753277.750000270.4190001116.1763608.28919522.9175002.6893406.3053208.000000
Module:Delink626.4480002146.420000217.060000760.5490312.0350434.7302100.6450841.4341425.000000
Module:Icon/data1547.0938896414.710000242.2830001346.2646447.66773051.7117000.80861911.48808518.000000
Module:If_empty1095.8879094904.810000236.7390001333.2627596.68652632.1322000.5430268.72892711.000000
Module:InfoboxImage1137.0677867016.760000267.0890001748.15163413.14006787.9075000.94049026.04251414.000000
Module:Labelled_list_hatnote1098.2512504159.650000232.0130001226.0227463.1476016.0051701.3643501.3772078.000000
Module:List785.8866671027.530000521.863000207.0433744.4396936.2422302.5820701.4947623.000000
Module:Main658.5554291360.000000217.118000465.4584202.2197215.8678000.8692461.5514197.000000
Module:Math1392.9105566232.190000506.5530001739.8685433.8529067.2147002.0374101.7660259.000000
Module:Navbar618.2554291295.480000272.240000338.8781933.6915146.7687001.3686001.8692857.000000
Module:No_globals494.1243001023.060000236.493000279.8229482.9605217.4876000.6825212.05621310.000000
Module:Portal616.2992501616.660000273.303000420.5536923.1524028.4033401.7388802.1390908.000000
Module:Protection_banner/config2339.9466673214.1600001155.290000868.7223254.0316674.9483302.2212001.2802273.000000
Module:Separated_entries1556.4896672471.830000626.919000753.24897419.51414043.1044004.35472016.9054973.000000
Module:Side_box508.7374001102.220000246.764000315.8966182.6265664.5078901.1829901.2998195.000000
Module:String656.8749091600.780000214.010000468.0574952.9946248.6053100.8710092.52117011.000000
Module:Unsubst434.491250624.366000288.299000120.8098052.9455556.0268301.6179201.8048934.000000
Module:Wikidata2502.7400004896.2600001127.1600001698.7603493.9712576.1655901.7385601.8075313.000000
Module:Yesno599.7226001147.140000263.746000327.2740018.56610035.6069001.18450013.5306695.000000
Template:Abbr2001.9600003182.6400001107.160000890.1081094.3687009.5179101.7315203.1602594.000000
Template:Age2158.3443333969.420000239.3930001524.6864654.7971406.1305003.5468601.0564013.000000
Template:Category_handler443.144800683.920000214.572000180.3450062.5152967.0057801.1745102.2504095.000000
Template:Citation4081.5510009172.760000898.0930003637.5049962.3168502.8791901.6053900.5306093.000000
Template:Cite_book1906.5846926798.710000268.5460002377.0542394.41004312.4360001.9610202.65335213.000000
Template:Colend6708.2220009776.4600005375.9600001589.0259394.8055189.2175303.1320902.2486905.000000
Template:Column-count816.731750958.538000692.485000100.8910892.9899826.5464501.6729002.0583724.000000
Template:Commons_category502.1860001164.250000256.630000382.5533834.2533187.4627702.3767701.9822764.000000
Template:Div_col_end1933.1566672546.8700001117.410000600.8061663.6496734.7043103.0290600.7496213.000000
Template:Dmbox3308.2636435950.650000990.9110001734.75087420.94698447.1743002.85705011.18513614.000000
Template:Episode_list9827.83666710570.7000008972.810000657.1395453.0784574.3215702.3333900.8847803.000000
Template:Europe_topic8981.52000010041.8000008097.790000803.3974306.7440908.9369404.8911601.6690803.000000
Template:Film_date2757.8500004808.0800001103.4500001331.5020117.00362512.0946003.5978403.1345354.000000
Template:Fix561.5626671313.360000240.393000402.9427033.0503226.3968300.9790712.2485146.000000
Template:Hlist1981.4743333661.430000203.6430001413.3311642.2969233.3963201.6776900.7794613.000000
Template:ISBN2133.2142865450.100000631.7410001646.8733734.1908037.1263402.2488501.4775107.000000
Template:ISO_639_name1653.8592502280.820000975.957000550.3175693.7095025.3052201.8393501.3750664.000000
Template:Infobox1099.2003004125.550000231.4680001100.2705173.95312710.4891000.7417402.60946210.000000
Template:Infobox_musical_artist/color2630.8240004479.5800001728.8500001017.34867112.59915417.7702003.0364705.1710085.000000
Template:Infobox_musical_artist/hCard_class1274.2733331584.210000989.860000243.3126404.5929707.3791902.4061802.0739973.000000
Template:Iso2country/article6322.4162508871.3000003451.1300001655.3062369.71225724.7199001.8738507.9115308.000000
Template:Italic_title1036.8422001348.830000621.874000241.06360717.53752259.5467003.12008021.2001415.000000
Template:Longitem3798.7060007907.630000936.9720002443.8055148.94028924.6635002.0170007.3577397.000000
Template:MONTHNUMBER1834.0517788704.590000255.9400002640.2203392.6096775.2148500.5308971.6759709.000000
Template:Main_other599.7362861234.850000281.585000337.7035923.40946710.4733001.3759502.9821837.000000
Template:Military_navigation5558.1200008326.7500001930.7700002386.2309152.9671584.0932501.2937301.0353485.000000
Template:N/a6403.83250010078.0000003349.9000002462.8344973.6102935.1433702.0129001.2070224.000000
Template:Navbox845.4385001725.500000262.409000565.1309064.78163710.2792001.5127602.7463616.000000
Template:Nihongo6581.85333310083.6000003460.4300002567.75995316.84952247.0134001.72312017.1885246.000000
Template:Nom2434.6433333420.1100001260.280000891.8063583.0238033.6061302.7078000.4122663.000000
Template:Nowrap1212.2206673171.920000222.617000841.9158626.68143625.5258001.7790506.8161899.000000
Template:PAGENAMEBASE1300.6650715353.980000250.0530001366.7441855.83053812.7488001.7108003.56812014.000000
Template:Page_needed8296.6366679744.2300007067.7900001103.5459052.3484772.8146701.7071600.4687903.000000
Template:Plainlist2607.6238335639.310000848.4730001622.07217515.58471043.4907003.09541013.2670056.000000
Template:Portal728.2722501608.660000272.174000537.2476276.15530310.8955002.6440203.0065254.000000
Template:Reflist2849.6570007129.600000555.6500002489.4511804.81960810.8294001.5524503.0619418.000000
Template:See_also3897.1240008206.590000573.1460003099.9661322.8354703.7511102.1957900.6144014.000000
Template:Sister_project405.968800702.482000233.761000173.3640493.7108467.0232601.8155401.8863005.000000
Template:Small1882.1019235461.210000289.7040001456.9594908.02343731.8329000.9569568.87771213.000000
Template:Start_date1908.1395002777.640000520.148000872.5878954.8403129.5496402.4327602.7718494.000000
Template:Str_left489.051250829.110000250.496000226.54119010.19491333.2113001.77983013.3183704.000000
Template:Title_disambig_text3680.1250005913.3700001244.4000002234.31764515.85173724.7630005.8215808.3973814.000000
Template:Trim1856.4328898440.050000221.1170002433.9060562.9215788.6786701.1969902.2464579.000000
Template:URL2419.2860004595.4200001211.5000001128.7952884.4543797.0405702.4312501.66749010.000000
Template:Use_mdy_dates662.5509001333.100000295.684000309.4911653.4671214.8743301.1657601.12929510.000000
Template:Wikiquote2683.4666675444.1700001158.6900001835.1664694.9313088.2110302.7522701.7071596.000000
Template:Wiktionary3908.0833338270.9000001448.8900002858.11032012.67631032.9740002.53940011.3209156.000000
Template:Yesno1533.8841434596.340000278.9810001378.2031642.9789435.0392401.0389001.4977697.000000
auxiliary_text2294.41782612166.000000207.4040003066.4309244.62383229.5783000.5098984.979459109.000000
category3196.50528119827.300000207.0210003660.34095823.890592449.4020000.52929050.367074217.000000
heading2055.2642189955.460000213.8400002604.1380035.16572530.9438000.5820105.16566978.000000
incoming_links1605.54501410078.300000207.9840001925.7956197.26550991.5752000.52063112.177630145.000000
redirect_or_suggest_dismax6295.66367592000.100000202.93300010959.660087154.2349679930.9100000.517249850.034527237.000000
Module:Distinguish8445.2900009813.5400006051.0500001698.7396031.7890972.1384101.4525200.2801583.000000
Module:Hatnote666.0743331369.950000226.081000434.7860903.1837136.6241700.7493692.2339076.000000
Module:Message_box2196.9740005570.530000735.4920001852.6967884.39418312.4424001.6713203.7139346.000000
Module:Official_website2428.5116676537.8900001052.1800001883.3786617.19681014.7761001.4873004.5661286.000000
Module:Portal/images/a6671.8033339956.3500004906.0600002324.7121613.9293805.3291602.6651001.0918103.000000
Module:Portal/images/e8234.6200008663.0500007994.040000303.7096336.0284537.6977903.7111101.6908263.000000
Module:Redirect2146.2082504795.580000349.8020001440.79040528.58687677.1638005.31988027.1754698.000000
Module:Series_overview9699.63333310325.7000009264.660000453.7539474.6981676.1159003.2298801.1787553.000000
Module:StringReplace1323.8215002610.100000704.798000671.2062655.24778710.5843001.9950603.0003736.000000
Module:WPMILHIST_Infobox_style4415.93000010044.4000001297.8800003011.9156736.22896813.0356001.9717903.46726515.000000
Template:Both1716.3864175682.500000436.5300001303.6097648.02011318.7786001.3787805.31692712.000000
Template:Cbignore7120.6400009154.1700004547.2500001712.8470864.0663685.8939502.3139501.2670744.000000
Template:Citation_needed361.338500448.028000220.69400092.2431982.8796535.5851800.8343621.7351154.000000
Template:Country_data_Bosnia_and_Herzegovina8364.81000010227.4000006459.5600001538.5099882.5983702.7129802.4228500.1260333.000000
Template:Disambiguation/cat3426.9250007480.7100001625.0500002379.87841721.20492036.1420008.84378010.2026724.000000
Template:Fix/category470.143000678.487000346.986000129.1588602.2104503.6106200.7603291.0110784.000000
Template:Greater_color_contrast_ratio11262.32400038375.8000001708.27000013702.49842219.17931033.3768008.0271509.1533715.000000
Template:I2c4346.8560006730.3800002223.9100001859.40001416.03754442.0546004.28624013.3890195.000000
Template:IMDb_title4387.2177698185.140000202.5210002337.14727419.68748442.2144004.21538010.91164113.000000
Template:Infobox_single2526.1145008361.000000812.3270002647.7048368.90275020.8042002.9178706.0820216.000000
Template:Navbox_subgroup754.1683331115.410000555.346000255.8683922.5239974.1613501.6199201.1598813.000000
Template:Ns01951.9350002425.9400001230.050000472.57677118.45836036.1834005.49444011.5179674.000000
Template:Sort6703.84166710285.7000001441.3400002870.9936082.9083053.9855502.1735500.5792576.000000
Template:Track_listing7828.1866678897.1800006078.2600001247.4855104.1396335.4909202.3309501.3299783.000000
Template:Use_dmy_dates563.635250846.771000455.229000163.8043422.4954273.2503601.8876500.6149264.000000
Template:Webarchive1562.5228893257.900000225.196000868.7938047.01244824.8341001.6536006.7500859.000000
44881871.4968574500.000000292.2410001377.6534505.97398415.7613001.1840304.66053314.000000
44896007.6080007293.8000003225.3400001461.9330505.85068413.1369002.7069603.8258275.000000
Template:Br_separated_entries1055.0680001675.880000605.784000361.9256484.5511785.9552402.4442601.3938405.000000
Template:Refimprove2509.4900004888.1900001247.2600001426.4905303.2251674.3650302.0178600.8307934.000000
Module:Check_isxn893.5820001756.730000278.716000574.7202525.48057512.1567001.3001404.1241374.000000
popularity_score5813.81976877727.700000201.00200010593.805969100.9310885314.3800000.678481410.865875353.000000
Template:As_of3972.9900005329.4700001629.4300001663.9950313.1145403.5643002.7870100.3288873.000000
Module:Other_uses2038.9200003305.7900001390.640000895.8943663.2115874.7035201.9623001.1321753.000000
Template:Cite_web2048.9618007281.990000237.5150002651.8238502.5936826.9346500.5282182.2838105.000000
Template:Cite_news1204.8340913387.860000273.929000917.8025033.9493868.5057801.2541202.28944111.000000
Module:TableTools766.8738183506.790000222.391000951.3691593.86211916.9595000.9196084.51509711.000000
Template:(!784.6513331220.260000429.543000327.7891316.2093877.0914004.9857500.8929243.000000
Template:Birth_date_and_age3894.8300005628.1500001320.1300001856.6455974.0147375.5048003.0702401.0661473.000000
Template:Column-width782.5244551396.290000293.064000406.0132273.1214184.5195701.7479000.77765011.000000
all_near_match8173.93395677454.2000001014.04000010806.929035117.1553611948.6400002.942300254.96933191.000000
Module:Portal/images/t6655.3150009773.6300005073.1400001858.4714382.2772733.3206801.5011800.6648724.000000
Module:Link_language6647.2166677141.0100005731.910000647.8918745.43958010.0705002.4816203.3163103.000000
Template:ISBNT10140.57500010980.8000009241.900000624.6809204.0118936.3607202.0919501.9250944.000000
Template:Main_article1251.9023333899.850000326.9750001226.8882632.4693824.1070200.9913121.0401326.000000
Template:Authority_control1381.0884442613.950000349.833000642.2388877.64045419.0935001.7615006.1314349.000000
Template:Rotten_Tomatoes4393.3150005954.5700003827.080000901.9833018.16882211.6170004.4467903.2780714.000000
Template:Infobox_settlement/impus5205.0866678690.2900001299.8300003031.72759962.608657142.1110003.04377058.4981503.000000
Template:Cite_journal2051.7555887202.770000215.8620001953.8759464.68592412.0311000.7945022.93338517.000000
Template:Cite_encyclopedia5535.0300006552.0200004667.740000673.1313524.8592537.6146202.4567001.8347684.000000
Module:Unsubst-infobox549.736500798.701000314.950000173.0425465.99896516.9128001.2362006.3483634.000000
Template:For2303.9033332653.6700001652.320000461.1546603.2190334.2627002.5724700.7449783.000000
title9063.832716169728.000000201.35400019086.794947511.31099645781.4000000.6091683389.279432306.000000
Module:Namespace_detect/config294.229667330.378000229.47600045.8911922.2509532.6688001.7378900.3859603.000000
Template:Ifempty3992.50428610083.8000001493.2400002981.6651124.88319911.2382001.5693903.0930457.000000
Module:Hatnote_list475.784000944.469000252.753000277.6282482.8555274.3819601.7129600.9922434.000000
Template:Ambox738.8246001632.240000338.025000467.6193452.1575442.9521901.1873800.7873145.000000
text_or_opening_text_dismax7309.66556938935.200000209.1030006182.30729629.835002260.3710000.55777649.510657255.000000
Module:Portal_bar4541.7466675796.5600003457.200000962.5986902.7733733.0905702.4844800.2482433.000000

Need to decide what to do with these exactly. For english i can certainly manually review the lists, take everything that seems sane, and build a combined model with all of them and re-review what happens. I think we also might need T162711 because talk page categories are probably as interesting as the main page categories. They will probably also change things.

There is also non-exact matches. Especially for talk page categories we might want matches like 'Grade A' or some such, rather than strictly matching specific categories to generalize things. Matching specific wikiproject tags regardless of quality might also be useful. I'm not sure how to generalize this all beyond english though, our best bet might be to do similar to ORES and work with communities to curate lists of sane things.

TJones added a comment.EditedJul 6 2017, 7:28 PM

Thanks for all the data. (It imports into Libre Office very easily.) Those max/avg/min swings are huge—orders of magnitude!

Whether or not you should filter for things that seem sane isn't clear to me. Machine learning often doesn't seem sane. ;) What really matters is whether things are sufficiently stable to be predictive. I think the refresh period for the model might be more important. Things like category_All_Wikipedia_articles_written_in_American_English doesn't seem to have any theoretical value, but it might have real-world value, at least in the short term, because whoever is tagging that category is focusing on high-quality articles first, or something. If the model is refreshed often enough (for some value of "often enough") then the reason something is predictive doesn't matter too much, because the newer model will catch up with any shifts in predictive value. That also resolves the problem of dealing with languages & cultures we don't know.

I definitely agree that the talk page categories will be very valuable!

While I agree that matching on parts of categories, like "A Class" or "Good Articles" could boost the signal from the larger quality categories to the smaller ones, there's the problem of parsing them. There could be problems like "Good Articles for Deletion" (terrible category name, but you get the idea) in English, and there's the problem of languages we don't know (and some of those could have inflections of the words in titles that make parsing them even more annoying). So, maybe, it would make sense to try them without parsing and see how it goes. "History good articles" might be a significantly more valuable indicator than "Pokeman good articles". You could also try just the most obvious and easiest-to-extract non-exact matches to see how much they help over the full category names.

All very fair. I haven't automated the above selection of categories/templates, but i suppose i could. It's in code but it's code i pasted into a REPL as opposed to building into mjolnir directly.
. iIwas thinking though that perhaps some of these things that were chosen are basically proxies for some more specific signal. For example:

  • category_Articles_with_unsourced_statements_from_[March|April|July]_2017 - Maybe this is a proxy for age of articles, or low quality new articles? Might be worth trying to use age as a feature directly and see if xgboost finds less value in these particular categories. We don't currently store the age of the oldest revision in elasticsearch but i think it would be pretty easy to inject.
  • Template:Navbox - Almost all pages have this, pages that don't are perhaps lower quality pages? Maybe using ores wp10 scores would make this kind of proxy for page quality less useful?
  • Module:WPMILHIST_Infobox_style - Maybe this is a proxy for general 'military history' pages? Not sure what better signal we could come up with though, this is probably pretty reasonable.
  • category_Wikipedia_indefinitely_semi-protected_pages - This almost seems like a proxy for popularity or quality? Generally something is semi-protected due to spam which probably happens on either controversial or popular articles.

As with anything though I'm not sure that guessing at what the actual signal behind the choice is and trying to more directly capture that is going to be worthwhile. The above was relatively easy to do with just a computer stuffing numbers into a box and poping out an answer, whereas each individual signal will probably take a day or two of engineering time to do an initial validation on to see if it's of any use. And of course there is plenty of opportunity to be wrong about what signal it was trying to capture. The results of testing a signal and finding it to be useless is probably still useful information, but not as usefull as a new feature that works.

I think it's a great idea to look at English templates and categories like you have and try to get at some idea of a more generic signal. Article age and ORES score both seem like good ideas—but as you say, it may or may not be worth the engineering time to extract the info and test its utility in every conceivable case. OTOH, something like article age is universal across all wikis (that is, it's available and might be meaningful, unlike ORES scores, which are probably always meaningful, but not available for all wikis).

Some evaluations of current QI features, and the addition of wp10 and page_created_ts:

added / removed featurescv-test-ndcg@10holdout-test-ndcg@10diff from baseline% possible improvement
baseline0.847910.848940.000000.00000%
-incoming_links0.845970.84710-0.00184-1.21540%
-popularity_score0.821030.82354-0.02540-16.81655%
-popularity_score, -incoming_links0.810710.81256-0.03638-24.08052%
+page_created0.848860.850060.001120.74417%
+page_created, -incoming_links0.847950.84874-0.00020-0.13183%
+page_created, -popularity_score0.823050.82628-0.02266-15.00040%
+wp100.848650.84860-0.00033-0.22167%
+wp10, -incoming_links0.845860.84734-0.00160-1.05778%
+wp10, -popularity_score0.823440.82633-0.02261-14.96935%
+one_hot_wp100.848030.84862-0.00032-0.20859%

Surprisingly while both of our existing QI features are "good", the new features for evaluation have much smaller or no impact. wp10 doesn't look to be a useful feature either as a weighted sum or a one-hot encoding of article classes. The page created date has some small utility, but relatively little. Adding it to our elasticsearch docs would be relatively easy if we want to add it.

EBernhardson added a comment.EditedJul 12 2017, 4:18 AM

One more (or really two) new features, adding a token count of the query string with both the text_search and plain_search analyzers:

added / removed featurescv-test-ndcg@10holdout-test-ndcg@10diff from baseline% possible improvement
baseline0.847910.848940.000000.00000%
+num_text_terms, +num_plain_terms0.84940.851310.002371.568%

Will train the model with them individually to see if both are necessary.

Some other random ideas for features:

  • Prefix match against title/redirect. It seems likely that a prefix match would be more important than a general match against title. May want to do this per-term? As in if any of the terms in the search query match the first term of the title.
  • Age of last edit to page
  • Age of last significant content change to page (so, disregarding edits to fix typos, or bots that swap a link out to point at internet archive, etc)
  • Does the page contain multimedia, ratio of images to page length may be a quality signal (but wp10 turned out to not be useful, so maybe not as important or overshadowed by popularity?)
  • Estimated reading level of a page? Seems it would somehow need to be combined with an estimated reading level of the query as well
  • Number of Section? Average section length?
  • Could category matches be influenced by the size of the category? Matching a very narrow category might be a better match than a large category.
  • Search popularity of page (% of search clicks that go to that page)? Might provide a slightly different signal than overall popularity
  • Backlink anchor text could be useful to index. It's possible that the choices of words people use to link to an article are different from the title/redirects that already exist.
  • popularity velocity, or change over time. A page with increasing popularity may be more important than one with decreasing popularity. Or it might all be noise.
  • backlink co-occurance, or the words that appear nearby a link to the article, may provide useful context.
  • existence of a talk page for an article may be a quality signal (but again, wp10...hard to say)
  • Page dwell time, or how long an "average" reading session on the page is.
  • Does the page have an infobox?

Perhaps also interesting are the ranking features available in the newly released vespa, which yahoo used for many internal ranking and recommendation tasks: http://docs.vespa.ai/documentation/reference/rank-features.html

Another idea for a feature—some similarity measure between the search term and the matching term in a document. Though after talking to @dcausse it sounds too expensive because there's no good way to map matched terms to specific query terms, and even pulling out the matched terms is a pain.

But it could help with unexpected stemmer bugs—for example, the Polish stemmer is statistical and has a few really weird errors. It might penalize some stemmings of related but dissimilar words (e.g., the English stemmer stems Dutch and Holland as dutch, and the Ukrainian stemmer groups жене and гнали)—but those are very rare.

EBernhardson closed this task as Resolved.Sep 10 2018, 6:06 PM

This is a pretty open-ended never ending ticket. The original need was met though, and some of the above ideas made it into the production feature sets.