Page MenuHomePhabricator

Benchmark old and new model accuracy on new labeled data
Closed, ResolvedPublic3 Estimated Story Points

Description

Problem:
We would like to see if the new model is better than the old model in predicting the quality of Items. To do this we want to check how the old and new model performs with the new training data we collected.

Acceptance criteria:

  • we have an overview of how many Items the old model judges to be A/B/C/D/E class compared to the human judgement
  • we have an overview of how many Items the new model judges to be A/B/C/D/E class compared to the human judgement

Event Timeline

So I took a 750 sample of current items in wikidata. I took a stratified sample, like 150 from a certain range of size otherwise it would be mostly papers and stuff. The query: https://quarry.wmflabs.org/query/47988. This is one tenth of the query I did to make a new labeling campaign (that's why it's 7500)

Then I tried to get the prediction using three different models: 1- Old model, 2- New model with PropertySuggester (PS) feature and 3- New model without PS. 215 cases (30%) one of these three were not in agreement with the other two, out of 215, 187 of them were disagreement between the old model and the new ones only (and the new models were in agreement regardless of existence of PS feature). You can find it in P12527#70026 (It's rather big) but the result looks good, for example 967989666 was being judged D in the old model and B in the new ones. 28 cases (3.7%) were disagreement between existence of PS feature. This is the list of that 28 cases:

Rev idOld modelNew model with PSNew model without PSOther nerdy stuff
1245749762CBC{"old": {"A": 0.033922905209765215, "B": 0.3156294884755486, "C": 0.6357983498027796, "D": 0.011791632271694597, "E": 0.0028576242402119953}, "new_ps": {"A": 0.07506341154975878, "B": 0.5157792612102473, "C": 0.39929552483486047, "D": 0.00611359009974234, "E": 0.0037482123053910747}, "new_without_ps": {"A": 0.10080501376933476, "B": 0.4091825660012928, "C": 0.4778178499618486, "D": 0.007443348708474312, "E": 0.0047512215590494004}}
1268403491AAB{"old": {"A": 0.9581183283251842, "B": 0.028314702806298882, "C": 0.010957647415122833, "D": 0.0013961524223473987, "E": 0.0012131690310467787}, "new_ps": {"A": 0.47172335609710603, "B": 0.4534614216634016, "C": 0.05535875041401178, "D": 0.011371950117362214, "E": 0.008084521708118674}, "new_without_ps": {"A": 0.3623044878833989, "B": 0.5702558222515018, "C": 0.05067867816474142, "D": 0.009853985767714471, "E": 0.006907025932643356}}
1272228817CBA{"old": {"A": 0.05489149933518188, "B": 0.05466214769030068, "C": 0.8814578719623226, "D": 0.006414714358706056, "E": 0.002573766653488728}, "new_ps": {"A": 0.29595415449014023, "B": 0.35019429930247076, "C": 0.33307570056966107, "D": 0.012191086088356676, "E": 0.00858475954937127}, "new_without_ps": {"A": 0.34849674768953015, "B": 0.32551627851465303, "C": 0.30496550210767087, "D": 0.012390480540117786, "E": 0.008630991148028103}}
1032620898EDE{"old": {"A": 0.0008107712998579739, "B": 0.001213658252819447, "C": 0.011217407965526772, "D": 0.0507681680549962, "E": 0.9359899944267996}, "new_ps": {"A": 0.004039595060549861, "B": 0.004426804158760506, "C": 0.021114505241631394, "D": 0.7165181136601142, "E": 0.2539009818789441}, "new_without_ps": {"A": 0.0030297197261097805, "B": 0.0038494886036259317, "C": 0.012387228657753662, "D": 0.4631927631616361, "E": 0.5175407998508744}}
954223095CCB{"old": {"A": 0.007334705016875339, "B": 0.008463184659439892, "C": 0.8880210643634205, "D": 0.0914071181854589, "E": 0.004773927774805269}, "new_ps": {"A": 0.026242218233272874, "B": 0.3733216836274356, "C": 0.40173521200105566, "D": 0.19222363724692618, "E": 0.0064772488913097965}, "new_without_ps": {"A": 0.022162480661051785, "B": 0.39375421587841725, "C": 0.3787948617705969, "D": 0.19888459299483738, "E": 0.006403848695096642}}
1272813710BAB{"old": {"A": 0.01246413972025038, "B": 0.9294193476502699, "C": 0.051001211254504526, "D": 0.0039517721845516275, "E": 0.0031635291904234795}, "new_ps": {"A": 0.5637827527490271, "B": 0.40895553231316906, "C": 0.021699633403881608, "D": 0.0035228124820841796, "E": 0.002039269051838088}, "new_without_ps": {"A": 0.46526672188978885, "B": 0.49010554270623885, "C": 0.03529306142122966, "D": 0.006017466983745688, "E": 0.003317206998996975}}
1094470411EDE{"old": {"A": 0.001799059787524304, "B": 0.0044285456515436045, "C": 0.014795198545761869, "D": 0.21788385781578284, "E": 0.7610933381993874}, "new_ps": {"A": 0.002608884933697272, "B": 0.0038156659085305075, "C": 0.019117380902449396, "D": 0.5067512668939108, "E": 0.46770680136141196}, "new_without_ps": {"A": 0.0020935761860606398, "B": 0.0030094334832940807, "C": 0.01670644890854257, "D": 0.33417932005771844, "E": 0.6440112213643844}}
1258899373CBC{"old": {"A": 0.027592605961843895, "B": 0.24729986536546772, "C": 0.7156668028023997, "D": 0.006265339348660646, "E": 0.0031753865216278877}, "new_ps": {"A": 0.0887850929467022, "B": 0.5069119829849444, "C": 0.3910440149440535, "D": 0.008901892978921517, "E": 0.0043570161453782355}, "new_without_ps": {"A": 0.12461977688901812, "B": 0.42896504165786764, "C": 0.43263103622111726, "D": 0.009132891499186448, "E": 0.004651253732810652}}
1009069533CBC{"old": {"A": 0.010652801373842995, "B": 0.012962250794266099, "C": 0.8827737215649725, "D": 0.08877496829797407, "E": 0.004836257968944502}, "new_ps": {"A": 0.016824833645722508, "B": 0.48010818529555577, "C": 0.45313187560549095, "D": 0.04515763583422306, "E": 0.004777469619007798}, "new_without_ps": {"A": 0.018536280229151027, "B": 0.455438060812234, "C": 0.47808992978645676, "D": 0.04357186902695685, "E": 0.004363860145201423}}
1264438441CBC{"old": {"A": 0.015581587264649751, "B": 0.023813890068793222, "C": 0.7977250036891984, "D": 0.1566318318896204, "E": 0.006247687087738293}, "new_ps": {"A": 0.05428543394117957, "B": 0.49533484279979517, "C": 0.4148779979839913, "D": 0.030754548946219755, "E": 0.004747176328814198}, "new_without_ps": {"A": 0.04085257392083643, "B": 0.45309223235638896, "C": 0.4626493265729858, "D": 0.03850773325285288, "E": 0.004898133896935818}}
1270796650ECE{"old": {"A": 0.0008639944604346134, "B": 0.004217942152839602, "C": 0.11690199000605744, "D": 0.007117011592748679, "E": 0.8708990617879196}, "new_ps": {"A": 0.005275813356979018, "B": 0.01947495567613131, "C": 0.48483334779954296, "D": 0.04678105548514889, "E": 0.44363482768219786}, "new_without_ps": {"A": 0.005533287260085267, "B": 0.01749448990975197, "C": 0.4577040297258244, "D": 0.049068621566628036, "E": 0.47019957153771036}}
1131017749EDE{"old": {"A": 0.0009848502211236157, "B": 0.0006813469818803148, "C": 0.004775415297439066, "D": 0.01895671501830939, "E": 0.9746016724812476}, "new_ps": {"A": 0.004199606691032983, "B": 0.005154652328617433, "C": 0.023458153537280577, "D": 0.5999341975141927, "E": 0.3672533899288763}, "new_without_ps": {"A": 0.00335887071997231, "B": 0.0043850586484284835, "C": 0.020857878449322593, "D": 0.4395591609852192, "E": 0.5318390311970574}}
1271363787CAB{"old": {"A": 0.02942496428861146, "B": 0.05218274412657272, "C": 0.9127098367649166, "D": 0.003476874024927542, "E": 0.002205580794971511}, "new_ps": {"A": 0.3708185962706765, "B": 0.2898797059405617, "C": 0.32606039148697247, "D": 0.008031809307278825, "E": 0.005209496994510433}, "new_without_ps": {"A": 0.32952094643011637, "B": 0.34706459285310925, "C": 0.30914599745261706, "D": 0.008584917224562087, "E": 0.005683546039595178}}
1264542927CCB{"old": {"A": 0.02586796466215134, "B": 0.20357116749522028, "C": 0.7571945870114729, "D": 0.008973058550706863, "E": 0.004393222280448534}, "new_ps": {"A": 0.08708832182554961, "B": 0.327501774597967, "C": 0.5687873868471643, "D": 0.009615245950699312, "E": 0.007007270778619622}, "new_without_ps": {"A": 0.06418116636023653, "B": 0.4620929243561956, "C": 0.4608517719536428, "D": 0.007894750104224201, "E": 0.004979387225700937}}
1029258443CCB{"old": {"A": 0.010149469394569805, "B": 0.012231789345366822, "C": 0.86814651191582, "D": 0.10446830837547406, "E": 0.005003920968769144}, "new_ps": {"A": 0.014158879585372354, "B": 0.43494899101845963, "C": 0.4998663430756384, "D": 0.04576292595082224, "E": 0.005262860369707504}, "new_without_ps": {"A": 0.024742993960759994, "B": 0.47636374434061096, "C": 0.45242392837029205, "D": 0.04155625303622282, "E": 0.004913080292114336}}
1149836991CDC{"old": {"A": 0.0014008992393890138, "B": 0.0024979565023026664, "C": 0.9135074969497293, "D": 0.0797959983307796, "E": 0.002797648977799516}, "new_ps": {"A": 0.006252381988954234, "B": 0.09746723123792793, "C": 0.44169244587359746, "D": 0.4471422125926644, "E": 0.007445728306856005}, "new_without_ps": {"A": 0.006461530862610264, "B": 0.13992114818691723, "C": 0.43293633265441384, "D": 0.41420032098532567, "E": 0.006480667310733116}}
1214998592CCB{"old": {"A": 0.025664447126885103, "B": 0.13749998638684494, "C": 0.8292312393065496, "D": 0.004628384033433865, "E": 0.0029759431462865558}, "new_ps": {"A": 0.26197429404974704, "B": 0.29339928441368013, "C": 0.42935937484767434, "D": 0.009265784786554529, "E": 0.006001261902343899}, "new_without_ps": {"A": 0.2599566933104892, "B": 0.36592831382713226, "C": 0.358201539786054, "D": 0.009683009607640793, "E": 0.0062304434686837095}}
1152841992DCD{"old": {"A": 0.0011919129490775779, "B": 0.0021983989418554043, "C": 0.01824551834989107, "D": 0.9676691411557627, "E": 0.010695028603413132}, "new_ps": {"A": 0.004683427615721967, "B": 0.19375223932761254, "C": 0.39562065308898303, "D": 0.3946053033963124, "E": 0.011338376571370004}, "new_without_ps": {"A": 0.005813202998580016, "B": 0.26429252937941644, "C": 0.3249076347008479, "D": 0.3908998091856571, "E": 0.014086823735498515}}
1198820250CBC{"old": {"A": 0.012759151286699983, "B": 0.011187144049443445, "C": 0.8742128019232632, "D": 0.09695561861908604, "E": 0.0048852841215072905}, "new_ps": {"A": 0.060158910125219886, "B": 0.477214192911728, "C": 0.42524828186762326, "D": 0.032377089461557686, "E": 0.005001525633871391}, "new_without_ps": {"A": 0.04542632330570129, "B": 0.4203989834565925, "C": 0.4966423572475575, "D": 0.03296837975010725, "E": 0.004563956240041518}}
1064604178CCB{"old": {"A": 0.006077520615884296, "B": 0.007980121710391667, "C": 0.9145130881301586, "D": 0.06764298077793829, "E": 0.003786288765627031}, "new_ps": {"A": 0.009232797502505437, "B": 0.4246640835453037, "C": 0.512973503363046, "D": 0.04850367343621232, "E": 0.004625942152932565}, "new_without_ps": {"A": 0.012738436577971888, "B": 0.5166899773084483, "C": 0.42320990892764027, "D": 0.04347415624918416, "E": 0.003887520936755231}}
1189352608CCD{"old": {"A": 0.0031220152199667606, "B": 0.005285747723151848, "C": 0.5266479030425185, "D": 0.45494568426487736, "E": 0.009998649749485637}, "new_ps": {"A": 0.0044315022439300284, "B": 0.006285430209997785, "C": 0.5231689307130449, "D": 0.4533916932147463, "E": 0.01272244361828094}, "new_without_ps": {"A": 0.00476970985215096, "B": 0.006976610550126999, "C": 0.46354385782717217, "D": 0.5094034413901668, "E": 0.015306380380383025}}
1228411373EED{"old": {"A": 0.0011212777949707094, "B": 0.0020005353942534254, "C": 0.011292140358027877, "D": 0.2131855132774196, "E": 0.7724005331753284}, "new_ps": {"A": 0.005535519248210102, "B": 0.009491700267960791, "C": 0.06241095012609311, "D": 0.43306715399659296, "E": 0.4894946763611432}, "new_without_ps": {"A": 0.0052299333368587015, "B": 0.009760535103259941, "C": 0.07410020406619343, "D": 0.4617167098522753, "E": 0.44919261764141266}}
1263750365DCD{"old": {"A": 0.0005432611445804824, "B": 0.006475163176499794, "C": 0.02312974940797309, "D": 0.9685738945714686, "E": 0.0012779316994782873}, "new_ps": {"A": 0.008685528050124628, "B": 0.02044201088817635, "C": 0.5024624799560203, "D": 0.4551612683385364, "E": 0.0132487127671423}, "new_without_ps": {"A": 0.008867357464731553, "B": 0.01745996251929092, "C": 0.44315978125332395, "D": 0.5183764373982008, "E": 0.012136461364452712}}
1149663201DCD{"old": {"A": 0.0012992975857483437, "B": 0.0025702320990851615, "C": 0.021972066628729792, "D": 0.9677441214181767, "E": 0.006414282268260026}, "new_ps": {"A": 0.004778751977293244, "B": 0.05734693063175027, "C": 0.48138667466853374, "D": 0.4456484227861222, "E": 0.010839219936300494}, "new_without_ps": {"A": 0.004650953223470606, "B": 0.0590975620334573, "C": 0.40904133813293764, "D": 0.5167026237948555, "E": 0.010507522815278877}}
1194285399CCB{"old": {"A": 0.006494333986695253, "B": 0.01040922274114759, "C": 0.8411640917990374, "D": 0.1371845452241501, "E": 0.004747806248969735}, "new_ps": {"A": 0.009845672877838623, "B": 0.38537949660764004, "C": 0.40159179531027256, "D": 0.19690351435321699, "E": 0.006279520851031891}, "new_without_ps": {"A": 0.012647386572943427, "B": 0.39516640260841585, "C": 0.383790516439133, "D": 0.20223501560421167, "E": 0.006160678775296092}}
1245752935CBC{"old": {"A": 0.03981085046905755, "B": 0.3187539891469367, "C": 0.6340017584070916, "D": 0.004915688274984957, "E": 0.002517713701929113}, "new_ps": {"A": 0.2423494890002473, "B": 0.4445594224880173, "C": 0.30239036294342864, "D": 0.00662052755322266, "E": 0.004080198015083957}, "new_without_ps": {"A": 0.2708974981945667, "B": 0.345375906458336, "C": 0.37060349633633727, "D": 0.007994018447832889, "E": 0.0051290805629272155}}
1083179082CCB{"old": {"A": 0.009159638474300078, "B": 0.011111512180356642, "C": 0.9058403125119452, "D": 0.06892238528942052, "E": 0.004966151543977758}, "new_ps": {"A": 0.020009595255595397, "B": 0.4645942756416388, "C": 0.4705818411827586, "D": 0.03908357410451043, "E": 0.005730713815496856}, "new_without_ps": {"A": 0.03346888860044858, "B": 0.5237381338488055, "C": 0.3993731142820039, "D": 0.03881257435117756, "E": 0.004607288917564308}}
1271774953CCB{"old": {"A": 0.013016157139260411, "B": 0.07666452942581528, "C": 0.8922154564064737, "D": 0.01439278685028885, "E": 0.003711070178161759}, "new_ps": {"A": 0.038098651108712814, "B": 0.36816357832250346, "C": 0.5818783389865151, "D": 0.007876755127342615, "E": 0.003982676454925909}, "new_without_ps": {"A": 0.032807692976973175, "B": 0.5035334401978344, "C": 0.4550614462507385, "D": 0.005641806092458687, "E": 0.0029556144819952233}}

The code that produced it: P12528

Current state: Lydia needs to review. Amir will explain :D