Benchmark old and new model accuracy on new labeled data
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	Lydia_Pintscher
	Sep 2 2020, 10:50 AM

Description

Problem:
We would like to see if the new model is better than the old model in predicting the quality of Items. To do this we want to check how the old and new model performs with the new training data we collected.

Acceptance criteria:

we have an overview of how many Items the old model judges to be A/B/C/D/E class compared to the human judgement
we have an overview of how many Items the new model judges to be A/B/C/D/E class compared to the human judgement

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Ladsgroup	T261849 Benchmark old and new model accuracy on new labeled data
		Resolved		Ladsgroup	T261850 compare model accuracy with and without property suggester

Event Timeline

Lydia_Pintscher created this task.Sep 2 2020, 10:50 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2020, 10:50 AM

Lydia_Pintscher set the point value for this task to 3.Sep 2 2020, 10:51 AM

GoranSMilovanovic subscribed.Sep 2 2020, 10:52 AM

Lydia_Pintscher moved this task from Backlog to Item Quality Scoring Improvement - Sprint 3 on the Item Quality Scoring Improvement board.Sep 2 2020, 11:11 AM

Lydia_Pintscher edited projects, added Item Quality Scoring Improvement (Item Quality Scoring Improvement - Sprint 3); removed Item Quality Scoring Improvement.

Ladsgroup claimed this task.Sep 8 2020, 11:14 PM

Ladsgroup moved this task from To Do to Doing on the Item Quality Scoring Improvement (Item Quality Scoring Improvement - Sprint 3) board.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptSep 8 2020, 11:14 PM

So I took a 750 sample of current items in wikidata. I took a stratified sample, like 150 from a certain range of size otherwise it would be mostly papers and stuff. The query: https://quarry.wmflabs.org/query/47988. This is one tenth of the query I did to make a new labeling campaign (that's why it's 7500)

Then I tried to get the prediction using three different models: 1- Old model, 2- New model with PropertySuggester (PS) feature and 3- New model without PS. 215 cases (30%) one of these three were not in agreement with the other two, out of 215, 187 of them were disagreement between the old model and the new ones only (and the new models were in agreement regardless of existence of PS feature). You can find it in P12527#70026 (It's rather big) but the result looks good, for example 967989666 was being judged D in the old model and B in the new ones. 28 cases (3.7%) were disagreement between existence of PS feature. This is the list of that 28 cases:

Rev id	Old model	New model with PS	New model without PS	Other nerdy stuff
1245749762	C	B	C	{"old": {"A": 0.033922905209765215, "B": 0.3156294884755486, "C": 0.6357983498027796, "D": 0.011791632271694597, "E": 0.0028576242402119953}, "new_ps": {"A": 0.07506341154975878, "B": 0.5157792612102473, "C": 0.39929552483486047, "D": 0.00611359009974234, "E": 0.0037482123053910747}, "new_without_ps": {"A": 0.10080501376933476, "B": 0.4091825660012928, "C": 0.4778178499618486, "D": 0.007443348708474312, "E": 0.0047512215590494004}}
1268403491	A	A	B	{"old": {"A": 0.9581183283251842, "B": 0.028314702806298882, "C": 0.010957647415122833, "D": 0.0013961524223473987, "E": 0.0012131690310467787}, "new_ps": {"A": 0.47172335609710603, "B": 0.4534614216634016, "C": 0.05535875041401178, "D": 0.011371950117362214, "E": 0.008084521708118674}, "new_without_ps": {"A": 0.3623044878833989, "B": 0.5702558222515018, "C": 0.05067867816474142, "D": 0.009853985767714471, "E": 0.006907025932643356}}
1272228817	C	B	A	{"old": {"A": 0.05489149933518188, "B": 0.05466214769030068, "C": 0.8814578719623226, "D": 0.006414714358706056, "E": 0.002573766653488728}, "new_ps": {"A": 0.29595415449014023, "B": 0.35019429930247076, "C": 0.33307570056966107, "D": 0.012191086088356676, "E": 0.00858475954937127}, "new_without_ps": {"A": 0.34849674768953015, "B": 0.32551627851465303, "C": 0.30496550210767087, "D": 0.012390480540117786, "E": 0.008630991148028103}}
1032620898	E	D	E	{"old": {"A": 0.0008107712998579739, "B": 0.001213658252819447, "C": 0.011217407965526772, "D": 0.0507681680549962, "E": 0.9359899944267996}, "new_ps": {"A": 0.004039595060549861, "B": 0.004426804158760506, "C": 0.021114505241631394, "D": 0.7165181136601142, "E": 0.2539009818789441}, "new_without_ps": {"A": 0.0030297197261097805, "B": 0.0038494886036259317, "C": 0.012387228657753662, "D": 0.4631927631616361, "E": 0.5175407998508744}}
954223095	C	C	B	{"old": {"A": 0.007334705016875339, "B": 0.008463184659439892, "C": 0.8880210643634205, "D": 0.0914071181854589, "E": 0.004773927774805269}, "new_ps": {"A": 0.026242218233272874, "B": 0.3733216836274356, "C": 0.40173521200105566, "D": 0.19222363724692618, "E": 0.0064772488913097965}, "new_without_ps": {"A": 0.022162480661051785, "B": 0.39375421587841725, "C": 0.3787948617705969, "D": 0.19888459299483738, "E": 0.006403848695096642}}
1272813710	B	A	B	{"old": {"A": 0.01246413972025038, "B": 0.9294193476502699, "C": 0.051001211254504526, "D": 0.0039517721845516275, "E": 0.0031635291904234795}, "new_ps": {"A": 0.5637827527490271, "B": 0.40895553231316906, "C": 0.021699633403881608, "D": 0.0035228124820841796, "E": 0.002039269051838088}, "new_without_ps": {"A": 0.46526672188978885, "B": 0.49010554270623885, "C": 0.03529306142122966, "D": 0.006017466983745688, "E": 0.003317206998996975}}
1094470411	E	D	E	{"old": {"A": 0.001799059787524304, "B": 0.0044285456515436045, "C": 0.014795198545761869, "D": 0.21788385781578284, "E": 0.7610933381993874}, "new_ps": {"A": 0.002608884933697272, "B": 0.0038156659085305075, "C": 0.019117380902449396, "D": 0.5067512668939108, "E": 0.46770680136141196}, "new_without_ps": {"A": 0.0020935761860606398, "B": 0.0030094334832940807, "C": 0.01670644890854257, "D": 0.33417932005771844, "E": 0.6440112213643844}}
1258899373	C	B	C	{"old": {"A": 0.027592605961843895, "B": 0.24729986536546772, "C": 0.7156668028023997, "D": 0.006265339348660646, "E": 0.0031753865216278877}, "new_ps": {"A": 0.0887850929467022, "B": 0.5069119829849444, "C": 0.3910440149440535, "D": 0.008901892978921517, "E": 0.0043570161453782355}, "new_without_ps": {"A": 0.12461977688901812, "B": 0.42896504165786764, "C": 0.43263103622111726, "D": 0.009132891499186448, "E": 0.004651253732810652}}
1009069533	C	B	C	{"old": {"A": 0.010652801373842995, "B": 0.012962250794266099, "C": 0.8827737215649725, "D": 0.08877496829797407, "E": 0.004836257968944502}, "new_ps": {"A": 0.016824833645722508, "B": 0.48010818529555577, "C": 0.45313187560549095, "D": 0.04515763583422306, "E": 0.004777469619007798}, "new_without_ps": {"A": 0.018536280229151027, "B": 0.455438060812234, "C": 0.47808992978645676, "D": 0.04357186902695685, "E": 0.004363860145201423}}
1264438441	C	B	C	{"old": {"A": 0.015581587264649751, "B": 0.023813890068793222, "C": 0.7977250036891984, "D": 0.1566318318896204, "E": 0.006247687087738293}, "new_ps": {"A": 0.05428543394117957, "B": 0.49533484279979517, "C": 0.4148779979839913, "D": 0.030754548946219755, "E": 0.004747176328814198}, "new_without_ps": {"A": 0.04085257392083643, "B": 0.45309223235638896, "C": 0.4626493265729858, "D": 0.03850773325285288, "E": 0.004898133896935818}}
1270796650	E	C	E	{"old": {"A": 0.0008639944604346134, "B": 0.004217942152839602, "C": 0.11690199000605744, "D": 0.007117011592748679, "E": 0.8708990617879196}, "new_ps": {"A": 0.005275813356979018, "B": 0.01947495567613131, "C": 0.48483334779954296, "D": 0.04678105548514889, "E": 0.44363482768219786}, "new_without_ps": {"A": 0.005533287260085267, "B": 0.01749448990975197, "C": 0.4577040297258244, "D": 0.049068621566628036, "E": 0.47019957153771036}}
1131017749	E	D	E	{"old": {"A": 0.0009848502211236157, "B": 0.0006813469818803148, "C": 0.004775415297439066, "D": 0.01895671501830939, "E": 0.9746016724812476}, "new_ps": {"A": 0.004199606691032983, "B": 0.005154652328617433, "C": 0.023458153537280577, "D": 0.5999341975141927, "E": 0.3672533899288763}, "new_without_ps": {"A": 0.00335887071997231, "B": 0.0043850586484284835, "C": 0.020857878449322593, "D": 0.4395591609852192, "E": 0.5318390311970574}}
1271363787	C	A	B	{"old": {"A": 0.02942496428861146, "B": 0.05218274412657272, "C": 0.9127098367649166, "D": 0.003476874024927542, "E": 0.002205580794971511}, "new_ps": {"A": 0.3708185962706765, "B": 0.2898797059405617, "C": 0.32606039148697247, "D": 0.008031809307278825, "E": 0.005209496994510433}, "new_without_ps": {"A": 0.32952094643011637, "B": 0.34706459285310925, "C": 0.30914599745261706, "D": 0.008584917224562087, "E": 0.005683546039595178}}
1264542927	C	C	B	{"old": {"A": 0.02586796466215134, "B": 0.20357116749522028, "C": 0.7571945870114729, "D": 0.008973058550706863, "E": 0.004393222280448534}, "new_ps": {"A": 0.08708832182554961, "B": 0.327501774597967, "C": 0.5687873868471643, "D": 0.009615245950699312, "E": 0.007007270778619622}, "new_without_ps": {"A": 0.06418116636023653, "B": 0.4620929243561956, "C": 0.4608517719536428, "D": 0.007894750104224201, "E": 0.004979387225700937}}
1029258443	C	C	B	{"old": {"A": 0.010149469394569805, "B": 0.012231789345366822, "C": 0.86814651191582, "D": 0.10446830837547406, "E": 0.005003920968769144}, "new_ps": {"A": 0.014158879585372354, "B": 0.43494899101845963, "C": 0.4998663430756384, "D": 0.04576292595082224, "E": 0.005262860369707504}, "new_without_ps": {"A": 0.024742993960759994, "B": 0.47636374434061096, "C": 0.45242392837029205, "D": 0.04155625303622282, "E": 0.004913080292114336}}
1149836991	C	D	C	{"old": {"A": 0.0014008992393890138, "B": 0.0024979565023026664, "C": 0.9135074969497293, "D": 0.0797959983307796, "E": 0.002797648977799516}, "new_ps": {"A": 0.006252381988954234, "B": 0.09746723123792793, "C": 0.44169244587359746, "D": 0.4471422125926644, "E": 0.007445728306856005}, "new_without_ps": {"A": 0.006461530862610264, "B": 0.13992114818691723, "C": 0.43293633265441384, "D": 0.41420032098532567, "E": 0.006480667310733116}}
1214998592	C	C	B	{"old": {"A": 0.025664447126885103, "B": 0.13749998638684494, "C": 0.8292312393065496, "D": 0.004628384033433865, "E": 0.0029759431462865558}, "new_ps": {"A": 0.26197429404974704, "B": 0.29339928441368013, "C": 0.42935937484767434, "D": 0.009265784786554529, "E": 0.006001261902343899}, "new_without_ps": {"A": 0.2599566933104892, "B": 0.36592831382713226, "C": 0.358201539786054, "D": 0.009683009607640793, "E": 0.0062304434686837095}}
1152841992	D	C	D	{"old": {"A": 0.0011919129490775779, "B": 0.0021983989418554043, "C": 0.01824551834989107, "D": 0.9676691411557627, "E": 0.010695028603413132}, "new_ps": {"A": 0.004683427615721967, "B": 0.19375223932761254, "C": 0.39562065308898303, "D": 0.3946053033963124, "E": 0.011338376571370004}, "new_without_ps": {"A": 0.005813202998580016, "B": 0.26429252937941644, "C": 0.3249076347008479, "D": 0.3908998091856571, "E": 0.014086823735498515}}
1198820250	C	B	C	{"old": {"A": 0.012759151286699983, "B": 0.011187144049443445, "C": 0.8742128019232632, "D": 0.09695561861908604, "E": 0.0048852841215072905}, "new_ps": {"A": 0.060158910125219886, "B": 0.477214192911728, "C": 0.42524828186762326, "D": 0.032377089461557686, "E": 0.005001525633871391}, "new_without_ps": {"A": 0.04542632330570129, "B": 0.4203989834565925, "C": 0.4966423572475575, "D": 0.03296837975010725, "E": 0.004563956240041518}}
1064604178	C	C	B	{"old": {"A": 0.006077520615884296, "B": 0.007980121710391667, "C": 0.9145130881301586, "D": 0.06764298077793829, "E": 0.003786288765627031}, "new_ps": {"A": 0.009232797502505437, "B": 0.4246640835453037, "C": 0.512973503363046, "D": 0.04850367343621232, "E": 0.004625942152932565}, "new_without_ps": {"A": 0.012738436577971888, "B": 0.5166899773084483, "C": 0.42320990892764027, "D": 0.04347415624918416, "E": 0.003887520936755231}}
1189352608	C	C	D	{"old": {"A": 0.0031220152199667606, "B": 0.005285747723151848, "C": 0.5266479030425185, "D": 0.45494568426487736, "E": 0.009998649749485637}, "new_ps": {"A": 0.0044315022439300284, "B": 0.006285430209997785, "C": 0.5231689307130449, "D": 0.4533916932147463, "E": 0.01272244361828094}, "new_without_ps": {"A": 0.00476970985215096, "B": 0.006976610550126999, "C": 0.46354385782717217, "D": 0.5094034413901668, "E": 0.015306380380383025}}
1228411373	E	E	D	{"old": {"A": 0.0011212777949707094, "B": 0.0020005353942534254, "C": 0.011292140358027877, "D": 0.2131855132774196, "E": 0.7724005331753284}, "new_ps": {"A": 0.005535519248210102, "B": 0.009491700267960791, "C": 0.06241095012609311, "D": 0.43306715399659296, "E": 0.4894946763611432}, "new_without_ps": {"A": 0.0052299333368587015, "B": 0.009760535103259941, "C": 0.07410020406619343, "D": 0.4617167098522753, "E": 0.44919261764141266}}
1263750365	D	C	D	{"old": {"A": 0.0005432611445804824, "B": 0.006475163176499794, "C": 0.02312974940797309, "D": 0.9685738945714686, "E": 0.0012779316994782873}, "new_ps": {"A": 0.008685528050124628, "B": 0.02044201088817635, "C": 0.5024624799560203, "D": 0.4551612683385364, "E": 0.0132487127671423}, "new_without_ps": {"A": 0.008867357464731553, "B": 0.01745996251929092, "C": 0.44315978125332395, "D": 0.5183764373982008, "E": 0.012136461364452712}}
1149663201	D	C	D	{"old": {"A": 0.0012992975857483437, "B": 0.0025702320990851615, "C": 0.021972066628729792, "D": 0.9677441214181767, "E": 0.006414282268260026}, "new_ps": {"A": 0.004778751977293244, "B": 0.05734693063175027, "C": 0.48138667466853374, "D": 0.4456484227861222, "E": 0.010839219936300494}, "new_without_ps": {"A": 0.004650953223470606, "B": 0.0590975620334573, "C": 0.40904133813293764, "D": 0.5167026237948555, "E": 0.010507522815278877}}
1194285399	C	C	B	{"old": {"A": 0.006494333986695253, "B": 0.01040922274114759, "C": 0.8411640917990374, "D": 0.1371845452241501, "E": 0.004747806248969735}, "new_ps": {"A": 0.009845672877838623, "B": 0.38537949660764004, "C": 0.40159179531027256, "D": 0.19690351435321699, "E": 0.006279520851031891}, "new_without_ps": {"A": 0.012647386572943427, "B": 0.39516640260841585, "C": 0.383790516439133, "D": 0.20223501560421167, "E": 0.006160678775296092}}
1245752935	C	B	C	{"old": {"A": 0.03981085046905755, "B": 0.3187539891469367, "C": 0.6340017584070916, "D": 0.004915688274984957, "E": 0.002517713701929113}, "new_ps": {"A": 0.2423494890002473, "B": 0.4445594224880173, "C": 0.30239036294342864, "D": 0.00662052755322266, "E": 0.004080198015083957}, "new_without_ps": {"A": 0.2708974981945667, "B": 0.345375906458336, "C": 0.37060349633633727, "D": 0.007994018447832889, "E": 0.0051290805629272155}}
1083179082	C	C	B	{"old": {"A": 0.009159638474300078, "B": 0.011111512180356642, "C": 0.9058403125119452, "D": 0.06892238528942052, "E": 0.004966151543977758}, "new_ps": {"A": 0.020009595255595397, "B": 0.4645942756416388, "C": 0.4705818411827586, "D": 0.03908357410451043, "E": 0.005730713815496856}, "new_without_ps": {"A": 0.03346888860044858, "B": 0.5237381338488055, "C": 0.3993731142820039, "D": 0.03881257435117756, "E": 0.004607288917564308}}
1271774953	C	C	B	{"old": {"A": 0.013016157139260411, "B": 0.07666452942581528, "C": 0.8922154564064737, "D": 0.01439278685028885, "E": 0.003711070178161759}, "new_ps": {"A": 0.038098651108712814, "B": 0.36816357832250346, "C": 0.5818783389865151, "D": 0.007876755127342615, "E": 0.003982676454925909}, "new_without_ps": {"A": 0.032807692976973175, "B": 0.5035334401978344, "C": 0.4550614462507385, "D": 0.005641806092458687, "E": 0.0029556144819952233}}

The code that produced it: P12528

Michael closed subtask T261850: compare model accuracy with and without property suggester as Resolved.Sep 14 2020, 9:33 AM

Current state: Lydia needs to review. Amir will explain :D

\o/

Benchmark old and new model accuracy on new labeled dataClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Benchmark old and new model accuracy on new labeled data
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...