Page MenuHomePhabricator

Articlequality model for nlwiki doesn't seem to track images correctly.
Open, Needs TriagePublic

Description

If you run this request, it shows that we only track one image in this version of the article, but there's actually quite a few. That dramatically affects the predicted quality. https://ores.wikimedia.org/v3/scores/nlwiki/60819226/articlequality?features

E.g., https://ores.wikimedia.org/v3/scores/nlwiki/60819226/articlequality?features&feature.revision.image_links=2

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think the right next step is to implement some tests to see if we detect the following image links:

[[Bestand:Stevie Wonder 1967 (1).jpg|thumb|In 1967 tijdens een repetitie voor een optreden in een [[TROS]]-programma]]
[[Bestand:Burt Bacharach - jam session.jpg|thumb|Stevie Wonder tijdens een optreden met [[Burt Bacharach]] in de jaren zestig]]

I can't seem to replicate the issue with the current version of the feature in the articlequality repo. When I run this same revision through the feature extractor, I get 5 image links rather than 1.

$ python
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from articlequality.feature_lists import nlwiki
>>> from revscoring.extractors import api
>>> import mwapi
>>> extractor = api.Extractor(mwapi.Session("https://nl.wikipedia.org"))
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> print("\n".join(str(v) for v in list(zip(nlwiki.wp10, extractor.extract(60819226, nlwiki.wp10)))))
(<feature.dutch.stemmed.revision.stems_length>, 22163)
(<feature.(dutch.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1))>, 0.8959453450297126)
(<feature.revision.image_links>, 5.0)
(<feature.(revision.image_links / max(wikitext.revision.content_chars, 1))>, 0.00020212636940615273)
(<feature.revision.category_links>, 10.0)
(<feature.(revision.category_links / max(wikitext.revision.content_chars, 1))>, 0.00040425273881230546)
(<feature.len(<datasource.dutch.dictionary.revision.dict_words>)>, 3729.0)
(<feature.(len(<datasource.dutch.dictionary.revision.dict_words>) / max(len(<datasource.wikitext.revision.words>), 1))>, 0.802453195610071)
(<feature.enwiki.revision.paragraphs_without_refs_total_length>, 3369.0)
(<feature.(enwiki.revision.paragraphs_without_refs_total_length / max(wikitext.revision.content_chars, 1))>, 0.1361927477058657)
(<feature.nlwiki.revision.cn_templates>, 0.0)
(<feature.(nlwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))>, 0.0)
(<feature.nlwiki.revision.infobox_templates>, 1.0)
(<feature.(nlwiki.revision.infobox_templates / max(wikitext.revision.content_chars, 1))>, 4.042527388123055e-05)
(<feature.wikitext.revision.chars>, 39846.0)
(<feature.wikitext.revision.content_chars>, 24737.0)
(<feature.wikitext.revision.ref_tags>, 84.0)
(<feature.(wikitext.revision.ref_tags / max(wikitext.revision.content_chars, 1))>, 0.0033957230060233656)
(<feature.wikitext.revision.wikilinks>, 269.0)
(<feature.(wikitext.revision.wikilinks / max(wikitext.revision.content_chars, 1))>, 0.010874398674051017)
(<feature.wikitext.revision.external_links>, 48.0)
(<feature.(wikitext.revision.external_links / max(wikitext.revision.content_chars, 1))>, 0.0019404131462990662)
(<feature.wikitext.revision.headings_by_level(2)>, 6.0)
(<feature.(wikitext.revision.headings_by_level(2) / max(wikitext.revision.content_chars, 1))>, 0.00024255164328738327)
(<feature.wikitext.revision.headings_by_level(3)>, 8.0)
(<feature.(wikitext.revision.headings_by_level(3) / max(wikitext.revision.content_chars, 1))>, 0.0003234021910498444)
(<feature.wikitext.revision.list_items>, 36.0)
(<feature.(wikitext.revision.list_items / max(wikitext.revision.content_chars, 1))>, 0.0014553098597242997)

Could it be my parser version?

>>> mwparserfromhell.__version__
'0.5.4'

That is old and we specifically push a new version to prod.

$ python
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from articlequality.feature_lists import nlwiki
>>> from revscoring.extractors import api
>>> import mwapi
>>> extractor = api.Extractor(mwapi.Session("https://nl.wikipedia.org"))
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> print("\n".join(str(v) for v in list(zip(nlwiki.wp10, extractor.extract(60819226, nlwiki.wp10)))))
<snip>
(<feature.revision.image_links>, 5.0)
(<feature.(revision.image_links / max(wikitext.revision.content_chars, 1))>, 0.00020212636940615273)
<snip>
>>> import mwparserfromhell
>>> mwparserfromhell.__version__
'0.6.4'

Nope. That didn't do it.

That's all the time I have now. I'll do some more exploration later.

@Halfak: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!