Page MenuHomePhabricator

Article country model
Open, Needs TriagePublic

Description

This epic task covers the work under KR WE2.1 to develop a model for assigning countries to Wikipedia articles as a concrete step towards improving our topic infrastructure and the recommender systems that use it. It builds on the planning the started under T361637.

Hypothesis text:

If we build a country-level inference model for Wikipedia articles, we will be able to filter lists of articles to those about a specific region with >70% precision and >50% recall.

Documentation: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Countries

Related Objects

Event Timeline

Isaac moved this task from Backlog to Epics on the Research board.

@Isaac @EBernhardson etc. in EventStreamConfig: Add mediawiki.article_country_prediction_change stream (1112451) for T382295: Create event stream for article-country model-server hosted on LiftWing, the name of the new stream is mediawiki.article_... with the emphasis on 'article' as the entity the prediction is for.

Other prediction streams use 'page' as the entity the prediction is for, and other (non prediction) streams use 'page' as well.

From https://wikitech.wikimedia.org/wiki/Data_Platform/Data_modeling_guidelines#page_vs._article

page is what is used in most places (including MW core) to refer to a MediaWiki page. There are a few places where article is used instead, although it appears to be almost exclusively in events (mobilewikiapparticlesuggestions, relatedarticles, and some properties on other events). This Design Best Practices page talks about the difference:

The word 'article' is meaningless on various wikis and the word 'page' should be used wherever possible. That said in sometimes the word page itself is too ambiguous - for example when describing pages in the main namespace on Wikipedia the word 'articles' would be more meaningful.

Unless you have a specific reason for distinguishing between main namespace pages (e.g. main namespace article counts), you should use page instead of article.

@kevinbazira mentioned that 'article' is being used here because that is what the model is named. Was this intentional?

From reading some context in T328276: Add outlink topic model predictions to CirrusSearch indices and T328899#8846959, perhaps 'article' is indeed the better term for this use?

If that is the case, perhaps mediawiki.page_outlink_topic_prediction_change.v1 should be renamed for consistency?

Thanks @Ottomata for checking - others should chime in but I'll leave my thoughts. In general I don't have a strong opinion because Research doesn't really ever work with the stream directly: we work a lot on the model itself and then also on the use-cases that having the predictions in the Search index etc. enables. So please choose whatever makes sense for your needs, but "article" is appropriate here:

for example when describing pages in the main namespace on Wikipedia the word 'articles' would be more meaningful.

Indeed, that's what is going on here. The model card actually calls this out explicitly (that the model is intended just for article namespace): https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Article_country#Users_and_uses

If that is the case, perhaps mediawiki.page_outlink_topic_prediction_change.v1 should be renamed for consistency?

Not sure how much work that is but yes, the same limitations apply there (it was designed for article namespace explicitly and shouldn't be applied to other namespaces)

Okay thanks. Lemme see if I can get some other brains on this, because we might want to change the Data Modeling Guidelines and make this article caveat more explicit. (I am torn though, because page is just so consistently used everywhere else.)

I solicited for some feedback in slack and got a lot of interesting points. I'll reflect them here so as not to lose them.

Andrew Otto

most datasets I know of (and have cared for) use page because it applies everywhere.The existing LiftWing prediction stream (perhaps sub-optimally?) uses 'page'.A new country prediction one is proposed to use 'article'.

[...] my question is kind of: in codified modeling (table, field names, etc.) is it good to use the term article?  Yes it has a clear definition, which is good, but it does mean that we have to explain the difference between page and article, and if people are looking for data about one or the other, they have to know which one to search for (edited) 

Alexandra Paskulin

Ah, I see. Yes, from a documentation perspective, using "page" consistently would be preferable. The distinction between page and article, while meaningful for experts, might be confusing to people looking for data

Isaac Johnson

that feels like a good reason to stick with page even though article is appropriate for this use-case. plus it sounds easier because that means changing a proposed stream vs. changing an already-implemented one :)

Xabriel Collazo Mojica

I vote article. I think the semantics are clear: when talking about pages with namespace_id=0, we are really talking about articles.A page is a technical concept from the implementation domain. An article is a business logic artifact. Yes, we have to explain that distinction to end users, perhaps in DataHub, perhaps elsewhere. But that distinction is clear. If we used page, we could have the opposite confusion: Can I have a 'Page country model' for files (namespace_id=6)?  If we named it page, then maybe, right? But it seems the folks driving this work do not want to have that.I think we should use the one that makes sense for the use case, and in this case it seems that is article , the business logic artifact.

Andrew Otto

counterpoint.  article_* tables could have primary key page_id.  Is that weird?

David Causse

namespace_id=0 is not always an article , wikidata items are in NS_MAIN but I would not name them articles. When I read article I imagine a wikipedia article but I'm not sure there's a clear definition for what it is and it might remain always a bit ambiguous.

Andrew Otto

From https://www.mediawiki.org/wiki/Manual:Article_count

By default, a page is counted as an article when:

  1. it's in the main namespace (meaning its title doesn't have a prefix like "User:" or "Talk:"),
  2. it contains at least one internal wikilink (e.g. the text "[[Main Page]]" creates a wikilink to the page titled "Main Page"), and
  3. it isn't a redirect.

Andrew Otto

So I guess according to that definition, wikidata items are articles.  but not really according to the design best practices one?  That one mentions more specifically 'wikipedia' pages.  Also some context in T49841 Message for term for main namespace page, with plural support where they are trying to be able to vary this term based on wiki for display purposes.It kind of seems like 'article' is not really well defined, but people know what it is when it is used.I think I'm leaning back towards using 'page' in all technical spaces,  including dataset names.  In user UIs where you can translate and change what you are showing to the user, article could make sense...including perhaps when describing models in model cards? Hm.

Mikhail Popov

I agree with Xabriel's reasoning for using article where the dataset is specific to articles as we know them, even if the primary key is page_id. (Not weird at all to me.) As a user browsing the datasets, if I was looking for a dataset containing predictions from model called "Article whatever", I would look for "article_whatever_predictions", not "page_whatever_predictions"

according to that definition, wikidata items are articles

MediaWiki is not Wikibase. Wikidata items are Wikidata items, because the MediaWiki notion/definition of an article shouldn't apply to them.From the task:

Andrew: perhaps mediawiki.page_outlink_topic_prediction_change.v1 should be renamed for consistency?

Isaac: yes, the same limitations apply there (it was designed for article namespace explicitly and shouldn't be applied to other namespaces)

I would encourage renaming to mediawiki.article_outlink_topic_prediction_change.v1

Neil Shah-Quinn

Hmm! What a fascinating rabbit hole. Thanks @otto !:grin:My thoughts in short:

  1. "Content page" is the most proper cross-project term for places like [Manual:Article court](ttps://www.mediawiki.org/wiki/Manual:Article_count), but for certain projects like Wikipedia, "article" is an good synonym.
  2. It's helpful to use "article" in the name to communicate that the domain is limited to Wikipedia content pages.
  3. It's more precise to specify "*Wikipedia *articles" when that's the true domain, as is the case with the link-based topic model, although that's a little less important, as "articles" already gets us partway there.

My thoughts in long:
"Article" is sort of a synonym for content page ([Manual:Article court](ttps://www.mediawiki.org/wiki/Manual:Article_count) is definitely using it that way), but it does feel weird to call Commons file pages, Wikidata item pages, and Wikidata functions "articles". We could improve the definition somewhat by saying articles must have the wikitext content model, but that still includes Commons file pages. Also, it includes things like Wikisource's various types of content pages (main-namespacepageauthor, and index---yes, Wikisource apparently has "page pages" !:laughing:), which don't all seem to quite fit as "articles".However, there's no problem with calling Wikipedia content pages "articles", and I think using it in the title of data sets to communicate the domain is helpful (even if, say, the dataset still has fields like page_id).However, "article" isn't clearly restricted to just Wikipedia. Wikivoyage and Wikinews pages fit quite nicely as articles. So I'm *sort of* inclined to say the most proper title is mediawiki.wikipedia_article_outlink_topic_prediction_change.v1, because the model card explicitly says it's for Wikipedia content pages only. I'm open to the argument that it's excessively pedantic, *but *we're already specifying that it it's for MediaWiki only, which in my opinion is a lot more pedantic (non-MediaWiki "articles", for example, are a theoretical concern, but Wikivoyage "articles" are very real).

Andrew Otto

Huh! Interesting. I'm not sure about putting the  project (family? I get these mixed up!) in the name, but maybe...How might we name a dataset that works for both wikipedia and wiktionary and wikivoyage?

Mikhail Popov

In that case maybe we could call it multiproject_article_outlink_topic_prediction_change?

Andrew Otto

^ maybe just not use project at all?

i don't mind article vs page so much, the reasons make sense perhaps we should go with that...i think I still prefer page but article is okay esp if documented.

but putting project in the name is going to be a slippery slope i think!

what if a dataset only works on enwiki?  should we put enwiki in the name too?  or only on enwiki and english wiktionary

My takeaway from this convo is that page is preferred, but article is okay to use in datasets names, but since the definition of article is slightly lose, use at your own risk!

I'd suggest not to put domains or projects in the dataset name. Maybe, but would need more discussion.

Thank you all!