This could either be done by tracking the usage of stub templates?
Categories?
Page length?
Description
Event Timeline
Top 20 Wikipedias:
enwiki
svwiki
cebwiki
dewiki
nlwiki
frwiki
ruwiki
warwiki
itwiki
eswiki
plwiki
viwiki
jawiki
ptwiki
zhwiki
ukwiki
cawiki
fawiki
nowiki
shwiki
MediaWiki itself has a stub definition, no? This should be the things that don't count into the article count on Special:Version. Can we not just use that?
stubthreshold is only a user preference, which wouldn't work for this.
page and content page counts on Special:Statistics (which i assume Lydia means) do not consider stubs or not stubs.
content pages count is simply count of pages that are not redirects and are in content namespaces
page count is just SELECT COUNT(*) FROM page.
Ah ok. Thanks aude!
Page length would be good then? I am not sure how reliable stub templates are used.
stub template is added by users (by user judgment) and when we notice a page should have the template. I'm sure there is plenty of stuff that would be a stub when measured by bytes but doesn't have the template, as well as pages that have been expanded beyond stubs but the stub template not yet removed.
Unfortunately, I have a sneaking suspicion that a number of disambiguation pages would turn up in a query for "all short pages"...
There's an AWB module that does some "stub" checking, I think.
i think disambiguation is a page_prop so should be possible to filter them or count separately
Yes. Disambugation is a pageprop.
Here's an example:
https://nl.wikipedia.org/wiki/Speciaal:APIZandbak#action=query&prop=pageprops&format=json&titles=Meulenhoff
{ "batchcomplete": "", "query": { "pages": { "4061655": { "pageid": 4061655, "ns": 0, "title": "Meulenhoff", "pageprops": { "disambiguation": "", "wikibase_item": "Q20967241" } } } } }
Non disambiguation pages don't have that property set. For example the same query for the page Nederland:
{ "batchcomplete": "", "query": { "pages": { "1146": { "pageid": 1146, "ns": 0, "title": "Nederland", "pageprops": { "page_image": "Flag_of_the_Netherlands.svg", "wikibase_item": "Q55" } } } } }
I've been thinking about his for some time already. The following idea is probably an overkill, but I would say it's a way to go if we cannot establish any prima facie criteria:
(1) Preprocessing, a text-mining approach: (1a) go for the dumps, (1b) search through all pages, (1c) collect metrics: page length (we can get this without actually performing text-search, I guess), number of references, number of external links, how many sections there are, properties of the word frequency distribution, sentiment, whatever can be measured, etc; formal descriptions would do (page length, frequency distributions, distributions of syntactic categories used), not semantics;
(2) Machine-Learning: use pages for which we know that are stubs against a sample of pages that are certainly not stubs to train the model (binary logistic regression, decision tree, random forest - something); train until some acceptable classification accuracy is reached (if possible from the given set of features produced in phase (1)); use the model to predict which of the remaining pages are stubs.
However, this is time consuming and would take a lot of experimentation and model tuning before we figure out what model exactly would deliver a satisfactory result... The feature extraction phase (1) would be difficult and computationally intensive, while (2) training a predictive model on a set of several million preprocessed pages should not be a problem for R from a single machine.
Nothing stalled ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on") in this ticket, hence resetting status.
@Lydia_Pintscher @Addshore Given the status reset - T119976#6178863 - of this task, what do we say: go, no go, priority?
Whatever the approach chosen to determine stubs, it could set the status as a badge on Wikidata .. this can then be queried easily.
No priority atm. We'll look into it more again when we touch the ArticlePlaceholder again and roll it out to more wikis.