Page MenuHomePhabricator

Track number of stubs on top 20 wikipedias
Open, LowPublic

Description

This could either be done by tracking the usage of stub templates?
Categories?
Page length?

Event Timeline

Addshore raised the priority of this task from to Needs Triage.
Addshore updated the task description. (Show Details)
Addshore added subscribers: Addshore, Lydia_Pintscher.

Top 20 Wikipedias:

enwiki
svwiki
cebwiki
dewiki
nlwiki
frwiki
ruwiki
warwiki
itwiki
eswiki
plwiki
viwiki
jawiki
ptwiki
zhwiki
ukwiki
cawiki
fawiki
nowiki
shwiki

MediaWiki itself has a stub definition, no? This should be the things that don't count into the article count on Special:Version. Can we not just use that?

stubthreshold is only a user preference, which wouldn't work for this.

page and content page counts on Special:Statistics (which i assume Lydia means) do not consider stubs or not stubs.

content pages count is simply count of pages that are not redirects and are in content namespaces

page count is just SELECT COUNT(*) FROM page.

Ah ok. Thanks aude!

Page length would be good then? I am not sure how reliable stub templates are used.

options for stubs range from disabled (default) to 10000 bytes

stub template is added by users (by user judgment) and when we notice a page should have the template. I'm sure there is plenty of stuff that would be a stub when measured by bytes but doesn't have the template, as well as pages that have been expanded beyond stubs but the stub template not yet removed.

thus, page length would be most objective imho :)

thus, page length would be most objective imho :)

Page length would also be super easy to track ;)

Unfortunately, I have a sneaking suspicion that a number of disambiguation pages would turn up in a query for "all short pages"...

There's an AWB module that does some "stub" checking, I think.

i think disambiguation is a page_prop so should be possible to filter them or count separately

Yes. Disambugation is a pageprop.
Here's an example:
https://nl.wikipedia.org/wiki/Speciaal:APIZandbak#action=query&prop=pageprops&format=json&titles=Meulenhoff

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "4061655": {
                "pageid": 4061655,
                "ns": 0,
                "title": "Meulenhoff",
                "pageprops": {
                    "disambiguation": "",
                    "wikibase_item": "Q20967241"
                }
            }
        }
    }
}

Non disambiguation pages don't have that property set. For example the same query for the page Nederland:

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1146": {
                "pageid": 1146,
                "ns": 0,
                "title": "Nederland",
                "pageprops": {
                    "page_image": "Flag_of_the_Netherlands.svg",
                    "wikibase_item": "Q55"
                }
            }
        }
    }
}

That will make this easy then! :)

Doesn't seem to be related to Wikidata ..

Doesn't seem to be related to Wikidata ..

The task was created for / by the Wikidata team.

I've been thinking about his for some time already. The following idea is probably an overkill, but I would say it's a way to go if we cannot establish any prima facie criteria:

(1) Preprocessing, a text-mining approach: (1a) go for the dumps, (1b) search through all pages, (1c) collect metrics: page length (we can get this without actually performing text-search, I guess), number of references, number of external links, how many sections there are, properties of the word frequency distribution, sentiment, whatever can be measured, etc; formal descriptions would do (page length, frequency distributions, distributions of syntactic categories used), not semantics;

(2) Machine-Learning: use pages for which we know that are stubs against a sample of pages that are certainly not stubs to train the model (binary logistic regression, decision tree, random forest - something); train until some acceptable classification accuracy is reached (if possible from the given set of features produced in phase (1)); use the model to predict which of the remaining pages are stubs.

However, this is time consuming and would take a lot of experimentation and model tuning before we figure out what model exactly would deliver a satisfactory result... The feature extraction phase (1) would be difficult and computationally intensive, while (2) training a predictive model on a set of several million preprocessed pages should not be a problem for R from a single machine.

Addshore changed the task status from Open to Stalled.Feb 28 2018, 12:52 PM

Another 7 months on, do we still want this?

Lydia_Pintscher lowered the priority of this task from High to Low.Feb 28 2018, 12:53 PM

Yes but not important right now.

@Lydia_Pintscher @Addshore Given the status reset - T119976#6178863 - of this task, what do we say: go, no go, priority?

Whatever the approach chosen to determine stubs, it could set the status as a badge on Wikidata .. this can then be queried easily.

@Lydia_Pintscher @Addshore Given the status reset - T119976#6178863 - of this task, what do we say: go, no go, priority?

No priority atm. We'll look into it more again when we touch the ArticlePlaceholder again and roll it out to more wikis.