⚓ T119976 Track number of stubs on top 20 wikipedias

Addshore created this task.Dec 1 2015, 4:41 PM

Addshore raised the priority of this task from to Needs Triage.

Addshore updated the task description. (Show Details)

Addshore added projects: Wikidata, WMDE-Analytics-Engineering.

Addshore added subscribers: Addshore, Lydia_Pintscher.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 1 2015, 4:41 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Dec 1 2015, 7:37 PM

Addshore moved this task from Incoming to ToDo on the WMDE-Analytics-Engineering board.Dec 24 2015, 11:55 AM

Top 20 Wikipedias:

enwiki
svwiki
cebwiki
dewiki
nlwiki
frwiki
ruwiki
warwiki
itwiki
eswiki
plwiki
viwiki
jawiki
ptwiki
zhwiki
ukwiki
cawiki
fawiki
nowiki
shwiki

MediaWiki itself has a stub definition, no? This should be the things that don't count into the article count on Special:Version. Can we not just use that?

stubthreshold is only a user preference, which wouldn't work for this.

page and content page counts on Special:Statistics (which i assume Lydia means) do not consider stubs or not stubs.

content pages count is simply count of pages that are not redirects and are in content namespaces

page count is just SELECT COUNT(*) FROM page.

Ah ok. Thanks aude!

Page length would be good then? I am not sure how reliable stub templates are used.

options for stubs range from disabled (default) to 10000 bytes

stub template is added by users (by user judgment) and when we notice a page should have the template. I'm sure there is plenty of stuff that would be a stub when measured by bytes but doesn't have the template, as well as pages that have been expanded beyond stubs but the stub template not yet removed.

thus, page length would be most objective imho :)

In T119976#1990541, @aude wrote:

thus, page length would be most objective imho :)

Page length would also be super easy to track ;)

Unfortunately, I have a sneaking suspicion that a number of disambiguation pages would turn up in a query for "all short pages"...

There's an AWB module that does some "stub" checking, I think.

i think disambiguation is a page_prop so should be possible to filter them or count separately

Yes. Disambugation is a pageprop.
Here's an example:
https://nl.wikipedia.org/wiki/Speciaal:APIZandbak#action=query&prop=pageprops&format=json&titles=Meulenhoff

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "4061655": {
                "pageid": 4061655,
                "ns": 0,
                "title": "Meulenhoff",
                "pageprops": {
                    "disambiguation": "",
                    "wikibase_item": "Q20967241"
                }
            }
        }
    }
}

Non disambiguation pages don't have that property set. For example the same query for the page Nederland:

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1146": {
                "pageid": 1146,
                "ns": 0,
                "title": "Nederland",
                "pageprops": {
                    "page_image": "Flag_of_the_Netherlands.svg",
                    "wikibase_item": "Q55"
                }
            }
        }
    }
}

That will make this easy then! :)

Lydia_Pintscher triaged this task as High priority.Jun 24 2016, 2:51 PM

Lydia_Pintscher added a subscriber: • ChrisPins.

Shizhao added a project: Chinese-Sites.Dec 27 2016, 6:46 AM

Liuxinyu970226 subscribed.Dec 27 2016, 10:58 AM

Doesn't seem to be related to Wikidata ..

In T119976#3148021, @Esc3300 wrote:

Doesn't seem to be related to Wikidata ..

The task was created for / by the Wikidata team.

Lydia_Pintscher added a subscriber: GoranSMilovanovic.May 5 2017, 2:10 PM

I've been thinking about his for some time already. The following idea is probably an overkill, but I would say it's a way to go if we cannot establish any prima facie criteria:

(1) Preprocessing, a text-mining approach: (1a) go for the dumps, (1b) search through all pages, (1c) collect metrics: page length (we can get this without actually performing text-search, I guess), number of references, number of external links, how many sections there are, properties of the word frequency distribution, sentiment, whatever can be measured, etc; formal descriptions would do (page length, frequency distributions, distributions of syntactic categories used), not semantics;

(2) Machine-Learning: use pages for which we know that are stubs against a sample of pages that are certainly not stubs to train the model (binary logistic regression, decision tree, random forest - something); train until some acceptable classification accuracy is reached (if possible from the given set of features produced in phase (1)); use the model to predict which of the remaining pages are stubs.

However, this is time consuming and would take a lot of experimentation and model tuning before we figure out what model exactly would deliver a satisfactory result... The feature extraction phase (1) would be difficult and computationally intensive, while (2) training a predictive model on a set of several million preprocessed pages should not be a problem for R from a single machine.

Another 7 months on, do we still want this?

Yes but not important right now.

GoranSMilovanovic added a project: User-GoranSMilovanovic.Feb 28 2018, 12:55 PM

GoranSMilovanovic moved this task from Technical Wishlist to Radar on the User-GoranSMilovanovic board.

Addshore moved this task from ToDo to Engineering Teams on the WMDE-Analytics-Engineering board.Jan 29 2019, 9:14 AM

Nothing stalled ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on") in this ticket, hence resetting status.

@Lydia_Pintscher @Addshore Given the status reset - T119976#6178863 - of this task, what do we say: go, no go, priority?

Whatever the approach chosen to determine stubs, it could set the status as a badge on Wikidata .. this can then be queried easily.

In T119976#6178871, @GoranSMilovanovic wrote:

@Lydia_Pintscher @Addshore Given the status reset - T119976#6178863 - of this task, what do we say: go, no go, priority?

No priority atm. We'll look into it more again when we touch the ArticlePlaceholder again and roll it out to more wikis.

++--

Shizhao moved this task from Backlog to Research on the Chinese-Sites board.Jul 31 2020, 2:47 AM

alaa subscribed.Nov 10 2020, 11:36 PM

Manuel subscribed.Jun 17 2021, 10:23 AM

GoranSMilovanovic removed a project: User-GoranSMilovanovic.Sep 7 2021, 10:08 PM

Addshore unsubscribed.Jun 27 2023, 12:33 PM