Page MenuHomePhabricator

[SPIKE] What % of talk pages have not yet been created?
Closed, ResolvedPublic

Description

This task is about uncovering the percentage of talk pages, across namespaces and Wikipedias, that have not yet been created.

Impact

Knowing the percentage of talk pages, across namespaces and Wikipedias, that have not yet been created will help us decide how highly we should prioritize work on designing the empty state experience (T252902).

Requirements

  • We are curious to know the following: Of all the pages in the subject namespace [i], what percentage of them do NOT have a corresponding talk page that's been created?
  • We would value seeing the percentage of non-yet-created talk pages grouped by Wikipedia and within each wiki, grouped by namespace. [ii] See example tables [iii] below.

All Wikipedias

Namespace% yet-to-be created talk pages
1 (Talk)___#%
3 (User talk)___#%
5 (Wikipedia talk)___#%

By wiki

WikiNamespace% yet-to-be created talk pages
es.wiki1 (Talk)___#%
es.wiki3 (User talk)___#%
es.wiki5 (Wikipedia talk)___#%

Done

  • A table that contains described in the ===Requirements section above.

i. https://en.wikipedia.org/wiki/Wikipedia:Namespace#Subject_namespaces
ii. https://en.wikipedia.org/wiki/Wikipedia:Namespace
iii. Please represent the data in the way you think will be most effective. These tables are just "sketches."

Event Timeline

3% of non-redirect mainspace pages on enwiki (176k out of 6.2M) have no corresponding talk pages. I think smaller wikis (especially wikis without article assessment) will tend to have higher percentage of mainspace pages without talks, for example it’s about two thirds on Bavarian Wikipedia and over 99% on Hungarian Wikibooks.

Please note that this doesn’t take any article definition into account, so disambiguation pages, stubs etc. all count, only (hard) redirects are exempted. But probably this better fits the purpose of this quick research than any stronger article criterion, as people may want to talk about disambiguation pages and stubs as well.

These queries can easily be customized to get the requested numbers per wiki. (I’m not sure whether Quarry is capable of making cross-wiki queries at all given T260389.)

Namespace% yet-to-be created talk pages
3 (User talk)___#%

I don’t think this is a good measure in case of user talk pages. User talk pages are used for communicating with the corresponding user, not for communicating about the corresponding user page. Many users have redlinked (or global) user pages with active discussions on their talk pages. The relevant question is IMO

Of all users of the wiki, what percentage of them do NOT have a user talk page that's been created?

Of course, the question asked by @ppelberg remains relevant for other namespaces.

Some more numbers based on the query in https://phabricator.wikimedia.org/T252902#6732847:

wikiPages without talk
enwiki3%
frwiki24%
dewiki71%
eswiki52%
nlwiki92%
plwiki64%

I've calculated the percentage of talk pages, across namespaces and Wikipedias, that have not yet been created. Data comes from the currently available snapshot of data from Mediawiki page history. As mentioned in the comments above, the per wiki numbers can also be obtained by customizing the query in T252902#6732847.

Here is a summary of the percent of talk pages yet to be created by namespace across all Wikipedias.

All Wikipedias

NamespacePercent of Talk Pages Yet To Be Created
1 (Talk)57.46%
3 (User talk)57.77%
5 (Wikipedia talk)94.22%
7 (File talk)78.38%
9 (MediaWiki talk)92.67%
11 (Template talk)84.26%
13 (Help talk)80.27%
15 (Category talk)72.69 %
101 (Portal talk)90.03 %

For per Wikipedia data, I created a superset dashboard that can be used to explore the results by namespace and Wikipedia. Here are a few highlights below for some common talk namespace types on a sample of Wikipedias.

By wiki

wikiNamespacePercent of Talk Pages Yet To Be Created
arwiki11 (Template talk)33.8%
arwiki3 (User talk)32.9%
arwiki15 (Category talk)27.9%
arwiki1 (Talk)12.2%
cswiki15 (Category talk)94.5%
cswiki1 (Talk)89.0%
cswiki11 (Template talk)88.7%
cswiki3 (User talk)61.5%
dewiki11 (Template talk)92.1%
dewiki15 (Category talk)85.7%
dewiki1 (Talk)70.0%
dewiki3 (User talk)56.5%
enwiki3 (User talk)63.9%
enwiki11 (Template talk)41.4%
enwiki15 (Category talk)15.1%
enwiki1 (Talk)2.7%
frwiki11 (Template talk)95.0%
frwiki15 (Category talk)82.2%
frwiki3 (User talk)42.5%
frwiki1 (Talk)23.2%

@ppelberg - Let me know if you have any questions or have any issues accessing the superset dashboard.

MNeisler triaged this task as Medium priority.Feb 2 2021, 10:30 PM
MNeisler added a subscriber: MNeisler.

@ppelberg - re-assigning to you for sign-off. Feel free to reach out if you have any questions!

I've calculated the percentage of talk pages, across namespaces and Wikipedias, that have not yet been created. Data comes from the currently available snapshot of data from Mediawiki page history. As mentioned in the comments above, the per wiki numbers can also be obtained by customizing the query in T252902#6732847.
...For per Wikipedia data, I created a superset dashboard that can be used to explore the results by namespace and Wikipedia. Here are a few highlights below for some common talk namespace types on a sample of Wikipedias.

Being able to explore this data in Superset is great, @MNeisler. In doing so, it seems to me that:

  • On ~42% of Wikipedias, a minimum of 75% user talk pages are empty/not created. [i]
  • On ~81% of Wikipedias, a minimum of 75% article talk pages are empty/not created. [ii]

I understand the above to mean that it is likely Junior Contributors are arriving on article talk pages, at many Wikipedias, and becoming confused. "Confused" when the software prompts them to generically "create a new page" as opposed to something more directed, like offering them the opportunity to start a new topic about the article from which they may be arriving.


i. See: cell D3: in the >75% user talk empty column of Talk pages project/Research.
ii. See: cell E3: in the >75% article talk empty column of Talk pages project/Research.

COROLLARY
While it is out of scope for this task, on 10-February, @MNeisler and I talked about what the wikis with high percentages of empty talk pages [and vice versa] might have in common.

Some of the attributes we talked about investigating to identify potential commonalities were:

  • Project formation date
  • Project size (e.g. number of articles)
  • Editor base (e.g. total number of monthly active editors, total number of registered editors)

If/when prioritized, research into the above will happen in T274828.

Note: in T252902#6732847, @Tacsipacsi made a great suggestion to exclude redirected pages from this analysis. I used the query Tacsipacsi shared [i] to see if excluding redirected pages significantly impacted the results we arrived at in T272657#6797884.

Below are the percentages of article talk pages, including [ii] and excluding redirects, that have not-yet-created at some of the largest Wikipedias.

Conclusion: redirects seem to have a nominal impact on these numbers

Wiki% article talk NOT created (excluding redirects)% article talk NOT created (including redirects)
eswiki51.75% (source)51.94% (source)
dewiki71.08% (source)69.99% (source)
huwiki45.47% (source)45.17% (source)
thwiki57.78% (source)57.31% (source)
arwiki12.93% (source)12.22% (source)

i. https://quarry.wmflabs.org/query/51179
ii. I'm assuming the analysis that produced the results in T272657#6797884 included redirect pages.