Page MenuHomePhabricator

Analyze permalinks table to see how many duplicates exist
Closed, ResolvedPublic

Description

T315510 will start a maintenance script to populate the talk page comment database.

This task involves the work of analyzing said database to learn what percentage of comments stored in the database are duplicates?

Knowing the answer to the above will enable us to estimate the likelihood that someone tapping/clicking on a permalink will directed to Special:GoToComment as opposed to being directly taken to the comment they are expecting to see.

With the probability described above "in-hand," we'll be able to decide whether any adjustments need to be made to:

  • A) How we're generating permalinks to lower the rate of duplicates
  • B) How the user experience looks/functions to help people develop more accurate expectations for what is likely to occur when they tap a permalink

Requirements

  • Once the talk page comment database contains a sufficiently large and representative amount of comments, calculate the percentage of said comments that are duplicates of one another

Open questions

  • 1. When and how will we know the comment database is filled with a large and representative enough sample of comments for us to analyze its contents and make conclusions based on said analysis?

Done

  • Answers to all Open questions are documented
  • Requirements are met

Event Timeline

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.
ppelberg moved this task from Untriaged to Upcoming on the Editing-team board.

what percentage of comments stored in the database are duplicates?

I started looking into this. It seems like it's normal for medium to large wikis to have 5%-20% of duplicate comments. Most of the duplicates are "boring" bot messages or mass notifications, but some are "real" comments that we can't uniquely identify.

(Note that I'm counting each occurrence of a duplicated comment separately. If you count all of them as just one comment, the number comes out to 2-8%.)

On wikis with less human activity it can be even higher. On the Cebuano Wikipedia (infamous for its bot-created articles), it's 97.36%, because many of the bot-created articles have bot-created talk pages (example, example).

I'll run the queries on all wikis and make some kind of report later, because Special:FindComment is currently still disabled on many wikis, which makes it a real chore to spot-check the results (and to share interesting examples).

There are several kinds of comments that can't be uniquely identified by author and time in our database (not necessarily "duplicates", but let's use that as a shorthand):

  • Identical messages posted to many users at once
  • Multiple comments posted in one edit (or within a minute) on a single page
  • Similar comments posted on separate pages within a minute (e.g. closing multiple deletion discussions, or welcoming multiple users)
  • Serial comments posted by a bot (e.g. notices about broken external links)
  • Mishaps with mass replacements of signatures (example, example)
  • A particular deletion discussion on enwiktionary concerning 81 related pages that has apparently been archived to each of these pages' talk page (example)

Queries I used: