Page MenuHomePhabricator

Map different types of measurement for T115119 's schema
Closed, ResolvedPublic

Description

We are suggesting the following measures of events following T115119 's schema, include:

  • Δ links to domain (or specific subdomain) over time with a measure (avg. slope? or % growth) of change - The main use case is TWL/GLAM partnerships
  • Comparison of change over time, between windows of time (Month to month change via # growth and % growth) - this is particularly important for the business case for TWL in new partnerships -something in the nature of Magnus's Baglama 2 https://tools.wmflabs.org/glamtools/baglama2/
  • Comparison of Δ unique domains/subdomains -- The main use case would be to discover new links/sources being used on Wikipedia.
  • Attribution of who is adding links, in particular time window - concatenate links by - The main use cases: identify spammers/COI issues, identify GLAM/TWL participants for encouragement support, identify potential recipients of TWL sources, or someone to engage in outreach. Also a good way to identify people to encourage linking best practices (if we could do high frequency adders of ".org" or journal namespaces).
  • Identification of "most frequently added/changed" urls overall and within a domain - looks for urls that are controversial, or are under a lot of scrutiny. Use cases: It would also be interesting to see this at a global level, and use for vandalism tools/ratings. It might be worth giving urls a "controversy factor", or researching these behaviours for AI.

Event Timeline

For anti-spam we also need the following to be at feature parity with COIBot:

  • A special page that gives the list of editors who have added a particular domain on demand with the number of links each editor has added to that domain. Must be able to fetch diffs in a separate query.
  • A special page that gives the list of domains that a particular user has edited on demand with the number of links added by that editor to each domain. Must be able to fetch diffs in a separate query.
  • Cross-wiki statistics (difficult, we might be able to work around this).

We don't mind if we don't get an answer if there are more than 1000 records for a particular domain or editor.

Contact Beetstra if you want to know about our existing link addition database.

It would be great if this could be a real-time IRC feed as well - as then http://en.wikipedia.org/wiki/User:XLinkBot can hook live into the feed and revert when conditions are met.

@Beetstra: we'd have to be pretty sure we're properly sanitizing this feed then, because it falls in that uncomfortable position of exposing vandal text and other types of stuff that we normally delete from our history but serve in our real-time streams.

ggellerman set Security to None.
ggellerman removed a subscriber: Halfak.
ggellerman added subscribers: Halfak, ggellerman.

removing Research and Data backlog and @Halfak

@DarTar is still subscribed

Samwalton9-WMF claimed this task.

The above information has been taken into account and merged into T152302 and its more granular subtasks.