Page MenuHomePhabricator

Analysis of data collected from databases to identify priority modules
Closed, ResolvedPublic

Description

Description

To determine important modules that are candidates for centralizing in Abstract Wikipedia some data analysis needs to be performed. So far data has been collected from API and databases across all wikis and stored in toolforge user database. For the next step priority modules need to be determined based on usage, pageviews, links etc.

After analysis, a relative scoring metric was devised to identify important modules.

Findings

Brief compilation of all the findings from my data analysis so far is attached here.

Scoring modules

An example-based documentation of the scoring metric is attached here.


In short:

  • Get a limit value from data distribution for each feature
  • Modify distribution to set limit as ~87% or less
  • Get feature score as 'percentile of the value in the modified distribution'
  • Get the total score of modules as the weighted sum of the feature scores.

Work

Notebook for data analysis to find priority modules: Aisha's Notebook IV
Notebook to detect priority modules: Aishas Notebook V

Event Timeline

@tanny411 I've been looking through your notebook and there are things I've seen previusly too. Modules like Module:inc-ash/dial/data/?? and Module:zh/data/ltc-pron/? all refer to different translation and/or prononciation information for the word. They would have different source code, that's logical, and I'm not sure we want to analyze them at all. At the same time, on my tests they are usually detected by Levenshtein distance quite easily, so it might not be worth the work to drop them.

The other thing is - all of them have *??* in their title. Is there some kind of naming convention? Is there any other types of functions, which use this naming?..

@LostEnchanter Yes those with *??* create duplicate titles although actually, they are not the same, these are some alphabets or symbols that I couldn't get rendered anywhere (web or notebook).
Also if you were able to find out certain groups/clusters of pages that go together (like pronunciation modules) then maybe we can find modules similar to them and start reducing our data for further analysis.

@tanny411 So, yes, my logic is something like that: they are all different, and it looks like all of them have "?" in title. Can we drop them, or there's something I miss?

@tanny411 So, yes, my logic is something like that: they are all different, and it looks like all of them have "?" in title. Can we drop them, or there's something I miss?

Although I haven't looked at their usage or such explicitly to identify how important they are, for analysis that requires page title be unique, I guess we can drop them since we know what they are for.

@tanny411 you did a great job creating this report!

I also have an idea, that might be true for the analysed information: large number of edits, especially minor ones, might reflect a big amount of data in the module, as different people would update the info they know. I think this might especially work for different wiktionary modules, where different people would add up prononciation/translation to different languages and so on.

@Quiddity (Nick) see the attachment on this task for data analysis so far. I'm out the next couple business days, but for our next weekly meeting my goal is to have a good "paper prototype" of a simple web dashboard for being able to drill down to modules based on some heuristics / filters.

@tanny411 @LostEnchanter , @Quiddity was interested in seeing some of the insights and interested to share any extra context if needed on editorial dynamics - if you all would like to connect separately to discuss further, please do, or perhaps we can pose questions and insights here on ticket as well.

Thanks! Very interesting. [Below are just my notes, in case useful. There are no questions for anyone, and no reply is needed. The TLDR is in bold. :-) ]

I'm curious about this item in the PDF:

"Conclusion: langlinks are across content models like Scribunto, wiktext. i.e. If a module has 89 language links does not mean the other89 pages are Scribunto modules"

If I understand correctly, this means that:
If I was to visit all the interwiki links from a page like w:en:Module:Convert (which contains Lua code) then there is a possibility that some of those ~130 other pages might not contain any Lua code.
Is that accurate, and if so are there any examples that could be examined?
My first guess is that these could only be "Redirect" pages. But perhaps some are written in other (non-Lua) programming languages?

Ah! I found the answer (or at least part of it) within the Notebook...

  • One example given there is w:tr:Modül:Konum haritası/veri/Polonya ("Module: Location map / data / Poland ") , and I see that wiki is using a system of JSON within subpages to store the data for specific locations. (confirmed here, in 3,171 subpages);
  • Whereas many other wikis use a system of wikitext and parser-function #switches within the Template: namespace for that data. (e.g. w:de:Vorlage:Positionskarte Polen).
    • So I guess the Trwiki page is linked to the Dewiki page just because they are roughly equivalent in result?
  • I.e. different wikis use 1 of 2 (or more) different systems to produce this type of map image within a page.
  • But then, why isn't Enwiki given as an interwiki link at Trwiki?
    • Ahah! The source of the confusion seems to be that Trwiki has been incorrectly linked to the "Template:" system, instead of being linked to the "JSON in Module: subpages" system.
    • I.e. it is attached to d:Q12274, whereas it should be attached to d:Q16783565.

I'll leave my notes here, in case they're useful to anyone. (and in case I'm mistaken about anything!)
And I won't fix that Trwiki link for now, (a) so that my explanation still makes sense, and (b) because it probably represents a much bigger problem which we will need a bot to fix!

I have no further questions at the moment. :-) (But do let me know if I can help you with non-technical questions)

@Quiddity thank you for this interesting observation! Do you know whether the initial linking to Wikidata pages was done by users or by bots? I'm curious because querying through API correctly shows, that w:tr:Modül:Konum haritası/veri/Polonya is Scribunto-type and belongs to namespace 828.

And, well, looking through d:Q12274 it's easy to notice that Trwiki is not the only one linked incorrectly, there's also at least w:pl:Moduł:Mapa/dane/Polska.

@LostEnchanter AFAIK, Both!

Note: I might have made some technical-keyword errors in my explanation here or above. I am not a dev!

Here's some historical context (and my apologies if you already know all this!)
Before Wikidata existed (i.e. ~2001 until ~2012/2013), we stored all the interwiki links in duplicated copies on every wiki.
I.e. Within the Enwiki page about [[Moon]], at the bottom of the page we listed all the links to the versions in other languages. And Ditto everywhere else. So if there were 100 articles about [[Moon]] in various languages, then there would be a list of 99 links in every single one of them (if everything was working properly!).
For most of the later years, there were a few bots that would detect changes to these lists of interwiki links at one wiki, and then propagate that change to all the other wikis.

Here are some example links:

(But there were a few bots working on this over the years, and I'm don't recall how exactly they all worked. E.g. I can see 5 different bots making interwiki edits, just in that page's history! Addbot, MerlIwBot, Obersachsebot, Vagobot, Xqbot)

So, to figure out where any problem originated would require a LOT of digging through page-diffs. And there are probably many errors like this, or similar to this.

Lastly, there's a Tangential complexity: We do still use this type of Embedded interwiki link, in a few edge-cases that Wikidata cannot deal with yet. The docs about that are at https://www.wikidata.org/wiki/Help:Handling_sitelinks_overlapping_multiple_items - That may or may not be a problem in Scribuntu pages and/or Template pages.

@Quiddity thanks for the info, it was really interesting to find out how this problem was handled. It really looks like encountering errors of this type should not be unusual at all, considering previous way of storying interwiki links.

As far as I understand, sitelinks overlapping shouldn't be a problem of a noticeable scale at leaste for Scribunto pages.

@Quiddity Thanks a lot. Your finds match with mine about the Scribunto vs wikitext types and I've checked from language links table as well, enwiki is not connected to trwiki indeed!

This makes me re-think our use of only Scribunto modules. If modules and wikitext can work together, these may need to be merged and centralized in a single format.

As a side note, @LostEnchanter , Nick lists relevant modules using advanced search, such as this, made me think how this might be working, maybe something we might draw inspiration from regarding module similarity?

@Quiddity, @dr0ptp4kt, and others, we do need some help determining which features to give importance to, to identify important modules. Especially on how to combine the gathered stats on various data. Some questions I had specifically were:

  • How can we better use the number of categories a page is included in, if at all? Since not all pages had categories. Some pages did have lots of categories but they were all related. Would it make sense to prioritize modules that are included in more categories? Or at least more distinct types of categories?
  • What kind of inter-module relationships can we identify from inter-module transclusions (i.e one module transcluding another)? One idea is that a module that transcludes a lot of priority modules may be important itself. (A side thought: I keep picturing a bubble/tree diagram of sorts to show module transclusions hierarchy. Might help users visualize the module dependencies. )
  • Tags: These seem to vary across wikis so it was a bit hard to assess what tags could help us. As i mentioned in the report, would it be helpful to collect the number of reverts/undo tags, blanking, mobile edit, advanced mobile edits etc for each page? Is there any existing pattern of user base with these tags that maybe we can use to identify important modules?

Lastly, how can we combine our heuristics properly. I would like to visualize them some more to look at how the limits I've mentioned interact with each other (for example pages with > 1M transclusions have how many edits/editors etc). We could create a scoring system or probability distribution with the values I've isolated being fixed at 70-80% for example. i.e at 1M transclusions it is 70% probable to be important, as transclusions increase probability increases.

I realize these are somewhat open-ended queries, I just wanted to get some discussion going, maybe gather some insights on wiki projects as a whole along the way that can help our analysis.

My hesitant/uncertain thoughts on those questions:

  1. Categories are complicated. I would hesitantly suggest not focusing too much effort on those, because there are no strict rules for how thoroughly most pages are categorized, so the number of categories might just be an indication of the efforts of 1 or 2 people who enjoy that particular task. E.g. One of the most widely-used Modules on Enwiki is Module:Convert, but that is only in 4 categories (2 of which are related to page-protection status).
    • However, I'm primarily familiar with Enwiki, so there may be very useful insights that can be identified from the categories used on every other wiki! Plus I've never done much volunteer editing of the category systems, even at Enwiki.
  2. I'm not sure, but that hypothesis sounds good. I agree a visualization would help (or even just a simple structured-tree bullet-list for a few examples).
  3. I'm not sure. From what I can see in the Notebook, the tags most commonly used in Module: namespace are almost entirely just either "mobile"-related or "undo"-related types of tags.
    • However, I am slightly curious whether analyzing the quantity of "undo"-related types, might highlight modules that ought to be protected from vandalism, or some other insight I cannot predict?
  4. I don't know. Great question!

Thanks a lot, @Quiddity. These actually help!

Here's what we discussed early this morning for a web based dashboard, in terms of the inputs and defaults. This is a rough outline, and so it's expected things may diverge somewhat for practical reasons or because of insights gained in further use and development of the web based dashboard.

FORM

Choose project family:

  • Wikipedia
  • Wiktionary
  • Commons

...

  • select all
  • select none

Choose wikis:

  • enwiki
  • eswiki
  • bnwiki
  • ruwiki

...

  • Select all for project family
  • Select none
  • Disinclude modules that look like data

Sensible, preset weights applied to the following.

editors_countx1
edits_countx2
impacted_pageviews_countx3
langlinks_countx4
transclude_countx5
similars_countx6
closeness_to_50_lines_of_codex7
edits_count_to_editors_count_ratiox8
page_links_countx9

🆗

The weights (x1..x9) should be editable, and a user can click 🆗 to regenerate results. Perhaps the simplest approach for weights involves simple integers in the presets, so that users don't have to contend with trying to make things add up to 1.00 or 100 precisely (even if the application automatically scales these variables before calculating). Most likely different log scales need to be applied to each ranking factor's underlying raw data, as the analysis so far indicates some fairly wild ranges of values for each ranking factor.

Some means of being able to express just how close similars are in vector space in order to influence the scoring would likely be useful here as well. Maybe this is a standalone field for input that lets the user set permitted minimum similarity / distance?

Perhaps the closeness to 50 lines of code (a guess hazarded about the sorts of modules ripe for standardization) should be a separate filter as well.

RESULTS

The initial result set from clicking 🆗 should have a list of the first 50 module entries scoring highest. And there should be a pager to fetch 50 more at a time (or maybe just all of them if paging is too complicated).

A "show detail" disclosure button for each entry should show the following:

  • A direct link to the module
  • A Wikidata link
  • The source code of the module
  • Button to enumerate similars. Should make it possible to access direct link of module and also, ideally, show the source code for a given similar module (for side by side comparison).

It may be useful to let users be able to have the result set be by wiki then by descending score in each wiki. It may be worth considering how to show the top X for each wiki that's checked as well.

OTHER THINGS
Here were some other things discussed.

  1. Could simple git-diffing across the full set of modules aid in finding duplication? Might obfucscation / packing / compression be helpful in standardizing the format to be able to look at similarity for some diffing step or vector space location? It may turn out a combination of such approaches may help.
  2. Pageviews for pages using a module for the previous 30 days is fine. Some technique to keep refreshing the data should be devised. If the pageviews are lagging for some entries as compared to others (because of the size of the data; the product of templates times pages is large), that's okay, as "ballpark" values are usually good enough for having a ranking factor that's useful.

SECURITY AND PERFORMANCE

Be sure to safely encode things like page titles and source code (normally this is important to avoid XSS, but here it's principally about not breaking the UX in this case). URL-encoding for titles in <a href>s will be important to ensure links work.

Different strategies may make it more or less possible to performantly query all the data. It may be easiest to have pre-computed JSON files for each wiki that have all of the data so that calculations can be done fairly quickly mostly client side (if you want it to be mostly JavaScript). It may just as well be easy to make this queryable via the data store (just beware of any SQL injection risks).

If using a queryable data store, consider use of GET query string parameters for the queries with a canonical URL parameter ordering. You may want to have some sort of caching TTL set for these URLs (e.g., fifteen minutes). For a user's first visit to the report, it might be especially nice to have a report shown quickly (i.e., the result set of a query with all the defaults) - caching on the tool's base URL might better guarantee it's usually fresh enough, but also returned quickly when users are regularly accessing it. Which is nice for UX. This whole business of caching may not even be necessary depending on user load and data store performance if using a data store, but it's something to keep in mind at least.

tanny411 updated the task description. (Show Details)

@dr0ptp4kt @LostEnchanter @gengh

I shared a short doc on how the scoring metric works, maybe something we can incorporate into our final report too.