Below is a description of the project, written by Adam initially in task T272003
Here's what we discussed early this morning for a web based dashboard, in terms of the inputs and defaults. This is a rough outline, and so it's expected things may diverge somewhat for practical reasons or because of insights gained in further use and development of the web based dashboard.
FORM
Choose project family:
- Wikipedia
- Wiktionary
- Commons
...
- select all
- select none
Choose wikis:
- enwiki
- eswiki
- bnwiki
- ruwiki
...
- Select all for project family
- Select none
- Disinclude modules that look like data
Sensible, preset weights applied to the following.
editors_count | x1 |
edits_count | x2 |
impacted_pageviews_count | x3 |
langlinks_count | x4 |
transclude_count | x5 |
similars_count | x6 |
closeness_to_50_lines_of_code | x7 |
edits_count_to_editors_count_ratio | x8 |
page_links_count | x9 |
🆗
The weights (x1..x9) should be editable, and a user can click 🆗 to regenerate results. Perhaps the simplest approach for weights involves simple integers in the presets, so that users don't have to contend with trying to make things add up to 1.00 or 100 precisely (even if the application automatically scales these variables before calculating). Most likely different log scales need to be applied to each ranking factor's underlying raw data, as the analysis so far indicates some fairly wild ranges of values for each ranking factor.
Some means of being able to express just how close similars are in vector space in order to influence the scoring would likely be useful here as well. Maybe this is a standalone field for input that lets the user set permitted minimum similarity / distance?
Perhaps the closeness to 50 lines of code (a guess hazarded about the sorts of modules ripe for standardization) should be a separate filter as well.
RESULTS
The initial result set from clicking 🆗 should have a list of the first 50 module entries scoring highest. And there should be a pager to fetch 50 more at a time (or maybe just all of them if paging is too complicated).
A "show detail" disclosure button for each entry should show the following:
- A direct link to the module
- A Wikidata link
- The source code of the module
- Button to enumerate similars. Should make it possible to access direct link of module and also, ideally, show the source code for a given similar module (for side by side comparison).
It may be useful to let users be able to have the result set be by wiki then by descending score in each wiki. It may be worth considering how to show the top X for each wiki that's checked as well.
OTHER THINGS
Here were some other things discussed.
- Could simple git-diffing across the full set of modules aid in finding duplication? Might obfucscation / packing / compression be helpful in standardizing the format to be able to look at similarity for some diffing step or vector space location? It may turn out a combination of such approaches may help.
- Pageviews for pages using a module for the previous 30 days is fine. Some technique to keep refreshing the data should be devised. If the pageviews are lagging for some entries as compared to others (because of the size of the data; the product of templates times pages is large), that's okay, as "ballpark" values are usually good enough for having a ranking factor that's useful.
SECURITY AND PERFORMANCE
Be sure to safely encode things like page titles and source code (normally this is important to avoid XSS, but here it's principally about not breaking the UX in this case). URL-encoding for titles in <a href>s will be important to ensure links work.
Different strategies may make it more or less possible to performantly query all the data. It may be easiest to have pre-computed JSON files for each wiki that have all of the data so that calculations can be done fairly quickly mostly client side (if you want it to be mostly JavaScript). It may just as well be easy to make this queryable via the data store (just beware of any SQL injection risks).
If using a queryable data store, consider use of GET query string parameters for the queries with a canonical URL parameter ordering. You may want to have some sort of caching TTL set for these URLs (e.g., fifteen minutes). For a user's first visit to the report, it might be especially nice to have a report shown quickly (i.e., the result set of a query with all the defaults) - caching on the tool's base URL