Page MenuHomePhabricator

Create web service for providing results of analysis work
Closed, DeclinedPublic

Description

Below is a description of the project, written by Adam initially in task T272003

Here's what we discussed early this morning for a web based dashboard, in terms of the inputs and defaults. This is a rough outline, and so it's expected things may diverge somewhat for practical reasons or because of insights gained in further use and development of the web based dashboard.

FORM

Choose project family:

  • Wikipedia
  • Wiktionary
  • Commons

...

  • select all
  • select none

Choose wikis:

  • enwiki
  • eswiki
  • bnwiki
  • ruwiki

...

  • Select all for project family
  • Select none
  • Disinclude modules that look like data

Sensible, preset weights applied to the following.

editors_countx1
edits_countx2
impacted_pageviews_countx3
langlinks_countx4
transclude_countx5
similars_countx6
closeness_to_50_lines_of_codex7
edits_count_to_editors_count_ratiox8
page_links_countx9

🆗

The weights (x1..x9) should be editable, and a user can click 🆗 to regenerate results. Perhaps the simplest approach for weights involves simple integers in the presets, so that users don't have to contend with trying to make things add up to 1.00 or 100 precisely (even if the application automatically scales these variables before calculating). Most likely different log scales need to be applied to each ranking factor's underlying raw data, as the analysis so far indicates some fairly wild ranges of values for each ranking factor.

Some means of being able to express just how close similars are in vector space in order to influence the scoring would likely be useful here as well. Maybe this is a standalone field for input that lets the user set permitted minimum similarity / distance?

Perhaps the closeness to 50 lines of code (a guess hazarded about the sorts of modules ripe for standardization) should be a separate filter as well.

RESULTS

The initial result set from clicking 🆗 should have a list of the first 50 module entries scoring highest. And there should be a pager to fetch 50 more at a time (or maybe just all of them if paging is too complicated).

A "show detail" disclosure button for each entry should show the following:

  • A direct link to the module
  • A Wikidata link
  • The source code of the module
  • Button to enumerate similars. Should make it possible to access direct link of module and also, ideally, show the source code for a given similar module (for side by side comparison).

It may be useful to let users be able to have the result set be by wiki then by descending score in each wiki. It may be worth considering how to show the top X for each wiki that's checked as well.

OTHER THINGS
Here were some other things discussed.

  1. Could simple git-diffing across the full set of modules aid in finding duplication? Might obfucscation / packing / compression be helpful in standardizing the format to be able to look at similarity for some diffing step or vector space location? It may turn out a combination of such approaches may help.
  2. Pageviews for pages using a module for the previous 30 days is fine. Some technique to keep refreshing the data should be devised. If the pageviews are lagging for some entries as compared to others (because of the size of the data; the product of templates times pages is large), that's okay, as "ballpark" values are usually good enough for having a ranking factor that's useful.

SECURITY AND PERFORMANCE

Be sure to safely encode things like page titles and source code (normally this is important to avoid XSS, but here it's principally about not breaking the UX in this case). URL-encoding for titles in <a href>s will be important to ensure links work.

Different strategies may make it more or less possible to performantly query all the data. It may be easiest to have pre-computed JSON files for each wiki that have all of the data so that calculations can be done fairly quickly mostly client side (if you want it to be mostly JavaScript). It may just as well be easy to make this queryable via the data store (just beware of any SQL injection risks).

If using a queryable data store, consider use of GET query string parameters for the queries with a canonical URL parameter ordering. You may want to have some sort of caching TTL set for these URLs (e.g., fifteen minutes). For a user's first visit to the report, it might be especially nice to have a report shown quickly (i.e., the result set of a query with all the defaults) - caching on the tool's base URL

Event Timeline

LostEnchanter created this task.

@tanny411 @gengh @dr0ptp4kt

Today was the second day of fighting with rolling out of production version, but I believe I finally fixed everything for the current version, so the main functionality is working - you can test it at https://abstract-wiki-ds.toolforge.org/
I'm planning onto fixing css tomorrow - and maybe adding language filter, which for now is left behind, as it takes too long to check all the entries.

Onto filter issue - @tanny411 , there's a filter_families_with_linkage function in scores_retrieval.py , which uses the approach we discussed today - uses dataframe with info database-family-language from meta database and checks project family using it. Somehow the code in this function, when I tested locally, took more than 5 times longer compairing to the current implementation. If you have any idea, if we can make it faster, please notify! I'm not that good with dataframes so I might use something improperly.

Hi @LostEnchanter

So I couldnt test it because I couldn't find where you populated the linked_df dataframe, sorry about that. But this snippet should be enough:
Get a list of all dbs with the chosen families:
dbs = linkage_df[linkage_df['family'].isin(chosen_families_list)]['database']
Filter the scores dataframe with the retrieved dbs list:
df = df[df['dbname']].isin(dbs)

You can do the same thing for language as well in the exact same way. For example:

dbs  = linkage_df[ (linkage_df['family'].isin(chosen_families_list))  \
                && (linkage_df['language'].isin(chosen_language_list)) ]['database']

A couple of things, some I think you may already have in your list or some improvement for later.

  • site title
  • the vue logo in title.
  • link to the actual module page (maybe when loading its source, the page title could be the actual link)
  • list of important modules can be listed as a table with their scores and other data (features) as well. Maybe some useful information for the user to assess.
  • pagination in list of important modules
  • pagination in similar modules list (showing nearby clusters, as we discussed)

Question: Are we not including the special wikis? Instead of having a list of hardcoded families, maybe you can use the meta_table acquired data to get a list of all families as well as language and display them. To avoid repeated calls, just saving them in a array should work when app initializes.
You will understand the app performance issues better, so I'm leaving the decision to you, let me know where I can jump in.

Instead of having a list of hardcoded families, maybe you can use the meta_table acquired data to get a list of all families as well as language and display them. To avoid repeated calls, just saving them in a array should work when app initializes.
You will understand the app performance issues better, so I'm leaving the decision to you, let me know where I can jump in.

This one is easily doable!

I'll test how these dataframe filters perform, thanks for the input.

@tanny411 Thanks for the help with Dataframes, they are way faster when used like that.

Update with filtering by language is live.

@LostEnchanter Glad I could help! And really great work!

Aklapper subscribed.

@LostEnchanter: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Closing this down with a Declined, to indicate no further work to be done.