Description
Phabricator: T270492
After collecting Scribunto module contents, the data about these modules need to be collected for analysis. There's a point of what data can be collected to determine which modules are most used and can be centralized for the Abstract Wikimedia project. And then comes the part of centralizing and making more modular codes, removing as much redundancy as possible. This issue concerns the first part: collecting relevant data.
Tasks
- Understanding how Lua functions are invoked and used in various wikis
- Going through database schema finding ways to collect Module usage data and relationship between modules, plus additional information if possible
- Page table
- title: To check code redundancy, e.g: title changes with language but means same thing
- length
- content: To check code similarity
- page_latest: Current revid. It may be 0 during page creation. Used to check for updates
- page_is_new
- is_redirect
- Revision table
- number of rows(edits) per page: Identify most edited modules
- minor_edits: #of major vs minor edits
- time of first edit
- time of last edit : Use first and last edit times to identify edit frequency and recent edits.
- Number of editors (contributors)
- Number of anonymous edits
- Page protection table
- Protection level for edit of each module
- Protection level for move of each module
- Iwlinks table (contain interwiki links only, not transclusions)
- Number of places the module page was linked (not transcluded) from other wikis.
- Pagelinks table (contains in-wiki links, en → en or bn→ bn)
- Number of places in the same wiki a module page is linked (not transcluded). This along with # of interwiki links can give us total # of places a module page is linked.
- Langlinks table
- Number of languages a module is available in
- Categorylinks table
- How many categories a module belongs to (Note that category list varies in various wikis. See more from category table)
- Templatelinks table
- Number of modules transcluded in a module: To find more about inter-module relations, we should use the templatelinks table
- Number of pages that use a module (and later find page views of those pages to find usage stats)
- Change tag table: A tag is associated with every revision of a page. This section only includes some aggregate infomation. More analysis can be done from tags table directly. See tag list here.
- Most common revision tag for each module.
- Page table
- Database queries to replicas to collect relevant information
- Store all information in a feasible manner in user database
- Add proper comments in code
- Set and test cronjobs
- Save missed wikis and load them after other crons are over
- Optimize cronjobs
- Add error catching and auto-retry for MySQL connection lost, deadlock etc erorrs
- Solve interface errors (solved by re-arranging cursor and connection open-close so as to not close from being idle)
- Repeated failures to get tags and transclusions in certain wikis (enwiki tl, commons and frwiki tags): Fixed by using analytics cluster to connect to databases.
Drafts
- Doc for discussion on which data points are relevant for the analysis: google doc
- PAWS notebook with experimental analysis: Aishas notebook
- PAWS notebook with database exploration and queries: Aishas notebook II