Page MenuHomePhabricator

Collect relevant data about the Modules for analysis
Closed, ResolvedPublic



Phabricator: T270492

After collecting Scribunto module contents, the data about these modules need to be collected for analysis. There's a point of what data can be collected to determine which modules are most used and can be centralized for the Abstract Wikimedia project. And then comes the part of centralizing and making more modular codes, removing as much redundancy as possible. This issue concerns the first part: collecting relevant data.

  • Understanding how Lua functions are invoked and used in various wikis
  • Going through database schema finding ways to collect Module usage data and relationship between modules, plus additional information if possible
    • Page table
      • title: To check code redundancy, e.g: title changes with language but means same thing
      • length
      • content: To check code similarity
      • page_latest: Current revid. It may be 0 during page creation. Used to check for updates
      • page_is_new
      • is_redirect
    • Revision table
      • number of rows(edits) per page: Identify most edited modules
      • minor_edits: #of major vs minor edits
      • time of first edit
      • time of last edit : Use first and last edit times to identify edit frequency and recent edits.
      • Number of editors (contributors)
      • Number of anonymous edits
    • Page protection table
      • Protection level for edit of each module
      • Protection level for move of each module
    • Iwlinks table (contain interwiki links only, not transclusions)
      • Number of places the module page was linked (not transcluded) from other wikis.
    • Pagelinks table (contains in-wiki links, en → en or bn→ bn)
      • Number of places in the same wiki a module page is linked (not transcluded). This along with # of interwiki links can give us total # of places a module page is linked.
    • Langlinks table
      • Number of languages a module is available in
    • Categorylinks table
      • How many categories a module belongs to (Note that category list varies in various wikis. See more from category table)
    • Templatelinks table
      • Number of modules transcluded in a module: To find more about inter-module relations, we should use the templatelinks table
      • Number of pages that use a module (and later find page views of those pages to find usage stats)
    • Change tag table: A tag is associated with every revision of a page. This section only includes some aggregate infomation. More analysis can be done from tags table directly. See tag list here.
      • Most common revision tag for each module.
  • Database queries to replicas to collect relevant information
  • Store all information in a feasible manner in user database
  • Add proper comments in code
  • Set and test cronjobs
  • Save missed wikis and load them after other crons are over
  • Optimize cronjobs
  • Add error catching and auto-retry for MySQL connection lost, deadlock etc erorrs
  • Solve interface errors (solved by re-arranging cursor and connection open-close so as to not close from being idle)
  • Repeated failures to get tags and transclusions in certain wikis (enwiki tl, commons and frwiki tags): Fixed by using analytics cluster to connect to databases.

Event Timeline

tanny411 updated the task description. (Show Details)

@LostEnchanter Hi, I spent couple of days going through the entire database layout and extracting as much information as I found relevant. I have listed them all out. Next, I will be going through how to get pageview information as those are not in the database and then will start storing all info in user database.

For now, I'd like you to go through the info I've extracted (as listed in the task) in your free time and let me know if you think I have missed something.


@tanny411 you did a really good job!

I've looked through the list in Googledocs and this task. I think it's really quite comprehensive list of information we can get. The only question I have is what is your idea behind working with page protection table? I'm just not sure how it's content can hel our analysis.

My idea was that some pages are highly protected and this may mean they are important modules (therefore also used in a lot of places). Those can be prioritized to be centralized.

My idea was that some pages are highly protected and this may mean they are important modules (therefore also used in a lot of places). Those can be prioritized to be centralized.

Good catch! Thank you, it might really help.

Hi @dr0ptp4kt,
So the issues I've been having involve the iwlinks table (code here) and templatelinks table (code here).

The templatelinks query is quite simple so I guess the time taken is really due to the size of the table. It takes around 9 hours.

As for the iwlinks table, I've been questioning myself how important this data is for us anyways, haha. It is taking more than a day, still running.
Anyways, the way the table has its information stored, with URL instead of page title, I needed to use the Interwiki table. Unfortunately interwiki table is empty in the replica dbs, so I created a interwiki table in user-db and populated it using the API (code here). Next I copied the iwlinks table over to user-db and used Interwiki and iwlinks tables together to get my job done. I could not find ways to work across databases in toolsdb and other wikis, hence this work around.

Let be know if you see how I can optimize this code, it definitely is heavy operation.

@dr0ptp4kt It seems due to the running jobs toolforge has gotten super slow. It's really hard to continue working on other things from toolforge, should I stop the jobs for now? (although they have been running for a long time). Debating myself.

@tanny411 I think your intuition is probably correct: better to skip iwlinks for now if it's causing too many problems. You can always come back later and see if there's anything of high value there. The other tables queried definitely contain some interesting stuff, we know that much!

I'm not 100% sure, but it may be possible to speed up the templatelinks query a bit by removing the distinct operator without much loss of data quality. There may be a few edge cases, but most of the time the count(*) should yield the same result as the count(distinct) because there ought to be just one tl_from per tl_namespace + tl_title pair (they form a 3-part key and I believe tl_from_namespace is merely extra detail that shouldn't result in duplicates?). If it's too onerous for data quality, though, of course if it's "only" a savings of several hours (e.g., 9 hours with the distinct operator versus 5 without it), it may be worth it.

For slow running toolforge, that may be symptomatic of the time of year, lots of people kicking off stuff either at the end of last year or the beginning of this newer year. If it's still running (slowly) when you get this, yeah, I'd stop it and try again (after first removing iwlinks stuff from the pipeline), assuming your queries and processing are dragging more than they were before and it isn't other jobs blocking execution of your jobs.

If your jobs are getting blocked by a bunch of other enqueued jobs for too long or you're just seeing things are running more slowly than before, you might try the Freenode wikimedia-cloud channel (if you haven't done so already on this item) to see if folks can help troubleshoot and determine if there's a way to get jobs prioritized/unlocked or otherwise sped up.

I'm interested to learn what you discover!

Hi, thanks for the feedback! I was working on solving some issues with the pageviews, I am going to try out your suggestion for the templatelinks table soon today.

Indeed tl_from is a unique value as it is the page_id. SImilarly there were other instances where I could remove the DISTINCT. Still to get a number on the improvement of time, but there will not be any data loss, that's for sure.

tanny411 renamed this task from [Abstract Wikipedia data science] Collect relevant data about the Modules for analysis to Collect relevant data about the Modules for analysis.Jan 25 2021, 5:52 AM
tanny411 updated the task description. (Show Details)
tanny411 updated the task description. (Show Details)