Page MenuHomePhabricator

Detect similarity in Lua sourcecodes
Closed, ResolvedPublic

Description

Description

Some of the modules through different wikis will always do the same, but will have minor differences. For example, Module:Yesno always provide the same service - but its code would be different, as words for positive and negative answers are translated to the language of the wiki, where this module is used.

What can we look at?

  • Code similarity - ASL trees and other different technics - but it's important to prepare the data and use it only as a last step, as the initial amount of data is huge, and this analysis requires quite a lot of computations.
  • Title analysis - some of the modules have some changes to the sourcecode to adapt it, but save the title (as Yesno, mentioned before). These modules can be distinguished quite easily, and it will lighten further analysis.
  • Function usage - if some script is highly popular, it's highly unlikely that it's completely unique, as it should implement something very useful.
Clustering Update:

A description of how similar modules were clustered is given in the document attached here.

Todo:

  • Reducing noise and merging clusters by tuning the xi parameter in OPTICS.
  • Finalizing which works better, word or document embedding
  • Noise removal
  • Creating TSNE plots for visualization

Jade's Python notebook on content analysis of modules under the same title: here
Aisha's Python notebook on content similarity analysis Aishas Notebook VI
Aisha's Python notebook on cluster analysis and tuning Aishas Notebook VII

Event Timeline

Currently the idea is to use metric, based on the Levenstein distance as distance between texts (examples in this notebook or this notebook for bigger cases). Current idea of closeness detection algorithm look like this like this:

Analysis.png (442×162 px, 20 KB)

These modifications are considered:

  1. Normalised Levenstein distance might work way worse on "real" input data. In this case further analysis is considered, most likely including AST usage and comparison. They can be run in the background.
  2. Calculating Levenstein distance requires walking through all the possible pairs, so it's quite costly, and speed is our main priority. Because of that it might be better to split initial input group into subgroups and run the code multiple times, shuffling subgroups between runs. Size of the group should be constant (as it's the thing affecting computational time the most), so the amount of runs should be calculated. Also, it's most likely possible to run different subgroups in different threads to speed everything up a lot.

What's left unanswered:

  1. How to store predicted group belonging info and real (approved by users) group belonging info? Scripts table can be modified once more, but it is already huge, so maybe it would be better to create simple new tables with columns like pair_id - page_id1 - wiki1 - page_id2 - wiki2 - group_id (or predicted_group_id).
  2. There are modules with the same titles across diffeent wikis, and it's usually OK to group them. But should we look into similar titles too, not only the same ones?
tanny411 renamed this task from [Abstract Wikipedia data science] Develop algorithms to detect copies in Lua sourcecodes to Detect similarity in Lua sourcecodes.Feb 10 2021, 8:06 AM
tanny411 claimed this task.
tanny411 updated the task description. (Show Details)
tanny411 updated the task description. (Show Details)

@LostEnchanter @gengh @dr0ptp4kt
I've attached my analysis and procedures so far. Some more tasks are added as todo. Tuning takes some time due to longer clustering time but good thing there's not much to tune.

tanny411 updated the task description. (Show Details)

@dr0ptp4kt @LostEnchanter @gengh

I've updated the pdf to include noise removal and tuning analysis. This should be it for now regarding similarity analysis. Feel free to send feedback.