Detect similarity in Lua sourcecodes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	LostEnchanter
	Dec 25 2020, 4:04 PM

Description

Some of the modules through different wikis will always do the same, but will have minor differences. For example, Module:Yesno always provide the same service - but its code would be different, as words for positive and negative answers are translated to the language of the wiki, where this module is used.

What can we look at?

Code similarity - ASL trees and other different technics - but it's important to prepare the data and use it only as a last step, as the initial amount of data is huge, and this analysis requires quite a lot of computations.
Title analysis - some of the modules have some changes to the sourcecode to adapt it, but save the title (as Yesno, mentioned before). These modules can be distinguished quite easily, and it will lighten further analysis.
Function usage - if some script is highly popular, it's highly unlikely that it's completely unique, as it should implement something very useful.

Clustering Update:

A description of how similar modules were clustered is given in the document attached here.

similarity_analysis.pdf337 KBDownload

Todo:

Reducing noise and merging clusters by tuning the xi parameter in OPTICS.
Finalizing which works better, word or document embedding
Noise removal
Creating TSNE plots for visualization

Jade's Python notebook on content analysis of modules under the same title: here
Aisha's Python notebook on content similarity analysis Aishas Notebook VI
Aisha's Python notebook on cluster analysis and tuning Aishas Notebook VII

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T263678 Analyze community authored functions that build Wikipedia infoboxes and more
Resolved	tanny411	T270827 Detect similarity in Lua sourcecodes
Resolved	LostEnchanter	T273767 Detect "data" modules

Event Timeline

LostEnchanter created this task.Dec 25 2020, 4:04 PM

LostEnchanter moved this task from To triage to Data Science work on the Abstract Wikipedia team board.

LostEnchanter triaged this task as High priority.Dec 25 2020, 4:15 PM

LostEnchanter updated the task description. (Show Details)Jan 6 2021, 2:26 PM

Currently the idea is to use metric, based on the Levenstein distance as distance between texts (examples in this notebook or this notebook for bigger cases). Current idea of closeness detection algorithm look like this like this:

These modifications are considered:

Normalised Levenstein distance might work way worse on "real" input data. In this case further analysis is considered, most likely including AST usage and comparison. They can be run in the background.
Calculating Levenstein distance requires walking through all the possible pairs, so it's quite costly, and speed is our main priority. Because of that it might be better to split initial input group into subgroups and run the code multiple times, shuffling subgroups between runs. Size of the group should be constant (as it's the thing affecting computational time the most), so the amount of runs should be calculated. Also, it's most likely possible to run different subgroups in different threads to speed everything up a lot.

What's left unanswered:

How to store predicted group belonging info and real (approved by users) group belonging info? Scripts table can be modified once more, but it is already huge, so maybe it would be better to create simple new tables with columns like pair_id - page_id1 - wiki1 - page_id2 - wiki2 - group_id (or predicted_group_id).
There are modules with the same titles across diffeent wikis, and it's usually OK to group them. But should we look into similar titles too, not only the same ones?

tanny411 renamed this task from [Abstract Wikipedia data science] Develop algorithms to detect copies in Lua sourcecodes to Detect similarity in Lua sourcecodes.Feb 10 2021, 8:06 AM

tanny411 claimed this task.

tanny411 updated the task description. (Show Details)

tanny411 updated the task description. (Show Details)Feb 23 2021, 7:07 AM

@LostEnchanter @gengh @dr0ptp4kt
I've attached my analysis and procedures so far. Some more tasks are added as todo. Tuning takes some time due to longer clustering time but good thing there's not much to tune.

tanny411 closed this task as Resolved.Feb 26 2021, 11:34 AM

tanny411 updated the task description. (Show Details)

@dr0ptp4kt @LostEnchanter @gengh

I've updated the pdf to include noise removal and tuning analysis. This should be it for now regarding similarity analysis. Feel free to send feedback.

tanny411 updated the task description. (Show Details)Mar 7 2021, 1:59 PM

LostEnchanter closed subtask T273767: Detect "data" modules as Resolved.Mar 7 2021, 7:56 PM

	F34027567: Analysis.png
	Jan 26 2021, 8:29 PM

Detect similarity in Lua sourcecodesClosed, ResolvedPublicActions

Description

Description

Clustering Update:

Related ObjectsSearch...

Event Timeline

Detect similarity in Lua sourcecodes
Closed, ResolvedPublic
Actions

Related Objects
Search...