Some of the modules through different wikis will always do the same, but will have minor differences. For example, Module:Yesno always provide the same service - but its code would be different, as words for positive and negative answers are translated to the language of the wiki, where this module is used.
What can we look at?
- Code similarity - ASL trees and other different technics - but it's important to prepare the data and use it only as a last step, as the initial amount of data is huge, and this analysis requires quite a lot of computations.
- Title analysis - some of the modules have some changes to the sourcecode to adapt it, but save the title (as Yesno, mentioned before). These modules can be distinguished quite easily, and it will lighten further analysis.
- Function usage - if some script is highly popular, it's highly unlikely that it's completely unique, as it should implement something very useful.
A description of how similar modules were clustered is given in the document attached here.
- Reducing noise and merging clusters by tuning the xi parameter in OPTICS.
- Finalizing which works better, word or document embedding
- Noise removal
- Creating TSNE plots for visualization
Jade's Python notebook on content analysis of modules under the same title: here
Aisha's Python notebook on content similarity analysis Aishas Notebook VI
Aisha's Python notebook on cluster analysis and tuning Aishas Notebook VII