User Details
- User Since
- Aug 8 2017, 10:56 AM (250 w, 1 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Diego (WMF) [ Global Accounts ]
Yesterday
Done.
Notice that the list of articles is new (the sample was random, now I fixed to get always the same articles in case we need to fix something).
Please find the samples required in this link.
Fri, May 6
Updates
- We are exploring cross-lingual article quality. Details on this task: T305390
Wed, Apr 27
Apr 25 2022
Thanks @Isaac. I'll check this.
Apr 22 2022
Apr 20 2022
pyspark2 --master yarn --deploy-mode client --executor-memory 8g --driver-memory 8g --conf spark.dynamicAllocation.maxExecutors=128but result is same.
In this case fasttext is running on the driver. You can increase the memory driver when calling the spark env, replacing this:
Apr 15 2022
- We have published the documentation about this project here.
- All code and data is available and linked on the documentation page.
- Results and dataset details has been published. You can find them here.
- I'm changing this task as resolves, and updating also in betterworks.
Apr 12 2022
Apr 11 2022
- Last week we have done the onboarding on using PySpark and cluster data.
- As first step @paramita_das will be working on obtain the article quality distribution along the time for enwiki.
- Last week we have done the onboarding on using PySpark and cluster data.
- @Aitolkyn is starting to explore how to match bad references with pageviews.
Apr 8 2022
- We have published the alignments for 205 languages here.
- Each folder contains the alignments from that language to all others. For example 'enwiki' contains the alignments from English to all the other wikis.
- The format is SQLite. @santhosh could you confirm you are able to read the files?
- We are working on the algorithm and output documentation.
- The dataset has been released here.
- The paper will be published in May.
- We finished this project, results can be found on Meta, the code and models could be found in Gitlab.
- I'll discuss future work with @Lydia_Pintscher.
Apr 7 2022
Thanks @Isaac for these inputs. There was a mistake on the title, this work is about article quality and not specifically about citations.
Apr 4 2022
The internship ends on June 24th.
@diego: We also need the estimated end date of the internship (you'll be contacted two weeks before it expires whether to extend access or not).
Apr 2 2022
@paramita_das please update the task description with your SSH key.
Apr 1 2022
- We have tested our model model on the CX dataset (sections translations done using the CX Tool).
- Results are showing a good performance. @MunizaA please report the precision@5 for the top-100 languages pairs.
- Now, we run the alignments for all the languages, and the results will be ready early next week.
Mar 25 2022
- I'm preparing a short presentation with the main findings of this project. I'm planning to present these results during the next disinformation working group meeting.
Updates
- We decided to go back to the XGBoost based model, because the results were better than using the Spark implementation.
- We noticed a decrease on precision when considering under-resourced languages. Our hypothesis is that the quality of embeddings created by M-Bert is not very high. We decided to create a second model, language-agnostic, and then compare the results. Our intuition is that for some languages the language agnostic model will be better.
- We plan to release all these results at the end of next week.
Updates
- I was comparing the results when adding anonymous edits, until now I haven't find major differences with the previous results. I'll continue working on this during the next week before my next meeting with Lydia.
Mar 18 2022
- I've presented the main results of this work during the Tuesday Research Sessions, slides can be find here.
- We are working in applying the model at scale. @MunizaA has been experimenting with native spark libraries to see if is possible to replace external dependencies. The quality of firsts results are not satisfactory, so we are exploring alternatives.
- No updates.
- Our collaborators got their publication accepted, the dataset should be released in the following week.s
Mar 7 2022
Updates
- No updates
- We are fine tuning the model.
- I'm currently studying the propagation of "climate change" related items (thanks Isaac for the dataset)
- We meet with Lydia and discussed the current results.
- We reviewed the results confirming that most co-edited items corresponds to on going events, even when we change the time window to be considered.
- Now, I'll be studying the relevance/prevalence of anonymous edits on popular content.
Feb 18 2022
Updates
- No update.
- With @MunizaA we have annotated data in Spanish to English and Urdu to English.
- We found that popularity of sections (amount of articles they appear) has a huge impact on the results' quality.
- While for popular sections there are multiple possible translations, the most infrequent ones usually has 1 or 2.
- We are trying to improve the model to address these issues.
- We are also analyzing how to use MT to improve the results.
- We are shaping the paper and checking which new experiments would be required.
- No updates
Feb 12 2022
- No updates.
- @MunizaA has uploaded this sample files containing several languages Each of them contains the top-200 most frequent sections in the source language.
- @Pginer-WMF , please have a look on them. Keep in mind that we are focusing in recall more than precision. By now, we are showing the top-20 most similar target sections, per source section.
- I'll coordinate a meeting in the following days to discuss how to tune these results.
- No updates
- I'm working in identifying collaborative edits on wikidata items not related to current events.
Feb 4 2022
- We are in the process of writing the paper.
- No updates
- We are preparing the data on some of the language pairs suggested by @Pginer-WMF
- @MunizaA has uploaded the code here: https://gitlab.wikimedia.org/mnz/section-alignment/-/tree/muniza-notebooks/notebooks
- No updates
Jan 22 2022
- No updates.
- We have done manual sanity checks on the data extraction pipeline, confirming that is working properly.
- Next steps will be to run the model in 20 new languages.
- No updates
- We are now focusing in understanding collaborations patterns: when/how more than user edits the same item in a given period of time.
- We found that in Wikidata such collaborations are less frequent than in other Wikimedia projects.
- We also found that items edited by more than one user are usually related to on going events (awards, deaths, releases)
- I'll present some of these findings:
- On research meeting (Tuesday) in March
- And @Lydia_Pintscher will propose a date probably in April to present these results to the Wikidata folks.
Jan 17 2022
- I'm gathering all the last results to organize them to write the report.
- We are analyzing the results showed above before deciding the new steps.
Updates
- No updates
- I'm organizing the new results to be discussed with the stakeholder.
Jan 13 2022
yes.
Jan 8 2022
- @MunizaA has run the first experiments to compare the results with the new language model, with our old FastText-based model, obtaining promising results. (@MunizaA please share the new results here.)
- The next steps are:
- Test the model for language pairs without training data.
- Estimate the time required to run the model in the 100+ languages supported by this new approach.
- I'm focusing on modeling the relationship between topics and collaborations/controversies.
- I'm working on graph representation of these components
- I've updated the task according to the new plan discussed with our collaborators.
- No updates
Jan 3 2022
I've updated the code here https://github.com/digitalTranshumant/templatesAlignment/blob/master/02alignmentsSpark.ipynb
Dec 24 2021
- @MunizaA has developed the full pipeline to efficiently extract all the features used on the original model, such as link similarity and edit distance.
- We are currently preparing the experiment to validate our results using the new Language model (to replace FastText).
- We have received the report from our collaborators with the description of the dataset, and the result of their model.
- We will coordinate for releasing the dataset during this FY.
- We have seen that few items are edited by more than one user.
- We are currently researching about the item and users characteristics related to collaborative work.
- We are testing a new DL model to predict content propagation, using content reliability as one of the features.
Dec 8 2021
Oh got it! This setup has been changed around one year ago. Now we all use the spark environments provided by the JupyterHub.
@KartikMistry this looks like a pyspark configuration issue, which kernel are you using?
Dec 4 2021
Dec 3 2021
- The paper can be found here, and this task is done.
- No updates this week. I'm going to meet with the stakeholder next week.
Updates
- No updates
- The dataset was created, I'm coordinating with our collaborators to discuss the time and format of the data release.
- We obtained the first results with new language models. @MunizaA could you please report the numbers here?
Nov 12 2021
- @MunizaA is testing new languages models that could be more efficient and possible accurate than the FastText embeddings used in the previous experiments.
- I've been working on classifier to predict reverts.
- The current classifier uses article (item), revision and user information.
- On a balance test set, the actual model gets results over 70% of accuracy
- However, there is a set of caveats to be considered:
- 'auto-reverts': users can revert themselves, this shouldn't be consider as signal of controversy. We need to analyze more this behavior.
- power-users: we need to take in account that a small set of users produces most of the edits and reverts, this behavior could affect our results. We are working on different sampling method to address this issue.
- The meta page was updated with the results in Q1 and partial results in Q2.
Updates
- Meta page was updated.
Updates
- No updates
Nov 5 2021
- The paper has been accepted in the IEEE BigData 2021 conference.
- I'll upload the paper and write documentation in meta in the following weeks.
- No updates this week.