Mon, Jun 22
- Talk pages are now included in the data.
- I generate a new contribution graph. It's a bipartite of (users) and (wikipages/talkpages) with edit edges (weighted with the number of edits)
- I tried multiple graph mining algorithms on the contribution graph to detect "sub-communities". So far, these techniques either didn't improve the performances, or the algos didn't scale to the data.
Fri, Jun 19
Wed, Jun 17
Mon, Jun 15
wiki @ Deployed a model to recommend properties for Wikidata. This could be an idea for template recommendation in wiki-pages.
- Tested a new model by adding concept-vectors and interaction graph.
- The model is now slightly more difficult to interpret but achieves a better AUC (75%), using XGBoost.
- Refactored the data preparation code in Scala. The code is much more scalable and can regenerate the necessary training data in 1 days on our analytics cluster.
- Discussed with the product team the api endpoints and the potential env. for deployment (ORES ?)
Mon, Jun 8
The paper was submitted to RecSys
- I was finally able to process a large enough view of wikipedia history (from 2015 onwards). This should match with the SSO rollout to use the user_text as a unique id across wikis.
- Transitioned to a new model based on word analysis to accommodate multiple wikis. I'll check what this is capable of. Basically, I gave up on sentiment analysis.
Jun 5 2020
Not a problem. Could you possibly point me to the source code?
hi @ArielGlenn do you know who maintains this dataset?
Jun 2 2020
Jun 1 2020
Hello @srijan. I didn't compute these metrics. Basically, processing only parts of enwiki creates an incomplete fingerprint for any user. Unless my current effort in making a pass on the full data succeeds, I plan on sampling users to obtain targeted full edit history.
Abstract uploaded, and work ongoing to restructure the paper for recsys.
Theme: Credibility and Verifiability
Two speakers confirmed: Connie Moon Sehat and Tiziano Piccardi
- First model is ready but with relatively low performance (~60% AUC). It was trained on a subset of the data in the english language. Calculating all-time edit diffs remains a challenge for such a large Wiki.
- Ongoing work on tuning the model to improve the results.
May 25 2020
No updates yet, but I might submit the abstract today.
May's showcase is concluded.
- Continued progress in building the model and preparing for the demo.
- Meeting with Amir and Niharika: We discussed the potential of integrating his code, ethical considerations, and the features that can be added/hidden.
To run the query on Hive if some fields contain a newline char:
May 24 2020
I came to report this issue and I found that it exists since 2015.
My estimate is that half of the abstracts dataset does not contain any info at all, but rather few bytes from the info-boxes. A simple parsing issue I assume.
May 19 2020
May 15 2020
Progress in building the model and updating the code.
Setup a timeline for deployment.
No specific updates here. (note: abstract due May 25th)
Gathered the abstracts and preparing for the showcase next week.
May 10 2020
May 9 2020
I am waiting to receive the abstracts/titles, I'll update here when I receive them.
- Building a new ground truth dataset from archived SPI reports
no updates for last week. I expect to start this work and send an email around with status mid-week.
May 5 2020
Step 5: Yes, we slightly tailor the communication with the speakers to clarify this. I will add a short paragraph on relevance in the email.
Step 7: Good idea.
May 4 2020
- Re-implemented most of the code now but missing training data and "embedding" pipeline for users.
- Gathering recent sock-puppet investigation outcomes for training.
no updates (RecSys deadline June 1st)
Mar 23 2020
No news here.
Trying link-graph embedding methods for link recommendation.
Mar 16 2020
finalize the details of the March showcase (zoom, team meeting, moderation etc.)
Refining the code and the documentation.
I didn't get to discuss a formal collaboration yet. The above is just building a dataset, which is handled by a student, with some ideas from me (this may still lead to a "resource" paper).
Other exploratory work that may impact this OKR is ongoing: I am working on graph embeddings, which is closely related.
Mar 9 2020
Planning a team update during the 3/18 showcase where the theme is topic models
Weekly update: exploring ideas on mapping Wikidata to Wikipedia sections using title fuzzy matching.
no updates. Needs to finish paper on section alignment in March, but submission might go to April (recsys?)
Built the model for English and annotated provided articles,
Mar 2 2020
Weekly update: Nothing to report this week.
Gave a demo to the growth team with the presence of some ambassadors. Training the model for English articles for internal evaluation.
Weekly update: no further work on this front yet.
Feb 24 2020
Weekly update: I booked a speaker for March on multilang NLP (Jordan Boyd-Graber).
Weekly update: no further work on this front yet. Discussing with the growth team the potential for article structure recommendation.
Finished a new version of the link rec, and sent it out for ambassadors evaluation. Next step, discuss deployment details with the growth team.
Feb 17 2020
@PPham Yes, I understand, this is extremely helpful, thank you so much. I will incorporate this observation.
Feb 15 2020
@PPham Thanks for the feedback! I am separating words that are formatted or in quotes because I took it that the added style is meant to highlight (or single out) the word. This is more of a rule that I apply to all languages.
More articles will come. Thanks again.
Feb 10 2020
Working on an updated version of the linkrec, entity detection in text wasn't satisfactory.
Weekly update: booked speaker for April on Human-ML, and looking for others on topic modeling
Jan 22 2020
Compiled a list of actions from the feedback received.
Generate links for English articles for evaluation.
Weekly update: synced with Jonathan to brainstorm a list of themes for the next 6 months
- I got to a point where I can rerun Diego's code (not PSL though) and reproduce the current numbers. Modification to make cross-validation turned out to be more challenging than expected (not enough space, and time-consuming). I am re-engineering the data pipeline.
- Postponed the planned submission to February.
Weekly update: doing cross related work review for this task as it is relevant to the section alignment. A first attempt is underway leveraging research on property recommendations for wikidata items.
Jan 13 2020
Dec 16 2019
I would like to request Kerberos credentials for the stat and notebook machines.
Nov 13 2019
Oct 4 2019
I have signed the Acknowledgement of Wikimedia Server Access Responsibilities Document (L3)