Fri, Nov 16
Thanks, @JAllemandou! The more recent dumps are very useful.
Thu, Nov 15
Just a follow up that Python kernels won't start on stat1007 for some reason. I've given up trying to fix the issue for now.
Wed, Nov 14
@Ottomata was able to figure it out: https://gist.github.com/ottomata/7651d0f008aa18dcd948ef3636424b23
Also, I connect to notebooks using an Emacs plugin, and SWAP as authentication enabled, which prevents Emacs from connecting to Jupyter.
Tue, Nov 13
@fdans thanks for looking into this issue. I need to use spark because my work is a little heavy on computation side and would adversely affect other SWAP users if I did so. One of our researchers had to go back to a stats machine for this reason.
Thu, Nov 8
Sorry, @fdans, I won't be able to help. I no longer maintain that schema. I'll update the wiki page.
Tue, Nov 6
@faidon thanks! Next week sounds good. We're having our offsite this week, so there's no rush.
Fri, Nov 2
Fri, Oct 26
@jcrespo can you share the password for the 'recommendationapi' user so that I can load some data into the database (I don't have access to the private puppet repo)? Also can you tell me which hosts allow me to connect to the database? Thanks!
Thu, Oct 25
Just be careful not to show/store anything private to/on it as it's on a labs system.
You shouldn't need to directly interact with the password yourself, as I imagine puppet will just deploy it into the configuration for your service.
Thanks, @Krenair. This is very helpful. Where's the password stored? How can I get it? For the tools database it's stored at $HOME/replica.my.cnf, but this is presumably different?
Thanks, @Krenair. Can you also share any documentation on how to connect to the database?
We've stopped data collection as of now.
Wed, Oct 24
Tue, Oct 23
OK, removed the backup part.
@Nuria I'd appreciate your review of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/468490/ before the branch cut today. I'd like to get the fix in to go out this Thursday. Thanks!
Oct 18 2018
@Miriam I've submitted a patch to limit the link text to 100 characters and page title to 200 characters. Let me know if these numbers need to change. Thanks!
OK, added the link.
@leila should I turn off https://gapfinder-tools.wmflabs.org/section-alignment/ then?
Oct 17 2018
Looks like this is done, @mobrovac?
Oct 15 2018
Oct 11 2018
@leila, ooops, I mixed up section with article. Since this task was assigned to me, while @diego is working on it, I got confused. Diego should probably claim this task, IMO. What you're saying makes sense.
Oct 10 2018
\o/ I see you got some input from a native speaker for the remaining sections, @TJones.
Oct 9 2018
According to grafana, on average we're getting 613 events/second for the CitationUsagePageLoad schema. We're also getting about 41 client side errors/minute.
Further improvements will be done as part of T206083.
@Miriam any updates on this? Did you get a chance to talk with Michele and Tiziano?
Oct 5 2018
We discussed T187957#4146727 with Dario, and decided to keep things as is for now. We can create changes both on Gerrit and Github. Changes created in Gerrit will be merged using the Gerrit workflow (+2'ing). Changes created on Github will be pushed to Gerrit manually.
Resolving these conflicts is challenging and time-consuming, but it's nevertheless feasible.
Oct 4 2018
@TJones, OK, I'll wait for your reply and see what I should do differently while doing the rest. (Thanks for the compliment.)
@Amire80 thanks for chiming in. I think we'll all benefit from identifying these problematic interlanguage links and fixing them. Hopefully we can publish a list of issues.
I've left some notes on the talk page. I'll do the remaining bits as I find some spare time.
Oct 3 2018
@leila thanks for the lead. Do you remember if in 2015 (when the scripts were written), Neoplasm (en) was linked to Neoplasma (de) in langlinks. Right now, it seems that's not the case:
@leila, I'm not sure which slide is best for theses but we also worked on:
- Implemented the paper (Growing Wikipedia Across Languages via Recommendation);
- Generated article recommendation for top 50 language pairs used in ContentTranslation;
- Created a morelike API for missing aritcles (still WIP though);
- Ongoing efforts to take the article creation API to production (sorted out the database issue).
@TJones, OK, I'll take a look. I'll leave a comment here when I'm done.
I know some Korean and I'd be happy to help with this task if you don't hear from native Korean speakers.
@Nuria that makes sense. Rather than limiting URL length (so that we don't get incomplete data), would it be a good idea to not report these errors? So I'd detect long URLs and not have EL ping these URLs. Would that work?
Oct 2 2018
For CitationUsagePageLoad we're getting about 450-800 events per second, which gives us 37,500 events per minute. At 200 errors per minute, we one error every 187.5 events. @Miriam and I found this not significant and that's why submitted this patch.
Oct 1 2018
@mobrovac no blockers left?
Sep 28 2018
Turns out we cannot reliably detect redirects across languages. For example, '"Them"' redirects to 'Them_(King_Diamond_album)' (Q1756739). Since we're trying to figure out the Wikidata ID of '"Them"' we can only search Wikidata items by English labels. There are many items with that label:
- Them (Q1338638)
- Them (Q37545106)
- Them (Q1112469)
- Them (Q3591139)
Sep 26 2018
@leila I've been experimenting with the implementation of the section 2.1 of the paper. We can get redirects from Hive (prod.redirect), but not sure how to retrieve interlanguage links as they are not being used in Wikpedia according to this (see the intro). Do you know how?
Sep 25 2018
BTW, try adding ensure_ascii=False to json_dumps for easy debugging.
I think here's why it's happening. You'll see that articles appear in both current.xml and current[N].xml. Here's an example:
@diego I looked at your code briefly and tested it with lang=uz, and the output JSON didn't contain any duplicate rows. Can you paste one of the duplicate rows from ruwiki maybe?
@jcrespo thanks! Looks like I misunderstood you. If DB creation is done, then I'll talk to the Services to team about the productionizing part.
@jcrespo could you please create an account with username 'recommendationapiservice' with the 'SELECT' right only?
Sep 24 2018
@jcrespo, good call. I've updated the task description.
We're increasing the sampling rate for CitationUsagePageLoad from 10% to 33.3% in a few hours.
@jcrespo anything else blocking us from importing data to the database? Any documentation on connecting to the database from the services?
Sep 18 2018
Sep 17 2018
Apparently, there was no train last week so our changes didn't make it to production. I'm delaying data collection until Thursday.
Analytics heads up that we're deploying CitationUsage at 100%, and CitationUsagePageLoad at 10% (per our conversation with @Nuria on IRC) in about two hours. This should yield in about 150 req/sec and 250 req/sec respectively. Tomorrow if these numbers are correct, we'd like to increase the 10% to 33.3%, which will increase 250 req/sec to around 800 req/sec.
@jcrespo 250K rows/sec sounds great. Batch import speed per se is not too important — I just don't want to wait hours to load data up like I did in a labs instance. And yes, starting with m2 section looks like a good idea.
Sep 11 2018
Good catch, @mforns!