Add a citolytics query prefix that exposes the relatedness information from the elasticsearch index
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	mschwarzer
	Aug 17 2016, 10:22 AM

Description

Summary

Currently the CirrusSearch API provides a morelike feature that provides a list of similar pages that have been calculated upfront by apache more like this. The goal of this task is to expose the relatedness information calculated via the apache flink based citolytics project in the same way. A good query prefix would be citolytics:"Pagetitle". Citolytics would be an additional source for the read more feature (related pages project page). With its different algorithmic approach compared to the current morelike system, we hope to increase the user engagement by providing better recommendations.

Implementation

The article recommendations can be integrated with a custom KeywordFeature, e.g. CitolyticsKeywordFeature, that is trigged by the citolytics: prefix and modifies the search query.
Recommendation data can be stored in an additional field of the CirrusSearch index.
The Flink job that generates the recommendations from a Wiki XML dump can output in ES bulk format. Its output can be used to populate the data manually to the CirrusSearch index or automatically from within the Oozie pipeline.
The pull request can be found on Gerrit: https://gerrit.wikimedia.org/r/#/c/329626/

Demo

A MediaWiki setup that demonstrates the feature based on simplewiki can be found here: http://citolytics-demo.wmflabs.org/
A guide for setting up the demo can be found here: https://github.com/mschwarzer/citolytics-demo/

Goal Visibility & Success Metrics

The Citolytics recommendations will be available in the Android app. See T148833.
The Android app's event logging data should be also used to evaluate the success. See T149682

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		mschwarzer	T142477 Improve mobile recommendations in Android app
		Declined		mschwarzer	T143197 Add a citolytics query prefix that exposes the relatedness information from the elasticsearch index

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

mschwarzer updated the task description. (Show Details)Aug 17 2016, 12:08 PM

@mschwarzer getting a new extension deployed is not realistic within the given timeframe. Thus I'd recommend to use an existing extension like for instance cirrussearch.

However, for local testing it might be easier for you start with a standalone extension.

CirrusSearch has already a process to populate some data from the hadoop cluster to our elasticsearch indices (for pageviews data).
If your TSV file is the end result for all possible pages :

Page_A

Recommended_Page_1

Recommended_Page_2

Recommended_Page_3

...how many?

We could maybe consider loading this data into elastic and use it as a simple KV store.

Then you'll have to branch your code in the morelike special function to interrogate this new field instead of running the lucene more_like query.

This field would be a simple array and I don't think it needs to be indexed so no need to change the mapping.

@EBernhardson do you think that will work?

if the goal is to attach an array of strings to a page id, and then return that list of strings when queried, yea that should be reasonably doable.

Thanks! This way sounds more feasible. I'll add an extra query prefix (citolytics:) to CirrusSearch.

CKoerner_WMF subscribed.Aug 18 2016, 6:33 PM

@dcausse, can you point to the CirrusSearch process you mentioned for writing data from Hadoop/HDFS to elasticsearch. I could only find a class for writing to elasticsearch without HDFS.

@mschwarzer the project we use for all the discovery analytics stuff is here: https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/analytics

The script that talks to elastic is oozie/transfer_to_es/transferToES.py, it's designed to send a single score per document but I think it can be easily adapted to send other types of data.

Aklapper renamed this task from Citolytics MediaWiki extension to Develop a Citolytics MediaWiki extension.Sep 28 2016, 12:05 PM

Aklapper added a project: MediaWiki-extension-requests.

mschwarzer renamed this task from Develop a Citolytics MediaWiki extension to Integrate Citolytics to CirrusSearch MediaWiki extension.Oct 22 2016, 11:11 AM

mschwarzer added a project: CirrusSearch.

mschwarzer updated the task description. (Show Details)

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 22 2016, 11:11 AM

@dcausse The transferToES.py script should work to write the JSON data from HDFS to ES. But what would in general the best approach to get the Citolytics recommendations to the CirrusSearch ES instance?

Currently the recommendations are generated with an Apache Flink job and written to HDFS. The next would be to somehow write the data to an additional ES index on the CirrusSearch instance. Is there any way to directly write the data from an external source to CirrusSearch? Can whole generation process be done on Wikipedia infrastructure? Or is there any other way to push the recommendations to ES?

@mschwarzer what is the size of the resulting json data. If I remember right only the intermediate dataset was huge. I.e. we do not need the long tail of since only the few best recommendations should be displayed in the end.

@mschwarzer I think we can adapt transferToES.py to push Citolytics data as well.

Few constraints:

we are currently limited to content namespaces (ns0 on wikipedias) for now
The pageviews data is computed and loaded once a week
I'd suggest using an additional property in current indices rather than creating a new index.
Does all the data you need is available in the wmf analytics cluster?

Concerning hadoop => elasticsearch process everything is in https://phabricator.wikimedia.org/diffusion/WDAN/browse/master/

The highlevel approach is :

compute your stuff and store it inside a hive table
run transferToES.py to push this data into elastic

Oozie is used as the orchestration tool of all of this.

What happens for pageviews is that we use a hive table named popularity_score as a pivot, this table is generated with a simple hive query and orchestrated with this coordinator. The code is located oozie/popularity_score.

transferToES is part of another oozie workflow, it's scheduled to start when the popularity score is done, see input-events in transfer_to_es/coordinator.xml

In your case instead of a simple hive query you have an apache flink job, what you need as a first step is to put this job inside an oozie workflow, I don't know the output of this job but I'd suggest creating and storing it inside a hive table like:

page_id
project_id
suggestions: [ { page_id: 123}, { page_id: 234}, ... ] // or with titles, how many do you want to keep top5 (10?)
partition info (year, month, day...)

So the question is: in which format do you write to HDFS today? Is it possible for you to write this data into a partitioned hive table?

The final step would be to adapt transferToES.py to take into account multiple input tables, currently it works with a single input table. We could then populate pageviews+citolytics data in a single run.
transferToES oozie coordinator would wait on both popularity_score and citolytics to run.

Note that this apply if you want to fully productionize your process in the wmf infrastructure.

Unfortunately I don't know the details of your apache flink job nor if it's ok to run it on the wikimedia hadoop cluster. Have you talked about it with the analytics team?

Your questions:

Currently the recommendations are generated with an Apache Flink job and written to HDFS

How is it automated today, would a weekly refresh rate works for you?

Is there any way to directly write the data from an external source to CirrusSearch?

Yes writing data to an elasticsearch index is trivial and we could write a simple bash/python script to push a tsv file to elastic but the difficulty here relies in the way you want to integrate this process into production and have your data refreshed properly. But if by "external source" you mean a process outside the wmf production cluster then no it's not possible to write directly to elastic. You need production access to write directly to elastic.

Can whole generation process be done on Wikipedia infrastructure?

I have no idea, I talked about Apache Flink with the analytics team and it would be the first job of this kind, the job seems batch oriented and it's possible to use a plain java process with oozie given that all the data you need is available in hive. You should probably ask them on irc (#wikimedia-analytics).

Or is there any other way to push the recommendations to ES?

Yes and no, it depends on the level of automation you want here, if you want a fully automated process from hive to elastic then oozie must be used. If you want to run your job manually, store the output to a csv file then push it to elastic it's possible as well but it would be only a POC.

@Physikerwelt The JSON output for top-10 recommendations (including scores) is around 2GB in size (without scores 1.3 GB).

@dcausse Thanks for your detailed answer. An automated process with the oozie workflow should be the long-term goal. However, a proof-of-concept should be enough for now, since at first we need to prove that the Citolytics recommendations outperform MoreLikeThis.

How would we proceed for this? I can provide the recommendations in any format (TSV/JSON/...).

@mschwarzer I think I can try to load your data into our test elastic servers. Can you make this data available somewhere on stat1002 or stat1004 machines? A json would be ideal but I don't really care, note that page ids would be a lot easier for me.

I don't think we'll use the scores right now.

You can start to work on integrating your keyword in cirrus.

How many wikis do you plan to support?

@dcausse I currently do not have access to the analytics cluster. Is it possible to upload it somewhere else?

For the start, we would focus on the English wiki, but in general Citolytics can be used for all languages.

@mschwarzer there's something I don't get, you said that you store your data into HDFS, how can you do that without access to stat100x machines?

Unfortunately if you don't have the rights to access these machines I would need someone else with with sufficient authority to review and approve your data before it can be loaded into the production cluster. Sadly I have no guarantee that this data is safe if it has not been generated within the wmf cluster.

@Physikerwelt do you have access to WMF resources where we can store the recommendations?

@dcausse Sorry for the misunderstanding. I meant that the Apache Flink job uses HDFS technology-wise to store its output. It's not yet deployed to WMF infrastructure.

@mschwarzer no, I have no special contract with WMF. I think it would be best to regenerate the data within the WMF cluster. The dump you have preocessed does not have the actual page and version ids. I think all that needs to be done is to deploy flink
https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/yarn_setup.html
and to run your job, which is nicely documented on
https://github.com/wikimedia/citolytics
Maybe a good idea would be to discuss that IRC

I think it'd be better to clarify the context and the scope of this project, I initially thought it was an on-going WMF project.
Since it will require WMF resources to expose these recommendations to readers I don't know the details but I think you need to have an explicit agreement with the WMF.
I'm very interested in helping you but I think you need to make sure that this project is viable in the WMF infrastructure. From what I understand exposing the recommendations is not a big deal but generating them is more problematic:

your job requires flink and it's not currently deployed
it seems that your job requires wiki dumps accessible in hdfs but afaik it's not the case today
can the wmf cluster supports the extra load?
it seems focused on enwiki and will require some adaptations to run on other wikis

I don't think it's wise to engage with an A/B test if the WMF is not ready to support the extra cost needed to maintain your solution.

If the proposal was to offer static files generated outside the wmf cluster solely for research purposes then I'm afraid that I can't help but again I don't know the details. I'd suggest you to contact the research team.

@dcausse the first research step of the project was done, and published. I think the first evaluation gives very good indications that the linked based approach provides benefits for the readers. The A/B testing idea was just born to get additional legitimation for the suitability of that approach. We believe that especially languages which are not as widely spoken as the English language will benefit from it.

However, running and maintaining a Flink job on the cluster might not be the most effective thing to do, if Flink is currently not used for any other task. Thus, I think we should have a discussion if we really need Flink here, or if @mschwarzer can reimplement the code using a language that can be excuted using technology that is available.

Let’s have a closer look on the actual map and reduce tasks that are performed to calculate the similarity scores:
A mapper
https://github.com/wikimedia/citolytics/blob/master/src/main/java/org/wikipedia/processing/DocumentProcessor.java
and reducer
https://github.com/wikimedia/citolytics/blob/master/src/main/java/org/wikipedia/citolytics/cpa/operators/CPAReducer.java
a slight complication due to optional redirect resolution
https://github.com/wikimedia/citolytics/blob/master/src/main/java/org/wikipedia/citolytics/cpa/WikiSim.java#L130
.
If we could access the wikitext directly from a database, the mapper could be simplified (i.e. we do not need to process the dump). Moreover, if we could access the redirect table directly this would also eliminate the optional step.

Hi all. I went through the abstract, introduction, and bits of the methodology section, as well as the discussion in this task. Here are my questions for the researchers:

I understand the desire to do an A/B test for validating the results of this research further. However, can we start with a cheaper/easier test model first? For example, have you considered creating a list of recommendations based on your approach versus recommendations from morelike? (The latter is currently being used to give read recommendations on certain platforms in Wikipedia.) The quality of such recommendations can be compared outside of Wikipedia, in a crowdsourcing environment. If you have considered this and decided not to go with it, can you expand why?
One aspect that I like about your approach is that it doesn't provide read recommendations solely based on what readers read. This helps us reinforce how an encyclopedia should be read (to some extent) as opposed to completely going with how it's read and recommending articles to be read based on readership alone.
One potential issue I can see down the line with your approach is when/if hyperlinks get more automatically added to Wikipedia. We've done this research https://arxiv.org/abs/1512.07258 which gives us recommendations on links to be added to Wikipedia. The strength of your approach relies on organic growth of hyperlinks on Wikipedia. (And there is the practical question that if the WMF goes with your approach, someone should own the algorithm and update it in the future. I understand this is way down the line, but we need to think about these given the limited resources/infrastructure available to us, even for just helping you to test a research idea that is developed to a good extent.).
I have another question for you, too: we know that 66% of the hyperlinks added in one month to enwiki are never clicked in the month after it. This means that at the moment, we're adding links that are not being used. How can this potential issue have impact on your recommendations? (Why would we recommend something that no one reads? There are strategic/pedagogical reasons we may have for doing so, something for you to think about.)

And a practical remark:
As someone who has worked on building formal collaborations for the Research team, I'd like to point out that the current request does not fall under what the Research team does in terms of Formal Collaborations (in my experience). The way I can see testing on live Wikipedia work in this case would be if someone in Discovery is willing to take this project to the next level, to improve search and recommendations. I'm happy to make myself available for consultation if we have that person in Discovery, and if needed. :)

leila added a subscriber: • Deskana.Oct 26 2016, 11:42 PM

I'm a huge fan of the idea of serving better (for some definition of better) related article suggestions through Cirrus/Elastic, possibly alongside the original recommendations. I think it makes sense given Discovery's mission to make the wealth of knowledge in our projects more discoverable. So, +1 to the idea here.

That said, there are practical scheduling concerns. The Search Team is a little behind on our work at the minute; we're finishing off our Q1 goal to enable BM25, and we're already almost one month in to Q2. It seems like this would be a sizeable amount of work for us to undertake (@dcausse can correct me if I'm wrong), so I don't see us getting to this this quarter. We'll keep it on our radar for the future. If this plan blocks someone's timeline, please let me know.

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.Oct 27 2016, 4:33 PM

Also @Deskana : let's make sure that we quantify what "better" is , If we want to run A/B testing we need to identify the metrics we are trying to move and the effect we are hoping to detect. In the absence of this it doesn't make much sense to plan to update current extensions or really, do any technical work quite yet.

@Nuria Agreed. The definition of "better" can come when we're closer to commencing work. Any ideas anyone has in the mean time would be welcome!

@leila Thanks for your questions!

I understand the desire to do an A/B test for validating the results of this research further. However, can we start with a cheaper/easier test model first? For example, have you considered creating a list of recommendations based on your approach versus recommendations from morelike? (The latter is currently being used to give read recommendations on certain platforms in Wikipedia.) The quality of such recommendations can be compared outside of Wikipedia, in a crowdsourcing environment. If you have considered this and decided not to go with it, can you expand why?

Yes, we considered such a user study. In our previous research we already did a manual evaluation of MLT vs CPA (concept behind Citolytics) within our team. But the problem of these small-scale lab studies is that they come with less significance due to size and lack of available domain knowledge. It is not feasible to reach a user base at Wikipedia scale that can be considered as domain experts - especially with respect to the various topics covered by Wikipedia. Moreover, an integrated A/B testing system would allow other researchers to evaluate their recommender systems.

One aspect that I like about your approach is that it doesn't provide read recommendations solely based on what readers read. This helps us reinforce how an encyclopedia should be read (to some extent) as opposed to completely going with how it's read and recommending articles to be read based on readership alone.

We also plan to compare our content-based approach against a user-based recommender system (in the future).

One potential issue I can see down the line with your approach is when/if hyperlinks get more automatically added to Wikipedia. We've done this research https://arxiv.org/abs/1512.07258 which gives us recommendations on links to be added to Wikipedia. The strength of your approach relies on organic growth of hyperlinks on Wikipedia. (And there is the practical question that if the WMF goes with your approach, someone should own the algorithm and update it in the future. I understand this is way down the line, but we need to think about these given the limited resources/infrastructure available to us, even for just helping you to test a research idea that is developed to a good extent.).

I'm not really aware how the links are automatically added to Wikipedia. But in general it should be possible to exclude auto-generated links if they are somehow flagged. On the other hand, clicks generated through read-recommendations can also be flagged in the server logs.

I have another question for you, too: we know that 66% of the hyperlinks added in one month to enwiki are never clicked in the month after it. This means that at the moment, we're adding links that are not being used. How can this potential issue have impact on your recommendations? (Why would we recommend something that no one reads? There are strategic/pedagogical reasons we may have for doing so, something for you to think about.)

The concept behind Citolytics does not imply that documents A and B are related if there is a link from A to B. Instead, it relies on co-citations, i.e. A links to B and C, thus B and C are related. For more details see Co-Citation and Co-Citation Proximity Analysis. Hence, the number of clicks from A to B shouldn't be an issue. Except that you - of course - can argue that links, which are not clicked, should not be considered as relevance judgements.

And a practical remark:
As someone who has worked on building formal collaborations for the Research team, I'd like to point out that the current request does not fall under what the Research team does in terms of Formal Collaborations (in my experience). The way I can see testing on live Wikipedia work in this case would be if someone in Discovery is willing to take this project to the next level, to improve search and recommendations. I'm happy to make myself available for consultation if we have that person in Discovery, and if needed. :)

Thanks for your offer! We'll try to get somebody from Discovery on board ;)

@Nuria @Deskana The most standard way to quantify what recommendations are better should be imho to measure the CTR. This metric was already used in a performance test of morelike vs opentext.

See https://phabricator.wikimedia.org/T125393 and https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYOyGgS5R6Y/edit#gid=312723487.

@Deskana Regarding the time schedule I would be happy to support you to speed-up the process. There is already a preliminary implementation in my CirrusSearch GitHub fork.

Thanks @leila and @Deskana for the clarification

I understand that there's a catch-22 problem here, and the proposal was to upload a dataset generated outside the wmf cluster that will be used to suggest articles to readers. @Deskana what's our policy regarding this kind of process? On my side I'm not too keen uploading this kind of data without a broader agreement.

@mschwarzer :

could you upload your fork into gerrit so we can discuss implementation details?
concerning the job itself, I understand that resolving redirects is not trivial and causes the job runtime to double. Have you considered using cirrusdumps, it contains only canonical articles with an array of redirects. See this example: Technical_University_of_Berlin, would that help?

In T143197#2751573, @dcausse wrote:

I understand that there's a catch-22 problem here, and the proposal was to upload a dataset generated outside the wmf cluster that will be used to suggest articles to readers. @Deskana what's our policy regarding this kind of process? On my side I'm not too keen uploading this kind of data without a broader agreement.

I'm not aware of any specific policy. Generally speaking, provided the creator of the dataset explicitly states that they release the data to us totally open manner (i.e. using a CC 0 declaration) then there should be no problems whatsoever, but that's just my guess. Perhaps @leila can let us know if there any best practices?

I'm not aware of any specific policy. Generally speaking, provided the creator of the dataset explicitly states that they release the data to us totally open manner (i.e. using a CC 0 declaration) then there should be no problems whatsoever, but that's just my guess.

Thanks, I'm more concerned about the general approach here (I'm assuming good faith so don't take my words personally) and I'd like to make sure that we agree on the consequences here. Given that a dataset is generated outside the cluster we have no guarantee that the suggestions would not introduce a malicious/intentional bias. Is there any precedent or will it be the first time we do that?

In T143197#2751615, @dcausse wrote:

Thanks, I'm more concerned about the general approach here (I'm assuming good faith so don't take my words personally) and I'd like to make sure that we agree on the consequences here. Given that a dataset is generated outside the cluster we have no guarantee that the suggestions would not introduce a malicious/intentional bias. Is there any precedent or will it be the first time we do that?

No problem! I agree we need more information on the general approach. Hopefully @leila can help us answer these questions, or point us to someone who can. :-)

Conceptually, I think it's not a good idea to import a dataset. After discussions in the analytics cluster, it seems to doable to set up a Flink job that runs in production. However, which data-source would provide best access to the link positions within the text is not entirely clear to me.

As a MediaWiki developer and Wikipedia enthusiast, I think it is not right to abuse humans for scientific experiments. However, if we use A/B testing (or beta features) to improve the MediaWiki platform and the associated services this can and should be reported to the scientific community.

@Physikerwelt thanks, so speeding-up the process by importing the data is no longer an option. Thus marking my question as "Declined!", thanks! :)

@Nuria @Deskana The most standard way to quantify what recommendations are better should be imho to measure the CTR. This metric was >already used in a performance test of morelike vs opentext.

Sorry, that test had several flaws of which we are aware and are working on with android team. Mainly there was no sanity check on wether A/B groups were of same size and whether Clickthrough (as measured for the app) has a random variability higher than what the test detected.

We need to be as through gathering data for testing as we are with selecting recommendation algorithms so I would focus on making sure you have a true metric that you can move first, once that is the case you can start planning the work needed.

In T143197#2751291, @mschwarzer wrote:

@leila Thanks for your questions!

I have another question for you, too: we know that 66% of the hyperlinks added in one month to enwiki are never clicked in the month after it. This means that at the moment, we're adding links that are not being used. How can this potential issue have impact on your recommendations? (Why would we recommend something that no one reads? There are strategic/pedagogical reasons we may have for doing so, something for you to think about.)

The concept behind Citolytics does not imply that documents A and B are related if there is a link from A to B. Instead, it relies on co-citations, i.e. A links to B and C, thus B and C are related. For more details see Co-Citation and Co-Citation Proximity Analysis. Hence, the number of clicks from A to B shouldn't be an issue. Except that you - of course - can argue that links, which are not clicked, should not be considered as relevance judgements.

Yes, your last sentence was my argument and something for you to think about.

@dcausse @Deskana I don't know of any strict policies around these kind of requests, especially if the data is produced in the cluster itself.

However, in my experience, the only way this type of experiment can be done on live Wikipedia is if someone in the WMF "owns" it, meaning: he/she goes through the methodology, works with the researchers (if needed) to design the experiment to run (note that we don't have an A/B infrastructure in place, so designing and setting up the engineering part of running the experiment will realistically take at least a month, more realistically a few months), and walks with them step by step until the test is done and the data is analyzed. It's also worth noting that if someone steps up for this task, he/she is stepping up on behalf of at least a few teams: Analytics, Legal, and Security will be involved in at least parts of the process for getting the data, so a simple test actually ends up being a big deal in our resource constrained environment. If you want to chat more about this, please ping me off thread.

@mschwarzer (cc @Deskana ) I understand that this recommendations project might be an improvement over current recommendations status quo but thus project planning seems really "academic". There are a myriad of tickets none of which lists the end goal. In order to put engineering effort on our end we need to please define 1) what is the end goal 2 ) why does it add value to our users

1 and 2 inform our "success criteria" and thus what we would be measuring on an AB test.

Example (end goal has to be user-focused)" increase engagement on android app". "increase session length on android app"... others?

The premise of this ticket: "Develop a MediaWiki extension that makes the citolytics recommendations accessible via API" is not an end onto itself, it is the means to some objective that needs to be defined.

@Nuria Thanks for the clarification. I'll review the project and update the corresponding tickets.

mschwarzer mentioned this in T142477: Improve mobile recommendations in Android app.Nov 1 2016, 12:58 PM

mschwarzer mentioned this in T149805: Clarify WMF involvement in "Improve mobile recommendations in Android app".Nov 2 2016, 2:29 PM

@mschwarzer I guess it would be good to upload your code to gerrit for code review. Otherwise, you might get into trouble with your planned time frame.

mschwarzer mentioned this in T151861: Enable 'analytics_cluster' role on Labs instance.Dec 20 2016, 10:18 AM

Physikerwelt added a subscriber: • ellery.Jan 11 2017, 11:42 PM

mschwarzer updated the task description. (Show Details)Jan 24 2017, 2:03 PM

@dcausse What needs to be done after the code review? Or what are the next steps to get the code deployed?

@mschwarzer I think we need to do some testing on a big index while the patch is still in review, the cirrus code looks good we just need to do a quick check on a real index to be sure it works as expected.
My current understanding of the state of the project and what needs to be done before deploying is:

some tests need to be done
make sure the apps handles the new syntax with quotes and escape sequences around the title as it is slightly different from the morelike syntax

For testing ideally I'd like a dump of your data in the elastic bulk format so I can easily import on a test index
The format should be something like :

{update: {_id: 123, _type: page}}
{doc: {citolytics: [{page:xyz, score:123}, {...}]}

But please double check it works as expected on a small index.

Once this dump is available I'll cherry-pick your patch in the labs instance we use for testing. This would allow you to test the full stack with the apps.

Will this be limited to english or do you have the data for other languages?

• Mholloway unsubscribed.Jan 26 2017, 11:26 AM

@dcausse

I uploaded a result file to one of our lab instances (hadoop000.math.eqiad). What would be the best approach so you can access it for testing?
For the Oozie workflow integration I already prepared a PySpark script that reads the data from HDFS and send updates as bulk to ES ( https://gerrit.wikimedia.org/r/#/c/334130/4/oozie/citolytics/transferCitolyticsToES.py - it mainly reuses the code from the popularity_score script). If this script is not suitable for testing, I also can prepare data in the elastic bulk format.
Regarding languages, it depends what would be the simplest way for testing. I can generate recommendations for only a single language but also for more or all that are available as XML dump.

@mschwarzer thanks for the dump, the es bulk would definitely help me but I suppose I can easily apply a jq filter on your dump.
Can you make sure that I can ssh on your lab machine so I can import it, or maybe you can do the import yourself, see if you can access https://relforge1001.eqiad.wmnet:9243/ from your vm?

Note; that your patch must be a Draft I think because I can't access it.

Concerning languages it'd be nice to test at least another one I think? Suggestions welcome.

@mschwarzer I think it would be good to pick at least two languages from each group on https://www.wikipedia.org/. So that we end up between 10 and 20 languages. I think from the top group you should include english and maybe german from group 100 000+ simple english from 10 000+ maybe Plattdüütsch from 1 000+ there is also a german dialect ...

by the way. how does this ticket relate to T155101

I'll prepare the ES bulk dumps for enwiki, simplewiki and ndswiki and upload them to hadooop.math.eqiad:/srv/wikisim/data/results/.

@dcausse I'll ping you as soon as you can import the data.

@Physikerwelt

Can give SSH access to dcausse? I think I don't have the rights to do so.
These languages should be used to test the CirrusSearch implementation. Three languages of different sizes should be enough here.
In T155101 we should collect the languages that we later want to evaluate with respect to the user signals in the Android app.

In T143197#2980140, @mschwarzer wrote:

@Physikerwelt

Can give SSH access to dcausse? I think I don't have the rights to do so.

You have the power. It's not on horizon. Look at https://wikitech.wikimedia.org/ at the math project settings

These languages should be used to test the CirrusSearch implementation. Three languages of different sizes should be enough here.

In T155101 we should collect the languages that we later want to evaluate with respect to the user signals in the Android app.

OK. Both makes sense. So my comment before was regarding T155101

@dcausse

Now you should have SSH access to hadoop000.math.eqiad.wmflabs. The ES dumps are located in /srv/wikisim/data/results/:

enwiki_wikisim_elastic.json  
ndswiki_wikisim_elastic.json  
simplewiki_wikisim_elastic.json

@mschwarzer http://citolytics-en.wmflabs.org/ is now available with enwiki and your data, you can run some tests to make sure everything runs smoothly with the mobile apps.
I have not imported other wikis yet, let me know if it's important for you for testing.

@dcausse Thank you very much! In the MediaWiki everything seems to work correctly. However, in the Android app it does not work. I cannot use citolytics-en.wmflabs.org as mediaWikiBaseUri / API endpoint. I keep getting these error messages when opening a Wiki article from within the Android app:

E/org.wikipedia.page.PageDataClient$2: failure():277: PageLead error: 
I/org.wikipedia.concurrency.SaneAsyncTask: onPostExecute():71: 
                                           java.io.FileNotFoundException: /data/user/0/org.wikipedia.alpha/files/savedpages/b116b1dfb0a01f309eeae25e3de6132/content.json: open failed: ENOENT (No such file or directory)
                                               at libcore.io.IoBridge.open(IoBridge.java:459)
                                               at java.io.FileInputStream.<init>(FileInputStream.java:76)

Probably an additional Vagrant role needs to be enabled. Maybe @EBernhardson knows what needs to be done.

I'm a little confused what's going on here.

In T143197#2748921, I made it clear that this isn't something that Discovery can really support. @dcausse has been providing advice, but my statement from before about support hasn't changed. From a product perspective, CirrusSearch is not a dumping ground for everyone's favourite recommendation engines. I probably get suggestions for a new one that we could add every week, and if we added them all we'd end up with hundred. If we have different recommendation engines in CirrusSearch that are not clearly and usefully differentiated to users, then we can confuse API users quite easily.

In this task, we seem to be rushing to a solution without defining the users or the problem we're trying to solve. "Develop a MediaWiki extension that makes the citolytics recommendations accessible via API" is the very first sentence in this task's description, and it has no statement of user value in it at all. I see no such user value statement anywhere here.

I understand there's a lot of excitement here, but please let's stop and think for a minute. I don't want this to get to the stage where either the merge of this is blocked by product concerns right at the last minute, or it needs to be integrated into a fork.

Also, if we have Mediawiki extension I think the extension should use CirrusSearch hooks (namely, CirrusSearchAddQueryFeatures) instead of patching the Cirrus code directly. If the current extension API is lacking, please tell us how it should be improved.

Hi,

Is the remaining work to be done just testing on Android with the existing code that has already been written?

It seems reasonable, since we've come all this way, to try @Smalyshev's recommendation in T143197#3003550 and see if that fixes the issues with testing on Android.

It would be great to get these tests in front of our Discovery Analysis Discovery-Analysis team to help with determining the effectiveness of the test and the results received to assist in planning for any next steps.

debt added a project: Discovery-Analysis.Feb 23 2017, 7:24 PM

Physikerwelt updated the task description. (Show Details)Feb 23 2017, 7:35 PM

Physikerwelt updated the task description. (Show Details)Feb 23 2017, 7:42 PM

Physikerwelt renamed this task from Integrate Citolytics to CirrusSearch MediaWiki extension to Add a citolytics query prefix that exposes the relatedness information from the elasticsearch index.Feb 23 2017, 7:53 PM

Physikerwelt removed a project: MediaWiki-extension-requests.

Physikerwelt updated the task description. (Show Details)

@mschwarzer If you create phablicator tasks, please make sure to stay focussed to one particular task. I can not understand, how the android app relates to the API endpoint. Would it not be better to test your patch by visiting /w/api.php (the old non restbase api) or enter a query citolytics:Pagename into the MediaWiki search interface. Maybe it would be good to clean up all the phabricator tasks you did create. Make sure to create one task for each component you are touching. And only one general discussion task where people can discuss, general issues.

In addition to that, please set up a public demo where people can test the implementation. i.e. it would be nice to have a demo of the citolytics api endpoint without the app.

In this task, we seem to be rushing to a solution without defining the users or the problem we're trying to solve.

+1. I have also pinted out the same thing earlier . see comments from @leila and myself on October 26/27/28 on this same ticket

In T143197#3054058, @Nuria wrote:

In this task, we seem to be rushing to a solution without defining the users or the problem we're trying to solve.

+1. I have also pinted out the same thing earlier . see comments from @leila and myself on October 26/27/28 on this same ticket

Does that mean that there are unresolved general phylosophical issues? If so we should enumerate and resolve them. Last time I discussed with @leila and @EBernhardson I had the impression that those were resolved.
However, I would be happy if we could seperate this discussion from the technical discussion how to expose the information from elastic search via the API which should be discussed in this ticket.
Do you agree with that?

In T143197#3055012, @Physikerwelt wrote:

Does that mean that there are unresolved general phylosophical issues? If so we should enumerate and resolve them. Last time I discussed with @leila and @EBernhardson I had the impression that those were resolved.

@Physikerwelt I just want to confirm that on my end, there are no philosophical issues with this and other related task. As I mentioned earlier, I'm outside of Discovery team who handles Search. At least one person in that team should be convinced to work with you on these tasks. If that person is there, and if the team or you need research help, I'll do my best to open up time on my end to contribute. Until then, I will have to stay as an observer on this thread.

mpopov moved this task from Needs triage to Tracking on the Discovery-Analysis board.Feb 28 2017, 9:06 PM

Moving to tracking...nothing for the Analysts to do at this time.

mschwarzer mentioned this in T142555: Recommendations interface.Mar 29 2017, 3:50 AM

mschwarzer updated the task description. (Show Details)Apr 13 2017, 6:12 AM

mpopov moved this task from Tracking to Later on the Discovery-Analysis board.Jun 22 2017, 9:04 PM

mpopov moved this task from Later to Tracking on the Discovery-Analysis board.

• Phabricator_maintenance added a project: Product-Analytics.Apr 18 2018, 11:04 PM

• Phabricator_maintenance removed a project: Product-Analytics.Apr 19 2018, 12:23 AM

Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 19 2018, 12:23 AM

There haven't been any updates to this in over a year and it's not entirely clear what needs to be done, nor how my team should be involved (if at all).

I'll go ahead and be bold and decline/close this ticket. @mschwarzer if there is additional work / information you need, please either open a new ticket or re-open this one and let us know where you're at and what you need help with. :)

Add a citolytics query prefix that exposes the relatedness information from the elasticsearch indexClosed, DeclinedPublicActions