Page MenuHomePhabricator

Add direct answers to Wikipedia search
Open, Needs TriagePublic

Description

Session title

Add direct answers to Wikipedia search

Main topic

Artificial Intelligence to build and navigate content

Type of activity

Unconference session

Description



=== The problem ===
Currently the internal search engine of Wikipedia only offers to the users plain text search alongside some property-value based options (only search pages in a category...).

Major search engine, like Google have started to offer direct answers to their user based on the Google Knowledge Graph (e.g. when searching "Barack Obama birth date" the actual birth date of Barack Obama is displayed alongside other results).

The Wikimedia movement has Wikidata a major knowledge base and a powerful query service (query.wikidata.org). It may be interesting to use it inside of Wikipedia search system in order to offer to web and apps readers direct answers and so, increase the use of Wikipedia as an entry point to free knowledge.

Demos of such systems have already been created like Platypus [1] or NLQuery [2].

[1] http://askplatyp.us
[2] http://nlquery.ayoungprogrammer.com

=== Expected outcome ===
Investigate if it's technically possible to offer such systems to the Wikipedia readers

=== Current status of the discussion ===
There is already a tool on Wikimedia labs, PPP-SPARQL, that uses Platypus backend to create SPARQL queries able to run on http://query.wikidata.org http://tools.wmflabs.org/ppp-sparql/


== Proposed by ==
@Tpt

== Preferred group size ==
15-20

== Any supplies that you would need to run the session ==
Post-its and markers

== Interested attendees (sign up below) ==

# Tpt
# Add your name here

Event Timeline

Tpt created this task.Oct 20 2016, 1:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2016, 1:05 PM
Tpt renamed this task from Adds direct answers to Wikipedia search to Add direct answers to Wikipedia search.Oct 21 2016, 11:35 AM
debt added a subscriber: debt.
Qgil added a subscriber: Qgil.Nov 11 2016, 12:11 PM

Who would be the person facilitating this session? Please assign this task to that person if you are aiming to have this session pre-scheduled. Thank you!

Qgil assigned this task to Tpt.Nov 15 2016, 12:29 PM

Assuming @Tpt.

The task still says "Type of activity: [to be discussed]". Are you aiming to have a pre-scheduled session or an unconference session?

From my perspective, some parts of this are easy and some parts are hard.

From a high level technical standpoint, I'm sure this is entirely doable at least in english. When it comes to open source NLP research into english is much more advanced than anything else. We have generally not tried to focus our efforts on tools that only help a single language though, aiming for at least all languages that amount to > 1% of search traffic (10 to 15).

The next big question is related to our SPARQL endpoint. Ops has been fairly tentative with regards to using SPARQL as a backend for production features. The problem is the SPARQL cluster is open to the world, and is an arbitrary query language no different than SQL. It is hard to provide any latency guarantees to production traffic while also allowing arbitrary, potentially very complex, queries to be submitted by the world. In my mind this would mean we would need a distinction much like with SQL, where we have separate clusters of servers for serving production SPARQL traffic and arbitrary queries from users.

Finally, there is the problem of parallelization, as it relates to PHP. Issuing search queries to multiple search engines (e.g. elasticsearch and a question-answer API) should not be done sequentially, the performance implications are rather undesirable. The problem here is that PHP doesn't offer any good way to do this tasks in paralel (as opposed to say, golang or node.js). There is multi_curl, but we would need to do a major refactoring of the Elastica library we use for talking to elasticsearch to allow it to make the request but not stall execution until it has received the response.

Siznax added a subscriber: Siznax.Nov 15 2016, 7:36 PM
Tpt added a comment.Nov 16 2016, 10:23 AM

@Qgil

Are you aiming to have a pre-scheduled session or an unconference session?

If some people are interested in a formal presentation of which open source tools currently exists and how thy work we could probably do a pre-scheduled session (and I'll prepare this presentation). If not, the unconference format is probably more adapted. What do you think about it? (ping @Halfak)

@EBernhardson

From a high level technical standpoint, I'm sure this is entirely doable at least in english.

There are now fairly good training set in quite a lot of languages. E.g. https://github.com/tensorflow/models/blob/master/syntaxnet/universal.md
But it still requires a bit of customisation per language (but imho not worst than the one required for regular search).

It is hard to provide any latency guarantees to production traffic while also allowing arbitrary,

The issue is that SPARQL query built from search question could be also more or less arbitrary (like "give me all humans"). But I hope it could be ok with LIMITs and a good query optimizer.

Qgil added a comment.Nov 16 2016, 10:59 AM

(I just want to say that pre-scheduled sessions are also supposed to be discussions, not presentations. The Pre-schedule vs Unconference decision is more about how big you expect the participation to be and formal/casual you want this session to become).

Addshore added a subscriber: Addshore.
Tpt added a comment.Nov 22 2016, 6:15 PM

The Pre-schedule vs Unconference decision is more about how big you expect the participation to be and formal/casual you want this session to become

As this project is in a very early stage, a casual discussion is probably better (except if there are a lot of interested people). What do you think about it?

@Tpt Hey! As developer summit is less than four weeks from now, we are working on a plan to incorporate the ‘unconference sessions’ that have been proposed so far and would be generated on the spot. Thus, could you confirm if you plan to facilitate this session at the summit? Also, if your answer is 'YES,' I would like to encourage you to update/ arrange the task description fields to appear in the following format:

Session title
Main topic
Type of activity
Description Move ‘The Problem,' ‘Expected Outcome,' ‘Current status of the discussion’ and ‘Links’ to this section
Proposed by Your name linked to your MediaWiki URL, or profile elsewhere on the internet
Preferred group size
Any supplies that you would need to run the session e.g. post-its
Interested attendees (sign up below)

  1. Add your name here

We will be reaching out to the summit participants next week asking them to express their interest in unconference sessions by signing up.

To maintain the consistency, please consider referring to the template of the following task description: https://phabricator.wikimedia.org/T149564.

Tpt updated the task description. (Show Details)Dec 16 2016, 9:41 AM
Tpt updated the task description. (Show Details)
Tgr awarded a token.Dec 23 2016, 2:06 AM

To the facilitator of this session: We have updated the unconference page with more instructions and faqs. Please review it in detail before the summit: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Unconference. If there are any questions or confusions, please ask! If your session gets a spot on the schedule, we would like you to read the session guidelines in detail: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines. We would also then expect you to recruit Note-taker(s) 2(min) and 3 (max), Remote Moderator, and Advocate (optional) on the spot before the beginning of your session. Instructions about each role player's task are outlined in the guidelines. The physical version of the role cards will be available in all the session rooms! See you at the summit! :)

Aklapper removed Tpt as the assignee of this task.Jun 19 2020, 4:15 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)