Page MenuHomePhabricator

WikiGPT Experiment
Closed, ResolvedPublic

Description

Feature ideas for "wikigpt":

  • (T329003) Each answer has a unique URL so people can share it
  • (T328638) There is a "how we got this answer" button that shows the full prompt with some explanation.
  • (T329016) The corpus includes 3-4 articles (instead of 1)
  • (T329016)Change some text to ask for encyclopedic questions, not like "where to buy the best stove in Romania" or something Changed done
  • (T328526) We _might_ need a way to restrict access since the API we are using is on my personal credit card
  • (T329345, T328494#8605276) We need to track traffic somehow. Or otherwise understand why we are getting so much 500s [Please Note: Wikimedia Cloud Services has strict terms of use on tracking users.]

Event Timeline

Also can someone add me to the toolforge group?

@calbon I couldn't find a user that matches you (calbon or Chris Albon). @kevinbazira any luck?

Coordinated with kevin on this and I am working on the 3rd bullet "The corpus includes 3-4 articles (instead of 1)".

@isarantopoulos, I could not find Chris' username on Toolforge.

@calbon, please create a Toolforge account if you have not yet, and share your username.

This will enable us to add you as a maintainer to the WikiGPT project.

Regarding the 500 errors referred on the ticket description, some of them come because of failures in the OpenAI API calls and some of them have been addressed. If we do some more error handling I believe they can all go away.

If we take on this project in the future I think that the two main building blocks are the following:

  • Semantic search using embeddings: retrieving the most relevant articles to the search query can be done using article embeddings. I played a bit with this suggestion from the embeddings created from cohere and it is a promising way to proceed. What we would need would be:
    • A batch job that creates article embeddings (and perhaps a stream that updates them as new edits/articles occur).
    • A (vector) database that would allow low latency retrieval and ideally one can perform an ANN (approximate nearest neighbors) query against it. There are many examples and candidates out there, but perhaps the dense_vector field available in latest versions of Elasticsearch seems like a good option especially since one could perform a ann query.
  • An open source model deployed on Lift Wing with a big context window. In the current design of WikiGPT a big context window (at least 4k tokens, ideally more than that) is essential for passing the text of the articles.

The important takeaway here is that the above can be building blocks for multiple applications for WMF (regardless if something like WikiGPT is ever built)