- Title of session: WikiGPT - Natural Language search results based on Wikipedia knowledge and ChatGPT
- Session description: The session will be a demo of an application named WikiGPT that the Machine Learning Team built as a proof of concept of what the future of Wikipedia could look like using natural language queries to interact with Wikipedia's content.
- Username for contact: isarantopoulos
- Session duration (25 or 50 min): 25
- Session type (presentation, workshop, discussion, etc.): presentation/demo
- Language of session (English, Arabic, etc.): English
- Prerequisites (some Python, etc.): A general idea about what Large Language Models do (LLMs), but people without relevant knowledge can attend.
- Any other details to share?:
- Interested? Add your username below:
Below you may find the link attached that redirects the user to the corresponding Etherpad: https://etherpad.wikimedia.org/p/wmh2023-WikiGPT_-_NL_search_results
WikiGPT - Natural Language search results based on Wikipedia knowledge and ChatGPT
Date & time: Friday, May 19th at 15:00 pm EEST / 12:00 pm UTC
- Phabricator task: https://phabricator.wikimedia.org/T333974
- be able find complex information in a more automated way
- Need for natural language search interfaces
- Interact with our searchbox more interactively
Problems with LLMs as a knowledge base: hallucinations, staleknowledge, ethical considerations
- Can lead to spread misinformation.
ChatGPT example of famous dogs
- Google can provide equally good answers with chatGPT
- Google provides sources, while chatGPT does not
- ChatGPT also returns nonexistent statues
- When we ask wikimedia sites, we get some results, though not very efficient
- use Wikipedia as a knowledge base; the strength is its reliability
- use LLM as an interface to assist as a search engine
- (TBA after the session)
QUESTION -> Large Language Model -> Knowledge base (eg wikipedia) -> Large Language Model -> Answer
(somewhere in the middle: chatGPT)
wiki-gpt.toolforge.org/search (password protected) -> results are pre-calculted as normally this would be a slow process
- Who won WWII ?
- In the result there is a list of sources that led to the answer we got from chatGPT
- There is Call of Duty Championshit 2018
- What is the Technopolis in Athens
- Sources are related to the actual topic
- Who were the 5 astronauts that landed apollo 11 on the moon
- Chatgpt replied about 5 astronauts, though they were 3
- Chatgpt4 actually replied that they were 3 and 2 of them walked
- Where can I buy a fridge at Technopolis in Athens
- While it replied that his question is out of scope
- The linked it provided were articles about shops selling appliances
- Using a closed source application is problematic
- Search using article embeddings
Q: Can any technology for producing a list of related articles to the question be used? A: yes, any search engine. Even something like chatGPT could be used but then we'd be back at the original problem (stale knowledge, results inlcude data of when the model was trained)
Open Source LLMs on Lift Wing (https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing)
- Open source models can help if their improvement as we as provide confidence to them
- Licensing is challenging, so we are looking for one that represents our values
- Has developed a ChatGPT plugin that uses wikipedia as source and includes wikipedia links in answers
- code is publicly available: https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/tree/dev
- Example: "Who is the queen of England?" Answer contains links to relevant wikipedia articles, where information was sourced
- The test page is password protected because we do not want to be responsible for the content, it could be wrong
- Do you have a way for the plugin to include updated content?
- The plugin does a search on google, gets wikipedia results, and parses them that is information is more up to date
- Did you cosider only giving the relevant articles as a result and not a whole answer? Could improve wikipedia search. ChatGPT can also write SPARQL queries, could be a nice way for users to write them without having to know the language.
many possible ways to use it
- What to do with outdated content?
same issue with Wikipedia, could add time interval
- Phrasing can be really subtle and also very relevant if you change small things.
current phrasing is basically the same as on wikipedia. Manual fixes would be needed
Could also think about asking for direct citations when using Wikipedia
- The problem we are trying to solve is not just technical
- we could end up wit controvertal results