[Investigation] Check if we can use RevDoc but inject property data type from another datasource
Closed, ResolvedPublic
Actions

Description

For example, get the property data types from another datasource (an API call) and inject it to the output.

Timebox: 5-8 hours

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T259351 property datatype information is missing in articlequality/ORES
		Resolved		Ladsgroup	T260778 [Investigation] Check if we can use RevDoc but inject property data type from another datasource

Event Timeline

Ladsgroup created this task.Aug 19 2020, 9:55 AM

guergana.tzatchkova subscribed.Aug 19 2020, 9:55 AM

This has the benefit of making dump analysis pretty much simpler but since datasources change for each requests in live requests, it won't affect performance (it slightly decreases it). We can hard-code property data types into ores but it would bloat the model file (and maintaining the list is another headache). A great solution would be to have a local server cache and keep it there but that's a long shot :(

Our options and the downsides of each option:

Using wbgetentity/special:entitydata (the approach suggested that's alternative to this one)
- Has downside of basically not being operable on dumps as our entity dumps don't have histories, etc., we can't run it on xmldumps because we can't just inject the mapping somehow
The separate data source (This suggestion)
- Has huge performance downsides, you have to hit something like https://www.wikidata.org/w/api.php?action=wbgetentities&ids=P17&props=datatype for every request but it also doesn't fully solve the problem of dumps because it still would hit api in every dump history read because that's how ores handles datasources (it drops them for the next read) so 1B API hits just for this if we want to rebuild the history dump
We can introducing concept of localserver cache and hold the mapping there (which ores should have and use it anyway)
- That would be a lot of work
- Also I'm not sure how that can be wired to the model and features (maybe as a datasource? then ores injects the datasource as a extractor? but then extractors don't have a chain to fallback to if it's not in the cache.
  - We can add a basic cache wrapper around the APIExtractor, so anything stays there but that would bloat the memory footprint drastically and also wouldn't fully solve the performance issue (because it would need to hit API quickly when the really hot cache expires) and it also needs to hit them when it sees a new combination of properties...
Hard-code the mapping which means we need to manually maintain such list and bloat the model and its memory footprint (probably around 1 GB per node) just for this.
Diverge the dump-based model and the API based model and go for the first option for API and the fourth option for dumps (it wouldn't bloat the memory for that).
- We already do this because of the item completeness issue (it has to hit property suggester API for every item). It doesn't mean we need to drop all features using data types, it just means we can either hard-code the mapping or hit the API for first part and then reshape the feature processing part based on that (same features, different ways of achieving them)
- It has the downside of duplicating some efforts but it's not that much

Honestly, the last option sounds least hard to achieve. I think we should go that way.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptAug 21 2020, 7:42 PM

Would the last option mean that we'd still get the same quality score for a given Item no matter if it was scored live or from the dumps?

In T260778#6406208, @Lydia_Pintscher wrote:

Would the last option mean that we'd still get the same quality score for a given Item no matter if it was scored live or from the dumps?

They currently differ already,, because ores dump analyzor can't hit API for every entity to get property suggester output but it's not too much of a difference. We won't increase the gap but probably need to re-implement some features twice.

Ladsgroup mentioned this in rOWC58b4773f73da: Add a list of external identifiers for wikidata features.Aug 25 2020, 10:15 AM

I want to also add that using wbgetentities is not possible in ores because it doesn't support sending revids and special entity data would work in ores because of ores heavily depending on mwapi library which doesn't support such requests (we would need to inject a new type of session which seems like a big overhead).

Restricted Repository Identity mentioned this in rOWCc9473fb54840: Add a list of external identifiers for wikidata features.Aug 25 2020, 11:12 PM

Lydia_Pintscher moved this task from Peer Review to Done on the Item Quality Scoring Improvement (Item Quality Scoring Improvement - Sprint 1) board.Aug 26 2020, 10:31 AM

Michael mentioned this in T261295: Document workaround for missing datatype.Aug 26 2020, 10:50 AM

Lydia_Pintscher closed this task as Resolved.Aug 26 2020, 10:58 AM

Maintenance_bot moved this task from Incoming to Done on the User-Ladsgroup board.Aug 26 2020, 11:15 AM

isarantopoulos moved this task from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 20 2023, 11:41 AM