Page MenuHomePhabricator

dumpRdf script should take the disabledRdfExportEntityTypes flag into account
Closed, ResolvedPublic

Description

When I run dumpRdf on the wiki with WikibaseLexeme enabled, it generates some rudimentary data on the Lexeme (see below) even though I have disabledRdfExportEntityTypes setting set to not export lexemes.

The script should take the setting into account, same as the RDF export through Special:EntityData already does.

The excerpt of the RDF dump related to lexeme lemma data (this is the only information being exported, and note: This is not a final implementation or RDF mapping, this is more of the proof of concept code which generates it)

wd:L2 a wikibase:Lexeme ;
	rdfs:label "apple"@en ;
	skos:prefLabel "apple"@en ;
	schema:name "apple"@en ;

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@WMDE-leszek Will making use of T195420 fix this (if limited to items and properties)?

@hoo: that would be a workaround, yes. Not a fix really, though.

Observations:

  • SqlEntityIdPager can be used to select entities (their ids) by page_namespace (which can be derived from the entity type) from the database
    • currently only supports this for one entityType at a time
  • when querying for more than one entity type, filtering is performed in application code (DumpGenerator)
  • as DumpEntities supports two id stream providers, SqlEntityIdPager and a file based EntityIdReader, the second of which does not support filtering by entity type on its own, the DumpGenerator will have to preserve its ability to filter by entity type regardless of potential optimization in the SQL department

@WMDE-leszek Could you please confirm that https://gerrit.wikimedia.org/r/#/c/437501/ conceptually (never mind the how for now) does what this ticket tries to achieve?

This assumes that LocalSettings.php contains a line

$wgWBRepoSettings['disabledRdfExportEntityTypes'] = ['lexeme'];

@WMDE-leszek

  • dumpRdf.php provides the possibility to pass entity types to dump. Should this respect the content of disabledRdfExportEntityTypes or will it override it?
  • do you think it would be worthwhile (keeping this in mind) to train SqlEntityIdPager to apply the entity type restriction (in case of multiple) already on the query level? You remember a reason why this is not already implemented into it (appears trivial at this point)?

/cc @Addshore

  • I think it should respect it. This script generates dumps that are then given out to the public, so it should not allow to override wiki settings, e.g. in the case of accidentally passing wrong parameters.
  • In theory it seems reasonable. If my IDE is correct, SqlEntityIdPager is only used in two places: in the rdf export script, and in ItemsPerSiteBuilder which only cares about items. So this extension of Pager might not be worth an effort either. I sadly don't know the history of this class. Ability to specify a "list" of entity types to RDF dump script is also really new (like since two weeks), before it was either all types, or a single type filtering only, AFAIR. I think it might simply not be needed, as long as there were only two entity types in play.

Change 437501 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] DumpRDF: Omit entity types disabled for RDF

https://gerrit.wikimedia.org/r/437501

Change 438021 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] SqlEntityIdPager: filter entity types on DB level

https://gerrit.wikimedia.org/r/438021

Vvjjkkii renamed this task from dumpRdf script should take the disabledRdfExportEntityTypes flag into account to 00caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from 00caaaaaaa to dumpRdf script should take the disabledRdfExportEntityTypes flag into account.Jul 2 2018, 4:13 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot claimed this task.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.