Page MenuHomePhabricator

Document and analyze the number of parsing errors for parsed WDQS queries
Closed, ResolvedPublic

Description

We wish, for the month of June 2021:

  • Report the number of parsing errors when generating parsed queries information
  • Provide information about why parsing errors happen

Event Timeline

CBogen triaged this task as High priority.Jun 24 2021, 1:42 PM
CBogen moved this task from All WDQS-related tasks to Analysis on the Wikidata-Query-Service board.

@JAllemandou @dcausse

  • For June, the average daily successful parsing rate was ~85%. Ranging from 75% to 90%. Note that this only includes queries with status 200 and 500.
  • 11% of the distinct queries ran into errors related to prefixes. The number of distinct queries due to each prefix is shown below. By adding the first 4 prefixes (mwapi, geof, foaf, gas) into the query processors' prefix list the average daily successful parsing rate was ~95% (93% to 97%). A few prefixes were off slightly (data instead of wdata, ref instead of wdref. These account for very few queries, but I fixed them nevertheless.)
prefix_namecount
mwapi7419357
geof54183
foaf17198
gas13753
wds2761
wdv216
fn62
dc50
mediawiki23
wdref22
wdata3

Total distinct queries: 67467327

  • Other errors included:
    • Variable used when already in-scope. This happened when the same variable was reused in a query. Testing such queries in WDQS returns results nicely. These form 2% of the errors in distinct queries.
    • Another notable error is the WITH clause. Although it runs well in WDQS, parser doesn't accept it. These form 2.5% of the distinct queries.

It seems including the prefixes should fix things, but should we also think of fixing the other two errors (although small in number). Not sure why Jena cannot parse them though.

Named subqueries (WITH) are a Blazegraph extension, not part of standard SPARQL syntax.

Thanks @AKhatun_WMF for the analysis.
@dcausse , @Gehel and @MPhamWMF - Do you think it;s worth trying to make our parser being able to process queries with the 'mwapi' prefix (it represents 10% of all requests) - otherwise this task can be closed.

Thanks @AKhatun_WMF for the analysis.
@dcausse , @Gehel and @MPhamWMF - Do you think it;s worth trying to make our parser being able to process queries with the 'mwapi' prefix (it represents 10% of all requests) - otherwise this task can be closed.

Thanks @AKhatun_WMF!

I understand that simply adding the prefix declarations to the jena parsing context will suffice to parse them? If yes I think this is worthwhile.
If you mean improving the AST extraction to better understand how the mwapi is being used I think this might be too early to invest some time in this but certainly something interesting to look into later.

@dcausse: Yes, just adding the prefix declaration in Jena parser is what we want to do.
@JAllemandou: Should I add the other prefixes as well?

Why not adding other prefixes if it's as simple as adding the prefix to the AQS list - I think there'll be more gotchas.
let's try @AKhatun_WMF :)