Page MenuHomePhabricator

[Task] add info if langlink is stored at repository or local
Open, HighPublic

Description

Add info if a langlink is stored at repository of local to langlinks/ll on api.php?action=query&prop=langlinks&titles=...

Bots need this info, because currently bots try to search for a langlink source on local wikipages. If the cannot find its source on the main page they start searching for langlink on included pages (mostly on template namespace lankings are included from subpage). This costs many page source requests and processing time for parsers a bot frameworks.

But if bots would know that langlinks are already stored at wikidata they do not have to request source code of many local pages.

Example:
http://de.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Vorlage:!

currently returns
<api>

<query>
  <pages>
    <page pageid="5327033" ns="10" title="Vorlage:!">
      <langlinks>
        <ll lang="ace" xml:space="preserve">Pola:!</ll>
        <ll lang="ar" xml:space="preserve">قالب:!</ll>
        <ll lang="as" xml:space="preserve">সাঁচ:!</ll>
      </langlinks>
    </page>
  </pages>
</query>

</api>

maybe this can be extended to
<api>

<query>
  <pages>
    <page pageid="5327033" ns="10" title="Vorlage:!">
      <langlinks>
        <ll lang="ace" storage="repository" xml:space="preserve">Pola:!</ll>
        <ll lang="ar" storage="local" xml:space="preserve">قالب:!</ll>
        <ll lang="as" storage="repository" xml:space="preserve">সাঁচ:!</ll>
      </langlinks>
    </page>
  </pages>
</query>

</api>

If querying this info takes much resources an extra parameter should be added (like llurl for fullurl extra info) and info should only be shown if requested.


Version: unspecified
Severity: enhancement
URL: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2012/12#Prioritizing_Hungarian_articles
See Also:
T47511: Sitelinks should be given a class when added to the sidebar

Details

Reference
bz41345

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:12 AM
bzimport set Reference to bz41345.
bzimport added a subscriber: Unknown Object (MLST).
Merl created this task.Oct 24 2012, 10:58 AM
jeblad added a comment.Dec 1 2012, 2:31 PM

This bug come up in a thread on Project chat (http://www.wikidata.org/wiki/Wikidata:Project_chat#Prioritizing_Hungarian_articles) and it could be important to fix it. That is it has load issues, but will not impact us very much as it is only one bot for now.

(In reply to comment #1)

This bug come up in a thread on Project chat
(http://www.wikidata.org/wiki/Wikidata:
Project_chat#Prioritizing_Hungarian_articles)
and it could be important to fix it. That is it has load issues, but will not
impact us very much as it is only one bot for now.

Is this bug still current? «i hope that bugzilla:41345 will be available before client extension goes live. Merlissimo (talk) 16:25, 30 November 2012 (UTC)»
Which has already happened, and bots found another way it seems?

Merl added a comment.Jan 31 2013, 3:53 PM

This is still open. There is not real solution. Because only article namespace is imported atm bots simply expect that langlinks are on wikidata if not founded in main source. Handling langlinks from inculded subpages like on template namespace will be impossible if this bug is not resolved.

Yurik added a comment.Feb 17 2013, 1:13 AM

To solve this bug, could someone comment on who creates langlinks table entries in the client DB? I might be mistaken, but it seems that the langlinks are not pulled dynamically from the repo, but rather copied in the background or on null edits. If this is the case, we might have to modify langlinks table to include an extra column for the "source".

re #4: Langlinks are pulled directly from the repo, but only when the page is re-rendered. When an item changes on wikidata.org, a background process (dispatchChanges.php) is used to invalidate the respective pages, so they get re-rendered. This may take a few minutes.

re #3: I currently see no easy way to do this. There is just no place to store this info on the client, and schema changes to large tables (like adding a field to the langlink table) are only done if absolutely necessary.

We could add a separate table to track this, but that has additional implications, needs more thought and is not trivial to code either. I'm actually quite happy that we can manage without *any* changes to the client database.

(In reply to comment #4)

To solve this bug, could someone comment on who creates langlinks table
entries
in the client DB? I might be mistaken, but it seems that the langlinks are
not
pulled dynamically from the repo, but rather copied in the background or on
null edits.

When the page is parsed and a langlink is found, it calls addLanguageLink() on the ParserOutput object. The Wikidata client code hooks into the ParserAfterParse hook and does the same for all the additional language links it wants to add. The accumulated list of language links in the ParserOutput (eventually) gets saved to the langlinks table.

If this is the case, we might have to modify langlinks table to
include an extra column for the "source".

Seems that way to me. ParserOutput and whatever does the actual updating of langlinks would also have to be changed to handle the extra field.

It just occurred to me that we could stuff the list of "local" links, without the ones from wikidata, into the page_props table. It would be serialized data, so we couldn't directly compare that to what's in the langlink table, but when asking for the langlinks for a specific page, it would be sufficient to provide the information which link comes from where.

polluting page_props is just an ugly hack, the best idea would be adding a new column to langlinks, or not storing wikidata links in langlinks.

(In reply to comment #0)

Add info if a langlinks is stored at repository of local to langlinks/ll on
api.php?action=query&prop=langlinks&titles=...
Bots need this info, because currently bots try to search for a langlink
source
on local wikipages. If the cannot find its source on the main page they start
searching for langlink on included pages (mostly on template namespace
lankings
are included from subpage). This costs many page source requests and
processing time for parsers a bot frameworks.
But if bots would know that langlinks are already stored at wikidata they do
not have to request source code of many local pages.
Example:
http://de.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Vorlage:
!
currently returns
<api>

<query>
  <pages>
    <page pageid="5327033" ns="10" title="Vorlage:!">
      <langlinks>
        <ll lang="ace" xml:space="preserve">Pola:!</ll>
        <ll lang="ar" xml:space="preserve">قالب:!</ll>
        <ll lang="as" xml:space="preserve">সাঁচ:!</ll>
      </langlinks>
    </page>
  </pages>
</query>

</api>
maybe this can be extended to
<api>

<query>
  <pages>
    <page pageid="5327033" ns="10" title="Vorlage:!">
      <langlinks>
        <ll lang="ace" storage="repository"

xml:space="preserve">Pola:!</ll>

        <ll lang="ar" storage="local" xml:space="preserve">قالب:!</ll>
        <ll lang="as" storage="repository" xml:space="preserve">সাঁচ:!</ll>
      </langlinks>
    </page>
  </pages>
</query>

</api>
If querying this info takes much resources an extra parameter should be added
(like llurl for fullurl extra info) and info should only be shown if
requested.

I'd rather suggest something like:

<api>

<query>
  <pages>
    <page pageid="5327033" ns="10" title="Vorlage:!">
      <langlinks>
        <ll lang="ace" shared="" xml:space="preserve">Pola:!</ll>
        <ll lang="ar" xml:space="preserve">قالب:!</ll>
        <ll lang="as" shared="" xml:space="preserve">সাঁচ:!</ll>
      </langlinks>
    </page>
  </pages>
</query>

</api>

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Paladox set Security to None.Apr 26 2015, 11:29 PM

But if bots would know that langlinks are already stored at wikidata they do not have to request source code of many local pages.

The langlinks that are stored on wikidata are available with a single request:

A) https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=berlin&normalize=&props=sitelinks

so would it not be easier to just solve this on the client side by comparing those links with the output from

B) https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Berlin

and taking the intersection A^B (=all langlinks from wikidata) + complement B\A (=langlinks not in wikidata) ?

That works on the small scale, but doesn't scale.One key example of how this could be used: Find all articles where language links are not in wikidata, Its a fairly simple database query, or 2+ million queries to the API.

I think we can lower the priority of this bug. Scale isn't really an issue because most language links have already been moved to Wikidata anyway.

That is actually the opposite reason. As more and more links get moved to wikidata finding and resolving the remaining links becomes more and more the primary focus.

There are still Wikipedias with hundreds of thousands interwikis in wikitext, according to https://stats.wikimedia.org/EN/TablesDatabaseWikiLinks.htm , so there is indeed no shortage of work to do.

Jonas renamed this task from add info if langlink is stored at repository or local to [Task] add info if langlink is stored at repository or local.Aug 13 2015, 6:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2015, 6:50 PM