Search index a limited number of article placeholders on cywiki for testing and evaluation purposes
Closed, ResolvedPublic

Description

For starters we should choose a wiki (which already has placeholders) and decide on a number of placeholders we want to index. We will then make them indexable and submit them to search engines, so that we can evaluate.

We should start by indexing the first (by id) 3,000 placeholders that are notable (3+ statements, 2+ sitelinks) on cy.wikipedia. Going for the first n placeholders by id is easy to filter (no need to define a huge list of indexable placeholders) and should approximate the importance of the respective placeholders (at least in the lower id ranges).

That should get us started and enable us to see the traffic/ visibility impact of this.

hoo renamed this task from Limitted trial of search indexed article placeholders to Search index a limitted number of article placeholders for testing and evaluation purposes.Sep 2 2016, 12:43 PM
hoo removed Lucie as the assignee of this task.

I suggest Esperanto and would use the items with the most sitelinks that don't have a link to Esperanto Wikipedia.

Jonas raised the priority of this task from Normal to High.

I discussed with Lydia recently that we should set a higher limit of number of statements/sitelinks in the items to be an indexing placeholder.
For the start, we should consider to not index more than 5000 placeholder and see how that works.

Nemo_bis renamed this task from Search index a limitted number of article placeholders for testing and evaluation purposes to Search index a limited number of article placeholders for testing and evaluation purposes.Oct 24 2016, 2:57 PM
Nemo_bis added a project: SEO.
hoo updated the task description. (Show Details)Oct 30 2016, 11:34 PM

@hoo Lucie and I discussed and I'd prefer if we can index the most important placeholders for at least some meaningful measure of importance. I'd rather not just take the first X items by Q-ID.

hoo added a comment.Oct 31 2016, 2:59 PM

@hoo Lucie and I discussed and I'd prefer if we can index the most important placeholders for at least some meaningful measure of importance. I'd rather not just take the first X items by Q-ID.

@Lucie and I also talked about this, and we decided that the first notable thousand (by entity id) are probably also pretty close to the most important thousand. Also this allows us to start with a very limited trial easily (no need to start indexing more than a few thousand placeholders: We can only do that easily if the indexable ids (or id range in this case) are/ is well know).

That makes this easier both technically, but also minimizes risk which makes this safer and easier to coordinate with other teams.

Izno added a subscriber: Izno.Oct 31 2016, 3:00 PM

Maybe items with featured or good badges? There is also https://en.wikipedia.org/wiki/Wikipedia:Vital_articles (1000 en.WP must-haves) and https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Expanded (10.000 en.WP must-haves) which might be good as well.

We would just need to extract the links from those pages and convert them to Wikidata IDs.

Jep. Or write a query for the items with most sitelinks that are not on eowiki. Then order the list by number of statements. Then take the top X.

I suggest Esperanto and would use the items with the most sitelinks that don't have a link to Esperanto Wikipedia.

This is reasonable. See also https://meta.wikimedia.org/wiki/Research:Newsletter/2014/August#Wikipedia_in_all_languages_used_to_rank_global_historical_figures_of_all_time

I wouldn't go into more complex selection methods.

hoo added a comment.Oct 31 2016, 11:30 PM

Jep. Or write a query for the items with most sitelinks that are not on eowiki. Then order the list by number of statements. Then take the top X.

In that case, we wont be able to allow indexing only for those: That's probably ok, but it definitely increases the impact of this widely.

hoo added a comment.Jan 11 2017, 9:26 PM

Query for getting the first 1,000 Items that don't have a sitelink to XXwiki:

SELECT page_title FROM page INNER JOIN wb_entity_per_page ON epp_page_id = page_id INNER JOIN page_props AS pp_sl ON pp_sl.pp_page = page_id AND pp_sl.pp_propname = 'wb-sitelinks' INNER JOIN page_props AS pp_st ON pp_st.pp_page = page_id AND pp_st.pp_propname = 'wb-claims' WHERE page_namespace = 0 AND pp_st.pp_value > 2 AND pp_sl.pp_value > 3 AND NOT EXISTS(SELECT 1 FROM wb_items_per_site WHERE ips_site_id = 'XXwiki' AND ips_item_id = epp_entity_id) ORDER BY epp_entity_id ASC LIMIT 1000;
hoo added a comment.Jan 11 2017, 9:38 PM

Cutoff for guwiki: Q1526, cutoff for htwiki: Q1666.

What part of that query selects for "most sitelinks"?

hoo added a comment.Jan 17 2017, 9:46 AM

What part of that query selects for "most sitelinks"?

None, that's not part of the criteria. The idea is to index the first 1000 placeholders (sorted by numeric part of the Item id) that are notable.

What part of that query selects for "most sitelinks"?

None, that's not part of the criteria.

According to previous comments, it is. Please update the task description to reflect the current thinking on what the criteria are and why the number of sitelinks was rejected.

The idea is to index the first 1000 placeholders (sorted by numeric part of the Item id) that are notable.

What definition of "notable" are you using?

hoo updated the task description. (Show Details)Jan 26 2017, 10:20 AM
hoo updated the task description. (Show Details)Jan 26 2017, 10:22 AM

What part of that query selects for "most sitelinks"?

None, that's not part of the criteria.

According to previous comments, it is. Please update the task description to reflect the current thinking on what the criteria are and why the number of sitelinks was rejected.

Added.

The idea is to index the first 1000 placeholders (sorted by numeric part of the Item id) that are notable.

What definition of "notable" are you using?

As just put in the description: 3+ statements and 2+ sitelinks are the minimum needed.

hoo renamed this task from Search index a limited number of article placeholders for testing and evaluation purposes to Search index a limited number of article placeholders on cywiki for testing and evaluation purposes.EditedJan 31 2017, 2:14 PM

Concrete query used:

SELECT page_title FROM page INNER JOIN wb_entity_per_page ON epp_page_id = page_id INNER JOIN page_props AS pp_sl ON pp_sl.pp_page = page_id AND pp_sl.pp_propname = 'wb-sitelinks' INNER JOIN page_props AS pp_st ON pp_st.pp_page = page_id AND pp_st.pp_propname = 'wb-claims' WHERE pp_st.pp_value > 2 AND pp_sl.pp_value > 3 AND NOT EXISTS(SELECT 1 FROM wb_items_per_site WHERE ips_site_id = 'cywiki' AND ips_item_id = epp_entity_id) ORDER BY epp_entity_id ASC LIMIT 1000;

Results (indexable user page): https://cy.wikipedia.org/wiki/Defnyddiwr:Hoo_man/T144592-placeholders

Note: The placeholders themselves are not indexable, yet.

Change 336225 had a related patch set uploaded (by Hoo man):
Search index article placeholders up to Q2794

https://gerrit.wikimedia.org/r/336225

hoo claimed this task.Feb 6 2017, 3:25 PM

Scheduled this to be deployed between 14:00–15:00 UTC tomorrow (2017-02-06).

Change 336225 merged by jenkins-bot:
Search index article placeholders on cywiki up to Q2794

https://gerrit.wikimedia.org/r/336225

Mentioned in SAL (#wikimedia-operations) [2017-02-07T14:12:14Z] <hoo@tin> Synchronized wmf-config/: Search index article placeholders on cywiki up to Q2794 (T144592) (duration: 00m 42s)

hoo closed this task as Resolved.Feb 7 2017, 2:31 PM
hoo removed a project: Patch-For-Review.
hoo moved this task from Doing to Done on the ArticlePlaceholder board.

Placeholders up until https://cy.wikipedia.org/wiki/Arbennig:AboutTopic/Q2794 are now indexable on cywiki.

Full list of indexable placeholders: https://cy.wikipedia.org/wiki/Defnyddiwr:Hoo_man/T144592-placeholders.

Thanks for updating the task description.

I note however that https://www.mediawiki.org/w/index.php?diff=2373589 contradicts the task description, since it says «all placeholders for Items that have an id up Q3000» (bold added).

Sorry for triple message... do I see correctly (https://archive.fo/MakpZ ) that currently https://cy.wikipedia.org/wiki/Arbennig:AboutTopic/Q272 is the only URL actually indexed by Google?

Yes, this is still the only article for now: https://www.google.com/search?q=site:cy.wikipedia.org+inurl:AboutTopic

If you click to show "duplicate" search results, you find that Google tries to index URLs like https://cy.wikipedia.org/wiki.phtml?title=Special:AboutTopic/Q2050, but can't because of https://cy.wikipedia.org/robots.txt it says, but I can not track down the rule. The problem here is not that it can't index these URLs. This is fine. The problem is: How does it even find these weird URLs?

hoo added a comment.Feb 22 2017, 6:25 PM

If you click to show "duplicate" search results, you find that Google tries to index URLs like https://cy.wikipedia.org/wiki.phtml?title=Special:AboutTopic/Q2050, but can't because of https://cy.wikipedia.org/robots.txt it says, but I can not track down the rule. The problem here is not that it can't index these URLs. This is fine. The problem is: How does it even find these weird URLs?

I gave some of these to google in order to experiment a bit. These should not be ranked highly and wont appear in any real-world searches.

I guess it will take another weeks until Google and other search engines start picking up the other placeholders :/

Sorry for triple message... do I see correctly (https://archive.fo/MakpZ ) that currently https://cy.wikipedia.org/wiki/Arbennig:AboutTopic/Q272 is the only URL actually indexed by Google?

Now searching the localised special page name:
https://duckduckgo.com/?q="Arbennig%3AAm_y_Pwnc"+site%3Acy.wikipedia.org
DuckDuckGo shows a few mostly-welsh results: https://archive.fo/iRyCb

Google picks up results which are mostly in English such as (2nd for me):

Wanfried agreement - Wicipedia
https://cy.wikipedia.org/wiki/Arbennig:Am_y_Pwnc/Q1441
treaty transferring territory between the United States and Soviet occupation zones of Germany after World War II. Karte Wanfrieder Abkommen.png

https://archive.fo/SVu8t