Page MenuHomePhabricator

Search index a limited number of article placeholders on cywiki for testing and evaluation purposes
Closed, ResolvedPublic

Description

For starters we should choose a wiki (which already has placeholders) and decide on a number of placeholders we want to index. We will then make them indexable and submit them to search engines, so that we can evaluate.

We should start by indexing the first (by id) 3,000 placeholders that are notable (3+ statements, 2+ sitelinks) on cy.wikipedia. Going for the first n placeholders by id is easy to filter (no need to define a huge list of indexable placeholders) and should approximate the importance of the respective placeholders (at least in the lower id ranges).

That should get us started and enable us to see the traffic/ visibility impact of this.

Related Objects

Event Timeline

hoo renamed this task from Limitted trial of search indexed article placeholders to Search index a limitted number of article placeholders for testing and evaluation purposes.Sep 2 2016, 12:43 PM
hoo removed Lucie as the assignee of this task.

I suggest Esperanto and would use the items with the most sitelinks that don't have a link to Esperanto Wikipedia.

Jonas raised the priority of this task from Medium to High.Oct 12 2016, 9:03 AM

I discussed with Lydia recently that we should set a higher limit of number of statements/sitelinks in the items to be an indexing placeholder.
For the start, we should consider to not index more than 5000 placeholder and see how that works.

Nemo_bis renamed this task from Search index a limitted number of article placeholders for testing and evaluation purposes to Search index a limited number of article placeholders for testing and evaluation purposes.Oct 24 2016, 2:57 PM
Nemo_bis added a project: SEO.

@hoo Lucie and I discussed and I'd prefer if we can index the most important placeholders for at least some meaningful measure of importance. I'd rather not just take the first X items by Q-ID.

@hoo Lucie and I discussed and I'd prefer if we can index the most important placeholders for at least some meaningful measure of importance. I'd rather not just take the first X items by Q-ID.

@Lucie and I also talked about this, and we decided that the first notable thousand (by entity id) are probably also pretty close to the most important thousand. Also this allows us to start with a very limited trial easily (no need to start indexing more than a few thousand placeholders: We can only do that easily if the indexable ids (or id range in this case) are/ is well know).

That makes this easier both technically, but also minimizes risk which makes this safer and easier to coordinate with other teams.

Maybe items with featured or good badges? There is also https://en.wikipedia.org/wiki/Wikipedia:Vital_articles (1000 en.WP must-haves) and https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Expanded (10.000 en.WP must-haves) which might be good as well.

We would just need to extract the links from those pages and convert them to Wikidata IDs.

Jep. Or write a query for the items with most sitelinks that are not on eowiki. Then order the list by number of statements. Then take the top X.

I suggest Esperanto and would use the items with the most sitelinks that don't have a link to Esperanto Wikipedia.

This is reasonable. See also https://meta.wikimedia.org/wiki/Research:Newsletter/2014/August#Wikipedia_in_all_languages_used_to_rank_global_historical_figures_of_all_time

I wouldn't go into more complex selection methods.

Jep. Or write a query for the items with most sitelinks that are not on eowiki. Then order the list by number of statements. Then take the top X.

In that case, we wont be able to allow indexing only for those: That's probably ok, but it definitely increases the impact of this widely.

Query for getting the first 1,000 Items that don't have a sitelink to XXwiki:

SELECT page_title FROM page INNER JOIN wb_entity_per_page ON epp_page_id = page_id INNER JOIN page_props AS pp_sl ON pp_sl.pp_page = page_id AND pp_sl.pp_propname = 'wb-sitelinks' INNER JOIN page_props AS pp_st ON pp_st.pp_page = page_id AND pp_st.pp_propname = 'wb-claims' WHERE page_namespace = 0 AND pp_st.pp_value > 2 AND pp_sl.pp_value > 3 AND NOT EXISTS(SELECT 1 FROM wb_items_per_site WHERE ips_site_id = 'XXwiki' AND ips_item_id = epp_entity_id) ORDER BY epp_entity_id ASC LIMIT 1000;

Cutoff for guwiki: Q1526, cutoff for htwiki: Q1666.

What part of that query selects for "most sitelinks"?

What part of that query selects for "most sitelinks"?

None, that's not part of the criteria. The idea is to index the first 1000 placeholders (sorted by numeric part of the Item id) that are notable.

What part of that query selects for "most sitelinks"?

None, that's not part of the criteria.

According to previous comments, it is. Please update the task description to reflect the current thinking on what the criteria are and why the number of sitelinks was rejected.

The idea is to index the first 1000 placeholders (sorted by numeric part of the Item id) that are notable.

What definition of "notable" are you using?

What part of that query selects for "most sitelinks"?

None, that's not part of the criteria.

According to previous comments, it is. Please update the task description to reflect the current thinking on what the criteria are and why the number of sitelinks was rejected.

Added.

The idea is to index the first 1000 placeholders (sorted by numeric part of the Item id) that are notable.

What definition of "notable" are you using?

As just put in the description: 3+ statements and 2+ sitelinks are the minimum needed.

hoo renamed this task from Search index a limited number of article placeholders for testing and evaluation purposes to Search index a limited number of article placeholders on cywiki for testing and evaluation purposes.EditedJan 31 2017, 2:14 PM

Concrete query used:

SELECT page_title FROM page INNER JOIN wb_entity_per_page ON epp_page_id = page_id INNER JOIN page_props AS pp_sl ON pp_sl.pp_page = page_id AND pp_sl.pp_propname = 'wb-sitelinks' INNER JOIN page_props AS pp_st ON pp_st.pp_page = page_id AND pp_st.pp_propname = 'wb-claims' WHERE pp_st.pp_value > 2 AND pp_sl.pp_value > 3 AND NOT EXISTS(SELECT 1 FROM wb_items_per_site WHERE ips_site_id = 'cywiki' AND ips_item_id = epp_entity_id) ORDER BY epp_entity_id ASC LIMIT 1000;

Results (indexable user page): https://cy.wikipedia.org/wiki/Defnyddiwr:Hoo_man/T144592-placeholders

Note: The placeholders themselves are not indexable, yet.

Change 336225 had a related patch set uploaded (by Hoo man):
Search index article placeholders up to Q2794

https://gerrit.wikimedia.org/r/336225

Scheduled this to be deployed between 14:00–15:00 UTC tomorrow (2017-02-06).

Change 336225 merged by jenkins-bot:
Search index article placeholders on cywiki up to Q2794

https://gerrit.wikimedia.org/r/336225

Mentioned in SAL (#wikimedia-operations) [2017-02-07T14:12:14Z] <hoo@tin> Synchronized wmf-config/: Search index article placeholders on cywiki up to Q2794 (T144592) (duration: 00m 42s)

hoo removed a project: Patch-For-Review.
hoo moved this task from Doing to Done on the ArticlePlaceholder board.

Placeholders up until https://cy.wikipedia.org/wiki/Arbennig:AboutTopic/Q2794 are now indexable on cywiki.

Full list of indexable placeholders: https://cy.wikipedia.org/wiki/Defnyddiwr:Hoo_man/T144592-placeholders.

Thanks for updating the task description.

I note however that https://www.mediawiki.org/w/index.php?diff=2373589 contradicts the task description, since it says «all placeholders for Items that have an id up Q3000» (bold added).

Sorry for triple message... do I see correctly (https://archive.fo/MakpZ ) that currently https://cy.wikipedia.org/wiki/Arbennig:AboutTopic/Q272 is the only URL actually indexed by Google?

Yes, this is still the only article for now: https://www.google.com/search?q=site:cy.wikipedia.org+inurl:AboutTopic

If you click to show "duplicate" search results, you find that Google tries to index URLs like https://cy.wikipedia.org/wiki.phtml?title=Special:AboutTopic/Q2050, but can't because of https://cy.wikipedia.org/robots.txt it says, but I can not track down the rule. The problem here is not that it can't index these URLs. This is fine. The problem is: How does it even find these weird URLs?

If you click to show "duplicate" search results, you find that Google tries to index URLs like https://cy.wikipedia.org/wiki.phtml?title=Special:AboutTopic/Q2050, but can't because of https://cy.wikipedia.org/robots.txt it says, but I can not track down the rule. The problem here is not that it can't index these URLs. This is fine. The problem is: How does it even find these weird URLs?

I gave some of these to google in order to experiment a bit. These should not be ranked highly and wont appear in any real-world searches.

I guess it will take another weeks until Google and other search engines start picking up the other placeholders :/

Sorry for triple message... do I see correctly (https://archive.fo/MakpZ ) that currently https://cy.wikipedia.org/wiki/Arbennig:AboutTopic/Q272 is the only URL actually indexed by Google?

Now searching the localised special page name:
https://duckduckgo.com/?q="Arbennig%3AAm_y_Pwnc"+site%3Acy.wikipedia.org
DuckDuckGo shows a few mostly-welsh results: https://archive.fo/iRyCb

Google picks up results which are mostly in English such as (2nd for me):

Wanfried agreement - Wicipedia
https://cy.wikipedia.org/wiki/Arbennig:Am_y_Pwnc/Q1441
treaty transferring territory between the United States and Soviet occupation zones of Germany after World War II. Karte Wanfrieder Abkommen.png

https://archive.fo/SVu8t