Page MenuHomePhabricator

Create labs instance for cross-wiki search results testing
Closed, ResolvedPublic

Description

As part of our testing cycle for the cross-wiki search results, we'd like to setup a labs instance with indices loaded for testing purposes.

The idea is that we'll load up the new front end UI display, use the new back end search functionality and end up with a viable testing environment to be used without affecting any of the upcoming A/B tests that will be done on specific language wikis.

The indices for each project would also be loaded up but not have the full data available (for example, any search result that comes back will be what would appear in a normal search result, but the links to the articles won't actually open anything up.

The preferred wikipedia and project indices to use is from English Wikipedia, as it has the most articles/media to search on.

Event Timeline

debt raised the priority of this task from Medium to High.Nov 22 2016, 4:57 PM
debt moved this task from Incoming to UI on the Discovery-Search (Current work) board.

We could use relforge elastic cluster for this purpose, I can load some data there, what wikis would you like to test first?
Concerning an instance to host a mediawiki installation we have rel-forge.search.eqiad.wmflabs but it'd be maybe simpler to have a dedicated instance.

Hi @dcausse - using relforge elastic cluster sounds great and I think it might be easier/simpler to have a dedicated instance of it for this data.

According to this query, I got back this for pages for enwiki and en projects:

sitepagescontent pages
enwiki40,814,6125,291,572
enwiktionary5,490,8544,998,392
enwikisource2,053,245593,842
enwikibooks245,95955,692
enwikiquote147,19828,510
enwikivoyage131,85527,573
enwikiversity162,52022,974
enwikinews2,734,72320,997

I believe the content pages is the magical number of actual pages that are shown in search queries...but I'm not positive about that. My confusion is coming from enwikinews: there are 21K-ish content pages, but over 2.7 million "pages"...how does that happen?

Would we be able to add in all the indices for these projects, but not the actual data? Would that overload any of the servers if we have that much added to an instance? If so, we can cherry pick the projects we want to add in.

@debt ok I'll start to import all the en indices.
I'll prefix all the indice names with crosswiki to avoid confusion with other indices.
I think we can load all the data, if we do not query too often these indices the overhead should be ok.
Concerning the ratio between pages and content pages on wikinews apparently there are 2.6m pages in the User talk namespace.
If I run into space issues I'll probably load partial data on non content indices as you suggested.

All english wikis are available on relforge cluster:

green open crosswiki_enwikibooks_content                        1 0    76157      0      2gb      2gb 
green open crosswiki_enwikibooks_general                        1 0   147808      0 1011.8mb 1011.8mb 
green open crosswiki_enwiki_content                             7 0  6363500      0  131.7gb  131.7gb 
green open crosswiki_enwiki_general                             8 0 22896342      0  233.4gb  233.4gb 
green open crosswiki_enwikinews_content                         1 0    21105      0  518.2mb  518.2mb 
green open crosswiki_enwikinews_general                         4 0  2676473      0   12.5gb   12.5gb 
green open crosswiki_enwikiquote_content                        1 0    28571      0    1.1gb    1.1gb 
green open crosswiki_enwikiquote_general                        1 0    93231      0  574.6mb  574.6mb 
green open crosswiki_enwikisource_content                       7 0  1766590      0   30.5gb   30.5gb 
green open crosswiki_enwikisource_general                       1 0   161859      0    1.2gb    1.2gb 
green open crosswiki_enwikiversity_content                      1 0    33860      0  875.6mb  875.6mb 
green open crosswiki_enwikiversity_general                      1 0   113400      0    1.1gb    1.1gb 
green open crosswiki_enwikivoyage_content                       1 0   149916      0      1gb      1gb 
green open crosswiki_enwikivoyage_general                       1 0    84397      0  626.2mb  626.2mb 
green open crosswiki_enwiktionary_content                       5 0  4998355      0   14.1gb   14.1gb 
green open crosswiki_enwiktionary_general                       2 0   448670      0      5gb      5gb

We now need to set up a new lab vm to run mediawiki on top of these indices.

The labs instance with relforge indices has been setup and is available at: http://sistersearch.wmflabs.org/

The labs instance with relforge indices has been setup and is available at: http://sistersearch.wmflabs.org/

Excellent! :-D