Page MenuHomePhabricator

Parallelize the theory-testing pipeline
Closed, ResolvedPublic

Description

  • Background: There are many promising theories about how the zero-results rate can be reduced, and there will probably be additional theories as the project progresses.
  • Summary: Enable simultaneous A|B testing of CirrusSearch/Elastic changes.
  • Benefit: Increase the quantity of theories that can be tested in Q1

Event Timeline

Jdouglas raised the priority of this task from to High.
Jdouglas lowered the priority of this task from High to Medium.
Jdouglas updated the task description. (Show Details)
Jdouglas set Security to None.
Jdouglas added subscribers: Aklapper, Jdouglas.

I would like @Manybubbles, @EBernhardson, @Jdouglas and @dcausse to get their heads together and make a list of tasks about how this task could be tackled. That list will form the basis of the subtasks to attack this problem.

This one might be too much complexity to be worthwhile, but could we duplicate the Elastic cluster, and have Cirrus search one or the other based on some client-side selection criterion?

We could have multiple indexes in elasticsearch, selecting which to use at query time and always updating both when write operations roll around. I'm not sure how that will jibe with our IO and memory usage though. There is also the several days to a week it takes to fill a new index (for enwiki)

There is also the several days to a week it takes to fill a new index (for enwiki)

This sounds like a strong argument in favor of enabling parallelization.

I think it depends on the test complexity. At a glance I see 4 levels of complexity :

1. No change in index required
When: tweaking search query parameters
If the test does not require changes on the index side it seems relatively easy to enable simultaneous A/B testing.

  • Week 1: Test config A on frwiki and config B on dewiki
  • Week 2: Test config B on frwiki and config A on dewiki

But I don't know how we can test simultaneously config A and B on the same wiki.

2. Small "companions" indexes/services

When: adding new services that are not related to the core search query (spellchecker, suggestions)
Depending on the feature we want to test we could also build "ad-hoc" indexes that are not "real-time", in other words some indexes may not have to reflect the exact state of the wiki it serves. For instance if we want to test different setup for misspelling detection it sounds feasible to build a small indexes like "enwiki_spellchecker_index_theoryA" or "enwiki_spellchecker_index_theoryB".

3. Index changes

When: change the behavior of some analyzers.
This is more challenging, data inertia is what prevent us from doing quick changes to the analysis chain. Having multiple indexes is a good idea. By tweaking some maintenance scripts we can create secondary indexes with different mapping configurations, I don't think we'll need to rebuild the index from db, a new maintenance script mostly based on "inplace reindex" could do a "inplace copy".

4. Model changes

When: adding a new field.
This is the most difficult and I don't know how we can parallelize such tests. Data sent to ES will be different so it seems pretty difficult to achieve.

Conclusion: I don't see major changes required to ease simultaneous testing, the only one required is for complexity #3 and this feature is very similar to T86781. We should maybe think of how we could make the metrics sent to @Ironholds more digestible in order to properly match theoryA with metricsY.

This whole ticket uses a different definition of the word parallel than we usually think of in software. Usually we think of it as a way to get additional throughput by throwing more threads/processes/machines at the problem. By that definition there isn't a ton of extra work to do. Elasticsearch is pretty parallel. When we rebuild all the documents for a wiki we use the job queue and its very parallel.

The definition this bug uses is that we need to be able to arbitrarily pick a different theory based on request parameter - in other words all the infrastructure for all theories exists in parallel. David does a good job of enumerating the different kinds of infrastructure needed to test theories below. For the most part existing techniques already work for this.

I think it depends on the test complexity. At a glance I see 4 levels of complexity

1. No change in index required
When: tweaking search query parameters
If the test does not require changes on the index side it seems relatively easy to enable simultaneous A/B testing.

  • Week 1: Test config A on frwiki and config B on dewiki
  • Week 2: Test config B on frwiki and config A on dewiki

But I don't know how we can test simultaneously config A and B on the same wiki.

Look at how beta features or the query parameters cause configuration changes. I suspect A/B testing here would be pretty similar.

2. Small "companions" indexes/services

When: adding new services that are not related to the core search query (spellchecker, suggestions)
Depending on the feature we want to test we could also build "ad-hoc" indexes that are not "real-time", in other words some indexes may not have to reflect the exact state of the wiki it serves. For instance if we want to test different setup for misspelling detection it sounds feasible to build a small indexes like "enwiki_spellchecker_index_theoryA" or "enwiki_spellchecker_index_theoryB".

We've never done this before - for the most part we just stick to the other three kinds of needs. This would require the most work simply because we don't have scripts for this use case because we haven't had it yet.

3. Index changes

When: change the behavior of some analyzers.
This is more challenging, data inertia is what prevent us from doing quick changes to the analysis chain. Having multiple indexes is a good idea. By tweaking some maintenance scripts we can create secondary indexes with different mapping configurations, I don't think we'll need to rebuild the index from db, a new maintenance script mostly based on "inplace reindex" could do a "inplace copy".

I'd advice against two indexes in this case - just add more subfields:

"suggest": {
    "type": "string",
    "analyzer": "suggest",
    "fields": {
        "experiment_1":   { "type": "string", "analyzer": "suggest_experiment_1" },
        "experiment_2":   { "type": "string", "analyzer": "suggest_experiment_2" }
    }
}

These aren't free but they are way, way cheaper than a second index. No maintenance script tweaking required. This is essentially how we support both stemmed and unstemmed search right now. And prefix search vs regular title search. This is tried and true technique. The only extension is using it for an experiment.

4. Model changes

When: adding a new field.
This is the most difficult and I don't know how we can parallelize such tests. Data sent to ES will be different so it seems pretty difficult to achieve.

New fields can (and should) just be dumped into the index with all the other fields. The trouble is that you have to wait until they are all built before you can really use them effectively. Usually. And building takes time. As I said, about a week for enwiki.

Removing fields isn't something that you do in an experiment so we don't need to talk about it but it is more complex. You have to make sure they are filtered out during the in place reindex.

Tweaking how a field is layed out or built is the most complex thing. For that I suggest creating a new, experimental field. Just call it field_name_theory_1.

One thing about adding fields: we've turned off dynamically adding fields to Elasticsearch meaning that you have to manually add them to the mapping or run an in place reindex. This might deserve some extra tooling. It might just work as is given the way updateSearchIndexConfig.php tries to do its job. Its worth experimenting on.

One possible simplification: What if we just test different theories on different wikis?

This would depend on being able to interpret metrics in the context of their wiki-specific usage. See T103596

One possible simplification: What if we just test different theories on different wikis?

This would depend on being able to interpret metrics in the context of their wiki-specific usage. See T103596

Its probably true that different things work better for different wikis. I think its right to pick a subset of wikis on which we're ok running experiments and then limit them to that list. That'd speed up lots of the deploys - instead of rolling out to all wikis we'd only have to do 10 or so. Sadly enwiki should be one of them and it is one of the slowest.

Deskana claimed this task.

This was an epic we made when we started thinking about how to test theories. We've now got the tasks to test those theories in the sprint, so this is essentially resolved.