Gather suggestion output from Elastic-based suggestions and Method 1 suggestions for a collection of data, and analyze the results.
When we did this for M0, we used 2 months of enwiki data to build the model and evaluated the results on 1 month of enwiki data. Something similar would be fine this time, too.
Analysis will include counting how often Elastic-based suggestions are made, how often Method 1 suggestions are made, how often both are made, and a manual review of a sample when both are made to see which does better—which is the same as what we did for M0.
There is some concern about the possibility of lower-quality Method 1 results for shorter strings, so if that looks to be a problem—either because of the high volume and/or lower quality of Method 1 suggestions for shorter queries—we may look into shorter queries more carefully.