Page MenuHomePhabricator

Write data coverage report for Abstracts
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story: “As a PM, I want to finish the Data Coverage document where the scope, methods and outcomes of coverage are defined and agreed on for MR, so that we have a shared understanding of what has been done, a way to gather insights and make informed steps on how to build up a larger testing and confidence framework for Abstracts"

  • Prep work**

Agree on the sample size

Acceptance criteria

The document must have the following sections:

  • A Description for abstracts of how we get our sample sets, where we got them from (What dataset was used to determine the outcomes?)
  • A Description for abstracts of what was used to test against (ground truth)
  • An explanation for abstracts of what FN, FP, TN, TP means for each features with examples of the four quadrants to reach team understanding and agreement that confusion matrix works for all features.
  • Run and document the agreed method on the test sample for Abstracts
  • Document analysis of Abstracts to explain the numbers or known caveats (eg findings in Reference discrepancies). Translate it to the matrix
  • Document must have all pending comments in Data Coverage document made until Feb 28th marked as resolved
  • Things to consider**

Please refer to the parent ticket, "things to consider section"

Event Timeline

JArguello-WMF updated the task description. (Show Details)
JArguello-WMF set the point value for this task to 5.
JArguello-WMF moved this task from Incoming to Sprint 56 on the Wikimedia Enterprise board.

For Abstract coverage metrics, we're comparing the WMF Summary API with the WME GetAbstract() call.

Two questions, I like to get answers for:

  1. Is GetAbstract() a good "ground truth" for abstracts? Since Abstract does not appear in the Web Browser page, it's an "abstract/non-tangible thing". Choosing a good ground truth for our ideal state is not obvious for Abstracts. Especially because getAbstract() was a Go rewrite/port of the Summary API code, it feels like an artificial comparison.
  2. In the previous Sprint, Ricardo compared the Summary API to the Parser getAbstract() results, if they were identical he marked them as a match. A small difference and it was a non-match. However, I'm seeing examples where the output of Summary and Abstract are nearly identical, see 3 examples below that are non-matches (there are a lot more of these close mismatches). Should we change the comparison, to use some threshold for character difference rather than exact match the two outputs?
https://en.wikipedia.org/wiki/New_York_State_Route_177
[2 character difference]
"  New York State Route 177 (NY 177) is an east–west state highway in the North Country of New York in the United States. It extends from Interstate 81 (I-81) exit 42 in the Jefferson County town of Adams to NY 12 west of the Lewis County village of Lowville. NY 177 intersects U.S. Route 11 (US 11) in Adams Center and meets Lewis County's County Route 21 (CR 21), formerly part of NY 194, at Barnes Corners. NY 177 originally began at US 11 when it was assigned in 1930. It was extended west to its present terminus in the 1950s following the construction of I-81.",

"New York State Route 177 (NY 177) is an east–west state highway in the North Country of New York in the United States. It extends from Interstate 81 (I-81) exit 42 in the Jefferson County town of Adams to NY 12 west of the Lewis County village of Lowville. NY 177 intersects U.S. Route 11 (US 11) in Adams Center and meets Lewis County's County Route 21 (CR 21), formerly part of NY 194, at Barnes Corners. NY 177 originally began at US 11 when it was assigned in 1930. It was extended west to its present terminus in the 1950s following the construction of I-81"


https://en.wikipedia.org/wiki/Verrophone
[4 character difference]
"  A verrophone (""glass-euphonium"") is a musical instrument, invented in 1983 by Sascha Reckert, which, ""uses tuned glass tubes,"" open at one end and arranged in various sizes. The sound is made by rubbing one end of one or more of the glass tubes, or also by striking them or rubbing them with a special mallet. The tubes are close together so that chords can be played by rubbing more than one at the same time. The instrument carries more acoustical volume than the glass harmonica and some other glass instruments and generally has a range from G3 to F6. Every piece composed originally for glass harmonica can be played on the verrophon",

"A verrophone (""glass-euphonium"") is a musical instrument, invented in 1983 by Sascha Reckert, which, ""uses tuned glass tubes,"" open at one end and arranged in various sizes. The sound is made by rubbing one end of one or more of the glass tubes, or also by striking them or rubbing them with a special mallet. The tubes are close together so that chords can be played by rubbing more than one at the same time. The instrument carries more acoustical volume than the glass harmonica and some other glass instruments and generally has a range from G3 to F6. Every piece composed originally for glass harmonica can be played on the verrophone."


https://en.wikipedia.org/wiki/List_of_pipeline_accidents_in_the_United_States_in_199[8 character difference]
" The following is a list of pipeline accidents in the United States in 1998. It is one of several lists of U.S. pipeline accidents. See also list of natural gas and oil production accidents in the United State",

 The following is a list of pipeline accidents in the United States in 1998. It is one of several lists of U.S. pipeline accidents. See also list of natural gas and oil production accidents in the United States.
ROdonnell-WMF renamed this task from Abstracts to Write data coverage report for Abstracts.Mar 5 2024, 1:42 PM

@HShaikh Team aligned on this point an made a decision. for reference, please see the MR decision log

Unblocked, we'll use a "fuzzy match" to compare the Summary API and WME getAbstract call output. We'll use a threshold of up to 5 character differences, less than that is considered a match. Over 5 characters is considered a non-match. This fuzzy character matching adjusts our metrics testing so the final statistics better represent the real-world matches for the summary/abstract text.