This ticket pertains to "Step 2 - Analysis" in this doc. The steps are re-iterated below.
- In the "Counts" tab of this spreadsheet, calculate how many requests we'd make to the model for each subset of articles (each row).
- For one large Wikipedia (EN), and one smaller Wikipedia (CS), query the content corpus to calculate how many articles and how many paragraphs we would be passing into the model.
- For now, we’ve chosen English and Czech because we believe both languages are supported by the model (we are still doing some evaluation of the model’s performance on Czech content). From the list of supported languages, English represents the biggest wiki and Czech represents the smallest wiki.
- In the "Counts" tab of this spreadsheet, calculate how many structured tasks we'd generate for each subset of articles (each row).
- For each Wikipedia, take a random sample of 50 articles from each article type.
- Parse the sample articles into plain text paragraphs, and send the paragraphs to the model.
- Calculate the number of positive predictions with a probability score >= 0.8.
- Use this number to generate an estimate of the total number of high-probability positive predictions we’d expect to see if we applied the model to all articles within that article type.
- For each Wikipedia and each article type (eg. "EN - Articles with relevant page templates" or "CS - Articles about people", make a tab in this spreadsheet containing the sample article paragraphs that receive a positive prediction with a probability score of 0.8 or higher. Include the following metadata about each paragraph:
- [If possible] Number of pageviews to the article (all-time)
- Number of pageviews to the article (last 90 days)
- Number of edits made to the article (all-time)
- Number of edits made to the article (last 90 days)
- Section title that the paragraph is in
- Age of the article (# days)
Feel free to replace or edit any of the tabs in the spreadsheet linked above. Please keep in mind that this list of articles is strictly for analysis, and is not meant to serve as the final list we use for the structured task.
