| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Aklapper | T369331 [epic] WIT initial metrics research | |||
| Resolved | • WMDECyn | T369335 WIT metrics research: notebooks for metrics on WD usage from page content |
Event Timeline
@AndrewTavis_WMDE I got at least part of the Spark query working--the part without the regexes.
For extracting properties, I realized that where possible, we should also extract properties referenced by label, but not extract variable names used instead of literals. So here are 4 regexes to use in the property extraction from Wikitext and Lua:
Extract literal property codes and labels accessed via parser functions Wikitext
Regex: \{\{#property:([^|}]+)
Sample text: {{#property:date}} {{#property:P23}}
Text not to match: {{#property:{{1}}}}
Extract literal property codes in template or module parameter values in Wikitext
Regex: \|([Pp]\d+)
Sample text: {{Wikidata|P23}}
Extract literal strings of property codes in Lua
Regex: ['"]([Pp]\d+)['"]
Sample text: "P23", 'P23'
Extract properties referenced by literal string for label in Lua
Regex: resolvePropertyId\(\s*['"]([^'"]+)['"]
Sample text: resolvePropertyId( 'date' )
Text not to match: resolvePropertyId(somevariable)
Thanks again for the feedback!
wmde/analytics:MR#4 is up for this and will be reviewed in the coming days, so this is moved into our backlog :)
Moved to in review for Analytics as the full checks of the notebooks were completed on Monday. Minor changes are pending, but just to the output file :)
As discussed in the meeting with @AndyRussG and @Ifeatu_Nnaobi_WMDE:
- The code has been formatted and prepared for future work
- All code foes run and the outputs are viable
- There are many metrics involved in these processes that might not be operationalized (put into an Airflow data pipeline)
- There are also many intermediary processes that are ran to be able to create the various metrics
- Once a decision on the important metrics is made, I'll be able to go through the code and find the code for the desired metrics and all intermediary processes that are needed
- More in depth checks for these processes will be made then as I don't have the capacity for a full review at this time
- When we do begin to operationalize the code, I'll be able to take the subset of the code we need to WMF Data Engineering for further checks and help with optimizing it (if needed, as they have been involved in @AndyRussG's work till now as well)
Thanks all so much!
Hey Cythina, task is complete. Please review on your end and close or send back to me if changes are needed