Page MenuHomePhabricator

track number of Lexemes, Forms and Senses
Closed, ResolvedPublic

Description

When rolling out the first support for lexicographical data we need to track how many Lexemes (equivalent to Items) and Forms (sub-parts of Lexeme but should be treated like Items and Properties for the graph) and Senses are being created over time. It should be added to the entities by type graph here: https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel?refresh=30m&orgId=1

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I just looked and the Form data seems to be the same as the Property data and the Lexeme data seems to be the same as the Item data. Could you double check?

@Lydia_Pintscher In the Wikidata namespaces documentation: https://www.wikidata.org/wiki/Help:Namespaces there are no mentions of the Lexems and Forms namespaces.

In order to have the current Graphite query for Items/Properties:

aliasSub(
    aliasSub(
        aliasByNode(
            daily.wikidata.site_stats.pages_by_namespace.{0,120}.nonredirects, 
            4),
            '120', 'Properties'),
            '0', 'Items')

where 120 = properties and 0 = items, modified to work with Lexems/Forms, I need to know the respective namespace codes (one for Lexems and one for Forms). Please advise.

https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces says Lexeme is 146. Forms are not in a separate namespace but part of the Lexeme page.

@Lydia_Pintscher Modifying the existing Graphite query for Items and Properties (see T191424#4236348) to incorporate the Lexems in the following way:

aliasSub(
   aliasSub(
      aliasSub(
         aliasByNode(daily.wikidata.site_stats.pages_by_namespace.{0,120,146}.nonredirects, 4), 
      '120', 'Properties'), 
   '0', 'Items'), 
'146', 'Lexems')

keeps on retrieving the data on Items and Properties, but returns no data points for Lexems (namespace: 146, as suggested).

The logic should be clear: if 146 is just another namespace, and we know from Graphite documentation that aliasSub does nothing else but regex gsub, then nesting another aliasSub call around the existing two simply must work. But it doesn't.

@Lydia_Pintscher Who wrote the initial Graphite query? His or her usage of aliasByNode is not completely clear to me, and maybe the problem is there.

Also, it is unclear to me at this point how to fetch the Forms data from Graphite, if these data are said to be parts of the content of the respective lexeme page (and they manifestly are). Is there a separate Graphite storage for these data? Because if they are really stored only as a content of the lexeme pages, well it's sounds crazy but web-scraping will be the only way to go:)

The following is inessential for the task at hand, but also I can't find the data that this dashboard uses on https://graphite.wikimedia.org/ - @Addshore You were mentioning another Graphite storage to me once, could you please remind where do to these things go and where the respective series can be browsed? - Thanks.

@Lydia_Pintscher Modifying the existing Graphite query for Items and Properties (see T191424#4236348) to incorporate the Lexems in the following way:

aliasSub(
   aliasSub(
      aliasSub(
         aliasByNode(daily.wikidata.site_stats.pages_by_namespace.{0,120,146}.nonredirects, 4), 
      '120', 'Properties'), 
   '0', 'Items'), 
'146', 'Lexems')

keeps on retrieving the data on Items and Properties, but returns no data points for Lexems (namespace: 146, as suggested).

@Lydia_Pintscher @Addshore https://gerrit.wikimedia.org/r/#/c/analytics/wmde/scripts/+/440901/

Q1. Where do you make this SQL calls from? It would help me to understand the workflow.
Q2. Where from (and when) do you send the data obtained from SQL to Graphite?
Q3. Would you consider switching this and similar event logging things to Big Data (HiveQL from the Data Lake)? Because sooner or later...
Q4. How will we track the forms data (see T191424#4236354).

If this is schematic - in a sense that I always need to introduce the changes into SQL code, or write new SQL code, in analytics/wmde/scripts - I can do these things from now on. It would be enough if you let me know what exactly do you need to be done on every new occasion.

@Lydia_Pintscher @Addshore https://gerrit.wikimedia.org/r/#/c/analytics/wmde/scripts/+/440901/

Q1. Where do you make this SQL calls from? It would help me to understand the workflow.

The scripts are puppetized and run on stat1005 currently as the user "analytics-wmde"

Q2. Where from (and when) do you send the data obtained from SQL to Graphite?

The SQL @ https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/site_stats/sql/select_pages_by_namespace.sql#L6-L8 is used in https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/site_stats/pages_by_namespace.php which makes the query and sends the data to graphite.

Q3. Would you consider switching this and similar event logging things to Big Data (HiveQL from the Data Lake)? Because sooner or later...

Yes, this could easily be moved to the data lake and the mediawiki_history table now, although I might actually be more efficient to do it in sql? (I haven't looked into this)
This isn't really an "event logging" thing.

Q4. How will we track the forms data (see T191424#4236354).

This query can not be modified to track forms as forms are not a top level entity / have no namespace.

If this is schematic - in a sense that I always need to introduce the changes into SQL code, or write new SQL code, in analytics/wmde/scripts - I can do these things from now on. It would be enough if you let me know what exactly do you need to be done on every new occasion.

It will be easier if on occasions such as this you ask the questions and we answer them in tickets rather than always explain every step in tickets as we risk spending time explaining things that are already known.

Change 440901 had a related patch set uploaded (by Addshore; owner: GoranSMilovanovic):
[analytics/wmde/scripts@master] Add Lexeme namespace, 146, to pages by namespace script

https://gerrit.wikimedia.org/r/440901

Change 441022 had a related patch set uploaded (by Addshore; owner: GoranSMilovanovic):
[analytics/wmde/scripts@production] Add Lexeme namespace, 146, to pages by namespace script

https://gerrit.wikimedia.org/r/441022

Change 441022 merged by Addshore:
[analytics/wmde/scripts@production] Add Lexeme namespace, 146, to pages by namespace script

https://gerrit.wikimedia.org/r/441022

Change 440901 merged by Addshore:
[analytics/wmde/scripts@master] Add Lexeme namespace, 146, to pages by namespace script

https://gerrit.wikimedia.org/r/440901

@Lydia_Pintscher @Addshore

  • Lexems moved to the right Y-axis with properties.
  • We need to understand how to track forms. Will forms have a namespace?

Forms won't have a namespace.
The best way to count them right now would be as part of the dump scanner that runs weekly.

Maybe page properties should include counts of lemmas, forms and senses, as they do for statements and identifiers.

@GoranSMilovanovic Thanks! And yay :)

I think @Esc3300's suggestion is good. We will want to show the number of Forms and later Senses on Special:Search for example and page properties is what we're using for the number of statements and sitelinks there iirc. @Addshore What do you think?

I think @Esc3300's suggestion is good. We will want to show the number of Forms and later Senses on Special:Search for example and page properties is what we're using for the number of statements and sitelinks there iirc. @Addshore What do you think?

We could put this on page info, although we don't really do it for anything else yet (such as statements).

Maybe page properties should include counts of lemmas, forms and senses, as they do for statements and identifiers.

Wait a minute. Do we mean page props the table, or on page info?
I think I commented this morning before my coffee had kicked in.

The page property for the number of Senses and Forms of a Lexeme exists now. That hopefully makes it possible to add them to the existing graph in grafana.

Lydia_Pintscher renamed this task from track number of Lexemes and Forms to track number of Lexemes, Forms and Senses.Jan 4 2019, 10:23 AM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher moved this task from incoming to ready to go on the Wikidata board.

FYI. This query which is pretty fast can give you total number of forms and senses but it won't stay fast in future:

MariaDB [wikidatawiki_p]> select pp_propname, SUM(pp_value) as total_number from page_props where pp_propname in ('wbl-senses', 'wbl-forms') group by pp_propname;
+-------------+--------------+
| pp_propname | total_number |
+-------------+--------------+
| wbl-forms   |        26390 |
| wbl-senses  |         3620 |
+-------------+--------------+
2 rows in set (0.12 sec)

@Ladsgroup We could totally add that to the bunch of daily scripts that we run now again the sql replicas on the analytics servers :)
But yes, one day it is going to get slow, but it's probably fine for now?

But yes, one day it is going to get slow, but it's probably fine for now?

It's definitely fine for me, it's such a low-hanging fruit.

Change 487128 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[analytics/wmde/scripts@master] Add SQL to push number of forms and senses to graphite

https://gerrit.wikimedia.org/r/487128

Change 487357 had a related patch set uploaded (by Addshore; owner: Ladsgroup):
[analytics/wmde/scripts@production] Add SQL to push number of forms and senses to graphite

https://gerrit.wikimedia.org/r/487357

Change 487128 merged by jenkins-bot:
[analytics/wmde/scripts@master] Add SQL to push number of forms and senses to graphite

https://gerrit.wikimedia.org/r/487128

Change 487357 merged by jenkins-bot:
[analytics/wmde/scripts@production] Add SQL to push number of forms and senses to graphite

https://gerrit.wikimedia.org/r/487357

Yeah, This data will be injected once a day and then we can use it.

Since we had the first push last night, I added it to the graph. Does it look good to you?

Looks good to me, but I'll leave this for @Lydia_Pintscher to close!

image.png (292×918 px, 33 KB)