Page MenuHomePhabricator

Christmas 2022 wbs_propertypairs table update on Wikidata
Closed, ResolvedPublic1 Estimated Story Points

Description

Problem:
The data in Wikidata changed significantly and we have not updated the Property Suggester data in quite a while. This leads to outdated suggestions for new Properties to add to existing Items. We should update it based on current statements in Wikidata.

Acceptance criteria:

  • Property Suggester data has been updated

Original report:
The looks like the wbs_propertypairs table in the Wikidata database hasn't been updated all year. I'm using the table to update https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Most_used_painting_properties every once in a while. Can you please run whatever script it was again to update the table? I recall Marius doing this in the past.

Related Objects

Event Timeline

Might I suggest if we're going to re-run it to add occupation (P106) to the list of classifying properties. it would be useful if it suggested sport for sports people or position held for politicians etc

Lydia_Pintscher updated the task description. (Show Details)

2023 sprint 1
a/c

  • convert into a chore (meaning team-wikidata has the information available on how do this independently)
  • update the documentation

Notes

I've just updated PropertySuggester_update with the latest instructions.

None of this is well polished, but given that we will replace this (soon?), investing further effort is probably not worth it.

Task breakdown notes: We need to follow the steps under PropertySuggester update § Each update (the latest data is from 20230102). Most of the steps need production shell access.

wbs_propertypairs updated (commit):

luwe@C380 ~/git/wbs_propertypairs (master|u=) $ mkdir 20230102 && time curl -sL https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/20230102/analyzed-out.gz | gzip -d | python ../wbs_propertypairs-refine/refine.py /dev/stdin /dev/stdout | gzip > 20230102/wbs_propertypairs.csv.gz

real    1m39,099s
user    0m15,550s
sys     0m0,731s
luwe@C380 ~/git/wbs_propertypairs (master|u=) $ git add 20230102/wbs_propertypairs.csv.gz && git commit -m 'Add propertypairs from the 20230102 dump' -m 'Bug: T325942'
[master 13b4e8e572] Add propertypairs from the 20230102 dump
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 20230102/wbs_propertypairs.csv.gz

Pulled to mwmaint:

lucaswerkmeister-wmde@mwmaint1002:~$ time https_proxy=http://webproxy.eqiad.wmnet:8080 curl -sL https://github.com/wmde/wbs_propertypairs/raw/master/20230102/wbs_propertypairs.csv.gz | gzip -d > wbs_propertypairs.csv

real	0m1.284s
user	0m0.850s
sys	0m0.107s
lucaswerkmeister-wmde@mwmaint1002:~$ wc -l wbs_propertypairs.csv
3813497 wbs_propertypairs.csv

Edit: For reference, the table currently has some 3.3M rows, so 3.8M rows in the new file seems like reasonable growth for a one-year time period IMHO.

lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql wikidatawiki
MariaDB [wikidatawiki]> SELECT COUNT(*) FROM wbs_propertypairs\G
*************************** 1. row ***************************
COUNT(*): 3300899
1 row in set (0.787 sec)

Mentioned in SAL (#wikimedia-operations) [2023-01-25T11:34:45Z] <Lucas_WMDE> Updated the Wikidata property suggester with data from 20230102's JSON dump (T325942)

lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv # T325942

*******************************************************************************
NOTE: Do not run maintenance scripts directly, use maintenance/run.php instead!
      Running scripts directly has been deprecated in MediaWiki 1.40.
      It may not work for some (or any) scripts in the future.
*******************************************************************************

Removing old entries
Deleting a batch
Deleting a batch

Deleting a batch
loading new entries from file
10000 rows inserted
20000 rows inserted

3800000 rows inserted
3810000 rows inserted
3813496 rows inserted
... Done loading

real	7m55.581s
user	0m40.128s
sys	0m2.425s
Arian_Bozorg subscribed.

Excellent! Thank you Lucas :)

Hi @Multichill, just wanted to see what you were referring to in the link?