Page MenuHomePhabricator

Make Dispenser's principle_links table accessible in new Wiki replica cluster
Open, Stalled, MediumPublic

Description

Every month I download 280,000 disambiguation pages via the API into a 250 MB gz file. That is then transformed by a custom script to pull the first link from every list entry and loaded into a principle_links table on u2815__p. This significantly improves database query results involving disambiguation pages.

MariaDB [enwiki_p]> DESCRIBE u2815__p.principle_links;
+-----------+------------------+------+-----+---------+-------+
| Field     | Type             | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+-------+
| pr_from   | int(10) unsigned | NO   | PRI | NULL    |       |
| pr_title  | varbinary(255)   | NO   | PRI | NULL    |       |
| pr_latest | int(10) unsigned | NO   |     | NULL    |       |
+-----------+------------------+------+-----+---------+-------+

Event Timeline

I guess that should go to the tools servers

@Marostegui It is needed in the middle of JOINs. Similar to the pagelinks table.

@Dispenser question- do you use the table on long-running queries (like gathering analytics) or on queries as part of a webrequest (quick ones), or both?

@jcrespo The current pipeline is:

  1. A bash/python script on ToolForge which makes 275,000 MW API requests and bundles JSON responses in a .tar.gz on the last day of the month
  2. A python script on my server downloads that on the first of the month. Processes it which spits out a Tab-Separated Value then loads it into the database.
  3. (edit) The death_dabs.sql script then for 30 minutes creating a report for users to work off of. (Currently long running query)

I think this could be added to the analytics databases without problem- but it may take some time due to other ongoing issues, plus the work needed to productionize this service, I ask you to be patient.

@Dispenser Can you point to the source code of what you use for generating such tables - I am mostly interested on the storage and data manipulation logic.

jcrespo changed the task status from Open to Stalled.Jun 15 2018, 3:13 PM
jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

Still waiting on getting the code to be able to deploy those to the replicas...