Page MenuHomePhabricator

Make Dispenser's principle_links table accessible in new Wiki replica cluster
Open, Stalled, MediumPublic

Description

Every month I download 280,000 disambiguation pages via the API into a 250 MB gz file. That is then transformed by a custom script to pull the first link from every list entry and loaded into a principle_links table on u2815__p. This significantly improves database query results involving disambiguation pages.

MariaDB [enwiki_p]> DESCRIBE u2815__p.principle_links;
+-----------+------------------+------+-----+---------+-------+
| Field     | Type             | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+-------+
| pr_from   | int(10) unsigned | NO   | PRI | NULL    |       |
| pr_title  | varbinary(255)   | NO   | PRI | NULL    |       |
| pr_latest | int(10) unsigned | NO   |     | NULL    |       |
+-----------+------------------+------+-----+---------+-------+

Event Timeline

I guess that should go to the tools servers

@Marostegui It is needed in the middle of JOINs. Similar to the pagelinks table.

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Nov 28 2017, 4:51 AM

@Dispenser question- do you use the table on long-running queries (like gathering analytics) or on queries as part of a webrequest (quick ones), or both?

1978Gage2001 moved this task from Triage to In progress on the DBA board.Dec 11 2017, 9:45 AM
Marostegui moved this task from In progress to Triage on the DBA board.Dec 11 2017, 11:05 AM
Dispenser added a comment.EditedDec 12 2017, 7:39 PM

@jcrespo The current pipeline is:

  1. A bash/python script on ToolForge which makes 275,000 MW API requests and bundles JSON responses in a .tar.gz on the last day of the month
  2. A python script on my server downloads that on the first of the month. Processes it which spits out a Tab-Separated Value then loads it into the database.
  3. (edit) The death_dabs.sql script then for 30 minutes creating a report for users to work off of. (Currently long running query)

I think this could be added to the analytics databases without problem- but it may take some time due to other ongoing issues, plus the work needed to productionize this service, I ask you to be patient.

@Dispenser Can you point to the source code of what you use for generating such tables - I am mostly interested on the storage and data manipulation logic.

jcrespo changed the task status from Open to Stalled.Jun 15 2018, 3:13 PM
jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

Still waiting on getting the code to be able to deploy those to the replicas...