Page MenuHomePhabricator

Store article reference/citations in database
Open, Needs TriagePublic

Description

As a developer and data scientist, I need article citations saved for long-term storage. so we can perform statistical analysis of the citation quality of single articles, subject areas and projects.

Save the article and each citation. Also save the categories of each article, so we can provisionally treat them as classification classes.

Once EN and Simple citations are completely extracted and saved to a database, we need to perform some validation checks:

  1. Missing citations
  2. Poorly parsed citations: percentage per article
  3. Percentage of correct URL parsed in each article, then get the average per category
  4. Use Pandas and standard Python packages to visualize the statistical data

Deliverables

  • A database of articles, citations and categories
  • A report on the parsing data quality