Page MenuHomePhabricator

Security review of pdfrw
Closed, ResolvedPublic

Description

Project Information

Description of the tool/project

pdfrw is a Python library and utility that reads and writes PDF files

Description of how the tool will be used at WMF

We'll use the tool to post-process PDFs generated by ElectronPdfService. The tool will be used by a Python script to modify PDFs to add page numbers and table of contents.

Dependencies

None (afaik)

Has this project been reviewed before?

No (afaik)

Working test environment

There's none yet, but we can share something as part of T171960: Create a library to post-process PDF and add page numbers and table of contents

Post-deployment

Readers Web will be responsible for maintaining the library.

Event Timeline

The target date for deployment has shifted forward to EOQ Q1 (end of September). Who's the best person to review this Python library?

Who's the best person to review this Python library?

@Bawolff @dpatrick: IIRC you were the folks to ping in this context. I apologise if that's not the case.

Who's the best person to review this Python library?

@Bawolff @dpatrick: IIRC you were the folks to ping in this context. I apologise if that's not the case.

Yes. We will add it to our list of things to review.

In addition to pdfrw, it's looking increasingly likely that we're going to have to use BeautifulSoup for easy DOM querying and manipulation. At this time we won't be using any external parsers such as lxml, but we'll use Python's built in html.parser. Should I create a new task for this? Not sure if any past projects have used this library before, but ORES or Wikimetrics don't seem to use it.

In addition to pdfrw, it's looking increasingly likely that we're going to have to use BeautifulSoup for easy DOM querying and manipulation. At this time we won't be using any external parsers such as lxml, but we'll use Python's built in html.parser. Should I create a new task for this? Not sure if any past projects have used this library before, but ORES or Wikimetrics don't seem to use it.

Yes, please create a separate task requesting that review. Thanks.

I checked out the WIP ppg code in the description of T171960 and I'm wondering whether that will be invoked by the Node service (T177765), returning a ready-to-read PDF which has ToC, page numbers, etc., or will the Node service just render an article which will then be post-processed (adding ToC, page numbers, etc.) separately? I'm asking to ascertain whether the script which will use pdfrw will be firejailed. This question is not a blocker. I'm just curious.

dpatrick claimed this task.

This review is complete. Basic concerns such as system i/o (very limited) and shell execution (none) were found to be safe. Encryption implementation was not reviewed since we won't be using that portion of the code. My main concerns would lie with code execution or denial of service (a la https://github.com/pmaupin/pdfrw/issues/92) via malicious PDF input, however, the fact that this will run in a closed ecosystem (in which input originates only from a system we control) mitigates those concerns.

I checked out the WIP ppg code in the description of T171960 and I'm wondering whether that will be invoked by the Node service (T177765), returning a ready-to-read PDF which has ToC, page numbers, etc., or will the Node service just render an article which will then be post-processed (adding ToC, page numbers, etc.) separately? I'm asking to ascertain whether the script which will use pdfrw will be firejailed. This question is not a blocker. I'm just curious.

I think we'll have the Node service render an article only. We may need to build another service that takes the PDF from the Node service and post-processes it using ppg. I'm not sure if that's what we will end up doing though.