Page MenuHomePhabricator

Security review of Beautiful Soup
Closed, ResolvedPublic

Description

Project Information

Description of the tool/project

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.

Description of how the tool will be used at WMF

We'll use the tool to post-process PDFs generated by ElectronPdfService / ChromiumPdfService. The tool will be used by a Python script to query / modify HTML used to generate PDFs.

Dependencies

None (afaik)

Has this project been reviewed before?

No (afaik)

Working test environment

There's none yet, but we can share something as part of T171960: Create a library to post-process PDF and add page numbers and table of contents

Post-deployment

?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2017, 3:15 PM

Readers Web will be responsible for maintaining the library.

Are you planning to fork BeautifulSoup? Or did you mean something else..?

Copy & paste fail. ;(

bmansurov updated the task description. (Show Details)Oct 12 2017, 6:24 PM
phuedx updated the task description. (Show Details)Oct 31 2017, 4:36 PM
phuedx added a subscriber: phuedx.

I can tell, but I basically looked at them all at the same time. I've found not issues with BeautifulSoup. I know that it's use here is generally limited, but I assumed that some user-controlled HTML make make it through to this parser and tested for DoS via resource consumption, code execution via entity expansion, failure to maintain entity encoding, etc. and found no concerns. A quick question which I should I have clarified before: will you be using HTMLParser, or an external parser (lxml, html5lib, etc.)?

will you be using HTMLParser, or an external parser (lxml, html5lib, etc.)?

We'll be using the Python 3's default html.parser for now: https://github.com/kodchi/ppg/blob/master/src/process_toc.py#L26

dpatrick closed this task as Resolved.Nov 21 2017, 4:56 PM
dpatrick claimed this task.

will you be using HTMLParser, or an external parser (lxml, html5lib, etc.)?

We'll be using the Python 3's default html.parser for now: https://github.com/kodchi/ppg/blob/master/src/process_toc.py#L26

Thanks. That looks to be fine. I'll go ahead and mark this complete.

Thanks for taking the time to review this, @dpatrick!