Page MenuHomePhabricator

[Session] Web-scraping
Closed, ResolvedPublic

Description

  • Title of session: Web-scraping: pitfalls and lifehacks
  • Session description: Web-scraping (parsing the content of some website page by page, to fetch some needed information, for example parse census data from a government website) is a very useful way of gathering data, but it can be quite painful for those, who do not have experience. There are a lot of both obvious and surprising pitfalls, and also different lifehacks to to overcome them. Different websites contain data in different form, some have APIs and some block visitors who look like bots, some have geographical limitations, some go offline after some time, some use unorthodox ways to encode non-latin characters, and so on. On this session participants can share their experience and try to write some code for web-scraping.
  • Username for contact: @Tohaomg
  • Session duration: 50 min
  • Session type: workshop
  • Language of session: English
  • Prerequisites: something to compile a code into an exe-file; examples will be given in C#, so built-in Microsoft .NET Framework with 'csc' Windows promt command will do, but best to practice using it in advance
  • Notes: https://etherpad.wikimedia.org/p/wmh2024-Web-scraping

Notes from the session

Web-scraping: pitfalls and lifehacks

Article: https://en.wikipedia.org/wiki/Web_scraping

Webscraping: When you need to extract data from different websites into your wiki projects because some websites don't make it easy to pull the data they have.

Use cases in Wiki projects:

  • Script that goes through references of wikipedia articles to pull certain metadata from the links like title, publishing date, etc
  • Scan through government websites to pull data such as election results, political positions, etc.

External use cases

  • Parse list in an e-commerce store to generate files containing information about the products.

How to webscrape:

  • Many websites have the presented data embedded in the html file, can be accessed using CTRL-U (Browser inspection tool).
  • Some websites generate content on the fly, therefore the source code doesn't contain much information example Telegram. In this case, it'd contain javascript files with instructions for the browser to know where to load the information from.
  • Some languages make encoding difficult, for example articles in this website: https://life.pravda.com.ua/society. In such cases, we need to specify in the webscraper that the website has some unique encoding to ensure it's read properly.
  • Review extracted data to ensure the proper encoding of the information.
  • For websites with geographic blocking, use a VPN set to the required countries to access the website.
  • For websites with limited access rates, ensure pause inbetween fetch of the website. Another option is to connect via VPN, splitting the work batch among different countries.
  • https://web.archive.org/ and https://archive.is/ helps to get information about how a page looked between a period of time. The page can be processed from these archive websites.
  • Pros of web archive is that it has a convenient API that returns an archive link of the requested website, downside of the API is that only allows access to the latest version. Another downside of the web archive is that sometimes it struggles to save content that are generated on the fly like the Telegram page.
  • https://archive.is/ does not have an API and denies access to bots.
  • If website denies access to bots, modify the user-agent to replicate the browser settings to make it look more like a human request.
  • A bot can be setup on a host machine to automatically open websites, inspect the source code, pull the required information and paste the content in a file. The bot can be created using https://www.autohotkey.com/ and it helps in emulating user actions. Docs link: https://www.autohotkey.com/docs/.

Things to look out for:

  • Don't believe everything written in the source code, for example templates may contain some metadata that do not represent the content of the website.
  • Some websites may transition to a new look, make sure to keep track of the appearance websites you are parsing.
  • Some links may be in a short form format, such as youtu.be.
  • There'd be a lot of regular expressions (https://regex101.com/) in the script but the results may be inconsistent.

Questions

Q: What libraries do you use for scraping in C#.
A: DotNetWikiBot - https://www.mediawiki.org/wiki/Manual:Creating_a_bot , https://dotnetwikibot.sourceforge.net/

Q:What software did you use to automate this?
A: A program called AutoHotKey ( https://www.autohotkey.com/) that allows you to write and compile files which will emulate user actions.

Q: How to scrape wikipedia without getting rate limited
A: Get bot rights (https://en.wikipedia.org/wiki/Wikipedia:Bot_policy), you could also use the API for free.

Q: Scraped openstreetmap, why don't they have an API?
A: They do (https://wiki.openstreetmap.org/wiki/API)

Photos

Social

Event Timeline

Hello! 👋 The 2024 Hackathon Program is now open for scheduling! If you are still interested in organizing a session, you can claim a slot on a first-come, first-serve basis by adding your session to the daily program, following these instructions. We look forward to hearing your presentation!

debt triaged this task as Medium priority.Apr 17 2024, 7:25 PM