Skip to content

Collects data via web scraping & APIs and outputs a JSON file

Notifications You must be signed in to change notification settings

vriveraq/duo-scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Duo Scraper

forthebadge made-with-python

📖 Description

The Duo Scraper builds a JSON file with the political leaders of each country found at this API. The Scraper performs a double scraping task, hence the name "duo":

  1. data colection from APIs endpoints:

    -- the Scraper first queries a sequence of API endpoints to obtain a list of countries & basic info about their past political leaders.

  2. data collection from HMTL endpoints:

    -- the Scraper then uses the wikipedia urls retrieved from the API to extract & sanitize the leaders' short bios from Wikipedia html pages

The combined information is written in an output JSON file.

🛠️ Setup & Installation

  1. create a new virtual environment by executing this command in your terminal:
python3 -m venv wikipedia_scraper_env
  1. activate the environment by executing this command in your terminal:
source wikipedia_scraper_env/bin/activate
  1. install the required dependencies by executing this command in your terminal:
pip install -r requirements.txt

👩‍💻 Usage

To run the program, clone this repo on your local machine, navigate to its directory in your terminal, make sure you have first executed your requirements.txt, then execute:

python3 main.py

📂 Project background

This was my second solo project in the AI Bootcamp in Ghent, Belgium, 2024.

Its main goals were to practice:

  • using virtual environments
  • extracting data from APIs and from HTML
  • using exception handling
  • getting comfortable with JSON
  • using OOP
  • using regex to clean text data

This project was completed over the course of 3 days in February 2024.

data

My main challenges and opportunities to learn while doing the project were:

  • handling cookies and sessions when performing GET requests
  • handling various tags when parsing html to get the required content

Extra

I also created a separate branch for the project called feature/o11y where I experiment with concurrency and observability (o11y) via Honeycomb.

Shoutout to 11011 for his advice and help with these experiments.

⚠️ Warning

All my code is currently heavily:

  • docstringed
  • commented

.. and sometimes typed.

This is to help me learn and to make my sessions with our training coach more efficient.


Thanks for visiting my project page!

Connect with me on LinkedIn 🤍

About

Collects data via web scraping & APIs and outputs a JSON file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%