Duo Scraper

📖 Description

The Duo Scraper builds a JSON file with the political leaders of each country found at this API. The Scraper performs a double scraping task, hence the name "duo":

data colection from APIs endpoints:

-- the Scraper first queries a sequence of API endpoints to obtain a list of countries & basic info about their past political leaders.
data collection from HMTL endpoints:

-- the Scraper then uses the wikipedia urls retrieved from the API to extract & sanitize the leaders' short bios from Wikipedia html pages

The combined information is written in an output JSON file.

🛠️ Setup & Installation

create a new virtual environment by executing this command in your terminal:

python3 -m venv wikipedia_scraper_env

activate the environment by executing this command in your terminal:

source wikipedia_scraper_env/bin/activate

install the required dependencies by executing this command in your terminal:

pip install -r requirements.txt

👩‍💻 Usage

To run the program, clone this repo on your local machine, navigate to its directory in your terminal, make sure you have first executed your requirements.txt, then execute:

python3 main.py

📂 Project background

This was my second solo project in the AI Bootcamp in Ghent, Belgium, 2024.

Its main goals were to practice:

using virtual environments
extracting data from APIs and from HTML
using exception handling
getting comfortable with JSON
using OOP
using regex to clean text data

This project was completed over the course of 3 days in February 2024.

My main challenges and opportunities to learn while doing the project were:

handling cookies and sessions when performing GET requests
handling various tags when parsing html to get the required content

Extra

I also created a separate branch for the project called feature/o11y where I experiment with concurrency and observability (o11y) via Honeycomb.

Shoutout to 11011 for his advice and help with these experiments.

⚠️ Warning

All my code is currently heavily:

docstringed
commented

.. and sometimes typed.

This is to help me learn and to make my sessions with our training coach more efficient.

Thanks for visiting my project page!

Connect with me on LinkedIn 🤍

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duo Scraper

📖 Description

🛠️ Setup & Installation

👩‍💻 Usage

📂 Project background

Extra

⚠️ Warning

About

Releases

Packages

Languages

vriveraq/duo-scraper

Folders and files

Latest commit

History

Repository files navigation

Duo Scraper

📖 Description

🛠️ Setup & Installation

👩‍💻 Usage

📂 Project background

Extra

⚠️ Warning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages