Ingestion Script
The Ingestion Script handles source data ingestion for the NBA ELT Project.
Architecture
Section titled “Architecture”ELT Pipeline Orchestration
Section titled “ELT Pipeline Orchestration”How It Works
Section titled “How It Works”The Ingestion Script performs the following tasks:
- Retrieves feature flags from the database to determine which endpoints to scrape.
- Scrapes data from the identified endpoints.
- Stores source data to Postgres
- Stores source data to S3 for backup purposes
- Sends any errors or missing data to Slack
A feature flag table in the database is managed to support all different kinds of functionality within the project. Specifically for the Ingestion Script, some of these flags are used to determine which endpoints should be scraped. For example:
- During the offseason, the
is_seasonflag should be disabled as there’s no more games being played. This disables all game-related data from being scraped. - This functionality enables a simple management process for disabling the extraction of this data, or having to do code deploys to comment out various functions during the offseason
Libraries
Section titled “Libraries”- Pandas is the primary package driving the Ingestion Script development
- beautifulsoup4 enables web scraping to be performed on various basketball-reference and DraftKings pages
- praw is used to authenticate to & pull data from the Reddit API
- nltk provides Sentiment Analysis Functions to be applied on various social media text data
- jyablonski_common_modules provides various functions to read & write data to Postgres
Production
Section titled “Production”The Ingestion Script runs as an ECS task kicked off by an AWS Step Functions pipeline triggered every day at 12 pm UTC.
- The Ingestion Script typically takes about ~2 minutes to complete
As soon as the script finishes and all ingestion data has been loaded, the dbt job begins the transformation process.
CI / CD
Section titled “CI / CD”For continuous integration (CI), the entire test suite is run on every commit in a pull request.
- It installs uv and the project dependencies on the GitHub Actions runner
- It uses Docker to spin up a Postgres database with bootstrap data
- It then runs the test suite, using the Postgres database to run integration tests
- The uv environment used by the GitHub Actions runners also gets cached for up to 7 days, enabling faster test suite executions
After a PR is merged, the continuous deployment (CD) pipeline performs the following steps:
- Builds the Docker image for the service with the updated source code and dependencies
- Pushes the Docker image to ECR
On the next NBA ELT Pipeline run, this new Docker image will be used when the Ingestion Script is scheduled in ECS.