basketball-reference
Basketball Reference is the primary source for NBA boxscores, play-by-play data, injuries, transactions, schedules, and team statistics used throughout the project.
Source System
Section titled “Source System”Sports Reference is a renowned website that provides a comprehensive database of statistics, analytics, and historical information across various sports. For the NBA, it includes detailed pages on game logs, play-by-play data, injuries, team transactions, salaries, and more. This serves as the primary data source for all NBA-related information throughout this project.
The NBA offers an official API which would normally be the preferred way of extracting this data, but the NBA blocks all AWS IP addresses from accessing its API. For this reason, basketball-reference is used to pull all NBA related data.
For the NBA Project, data is scraped from this website once a day at 12 pm UTC, which is typically after data has been updated and made available for the previous day’s games.
Data Ingestion Process
Section titled “Data Ingestion Process”Web Scraping is used to extract data from the following pages:
- Daily Boxscores -> https://www.basketball-reference.com/friv/dailyleaders.fcgi?month=03&day=15&year=2025
- Play-by-Play data -> https://www.basketball-reference.com/boxscores/pbp/202503160CLE.html
- Transactions -> https://www.basketball-reference.com/leagues/NBA_2025_transactions.html
- Injuries -> https://www.basketball-reference.com/friv/injuries.fcgi
- Team Statistics -> https://www.basketball-reference.com/leagues/NBA_2025.html
- Schedule -> https://www.basketball-reference.com/leagues/NBA_2025_games.html
- Player Shooting Statistics -> https://www.basketball-reference.com/leagues/NBA_2025_per_game.html
After the data has been pulled, it’s stored into Pandas DataFrames and upserted into Postgres in the bronze Schema.
Source Tables
Section titled “Source Tables”bronzeSchemabbref_player_boxscores-> Boxscore Databbref_player_injuries-> Injury Databbref_player_pbp-> PBP Event Databbref_player_shooting_stats-> Aggregated Player Shooting Statsbbref_player_stats_snapshot-> Aggregated Player Stats Snapshotbbref_league_schedule-> Schedule Databbref_league_transactions-> League Transactions such as trades, free agent signings etcbbref_team_adv_stats_snapshot-> Team Advanced Stats Snapshotbbref_team_opponent_shooting_stats-> Team Opponent Shooting Stats
Data Quality Considerations
Section titled “Data Quality Considerations”-
Player names have historically been changed by basketball-reference mid-season, which caused issues downstream in dbt on joining & grouping boxscore data
- For example, they started removing suffixes on names such as Robert Williams III -> Robert Williams
- There are also inconsistencies with how names are stored, such as:
- Some of this I’ve tried to clean up via various helper functions, but it’s still something to watch out for
-
Boxscore & play-by-play data is sometimes not available at 12 pm UTC when the Ingestion Script runs.
- In these cases, this data must be backfilled at a later time once the data is made available, and the downstream dbt models must be refreshed afterwards.