Data Ingestion
Data ingestion follows a shared workflow for pulling source data, shaping it into DataFrames, applying light cleanup, and merging it into the Postgres bronze schema.
Background
Section titled “Background”The ingestion process is standardized across all data sources, regardless of the system they originate from. This uniform approach simplifies the onboarding of new data sources and ensures consistency across the pipeline.
By maintaining a common pattern, extending the system to handle additional sources requires minimal effort.
Workflow Overview
Section titled “Workflow Overview”- Pull data from a source system
- Turn data into a Pandas DataFrame for subsequent steps
- Perform minimal cleaning, data enrichment, and column name standardization
- Merge data from Pandas DataFrame into Postgres
bronzeschema - Write data to S3 as Parquet files for backup purposes
How It Works
Section titled “How It Works”- The DataFrame is first inserted into a temporary table in Postgres.
- Records from the temp table are upserted into the target source table:
- Inserts: New records are added.
- Updates: Existing records are updated based on matching keys.
- A
modified_attimestamp is automatically updated whenever a record is inserted or changed. - dbt leverages the
modified_attimestamp to drive incremental builds for downstream Fact and Dimension tables, which are the only tables that pull from the Source data
Details
Section titled “Details”Merge Process
Section titled “Merge Process”The merge logic predates Postgres’ native MERGE functionality introduced in version 15. Instead, a custom merge utility is provided via the internal Python library jyablonski_common_modules.
Table Naming Conventions
Section titled “Table Naming Conventions”Source tables are named using the following convention:
nba_source.<source_name>_<table_name>- Example:
nba_source.bbref_player_boxscores-> Player Boxscore records from Basketball Reference
Where possible, further naming standardization should be applied to help denote the type of data in the table:
nba_source.bbref_player_stats_snapshot-> Player Statistics Snapshotnba_source.bbref_league_transactions-> League Transactions such as trades, free agent signings etc
For tables created manually or internally, the following naming convention is used:
nba_source.internal_<table_name>- Example:
nba_source.internal_player_attributes-> Contains special attributes from various sources like player headshot PNGs, years of experience etc