Data Ingestion

Author jyablonski

Updated May 13, 2026

Tags guideeltpythondatabase

Data ingestion follows a shared workflow for pulling source data, shaping it into DataFrames, applying light cleanup, and merging it into the Postgres bronze schema.

Background

The ingestion process is standardized across all data sources, regardless of the system they originate from. This uniform approach simplifies the onboarding of new data sources and ensures consistency across the pipeline.

By maintaining a common pattern, extending the system to handle additional sources requires minimal effort.

Workflow Overview

Pull data from a source system
Turn data into a Pandas DataFrame for subsequent steps
Perform minimal cleaning, data enrichment, and column name standardization
Merge data from Pandas DataFrame into Postgres bronze schema
Write data to S3 as Parquet files for backup purposes

How It Works

The DataFrame is first inserted into a temporary table in Postgres.
Records from the temp table are upserted into the target source table:
- Inserts: New records are added.
- Updates: Existing records are updated based on matching keys.
A modified_at timestamp is automatically updated whenever a record is inserted or changed.
dbt leverages the modified_at timestamp to drive incremental builds for downstream Fact and Dimension tables, which are the only tables that pull from the Source data

Details

Merge Process

The merge logic predates Postgres’ native MERGE functionality introduced in version 15. Instead, a custom merge utility is provided via the internal Python library jyablonski_common_modules.

Table Naming Conventions

Source tables are named using the following convention:

nba_source.<source_name>_<table_name>
Example: nba_source.bbref_player_boxscores -> Player Boxscore records from Basketball Reference

Where possible, further naming standardization should be applied to help denote the type of data in the table:

nba_source.bbref_player_stats_snapshot -> Player Statistics Snapshot
nba_source.bbref_league_transactions -> League Transactions such as trades, free agent signings etc

For tables created manually or internally, the following naming convention is used:

nba_source.internal_<table_name>
Example: nba_source.internal_player_attributes -> Contains special attributes from various sources like player headshot PNGs, years of experience etc