Skip to content

Data Ingestion

Author jyablonski
Updated May 13, 2026
Tags guideeltpythondatabase

Data ingestion follows a shared workflow for pulling source data, shaping it into DataFrames, applying light cleanup, and merging it into the Postgres bronze schema.


The ingestion process is standardized across all data sources, regardless of the system they originate from. This uniform approach simplifies the onboarding of new data sources and ensures consistency across the pipeline.

By maintaining a common pattern, extending the system to handle additional sources requires minimal effort.

  1. Pull data from a source system
  2. Turn data into a Pandas DataFrame for subsequent steps
  3. Perform minimal cleaning, data enrichment, and column name standardization
  4. Merge data from Pandas DataFrame into Postgres bronze schema
  5. Write data to S3 as Parquet files for backup purposes
  1. The DataFrame is first inserted into a temporary table in Postgres.
  2. Records from the temp table are upserted into the target source table:
    • Inserts: New records are added.
    • Updates: Existing records are updated based on matching keys.
  3. A modified_at timestamp is automatically updated whenever a record is inserted or changed.
  4. dbt leverages the modified_at timestamp to drive incremental builds for downstream Fact and Dimension tables, which are the only tables that pull from the Source data

The merge logic predates Postgres’ native MERGE functionality introduced in version 15. Instead, a custom merge utility is provided via the internal Python library jyablonski_common_modules.

Source tables are named using the following convention:

  • nba_source.<source_name>_<table_name>
  • Example: nba_source.bbref_player_boxscores -> Player Boxscore records from Basketball Reference

Where possible, further naming standardization should be applied to help denote the type of data in the table:

  • nba_source.bbref_player_stats_snapshot -> Player Statistics Snapshot
  • nba_source.bbref_league_transactions -> League Transactions such as trades, free agent signings etc

For tables created manually or internally, the following naming convention is used:

  • nba_source.internal_<table_name>
  • Example: nba_source.internal_player_attributes -> Contains special attributes from various sources like player headshot PNGs, years of experience etc