Skip to content

ML Pipeline

Author jyablonski
Updated May 13, 2026
Tags servicemlpython

The ML Pipeline is a Python service that generates win predictions for upcoming NBA games.


graph LR subgraph PostgresDB[Postgres] T1[ml_game_features] T2[ml_game_predictions] end T1 -->|Fetches upcoming games| ML[ML Pipeline] ML -->|Stores win predictions| T2 T2 --> API[REST API Service] T2 --> DASH[Dash Frontend Service] style PostgresDB fill:#89888f,stroke:#444444,stroke-width:2px style ML fill:#d6d6d6,stroke:#444444,stroke-width:1.5px style API fill:#f5f5f5,stroke:#444444,stroke-width:1.5px style DASH fill:#f5f5f5,stroke:#444444,stroke-width:1.5px
graph LR A[Ingestion Script] --> B[dbt] B --> C[ML Pipeline]

The ML model is a Logistic Regression classifier trained on data from the 2023-24 season.

  • In production, the model achieves around 67% accuracy on win predictions.
  • The trained model is serialized using joblib and stored directly into the container image that runs in production.

Note: For more dynamic use cases, the model could be stored in S3 and pulled at runtime. This would allow the pipeline to load updated versions without requiring a container rebuild.

It pulls upcoming games from the silver.ml_game_features table generated by dbt, uses the ML model to generate the predictions for each upcoming game, and then saves the results back to gold.ml_game_predictions.

  • The downstream services pull the latest available data to present to end users
  • Model accuracy is tracked by comparing the prediction results to the actual game results after the games are played

The following features are pulled for each home & away team in the upcoming games:

  1. Days of Rest - Measures how many days of rest the team has had before the game.

  2. Top Players - Ordinal ranking (0, 1, 2) which represents whether the team’s top players are active or unavailable for the game

  3. Moneyline Odds - Moneyline odds for the team for the upcoming game

  4. Recent Team Performance - Team Win % for the last 10 games

  5. Overall Team Performance - Team Win % for the entire season thus far

  1. scikit-learn is the primary package behind building the ML Model & creating predictions
  2. jyablonski_common_modules provides various functions to read & write data to Postgres

The ML Pipeline runs as an ECS Task following the completion of the dbt job.

  • The ML Pipeline typically takes about ~10 seconds to complete

The NBA ELT Pipeline is complete after the ML Pipeline finishes, with no further tasks to run.

For continuous integration (CI), the entire test suite is run on every commit in a pull request using Docker.

After a PR is merged, the continuous deployment (CD) pipeline performs the following steps:

  1. Builds the Docker image for the service with the updated source code and dependencies
  2. Pushes the Docker image to ECR

On the next NBA ELT Pipeline run, this new Docker image will be used when the ML Pipeline is scheduled in ECS.