ML Pipeline
The ML Pipeline is a Python service that generates win predictions for upcoming NBA games.
Architecture
Section titled “Architecture”Database Usage
Section titled “Database Usage”ELT Pipeline Orchestration
Section titled “ELT Pipeline Orchestration”How It Works
Section titled “How It Works”The ML model is a Logistic Regression classifier trained on data from the 2023-24 season.
- In production, the model achieves around 67% accuracy on win predictions.
- The trained model is serialized using
jobliband stored directly into the container image that runs in production.
Note: For more dynamic use cases, the model could be stored in S3 and pulled at runtime. This would allow the pipeline to load updated versions without requiring a container rebuild.
It pulls upcoming games from the silver.ml_game_features table generated by dbt, uses the ML model to generate the predictions for each upcoming game, and then saves the results back to gold.ml_game_predictions.
- The downstream services pull the latest available data to present to end users
- Model accuracy is tracked by comparing the prediction results to the actual game results after the games are played
Features Used
Section titled “Features Used”The following features are pulled for each home & away team in the upcoming games:
-
Days of Rest - Measures how many days of rest the team has had before the game.
-
Top Players - Ordinal ranking (0, 1, 2) which represents whether the team’s top players are active or unavailable for the game
-
Moneyline Odds - Moneyline odds for the team for the upcoming game
-
Recent Team Performance - Team Win % for the last 10 games
-
Overall Team Performance - Team Win % for the entire season thus far
Libraries
Section titled “Libraries”scikit-learnis the primary package behind building the ML Model & creating predictionsjyablonski_common_modulesprovides various functions to read & write data to Postgres
Production
Section titled “Production”The ML Pipeline runs as an ECS Task following the completion of the dbt job.
- The ML Pipeline typically takes about ~10 seconds to complete
The NBA ELT Pipeline is complete after the ML Pipeline finishes, with no further tasks to run.
CI / CD
Section titled “CI / CD”For continuous integration (CI), the entire test suite is run on every commit in a pull request using Docker.
After a PR is merged, the continuous deployment (CD) pipeline performs the following steps:
- Builds the Docker image for the service with the updated source code and dependencies
- Pushes the Docker image to ECR
On the next NBA ELT Pipeline run, this new Docker image will be used when the ML Pipeline is scheduled in ECS.