A custom end-to-end analytics platform for customer churn

An end to end data pipeline that models and visualizes customer churn. The project blends together concepts from data engineering, analytics engineering, and machine learning. Please check the README for a more detailed exposition on the project.

The dashboard can be accessed here.

Overview #

This project was developed to:

Employ various elements of the modern data stack
Be applied to a somewhat realistic business use case
Serve as a template for others to learn from and use via extensive documentation

Project Architecture #

For this project, the Telco Customer Churn data module which is a sample dataset on IBM’s Cognos Analytics platform is used. This seemed like the best representative considering the difficulty in finding a decent dataset for the use case.

The dataset is then used to train two models. The first is a Gaussian Copula Synthesizer to produce synthetic data with characteristics similar to the original. This is done since there is not much data to go around and serves as a rudimentary imitation of data entering the database, The second is an LGBModel which is a product of using FLAML’s AutoML implementation on the data, and its purpose is to predict churn status for a particular user.

Both models are hosted via FastAPI and are accessed this way. Airflow is then used to orchestrate the pulling of data from the Synthesizer, obtaining churn status prediction for said data from the classification model, generating a ULID for each customer, and writing it all to a Postgres database. Airflow is also used to trigger dbt afterward to run tests and apply necessary transformations. The data is modeled after the star schema and is finally visualized as a dashboard using Metabase.

Almost all of the services above run in their own docker containers, as seen in the diagram. These containers are running on a GCP VM, provisioned via Terraform. Finally, GitHub Actions facilitates CD, as changes made to this repo are reflected in the VM.

Some points to consider #

Keep in mind that predicting customer churn is a rather sophisticated use case and is difficult to get right. AutoML was only employed to obtain a viable baseline model, and this is how it should be applied most of the time. Check out Fighting Churn with Data for a more comprehensive outlook on modelling customer churn.
The original dataset has been transformed and can be found here. This is the dataset the models have been trained on.
As mentioned earlier, the various elements which comprise this project have tried to be documented. Most files should have accompanying comments to facilitate understanding, and more high level system design justifications can be found in the README.