In this case study, the SmartCat team successfully addressed the challenge faced by a fintech startup in predicting cash flow and purchase patterns. Despite having limited data consisting only of invoice dates and purchase amounts, the team employed innovative strategies and models to provide valuable insights.
The classification model accurately predicted whether a purchase would occur within a given timeframe for a specific end customer. The regression model effectively estimated the total amount spent during a particular timeframe by individual customers. The results showed a reduction in median errors compared to baseline approaches, indicating the improved accuracy of the developed models.
Moving forward, further enhancements could be made to the models, such as incorporating data from multiple companies, grouping or clustering companies and their end-customers, and integrating additional data sources and purchase metadata. These improvements would likely yield even more accurate predictions and unlock new potential applications.
Overall, this case study demonstrated the ability to overcome challenges in cash flow prediction by leveraging available data, employing machine learning techniques, and conducting thorough exploratory analysis. By utilizing such approaches, businesses in the fintech industry can enhance their cash flow forecasting capabilities and make informed decisions for better financial management.
The specific pain points and obstacles faced by the client were as follows:
- Limited data: The client only had access to invoice dates and purchase amounts, which made it challenging to predict purchases accurately without any contextual information.
- Cold-start problem: When there was a lack of historical data or data for new customers, the client still wanted to obtain predictions based on the existing data.
To address the client’s challenges, the SmartCat team adopted the following strategies:
- Data gathering and exploratory analysis: In the first week, the team collected external data, including macroeconomic data for specified geo-regions and timeframes, as well as data about holidays in designated regions. Exploratory analysis was conducted to identify buying patterns within the available data.
- Feature engineering and modeling: In the second week, the team focused on feature engineering and method research. More than 30 features were constructed for each model, primarily related to amounts per predefined periods, time periods since the last purchase, and late-paid invoices.
- Classification and regression models: Two models were developed to achieve the project goals. The classification model predicted whether a purchase would occur within a given timeframe and for a specific end customer. The regression model predicted the total amount spent during a particular timeframe by individual customers.
- Performance evaluation: Precision and recall were used as performance metrics for the classification model due to the imbalanced class distribution. Median errors were utilized to assess the regression model’s predictions.
The outcomes of the solution were as follows:
- Classification model: The generated dataset for classification yielded approximately 13% positive instances, indicating that a purchase occurred. Although the accuracy of the model was around 86%, precision and recall were lower, approximately 0.65 and 0.35, respectively. The most influential features for classification were the length of time before the previous purchase and the popularity of the time period. The best results were achieved for longer time frames of 30 days or more.
- Regression model: The model’s performance was compared to a baseline approach that used the mean interval amount as an estimate for future interval amounts. The baseline had a median error of approximately 60% for intervals of length 30 and 90. In comparison, the developed model reduced the median error to around 55%, resulting in a 5-6% improvement. The model also provided prediction confidence intervals, capturing the volatile nature of the cash flow amounts.
The following technologies and tools were used to address the problem and implement the solution:
- Python: The programming language used for data processing, feature engineering, modeling, and evaluation.
- Scikit-learn: A machine learning library in Python utilized for developing and training classification and regression models.
- Jupyter: An interactive computing environment used for data exploration, analysis, and model development.
About the Clients
Our client is a fintech startup operating in the financial technology industry. They aim to develop financial products that help companies improve their cash flow forecasting capabilities. The client faced a challenge in accurately predicting whether a purchase would occur within a given timeframe and the expected amount of that purchase. The client had limited data consisting only of invoice dates and purchase amounts, without any additional information about the product/service, the company issuing the invoices, or the end customers. Furthermore, the client wanted to address the cold-start problem, where there might be a lack of historical data for new customers. The objective of the project was to model the available data and uncover purchase patterns for end customers, even in these limited cases.