Training never stops…

The most important task for our Data Science Team at Fable Data is to accurately determine the merchants and stock market tickers associated with anonymised consumer credit and debit card transaction strings in order to provide sophisticated insights into consumer spending patterns.

Great modelling needs a great training set and we have described in a previous post how complex this can be. The job of identifying merchants in naive data is a multi-class text classification problem. This is not only an interesting intellectual challenge for our Data Science team, but also it has a clear commercial application.

Dull work and a lack of a clear business problem are the downfalls of many technical teams. We are far from that place. We always strive to be best in class and constantly improve the quality of our work and the performance of our models.

Very early in the life of Fable Data we decided that we would not rely on external providers to classify transaction strings for many reasons including control, quality and our ability to externally license our classification services. Our diverse Data Partners are Card Issuers, Banks and Fintechs across Europe. We immediately discounted centring on rule-based regular expressions (Regex). Regex takes hours to maintain, strings vary for each Data Partner and can be complicated for franchised businesses. Transactions also have different patterns in different countries. For example, the hennes&mauritz and mcdonalds strings we see in German transactions sometimes appear as hetm and macdo in French strings. Scoring using Regex is computationally expensive compared to machine learning models that can be deployed in Python. Whilst Regex achieves good accuracy, recall and precision for merchants such as Netflix, rule-based tagging performs badly for merchants with a diverse set of services and string patterns such as Sainsbury’s and Amazon.

We have done and will continue to experiment with many modelling techniques including neural networks and support vector machines. We have had to work with massive class imbalance issues from the large volume of transactions from South West Trains compared to the small volume from South West Airlines. Synthetic strings and borrowing true strings from other training sets have proved useful in solving this challenge. Tuning hyper parameters reduces the time it takes to train models and boosts performance. Stacking the classification results from multiple models further boosts performance and allows us to combine the best elements from multiple classifiers. This is because some of the smaller classes perform well with one approach and less well with another technique.

And because we do all the above, our models are 99%+ correct. But training never stops because banks, payments providers, APIs and merchants will never stop changing the variety of string patterns that we need to classify. Data from each Data Partner looks different and the format of the feed from each bank to the open banking APIs constantly shifts. We must train our models for each new partner we add to the panel and not stand still. Klarna, Amazon Pay and iZettle have jumped from a small proportion of volume a couple of years ago to commonly used payment methods today. Mothercare, Jack Wills and Virgin Trains will disappear, but new merchants will emerge. Fable Data is prepared to keep abreast of this rapidly evolving space and provide the single most accurate spending data panel to its clients.

For more information about our data and our Data Science team, you can contact Dr Mark Howland, Chief Data Scientist, mark@fabledata.com