“Data Science meets Data Engineering”
Prior to forming SmartCat, my partners and I had vast experience working with (and for) traditional web development companies. When it came time to start our own thing, we tried to apply the same processes and steps to build a data science team. It was quickly obvious that this would be a long learning process, and we needed to adjust our way of thinking if we hoped to bring AI projects successfully into production.
What made the transition even more of a challenge was the team makeup itself, consisting of engineers and scientists from different backgrounds, speaking completely different (technical) languages. This is the story of how we’ve gotten to where we are today. Having said that, this is not a silver bullet formula since the field is changing rapidly and we react to changes on a daily basis ourselves.
Initial Idea vs Reality
Our vision was to build a Data Science company, helping businesses solve their problems by using smart solutions. Big Data was gaining popularity, and everybody was reading success stories about how machine learning and AI solutions were making companies thrive.
In our early days, most of the companies that contacted us were saying the same thing: “We are collecting data and want to do something smart but we do not have a clear vision. Can you help us?”
This was the inspiration that led us to offer a data strategy workshop. Here, we aimed to help companies utilize their data and data specifications to perform exploratory analysis to gain key insights and identify firm conclusions, visualizations, and ideas for the future. Unfortunately, while we saw a lot of value in our offer to help optimize business insights and decisions, most companies decided that making sense of their data was not a big enough priority to follow through.
On the other side of the coin, however, our data engineering and data ops team were in high demand. Recognizing the power of data, companies wanted to equip themselves with the right infrastructure and tools, so we helped in building data processing pipelines and storage solutions.
For this reason, our data science and data engineering/ops teams were working on different projects without much collaboration at the time.
Elaborating on our vision, we wanted to run a company capable of offering end-to-end data products. To us, this meant working on initial ideas with the client, then running workshops with them to create a clear path and action plan, and finally doing all the prototyping and integration along with making production-ready and maintenance. The problem was, those projects were nowhere to be found. We remained patient since we knew there was a gap in the market, and companies that were not big enough to have an internal data science team could benefit from our expertise.
Fast-forward a year down the line, when our long anticipated end-to-end project appeared; a system that combined user analytics and real-time offers to create highly personalized recommendations. All in a single solution.
Finally, a project that was the perfect task for the data science team. Real-time offers were something our engineers had to solve on a daily basis, and since it also had to run on a reliable infrastructure, it was a perfect match for our data ops team to dig their teeth into at the same time.
But things didn’t go as smoothly as we hoped.
Lesson 1: Involve the client in all phases.
As we started the project, we expected the first phase to be easy. A single data scientist started data exploration through Python Jupyter notebooks. We wanted to visualize data, outliers, trends, figure out size and amount of data. Absolutely confident in our ability to understand the problem, we hardly involved the client. We attained insights, plotted several graphs, and composed a beautiful document filled with features and ideas about the implementation phase.
Eager to present our findings, we scheduled a call, and that’s when it became abundantly clear to both parties that we did not involve the client enough.
Some findings were not relevant for their domain or their company and we missed important details. We now had to throw out half the report and go back to the drawing board. This was a serious wake-up call to get off our high horse, take notes, and learn our first lesson the hard way.
The second report we prepared was much better, and the client was very happy with the results. We moved forward with one of the proposed approaches and started with prototyping. Again, our lone data scientist got to work. After trying a few different algorithms before settling on one that made the most sense and gave the best results, we created a web demo and the client started testing. Satisfied with the results, we all moved on to the production phase.
Lesson 2: Involve data engineers with data scientists early – even in the prototyping phase – to avoid double work.
The production phase, in our minds, should have been easy; We had the web demo, so all we needed to do was make the API and integrate it with the final solution. We severely underestimated this phase.
The prototyping code was not ready at all. It was done only by data scientists (I mean no offense), and that means they were focused on clustering users and showing the best possible recommendations to each user. They were all about precision and if a test set showed above 90% accuracy, it was considered a good job. Thinking about the amount of users, ways to integrate, dataset size, performance, and availability of the systems around the data science solution is not their usual strong suit, and here it was all overlooked. One example is using Flask as a Python web framework. It is good for this phase; putting a demo together is quick and easy, but problems occur with more than one user. If this had been done in Django, we could have avoided double work. Also, working with a database dump is not the same as working with the database itself. The individual is not just a user of the system and this must be taken into account.
Lesson 3: Do more internal knowledge-sharing sessions and adjust the vocabulary and processes.
The way we communicated about problems was not pretty. It took us a while to even explain to each other how the system would even work. The data science team could not grasp how data was ingested into the system, what should be cached, or what hurt performance, while data engineers kept asking how the whole pipeline for calculating recommendations worked. We spoke different languages. When the data science team used the word “bean,” the data engineering team used the word “bucket,” and both teams were talking about the same thing. SmartCat paid the price for not working closely together and not sharing enough knowledge early enough.
Lesson 4: Do not underestimate the complexity of the system and speak upfront with the client about all the details which are needed to be production-ready.
Another realization during this phase was that we were building a very complex system, and that can’t be taken for granted. It’s imperative to do automation and monitoring well to maintain a successful system. That’s just a fact. In addition, it’s important not only to keep track of tools and infrastructure when working on data science systems, but also to closely monitor how the algorithm works in production (click rate, goals achieved, revenue generated). These are the things we know now and say upfront to the clients, but we certainly did not know them back then.
Today we know what types of engagements exist, which steps are needed for each phase, and the team members required to successfully finish a project. The first phase is vital; make sure to spend time to understand the client’s needs and business, with communication being the key ingredient. Face-to-face workshops help build a strong relationship with the customer. Show your clear understanding of their problem, and have an implementation plan in place. SmartCat learned a lot from that first project and it led to the clearly defined process for data science projects that the team has now.
This is the foundation of all other phases, so take the time to do it right.
Today, Smartcat divides projects into 3 groups: data insights, prototyping, and production. For each element, it is important to involve people from different teams working together to have a successful delivery. We’ve also developed an additional, essential role on our technical team that we call Data Wrangler. This is someone who isn’t a data scientist with rich knowledge about algorithms and approaches, but instead, someone with strong business knowledge, a professional analyst armed with Jupyter and Python knowledge. He/she plays an important role in the initial phase, works with clients in the business domain, and extracts key findings from data. The Wrangler’s quintessential goal in the initial phase is to come up with the best possible idea to improve the business of our client.
We have also improved our internal communication: Knowing that a bucket for an engineer is a bean to a scientist, and learning the vocabulary of our teams has made us more efficient. Having the ability to explain solutions to different team members who don’t “speak our language” allows us to do a better job in the end. Adjusting our language to better suit business stakeholders concerned with the bottom line is not always easy, so we started practicing internally. As a result, even our diagrams and visualizations look much better now, since people from different teams can understand what is going on more naturally.
I’m definitely proud of how far SmartCat has come after the way we started. Not everybody has enough room to learn from their mistakes, so we were fortunate there, and we learned a lot. Our successes since then have been really encouraging but there’s still a long road ahead of us.
As I said at the beginning, this is a field that is changing all the time, so I expect this is not the end of our learning curve. What is certain is that we are a much better company now than we were five years ago, thanks to projects done with mixed teams.
Written by: Nenad Bozic
March 3, 2023