- Data Talent Pulse
- Effective Collaboration Between Data Engineering and Data Science Teams
Effective Collaboration Between Data Engineering and Data Science Teams
DTP #8: Q&A with Aby Jerin
We spoke to Aby Jerin, Principal Data Engineer at ThoughSpot on how data engineering teams and data science teams can collaborate effectively.
Quotations have been lightly edited for concision and readability.
Your thoughts on the Data Talent Pulse?
Help us out by taking a few minutes to fill in this survey, and we’ll send you a Packt book of your choice
How do you work with the data scientists in your organization? Are they your main collaborators?
“[For example] in Thoughtspot, [data engineering] is centralized, all the data comes into one single data lake and then to different teams. So each team has their own data scientists or analysts. Like [the] product team has their growth team, has their own GTM team and so on. But the raw data, data requirements are handled by [data engineering].”
On ensuring communication with distributed data scientists:
“We do have daily questionnaires [to understand] what is missing. If there's anything going wrong with the analysis, if the modelling that the data scientists do actually makes sense. Because you can do your modelling, [but] if the data in the variables are wrong, then your entire model is wrong.”
Data engineering is essential for getting actionable insights from data:
“So, if your data is 100% correct and quality is good, then 80% of the data scientist's work is done.”
You mentioned data scientists distributed among different teams, like the product team. So a decentralized structure, with a data engineering team being a single unit?
“As a company grows, [it may be more ideal to] go into a data mesh strategy where every team will own their data and every team will have their own data engineers and data scientists. Yeah, for a bigger company, say 5000 plus employees, then this would be a requirement. But for mid size and small companies, startups specifically, centralized data teams are enough.”
And for a data engineering team, even if the company grows, it's OK for that particular aspect to be centralized?
“It depends. I have seen that in bigger companies, even the data engineering, the entire data team is decentralized or complete data mesh. So each team will have their own data team with data engineers. But, completely decentralized depends on the data sources. “
A larger company may be better suited towards a distributed structure:
“So if you look at a bigger company’s GTM Team, the go to market team itself will have a lot of data sources. Everything will be on a much larger scale. That centralized team won't be able to handle it, and the same goes with the product. If it's a product based company, the data volume, event data that you get from different sources, your survey data, your in-house product tracking data, those all will grow for a bigger company as compared to a startup. So a decentralized data mesh will make sense then.”
How does the communication look like between the engineering team and the science team?
“We have tools like JIRA and mostly it's agile methodology. [Slack too] and then whatever is on Slack, we record it on JIRA so that there's tracking. But yeah, that's how the communication goes. Again it might be different for a company with thousands of employees.”
Starting out, how do you go about finding out what exactly the data science team needs from you?
“The first thing is to set up a call with the data science team or scientist with the business users. End users who are going to use it and try to find out the exact requirements and then bring in the data if it is a new source altogether or if it is adding more field calculations to the existing data. So the requirement analysis is the most important aspect. Once you nail that, you bring the correct data, then it's all smooth.”
Business users and product managers are essential in specifying use cases for data:
“Then the [business users] work is always between the data science and the data engineering team.”
What comes to mind in terms of how business leaders can better set expectations for data?
“Let's say a PLG (product-led growth) leader comes in and says I want to look at how many people clicked on a button inside a product on, say, the fourth screen. That's a US that might be a legit ask. You want to see how many people clicked on it, but if there's no tracking of that entire page and there's no UN being passed at the back end and the data science team is not aware of this. That's why the entire team has to be on the same page. Data science, engineering and the business users to set expectations.”
Would you say the data engineering team works equally with the product and the data science teams?
“Yeah, that's our daily life. So we work with the engineering team mostly in our case, for applications which are built by engineers, especially for the product data. For other [data] the majority of work we do is for the GTM Team, [working] with third party tools like Salesforce and HubSpot. Going into third party tools, getting data from CRMs or services. But for any in-house, especially for a product where we have to analyze how the product is being used for SU, for user adoption pricing and other things, we always work with the engineering team.”
For a small company that’s just getting started with getting insights from data, what would you say the first steps would be in terms of who they should hire?
“So if you're starting a company, the first step will be to build the product, so you hire an engineering team, you build a product, right? And then [for a data driven company] if you want to start analyzing your data, [you would need] to get the data in place. You need that infra setup so that your data is flowing into a place where your data scientists can analyze the data. So, first will be data engineering. Then there is data visualization. Then there is data analytics. Then comes data science. It's a flow of events.”
This flow of events may not always be the same, depending on the needs of a business:
“[In some cases] you can go from data engineering to data science directly, you don't need visualization or analytics in between. It's all up to the [business] leader. Data engineering is the necessary step.”
In the cases where we find that businesses don't have this data or haven't set processes in place, what exactly are they doing wrong at that stage?
“Say If you don't have data right and you hire a data scientist to do most of the work- bring the data in, clean the data up, get it correct so that they can run their model on top of it, train it. You are putting an additional burden on the data scientist where they should have trained the model, make some amazing data points available for the team to analyze. But then [they are] spending most of their time cleaning the data.”
Delineating responsibilities and bringing in professionals that can set up data infrastructure is almost a necessity:
“On the other hand, if there was a data engineering team set up it would have been very easy for them.”
What do you feel about outsourcing data science and data engineering?
“[There are examples of] outsourcing companies, the major ones like Accenture and Cognizant and all those that provide data teams, data project setup. But in today's world, I think it's all in house. It's much faster. It's more dependable, less red tape in between.”
Our conversation with Aby Jerin highlights:
What effective communication between data engineering, data science and product teams looks like.
How the ideal structure of data teams can vary depending on the size of an organization and availability of data sources.
The need for business leaders to set realistic expectations and objectives for data, along with delineating data roles.
📰 Link List
See you next week,
Do you have a unique perspective on developing and managing data science and AI talent? We want to hear from you! Reach out to us by replying to this email.