Data Talent Pulse
Posts
💻 Data Silos and Associated Problems, The Power of Network Science

💻 Data Silos and Associated Problems, The Power of Network Science

DTP #20: Q&A with David Knickerbocker

Mukundan Sivaraj
October 13, 2023

“There's limited usefulness of a data scientist who can only build a model because many, many people can build models these days.”

We spoke to David Knickerbocker, Chief Engineer & Co-founder at VAST-OSINT, and author of Network Science with Python, to get his thoughts on the value network science can bring to businesses, the challenges that come with silos for data teams, bridging gaps between data science, engineering and operations, and the pitfalls of Generative AI.

Quotations have been lightly edited for concision and readability.

🌐 From the Web

Refresher: What are Data Silos and What Problems Do They Cause?

The Dangers of Data Silos: A Data Scientist’s Thoughts on Locking Away Data

“Data silos stifle innovation, locking away vital insights and complicating data science. Recognizing their impact is essential.”

To empower data scientists and unlock valuable insights, data silos must be broken down through data integration. Strategies include data governance, advanced integration tools, and cross-functional collaboration.

How Network Graphs Connect The Dots In A World Of Ever-Expanding Data

“As business leaders, we know we cannot make informed predictions without understanding the complex interactions at play.”

Traditional data analysis tools fall short in capturing complex interactions and breaking down data silos. Network graphs are emerging as a solution to visualize connections between various data dimensions, offering a comprehensive view of relationships and segments.

It’s really fascinating because I've always wondered about how things were connected to each other. Documenting that seems like what networks are about, right?

That's right, networks seem to me kind of like the hidden hand that pushes everything in the universe. I see them everywhere I look. Like in network science, people often talk about power law. For instance, the last post I did for 100 days of networks, if you look at the Jupyter Notebook, there is a part where I'm looking at the degree distribution of nodes. And if you look a little bit down further in the code, I did the same thing for connected components and I had a realization over this weekend that connected components also seemed to follow power law, so I will definitely be looking into that, as I investigate various networks.

I used to work in data operations and being able to see how software works through networks, it gives you a deeper understanding of how software works too.

So, it's very fascinating and I believe that networks are just pervasive in everything in the universe and yet it's rare when people teach how to actually analyze these networks. So, my goal is to kind of push science. Data Sciences has so much hype and I try to just kind of push it towards usefulness.

I get the sense from what you're saying that people don't necessarily realize how important networks are. Do you find that that's true in the business setting?

I believe that is absolutely true and in fact I used to get pushback. Most people don't know how to look at network visualizations because they're very complicated. If you just visualize a whole network, it looks like a spider web. You know, it's really, really complicated and there's no insights that you can pull from it because it has so much going on. But when you use the techniques that I show [in my writing], it's basically like peeling an onion and looking at the different layers.

Like when I teach network analysis, I often say start at the core because the core should be what the network is all about. It shows you what's influencing the network. Or, if you peel the network from the outside in, then you also get to see some very interesting dynamics too. You [just] start thinking differently [about everything].

About your journey from working in data operations to shifting into ML and networks. Was there a point where you made that decision?

When I worked in data operations, I saw many people just using simple addition to troubleshoot problems. Counting the number of occurrences of a thing happening but never actually mapping out how software itself works. I decided that I was going to just jump in and learn what data scientists had to do because we weren’t getting any [support] and this eventually became an obsession for me. As I got better and better at that, management took notice and eventually I got pulled onto their AI research team as a senior platform engineer and that position is what they nowadays call ML engineering.

Because we got very little help from data scientists and it's no fault of the data scientists, it's the fault of management, I had to go figure it out by myself [without a mentor].

But we were able to take what we learned, and we were able to migrate servers and troubleshoot problems faster because we applied data science to operations. So when you actually see how useful this can be, beyond the machine learning model and actually used for solving real operations problems, you can't go back to thinking small, and once you learn science, you can't go back to thinking about lists. You're always going to think of graphs.

What do you think could have been done [by management] so that there was a shared responsibility rather than you just having to learn everything you needed?

I think that the problem is silos. Before 2013, people talked about statisticians, or they talked about ETL. They didn't really talk about data engineering, [or] about data scientists. So when I moved [back] to the United States in 2013, and immediately data scientists were held differently than everybody else, it's like they were this magical Unicorn. They were the only people given responsibility for machine learning, and everybody else had to just kind of support it. It was very siloed. Probably even to this day there are still silos where you have the data operations people that make sure stuff works and you have the data scientists that are building models.

But there's no reason for that. Because there are smart engineers that can do data science and there are smart data scientists that can do engineering and these people should kind of serve as the bridge between the two things.

There should be no gaps. The people that are ambitious and want to learn both, make those your leaders. Because if you can do both and you can tie the two teams together, problems get solved. If you don't do that, then you've got silos and you've got a very expensive data science team and you've probably got very burnt out and angry data scientists because their models don't reach production.

In your current work, do you find that you set up things in a different way? How does it look like right now?

I can't give too many details, but I think of it as what comes after cyber security. The Internet made malware a huge threat, and so companies like Malwarebytes and McAfee and others like that were created. But those are threats of the 90s, in my opinion. You know, there are new threats that need to be addressed. And so, I started a company to do it and I take my security knowledge into my company.

But as I built this company, indeed we don't have silos. We're a very young company, so we don't always have employees, but when we do, there are no silos, there's no dead weight. Everybody has to know what they're doing.

So, when I think about [people] that I would love to eventually bring into my company. It's the data scientists that are solid with engineering or it's the engineers who are solid with data science. There's limited usefulness of a data scientist who can only build [a] model because many, many people can build models these days. The age of snobbery, I think, is over for data science, it's time to actually just solve problems and silos don't help at all with that. And as I am a data scientist and a data engineer and an engineer, period, I get to not make the mistakes that we made in other companies. So far, it’s working.

It seems like if you're a smaller company that's just starting out, it's easier to keep everyone updated on what everyone else is doing. And thereby reduce silos. But if you scale, that seems like it might be more of a problem. Is that something you’re anticipating?

Yeah. What you're describing is a network effect. As you add complexity, then things become more difficult and unmanageable. If you think about my company right now, it's very few people. It’s very easy to know what's going on, [as] we don't share the same strengths, [for example] I'm not super strong in business and the business side can't do the data science side. We know what each other is doing.

But let's say that we grew as a company. I have to be very careful about the very first managers [and] the first engineers [we hire] because that's where company culture is going to start. If I hire overly controlling managers, that's going to mess things up entirely. And if I hire narcissistic but very talented engineers, nobody's going to want to work in the company, you know? You have to be very careful to bring on people that want to build the company that you're building, but also aren't going to poison the future of the company. And so, it's a lot to think about for founders.

AI and how it's gotten so much news attention, especially in the last year. Do you find that it's changing your work? Is it affecting how you go about things?

Yeah. It's very interesting. I've been vocal about generative models ever since I started using them [and] creating them several years ago and to me, they're the coolest thing in the world. But then ChatGPT came out and it really kind of took me by surprise because people were not talking about it as a generative model. They were talking about it as a replacement to search.

If you use it for search, you're setting yourself up for problems because a generative model [produces output] based off seed text. If it was a terrible model (ChatGPT is probably not that that bad), it's going to predict what words should appear based on probability, and they might not necessarily need to appear, whereas search for instance, like Google, has identified the content and prioritizes pointing people to the actual piece of content rather than predicting words. ChatGPT does not do that.

My path with ChatGPT was super-duper excited and then super-duper afraid all at once because I was watching people on LinkedIn react dangerously, mistaking generative technology for a replacement for search. ChatGPT and LLM's can be a useful front end for content [and] search. But I do believe that the back end still must be some search related technology, but I'm also accepting that I could be completely wrong.

It seems like a lot is changing. There's a lot happening at a very fast rate. So I guess it might be hard to predict.

Yeah, also, we're all wrong sometimes. Sometimes we give people bad information, but we do our best to not do that. And sometimes our bias gets in the way.

I do come from cyber security. So, I've got a bit of that kind of pessimistic side and some of that rubs off on you when you work in a domain for so long. But it's [with] good intentions. I want people to use the appropriate tool for the appropriate job and a generative model is great for generating content. But, ChatGPT has already shown that it has problems [such as hallucinations] when it's used for search.

💻 Platform Highlight

Fivetran: A cloud-based data integration platform that automates the movement and transformation of data between sources and destinations.

dbt (data build tool): An open-source platform for executing SQL-based data transformations. It also handles dependency mapping and schema compilation.

Denodo: A platform that connects distributed data through semantic models that decouple data from its location and physical schemas.

💼 AI in Business

Overcoming Healthcare Data Silos with AI

According to RBC Capital Markets, about 30% of the world’s data is generated by healthcare. This exceeds any other industry. Taking note of the immense amounts of data, weak communication and coordination are significant problems for the healthcare industry.

A recent Forbes article examined the impact AI may have in breaking down data silos in healthcare:

AI can help address these issues by making sense of fragmented information, leading to more efficient operations.
Initial AI applications may focus on areas with lower IT consequences and less daunting regulatory barriers.
The life science sales and marketing sector is a promising target for AI intervention, as it is often characterized by dysfunction and siloed teams.
AI can assist sales representatives in prioritizing efforts and enable personalized messaging based on various data sources, including scripts, social media activity, demographics, and more.
AI can handle the complexity of data analysis and help tailor messages to physicians in a modular, algorithmic way while adhering to regulatory restrictions.
By rectifying data imbalances and improving data utilization, AI has the potential to deliver significant benefits to healthcare stakeholders.

🤖 Prompt of the week

Write a brief, friendly, and easy to read retargeting email from [Company], for users who haven’t visited the website within the last 3 months. Include a prominent CTA that directs the user to return to the website. 

 

Company = [Describe products/services/features here]

See you next week,

Mukundan

Do you have a unique perspective on developing and managing data science and AI talent? We want to hear from you! Reach out to us by replying to this email.

Reply

or to participate.