Demystifying data science: origins of the field

Confession: In my most recent role, I was the head of Data Science for a public company. But if a stranger asked me what I do, I’d probably describe myself as an analyst rather than a data scientist. If they asked me “what is data science?”, I’d probably answer “it depends.”

Lately I’ve been reflecting on my discomfort identifying as a data scientist, and what it means to be a leader in the fast-evolving (and often misunderstood) field.

But first, let’s start at the beginning.

By the numbers, “data science” started to take off in 2012. As Google Trends show, searches for both “data science” (“DS”) and “machine learning” (“ML”) grew in tandem with an explosion of interest in “big data” in 2012. But whereas searches for “big data” peaked in early 2015, “data science” and “machine learning” continued to grow. In the US, both disciplines surpassed “big data” in popularity in early 2016. After plateauing during the early days of COVID, searches have spiked +50% in the past year.

There’s an obvious correlation between “supply” and “demand” here.

As it became cheaper to build apps and websites, more businesses moved online. This created the opportunity to collect more data, which in turn required businesses to invest in data storage. But data infrastructure couldn’t keep up: businesses were frustrated by how costly and slow it was to query exponentially growing datasets, which created a market opportunity for cloud data warehousing solutions like Redshift, Snowflake, and Bigquery.

Unfortunately, even after all of these investments, data still couldn’t speak for itself. Humans– specifically, those with the ability to write code–were required to give it a voice. Only then could companies get a return on their investment. As a result, interest shifted from data assets to data people.

From data assets to data people

My view is that companies collect data for two reasons: to gain a competitive advantage today, or to serve as a moat for tomorrow.

How companies use data varies. Some build it into the product (think: search engines, logistics platforms, personalized feeds), whereas others use it to size opportunities or make smarter decisions (think: operational and marketing efficiency, expansion into new product verticals). Although most companies have a predisposition towards using data for products vs. decisions, more often than not, they hoover it up without specific plans for the future.

I’d argue that ML and DS align to these broader predispositions. As an established subfield of artificial intelligence, “machine learning” has the mission of helping computers detect patterns or make predictions without requiring input from a human. In contrast, “data science” is murkily defined as the process of “extracting and extrapolating knowledge from data, and then applying it to domains.” “Treasure” is “coaxed out of messy data” by people, not machines.

Initially, search trends suggested that “data products” (ML) would prevail over “data-driven decisions” (DS). On a global basis, the difference between ML and DS was even more pronounced: internationally, machine learning eclipsed “big data” nearly a year before data science reached similar popularity.

Yet today, the two disciplines capture equal mindshare globally and in the US. Why?

A human interface to data

Clues can be found in related search terms. After lagging behind “data science,” “data analytics” has regained momentum, with US searches nearly doubling in the past 18 months. Searches for newer concepts focused on improving data foundations, such as “data platform,” “data governance,” and “data product management” are also growing. Further specialization of data roles suggests that businesses are still looking for ways to extract value from their data investments – and that machine learning or data science alone have not been sufficient to achieve ROI.

Another clue, albeit more puzzling: searches for “business intelligence” have remained flat throughout the advent of big data – as have searches for “metrics,” “data modeling,” and “data visualization.” If mountains of data have forced businesses to invest in expensive cloud data warehouses and specialized talent, wouldn’t employee interest in interacting with that data also increase?

SELECT DISTINCT ()

My theory is that “data science” has become a stand-in for describing a skillset that manipulates and extracts insights from big data distinct from engineering. In other words: employers are looking for a human interface to data, either in addition to or instead of machines.

In my opinion, this is why many folks with deep stats backgrounds and/or advanced coding skills have migrated to new job titles that clearly manage expectations around their work products, such as “machine learning engineer,” “AI research engineer,” and even “statistical engineer.” It’s widely understood that engineers build products that are scalable and automated. Projects are long-term, and business requirements are mediated by PM’s, EM’s, or TPM’s. Not everyone wants to be the human interface.

It’s also why “business intelligence” has fallen out of vogue, even though employers remain ravenous for “data insights.” The skills required to build snappy dashboards that query billions of rows are more coding-intensive than using a GUI to build visualizations with datasets that could fit into a (large) spreadsheet. Most businesses ingest data from a wide variety of sources, which means that individual datasets need to cleaned, restructured, and joined into a single format that can be analyzed. Business analysts who need software to manipulate large datasets struggle to quickly transform such a wide variety of inputs into the outputs needed to generate insights.

Which is a key focus for people who are thinking about joining the field: writing code (usually SQL) and basic knowledge of statistics are necessary but not sufficient conditions to become a successful data scientist in the eyes of a modern employer.

Whether work is delivered through a predictive model, a dashboard, or a deck, the expectation is that insights from projects will be crafted or contextualized, at least in part, by a human. This is because impactful insights require business context, detailed knowledge of how data was produced, and effective communication (both to gather requirements as well as sharing results). I’ll cover the ways that data scientists and analysts form this bridge between data and people – and how to detect an employer’s desired flavor of “data science” – in my next post.

Roody

Demystifying Data Science

From data assets to data people

A human interface to data

SELECT DISTINCT ()

Leave a Reply Cancel reply