An Introduction to Data

The subject of data can be a bit dry, and alone they don’t offer much in the way of meaningful interdisciplinary application, whether planning finances or even trying to decipher what you are being told in the age of bullshit.

But to get there, you have to begin with the basics. So, what are data?

Data can be considered as the observations or measurements. They are what is processed and analysed to gain insight into a particular problem and help develop strategies to solve them.

They can be divided into two broad types; qualitative and quantitative.

Qualitative data

Qualitative data is non-numerical, such as hair colour, species or even market sentiment; it describes rather than counts.

It can appear subjective, particularly in cases where descriptions haven’t been standardised. It is possible for two researchers to describe the same thing differently making it difficult to compare or analyse consistently. Standardising qualitative data is an attempt to turn descriptions into categories that can be used reliably.

Qualitative data can be divided into nominal data, which has no natural order, such as eye colour, and ordinal data, which does have a meaningful order but no consistent numerical distance between categories, as seen in satisfaction ratings like poor, fair or good.

Quantitative data

Quantitative data has a numerical value that can be counted or measured.

It could be as simple as weight in kg or the frequency of observations in a local butterfly count.

Quantitative data can be further described as either continuous or discrete.

Continuous data can take any value within a range to measure variables such as temperature or Formula 1 lap times. It can be recorded to various decimal points, creating an effectively infinite range of possible values.

Discrete data, on the other hand, refers to a finite set of values that cannot be subdivided into parts, such as the number of tourists that visit a city. They are whole numbers, as it’s not possible to have one half of a tourist.

It is worth noting that the distinction between discrete and continuous is not always about the measurement itself but the context. A gym weight is recorded in fixed increments, making it discrete in practice, but in reality somebody who can lift 100kg, might in fact be able to lift 101.75kg. The underlying physical capacity of the person is continuous.

Application of data

Data is a universal topic. In marketing it is used to determine and measure KPI, in ethical studies it determines bias in systems and in machine learning it can help develop algorithms to better understand species distribution.

The field in which it is applied and the problem that needs solving will determine the depth of understanding required to make the most of it, which could mean exploring algebra, statistics and probability.

But to describe, compare and draw conclusions from data requires the same underlying foundations; knowing what data is, how it is measured and classified, and how to summarise it in ways that are actually useful.

Data Quality

Being able to make good decisions based on data requires that they are reliable. Data can be incomplete, inconsistent, outdated or unsuitable for answering the question being asked.

Perhaps more importantly, just because data has been recorded, that doesn’t mean it’s accurate. And accuracy is not the same as relevance.

Data Literacy

Beyond the issues of data quality, more problems can arise in its interpretation and communication.

Human judgement is fallible. Works by Kahneman, Taleb, and Bergstrom and West offer useful insight into the kinds of errors people make, particularly when presented with numbers and statistics. Perhaps more so when used to provide a narrative.

We are capable of remarkably simple mistakes that have nothing to do with intelligence, whether the result of a misleading graph, an inherent bias, or a simple lapse in judgement. Munger’s tendencies offer a useful framework for understanding why.

Understanding data can be about improving your marketing campaign flows, modelling whether a species population is likely to recover or decline based on historical count data, or simply reducing your risk of being misled in an era where data is routinely cited to support an argument.