Data Grammar, Pt. 1
While “numbers people” may assume “words people” can’t understand their data world, the truth is that its theoretical and structural roots lie in linguistic organization.
Word Nerds vs. Quants
When you tell someone you’re a data scientist it’s not uncommon for them to respond, “I’m a word person, not a numbers person,” or “oh, that sounds really technical.” Statements like this arise from misconceptions that you have to be an amazing mathematician or extreme tech nerd to be a “data person” or understand data science. Although mathematical interest and technical knowledge are obviously important if you want to build and deploy machine learning algorithms, understanding data, how it is structured and how it is used is arguably more akin to understanding basic grammar. While “numbers people” may assume “words people” can’t understand their data world, the truth is that its theoretical and structural roots lie in linguistic organization. In fact, most data processing terminology is directly taken from the lexicon of language. For example, when a computer program stores new data, it’s said to have “written” those data, and when a program loads new data, it’s called a “read.” Programming language theory, the study of how computer programming languages work, discusses “syntax,” and so on.
Definitions of “Data Science” typically focus on the scientific methods and mathematical tools used to turn data into insights and decisions. But even good overviews with lengthy discussions of the importance of using lots of “good data” or the art of “feature engineering” often neglect to include a digestible introduction to data itself. This post is the first in a series aimed at helping people who don’t yet see themselves as “data people” build a foundational understanding of data in the same way they would learn a new language. Over the course of a few articles, we’ll learn about “binary,” the computer alphabet, data types, the “words” of computation, and data structures, which are analogous to sentences. We’ll end with a post on how these ideas can be combined to create databases, the libraries of the data world. This will give you a foundational vocabulary and understanding of what data is from a technical perspective, and how it is organized and structured for data science.
First, a disclaimer. If you are a trained computer scientist or “data person” the following articles may offend your sensibilities. While we could debate the merits of binary floating-point values versus limited precision decimals or delve into the finer points of when to use Tree-Maps instead of Hash-Maps, that’s not the purpose of this series. Too often we let the precision required by our field stand in the way of being generally understood by non-experts, despite the fact that clear communication of our processes and findings is a critical part of every data scientist’s job.
So, as Julie Andrews says, “Let’s start at the very beginning, a very good place to start.”
Binary, a Computer's ABCs
Binary is the most basic language for computers, like syllables are to speech or letters are to written text. Just as each human language has decided that a certain collection of symbols, in a set order, make up a word, you can think of binary as the “alphabet” of the data world.
Understanding binary is kind of like knowing Morse code. Most people never have reason to learn it, but it’s useful to have an appreciation of the precision of this form of digital communication.
Spelling, the bane of my existence until almost all of my written communication shifted to digital methods, is arguably the “beginning” of advanced written communication. Every human language has decided that a certain collection of symbols, in a set order, make up a word. The binary characters are 0 and 1. Just as you put letters together to form words, data is really just a collection of 1s and 0s in a file (a document), which, when read in a specific order, can be “understood” by a computer that has been told what language, or format, the file was written in.
Binary is actually a counting system.
Most humans now use a “base-ten” counting system (because we have 10 fingers).
1, 2, 3, 4,…,10,11
But binary is base-2, so 1,2,3,4,...,10,11 becomes:
0001, 0010, 0011, 0100,...,1010,1011
Don’t be intimidated by counting in a different base! You already know how to use other base systems. Every time you look at a clock you are using two different base systems (base 12 for hours and base 60 for seconds).
Why do computers use binary? Well, just as our brains process letters and sounds into meaning, computer hardware process electrical signals into 1s (electrical signal is on) and 0s (electrical signal is off). Software then processes those 1s and 0s into other “data types.”
These data types are the next level of communication, akin to parts of speech (noun, verb, adjective, etc) in the linguistic world. We’ll dig into the most common data types in the next Data Grammar post - stay tuned!