Data Grammar, Part 2
Data Types, the “Parts of Speech” of the Data World
In the previous “Data Grammar” post we walked through the basics of binary and presented it as “the spelling of the data world.” Today we’ll introduce five basic data types: Booleans, Integers, Floats, Characters, and Strings. You can think of these types as data’s parts of speech. Just as there are times when only a noun makes sense, there are times when only a float will do. Below, we’ll outline when and why.
Booleans, AKA True or False
Booleans are just True and False values. Sometimes booleans are presented as the words “True” and “False,” and sometimes data people use the shorthand of "1" for True and "0" for False.
A common use of booleans is to control the actions a computer program takes. For example, a boolean might be used to indicate whether or not a given user is an adult and display different content for adults and children as a result.
Another use is to indicate whether a certain feature is present for a given row in a data set. For example, in the tiny table of data below, we use booleans to indicate whether or not items have a specific quality.
Integers, AKA Counting Numbers
An integer is a whole number — a number without a decimal.
3, 5, -64, and 34,897 are all integers. Orders and counts of indivisible things (ID numbers, your finishing place in a race, number of cans sold, day of the month, and number of whiskers on a kitten) are generally recorded as integers.
Floats, AKA Fractions
If you need a decimal point, say, to describe how much your morning coffee cost, you need a float. Other uses for floats you may be familiar with include rates and averages (points per game, headcount growth per year). Floats can also be used to represent portions of a whole (percentage of a pumpkin pie).
Floats are numbers with decimal points. They may not all really “need” the decimal point, but if the decimal point is there, it’s a float. 10.2353246, 4.5, and 4,566.0 are all floats.
Integers and floats are the most common numerical data types. People who are really into data types might also go into the finer points of types like BigInt, Fixed Point, Int64 and a whole host of others we won’t discuss here, but suffice it to say that the differences between those types boil down to the size of the numbers they represent. Those details are important for a data scientist or data engineer to understand, but should never get in the way of clear communication.
The other basic data types are all about letters! Yay letters! Or as we data people say, “characters.”
Characters, the Keys on Your Keyboard
A character is a single letter, punctuation mark, symbol or space. Keep in mind that to a computer, “j” and “J” are as similar as “;” and “J" — that is to say, to a computer, they aren't similar at all.
“A”, “$”, “9”, “}” and “c” are all characters.
You may have noticed two things:
1) “9” is a character. Yes, it can be! It’s all in how the data is stored.
2) There are quotes around the characters. This is simply a convention that helps a user
know if the data is stored as a character or a number type. Numerical data doesn’t have quotes.
Just as numbers are stored in binary, an “encoding,” like ASCII encoding, is used to store characters as specific numbers. When a computer reads a file, the file format tells the application how to distinguish between binary that should be translated to characters with ASCII and data that should be read as a numerical type.
When you stick a bunch of characters next to each other in a specific order, you get a string.
Strings, AKA Written Text
Strings are ordered collections of characters, but be careful not to confuse strings with words. A string can be a single word, a sentence or even an entire novel!
“Agfc,” “cat,” “The quick brown fox jumped over the lazy dog,” and “g395&dkj294%” are all strings. Yep, you can put numbers in strings. When a numeral shows up in a string, it’s represented as a character in that string.
Understanding binary and the basic “data types” is like being able to write all the letters in the alphabet and read and write individual words. It’s a crucial first step, but doesn’t really provide a means of communication. The power of data comes from putting data elements in relation to one another. It’s not just what data values are stored but how those values are stored, and the structure in which they are organized. Our next Data Grammar post will dive into a few of the most common structures.