Data Grammar, Part 3
Our previous Data Grammar posts introduced you to binary and data types. Binary is the basis of computer language, akin to letters in written language. Data types are the next level of abstraction. Computers read binary and parse it into different data types just like humans read individual words and categorize them into nouns, verbs and adjectives.
The next step is to take data values, of the same or different data types, and organize them in different ways to make data structures, the same way we would combine words to make clauses and sentences.
Just as we use words to write sentences, we can combine data values to describe objects or events, organize large sets of data for fast access, and track an order or preference across data elements. This is where the real magic begins. When we combine individual data values in specific ways, we get what are called data structures. Each sentence has a structure, a pattern of words that allows us to derive complex ideas from a collection of words. Data structures are similar.
By defining relationships between individual data elements we can express complex ideas like the order in which customers signed up for your service, which books on your bookcase you have read, and which of your friends know one another. We will use these three examples as we walk through the three major data structures in this post.
Data structures, AKA sentences
There are lots of different kinds of data structures out there, just as there are lots of sentence structures. Computer Science students take entire classes in data structures. In fact, my data structures class made me fall in love with the field. Somewhere in the process of figuring out how to delete an aardvark from a binary-search tree I was hooked! But I digress…
The three data structures we are going to dive into now are lists, graphs and maps.
Good news, if you have used excel, you’ve used a list. A column of data can be thought of as a list. It is a collection of data elements, maybe of various data types, that might have a specific order. There are many types of lists and lots of ways to write a list in different programming languages, but a common representation is: [12, 6, 1, 4]. That is a list of integers, but you can make lists of other data types. The data elements in a list don’t even have to be of the same type. For example, ["a", 56, 3.56, "gbh", True] is a list.
You can even make lists of lists! If you’ve created a table in excel that has more than 1 column, you’ve made a list of lists. Here, people who are really into data structures might add other terms like array, set, queue, stack, linked list, or doubly-linked list to the discussion. Those are all cool and useful, but they are all basically just fancy lists.
Lists are used to keep track of a collection of data elements that are related. For example, you might have a list of customer IDs or a list of books you have read.
Data structure maps aren’t like the maps you use to find the tasty new ramen place across town. A map data structure is a way to store a collection of relationships between two data elements. You can think of it as a well-organized, labeled filing cabinet.
For example, if you pulled open the drawer to your home filing cabinet you might see labeled folders. A manila folder labeled “legal documents” is a file of…legal documents. A folder labeled “medical test results” contains…medical test results.
If you give a map data structure a key, it will give you back a value (or collection of values) that it has connected to that key. Data scientists typically describe this as “mapping from a key to a value.” There are many ways to implement a map. A simple one to explain is called a sorted map. Basically, a sorted map is a list where the first element is a key and the next element is the value associated with that key. The pattern continues for every key-value pair.
Maps, like lists, go by many names: dictionaries, hash maps, json, etc. Maps are useful for keeping track of a specific set of facts about an entity. For example, you could create a map for each customer that tracks the customer's address, age and number of pets. The map would have three keys: “address”, “age”, and “num_pets.”
The last data structure we are going to discuss is a graph. In the data structure world, graphs aren’t excel line graphs and bar charts. You might be familiar with data structure graphs from making food webs in an ecology class or seeing visualizations of social networks. Graphs are sometimes called "networks" or "trees" (just a special type of graph). Data structure graphs are collections of nodes and edges.
Nodes and edges can be constructed in several ways. A simple way is to construct what’s called an adjacency matrix. You can think of it as a table in excel where each node name is listed down the first column as well as along the top row. Then, you can create an edge between two nodes by filling in the cell located at the first node’s row and the second node’s column.
Below is an example of a graph with 4 nodes (animals) and 5 “directed edges” (arrows) which represent which animals are food for other animals. The table is the same graph stored as an adjacency matrix.
Some data structures work better for a given data set (customer order, books, friends) than others. It is the scientist’s job to think through which data structure should be used for each scenario. I’d suggest that, for most situations, you’d use a list for the order customers arrived, a map for the books you have and haven't read, and a graph for which of your friends know one another.
These data structures are the basis for how data scientists organize data in order to represent complex ideas. Sometimes we use these data structures to manipulate fairly small sets of data, but the concepts can be extended to massive data sets. Usually, once we are managing very large data, we store those data in a database. The structures presented above are the building blocks for the three most common database types, which we'll dive into next time, in part four of this series.