• Valkyrie

Questions to Ask Your Data, Part 1

In my previous series, Data Grammar, we walked through data basics from 1s and 0s all the way up to databases. Now that you can "speak data," let's move on to understanding what your data can tell you. If you ask a technical person what's in some data, they might give you table schemas and graph ontologies. Those documents are important, but when I want to really understand some data, I find it more useful to take a direct approach and ask a few questions. This series of posts will cover some of those key questions.

My first question is: what does each unit of data describe?

A unit of data in a relational database is a row of a table, while in a document store a unit is a single document, and in a graph it's a node or edge. Though the terminology varies depending on the database type, the objective is the same. What things, relationships, and events are being described here?

I try to develop one sentence summaries of what the data describe. The answers could be "each row describes facts about a person," or "each document contains details of a single financial transaction." The data may also describe a relationship, such as, "the row lists the city and state where a US zip code is located," or "the document notes whether or not two characters in a novel are friends."

Note that for basically any unit, there is always more data you could include. You could add details about the person’s favorite ice cream flavor. You could note the temperature outside at the time the transaction was made. You could note the number of dog groomers in the zip code. You could add a count of the number of times the characters speak to one another. There is always more data you could add, assuming you have access to it.

In general, I'm a proponent of the idea that more data is always better, but I'm also a realist and know some data are more useful than others. (Important: There are also significant privacy concerns when the data you are collecting describe people or their behavior.)

There are a couple of data facts you should always consider collecting to support and contextualize your data, even if you aren’t initially sure you need them:

1. Time. When did the event happen, how long did something take, at what time was the data collected? You may only care about what the current state of the world is, but if you want to use these data to better understand something, the context of time and order of events is crucial.

2. Unique identifiers. The sooner you can come up with a good way to uniquely identify the people, places and things described in your data, the better.

3. Source. Where did each data point come from? Did a system you control generate it? Did you buy it from someone? Did you scrape it from a website? Is it a side effect from combining other data?

4. Meaning. Someone may think it is obvious that the column labeled Pers_Mon_Inc means "person’s monthly income," but to someone else it could mean "personal Monday incidental costs," or even "increase in personified monkeys." Meaning information should be stored in a data catalog and should be valued as a data source in and of itself.

Ultimately, understanding what is being described in a data source and the context in which it is gathered and identified is the first step in understanding how you can use that data. It will also help you think through additional data that could enrich those descriptions. The next question is, how are the units I've identified related to other units in this, and other data sources?"

Stay tuned - we'll explore this question and more in the next Questions for your Data post!