Data Grammar, Part 4
Now that we’ve introduced binary (the computer alphabet), data types (computer words), and data structures (computer sentences), we're ready to take a leap and introduce a concept analogous to a library.
A database can be thought of as a library - it contains a multitude of large or complex data structures, like books full of sentences, stored in specific locations in some sort of system (Dewey decimal, ISBN) that is - hopefully - well-documented and set up so that you can find books that answer your questions, or in database terms, “queries.” It's a technical detail but the collection of organized data is a database, and the technology used to organize and access the data is a database management system (DBMS). I, along with most data scientists, will get sloppy and use database to refer to DBMSs sometimes. Forgive me.
Let’s discuss the three most popular types of database management systems: relational, document store and graph.
Most people who have run some sort of business have run into a relational database management system (RDBMSs). Common software implementations are MySQL, Postgres, Netezza, Oracle, SQLite, and Microsoft SQL Server. Relational databases are collections of tables tied together by something called a schema, which is really just an organizational system that tells you what information can be found in which table and how the tables relate to each other. You can think of RDBs as filing cabinets with really good folder and drawer labels.
Tables are really just lists of lists. Though large, optimized, relational databases might seem complex to think about, most are pretty straightforward. For example, you might have a list of customers, a list of those customers’ orders and a list of employees who serve those customers or fill the orders. You can access data in a Relational Database using a query language called SQL (pronounced Sequel or S-Q-L, depending on who you ask - it’s an epic battle most of us don’t like talking about). SQL is great because it lets you define the structure you'd like to format the data into, and then the language figures out how to extract and transform the underlying data tables into that format.
Relational databases are the go-to for most data scientists. They are fast, easy to manage, and commonly found in businesses. Also, many machine learning algorithms rely on creating “feature vectors” for each example entity. "Feature vector" is a fancy way of saying row in a table. Since most data science models require transforming the data into a table, it can frequently be easy to just keep all the raw data stored in tables in a relational database.
Document Store Databases
The next database management system, the document store, is in the “NoSQL” family. Unlike relational databases, which all use some form of SQL, noSQL databases don’t always use SQL as a query language (hence Not Only SQL). There are several types of noSQL databases, but document store databases are probably the most common right now.
Document stores are to map data structures as relational databases are to list data structures. The filing cabinet is still a great analogy for document store databases, but imagine that instead of housing folders that each contained a bunch of documents, the cabinet was packed with individual folders for each piece of information in each type of document. For example, a doctor’s filing cabinet might contain a “Patients” folder which in turn would house folders for each patient and each patient’s folder would have individual folders for the patient’s age, gender, AND a unique folder for every lab result and every insurance claim! That many folders might be overkill for your real-world filing cabinet, but computers are REALLY good at finding the right file folder fast!
Data scientists might use a document store database if they are dealing with “unstructured" data like large amounts of text, or if each entity in the database might have different types of data elements. Document stores aren’t usually the main storage location data scientists use, but they are really useful for managing data flow through a web application as they are better suited to interfacing with programming languages commonly used on the web.
The last database management system type is a graph database. You guessed it, it’s based on the graph data structure which is a collection of nodes and edges. Of the three databases I've introduced, the similarities between the simple data structure of a graph and a graph database is the strongest. It’s basically just a giant graph! The database characterization lies in the cataloging, or indexing, of the data so that when you query the database it knows where in the database’s memory it wrote each piece of data.
Data scientists use graph databases primarily when the relationships between entities are important. For example, when building a recommendation engine to recommend products to customers a data scientist could use a graph database to illuminate relationships between people who buy the same items or among products that tend to end up in the same basket.
DBMSs are, in theory, just souped-up data structures, but in practice they are complex software, and sometimes also hardware, systems. A good data scientist should be able to interact with most databases and maybe even set up small scale versions of these systems. However, installing, designing and managing a database of any real scale requires the skills of a Data Engineer or Database Administrator.
We’ve now walked you through how data is stored, structured, and organized. The next post will focus on helping you develop a sense of what information is really in your data.