Codd’s Relational Vision – Has NoSQL come full circle?

Doug Turnbull — December 11, 2013 | 16 Comments | Filed in: Big Data, NoSQL

Recently, I spoke at NoSQL Matters in Barcelona about database history. As somebody with a history background, I was pretty excited to dig into the past, beyond the hype and marketing fluff, and look specifically at what technical problems each generation of database solved and where they in-turn fell short.

However, I got stuck at one moment in time I found utterly fascinating: the original development of relational databases. So much of the NoSQL movement feels like a rebellion against the “old timey” feeling relational databases. So I thought it would be fascinating to be a contrarian, to dig into what value relational databases have added to the world. Something everyone thinks is obvious but nobody really understands.

It’s very easy and popular to criticize relational databases. What folks don’t seem to do is go back and appreciate how revolutionary relational databases were when they came out. We forget what problems they solved. We forget how earlier databases fell short, and how relational databases solved the problems of the first generation of databases. In short, relational databases were the noSomething, and I aimed to find out what that something was.

And from that apply those lessons to today’s NoSQL databases. Are today’s databases repeating mistakes of the past? Or are they filling an important niche (or both?).

The First Databases

To appreciate how revolutionary the relational model is, we need to appreciate what the earliest, pre-relational databases looked like. These databases largely reflect what we would build if we were tasked by our boss to “build a database”. Maybe, if we’re tracking movie rentals, in our ignorance we might start out with a line-by-line csv listing of customers and what movies they’ve rented. Maybe something silly that looks like this:

vectors are fun

Poor Man’s Database

We might also add an “index” to the back of the file to help us find specific records, in the same way we’d use an index in a book. Here the index is telling us that the “doug” record is 512 characters into the file while the “rick” record is 9212 characters into the file.

Now the earliest designers of these data storage systems would have had to face the problem, “How do I store movies as their own record?”. For example, we’re going to need to start storing how much a movie costs to rent and how many we have in stock. Should movies like “Top Gun” be treated as top-level records? Or should they be as parts-of (something owned by) the user records we already have? With movies as their own records, we don’t have to duplicate all the inventory and price data everywhere, but we’ll need to create another construct (an additional record type?) to specify the relationship.

vectors are fun

Poor Man’s Database

The databases that aggregate videos into users are considered to be following the “hierarchical” model. The databases that break out movies into their own records with the ability to link records fall into the “navigational” model.

You can imagine, if we were interacting with this database, our first reaction would be to create a rather primitive interface – create something low-level that talked only in movies and users. Luckily, others saw past this and realized that patterns like the ones in our video rental and other data stores could be generalized–that we could abstract the notion of “record” and “ownership” into something much more powerful. Eventually this is exactly what Codasyl/DBTG (Database Task Group) did, creating a standard language for creating and interacting with navigational and hierarchical databases.

So reflecting on our data model, you could create a record in one of these databases by using the RECORD command:

Record Name is USER
   Location Mode is Calc Using username
   Duplicates are not allowed
   username Type character 25
   address Type character 50
   phonenumber Type character 10

And when we want to declare a relationship between two records, we can define a set that maps them in an owner-ownee relationship.

Set Name is USER-VIDEOS
   Order is Next
   Retention is Mandatory
   Owner is USER
   Member is VIDEO

This model reflects first building many databases and then attempting to generalize important abstractions. We can think of the early history of databases like our video rental example. Since the advent of storage media, folks needed to store all kinds of data. Eventually somebody did something laudable, reflected on all the patterns being used in data storage and management and created a generalized database that was codified in the DBTG language.

Enter Codd

Whereas the first database’s abstractions grew out of patterns learned building from the bottom-up, the relational model did just the opposite. The relational model creates an amazingly powerful abstraction rooted in predicate calculus, and then expects the implementation details to follow (which as we all know they did).

When Codd wrote his paper, he criticized the DBTG databases of the day around the area of how the application interacted with the databases abstractions. Low-level abstractions leaked into user applications. Application logic became dependent on aspects of the databases: Specifically, he cites three criticisms:

  1. Access Dependencies: We often need to navigate from users -> videos to get at the videos. Application logic depends on how records are linked or aggregated. I must use one record type to get another record type.
  2. Order Dependencies: Applications depend on how data is physically stored in the database. Notice the “ORDER is NEXT” line in the Set declaration above. This specifies storage based on insert order, so retrieval will in turn happen on index order.
  3. Index Dependencies: When accessing indices, these databases required database indices be referred to explicitly by name.

Codd proposed to get around these limitations by focusing on a specific abstraction: relations. A relation is simply a tuple of elements. The ith element of each tuple is a member of some set, known as that element’s domain. Perhaps a given element’s domain is the set of users, user ids, possible movies to rent, etc. So for our videos, a tuple might look like:

 (user=u, address=a, list of movies rented=l)

Or in other words

 (doug, 1234 Bagby St, [Top Gun 3.99, Terminator 12.99])

One can interpret this relation to “mean” any number of things. We might equate this to a statement that “Doug rents Top Gun at 3.99 and Terminator at 12.99” or “Doug lives at 1234 Bagby St”.

Adding to this, Codd proposes an idea known as normalization. Codd only touches on the basics of normalization in this first paper. Suffice it to say, one important goal of normalization is to create flatter and less-redundant relations, creating simpler sentences from our relations. In Codd’s first-order normalization, what he introduces in his paper, we’d rather say “Doug rents Top Gun at 3.99” and “Doug rents Terminator at 12.99” instead of including both movies in the sentence. In relation terms, this would look like:

Users Relations
(user=doug, address=1234 BagbySt)

Movies-Rented Relations
(user=doug, movie=Top Gun, price=3.99)
(user=doug, movie=Terminator, price=12.99)

With simpler and normalized relations, we can use Codd’s new JOIN operator that can derive relations from other, simpler and normalized relations. This is extremely powerful. In other words with JOIN we can construct knowledge from smaller units of knowledge. For example, what if our database contained these relations:

Peoples Relations
(name=Socrates, Occupation=Philosopher)

Occupation-Characteristic Relations
(Occupation=Philosopher, Characteristics=Drunk)

We can do a JOIN on Occupation=Philosopher and learn that Socrates was a drunk! This is neat because, in my Artificial Intelligence class in college, this kind of “reasoning” is presented as something advanced. In fact there’s a whole language, Prolog, that’s whole job is to take assertions about the world and give you extra facts. I had to learn all that stuff in AI when mundane-old SQL was right under my nose giving me the tools to learn anything I want about the world my relations represented!

So now that we’ve presented this model, let’s reflect on how it matches up against Codd’s criticisms:

  1. Access Dependencies: As relations, data is split out into independent relations. There’s no “links” between them to follow.
  2. Order Dependencies: The order of a set of relations is not specified. Sets in math don’t have a notion of order. In practice, a relational query language, like SQL, allows you to sort by whatever criteria you want
  3. Index Dependencies: What’s an index? This data model does not mention indices. Indices are, for better-or-worse, a knob that gets tuned at a place the application does not interact with the database.

In short, Codd created a beautiful abstraction that turned out to be reasonable to implement. Instead of building on what had been done with databases, he went to the roots of predicate calculus and created a clean and beautiful way of interacting with data.

It’s a pretty singular and daring achievement in fact to assert “this is the abstraction we should use” and have it be largely successful. There was not much anticipating Codd’s invention of relational databases. It wasn’t, as is common today, a minor innovation on previous innovations. Instead it was simply Codd’s vision. Pretty cool stuff!

The wrench-in-the-spokes of-course came when we decided to build horizontally scalable systems “webscale” as all the kiddies say. Doing a JOIN over a distributed system turns out not to scale well. Because of this, we started to see old themes reintroduced, specifically the denormalized, hierarchical databases of yore in the form of many of today’s NoSQL stores.

NoSQL – Have we come full circle?

Yeah… in a way we’ve come full circle back to pre SQL days. Hierarchical data stores, once forgotten, have risen again. They certainly suffer from the same kinds of criticisms that Codd lays out. They leak lower-level details, access dependencies, order dependencies, and index dependencies up to the user application. Using Cassandra databases requires holding in high respect the primary row key, using that-and-only-that to access data. These are index and access dependencies. A database like HBase will impose order dependencies, storing your rows sorted based on the row key, allowing/requiring you to consider the order when your application scans keys.

Most importantly, when using a NoSQL store, your opportunities for ad-hoc querying becomes much more limited. Whereas relational databases hold sacred the goal of allowing you to define more complex information from more primitive information, NoSQL databases require performing massive batch processing through map-reduce to create derived information.

Moreover, the lack of normalization requires potentially large amounts of data duplication. If tracking users renting videos in a hierarchical data store, how does one update the inventory successfully? If users own videos, does this require a map reduce job to visit every user record and update the count. Will this occur atomically across all the records?

In short, denormalization stinks!

Ok Eeeyore

vectors are fun

Databases are hard

But denormalization is extremely powerful.

When we can denormalize we can do amazing things with horizontal scalability. We can simplify the task of atomically moving data around boxes. We can avoid having to worry about difficult distributed joins (all the data is in one place). Databases like Cassandra can be built that do beautiful things with distributed systems.

My point is that it’s dangerous to just grab a database off the shelf because its fun and trendy. You might end up creating a lot of technical debt for yourself. Do you know you’re handling an extremely high volume of a single somewhat-well-defined “thing”? Then the hierarchical model might be an amazing bet for you. Do you have many different kinds of records that link together, but don’t necessarily contain parts of the whole? A relational data store might be the ideal solution.

With Quepid, I started out using Redis as the primary database. Its been amazing for doing some simple storage of search query/document ratings. Once we started needing to add user accounts. With user accounts relating to sets of search problems, it became clear that Redis wasn’t good at this. My code started to look like a pseudo-relational database on top of Redis. So I’ve been working now (not a year from now) to transition to FoundationDB. FoundationDB will be great for my problem because I can mix-match different kinds of models in a single database, storing both hierarchical ratings for documents, and relational users.

In conclusion, “a foolish consistency is a hobgoblin of little minds”. Its important as a data architect to understand these tradeoffs in your bones. Its easy to get lost in hype and marketing material, but really appreciate what you’re gaining and measure if that gain is worth the potentially very real and painful cost.

Do you need help choosing the right data store for your application? . We can help you! Do you have feedback for this article? I’d love to hear what you have to say or critique and learn more about database history! So please leave a comment.

16 comments on “Codd’s Relational Vision – Has NoSQL come full circle?

  1. I am embarrassed to admit I implemented something extremely similar to the original text-based approach in middle school before even learning what a database was! I think it is still in production for the club I built it for…

    The new JSON support from postgresql looks like a possible “best of both worlds” solution I’m interested to try out.

  2. Totally great post; it very much resonates with my feelings. I shared on G+. (not sure if you prefer I share the DZone version or the o19s one).

    By the way, one very important aspect you didn’t address was the widespread and overall excellent tooling available for relational databases. It’s had many years to mature and the RDBs are very homogeneous so there’s lots of choice and wealth of options–some really excellent. I’m sure you’ll agree Rails’ ActiveRecord is one of the best. Yet…. I’m surprised that you revealed you went straight to one NoSQL store and then to another (!). Surely you don’t expect Quepid to demand the scalability characteristics to look for alternatives to an RDB; right? Or was it all a learning exercise (one you admit can be expensive — the wrong choice of DBs that is)?

  3. “[Non-relational data stores] leak lower-level details, access dependencies, order dependencies, and index dependencies up to the user application.”

    Well, that’s true … but every relational database I’ve used in the past 10 years has leaked things like lower-level details and index dependencies up to the app, too. I’m not saying relational databases can’t have an advantage, but every RDBMS I’ve used in the past decade has been so bad at these things that using a NoSQL database wasn’t actually a significant loss.

    For example, when you need a type of indexing which SQL doesn’t support (and your database can’t be extended to support), and you have to write your own index anyway, what’s the loss from just using NoSQL to begin with? Unless you’re doing sophisticated queries, probably not much.

    It’s quite true that dumb key-value stores are poor fits for user accounts, just like SQL is a poor fit for other kinds of data. But in any real project, there are other concerns to take into account. For example, setting up a new dev machine that needs 5 different services is going to significantly slow down new hires, and existing programmers who need to reload data a lot, and people wanting to set up new automatic test runners, and so on. It can be a huge advantage to have only 1 data store for your application, even if it’s sub-optimal for some tasks. (At least, until your company is big and mature enough to hire people to write automatic container installers for devs, and virtual test machines which can be created instantly. But those don’t come free.)

    I’ve never used any system that had 2 (or more) data stores where they didn’t get out of sync during development on a regular basis. Using Redis for user accounts can suck, as can using SQL for hierarchical data, but having to install and configure and maintain both in sync is worse than either of those on their own.

    Way too many projects try to use the best possible X for each requirement, and they end up with hundreds of libraries and dozens of running services, and nobody is an expert at more than 2% of the system, and whenever anything breaks for anybody, the whole team has to stop and figure it out.

  4. @Lucas thanks! I saw that when I was putting together my database history talk. BTW geez thanks for sharing on reddit/hacker news. My first day being on page of of hacker news! Fun.

  5. For example, when you need a type of indexing which SQL doesn’t support (and your database can’t be extended to support), and you have to write your own index anyway, what’s the loss from just using NoSQL to begin with?

    Well, if time travel is a choice, why not choose a relational database that can be extended?

  6. Hey @Frank, great thoughts!

    I get exactly where you’re coming from. The cost of taking on a 2nd data structure that’s a little bit more optimal often isn’t worth it. Totally agree. And I’ll shamefully admit those are the little corners of development where I feel ok being a little “clever” in how I use my data store.

    Along those lines, I’ve done some interesting things to push redis’s data model in Quepid. Hopefully I’ll blog about that soon :)

  7. Thanks @Steven Postgres is something I honestly haven’t played with, but I keep hearing neat things about it. Its definitely on the list of things to learn :)

  8. A very interesting and informative post, thank you!

    As interesting as the NoSQL wave has been, I don’t see it lasting when we have next generation RDBs like MemSQL that are designed to solve exactly the sort constraints that older RDB’s had. With easy horizontal scale-out MemSQL’s in-memory DB can run large read and write workloads concurrently, support both SQL and JSON datatypes, and provide sub-second latency to terabyte-scale workloads.

  9. @Doug: You’re welcome. Yeah, I thought that this article needed to be on Reddit / HN as a compensation for my utterly silly NoSQL vs. SQL battle post I made the other day ;-)

    I’m not sure if you’ve noticed these comments in the HN discussion, but you might also enjoy reading:

    Michael Stonebraker’s and Joseph M. Hellerstein’s “What Goes Around Comes Around”, Michael Stonebraker is the genious behind PostgreSQL and many NewSQL data stores.
    Eric Meijer and Gavin Bierman’s slightly esoteric “Contrary to popular belief, SQL and NoSQL are really just two sides of the same coin”, which is more of an advertising for Erik Meijer’s LINQ than a serious scientific publication :-)

  10. That MemSQL comment is blatant spam. Interesting to see them engaging in such tactics. I remember when a bunch of so-called NewSQL database vendors were doing that a lot on blogs I owned. None of them is in business anymore.

  11. Pingback: Auto-Creation of Indexes in RDBMS | Java, SQL and jOOQ.

  12. @Frank — You gave us a gem, applicable to many areas of development:

    Way too many projects try to use the best possible X for each requirement, and they end up with hundreds of libraries and dozens of running services, and nobody is an expert at more than 2% of the system, and whenever anything breaks for anybody, the whole team has to stop and figure it out.

  13. Pingback: Typesafe’s Slick is Not About SQL | Java, SQL and jOOQ.

  14. Pingback: Relational Data Stores Versus Hierarchical Databases : Stephen E. Arnold @ Beyond Search

Comments are closed.

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC