Solving data “variety” with Postgress NoSQL Extensions

Raise your hand if you’ve heard the three “Vs” of Big Data?

  • Velocity – your query/updates are exceptionally fast or large. Your processing the entire twitter feed.
  • Volume – you store a massive amount of data at rest. You’ve crawled the web and are storing the entire web in a database.
  • Variety – The structure of records varies dramatically. You need to store information about people, but on rare occasion you’d like to about their pets, their names, sometimes their pet’s vet names, sometimes their mother’s maiden name.

The three “Vs” are a big part of our job helping clients understand whether or not they need NoSQL. Many clients have problems that go like:

Help us with variety! We don’t really have volume/velocity scaling concerns, but our data model is getting weird. We think we should migrate from our relational solution to NoSQL because of our dynamic variety from record-to-record.

Switching to NoSQL for a little extra “variety” and not scaling focussed volume/velocity concerns has always felt wrongheaded to me. We present these clients with pretty daunting tradeoffs for migrating to NoSQL. Many of these tradeoffs involve forcing hierarchical, denormalized data models, that solve specific targeted query patterns at massive volume/velocity scale. Many clients don’t quite realize how much powerful ad-hoc query capability they’re losing by leaving SQL.

But how can we possibly have the best of both worlds? Well luckily for us Postgres is working on a very handy solution. In the upcoming release (9.4) Postgres is getting first-class JSON document support. What does this mean? Lets take it for a test spin:


First, go off and install postgres 9.4 beta 2 somewhere you don’t care about (I have a vagrant box).

Second, fire up psql or whatever you do your postgresing in, and lets start cracking!

Create a table that looks like any other SQL table

    create table person (      id serial,      first_name varchar(40),      last_name varchar(40),      extra_details jsonb    );

Except notice that little “jsonb” column type? That’s our handy-dandy binary JSON column type. What can we do with this table? Well we can insert a few records:

From the mundane

    insert into person (first_name, last_name, extra_details)    VALUES ('doug', 'turnbull', '{"age": 34}');    insert into person (first_name, last_name, extra_details)    VALUES ('eric', 'pugh', '{"age": 37}');

to the schemaless:

insert into person (first_name, last_name, extra_details)VALUES ('daniel', 'beach',        '{"age": 37, "cats": {"names": ["itsy", "bitsy", "cottontail"]}}');

and hey look at my JSON!

select * from person; id | first_name | last_name |                          extra_details----+------------+-----------+-----------------------------------------------------------------  1 | doug       | turnbull  | {"age": 34}  2 | eric       | pugh      | {"age": 37}  3 | daniel     | beach     | {"age": 37, "cats": {"names": ["itsy", "bitsy", "cottontail"]}}

Great, we can store and retrieve arbitrary JSON in a SQL column. We’ve solved our “variety” problem – right?

But how do we actually find things in our JSON?

Well one thing thats left – how can we query JSON documents? Postgres nicely provides some operators to apply to JSON data. First, there’s basic queries against elements within the JSON document using the -> operator:

 select * from person WHERE extra_details->'age' = '37'; id | first_name | last_name |                           extra_details----+------------+-----------+-------------------------------------------------------------------  2 | eric       | pugh      | {"age": 37}  3 | daniel     | beach     | {"age": 37, "cats": {"names": ["itsy", "bitsy", "cottontail"]}}

or select attributes within the JSON fields:

select id, extra_details->'age' as how_oldfrom personwhere extra_details->'age' = '37'; id | how_old----+---------  2 | 37  3 | 37

An important operator to know from the list is the containment operator. This operator asks the JSON data is this operand on the right a subset of the JSON data on the left? Or in other words, does the intersection of JSON documents A and B == document B?

For example, lets find all the people with cats named “itsy” and “bitsy”

select *from personwhere extra_details @> '{"cats": {"names": ["itsy", “bitsy”]}}';

Yeah but indexes?

You bet. We’re in luck, because JSON columns support postgres’s GIN (inverted index) indexes. The subset query above can be sped up by simply creating a GIN index on the entire column (indexing the entire document):

CREATE INDEX alldetails ON person USING gin (extra_details);

Or on just a specific property in the JSON:

CREATE INDEX age ON person USING gin ((extra_details->'age'));

Schemaless FTW – right? well maybe.

This is probably some of the coolest stuff I played with all month. It certainly hits the nail on the head for many client’s need for broader, unpredictable variety in their data models without leaving the ad-hoc capabilities of their SQL databases. If every person in our database has a collection of extra data that is going to be so different we cant easily anticipate creating new tables for that data, then this is a great solution.

But word to the wise: schemaless is a bit of a misnomer. You usually care somewhere about “schema” concerns. When we say “schemaless” we really just mean “we’ve pushed the problem to you, the consumer, to validate data”. Certainly its great when validation can be lighter and less opinionated. Or when validation is very application specific and possibly even more opinionated than whether an input string matches a given primitive type like string, integer, etc.

Sometimes people say “schemaless” and mean “stop making me think about a relational data model”. Its an open question that we’ll discover with these new Postgres features – is it ok to just have a database with a single table and a single JSON column that we query? Its likely equivalent to the question “should I switch to MongoDB”? Well its probably ok up until we experience the painful reasons relational databases were created in the first place – because normalization is actually a good thing(R) that models entity relationships much more richly than hiererarchical representations like JSON.

I’m just glad to now be able to straddle both worlds, and make informed engineering decisions that use the best of relational and non-relational.

Are you contemplating NoSQL technologies? Concerned that your relational database can’t meet today’s needs? Want a balanced opinion on what the best tech for your data problem is? Well contact us, and ask about one of our Data Architecture Assessments!