Understanding How CQL3 Maps to Cassandra’s Internal Data Structure: Sets, Lists, and Maps

John Berryman — July 24, 2013 | 4 Comments | Filed in: solr

Refer to my previous post for more in depth details into what I’m doing here. This is just a quick reference regarding how Cassandra sets, lists, and maps work under the hood:

Maps

Defining:

cqlsh:test> CREATE TABLE phonelists (
        ... user text PRIMARY KEY,
        ... phoneNumbers map<text,text> );

Inserting:

cqlsh:test> INSERT INTO phonelists (user, phonenumbers) 
        ... VALUES ('john',{'patricia':'555-4326','doug':'555-1579'});
cqlsh:test> INSERT INTO phonelists (user, phonenumbers) 
        ... VALUES ('scott',{'bill':'555-7382','patricia':'555-4326', 'jane':'555-8743'});

Retrieving:

cqlsh:test> SELECT * FROM phonelists;

 user  | phonenumbers
-------+------------------------------------------------------
 scott | {bill: 555-7382, jane: 555-8743, patricia: 555-4326}
  john |                 {doug: 555-1579, patricia: 555-4326}

Updating:

cqlsh:test> UPDATE phonelists  
        ... SET phonenumbers = phonenumbers + {'daniel':'555-0453'}  
        ... WHERE user='john';

The internal representation:

[default@test] list phonelists;
-------------------
RowKey: scott
=> (column=, value=, timestamp=1374684062860000)
=> (column=phonenumbers:bill, value='555-7382', timestamp=1374684062860000)
=> (column=phonenumbers:jane, value='555-8743', timestamp=1374684062860000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=1374684062860000)
-------------------
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=phonenumbers:doug, value='555-1579', timestamp=1374683971220000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=1374683971220000)

Note that the above text components, e.g. ‘doug’ and ‘555-1579′, are returned by cassandra-cli in their hex encoding. I’ve taken the liberty to decode them so that the print out is more understandable. I will do this throughout the post.

Lists

Defining:

cqlsh:test> CREATE TABLE friendlists (
        ... user text PRIMARY KEY,
        ... friends list <text>
        ... );

Inserting:

cqlsh:test> INSERT INTO friendlists (user, friends)  
        ... VALUES ('john',['doug','patricia','scott']);
cqlsh:test> INSERT INTO friendlists (user, friends)  
        ... VALUES ('patricia', ['john','lucifer']);

Retrieving:

cqlsh:test> SELECT * FROM friendlists;

 user     | friends
----------+-------------------------
     john | [doug, patricia, scott]
 patricia |         [john, lucifer]

Updating:

cqlsh:test> UPDATE friendlists  
        ... SET friends = friends + ['matt','eric']  
        ... WHERE user='john';
cqlsh:test> UPDATE friendlists  
        ... SET friends = friends - ['lucifer']  
        ... WHERE user='patricia';

The internal representation:

[default@test] list friendlists;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: john
=> (column=, value=, timestamp=1374687324950000)
=> (column=friends:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374687206993000)
=> (column=friends:26017c11f48711e2801fdf9895e5d0f8, value='patricia', timestamp=1374687206993000)
=> (column=friends:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374687206993000)
=> (column=friends:6c504b60f48711e2801fdf9895e5d0f8, value='matt', timestamp=1374687324950000)
=> (column=friends:6c504b61f48711e2801fdf9895e5d0f8, value='eric', timestamp=1374687324950000)
-------------------
RowKey: patricia
=> (column=, value=, timestamp=1374687352290000)
=> (column=friends:3b817b80f48711e2801fdf9895e5d0f8, value='john', timestamp=1374687243064000)

Here the internal column name is more complicated because a UUID is appended to the name of the CQL field “friend”. This is used to keep track of the order of items in the list.

To be determined: Does a list item delete take more time than a list insert? I suspect so – I don’t see how Cassandra can delete an element in the list without reading in all the elements in the list and then deleting column that has the value indicated in the delete. — UPDATE This is the case.

Set

Defining:

cqlsh:test> CREATE TABLE friendsets (  
        ... user text PRIMARY KEY,  
        ... friends set <text>  
        ... );

Inserting:

cqlsh:test> INSERT INTO friendsets (user, friends)  
        ... VALUES ('john',{'doug','patricia','scott'});
cqlsh:test> INSERT INTO friendsets (user, friends)  
        ... VALUES ('patricia', {'john','lucifer'});

Retrieving:

cqlsh:test> SELECT * FROM friendsets;

 user     | friends
----------+-------------------------
     john | {doug, patricia, scott}
 patricia |         {john, lucifer}

Updating:

cqlsh:test> UPDATE friendsets  
        ... SET friends = friends + {'matt','eric'}  
        ... WHERE user='john';
cqlsh:test> UPDATE friendsets  
        ... SET friends = friends - {'lucifer'}  
        ... WHERE user='patricia';

The internal representation:

[default@test] list friendsets;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: john
=> (column=, value=, timestamp=1374688135443000)
=> (column=friends:'doug', value=, timestamp=1374688108307000)
=> (column=friends:'eric', value=, timestamp=1374688135443000)
=> (column=friends:'matt', value=, timestamp=1374688135443000)
=> (column=friends:'patricia', value=, timestamp=1374688108307000)
=> (column=friends:'scott', value=, timestamp=1374688108307000)
-------------------
RowKey: patricia
=> (column=, value=, timestamp=1374688151386000)
=> (column=friends:'john', value=, timestamp=1374688116595000)

4 comments on “Understanding How CQL3 Maps to Cassandra’s Internal Data Structure: Sets, Lists, and Maps

  1. Ok, so how do you test the existence of a member of a set in CQL, or get a specific key value, or select the keys of a map in CQL?

    If these are truly mapped as individual columns for the set keys and map keys, that should be doable through CQL3, but I haven’t found any syntax that works.

  2. For example, how does one do an efficient query to check for set/map key membership existence using CQL3? That should translate to a bloom filter check if the keys/set members are actually implemented as separate columns under the hood, isn’t it?

    Or if I want to retrieve a specific map’s value for a key, that should be a single column retrieve under the hood, but where’s the CQL3 syntax for SELECT map[‘key’] FROM table WHERE rowkey = ‘rowval’ ?

  3. I think I’ve found some workaround for testing the existence of an element in the set. Moreover, it is transactional:

    Given a table:

    CREATE TABLE test ( id varchar primary key, s set<int> );
    we want to check if there is a value 5 in the set s. We need to:

    UPDATE test SET s = s + {5} WHERE id = 'some id' IF s = {5};

    Actually, you need to provide some operation in the SET clause that will not modify the row. The syntax in the IF clause is allowed. Despite it looks like we do a simple comparison, it actually checks for existence of the number 5 in the set s.

    So, if there are no number 5 in the set, the set will not be modified and we get the following result:

    applied | s
    -----------+------
    False | null

    Otherwise, the set already contains the number 5 and adding the number of to it again will not modify the set itself. We get:

    applied
    -----------
    True

    There are some drawbacks of this solution. One of them is that this is not so fast, because it involves paxos. The other one is that, if the element is present in the set, we change its timestamp and TTL.

  4. @cowardlydragon, Hmm… dunno. That’s a good question. If you’re set is of low cardinality then returning the contents of the set is not toooooo much worse than returning the true/false of set membership. If you set if of large cardinality, then you’re probably stretching what Cassandra and it’s set implementation is good at. That said, I think that you’ve got a good point and I’d be interested in seeing a set/list/map inclusion test as part of CQL. @Jacek good find. I had no idea that this syntax was available. Though it is a little weird isn’t it? I mean, we’re using the equals sign to test set inclusion. I wish it’d been something more like IF 5 in s. And it’s a shame that the syntax implies mutation for what we basically want to be a true/false for existence.

Comments are closed.

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC