1
Getting Started with User Profile Data Modeling
White Paper
2
What was old is new again .................................................................................. 3
Fixed schema User Data Pattern 1 ................................................................... 3
Fixed schema on a wide row User Data Pattern 2 ............................................ 3
Fixed Schema with a dynamic collection User Data Pattern 3 .......................... 5
Conclusion .......................................................................................................... 6
3
What was old is new again
Creating static schema in Cassandra may seem unusual, but is frequently done in production
data models. Not every table created needs to have dynamic columns. In some cases it's useful
to have the data stored in fixed column names with type validation. The following examples will
get you started with defining static schema tables and in some cases create hybrid data models
that add a dynamic column component for greater flexibility.
Fixed schema – User Data Pattern 1
To start, we'll work on a very basic table of data that you might find in any relational database. If
you have created a table in MySQL or Oracle, this should look very familiar to you. In the syntax
below, the username is the only primary key, so we put the declaration to the right of field type.
You can use this reference to see the different types that can be used:
http://www.datastax.com/docs/1.1/references/cql/cql_data_types
CREATE TABLE users (
username text PRIMARY KEY,
first_name text,
last_name text,
address1 text,
city text,
postal_code text,
last_login timestamp
);
Once we have our table definition, it should be as equally as familiar to insert data using the
INSERT INTO syntax.
INSERT INTO users
(username,first_name,last_name,address1,city,postal_code,last_login)
VALUES ('cstar','Cassandra','Database','123 Main St','San
Mateo','11111','2013-4-4');
Once the data is in, you can use the SELECT command.
SELECT first_name, last_name
FROM users
WHERE username = 'cstar';
One large difference between relational databases and Cassandra CQL is the WHERE clause
syntax. Because of how the data is stored, you can only use the elements declared in the
primary key. Unless you want every row, you need to specify the row key at a minimum. We'll
use the next example to dive further into this difference.
Fixed schema on a wide row – User Data
Pattern 2
For the this pattern, we'll again be storing what looks like static row oriented data. The subtle
difference here is in how the data is stored by Cassandra. By using multiple fields in the
PRIMARY KEY definition, we are specifying that this data will be stored in a wide row. Let's use a
playlist model as an example:
4
CREATE TABLE user_playlist (
username text,
playlist_name text,
song_name text,
artist text,
list_order int,
songid UUID
PRIMARY KEY (username, playlist_name, song_name)
);
At first glance, it would appear that we are creating a table that will contain multiple rows with 3
unique fields. The first element 'username', will be used as the row key. The remaining elements
will be be used by CQL to create unique columns in the underlying storage engine. The result set
presented to the user is a large table of data similar to a relational database. Let's insert some
data and see how it works:
INSERT INTO user_playlist (username, playlist_name, song_name, artist,
list_order, songid)
VALUES ('cstar','Classic Rock','Stairway to heaven','Led
Zepplin',1,c9df5407-7dd8-46a4-84ee-1c358106c926);
INSERT INTO user_playlist (username, playlist_name, song_name, artist,
list_order, songid)
VALUES ('cstar','Classic Rock','Sweet home Alabama','Lynyrd
Skynyrd',2,d2fb151c-cba8-46ff-bc03-66b32c45434a);
INSERT INTO user_playlist (username, playlist_name, song_name, artist,
list_order, songid)
VALUES ('cstar','Alternative','The Queen is dead','The
Smiths',1,24106aa0-10d5-4de1-9dab-a3a652e32c49);
INSERT INTO user_playlist (username, playlist_name, song_name, artist,
list_order, songid)
VALUES ('cstar','Alternative','Blue Monday','New Order',2,7a0b82ba-
5a24-4554-903c-1970d1e3aaea);
All of the data inserted will be on one row with the key 'cstar', however, we we access that data with CQL,
it will appear as many rows.
SELECT playlist_name,song_name,artist,list_order from user_playlist
WHERE username='cstar';
playlist_name | song_name | artist | list_order
---------------+--------------------+----------------+------------
Alternative | Blue Monday | New Order | 2
Alternative | The Queen is dead | The Smiths | 1
Classic Rock | Stairway to heaven | Led Zepplin | 1
Classic Rock | Sweet home Alabama | Lynyrd Skynyrd | 2
The output is nicely formatted and easier to read by the user. The underlying storage engine has actually
created a single row of data using unique column names derived from the PRIMARY KEY definition. The
result is a faster access pattern on reads when all data is co-located on the same node and sequentially read
from disk. You can read more in-depth here: http://www.datastax.com/docs/1.2/ddl/table
5
Fixed Schema with a dynamic collection –
User Data Pattern 3
When you need some dynamic element to the static table, this can be accomplished by using the collection
feature of CQL. A collection can be a Map, Set or List and can be read about more here:
http://www.datastax.com/dev/blog/cql3_collections
The example we'll use is a user table with an added requirement to dynamically store one or more locations
sorted by time. The map collection can be used to create an ordered pair of data where the key is a
timestamp and the value is a location stored as text.
CREATE TABLE user_with_location (
username text PRIMARY KEY,
first_name text,
last_name text,
address1 text,
city text,
postal_code text,
last_login timestamp,
location_by_date map<timestamp,text>
);
First we insert the entire user record. Note: When inserting a collection, you will overwrite any of
the previous values if the row key exists.
INSERT INTO user_with_location
(username,first_name,last_name,address1,city,postal_code,last_login,loc
ation_by_date)
VALUES ('cstar','Cassandra','Database','123 Main St','San
Mateo','11111','2013-4-4',{ '2013-4-4' : 'San Francisco'});
Then we add a couple of locations for this user. This will be an update so not overwrite the
previous location data in the collection. If you use the same map key as a previous data point,
you will overwrite just that map value.
UPDATE user_with_location SET location_by_date['2013-4-5'] = 'San
Diego'
WHERE username = 'cstar';
UPDATE user_with_location SET location_by_date['2013-4-2'] = 'Santa
Rosa'
WHERE username = 'cstar';
When we select the row of data for the user 'cstar', the result set will contain the entire map of
locations. Collections have to be returned completely, so keep that in mind as you design your
model. Too much in one collection could cause a lot of data to be returned.
SELECT first_name, last_name, location_by_date
FROM user_with_location
WHERE username = 'cstar';
first_name | last_name | location_by_date
------------+-----------+--------------------------------------
Cassandra | Database | {2013-04-02 00:00:00-0700: Santa Rosa,
2013-04-04 00:00:00-0700: San Francisco,
2013-04-05 00:00:00-0700: San Diego}
You might also notice that the data in the first element of the map is sorted. This is a
feature of the map to sort by the type of the key.
6
Conclusion
As you can see, creating user based schema can be simple and with some of the CQL
features shown, powerful in creating wide row data sets. Cassandra has had the reputation
of only fitting large dynamic data sets, but with CQL you can create fixed schema and
strong validation typing. There are many other patterns for user profile management.
These three should get you started on your user centered data model.