Java, NoSQL, SQL, REST API and other scary words

Cassandra. Introduction

NoSQL is definitely faster than SQL. This isn’t surprising; NoSQL’s simpler denormalized store allows you to retrieve all information about a specific item in a single request. There’s no need for related JOINs or complex SQL queries.

That said, your project design and data requirements will have most impact. A well-designed SQL database will almost certainly perform better than a badly designed NoSQL equivalent and vice versa.

I decided to start with Cassandra cause, from my point of view, this is the most trending NoSQL for today.

Cassandra is good for:

  • Simple setup, maintenance code
  • Fast random read/write
  • Flexible parsing/wide column requirement
  • No multiple secondary index needed

Not good for:

  • Secondary index
  • Relational data
  • Transactional operations (Rollback, Commit)
  • Primary & Financial record
  • Stringent and authorization needed on data
  • Dynamic queries/searching  on column data
  • Low latency

Key usage case: Twitter

No Joins

You cannot perform joins in Cassandra. Work on the client side or create a denormalized second table that represents the join results for you (this option is definitely preferable).

Query Model

Don’t start with the data model – start with the query model cause all of the data should be organized around queries. Design the most common query paths first and create the tables that you need to support them as a second step.

Denormalization

No joins allows to get the query result asap (all tables stores data like a classic data view). No related tables support allows you not to support related tables/ dictionaries, etc.

Order By

The sort order available on queries is fixed, and is determined entirely by the selection of clustering columns you supply in the CREATE TABLE command. The CQL SELECT statement does support ORDER BY semantics, but only in the order specified by the clustering columns.

Clustering Keys

Suppose your clustering keys are

k1 t1, k2 t2, …, kn tn

where ki is the ith key name and ti is the key type. Then the order data is stored in is lexicographic ordering where each dimension is compared using the comparator for that type.

So (a1, a2, …, an) < (b1, b2, …, bn) if a1 < b1 using t1 comparator, or a1=b1 and a2 < b2 using t2 comparator, or (a1=b1 and a2=b2) and a3 1. In fact, such a query isn’t allowed – the only clustering key constraints that are allowed specify zero or more clustering keys, starting from the first with none missing.

For example, consider the schema:

create table clustering (
x text,
k1 text,
k2 int,
k3 timestamp,
y text,
primary key (x, k1, k2, k3)
);

If you did the following inserts:

insert into clustering (x, k1, k2, k3, y) values (‘x’, ‘a’, 1, ‘2013-09-10 14:00+0000’, ‘1’);
insert into clustering (x, k1, k2, k3, y) values (‘x’, ‘b’, 1, ‘2013-09-10 13:00+0000’, ‘1’);
insert into clustering (x, k1, k2, k3, y) values (‘x’, ‘a’, 2, ‘2013-09-10 13:00+0000’, ‘1’);
insert into clustering (x, k1, k2, k3, y) values (‘x’, ‘b’, 1, ‘2013-09-10 14:00+0000’, ‘1’);

then they are stored in this order on disk (the order select * from clustering where x = ‘x’ returns):

x | k1 | k2 | k3                                                  | y
x | a   | 1  | 2013-09-10 14:00:00+0000 | 1
x | a   | 2  | 2013-09-10 13:00:00+0000 | 1
x | b   | 1  | 2013-09-10 13:00:00+0000 | 1
x | b   | 1  | 2013-09-10 14:00:00+0000 | 1

k1 ordering dominates, then k2, then k3.

ID

If the column values are unique then this column could be a primary key.

What are the advantages and disadvantages of not using a separate uuid field? UUID is good if you need a unique id that is globally unique and you don’t have to check for it’s uniqueness. If you can find a set of columns that can be granted that their combination is unique you don’t have to use UUID (assuming you don’t need an id to refer to it). But it all depends on your query pattern. if you are going to look for an object with it’s id (probably coming from another table) use UUID as primary key.

UUID

As of Cassandra 2.0.7 the best way to create id – use uuid() function, which generates data like this: 123e4567-e89b-12d3-a456-426655440000. Example:

INSERT INTO users(uid, name) VALUES(uuid(), ‘my name’);

There are also several timeuuid functions which could be very useful.

 SQL vs NoSQL code

Here is the syntaxis difference between SQL and NoSQL

Full article: https://www.sitepoint.com/sql-vs-nosql-differences/

SQL NoSQL
insert a new book record
INSERT INTO book (
   `ISBN`, `title`, `author`
)
VALUES (
   ‘9780992461256’,
   ‘Full Stack JavaScript’,
   ‘Colin Ihrig & Adam Bretz’
);
db.book.insert({
   ISBN: “9780992461256”,
   title: “Full Stack JavaScript”,
   author: “Colin Ihrig & Adam Bretz”
});
update a book record
UPDATE book
SET price = 19.99
WHERE ISBN = ‘9780992461256’
db.book.update(
   { ISBN: ‘9780992461256’ },
   { $set: { price: 19.99 } }
);
return all book titles over $10
SELECT title FROM book
WHERE price > 10;
db.book.find(
   { price: { >: 10 } },
   { _id: 0, title: 1 }
);The second JSON object is known as a projection: it sets which fields are returned (_id is returned by default so it needs to be unset).
count the number of SitePoint books
SELECT COUNT(1) FROM book
WHERE publisher_id = ‘SP001’;
db.book.count({
   “publisher.name”: “SitePoint”
});This presumes denormalized documents are used.
return the number of book format types
SELECT format, COUNT(1) AS `total`
FROM book
GROUP BY format;
db.book.aggregate([
   { $group:
      {
         _id: “$format”,
         total: { $sum: 1 }
      }
   }
]);This is known as aggregation: a new set of documents is computed from an original set.
delete all SitePoint books
DELETE FROM book
WHERE publisher_id = ‘SP001’;Alternatively, it’s possible to delete the publisher record and have this cascade to associated book records if foreign keys are specified appropriately.
db.book.remove({
   “publisher.name”: “SitePoint”
});

Security

The simplest way to implement security for my web application is to store uid for each session for each user (same way as at SQL databases).

Naming Convention

Note that keyspace names, function names and argument types are subject to the default naming conventions and case-sensitivity rules. Also, if there are attributes of other related entities by which we are querying, to append those to the name separated with by.

Storage

Each partition is a unit of storage that does not get divided across nodes, a query that searches a single partition will typically yield the best performance. So, the idea is to minimize number of partitions.

No Referential Integrity

Cassandra supports features such as lightweight transactions and batches, Cassandra itself has no concept of referential integrity across tables. In fact – you should forget about some useful things, for example – cascade deletion, explicit commit/rollback functionality.

Database Modeling

Initial MySQL DB

datast

Note: I will use underscore in table names in my Cassandra project

Cassandra Database Modeling

Let’s find important queries in the System. Our main aim is to allow user to talk to each other and see their previous conversation. Second one thing is to manage list of friends (including ignored users and friendship requests). So, here is the prioritized list of queries:

  1. Get my conversation with user
  2. View my friends list
  3. View my friendship requests
  4. View ignored users

These are only Select queries, and, of course, we will have all kinds of queries to add new friendship requests, friends, etc. So, these queries are:

  1. Send new friendship request
  2. Approve Friendship
  3. Ignore User
  4. Send text message

And other queries for the user account:

  1. Create a new account
  2. Modify account
  3. Delete an account

After that – let’s find the most recent queries and place them into workflow order:

  1. Login to the system and get current user data
  2. View list of friends
  3. View messages

Tables

User

User is the simplest table our workflow will start with.

All we need here is:

  1. id
  2. email
  3. first_name
  4. last_name
  5. user_password
  6. created_date
  7. updated_date (?)

Friends by User

To display the friendship we need the same fields as at MySQL application, but also user name to display friend full name (what to do if user will rename himself? this is a reason that I will use Spark in my final implementation):

  1. id
  2. id_user
  3. related_user
  4. link_type
  5. created_date
  6. updated_date

Messages by Friends

The main table is messages_by_friends. Because I don’t need to display my friend’s name – there is no need to store his/her name, also – I need created_date to sort messages and to make the search easier for the end user:

  1. id
  2. message_text
  3. link_id (link to the friends table)
  4. user_id (link to the user table, describes who sent this message)
  5. created_date (cause we will sort messages by date – this is the clustering column)

If I should display user name here – I will put it here as a separate table like first_name, last_name or even better – full_name.

There is another one alternative way how to store messages: separate sent messages and received messages. In fact – it will be duplicated message in our database, but it will be much less data to form sent messages and received. There is no strong need to do that because we will have sent and received messages in one list.

Security

Security should be implemented same way as for MySQL:

  1. session id
  2. user_id
  3. created_date (for statistical purposes only)

Dictionary

Dictionary uses to store different types of the user-user relations.

Install Cassandra

Just follow this link:

http://www.datastax.com/2012/01/getting-started-with-apache-cassandra-on-windows-the-easy-way or you could also use this one link: http://www.planetcassandra.org/cassandra/

In my own example I ran Cassandra CQL Shell:

keys

And made my keyspace:

create keyspace chat with replication={‘class’:’SimpleStrategy’, ‘replication_factor’:1};

Note: My local Cassandra DB starts automatically with Windows.

Create Cassandra tables

Create keyspace fisrt:

CREATE KEYSPACE chat
WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 };

Example: http://www.9threes.com/2014/12/crud-operations-in-cassandra-from-java.html

Dictionary

CREATE TABLE dictionary (
  value_id    text,
  d_value     text,
  d_name    text,
  PRIMARY KEY (value_id)
) WITH comment = ‘dictionary values’
AND CLUSTERING ORDER BY (value_id ASC);

User

CREATE TABLE user (
  user_id               uuid,
  email                  text,
  first_name          text,
  last_name          text,
  user_password   text,
  created_date      timestamp,
  updated_date     timestamp,
  PRIMARY KEY (user_id, email)
) WITH comment = ‘List of all registered and active users’
AND CLUSTERING ORDER BY (email ASC);
Note: System got unique emails only, but it is easier to use unique user id for references from other tables.

User_link

CREATE TABLE user_link (
  link_id                 uuid,
  id_user               uuid,
  related_user       uuid,
  link_type             text,
  created_date      timestamp,
  updated_date     timestamp,
  PRIMARY KEY (user_id, created_date)
) WITH comment = ‘List of friendship status by user’
AND CLUSTERING ORDER BY (created_date DESC);
Note: System got unique pair of id_user and related_user, but it is easier to use unique link id for references from other tables.

Message

CREATE TABLE message(
  message_id         uuid,
  link_id                  uuid,
  id_user                uuid,
  message              text,
  created_date       timestamp,
  PRIMARY KEY (message_id , created_date)
) WITH comment = ‘List of all messages by link id’
AND CLUSTERING ORDER BY (created_date  DESC);

Note: We can avoid message_id because there is no need to reference to it from any other place for today, but I’ll keep it for future improvements.

datast

Add dictionary values

INSERT INTO chat.dictionary (d_name,d_value,value_id) VALUES (‘Friendship Status’,’Requested’,’1′);
INSERT INTO chat.dictionary (d_name,d_value,value_id) VALUES (‘Friendship Status’,’Friends’,’2′);
INSERT INTO chat.dictionary (d_name,d_value,value_id) VALUES (‘Friendship Status’,’Ignored by first’,’3′);
INSERT INTO chat.dictionary (d_name,d_value,value_id) VALUES (‘Friendship Status’,’Ignored by second’,’4′);
INSERT INTO chat.dictionary (d_name,d_value,value_id) VALUES (‘Friendship Status’,’Ignored by both’,’5′);

Cassandra and Java

Configure Maven Dependencies

Let’s add a dependency to our project’s Maven configuration file (don’t forget to configure Eclipse to work with Maven projects):

com.datastax.cassandra
cassandra-driver-mapping
3.1.0

Cluster

Cluster is an main entry point of the driver: Cluster.builder().addContactPoint().build();
addContactPoint() – Adds a contact point to the cluster.
address – address the address of the node to connect to.

// Here 127.0.0.1 is the address of the node.
final Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build();
// cluster.getClusterName() – To get Cluster Name
// cluster.getDriverVersion() – To get Driver Version
// cluster.getConfiguration() – To get cluster Configurations
// cluster.getMetadata() – To get Cluster Metadata
// cluster.getMetrics() – To get Metrics associated with cluster.

Session

A session (com.datastax.driver.core.Session) holds connections to a Cassandra cluster, allowing it to be queried. To create or get a Session object:
cluster.connect() – Creates a new session on this cluster and initialize it.
connect(keyspace) – Creates a new session on this cluster, initialize it and sets the keyspace to the provided one.
cluster.newSession() – Creates a new session on this cluster but does not initialize it.
How to execute DB Query on Cassandra Session object :
session.execute(<> | | | )
session.executeAsync(<> | | | )

CRUD example

This is a code from example: http://www.devjavasource.com/cassandra/cassandra-crud-operation-using-java/

//package
//imports
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
 
public class App {
    public static void main(String[] args) {
 
        // Connect to the cluster and keyspace “chatRest”
        final Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”)
                .build();  
        final Session session = cluster.connect(“chatRest”);
        
        //Cluster Name is: cluster.getClusterName()
        //Driver Version is: cluster.getDriverVersion()
        //Cluster Configuration is: cluster.getConfiguration()
        //Cluster Metadata is: cluster.getMetadata()
        //Cluster Metrics is:  cluster.getMetrics()    
        
        // Insert new User into users table
        session.execute(“INSERT INTO users (id, address, name) VALUES (11104, ‘USA’, ‘Stuatr’)”);
        getUsersAllDetails(session);
        
        // Update user data in users table
        session.execute(“update users set address = ‘USA NEW’ where id = 11104”);
        getUsersAllDetails(session);
        
        // Delete user from users table
        session.execute(“delete FROM users where id = 11104”);
        getUsersAllDetails(session);
        
        // Close Cluster and Session objects
        //Cluster Closed: cluster.isClosed()
        //Session Closed: session.isClosed()         
        cluster.close();
        session.close();
    }
    
    private static void getUsersAllDetails(final Session inSession){      
        // Use select to get the users table data
        ResultSet results = inSession.execute(“SELECT * FROM users”);
        for (Row row : results) {
            System.out.format(“%s %d %s\n”, row.getString(“name”),
                    row.getInt(“id”), row.getString(“address”));
     }
}
}
Advertisements
Standard

One thought on “Cassandra. Introduction

  1. Pingback: Spark Introduction. RDD | Alexey Kiselev

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s