Java, NoSQL, SQL, REST API and other scary words

Cassandra Datastax and Java – best way to set up connection

I’ll research the best way to make a connection from my Java to Cassandra here. There are a lot of examples how to do that, but the main thing, but I’m developing some kind of chat application on my localhost (will do single insert/update statements, etc.) when all this Spark examples are perfect for analytical workflows.

The first one example is Spark 1.6:

public static JavaSparkContext getCassandraConnector(){
         SparkConf conf = new SparkConf();
         conf.setAppName("Chat");
         conf.set("spark.driver.allowMultipleContexts", "true");
         conf.set("spark.cassandra.connection.host", "127.0.0.1");
         conf.set("spark.rpc.netty.dispatcher.numThreads","2");
         conf.setMaster("local[2]");

         JavaSparkContext sc = new JavaSparkContext(conf);
         return sc;
    }

So, I also got an example for Spark 2.x where the builder will automatically reuse an existing SparkContext if one exists and create a SparkContext if it does not exist. Configuration options set in the builder are automatically propagated over to Spark and Hadoop during I/O. Continue reading

Standard
Java, NoSQL, SQL, REST API and other scary words

Spark Dataset API implementation

Dataframes

Spark introduced Dataframes in Spark 1.3 release. Dataframe overcomes the key challenges that RDDs had.

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a R/Python Dataframe. Along with Dataframe, Spark also introduced catalyst optimizer, which leverages advanced programming features to build an extensible query optimizer.

Dataframe Features

  • Distributed collection of Row Object: A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, but with richer optimizations under the hood.
  • Data Processing: Processing structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL, etc). It can read and write from all these various datasources.
  • Optimization using catalyst optimizer: It powers both SQL queries and the DataFrame API. Dataframe use catalyst tree transformation framework in four phases,
  • 1.Analyzing a logical plan to resolve references 2.Logical plan optimization 3.Physical planning 4.Code generation to compile parts of the query to Java bytecode.
  • Hive Compatibility: Using Spark SQL, you can run unmodified Hive queries on your existing Hive warehouses. It reuses Hive frontend and MetaStore and gives you full compatibility with existing Hive data, queries, and UDFs.
  • Tungsten: Tungsten provides a physical execution backend whichexplicitly manages memory and dynamically generates bytecode for expression evaluation.
  • Programming Languages supported:
  • Dataframe API is available in Java, Scala, Python, and R.

Continue reading

Standard
Java, NoSQL, SQL, REST API and other scary words

Modification of the SQL ER model to NoSQL Cassandra

Original MySQL ER-model:

mysql

There is no difficulties to implement account management and security case based on UI architecture – there is no need to join tables.

First problem appeared on attempt  to implement friendship data or messages – we need to join users_link and users table for friendship data and we need to join messages and friendship data to filter messages to display here.

There are some shortcuts which will allow us to solve these problems. Continue reading

Standard
Java, NoSQL, SQL, REST API and other scary words

Spark and Data Formats. Introduction

This is a pretty short compilation about data formats and I will ignore JSON, XML and a lot of other formats here.

From my point of view – this one presentation is very important to understand difference between different formats (I used a lot of other data sources, just in case):

http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015

CSV

The CSV (“Comma Separated Values”) file format is often used to exchange data between differently similar applications. The CSV Format:

  • Each record is one line – Line separator may be LF (0x0A) or CRLF (0x0D0A), a line separator may also be embedded in the data (making a record more than one line but still acceptable).
  • Fields are separated with commas.
  • Leading and trailing whitespace is ignored – Unless the field is delimited with double-quotes in that case the whitespace is preserved.
  • Embedded commas – Field must be delimited with double-quotes.
  • Embedded double-quotes – Embedded double-quote characters must be doubled, and the field must be delimited with double-quotes.
  • Embedded line-breaks – Fields must be surrounded by double-quotes.
  • Always Delimiting – Fields may always be delimited with double quotes, the delimiters will be parsed and discarded by the reading applications.

Example: Continue reading

Standard
Java, NoSQL, SQL, REST API and other scary words

Spark Introduction. RDD

Once again, the key thing about queries in Cassandra (my previous article about it) –  there is no way to do joins there. So you should be very accurate with the model of the database. Anyway, if  there is a strong need to perform relational queries over data stored in Cassandra clusters – use Spark.

Let’s start with short introduction about Spark, it is:

  • A lightning-fast cluster computing technology, designed for fast computation,
  • A unified relational query language for traversing over Spark Resilient Distributed Datasets (RDDs),
  • Support of a variation of the query language used in relational databases,
  • Not about their own database and query language – Spark is about query language and other databases (in our case – Cassandra). You can execute Spark queries in Java applications that traverse over Cassandra column families.

One pretty nice article about Spark is here: http://spark.apache.org/docs/latest/programming-guide.html

Spark and Hadoop difference

1

What to add? Continue reading

Standard
Java, NoSQL, SQL, REST API and other scary words

Cassandra. Introduction

NoSQL is definitely faster than SQL. This isn’t surprising; NoSQL’s simpler denormalized store allows you to retrieve all information about a specific item in a single request. There’s no need for related JOINs or complex SQL queries.

That said, your project design and data requirements will have most impact. A well-designed SQL database will almost certainly perform better than a badly designed NoSQL equivalent and vice versa.

I decided to start with Cassandra cause, from my point of view, this is the most trending NoSQL for today.

Cassandra is good for:

  • Simple setup, maintenance code
  • Fast random read/write
  • Flexible parsing/wide column requirement
  • No multiple secondary index needed

Not good for:

  • Secondary index
  • Relational data
  • Transactional operations (Rollback, Commit)
  • Primary & Financial record
  • Stringent and authorization needed on data
  • Dynamic queries/searching  on column data
  • Low latency

Key usage case: Twitter Continue reading

Standard
Java, NoSQL, SQL, REST API and other scary words

Simple SQL commands

Term Description Syntax
SELECT The SELECT statement is used to select data from a database. The result is stored in a result table, called the result-set. SELECT column_name,column_name
FROM table_name;
DISTINCT In a table, a column may contain many duplicate values; and sometimes you only want to list the different (distinct) values.
The DISTINCT keyword can be used to return only distinct (different) values.
SELECT DISTINCT column_name,column_name
FROM table_name;
WHERE The WHERE clause is used to extract only those records that fulfill a specified criterion. SELECT column_name,column_name
FROM table_name
WHERE column_name operator value;
AND/OR
ORDER BY The ORDER BY keyword is used to sort the result-set by one or more columns.
The ORDER BY keyword sorts the records in ascending order by default. To sort the records in a descending order, you can use the DESC keyword.
INSERT INTO he INSERT INTO statement is used to insert new records in a table. INSERT INTO table_name
VALUES (value1,value2,value3,…);INSERT INTO table_name (column1,column2,column3,…)
VALUES (value1,value2,value3,…);
UPDATE The UPDATE statement is used to update existing records in a table.
DELETE
SELECT TOP The SELECT TOP clause is used to specify the number of records to return.
LIKE The LIKE operator is used to search for a specified pattern in a column.
IN The IN operator allows you to specify multiple values in a WHERE clause.
BETWEEN The BETWEEN operator is used to select values within a range.
Joins SQL joins are used to combine rows from two or more tables:
INNER JOIN: Returns all rows when there is at least one match in BOTH tables
LEFT JOIN: Return all rows from the left table, and the matched rows from the right table
RIGHT JOIN: Return all rows from the right table, and the matched rows from the left table
FULL JOIN: Return all rows when there is a match in ONE of the tables
INNER JOIN The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables. INNER JOIN is the same as JOIN.
Общие значения
LEFT JOIN The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match. SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name=table2.column_name;
RIGHT JOIN The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match. SELECT column_name(s)
FROM table1
RIGHT OUTER JOIN table2
ON table1.column_name=table2.column_name;
FULL JOIN The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right table (table2).
The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.
SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;
UNION The SQL UNION operator combines the result of two or more SELECT statements.
The UNION operator selects only distinct values by default. To allow duplicate values, use the ALL keyword with UNION.
SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;
———————————————————
SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;
SELECT INTO The SELECT INTO statement copies data from one table and inserts it into a new table. SELECT *
INTO newtable [IN externaldb]
FROM table1;
INSERT INTO SELECT With SQL, you can copy information from one table into another.
The INSERT INTO SELECT statement copies data from one table and inserts it into an existing table
INSERT INTO table2
SELECT * FROM table1;
CREATE DATABASE The CREATE DATABASE statement is used to create a database.
CREATE TABLE NOT NULL – Indicates that a column cannot store NULL value
UNIQUE – Ensures that each row for a column must have a unique value
PRIMARY KEY – A combination of a NOT NULL and UNIQUE. Ensures that a column (or combination of two or more columns) have an unique identity which helps to find a particular record in a table more easily and quickly
FOREIGN KEY – Ensure the referential integrity of the data in one table to match values in another table
CHECK – Ensures that the value in a column meets a specific condition
DEFAULT – Specifies a default value when specified none for this column
INCREMENT TABLE
DROP INDEX Indexes, tables, and databases can easily be deleted/removed with the DROP statement. DROP INDEX index_name ON table_name
DROP INDEX table_name.index_name
DROP INDEX index_name
ALTER TABLE table_name DROP INDEX index_name
DROP TABLE table_name
DROP DATABASE database_name
TRUNCATE TABLE table_name
ALTER TABLE The ALTER TABLE statement is used to add, delete, or modify columns in an existing table. ALTER TABLE table_name
ADD column_name datatype
Views In SQL, a view is a virtual table based on the result-set of an SQL statement.
A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table.
CREATE VIEW view_name AS
SELECT column_name(s)
FROM table_name
WHERE condition
Date Data Types MySQL comes with the following data types for storing a date or a date/time value in the database:DATE – format YYYY-MM-DD
DATETIME – format: YYYY-MM-DD HH:MM:SS
TIMESTAMP – format: YYYY-MM-DD HH:MM:SS
YEAR – format YYYY or YY
SQL Server comes with the following data types for storing a date or a date/time value in the database:DATE – format YYYY-MM-DD
DATETIME – format: YYYY-MM-DD HH:MM:SS
SMALLDATETIME – format: YYYY-MM-DD HH:MM:SS
TIMESTAMP – format: a unique number
SQL Functions SQL Aggregate Functions
SQL aggregate functions return a single value, calculated from values in a column.Useful aggregate functions:AVG() – Returns the average value
COUNT() – Returns the number of rows
FIRST() – Returns the first value
LAST() – Returns the last value
MAX() – Returns the largest value
MIN() – Returns the smallest value
SUM() – Returns the sum
SQL Scalar functions
SQL scalar functions return a single value, based on the input value.Useful scalar functions:UCASE() – Converts a field to upper case
LCASE() – Converts a field to lower case
MID() – Extract characters from a text field
LEN() – Returns the length of a text field
ROUND() – Rounds a numeric field to the number of decimals specified
NOW() – Returns the current system date and time
FORMAT() – Formats how a field is to be displayed

Continue reading

Standard