Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Wednesday, August 02, 2017

Working with Apache Cassandra (RHEL 7)


Introduction
Cassandra (created at Facebook for inbox search) like HBase is a NoSQL database, generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL and designed to manage extremely large data sets with manipulation capabilities. It is a distributed database, clients can connect to any node in the cluster and access any data.

The primary container of data is a keyspace , which is like a database in an RDBMS. Inside a keyspace are one or more column families , which are like relational tables, but they are more fluid and dynamic in structure. Column families have one to many thousands of columns, with both primary and secondary indexes on columns being supported.

In Cassandra, objects are created, data is inserted and manipulated, and information queried via CQL – the Cassandra Query Language, which looks nearly identical to SQL. Developers coming from the relational world will be right at home with CQL and will use standard commands (e.g., INSERT, SELECT) to interact with objects and data stored in Cassandra.

The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.
NoSQLDatabase

A NoSQL database (sometimes alled as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.

Relational DatabaseNoSql Database
Supports powerful query language.Supports very simple query language.
It has a fixed schema.No fixed schema.
Follows ACID (Atomicity, Consistency, Isolation, and Durability).It is only “eventually consistent”.
Supports transactions.Does not support transactions.
Besides Cassandra, we have the following NoSQL databases that are quite popular:Apache HBase and MongoDB

Features of Cassandra

Cassandra has become so popular because of its outstanding technical features. Given below are some of the features of Cassandra:

Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.

Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.

Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.

Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.

Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.

Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).

Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.

Cassandra - Architecture
Components of Cassandra


The key components of Cassandra are as follows

Node − It is the place where data is stored.

Data center − It is a collection of related nodes.

Cluster − A cluster is a component that contains one or more data centers.

Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.

Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.

SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.

Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.


Cassandra Query Language

Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.

Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.

Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be captured and stored in the mem-table.Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary data.

Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable that holds the required data.

Snitches

A snitch determines which datacenters and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into datacenters and racks. Specifically, the replication strategy places the replicas based on the information provided by the new snitch. All nodes must return to the same rack and datacenter. Cassandra does its best not to have more than one replica on the same rack (which is not necessarily a physical location).

Cassandra - Data Model

Cluster: Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.

Keyspace:
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in Cassandra are


Replication factor − It is the number of machines in the cluster that will receive copies of the same data.
Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We have strategies such as simple strategy(rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy).

Column families − Keyspace is a container for a list of one or more column families. A column family, in turn, is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.


Column
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp. Given below is the structure of a column.

SuperColumn

A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns. 


Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is important to keep columns that you are likely to query together in the same column family, and a super column can be helpful here.Given below is the structure of a super column.


Installation                                                      

You can download Cassandra and install from below link, I used the Installation from RPM packages;
http://cassandra.apache.org/download/
http://www.apache.org/dyn/closer.lua/cassandra/3.11.0/apache-cassandra-3.11.0-bin.tar.gz


[hdpsysuser@dn01 ~]$ cd /usr/hadoopsw/
[hdpsysuser@dn01 ~]$ tar zxvf apache-cassandra-3.11.0-bin.tar.gz

1- Untar the file somewhere
root@dn01 hadoopsw]# tar -xvf apache-cassandra-3.11.0-bin.tar.gz

2- Start Cassandra in the foreground by invoking bin/cassandra -f from the command line. Press “Control-C” to stop Cassandra. Start Cassandra in the background by invoking bin/cassandra from the command line. Invoke kill pid or pkill -f CassandraDaemon to stop Cassandra, where pid is the Cassandra process id, which you can find for example by invoking pgrep -f CassandraDaemon.


[root@dn01 hadoopsw]# cassandra -f

Running Cassandra as root user or group is not recommended - please start Cassandra using a different system user.
If you really want to force running Cassandra as root, use -R command line option.

[root@dn01 hadoopsw]# useradd cass
[root@dn01 hadoopsw]# chown -R cass:cass /usr/hadoopsw/apache-cassandra-3.11.0
[root@dn01 hadoopsw]# su - cass
[cass@dn01 ~]$ cat ~/.bash_profile

# .bash_profile

#######Cassandra Variables##########
export CASSANDRA_HOME=/usr/hadoopsw/apache-cassandra-3.11.0
export PATH=$PATH:$CASSANDRA_HOME/bin

[cass@dn01 ~]$ source ~/.bash_profile

[cass@dn01 ~]$ cassandra -f
CTRL+C
[cass@dn01 ~]$ cassandra

3- Verify that Cassandra is running by invoking bin/nodetool status from the command line.

[cass@dn01 ~]$ nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  174.71 KiB  256          100.0%            e4382ae1-33a0-4a1d-9451-62978b9833be  rack1

4- Configuration files are located in the CASSANDRA_HOME/conf sub-directory. Since Cassandra 2.1, log and data directories are located in the CASSANDRA_HOME/logs and CASSANDRA_HOME/data sub-directories respectively.


Configure Cassandra                                                            

For running Cassandra on a single node, the steps above are enough, you don’t really need to change any configuration. However, when you deploy a cluster of nodes, or use clients that are not on the same host, then there are some parameters that must be changed.
The Cassandra configuration files can be found in the conf directory of tarballs. For packages, the configuration files will be located in /etc/cassandra.

Main runtime properties
Most of configuration in Cassandra is done via yaml properties that can be set in cassandra.yaml. At a minimum you should consider setting the following properties:

cluster_name: the name of your cluster.
seeds: a comma separated list of the IP addresses of your cluster seeds.
storage_port: you don’t necessarily need to change this but make sure that there are no firewalls blocking this port.
listen_address: the IP address of your node, this is what allows other nodes to communicate with this node so it is important that you change it. Alternatively, you can set listen_interface to tell Cassandra which interface to use, and consecutively which address to use. Set only one, not both.
native_transport_port: as for storage_port, make sure this port is not blocked by firewalls as clients will communicate with Cassandra on this port.

Changing the location of directories

The following yaml properties control the location of directories:

data_file_directories: one or more directories where data files are located.
commitlog_directory: the directory where commitlog files are located.
saved_caches_directory: the directory where saved caches are located.
hints_directory: the directory where hints are located.
For performance reasons, if you have multiple disks, consider putting commitlog and data files on different disks.

You can repeat the above steps on the other nodes if you span your cluster more than one node.

[cass@dn02 ~]$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.49.136  175.47 KiB  256          100.0%            857f41ec-2fbc-456f-a349-437a7fee7e1f  rack1
UN  192.168.49.135  337.98 KiB  256          100.0%            e4382ae1-33a0-4a1d-9451-62978b9833be  rack1


[cass@dn03 ~]$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.49.136  237.26 KiB  256          66.1%             857f41ec-2fbc-456f-a349-437a7fee7e1f  rack1
UN  192.168.49.137  103.71 KiB  256          64.6%             5f9089b2-6d0a-4660-962a-9db81887b2fd  rack1
UN  192.168.49.135  235.46 KiB  256          69.3%             e4382ae1-33a0-4a1d-9451-62978b9833be  rack1


Environment variables

JVM-level settings such as heap size can be set in cassandra-env.sh. You can add any additional JVM command line argument to the JVM_OPTS environment variable; when Cassandra starts these arguments will be passed to the JVM.

Logging

The logger in use is logback. You can change logging properties by editing logback.xml. By default it will log at INFO level into a file called system.log and at debug level into a file called debug.log. When running in the foreground, it will also log at INFO level to the console.



Internode communications (gossip)                                       

In Cassandra internode communication is performed using Gossip which is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

To prevent problems in gossip communications, use the same list of seed nodes for all nodes in a cluster. In multiple data-center clusters, the seed list should include at least one node from each datacenter (replication group). More than a single seed node per datacenter is recommended for fault tolerance.It is recommended to use a small seed list (approximately three nodes per datacenter).


Connect/working with Cassandra using cqlsh                  

Connect Locally

[cass@dn01 ~]$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.

cqlsh> SELECT cluster_name, listen_address FROM system.local;

 cluster_name | listen_address
--------------+----------------
 Test Cluster |      192.168.49.135

(1 rows)

cqlsh> help

Connect Remotely
[cass@dn03 ~]$ cqlsh dn03 9042
Connection error: ('Unable to connect to any servers', {'192.168.49.138': error(111, "Tried connecting to [('192.168.49.138', 9042)]. Last error: Connection refused")})

[cass@dn03 ~]$ netstat -lnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:8010            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:5901            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:9042          0.0.0.0:*               LISTEN
tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 192.168.49.135:7000     0.0.0.0:*               LISTEN
...
...

Change below property value from localhost to the name of the node in cassandra.yaml and restart cassandra on that node.
rpc_address: dn03

[cass@dn03 ~]$ netstat -lnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:8010            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:5901            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:54832         0.0.0.0:*               LISTEN
tcp        0      0 192.168.49.137:9042     0.0.0.0:*               LISTEN
tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN


[cass@dn01 ~]$ cqlsh dn03 9042
Connected to Test Cluster at dn03:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>


-- Captures the output of a command and adds it to a file
cqlsh> capture '/tmp/cass_output.txt'
Now capturing query output to '/tmp/cass_output.txt'.
cqlsh> capture off;

-- Describe the current cluster of Cassandra and its objects
cqlsh> describe cluster;
Cluster: Test Cluster
Partitioner: Murmur3Partitioner

-- List all the keyspaces in a cluster
cqlsh> describe keyspaces;

system_traces  system_schema  system_auth  system  system_distributed

-- List all the tables in a keyspace
cqlsh> describe tables;

Keyspace system_traces
----------------------
events  sessions

Keyspace system_schema
----------------------
tables     triggers    views    keyspaces  dropped_columns
functions  aggregates  indexes  types      columns

Keyspace system_auth
--------------------
resource_role_permissons_index  role_permissions  role_members  roles

Keyspace system
---------------
available_ranges          peers               batchlog        transferred_ranges
batches                   compaction_history  size_estimates  hints
prepared_statements       sstable_activity    built_views
"IndexInfo"               peer_events         range_xfers
views_builds_in_progress  paxos               local

Keyspace system_distributed
---------------------------
repair_history  view_build_status  parent_repair_history

-- Describe a Table
cqlsh> describe system_traces.sessions;

CREATE TABLE system_traces.sessions (
    session_id uuid PRIMARY KEY,
    client inet,
    command text,
    coordinator inet,
    duration int,
    parameters map<text, text>,
    request text,
    started_at timestamp
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = 'tracing sessions'
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 3600000
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';


--Describe a user-defined data type

cqlsh> describe types

cqlsh> describe types <<typeName>>

-- To expand the output on/off

cqlsh> expand on ;

Now Expanded output is enabled

cqlsh> expand off;
cqlsh> exit

cqlsh> show host
Connected to Test Cluster at 127.0.0.1:9042.

-- Execute the commands in a filevi /data/cass_input_file.cas'
select * from system.local;

cqlsh> source '/data/cass_input_file.cas';

-- Keyspace Operation
A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node.
cqlsh> CREATE KEYSPACE scott
   ... WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
cqlsh> describe keyspaces;
system_schema  system_auth  system  scott  system_distributed  system_traces

cqlsh> use scott;
cqlsh:scott>

cqlsh:scott> ALTER KEYSPACE scott
         ... WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};


cqlsh:scott> drop keyspace test;

-- Table Operations
CREATE TABLE emp(
             empno int PRIMARY KEY,
             ename text,
             job text,
             mgr int,
             hiredate text,
             sal varint,
             comm varint,
             deptno int
             );
cqlsh:scott> CREATE TABLE emp(
         ...    empno int PRIMARY KEY,
         ...    ename text,
         ...    job text,
         ...    mgr int,
         ...    hiredate text,
         ...    sal varint,
         ...    comm varint,
         ...    deptno int
         ...    );

cqlsh:scott>  DESCRIBE COLUMNFAMILIES;
emp

cqlsh:scott> select * from emp;

 empno | comm | deptno | ename | hiredate | job | mgr | sal
-------+------+--------+-------+----------+-----+-----+-----

(0 rows)

The primary key is a column that is used to uniquely identify a row. Therefore,defining a primary key is mandatory while creating a table.



INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno) values(7369,'SMITH','CLERK',7902,'17-DEC-80',800,null,20);
INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno) values(7499,'ALLEN','SALESMAN',7698,'20-FEB-81',1600,300,30);
INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno) values(7902,'FORD','ANALYST',7566,'03-DEC-81',3000,null,20);


cqlsh:scott> INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno

         ... ) values(7369,'SMITH','CLERK',7902,'17-DEC-80',800,null,20);

cqlsh:scott> INSERT INTO emp(empno,ename,job,mgr,hiredate,sal,comm,deptno
         ... ) values(7499,'ALLEN','SALESMAN',7698,'20-FEB-81',1600,300,30);

cqlsh:scott> select * from emp;

 empno | comm | deptno | ename | hiredate  | job      | mgr  | sal
-------+------+--------+-------+-----------+----------+------+------
  7499 |  300 |     30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
  7369 | null |     20 | SMITH | 17-DEC-80 |    CLERK | 7902 |  800

(2 rows)

cqlsh:scott> select empno, count(*) from emp group by empno;

 empno | count
-------+-------
  7499 |     1
  7369 |     1

(2 rows)

Warnings :
Aggregation query used without partition key

cqlsh:scott> update emp set comm=100 where empno=7369; --delete column
cqlsh:scott> select * from emp;

 empno | comm | deptno | ename | hiredate  | job      | mgr  | sal
-------+------+--------+-------+-----------+----------+------+------
  7499 |  300 |     30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
  7369 |  100 |     20 | SMITH | 17-DEC-80 |    CLERK | 7902 |  800

(2 rows)

cqlsh:scott> DELETE comm FROM emp WHERE empno=7369; 
cqlsh:scott> select * from emp;

 empno | comm | deptno | ename | hiredate  | job      | mgr  | sal
-------+------+--------+-------+-----------+----------+------+------
  7499 |  300 |     30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600
  7369 | null |     20 | SMITH | 17-DEC-80 |    CLERK | 7902 |  800

(2 rows)

cqlsh:scott> delete from emp where empno=7369; --delete entire row
cqlsh:scott> select * from emp;

 empno | comm | deptno | ename | hiredate  | job      | mgr  | sal
-------+------+--------+-------+-----------+----------+------+------
  7499 |  300 |     30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 | 1600

(1 rows)


cqlsh:scott> truncate table emp;


cqlsh:scott> CREATE INDEX idx_ename ON emp (ename);
cqlsh:scott> drop index idx_ename;

cqlsh:scott> ALTER TABLE emp ADD email text;
cqlsh:scott> select * from emp;
 empno | comm | deptno | email | ename | hiredate | job | mgr | sal
-------+------+--------+-------+-------+----------+-----+-----+-----

(0 rows)
cqlsh:scott> ALTER TABLE emp DROP email;
cqlsh:scott> drop table emp;

-- User defined type UDT
CREATE TYPE phone (
    country_code int,
    number text
)

cqlsh:scott> CREATE TYPE phone (
         ...     country_code int,
         ...     number text
         ... );
cqlsh:scott> describe types

phone

cqlsh:scott> ALTER TABLE emp ADD phonenum phone;
cqlsh:scott> select * from emp;

 empno | comm | deptno | ename | hiredate  | job      | mgr  | phonenum | sal
-------+------+--------+-------+-----------+----------+------+----------+------
  7902 | null |     20 |  FORD | 03-DEC-81 |  ANALYST | 7566 |     null | 3000
  7499 |  300 |     30 | ALLEN | 20-FEB-81 | SALESMAN | 7698 |     null | 1600
  7369 | null |     20 | SMITH | 17-DEC-80 |    CLERK | 7902 |     null |  800

(3 rows)

cqlsh:scott> update emp set phonenum={ country_code: 1, number: '202 456-1111' } where empno=7369;

cqlsh:scott> select empno,ename,phonenum from emp;

 empno | ename | phonenum
-------+-------+-------------------------------------------
  7902 |  FORD |                                      null
  7499 | ALLEN |                                      null
  7369 | SMITH | {country_code: 1, number: '202 456-1111'}

-- Select data as JSON
cqlsh:scott> select json ename,job from emp;

 [json]
---------------------------------------
   {"ename": "FORD", "job": "ANALYST"}
 {"ename": "ALLEN", "job": "SALESMAN"}
    {"ename": "SMITH", "job": "CLERK"}

(3 rows)



Using ODBC Driver                                       

Dowload Cassandra ODBC driver from below link, install and configure. Then use in your desired application

https://academy.datastax.com/downloads/download-drivers


Test Failed, investigate the reason

Cassandra Ports
  • 7199 - JMX (was 8080 pre Cassandra 0.8.xx)
  • 7000 - Internode communication (not used if TLS enabled) (gossip/replication/proxied queries/etc)
  • 7001 - TLS Internode communication (used if TLS enabled)
  • 9160 - Thrift client API
  • 9042 - CQL native transport port


Determine which ports are listening for connections from the network
[root@dn04 ~]# netstat -tanp | grep LISTEN

tcp        0      0 0.0.0.0:5901            0.0.0.0:*               LISTEN      3256/Xvnc
tcp        0      0 127.0.0.1:9042          0.0.0.0:*               LISTEN      20153/java
...

9042 port is listening for localhost (127.0.0.1), go to /etc/cassandra/default.conf/cassandra.yaml
find "rpc_address:" change its value to dn04 (name of the server where cassandra is running)

restart cassandra service and try again to establish the connection
[root@dn04 ~]# service cassandra restart
Restarting cassandra (via systemctl):                      [  OK  ]





Use the new DSN in Excel to test 






After the configuration changes, you will need to put the server name or IP while connecting with CQLSH.

[root@dn04 ~]# cqlsh dn04
cqlsh> show host
Connected to Test Cluster at dn04:9042.



How data is stored and read?                                                

At a very high level, Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as a ring. Nodes generally run on commodity hardware. Each node in the cluster is responsible for and assigned a token range. A token in Cassandra is a Hash value.

When you try to insert data into Cassandra, it will use an algorithm to hash the primary key (which is combination of partition key and clustering column of the table). The token range for data is 0 – 2^127. Every node in a Cassandra cluster or “ring”  is given an initial token. This initial token defines the end of the range a node is responsible for.

For example consider token range of 1 - 100, if you have 4 nodes in the Cassandra cluster then each node will have a initial token Node1 = 25, Node2 = 50, Node3 = 75 and Node4 = 100. So data which has a hash value of 1 – 25 will be inserted in Node1, data which has a hash value of 26 - 50 will be inserted in Node2 and so on.

Client requests

Client read or write requests can go to any node in the cluster because all nodes in Cassandra are peers. When a client connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.
The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data being requested. The coordinator determines which nodes in the ring should get the request based on the cluster configured partitioner and replica placement strategy.

The coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line. 
[cass@dn01 ~]$ nodetool ring > /tmp/cassRingToken.txt


cqlsh:scott> select token(empno), empno,ename from emp;

 system.token(empno)  | empno | ename
----------------------+-------+-------
 -8670174067668179189 |  7902 |  FORD
 -1144048224861957591 |  7499 | ALLEN
  2617034212096716347 |  7369 | SMITH

(3 rows)

Search the cassRingToken.txt for the related token to see which node is responsible for this token or even easier, you can use nodetool getendpoints to see this data:

[cass@dn01 ~]$ nodetool getendpoints
nodetool: getendpoints requires keyspace, table and partition key arguments
See 'nodetool help' or 'nodetool help <command>'.


[cass@dn01 ~]$ nodetool getendpoints scott emp 7369
192.168.49.135
192.168.49.136

No comments: