Cassandra in a Nutshell
What is Cassandra ? Its a file+SSTable based NOSQL database optimized for clustered environments. With reference to CAP theorem, Cassandra is more optimized for Availability and Partitioning tolerance.It can be tuned for consistency but that will lead to more latency. It supports clustering through rings that consists of many Cassandra nodes, and different configuration parameters can be used to leverage strengths of a cluster with it. Optimized for writes. Reads may become slow for various reasons. But writes are super-fast. Read latency can exponentially grow if data is being deleted frequently (Cassandra does not delete immediately. It flags the data as tombstones and waits for a scheduled tombstone compaction. Read queries go through these ghost data, causing delays.) More details in [1]. How to model your data Works based on ;
keyspaces : This is similar to a schema within relational DBs. A keyspace consists of multiple column families.
Column Families : Similar to tables in relational data, but different inside. Instead of rigid columns, Column families contain a dynamic cell based approach.
Rows : Not restricted to a rigid column count within a column family.
Column : Different from a relational column. A column is bound only to its row and it contains an attribute of the row data.
Refer [2] to understand more on above elements. All these have specific size limitations to ensure Cassandra works as expected [3]. CQL is a modern Cassandra design language similar to SQL syntax [4]. Handy, in-built tools used to investigate Cassandra ---------------------------------------------- This is mostly the important part of this article. Since Cassandra is a very versatile DBMS, it can be tuned on various aspects, and multiple combinations can lead to various outcomes. In such a situation, knowledge about how to understand and investigate the software is crucial. To this end, I have summarized a few tools and commands used frequently during my work. nodetool
------------- This is the swiss-army knife when it comes to collecting external statistics of Cassandra. A bit of reading and command help will give you in-depth explanations of each use of nodetool, but I will explain a few for quick reference. Check status of the cluster - "./nodetool -host <ip> ring". This will display the status of each cassandra node, its token value, disk usage, and the weight distribution. Useful to check if the ring is balanced.
Move a node to a different token - "./nodetool -host <ip> move <newToken>". Even though Cassandra clusters are balanced initially, they can become imbalanced when some nodes fail and shut down. This command will help you to re-balance it. View compactionstats - "./nodetool -host <ip> compactionstats". This command will show the state of the current compaction running on Cassandra (if any). View live sstable access information - "./nodetool -host <ip> cfstats". This command will display a running view of the sstables being read at a given point and much more information. Find tokens to balance your cluster python -c 'print [str(((2**64 / <NODE_COUNT>) * i) - 2**63) for i in range(<NODE_COUNT>)]' cqlsh
-------
This acts like a query tool to investigate the actual data in Cassandra SQL style. The tool can be used with command "./cqlsh <ip>"
Display all keyspaces - DESCRIBE keyspaces;
Use a specific keyspace - USE "<keyspaceName>";
Count number of rows in a column family - select count(*) from "<keyspaceName>"."<columnFamily>"
View all meta information about a column family - DESCRIBE TABLE "<columnFamilyName>"
A sample query to alter the schema of a columnFamily - ALTER TABLE "MessageContent" with compaction={'tombstone_compaction_interval':600,'class':'SizeTieredCompactionStrategy'};
Assume that a specific column is the given data type. (If u want to check out byte column values, this is useful.) ASSUME "<keyspaceName>"."<columnFamily>"(<columnName>) VALUES ARE text ;
Apologies if this post feels rushed :) I wanted to share the information quick before i lose track of it. And again, there could be mistakes and misunderstandings here. Everything is open for discussion.
References : [1] : http://wiki.apache.org/cassandra/ArchitectureOverview [2] : http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.VFuG59a35dI [3] : http://wiki.apache.org/cassandra/CassandraLimitations [4] : http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure











