Cassandra

Cassandra

Problems With Relational DataBase Management System:

1. Scalability problems

when our relational applications become successful and usage go up. Joins are relatively inherent in relational database. Joins can be slow. The way that database become consistent by use of transactions, which require locking some portion of the database becomes locked and it will not available for others. Other will wait until first release the lock. This can become untenable under heavy loads, as the lock mean that competing users starts queuing up  and waiting for their turn to read and write the data.

2. RDBMS allows Vertical Scaling


It means Adding more hardware and memory, increase in cost and we can add hardware in limited number. This can relive for a time.

When the problem arise again, The answer appears is to similar, one box is maxed out. You can add one more boxes in a database cluster. We have a problem for data consistency and replication.

Why we need a Strong DataBase ?

1. Youtube serves 100 million videos every day.

2. Chevron accmulates 2 TB of data every day.

3. In 2006, the amount of data on the Internet was approximately 166 exaBytes But in 2010, It was 1,000 exaBytes.

4. The movie Avatar required 1 PB of Storage space, is equivalent to 1 MP3 Song.

5. As of may 2010, Google was provisioning 1,00,000 Android phones every day, all of which have Internet access as  a service.

6. In 1998, the number of emails was encountered was approximately 253 million. By 2010, number is closer to 2 billion.

Need a architecture that allows organizations to take advantage of near time to support decision making and to offer new features and capabilities:


Cassandra


Cassandra was open-sourced by Facebook in July 2008. This original version of cassandra was written primarily by an ex-employee from Amazon and one from Microsoft.

Cassandra implements a Dynamo-style replication model with no single-point of failure, but adds a more powerful "Column-family" Data Model.

Cassandra was accepted into the apache Incubator, and by the time it graduated in March 2010. Cassandra has popular due to outstanding feature such as durability, seamlessly scalability and tuneably consistent. It is highly available and offers a schema-free model.

Cassandra is written in Java but cassandra form a wide variety of languages, including C#, scala, Python and Ruby.

Apache Cassandra is an Open-Source, Distributed, Decentralized, Elastically Scalable, tuneably consistent, highly-available, fault tolerant, Column-Oriented database that bases its distribution design on Amazon's Dynamo and its Data Model on Google's  Big Table.

 How Did Cassandra Get Its Name ?

In Greek mythology, cassandra was the daughter of King Praim and Queen Hecuba of Troy. cassandra was so beautiful that the god Apollo gave her the ability to see the future. But she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen, but no one believe her. Cassandra foresaw the destruction of her city of troy, but she was powerless to stop it. The cassandra distributed database is named for her.

Terms Used in Definition of Cassandra:

1. Distributed

      Cassandra is distributed, that is it is capable of running in multiple machines, while appearing as a unified whole. We can confidently write data anywhere and cassandra will get it.

2. Decentralized

    Cassandra is decentralized, means that every node in cassandra is identical, no cassandra node performs organizing operations distinct from any other nodes. Instead, cassandra feature is Peer-to-peer protocol and uses gossips to maintain and keep a sync list of nodes that are alive or dead.

Cassandra working

Advantage of Decentralization in Cassandra

  With this  feature of decentralization, there is no single point of failure. All of nodes in the cassnadra cluster function exactly the same. This is sometimes referred as Server Symmetry. Because all are doing the same thing, by definition there can't be special host that is  coordinating service as with the master/Slave architecture with MySQL, Big Table.

The decentralized design is therefore one of the keys to Cassandra's high availability. Decentralization, has two key features Easy to use than the Master-Slave Architecture and It helps you to avoid outages.


3. Elastic Scalability

Scalability is the architectural feature of the system that serves a greater number of request, with a little degradation in performance.

Cassandra allows HORIZONTAL SCALING, it allows adding a machines in parallel, mo machine has to bear entire burden of the requests.

Elastic scalability is a special feature of Cassandra that allows the cluster Scale Up and Scale Down.

4. High Availability and Fault-tolerance

Availability of the system is measured to its ability to full fill the requests.

Cassandra is highly available beacause,
we can replace failed nodes in the cluster with no downtime.
we can replicate the data into multiple nodes to offer improved local performance and prevent downtime.

5. Tuneable Consistency

Cassandra is more accurately termed as "Tuneable Consistency", which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.

How Client manage the level of consistency ?

Client can control the number of replicas to blocks for all updates. This is done by setting the consistency level against the replication factor. 
 The replication factor lets you decide how much you want to pay in performance to gain more consistency. The consistency level is a setting that clients must specify on every operation.

Brewer's CAP Theorem 

CAP Theorem sometimes called Brewer's theorem after its author, Eric Brewer. Eric Brewer posted his theorem in 2008 at University of California at Berkeley.

It States that within a large-scale distributed data system, there are 3 requirements that have a relational ship sliding dependency : Consistency, Availability, Partition-tolerance.


Three properties of CAP
1. Consistency:
 All database clients reads the same value for the same query, even given concurrent updates    

2. Availability:
  All Client will always be able to read and write data.

3. Partition Tolerance:
The database can be split into multiple machines, it can continue functioning int the face of network segmentation breaks.


6. Row-Oriented

Cassandra is referred as 'Column-Oriented database', which is not in correct. It is not relational and it does represent its data structure in sparse multidimensional hashtables.

What is Sparse ?

Sparse means that for any given row, we can have one or more columns. Each row doesn't need to have all the same columns. It can be differs. Each row has  a unique key, which makes its data accessible.


Suppose, If we want to change the business, we need to change the fields of the database and its structure. By using Cassandra database, we can add or remove fields on the fly without disrupting service.


Cassandra requires a shift in how you think about your data. Instead of designing the data model and then designing queries around the model. We are free to think of our queries first and then provide the data that answer them.

7. Schema-free

Cassandra require you to define an Outer container called Keyspace, that contains Column-families. The Keyspace is just a logical name space to hold the column-families and certain configuration properties.  You just add data into the columns that you defined in the column-family.


8. High Performance

Cassandra uses multiprocessor/multicore machines, and to run across many dozens of these machines housed in multiple data centers.   


Cassandra users:

  • Twitter
  • Cisco
  • Facebook
  • RackSpace
  • Reddit
  • CloudKick 

Cassandra Highlights:

High Availability
Incremental Scalability
Eventually Consistent
No SPF(Single Point of Failure)
P2P Distributed Data Model


  
Previous
Next Post »

Popular Posts