Using Apache Cassandra with Apache Hadoop

I am currently working on a data analytics website for my own educational purposes and to fulfil my hacking/learning needs, I decided to use Apache Cassandra as the input/output storage engine for and Apache Hadoop map/reduce job.

The job in question is as simple as it gets: it reads the data from a table stored in a Cassandra database and identifies what are the most commonly used adjectives for each of the major communication service providers (CSPs) in Brazil. After processing, the results are stored in another table in the same Cassandra database. Basically, it is a fancier version of the famous Hadoop word count example.

Unfortunately, there seem to be a lack of modern documentation about integrating Hadoop and Cassandra. Even the official guide seem to be deficient/outdated about this subject. To add insult to the injury, I also wanted to use composite keys, which complicated things further. After reading the example source code in Cassandra source code, I was able to successfully implement a working job.

Despite the lack of documentation and the hacking required to figure out how to make it work, the process is quite simple and even an unexperienced Cassandra/Hadoop developer such as myself can do it without much trouble. In the paragraphs below you will find additional details about the Hadoop and Cassandra integration and what is required to make it work.

Finally, as it’s usual for my coding examples, the source code is available in my Github account under the open source Apache License v2.

1.  First we need to setup the input configuration:  it should be pretty simple, as you have to point to the Cassandra database instance, key space, input tables and the columns you’ll be working with.

2. Then, we repeat the process for the output configuration. Since you don’t have to setup the predicate or the input columns,  it’s even simpler than the input one:

3. Configure the Hadoop job, with the appropriate classes and the input/output configuration

4. Create your mapper class. To do so, extend Hadoop’s Mapper class and create the map method. It may also be useful to create a getColumnValue method to simplify reading the value from the column.

 

 

5. Create your reducer job. To do so, you will have to extend the Hadoop’s Reducer class and write the reduce method. In this class you’ll write the reduce class that will save the results to the database.

5.1 It may be useful to also create a few methods to create Mutation objects out of your data. The mutation objects are the ones that will actually be saved to the database. Here’s an example:

5.2 Finally, you can create the reduce logic in your reducer class. Basically, the process involves: a) run your reduce calculation, b) create the mutation objects that will be saved to the database, c) create the index to be saved to the database and d) write all of  that to the context.

Published by

Otavio Piske

Just another nerd

3 thoughts on “Using Apache Cassandra with Apache Hadoop”

Leave a Reply

Your email address will not be published. Required fields are marked *