Using Apache Cassandra with Apache Hadoop

I am currently working on a data analytics website for my own educational purposes and to fulfil my hacking/learning needs, I decided to use Apache Cassandra as the input/output storage engine for and Apache Hadoop map/reduce job.

The job in question is as simple as it gets: it reads the data from a table stored in a Cassandra database and identifies what are the most commonly used adjectives for each of the major communication service providers (CSPs) in Brazil. After processing, the results are stored in another table in the same Cassandra database. Basically, it is a fancier version of the famous Hadoop word count example.

Unfortunately, there seem to be a lack of modern documentation about integrating Hadoop and Cassandra. Even the official guide seem to be deficient/outdated about this subject. To add insult to the injury, I also wanted to use composite keys, which complicated things further. After reading the example source code in Cassandra source code, I was able to successfully implement a working job.

Despite the lack of documentation and the hacking required to figure out how to make it work, the process is quite simple and even an unexperienced Cassandra/Hadoop developer such as myself can do it without much trouble. In the paragraphs below you will find additional details about the Hadoop and Cassandra integration and what is required to make it work.

Finally, as it’s usual for my coding examples, the source code is available in my Github account under the open source Apache License v2.

Continue reading ‘Using Apache Cassandra with Apache Hadoop’

Development Goodies

These are just some development-related links and articles I have read in the last weeks which I think are worth mentioning:

Understanding webservices specifications (and more)

We all know that JSON and RESTful web services are the new darlings of the Internet and, to some extent, backend development these days. Their simplicity over other mechanisms are, undoubtedly, a good thing. However, a large amount of the backend development still (will continue to) rely on SOAP and other mechanisms to provide services. That’s why it’s so important to understand them. This series or articles from IBM Developer Works can help you understand them:

On the other hand, if you want to understand the RESTful side of the force, you may want to read about Developing RESTful Services using Apache CXF.

Data Structures

Data structures are a recurring topic for any software engineer: be it because it’s required for pretty much any interview or because you need to find the most adequate solution to a problem you are working with. Nonetheless, there are a vast amount of them and it’s not always easy to remember about them. The list below contains a list of interesting reference material about them.

  • Data Structure Visualizations: contains an animated walk-through through the most used/known data structures. A must see if you are having trouble understanding any of them.
  • Algorithms + Data Structures = Programs: a book about fundamental topics in computer programming.
  • Know Thy Complexities: a Big-O Cheat Sheet at a click of a mouse. Bônus point: it links to the Wikipedia articles about each of the items in the cheat sheet.
  • Core Algorithms Deployed: so you want to know who uses a Radix Tree? Lots and lots of good code showing how they are used in real-life.

And to add some art to the science, algorithms dance:

Bubble Sort:

Quick Sort:


MacPass: a decent, OS X native, KeePass port

A native OS X port of Keepass is something that I have been wanting for a long time. Amazingly I found one today while browsing the web. You can download it from here, and look at the source code on the project’s Github.



Enterprise Integration with Apache Camel

I’ve just published a mini e-book, in Portuguese, about Enterprise Integration with Apache Camel. If you happen to speak Portuguese, you can download it out here.

Quick tips for running Java applications on OpenShift

Apache Commons Configuration:

It’s pretty common to need to set hostname or a port for your service in OpenShift. If you’re using Apache Commons Configuration, there’s a quick an easy way to access variables exported by the cartridges. You can address the environment variables using the ‘env’ prefix.

Continue reading ‘Quick tips for running Java applications on OpenShift’

Tip: adjust sudo timeout on OS X

Find it here.

NoSQL: links for beginners

NoSQL databases are some of the hottest topics in the IT industry in the moment. A beginner can easily feel swamped with the amount of documentation available. Since I am a beginner to NoSQL as well, I separated two links which I access every now and then:

A Visual Guide to NoSQL explains how the commonly used NoSQL offerings relate to CAP Theorem.

A Beginner’s Guide to NoSQL is an article, originally written for the Software Developer’s Journal, that explain the basics principles and ideas behind the NoSQL databases.



Running the Simple Apache CXF Server Example on Red Hat Openshift

Today I dedicated some time to educate myself about OpenShift, the Red Hat’s Platform-As-A-Service offering. It allow us, developers, to quickly develop, deploy and provide scalable applications over the web.

To learn about it, I decided to deploy a really simple web application. I thought it would be a good idea to deploy the Simple CXF Server example on my free account. You can see it in action here. Because OpenShift documentation is quite extensive, it might be complicated for the beginner like me. So I decided to take notes of my steps while I deployed I simple Apache CXF-based application.

These are the steps I had to do:

Continue reading ‘Running the Simple Apache CXF Server Example on Red Hat Openshift’

Next Page »