Accelerate your big data replication

Hadoop is a terrific tool. Never before has there been such a short distance between zero and the ability to do massively parallel computations (like through brute force, genetic programming, or machine learning) on a large scale, and really cheaply too.

One of the challenges for enterprises using Hadoop is that the analysis is often performed in a completely different location than the data. If you're tasked with replicating HDFS clusters between two sites, well, things get interesting fast. This is especially true if you are using the native Linux TCP stack.

Let's say you're curious whether the Harlem Shake affects helmet sales in your retail stores. You might have one instance of HDFS for recording all the transaction data, and need to replicate it to another site for analysis where you can try to correlate it with weather, YouTube, or proprietary data (like whether helmet sales have historically peaked in February). There are all kinds of use cases here -- you might need to replicate to a partner site, to the cloud, or disaster recovery location. Try to do this over the WAN, with the standard 'distcp' tool, and it can be painfully slow.

The good news is that a lot of data like this is in a text-based format, such as with XML, JSON or CSV files. These types of files are very amenable to LZ compression and often compress down to a fraction of their size.

Another interesting feature of Hadoop input data is that it can contain a lot of redundant structures. This is a common characteristic of unstructured data and can benefit from even more reduction by applying WAN optimization.

Add Steelhead appliances to each end of the link, and your HDFS replication times come down fast.  Here's an example for replicating 500MB of data between two instances:


Adding optimization here is a use case that can make a lot of sense. You can quickly add compression to your data over the network without modifying any of your Java, Perl or Ruby code to deal with compressing/uncompressing the data at each end. This means the solution can be implemented fast, and it can make a huge impact on performance. At the end of the day you get Map/Reduce results faster... and can do something about those helmet sales!

0 Responses

Leave a Reply

$mainImageBigHTML ×

Riverbed, at more than $1 billion in annual revenue, is the leader in application performance infrastructure, delivering the most complete platform for the hybrid enterprise to ensure applications perform as expected, data is always available when needed, and performance issues can be proactively detected and resolved before impacting business performance. Riverbed enables hybrid enterprises to transform application performance into a competitive advantage by maximizing employee productivity and leveraging IT to create new forms of operational agility. Riverbed’s 26,000+ customers include 97% of the Fortune 100 and 98% of the Forbes Global 100.

We need your email to add to briefcase!


Update your Profile!