Accelerate your big data replication

Hadoop is a terrific tool. Never before has there been such a short distance between zero and the ability to do massively parallel computations (like through brute force, genetic programming, or machine learning) on a large scale, and really cheaply too.

One of the challenges for enterprises using Hadoop is that the analysis is often performed in a completely different location than the data. If you're tasked with replicating HDFS clusters between two sites, well, things get interesting fast. This is especially true if you are using the native Linux TCP stack.

Let's say you're curious whether the Harlem Shake affects helmet sales in your retail stores. You might have one instance of HDFS for recording all the transaction data, and need to replicate it to another site for analysis where you can try to correlate it with weather, YouTube, or proprietary data (like whether helmet sales have historically peaked in February). There are all kinds of use cases here -- you might need to replicate to a partner site, to the cloud, or disaster recovery location. Try to do this over the WAN, with the standard 'distcp' tool, and it can be painfully slow.

The good news is that a lot of data like this is in a text-based format, such as with XML, JSON or CSV files. These types of files are very amenable to LZ compression and often compress down to a fraction of their size.

Another interesting feature of Hadoop input data is that it can contain a lot of redundant structures. This is a common characteristic of unstructured data and can benefit from even more reduction by applying WAN optimization.

Add Steelhead appliances to each end of the link, and your HDFS replication times come down fast.  Here's an example for replicating 500MB of data between two instances:


Adding optimization here is a use case that can make a lot of sense. You can quickly add compression to your data over the network without modifying any of your Java, Perl or Ruby code to deal with compressing/uncompressing the data at each end. This means the solution can be implemented fast, and it can make a huge impact on performance. At the end of the day you get Map/Reduce results faster... and can do something about those helmet sales!

0 Responses

Leave a Reply

$mainImageBigHTML ×

Riverbed. WAN optimization for your network: Application acceleration, WAN bandwidth optimization, and IT consolidation. Riverbed is the IT performance company. WAN optimization solutions from Riverbed liberate businesses from common IT constraints by increasing application performance, enabling consolidation, and providing enterprise-wide network and application visibility – all while eliminating the need to increase bandwidth, storage or servers. Thousands of companies trust Riverbed to deliver greater productivity and cost savings by making their IT infrastructure faster, less expensive and more responsive. Riverbed solutions are also available as managed services through select providers.

We need your email to add to briefcase!


Update your Profile!