Despite the increasing interest in unstructured data, much of the world's information still lives in some form of relational database. Firing up your first Hadoop cluster often means moving data from existing SQL tables, maybe hosted in SQL Server, MySQL or PostgreSQL, into HDFS or Hive so it can be accessed by all of that MapReduce goodness. Towards making that easier, there's a cool Apache project called Sqoop that allows you to import this data in a streamlined, flexible way that can leverage the parallelism in your cluster using mappers.
There are some factors that affect how fast Sqoop can move data across the network (as my colleague recently observed with SnapMirror). It turns out that moving data this way can really start to slog if are moving data between sites or clouds. Here are three things you can do to speed things up:
#1: Compare the use of direct import vs. JDBC. Sqoop provides two ways of importing data for most SQL implementations: using the well-known JDBC interface or doing a direct dump over a TCP connection. Looking at an example with a 450MB CSV file, If we do the transfer in PostgreSQL and add 80ms of latency on the link, the direct method is clearly faster. However, there are some situations (like with HSQLDB or with really large tables) where you're out of luck.
If we set Sqoop to use 4 mappers and do the import with JDBC, with Cascade Pilot we can see that 4 TCP connections are open and the database traffic is spread across all of them.
Instead of we use the direct method, only 1 connection is opened for the entire transfer:
but the overall throughput is still much higher with the direct method (the x-axis below is time, in seconds, to complete):JDBC:
#2: Take a close look at the number of parallel connections. If you are using the JDBC interface, adding more mappers (with the -m option) can increase throughput, but there are diminishing returns. For example if we double the mappers in the above transfer to 8, we only get a 12% reduction in the time it takes to do the transfer:
#3: Riverbed can help. If you already have Riverbed in your network or are considering adding some mojo, there are a few ways we can help:
These are just a few things you can do to turbocharge moving SQL into Hadoop. With more enterprises embracing multiple clouds and Hadoop clusters starting to get really big, network performance is a key component of getting better results.
Riverbed. WAN optimization for your network: Application acceleration, WAN bandwidth optimization, and IT consolidation. Riverbed is the IT performance company. WAN optimization solutions from Riverbed liberate businesses from common IT constraints by increasing application performance, enabling consolidation, and providing enterprise-wide network and application visibility – all while eliminating the need to increase bandwidth, storage or servers. Thousands of companies trust Riverbed to deliver greater productivity and cost savings by making their IT infrastructure faster, less expensive and more responsive. Riverbed solutions are also available as managed services through select providers.