Despite the increasing interest in unstructured data, much of the world's information still lives in some form of relational database. Firing up your first Hadoop cluster often means moving data from existing SQL tables, maybe hosted in SQL Server, MySQL or PostgreSQL, into HDFS or Hive so it can be accessed by all of that MapReduce goodness. Towards making that easier, there's a cool Apache project called Sqoop that allows you to import this data in a streamlined, flexible way that can leverage the parallelism in your cluster using mappers.
There are some factors that affect how fast Sqoop can move data across the network (as my colleague recently observed with SnapMirror). It turns out that moving data this way can really start to slog if are moving data between sites or clouds. Here are three things you can do to speed things up:
#1: Compare the use of direct import vs. JDBC. Sqoop provides two ways of importing data for most SQL implementations: using the well-known JDBC interface or doing a direct dump over a TCP connection. Looking at an example with a 450MB CSV file, If we do the transfer in PostgreSQL and add 80ms of latency on the link, the direct method is clearly faster. However, there are some situations (like with HSQLDB or with really large tables) where you're out of luck.
If we set Sqoop to use 4 mappers and do the import with JDBC, with Cascade Pilot we can see that 4 TCP connections are open and the database traffic is spread across all of them.
Instead of we use the direct method, only 1 connection is opened for the entire transfer:
but the overall throughput is still much higher with the direct method (the x-axis below is time, in seconds, to complete):JDBC:
#2: Take a close look at the number of parallel connections. If you are using the JDBC interface, adding more mappers (with the -m option) can increase throughput, but there are diminishing returns. For example if we double the mappers in the above transfer to 8, we only get a 12% reduction in the time it takes to do the transfer:
#3: Riverbed can help. If you already have Riverbed in your network or are considering adding some mojo, there are a few ways we can help:
These are just a few things you can do to turbocharge moving SQL into Hadoop. With more enterprises embracing multiple clouds and Hadoop clusters starting to get really big, network performance is a key component of getting better results.
Riverbed delivers the most complete platform for Location-Independent Computing, turning location and distance into a competitive advantage. The Riverbed Application Performance Platform™ allows IT to have the flexibility to host applications and data in the most optimal locations while ensuring applications perform as expected, data is always available when needed, and performance issues are detected and fixed before end users notice. At more than $1 billion in annual revenue, Riverbed has 25,000+ customers, including 97% of both the Fortune 100 and the Forbes Global 100.