Despite the increasing interest in unstructured data, much of the world's information still lives in some form of relational database. Firing up your first Hadoop cluster often means moving data from existing SQL tables, maybe hosted in SQL Server, MySQL or PostgreSQL, into HDFS or Hive so it can be accessed by all of that MapReduce goodness. Towards making that easier, there's a cool Apache project called Sqoop that allows you to import this data in a streamlined, flexible way that can leverage the parallelism in your cluster using mappers.
There are some factors that affect how fast Sqoop can move data across the network (as my colleague recently observed with SnapMirror). It turns out that moving data this way can really start to slog if are moving data between sites or clouds. Here are three things you can do to speed things up:
#1: Compare the use of direct import vs. JDBC. Sqoop provides two ways of importing data for most SQL implementations: using the well-known JDBC interface or doing a direct dump over a TCP connection. Looking at an example with a 450MB CSV file, If we do the transfer in PostgreSQL and add 80ms of latency on the link, the direct method is clearly faster. However, there are some situations (like with HSQLDB or with really large tables) where you're out of luck.
If we set Sqoop to use 4 mappers and do the import with JDBC, with Cascade Pilot we can see that 4 TCP connections are open and the database traffic is spread across all of them.
Instead of we use the direct method, only 1 connection is opened for the entire transfer:
but the overall throughput is still much higher with the direct method (the x-axis below is time, in seconds, to complete):JDBC:
#2: Take a close look at the number of parallel connections. If you are using the JDBC interface, adding more mappers (with the -m option) can increase throughput, but there are diminishing returns. For example if we double the mappers in the above transfer to 8, we only get a 12% reduction in the time it takes to do the transfer:
#3: Riverbed can help. If you already have Riverbed in your network or are considering adding some mojo, there are a few ways we can help:
These are just a few things you can do to turbocharge moving SQL into Hadoop. With more enterprises embracing multiple clouds and Hadoop clusters starting to get really big, network performance is a key component of getting better results.
Riverbed, at more than $1 billion in annual revenue, is the leader in application performance infrastructure, delivering the most complete platform for the hybrid enterprise to ensure applications perform as expected, data is always available when needed, and performance issues can be proactively detected and resolved before impacting business performance. Riverbed enables hybrid enterprises to transform application performance into a competitive advantage by maximizing employee productivity and leveraging IT to create new forms of operational agility. Riverbed’s 27,000+ customers include 97% of the Fortune 100 and 98% of the Forbes Global 100.