In a post that I made on my Network World blog, I spoke about the challenges of running Hadoop over the WAN. What I didn’t get a chance to cover is why Silver Peak optimization is such a great complement to big data, like Hadoop.
Hadoop requires a predictable, low-latency network, which is why most implementations tend to reside at a single site. The Large Hadron Collider experiment, though, generates so much data of interest to researchers all over the world that the Hadoop network spans hundreds of universities and laboratories and over 2000 WAN links.
The University of California, San Diego (UCSD) is one of those organizations and it’s connected via dual, 10 Gbps links back to Switzerland housing the primary data. GridFTP is used to move up to 15 Gbps of data at any one time between locations.
The research uncovered a number of problems of running Hadoop over the WAN, problems which Silver Peak’s optimization technology largely addresses:
- Organizations that use multiple GridFTP streams per transfer will require large buffers to keep all the out-of-order packets. This increases the cost of the hosts as they need more system memory. Silver Peak is the only optimization provider to correct and often eliminate packet loss and OOPs in real-time using our Network Integrity features.
- HDFS today runs directly over IP, which is not optimized by most WAN optimization solutions. Silver Peak automatically optimizes any IP-based protocol, including HDFS. No additional plug-ins or configuration is needed.
- HDFS requires extreme scalability to function properly. Silver Peak leads the industry in scalability, being able to deliver up to 2.5 Gbps of optimize data. (That’s approximately 10-15 Gbps of LAN data depending on the natures of the data.)
- HDFS cannot support asynchronous writes so a single session stream per GridFTP transfer is the best solution. Silver Peak has been architected for just this scenario and does and does an excellent job maximizing the value of individual streams.
And since Silver Peak optimization is just software it can be downloaded in minutes from our Virtual Marketplace and deployed across most servers running a major hypervisor. As computational loads move between nodes or sites, optimization instances can follow. So while Big data may require big optimization, it most definitely does not require big iron.