As more companies begin to delve into Big Data and analytics on a larger scale, they are beginning to notice that data mobility is one of the largest challenges. Hadoop clusters – whether deployed on-premises or leveraging cloud-based services such as Amazon Web Services Elastic Map Reduce (EMR) – previously required a coupling of the data to the compute nodes that makes for a sticky situation should a company want to develop a hybrid cloud approach to meet their analytics needs. Suppose an enterprise starts their foray into analytics by deploying a small Hadoop cluster within their enterprise. Most companies will take such an approach as the cost of learning the Hadoop ecosystem in the cloud can be costly. Once they’ve figured out how to use the Hadoop tool-chain to develop their analytics applications, they then see that using cloud resources such as Elastic Map Reduce as very enticing. EMR clusters can be spun up and put to use to run a particular analysis and then torn down just as quickly in order to reduce resource consumption costs. It’s a perfect case for leveraging cloud resources in the manner for which they were intended.
Unfortunately, this type of scenario doesn’t come without it’s challenges. Most companies will leverage Hadoop clusters to analyze data from varying different sources in order to reveal relationships in the data that were previously unknown. In doing so, they will be tasked with needing to architect very large ETL (entry, transform, load) workflows to ingest all of this data. NetApp has helped address this challenge previously with it’s NFS Connector for Hadoop and is doing so again with the newly released CloudSync.
For those that have not heard of the NFS Connector for Hadoop, let’s quickly review. Imagine if you were a large law firm for instance. You have approximately 250TB of unstructured data currently hosted on a NetApp FAS disk array and accessed via CIFS (Windows file shares). One of the big challenges in the legal vertical is eDiscovery. There are software tools available to process a large corpus of documents, look for keywords, and/or find documents that have a certain level of relativity. However, such tools are typically very expensive and only used on a case-by-case basis. What if you wanted to do a search across all cases that the firm has ever worked? Searching through 250TB would be very cumbersome and just creating the indices would be a feat. The Apache Hadoop ecosystem provides scale-out tools such as Apache Solr and Tika that can handle these types of searches with ease. Traditionally the data would need to reside on an HDFS filesystem hosted on the same cluster where Solr and Tika were also installed. With the NFS Connector for Hadoop, and NetApp’s advanced NAS capabilities within Data ONTAP, an Apache Hadoop cluster can be configured to access this data where it resides on the NetApp FAS disk array without any need for ingest into the Hadoop cluster. Feel free to check out NetApp’s latest Technical Report regarding the NFS Connector for Hadoop for more info.
Building on what NetApp has done with the NFS Connector for Hadoop, today’s Hybrid Cloud Launch introduced a new tool called CloudSync. CloudSync is a cloud-hosted service giving customer’s the ability to seamlessly move data from an on-premises NFS server to an S3 bucket hosted on AWS.
Data in an S3 bucket can then be consumed by such services as AWS Elastic Map Reduce. This gives customer’s the option of performing analytics in their on-premises Hadoop clusters or via the use of cloud-based Hadoop services. Data can be selected for replication at a directory level – either at the root NFS mount point or a sub-directory further down the directory structure. Replication relationships can be sync’d, re-sync’d, paused, or deleted. There is currently no facility to schedule relationships, but with NetApp’s new found passion towards DevOps it may be only a matter of time before an API is released to orchestrate such a thing (NOTE: I do not know this for sure. I’m only speculating).
In addition to all of these capabilities, CloudSync isn’t just for folks that own a NetApp FAS disk array. It can be used to migrate data hosted on any ‘ole NFS server. That gives customers using traditional HDFS clusters the ability to migrate their data to AWS using the “newish” NFS Gateway for HDFS as well.
A subscription to CloudSync can be established in the AWS MarketPlace using the following URL: https://aws.amazon.com/marketplace/pp/B01IQ35IFE
NetApp continues to drive innovation for Big Data architects. Innovations like CloudSync are just another way NetApp has shown they are not just a storage company. They are designing meaningful solutions to solve Data Management and Data Mobility and create a true Data Fabric.