Five years ago, the go-to architecture for a data lake was HADOOP which was synonymous with big data. What exactly is HADOOP and why it’s rise to stardom started to decline?
HADOOP is primarily composed of three layers namely;
- HDFS – Storage Layer
- MapReduce – Processing Layer
- Yarn – Resource Management Layer (Yet Another Resource Negotiator)
The Layered attack by Cloud
With the advent of cloud and organizations like Amazon Web Service, Microsoft Azure, and Google Cloud Platform to name a few providing cheaper scalable and faster solutions than on-premise HDFS, HADOOP dealt a heavy blow. One of the primary factors in the adoption of HADOOP was its ability to scale for exponential data volume growth with cheaper hardware. All said and done, it still was an on-premise platform and the maintenance overhead loomed over organizations. Cloud providers were able to do much more by providing the same scalability along with nil maintenance overhead and cheaper costs. A new architecture is gaining momentum on the cloud which allows the separation of compute and storage layers, thus enabling scaling on both fronts. Snowflake is a very good example of providing scalable data warehousing architecture with the ability to scale both on computation and storage dynamically and on the cloud.
The selling point at the time of introduction of HADOOP was the MapReduce component. Built out on Java, MapReduce did exactly what it was named for. HADOOP could distribute and store huge amounts of data across several nodes of a cluster. A great concept that was open source and quickly adopted and modified by various cloud vendors to suit their needs. HADOOP was never able to catch up with it. Also, the complexity involved shied away developers from getting information out of HADOOP on time.
Kubernetes and Cloud Foundry have started to eat away enthusiasts at the YARN layer. It is an orchestration layer that was developed entirely on Java limiting the number to Java focused tools. However, with microservices and enthusiasm of python-based data mining tools, Kubernetes proved to be an effective solution. In this, a developer focuses primarily on the development and not on deployment, scaling, and management of the application. Cloud Foundry provided serverless computing, deployment, and scaling of applications that could not be provided by YARN.
Factoring but not Cloud
The subsequent blow came with in-memory computing. Hardware cost has reduced and at the same time, in-memory capacity is increasing to Terabytes. Spark with its Scala-based programming and ability to run extensively in just in-memory could churn out more data at 100 times the speed of HADOOP. Steering ahead with the wave was real-time computation. When cloud adapters and several solutions provided real-time data processing, HADOOP was still left with batch data processing.
The foreseeable future is guided by Machine Learning driving important decisions and garnering the attention of all organizations. Machine Learning in HADOOP is not an easy task as that of other MLaaS offerings from various cloud providers like IBM Watson, Google Cloud Machine Learning Engine and BigML to name a few.
The End and an era of new Beginning
With the advent of cloud computing and in-memory applications and anything as a service, HADOOP is losing its luster. Scalable architectures on compute and storage fronts have captured markets whereas HADOOP was not a choice and belong to a bygone era. Cloudera and Hortonworks merger might revive this once-thriving big data solution. Hadoop is now facing a decline; however, it is not dead.