Introduction
A highly scalable, open-source and distributed computing platform developed by Apache for storing, processing, and analysing large amounts of data.
- Big Data technology
- primarily Java-based
- batch/offline processing system
Its core components — HDFS for storage, MapReduce for processing, YARN for resource management, and Hadoop Common for essential utilities — continue to shape how data workflows are designed and scaled.
Hadoop Ecosystem
Hadoop is not a single tool. It’s an ecosystem that includes multiple modules that work together to manage data storage, processing, and resource coordination.
HDFS, MapReduce, Yarn & Hadoop Commons are the 4 primary pillars of Hadoop ecosystem.
Benefits of Hadoop
Hadoop was designed to deal with large amounts of data, so here are the five primary benefits:
Name | Details |
---|---|
Speed | Hadoop’s concurrent processing, MapReduce model, and HDFS enables this |
Diversity | HDFS can hold a variety of data forms, including structured, semi-structured, and unstructured data. |
Cost-Effective | Open-source and stores data on commodity technology |
Resilient | HDFS has the ability to replicate data over the network & provides replication factor of 3 |
Scalable | Distributed in nature, we can quickly add extra servers to Hadoop & quickly scale it horizontally |
Challenges in Hadoop
Name | Details |
---|---|
Security | Lot of confidential information. Hadoop still has to provide suitable identity, data encryption, provisioning, and auditing procedures. |
High Learning Curve | For running query in Hadoop, it requires developing MapReduce functions in Java. In addition, the ecosystem is made up of many other components. |
Not All Datasets can be Handled the Same | Hadoop does not provide a “one-size-fits-all” benefit. Different components operate in different ways, lot trial and error needed |
MapReduce is Limited | Although MapReduce is a fantastic programming model, it relies on a file-intensive technique which is tough to handle |
Components of Hadoop
Component | Details |
---|---|
HDFS | ![]() |
MapReduce | ![]() |
Yarn | ![]() |
Key Features of Hadoop
These are mainly:
- Distributed storage and processing allow it to store extremely large datasets and query them.
- Horizontal scalability without requiring high-end machines.
- Data locality optimization, moving computation to where the data resides.
- Resilience to failure via replication and task re-execution.
These 4 features make Hadoop well-suited for batch data processing, log analysis, and ETL pipelines. Hadoop is a classic case study for distributed computing, where computation happens across multiple nodes to increase the overall efficiency and scale.
Hadoop Distributed File System (HDFS)
HDFS is Hadoop’s primary storage system. It is designed to reliably store the vast amounts of data across a cluster of machines. HDFS architecture is highly optimized for fault tolerance, scalability, and data locality.
Architecture & Components
HDFS follows a master-slave model.
- At the top sits the NameNode
- which manages metadata like the file system’s directory tree and information about each file’s location.
- It doesn’t store the actual data.
- Serves as the Master, guiding the Datanode.
- The DataNodes are the workhorses or slaves.
- They manage storage attached to the nodes and serve read/write requests from clients.
- Each DataNode regularly reports back to the NameNode with a heartbeat and block reports to ensure consistent state tracking.
Data Storage & Fault Tolerance
When a file is written to HDFS, it’s broken into fixed-size 64MB blocks. Each block is then distributed across different DataNodes. Data in HDFS isn’t just stored once. Each block is replicated (the default is three copies) and spread across the cluster.
- During a read operation, the system pulls data from the nearest available replica to maximize throughput and minimize latency.
- Writes go to one replica first and then propagate to the others, ensuring durability without immediate bottlenecks.
HDFS is built with failure in mind. HDFS ensures data availability through its replication mechanism.
-
DataNode Failure NameNode detects the failure via missed heartbeats and schedules the replication of lost blocks to healthy nodes.
-
NameNode Failures To Be Researched and Added
Important HDFS Commands
Purpose | Command |
---|---|
Checking a file content | hadoop fs -text <FILE_PATH> \| head -n 1 |
Listing files | hdfs dfs -ls <FILE_PATH> |
Creating Directory | hadoop fs -mkdir -p <PATH> |
Checking storage | hdfs dfs -du -h -s <FILE_PATH> |
Merging files | hdfs dfs -getmerge <HDFS_FILE_PATH> <LOCAL_FILE_NAME> |
Moving data | hdfs dfs -mv <HDFS_SOURCE>/* <HDFS_DESTINATION> |
Copying data | hadoop fs -cp <HDFS_SOURCE>/* <HDFS_DESTINATION> |
Copying data to local | hdfs dfs -copyToLocal <HDFS_FILE_PATH> <LOCAL_FILE_NAME> |
MapReduce
MapReduce is the Hadoop’s processing engine. It allows for distributed computation across large datasets by breaking down tasks into smaller, independent operations that can be executed in parallel, which is what makes Hadoop so fast.
Stages and Components of MapReduce
MapReduce follows a two-phase programming model: the Map and Reduce phases:
- The Map() function converts DataBlocks into Tuples, which are just key-value pairs.
- The Reduce() function then joins these broken Tuples or key-value pairs based on their Key value and creates a set of Tuples, and performs operations such as sorting, summing, and so on, before sending them to the final Output Node.
Cascading is a very popular java framework for writing MR Jobs. In recent times, Spark jobs have replaced MR/Cascading Jobs in production.
Yet Another Resource Negotiator (YARN)
YARN serves as Hadoop’s resource management layer. It separates job scheduling and resource allocation from the processing model, helping Hadoop support multiple data processing engines beyond MapReduce like Apache Spark, Hive, and Tez.
There are 2 primary tasks for YARN:
- Job Scheduling
- Breaking down a large work into smaller jobs so that each one may be assigned to different slaves in a Hadoop cluster and processing is optimized.
- It also maintains track of which jobs are more vital, which jobs have higher priority, job dependencies, and other information such as job timing.
- Resource Management
- Managing all of the resources that are made available for a Hadoop cluster to run.
Components of YARN
Component | Work | Details |
---|---|---|
ResourceManager | The master daemon that manages all resources and schedules applications |
![]() |
NodeManager | per-node agent that monitors resource usage and reports back to the RM |
![]() |
ApplicationMaster | A job-specific operation that uses resources from the RM and coordinates execution with NMs |
![]() |
Job Scheduling is done by RM & Task Monitoring is done by AM.
Internals of NameNode
Containers are also called as Executors in normal tech language.
Hadoop Common
The glue that pulls all Hadoop’s components together is Hadoop Common. It’s a collection of java libraries, configuration files, and utilities required by all the Hadoop cluster components. HDFS, YARN, and MapReduce
- Provides the foundational code and tools that allow HDFS, YARN, and MapReduce to communicate and coordinate.
- This includes Java libraries, file system clients, and APIs used across the ecosystem.
-
Ensures consistency and compatibility between modules, allowing developers to build on a shared set of primitives.
- Proper configuration of Hadoop is critical. Contains configuration files like core-site.xml, which define default behaviors such as filesystem URIs and I/O settings.
- Contains CLI tools and scripts for tasks like starting services, checking node health, and managing jobs.