Hadoop

  • How to store and work with huge volumes & variety of data
  • How to analyze these vast data points & use it for competitive advantage.
  • Volume -There is a massive amount of data generated every second.
  • Velocity -The speed at which data is generated, collected and analyzed
  • Variety -The different types of data: structured, semi-structured, unstructured
  • Value -The ability to turn data into useful insights for your business
  • Veracity -Trustworthiness in terms of quality and accuracy
  • Data is split into small blocks of 64 or 128MB and stored onto a minimum of 3 machines at a time to ensure data availability & reliability
  • Many machines are connected in a cluster work in parallel for faster crunching of data
  • If any one machine fails, the work is assigned to another automatically
  • MapReduce breaks complex tasks into smaller chunks to be executed in parallel
  1. Hadoop HDFS — Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
  2. Hadoop MapReduce — Hadoop MapReduce is the processing unit of Hadoop.
  3. Hadoop YARN — Hadoop YARN is a resource management unit of Hadoop.

4.1 Hadoop HDFS

  • Data is stored in a distributed manner in HDFS. There are two components of HDFS — name node and data node. While there is only one name node, there can be multiple data nodes.
  • HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy 100 of these enterprise version servers, it will go up to a million dollars.
  • Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend millions of dollars just on your data nodes. However, the name node is always an enterprise server.

4.1.1 Master and Slave Nodes

Master and slave nodes form the HDFS cluster. The name node is called the master, and the data nodes are called the slaves.

  • The name node is responsible for the workings of the data nodes. It also stores the metadata
  • The data nodes read, write, process, and replicate the data. They also send signals, known as heartbeats, to the name node. These heartbeats show the status of the data node.
  • Consider that 30TB of data is loaded into the name node. The name node distributes it across the data nodes, and this data is replicated among the data notes. You can see in the image above that the blue, grey, and red data are replicated among the three data nodes.
  • Replication of the data is performed three times by default. It is done this way, so if a commodity machine fails, you can replace it with a new machine that has the same data.

4.2 Hadoop MapReduce

  • Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node.
  • A data containing code is used to process the entire data. This coded data is usually very small in comparison to the data itself. You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.
  • The input datasets is first split into chunks of data. In this example, the input has three lines of text with three separate entities — “bus car train,” “ship ship train,” “bus ship car.” The datasets is then split into three chunks, based on these entities, and processed parallel.
  • In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one ship, and one train.These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the aggregation takes place, and the final output is obtained.

4.3 Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2.

  • Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
  • It is responsible for managing cluster resources to make sure you don’t overload one machine.
  • It performs job scheduling to make sure that the jobs are scheduled in the right place
  • Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management.In the node section, each of the nodes has its node managers.
  • These node managers manage the nodes and monitor the resource usage in the node. The containers contain a collection of physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in, the app master requests the container from the node manager. Once the node manager gets the resource, it goes back to the Resource Manager.

5. Features of Hadoop

5.1 Fault-Tolerant
5.2 Highly Available
5.3 Recoverable
5.4 Consistent
5.5 Scalable
5.6 Predictable Performance
5.7 Secure

5.1 Fault Tolerance

  • Fault tolerance in Hadoop HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such a situation.
  • HDFS is highly fault-tolerant. Before Hadoop 3, it handles faults by the process of replica creation. It creates a replica of users’ data on different machines in the HDFS cluster. So whenever if any machine in the cluster goes down, then data is accessible from other machines in which the same copy of data was created.
  • HDFS also maintains the replication factor by creating a replica of data on other available machines in the cluster if suddenly one machine fails.
  • Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves storage efficiency while providing the same level of fault tolerance and data durability as traditional replication-based HDFS deployment.

5.1.1 How HDFS Fault Tolerance achieved?

  • Prior to Hadoop 3, the Hadoop Distributed File system achieves Fault Tolerance through the replication mechanism. Hadoop 3 came up with Erasure Coding to achieve Fault tolerance with less storage overhead.
  • Let us see both ways for achieving Fault-Tolerance in Hadoop HDFS.

5.1.1.1 Replication Mechanism

  • Before Hadoop 3, fault tolerance in Hadoop HDFS was achieved by creating replicas. HDFS creates a replica of the data block and stores them on multiple machines (Data Node).
  • The number of replicas created depends on the replication factor (by default 3).
  • If any of the machines fails, the data block is accessible from the other machine containing the same copy of data. Hence there is no data loss due to replicas stored on different machines.

5.1.1.2 Erasure Coding

  • Erasure coding is a method used for fault tolerance that durably stores data with significant space savings compared to replication.
  • RAID (Redundant Array of Independent Disks) uses Erasure Coding. Erasure coding works by striping the file into small units and storing them on various disks.
  • For each strip of the original dateset, a certain number of parity cells are calculated and stored. If any of the machines fails, the block can be recovered from the parity cell. Erasure coding reduces the storage overhead to 50%.

Example of HDFS Fault Tolerance

  • Suppose the user stores a file XYZ. HDFS breaks this file into blocks, say A, B, and C. Let’s assume there are four Data Nodes, say D1, D2, D3, and D4. HDFS creates replicas of each block and stores them on different nodes to achieve fault tolerance. For each original block, there will be two replicas stored on different nodes (replication factor 3).
  • Let the block A be stored on Data Nodes D1, D2, and D4, block B stored on Data Nodes D2, D3, and D4, and block C stored on Data Nodes D1, D2, and D3.
  • If Data Node D1 fails, the blocks A and C present in D1 are still available to the user from Data Nodes (D2, D4 for A) and (D2, D3 for C).
  • Hence even in unfavorable conditions, there is no data loss.

5.2 Highly Available

  • High Availability was a new feature added to Hadoop 2.x to solve the Single point of failure problem in the older versions of Hadoop.
  • As the Hadoop HDFS follows the master-slave architecture where the Name Node is the master node and maintains the file system tree. So HDFS cannot be used without Name Node. This Name Node becomes a bottleneck. HDFS high availability feature addresses this issue.

5.2.1 What is high availability?

High availability refers to the availability of system or data in the wake of component failure in the system.

  • The high availability feature in Hadoop ensures the availability of the Hadoop cluster without any downtime, even in unfavorable conditions like Name Node failure, Data Node failure, machine crash, etc.
  • It means if the machine crashes, data will be accessible from another path.

5.2.2 How Hadoop HDFS achieves High Availability?

  • As we know, HDFS (Hadoop distributed file system) is a distributed file system in Hadoop. HDFS stores users’ data in files and internally, the files are split into fixed-size blocks. These blocks are stored on Data Nodes. Name Node is the master node that stores the metadata about file system i.e. block location, different blocks for each file, etc.

1. Availability if Data Node fails

  • In HDFS, replicas of files are stored on different nodes.
  • Data Nodes in HDFS continuously sends heartbeat messages to Name Node every 3 seconds by default.
  • If Name Node does not receive a heartbeat from Data Node within a specified time (10 minutes by default), the Name Node considers the Data Node to be dead.
  • Name Node then checks for the data in Data Node and initiates data replication. Name Node instructs the Data Nodes containing a copy of that data to replicate that data on other Data Nodes.
  • Whenever a user requests to access his data, Name Node provides the IP of the closest Data Node containing user data. Meanwhile, if Data Node fails, the Name Node redirects the user to the other Data Node containing a copy of the same data. The user requesting for data read, access the data from other Data Nodes containing a copy of data, without any downtime. Thus cluster is available to the user even if any of the Data Nodes fails.

2. Availability if Name Node fails

  • Name Node is the only node that knows the list of files and directories in a Hadoop cluster. “The file system cannot be used without Name Node”.
  • The addition of the High Availability feature in Hadoop 2 provides a fast fail over to the Hadoop cluster. The Hadoop HA cluster consists of two Name Nodes (or more after Hadoop 3) running in a cluster in an active/passive configuration with a hot standby. So, if an active node fails, then a passive node becomes the active Name Node, takes the responsibility, and serves the client request.
  • This allows for the fast fail over to the new machine even if the machine crashes.
  • Thus, data is available and accessible to the user even if the Name Node itself goes down.

5.3 Recovery

  • Many people don’t consider backups since Hadoop has 3X replication by default. Also, Hadoop is often a repository for data that resides in existing data warehouses or transactional systems, so the data can be reloaded. That is not the only case anymore! Social media data, ML models, logs, third-party feeds, open APIs, IoT data, and other sources of data may not be reloadable, easily available, or in the enterprise at all. So, this is not critical single-source data that must be backed up and stored forever.
  • There are a lot of tools in the open-source space that allow you to handle most of your backup, recovery, replication, and disaster recovery needs. There are also some other enterprise hardware and software options.

Some Options

  • Replication and mirroring with Apache Falcon.
  • Dual ingest or replication via HDF.
  • WANdisco.
  • DistCP.
  • In-memory WAN replication via memory grids (Gemfire, GridGain, Redis, etc.).
  • HBase Replication.
  • Hive Replication.
  • Apache Storm, Spark, and Flink custom jobs to keep clusters in sync.
  • Catastrophic failure at the data center level, requiring fail over to a backup location
  • Needing to restore a previous copy of your data due to user error or accidental deletion
  • The ability to restore a point-in-time copy of your data for auditing purposes

RTO/RPO Drill Down

RTO, or Recovery Time Objective, is the target time you set for the recovery of your IT and business activities after a disaster has struck. The goal here is to calculate how quickly you need to recover, which can then dictate the type or preparations you need to implement and the overall budget you should assign to business continuity.

5.4 Consistent

5.4.1 Issues in maintaining consistency Of HDFS HA cluster:

There are two issues in maintaining the consistency of the HDFS high availability cluster. They are:

  • The active node and the passive node should always be in sync with each other and must have the same metadata. This allows us to restore the Hadoop cluster to the same namespace where it crashed.
  • Only one Name Node in the same cluster must be active at a time. If two Name Nodes are active at a time, then cluster gets divided into smaller clusters, each one believing it is the only active cluster. This is known as the “Split-brain scenario” which leads to data loss or other incorrect results. Fencing is a process that ensures that only one NameNode is active at a time.

5.4.2 Problem

Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop file system, S3 behaves different than HDFS. One of the key differences is in the level of consistency provided by the underlying file system. Unlike HDFS, S3 is an eventually consistent file system. This means that changes made to files on S3 may not be visible for some period of time.

  • FileNotFoundExceptions. Processes that write data to a directory and then list that directory may fail when the data they wrote is not visible in the listing. This is a big problem with Spark, for example.
  • Flaky test runs that “usually” work. For example, our root directory integration tests for Hadoop’s S3A connector occasionally fail due to eventual consistency. This is due to assertions about the directory contents failing. These failures occur more frequently when we run tests in parallel, increasing stress on the S3 service and making delayed visibility more common.
  • Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently omit some data. This omission happens when a job chooses which files to consume based on a directory listing, which may not include recently-written items.

5.4.3 Solution

To address these issues caused by S3’s eventual consistency, we worked with the Apache Hadoop community to create a new feature, called S3 Guard. S3 Guard is being developed as part of the open source Apache Hadoop S3 client, S3A. S3 Guard works by logging all metadata changes made to S3 to an external, consistent storage system, called a Metadata Store. Since the Metadata Store is consistent, S3A can use it to fill in missing state that may not be visible yet in S3.

5.5 Scalable

Another primary benefit of Hadoop is its scalability. Rather than capping your data throughput with the capacity of a single enterprise server, Hadoop allows for the distributed processing of large data sets across clusters of computers, thereby removing the data ceiling by taking advantage of a “divide and conquer” method among multiple pieces of commodity hardware.

5.6 Predictable Performance

One of the main reasons for moving away from a traditional database solution for managing data is to increase raw performance and gain the ability to scale. It may come as a surprise to you to know that not all Hadoop distributions are created equal in this regard.

  • Speed of processing — tremendous parallel processing to reduce processing time

5.7 Security

Big data is the collection and analysis of large set of data which holds many intelligence and raw information based on user data, Sensor data, Medical and Enterprise data. The Hadoop platform is used to Store, Manage, and Distribute Big data across several server nodes. This paper shows the Big data issues and focused more on security issue arises in Hadoop Architecture base layer called Hadoop Distributed File System (HDFS). The HDFS security is enhanced by using three approaches like Kerberos, Algorithm and Name node.

  • HDFS — Hadoop distributed file system
  • Distributed computation tier using programming of MapReduce
  • Sits on the low cost commodity servers connected together called Cluster
  • Consists of a Master Node or Name Node to control the processing
  • Data Nodes to store & process the data
  • Job Tracker & Task Tracker to manage & monitor the jobs
  • Over the last decade all the data computations were done by increasing the computing power of a single machine by increasing the number of processors & increasing the RAM, but they had physical limitations.
  • As the data started growing beyond these capabilities, an alternative was required to handle the the storage requirements of organizations like eBay (10 PB), Facebook (30 PB), Yahoo (170 PB), JPMC (150 PB)
  • With typical 75 MB/Sec disk data transfer rate it was impossible to process such huge data sets
  • Scalability was limited by physical size & no or limited fault tolerance
  • Additionally various formats of data are being added to the organizations for analysis which is not possible with traditional databases

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store