Hadoop
Today we live in the age of Big data, where data volumes have outgrown the storage & processing capabilities of a single machine, and the different types of data formats required to be analyzed have increased tremendously. When we look at how data was handled in the past, we see that it was a fairly easy task due to the limited amount of data that professionals had to work with. Years ago, only one processor and storage unit was required to handle data. It was handled with the concept of structured data and a database that contained the relevant data. SQL queries made it possible to go through giant spreadsheets with multiple rows and columns.
1.What is Hadoop?
As the years went by and data generation increased, higher volumes and more formats emerged. Hence, multiple processors were needed to process data to save time. However, a single storage unit became the bottleneck due to the network overhead that was generated. This led to using a distributed storage unit for each processor, which made data access easier. This method is known as parallel processing with distributed storage ,various computers run the processes on various storage.
It is an Open source Java Framework technology helps to store, access and gain large resources from big data in a distributed fashion at less cost,high degree of fault tolerance and high scalability. Hadoop [5] handles large number of data from different system like Images,videos, Audios, Folders, Files, Software, Sensor Records, Communication data, Structured Query, unstructured data, Email& conversations, and anything which we can’t think in any format. All these resources can be stored in a Hadoop cluster without any schema representation instead of collecting from different systems. There are many components involved in Hadoop like Avro, Chukwa, Flume, HBase, Hive, Lucene, Oozie, Pig, Sqoop and Zookeeper.
2. Big Data and Its Challenges
Big Data refers to the massive amount of data that cannot be stored, processed, and analyzed using traditional ways.
2.1 This brings 2 fundamental challenges:
- How to store and work with huge volumes & variety of data
- How to analyze these vast data points & use it for competitive advantage.
2.2 The main elements of big data are:
- Volume -There is a massive amount of data generated every second.
- Velocity -The speed at which data is generated, collected and analyzed
- Variety -The different types of data: structured, semi-structured, unstructured
- Value -The ability to turn data into useful insights for your business
- Veracity -Trustworthiness in terms of quality and accuracy
3. How Hadoop addresses these challenges:
- Data is split into small blocks of 64 or 128MB and stored onto a minimum of 3 machines at a time to ensure data availability & reliability
- Many machines are connected in a cluster work in parallel for faster crunching of data
- If any one machine fails, the work is assigned to another automatically
- MapReduce breaks complex tasks into smaller chunks to be executed in parallel
Hadoop fills this gap by overcoming both challenges. Hadoop is based on research papers from Google & it was created by Doug Cutting, who named the framework after his son’s yellow stuffed toy elephant.
4. Components of Hadoop
Hadoop is a framework that uses distributed storage and parallel processing to store and manage big data. It is the most commonly used software to handle big data. There are three components of Hadoop.
- Hadoop HDFS — Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
- Hadoop MapReduce — Hadoop MapReduce is the processing unit of Hadoop.
- Hadoop YARN — Hadoop YARN is a resource management unit of Hadoop.
4.1 Hadoop HDFS
- Data is stored in a distributed manner in HDFS. There are two components of HDFS — name node and data node. While there is only one name node, there can be multiple data nodes.
- HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy 100 of these enterprise version servers, it will go up to a million dollars.
- Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend millions of dollars just on your data nodes. However, the name node is always an enterprise server.
4.1.1 Master and Slave Nodes
Master and slave nodes form the HDFS cluster. The name node is called the master, and the data nodes are called the slaves.
- The name node is responsible for the workings of the data nodes. It also stores the metadata
- The data nodes read, write, process, and replicate the data. They also send signals, known as heartbeats, to the name node. These heartbeats show the status of the data node.
- Consider that 30TB of data is loaded into the name node. The name node distributes it across the data nodes, and this data is replicated among the data notes. You can see in the image above that the blue, grey, and red data are replicated among the three data nodes.
- Replication of the data is performed three times by default. It is done this way, so if a commodity machine fails, you can replace it with a new machine that has the same data.
4.2 Hadoop MapReduce
- Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node.
- A data containing code is used to process the entire data. This coded data is usually very small in comparison to the data itself. You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.
- The input datasets is first split into chunks of data. In this example, the input has three lines of text with three separate entities — “bus car train,” “ship ship train,” “bus ship car.” The datasets is then split into three chunks, based on these entities, and processed parallel.
- In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one ship, and one train.These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the aggregation takes place, and the final output is obtained.
4.3 Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2.
- Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
- It is responsible for managing cluster resources to make sure you don’t overload one machine.
- It performs job scheduling to make sure that the jobs are scheduled in the right place
- Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management.In the node section, each of the nodes has its node managers.
- These node managers manage the nodes and monitor the resource usage in the node. The containers contain a collection of physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in, the app master requests the container from the node manager. Once the node manager gets the resource, it goes back to the Resource Manager.
5. Features of Hadoop
5.1 Fault-Tolerant
5.2 Highly Available
5.3 Recoverable
5.4 Consistent
5.5 Scalable
5.6 Predictable Performance
5.7 Secure
5.1 Fault Tolerance
- Fault tolerance in Hadoop HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such a situation.
- HDFS is highly fault-tolerant. Before Hadoop 3, it handles faults by the process of replica creation. It creates a replica of users’ data on different machines in the HDFS cluster. So whenever if any machine in the cluster goes down, then data is accessible from other machines in which the same copy of data was created.
- HDFS also maintains the replication factor by creating a replica of data on other available machines in the cluster if suddenly one machine fails.
- Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves storage efficiency while providing the same level of fault tolerance and data durability as traditional replication-based HDFS deployment.
5.1.1 How HDFS Fault Tolerance achieved?
- Prior to Hadoop 3, the Hadoop Distributed File system achieves Fault Tolerance through the replication mechanism. Hadoop 3 came up with Erasure Coding to achieve Fault tolerance with less storage overhead.
- Let us see both ways for achieving Fault-Tolerance in Hadoop HDFS.
5.1.1.1 Replication Mechanism
- Before Hadoop 3, fault tolerance in Hadoop HDFS was achieved by creating replicas. HDFS creates a replica of the data block and stores them on multiple machines (Data Node).
- The number of replicas created depends on the replication factor (by default 3).
- If any of the machines fails, the data block is accessible from the other machine containing the same copy of data. Hence there is no data loss due to replicas stored on different machines.
5.1.1.2 Erasure Coding
- Erasure coding is a method used for fault tolerance that durably stores data with significant space savings compared to replication.
- RAID (Redundant Array of Independent Disks) uses Erasure Coding. Erasure coding works by striping the file into small units and storing them on various disks.
- For each strip of the original dateset, a certain number of parity cells are calculated and stored. If any of the machines fails, the block can be recovered from the parity cell. Erasure coding reduces the storage overhead to 50%.
Example of HDFS Fault Tolerance
- Suppose the user stores a file XYZ. HDFS breaks this file into blocks, say A, B, and C. Let’s assume there are four Data Nodes, say D1, D2, D3, and D4. HDFS creates replicas of each block and stores them on different nodes to achieve fault tolerance. For each original block, there will be two replicas stored on different nodes (replication factor 3).
- Let the block A be stored on Data Nodes D1, D2, and D4, block B stored on Data Nodes D2, D3, and D4, and block C stored on Data Nodes D1, D2, and D3.
- If Data Node D1 fails, the blocks A and C present in D1 are still available to the user from Data Nodes (D2, D4 for A) and (D2, D3 for C).
- Hence even in unfavorable conditions, there is no data loss.
Reference: https://data-flair.training/blogs/learn-hadoop-hdfs-fault-tolerance/
5.2 Highly Available
- High Availability was a new feature added to Hadoop 2.x to solve the Single point of failure problem in the older versions of Hadoop.
- As the Hadoop HDFS follows the master-slave architecture where the Name Node is the master node and maintains the file system tree. So HDFS cannot be used without Name Node. This Name Node becomes a bottleneck. HDFS high availability feature addresses this issue.
5.2.1 What is high availability?
High availability refers to the availability of system or data in the wake of component failure in the system.
- The high availability feature in Hadoop ensures the availability of the Hadoop cluster without any downtime, even in unfavorable conditions like Name Node failure, Data Node failure, machine crash, etc.
- It means if the machine crashes, data will be accessible from another path.
5.2.2 How Hadoop HDFS achieves High Availability?
- As we know, HDFS (Hadoop distributed file system) is a distributed file system in Hadoop. HDFS stores users’ data in files and internally, the files are split into fixed-size blocks. These blocks are stored on Data Nodes. Name Node is the master node that stores the metadata about file system i.e. block location, different blocks for each file, etc.
1. Availability if Data Node fails
- In HDFS, replicas of files are stored on different nodes.
- Data Nodes in HDFS continuously sends heartbeat messages to Name Node every 3 seconds by default.
- If Name Node does not receive a heartbeat from Data Node within a specified time (10 minutes by default), the Name Node considers the Data Node to be dead.
- Name Node then checks for the data in Data Node and initiates data replication. Name Node instructs the Data Nodes containing a copy of that data to replicate that data on other Data Nodes.
- Whenever a user requests to access his data, Name Node provides the IP of the closest Data Node containing user data. Meanwhile, if Data Node fails, the Name Node redirects the user to the other Data Node containing a copy of the same data. The user requesting for data read, access the data from other Data Nodes containing a copy of data, without any downtime. Thus cluster is available to the user even if any of the Data Nodes fails.
2. Availability if Name Node fails
- Name Node is the only node that knows the list of files and directories in a Hadoop cluster. “The file system cannot be used without Name Node”.
- The addition of the High Availability feature in Hadoop 2 provides a fast fail over to the Hadoop cluster. The Hadoop HA cluster consists of two Name Nodes (or more after Hadoop 3) running in a cluster in an active/passive configuration with a hot standby. So, if an active node fails, then a passive node becomes the active Name Node, takes the responsibility, and serves the client request.
- This allows for the fast fail over to the new machine even if the machine crashes.
- Thus, data is available and accessible to the user even if the Name Node itself goes down.
Reference: https://data-flair.training/blogs/hadoop-high-availability-tutorial/
5.3 Recovery
- Many people don’t consider backups since Hadoop has 3X replication by default. Also, Hadoop is often a repository for data that resides in existing data warehouses or transactional systems, so the data can be reloaded. That is not the only case anymore! Social media data, ML models, logs, third-party feeds, open APIs, IoT data, and other sources of data may not be reloadable, easily available, or in the enterprise at all. So, this is not critical single-source data that must be backed up and stored forever.
- There are a lot of tools in the open-source space that allow you to handle most of your backup, recovery, replication, and disaster recovery needs. There are also some other enterprise hardware and software options.
Some Options
- Replication and mirroring with Apache Falcon.
- Dual ingest or replication via HDF.
- WANdisco.
- DistCP.
- In-memory WAN replication via memory grids (Gemfire, GridGain, Redis, etc.).
- HBase Replication.
- Hive Replication.
- Apache Storm, Spark, and Flink custom jobs to keep clusters in sync.
Recovery plan is a set of well-defined process or procedures that needs to be executed so that the effects of a disaster will be minimized and the organization will be able to either maintain or quickly resume mission-critical operations.
Disaster usually comes in several forms and need to planned for recovery accordingly:
- Catastrophic failure at the data center level, requiring fail over to a backup location
- Needing to restore a previous copy of your data due to user error or accidental deletion
- The ability to restore a point-in-time copy of your data for auditing purposes
RTO/RPO Drill Down
RTO, or Recovery Time Objective, is the target time you set for the recovery of your IT and business activities after a disaster has struck. The goal here is to calculate how quickly you need to recover, which can then dictate the type or preparations you need to implement and the overall budget you should assign to business continuity.
RPO, or Recovery Point Objective, is focused on data and your company’s loss tolerance in relation to your data. RPO is determined by looking at the time between data backups and the amount of data that could be lost in between backups.
The major difference between these two metrics is their purpose. The RTO is usually large scale, and looks at your whole business and systems involved. RPO focuses just on data and your company’s overall resilience to the loss of it.
Reference:
5.4 Consistent
5.4.1 Issues in maintaining consistency Of HDFS HA cluster:
There are two issues in maintaining the consistency of the HDFS high availability cluster. They are:
- The active node and the passive node should always be in sync with each other and must have the same metadata. This allows us to restore the Hadoop cluster to the same namespace where it crashed.
- Only one Name Node in the same cluster must be active at a time. If two Name Nodes are active at a time, then cluster gets divided into smaller clusters, each one believing it is the only active cluster. This is known as the “Split-brain scenario” which leads to data loss or other incorrect results. Fencing is a process that ensures that only one NameNode is active at a time.
5.4.2 Problem
Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop file system, S3 behaves different than HDFS. One of the key differences is in the level of consistency provided by the underlying file system. Unlike HDFS, S3 is an eventually consistent file system. This means that changes made to files on S3 may not be visible for some period of time.
Many Hadoop components, however, depend on HDFS consistency for correctness. While S3 usually appears to “work” with Hadoop, there are a number of failures that do sometimes occur due to inconsistency:
- FileNotFoundExceptions. Processes that write data to a directory and then list that directory may fail when the data they wrote is not visible in the listing. This is a big problem with Spark, for example.
- Flaky test runs that “usually” work. For example, our root directory integration tests for Hadoop’s S3A connector occasionally fail due to eventual consistency. This is due to assertions about the directory contents failing. These failures occur more frequently when we run tests in parallel, increasing stress on the S3 service and making delayed visibility more common.
- Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently omit some data. This omission happens when a job chooses which files to consume based on a directory listing, which may not include recently-written items.
5.4.3 Solution
To address these issues caused by S3’s eventual consistency, we worked with the Apache Hadoop community to create a new feature, called S3 Guard. S3 Guard is being developed as part of the open source Apache Hadoop S3 client, S3A. S3 Guard works by logging all metadata changes made to S3 to an external, consistent storage system, called a Metadata Store. Since the Metadata Store is consistent, S3A can use it to fill in missing state that may not be visible yet in S3.
We currently support a DynamoDB-based Metadata Store for use in production. We also have written an in-memory Metadata Store for testing. We designed the system to allow adding additional back end databases in the future.
The initial version of S3 Guard fixed the most common case of inconsistency problems: List after create. If a Hadoop S3A client creates or moves a file, and then a client lists its directory, that file is now guaranteed to be included in the listing. This version of S3 Guard shipped in CDH 5.11.0. The next feature we added was delete tracking. Delete tracking ensures that, after S3A deletes a file, subsequent requests for that file metadata, either directly, or by listing the parent directory, will reflect the deletion of the file. We have also designed S3 Guard with performance in mind, and have seen speedups on certain workloads. Keep an eye out for more performance enhancements in the future.
Reference: https://blog.cloudera.com/introducing-s3guard-s3-consistency-for-apache-hadoop/
5.5 Scalable
Another primary benefit of Hadoop is its scalability. Rather than capping your data throughput with the capacity of a single enterprise server, Hadoop allows for the distributed processing of large data sets across clusters of computers, thereby removing the data ceiling by taking advantage of a “divide and conquer” method among multiple pieces of commodity hardware.
While this architecture was the beginning of the data scalability revolution, it is by no means the end. Within the Hadoop platform there are three further considerations regarding scalability
5.6 Predictable Performance
One of the main reasons for moving away from a traditional database solution for managing data is to increase raw performance and gain the ability to scale. It may come as a surprise to you to know that not all Hadoop distributions are created equal in this regard.
As you can see in the graph below that compares the MapR M7 Edition with another Hadoop distribution, the difference in latency, and thus performance, between distributions is staggering.
The need for high performance increases even further when you consider the real-time applications of Hadoop, such as that of financial security systems.
Thanks to technologies like Hadoop, financial criminals are finding it increasingly difficult to steal digital assets. Financial services firms like Zions Bank are now able to stop fraudulent financial threats before any real impact is felt by banking customers. Dependability and high performance are essential features for analyzing and reacting to real-time data in order to prevent destructive fraudulent activity.
- Speed of processing — tremendous parallel processing to reduce processing time
5.7 Security
Big data is the collection and analysis of large set of data which holds many intelligence and raw information based on user data, Sensor data, Medical and Enterprise data. The Hadoop platform is used to Store, Manage, and Distribute Big data across several server nodes. This paper shows the Big data issues and focused more on security issue arises in Hadoop Architecture base layer called Hadoop Distributed File System (HDFS). The HDFS security is enhanced by using three approaches like Kerberos, Algorithm and Name node.
Security Issues There are fewer challenges for managing a large data set in secure manner and inefficient tools, public and private database contain more threats and vulnerabilities, volunteered and unexpected leakage of data, and deficiency of Public and Private Policy makes a hackers to collect their resources whenever required. In Distributed programming frameworks, the security issues start working when massive amount of private data stored in a database which is not encrypted or in regular format. Securing the data in presence of untrusted people is more difficult and when moving from homogeneous data to the Heterogeneous data certain tools and technologies for massive data set is not often developed with more security and policy certificates. Sometimes data hackers and system hackers involves in collecting a publicly available big data set, copy it and store it in a devices like USB drives, hard disk or in Laptops. They involves in attacking the data storage by sending some attacks like Denial of Service [14],Snoofing attack and Brute Force attack [15]. If the unknown user knows about the key value pairs of data it makes them to collect atleast some insufficient information. When the Storage of data increases from single tier to Multi storage tier the security tier must also be increased. In order to reduce these issues some cryptographic Framework techniques and robust algorithm must be developed in order to enhance the security of data for future. Similarly some tools are developed like Hadoop; NoSQL technology can be used for big data storage.
5.7.1 Security Issues in HDFS
The HDFS is the base layer of Hadoop Architecture contains different classifications of data and it is more sensitive to security issues. It has no appropriate role based access for controlling security problems.Also the risk of data access,theft,and unwanted disclosure takes place when embedded a data in single Hadoop environment. The replicated data is also not secure which needs more security for protecting from breaches and vulnerabilities. Mostly Government Sector and Organisations never using Hadoop environment for storing valuable data because of less security concerns inside a Hadoop Technology. They are providing security in outside of Hadoop Environment like firewall and Intrusion Detection System. Some authors represented that the HDFS in
Hadoop environment is prevented with security for avoiding the theft, vulnerabilities only by encrypting the block levels and individual file system in Hadoop Environment.Even though other authors encrypted the block and nodes using encryption technique but no perfect algorithm is mentioned to maintain the security in Hadoop Environment. In order to increase the security some approaches are mentioned below.
5.7.2 HDFS Security Approaches
The proposed work represents different Approaches for securing data in Hadoop distributed file system. The first approach is based on Kerberos in HDFS.
a. Kerberos Mechanism
Kerberos [10] is the network authentication protocol which allows the node to transfer any file over non secure channel by a tool called ticket to prove their unique identification between them. This Kerberos mechanism is used to enhance the security in HDFS. In HDFS the connection between client and Name node is achieved using Remote Procedure Call [11] and the connection from Client (client uses HTTP) to Data node is Achieved using Block Transfer. Here the Token or Kerberos is used to authenticate a RPC connection. If the Client needs to obtain a token means, the client makes use of Kerberos Authenticated Connection. Ticket Granting Ticket (TGT) or Service Ticket are used to authenticate a name node by using Kerberos. Both TGT and ST can be renewed after long running of jobs while Kerberos is renewed, new TGT and ST is also issued and distributed to all task. The Key Distribution Center (KDC) issues the Kerberos Service Ticket using TGT after getting request from task and network traffic is avoided to the KDC by using Tokens In name node, only the time period is extended but the ticket remain constant. The major advantage is even if the ticket is stolen by the attacker it can’t be renewed. We can also use another method for providing security for file access in HDFS. If the client wants to access a block from the data node it must first contact the name node in order to identify which data node holds the files of the blocks. Because of name node only authorize access to file permission and issues a token called Block Token where data node verifies the token. The data node also issues a token called Name Token where it allows the Name node to enforce permission for correct control access on its data blocks. Block Token allows the data node to identify whether the client is authorized access to access data blocks. These block token and Name Token is sent back to client who contains data block respective locations and you’re the authorized person to access the location. These two methods are used to increase security by preventing from unauthorized client must read and write in data blocks. The figure 5 shows the design view of Kerberos key distribution centre.
b. Bull Eye Algorithm Approach
In big data the sensitive data are credit card numbers, passwords, account numbers, personal details are stored in a large technology called Hadoop. In order to increase the security in Hadoop base layer the new approach is introduced for securing sensitive information which is called “Bull Eye Approach”. This approach is introduced on Hadoop module to view all sensitive information in 360° to find whether all the secured information are stored
without any risk, and allows the authorized person to preserve the personal information in a right way. Recently this approach is being used in companies like Dateguise’s DGsecure[8] and Amazon Elastic Map Reduce[9]. The DGsecure Company which is famous for providing a Data centric security and Governance solutions also involves in providing a security for Hadoop in the cloud. The data guise company is decided to maintain and provide security in Hadoop wherever it is located in the cloud. Nowadays the Companies are storing more sensitive data in the cloud because of more breaches taking place in traditional on premise data stores. To increase the security in Hadoop base layers, the Bull eye Approach is also used in HDFS to provide security in 360° from node to node. This approach is implemented in Data node of rack 1, where it checks the sensitive data are stored properly in block without any risk and allows only the particular client to store in required blocks. It also bridges a gap between a data driven from original data node and replicated data node. When the client wants to retrieve any data from replicating data nodes it is also maintained by “Bull Eye Approach” and it checks where there is a proper relation between two racks. This Algorithm allows the data nodes to be more secure, only the authorized person can read or write about it. The algorithm can be implemented below the data node where the client reads or writes the data to store in blocks. It is not only implemented in Rack 1 similarly it is implemented in Rack 2 in order to increase the security of the blocks inside the data nodes in 360°. It checks for any attacks, breaches or theft of data taking place in the blocks of the data node. Sometimes data is encrypted for protection in data mode. These types of encrypted data are also protected using this Algorithm in order to maintain order security. The Algorithm travels from less terabyte to multi-petabytes of semi structured, structured and unstructured data stored in HDFS layer in all angles. Most encryption and Wrapping of data occurs at the block levels of Hadoop rather than the entire file level. This algorithm scans before the data is allowed to enter into the blocks and also after enters both rack 1 and rack 2. Thus, this Algorithm concentrates only on the sensitive data that matters about the information stored in the data nodes. In our work, we mentioned this new Algorithm to enhance more security in the data nodes of HDFS.
c. Name node Approach
In HDFS if there is any problem in Name node event and becomes unavailable, it makes the group of system service and data stored in the HDFS make unavailable so it is not easy to access the data in a secure way from this critical situation. In order to increase the security in data availability, it is achieved by using two Name node. These two Name nodes servers are allowed to run successfully in the same cluster. These two redundant name nodes are provided by Name Node Security Enhance (NNSE), which holds Bull Eye Algorithm. It allows the Hadoop administrator to run the options for two nodes. From these names node one acts as Master and other acts as a slave in order to reduce an unnecessary or unexpected server crash and allows predicting from natural disasters. If the Master Name node crashes, the administrator needs to ask permission from Name Node Security Enhance to provide data from a slave node in order to cover a time lagging and data unavailability in a secure manner. Without getting permission from NNSE admin never retrieves the data from the slave node to reduce the complex retrieval issue. If both Name nodes act as a master there is a continuous risk, reduces a secure data availability and bottleneck in performance over a local area network or Wide Area Network. Thus in future we can also increase security by using Vital configuration that provides and ensures data is available in a secured way to clients by replicating many Name node by Name Node Security Enhance in HDFS blocks between many data centers and clusters.
6. So What is Hadoop? It is a framework made up of:
- HDFS — Hadoop distributed file system
- Distributed computation tier using programming of MapReduce
- Sits on the low cost commodity servers connected together called Cluster
- Consists of a Master Node or Name Node to control the processing
- Data Nodes to store & process the data
- Job Tracker & Task Tracker to manage & monitor the jobs
7. Let us see why Hadoop has become so popular today.
- Over the last decade all the data computations were done by increasing the computing power of a single machine by increasing the number of processors & increasing the RAM, but they had physical limitations.
- As the data started growing beyond these capabilities, an alternative was required to handle the the storage requirements of organizations like eBay (10 PB), Facebook (30 PB), Yahoo (170 PB), JPMC (150 PB)
- With typical 75 MB/Sec disk data transfer rate it was impossible to process such huge data sets
- Scalability was limited by physical size & no or limited fault tolerance
- Additionally various formats of data are being added to the organizations for analysis which is not possible with traditional databases