Apache Hadoop core components

HDFS Key Features

HDFS is a fault-tolerant and self-healing distributed filesystem designed to turn a cluster of industry-standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility, and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high-bandwidth streaming, and scales to proven deployments of 100PB and beyond.

Hadoop Scalable:

HDFS is designed for massive scalability, so you can store unlimited amounts of data in a single platform. As your data needs grow, you can simply add more servers to linearly scale with your business.

Flexibility:

Store data of any type — structured, semi-structured, unstructured — without any upfront modeling. Flexible storage means you always have access to full-fidelity data for a wide range of analytics and use cases.

Reliability:

Automatic, tunable replication means multiple copies of your data are always available for access and protection from data loss. Built-in fault tolerance means servers can fail but your system will remain available for all workloads.

MapReduce Key Features

Accessibility:

Supports a wide range of languages for developers, including C++, Java, or Python, as well as high-level language through Apache Hive and Apache Pig.

Flexibility:

Process any and all data, regardless of type or format — whether structured, semi-structured, or unstructured. Original data remains available even after batch processing for further analytics, all in the same platform.

Reliability:

Built-in job and task trackers allows processes to fail and restart without affecting other processes or workloads. Additional scheduling allows you to prioritize processes based on needs such as SLAs.

Hadoop Scalable:

MapReduce is designed to match the massive scale of HDFS and Hadoop, so you can process unlimited amounts of data, fast, all within the same platform where it’s stored.

While MapReduce continues to be a popular batch-processing tool, Apache Spark’s flexibility and in-memory performance make it a much more powerful batch execution engine. Cloudera has been working with the community to bring the frameworks currently running on MapReduce onto Spark for faster, more robust processing.

MapReduce is designed to process unlimited amounts of data of any type that’s stored in HDFS by dividing workloads into multiple tasks across servers that are run in parallel.

Learn why Apache Spark™ is the heir to MapReduce >

YARN Key Features

YARN provides open source resource management for Hadoop, so you can move beyond batch processing and open up your data to a diverse set of workloads, including interactive SQL, advanced modeling, and real-time streaming.

Read the Engineering blog series: Untangling YARN

Hadoop Scalable:

YARN is designed to handle scheduling for the massive scale of Hadoop so you can continue to add new and larger workloads, all within the same platform.

Dynamic Multi-tenancy:

Dynamic resource management provided by YARN supports multiple engines and workloads all sharing the same cluster resources. Open up your data to users across the entire business environment through batch, interactive, advanced, or real-time processing, all within the same platform so you can get the most value from your Hadoop platform.

Optimal Workload Management:

Fine-grained configurations allow for better cluster utilization so you can enable workload SLAs for priority workloads and group-based policies across the business. Process more data in more ways, without disrupting your most critical operations.

Integrated across the platform

Core Hadoop, including HDFS, MapReduce, and YARN, is part of the foundation of Cloudera’s platform. All platform components have access to the same data stored in HDFS and participate in shared resource management via YARN. Hadoop, as part of Cloudera’s platform, also benefits from simple deployment and administration (through Cloudera Manager) and shared compliance-ready security and governance (through Apache Sentry and Cloudera Navigator) — all critical for running in production.

Learn more

Cloudera’s commitment to Hadoop

Cloudera is actively involved in the Hadoop community, including having Doug Cutting, one of the co-founders of Hadoop, as its Chief Architect. As the first company to commercialize Hadoop in 2008, Cloudera has the most experience with that platform — with large-scale production customers across industries — and also has the committers on-staff to continuously drive innovations improvements for our customers and the community.

Apache Hadoop project

Learn more about open source and open standards

Expert support for Core Hadoop

Cloudera has Hadoop experts available across the globe ready to deliver world-class support 24/7. With more experience across more customers, for more use cases, Cloudera is the leader in Hadoop support so you can focus on results.

Learn more about Cloudera Support