Cloudera announced the Kudu project at Strata Hadoop World conference in New York in September 2015. What makes Kudu important among over 250 active Apache projects? It is aiming to improve the core foundation of Hadoop – HDFS (Hadoop Distributed File System). Hadoop has been around over 10 years with HDFS being its foundational component providing effective data distribution to support scale-out processing, redundancy and failover. Hadoop was inspired by 2003 Google papers on MapReduce and Google File System. Among primary objectives of Hadoop was providing a performant way of processing large amounts of data by distributing data and processing across large number of commodity servers. 2005 was when Hadoop first announced that disk was the main performance bottleneck of any data platform. Magnetic disks are relatively fast reading and writing continuous large blocks of data but random read and write performance is very slow. To overcome this limitation HDFS was designed to work with large files (gigabytes) and support only read, append and drop operations. Here is a great blog that outlines best practices for file sizing on HDFS. To summarize, HDFS works great by leveraging the distributed power of multiple nodes reading and writing sequential data blocks – on read and bulk load workloads, that is.
What if your data requires updates (which is very common scenario for analytic data platform)? With HDFS-based platforms updates are very expensive – they require to drop and recreate entire multi-gigabyte file partitions. This core inefficiency in HDFS led to development of workaround approaches such as lambda architecture where real-time data is loaded into fast NoSQL platform like HBase and then later structured into HDFS-optimal file sizes and copied to Hadoop. This approach works well, but at the expense of significantly increased overall system complexity and associated development and maintenance efforts.
How is Kudu aiming to change this? By replacing HDFS, no less. This is very bold move for Cloudera – HDFS has been around for over 10 years and it has been the most stable, core, foundational component in Hadoop. Majority of the innovation on Hadoop platform has been targeted at improving map-reduce processing speed by improved cluster resource management (Yarn), utilizing in-memory processing (Spark) and optimizing storage formats (Parquet). Cloudera engineers realized that foundational changes in HDFS are required in order to support performant, updatable data platform. Cloudera positions Kudu as a new technology that provides same as HDFS or better bulk read performance and also works well with read/write workloads ranging from bulk inserts to single record seeks and updates.
To achieve this Kudu does not use HDFS storage but relies on its own columnar storage format. This blogpost outlines technical implementation details. Kudu is integrated with Cloudera’s Apache Impala project which is one of the leaders in SQL on Hadoop database platforms. It provides very important improvement to already great platform: UPDATE and DELETE SQL statements, something that has been standard on all major RDBMS systems. By supporting efficient data refresh mechanism it may now be feasible to have single-tier architecture for analytics data platform with significantly reduced complexity.
Image and technical details credit Cloudera.