Ibis is an open-source data analysis framework that enables a 100% Python user workflow on top of big data systems. It is new Apache licensed project from Cloudera.
Although Python is an open-source language of choice for many data scientists, usually it is limited to local data processing and smaller data set analysis. Ibis aims at removing this limitation. With Ibis data scientists and data engineers can use Python while working with big data as productively as they were with small and medium data before.
According to the Ibis developers, although Python can not scale at a level of such software framework as Hadoop, Ibis goal is to ensure a first-class Python experience on large scalable architectures with full access to the variety of Python tools.
Ibis targets Impala, the leading MPP database engine for Hadoop. This open source interactive SQL-on-Hadoop engine allows Ibis to achieve the scalability and interactive performance necessary for big data. Ibis supports Impala’s built-in analytic capabilities and exposes Impala via an API to provide users with simplified data warehousing, data wrangling, and data analysis. Integration with Impala eliminates serialization or other interface bottlenecks for high performance Python at massive scale.
Current Ibis features:
- Comprehensive support of Impala functionality.
- Interoperability with pandas.
- Instruments that simplify interactions with HDFS.
- A pandas-like semantically complete data expression system that covers even such relational data concepts as self-joins, window functions, correlated and uncorrelated subqueries.
- High level analytics tools like bucketing, top-k, histogram, and value_counts.
At the moment work on the Ibis project is not yet done, but users are free to try it out and see whether Ibis can become a true first-class language for Apache Hadoop.
According to the Ibis vision and roadmap upcoming versions may encompass backend systems other than Impala and will allow users to leverage the full range of Python libraries. Among other goals is support for Impala’s forthcoming complex types (lists, maps, and structs) as first-class value types and quick Python API for a canonical in-memory columnar data format. When there are would be opportunity to run interpreted Python user-defined functions on Impala nodes and perform computations directly on columnar data in shared memory without any need for deserialization, then users will be able to leverage the existing Python data ecosystem at all potential of its performance and scale.
Ibis is a Python big data framework that offers a 100% Python end-to-end user workflows and scalability for big data. It integrates with Impala enabling work with powerful execution engine. With advanced data analysis and seamless analytical access to big data Ibis allows to focus on the real-world, practical applications of data science. More details at ibis-project.org.