Impala

December 07, 2022

What is Impala

Impala server is a distributed, massively parallel processing (MPP) database engine.
Distributed Massively Parallel processing engine
Developed by Cloudera.
Based on Google 2010 dremel paper.
Direct access to HDFS/HBase data via impala engine.
Hive and Impala are similar. Both are meant for “SQL on Hadoop”.
Any table create on impala can access from hive and impala
Impala SQL is a subset of Hive Query Language.
It provides high performance and low latency compared to other SQL engines for Hadoop.

Why Impala when we have Hive

Hive queries incurs overheads of starting MapReduce jobs:

Job setup and creation
Start JVMs
Slot assignment
Input split creation
Map tasks generation
Hive inherits the latency of MapReduce.

The latency with a MapReduce jobs is higher since it has the overhead of creating a job, starting JVMs, calculating input splits etc.
Higher latency, poor performance. So, Hive cannot be used interactively by the user.
Impala does not build on MapReduce.
It is its own execution engine that accepts queries in SQL.
It queries HDFS and HBase directly, unlike Hive which starts MapReduce jobs and then queries the data.

Impala Architecture Explanation

Impala Deamon

It is a Core component
Impalad deamon running on each node

State store

It is responsible for checking the health of each impalad
Only one statestore process needs to run on any node in the cluster.
If an Impala node goes offline due to hardware failure, network error, software issue, or other reason, the statestore informs all the other nodes so that future queries can avoid making requests to the unreachable node.

Catalogd

It is responsible for metadata changes
Impala daemons cache metadata.
In Impala versions prior to version 1.2, metadata updates in one impala daemon were not automatically transmitted to other nodes.

INVALIDATE METADATA;

REFRESH <table>;

Only one Catalog process is required per cluster.
You can run Statestore and Catalog services on the same node.

Planner

Parsing the query
Creating a Direct Acyclic Graph of operators such as SELECT, WHERE, GROUP BY etc.

Coordinator

Coordinating execution of the entire query
Sends requests to various executors to read and process data.
Receives data back and streams it to the client.

Executor

Responsible for reading data from HDFS/HBase and/or performing aggregations on data.

Impala Use cases

Impala is well-suited to executing SQL queries for interactive exploratory analytics on large data sets.
Hive and MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.
Ad hoc analytics
Data Warehouse solutions

Why is Impala faster than Hive?

Impala does most of its operation in-memory and does not write the intermediate results to disk.
Impala supports new file formats like parquet, which is columnar file format.

Limitations of Impala

Whenever you try complex queries, if the result set exceeds more than 128GB RAM then it will terminate abruptly, Even we can't see partially processed data.
Impala does not provide any support for Serialization and Deserialization.
Impala can only read text files, not custom binary files.
Whenever new records/files are added to the data directory in HDFS, the table needs to be refreshed.
Impala also does not support XML and JSON functions.
Impala does not support multiple DISTINCT clauses per query, although Impala includes some workarounds for this limitation.
Unlike Hive, Impala does not support NESTED SELECT statements in the WHERE clause.
Currently, Impala doesn't support fault tolerance within a query. If a node fails in the middle of processing, the whole query has to be re-run.
Impala is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory.

Search This Blog

Hadoop Zone

Impala

What is Impala

Why Impala when we have Hive

Impala Architecture Explanation

Impala Deamon

State store

Catalogd

Planner

Coordinator

Executor

Impala Use cases

Why is Impala faster than Hive?

Limitations of Impala

Comments

Post a Comment

Popular posts from this blog

Hadoop 3.3.6 Installation on Ubuntu 24.04.1 LTS

Why HBase ?

Pig Datatypes