Impala

What is Impala

  • Impala server is a distributed, massively parallel processing (MPP) database engine.
  • Distributed Massively Parallel processing engine
  • Developed by Cloudera.
  • Based on Google 2010 dremel paper.
  • Direct access to HDFS/HBase data via impala engine.
  • Hive and Impala are similar. Both are meant for “SQL on Hadoop”.
  • Any table create on impala can access from hive and impala
  • Impala SQL is a subset of Hive Query Language.
  • It provides high performance and low latency compared to other SQL engines for Hadoop.

Why Impala when we have Hive

  • Hive queries incurs overheads of starting MapReduce jobs:
    • Job setup and creation
    • Start JVMs
    • Slot assignment
    • Input split creation
    • Map tasks generation
    • Hive inherits the latency of MapReduce.
  • The latency with a MapReduce jobs is higher since it has the overhead of creating a job, starting JVMs, calculating input splits etc.
  • Higher latency, poor performance. So, Hive cannot be used interactively by the user.
  • Impala does not build on MapReduce.
  • It is its own execution engine that accepts queries in SQL.
  • It queries HDFS and HBase directly, unlike Hive which starts MapReduce jobs and then queries the data.

Impala Architecture Explanation

    Impala Deamon

  • It is a Core component
  • Impalad deamon running on each node

    State store

  • It is responsible for checking the health of each impalad
  • Only one statestore process needs to run on any node in the cluster.
  • If an Impala node goes offline due to hardware failure, network error, software issue, or other reason, the statestore informs all the other nodes so that future queries can avoid making requests to the unreachable node.

    Catalogd

  • It is responsible for metadata changes
  • Impala daemons cache metadata.
  • In Impala versions prior to version 1.2, metadata updates in one impala daemon were not automatically transmitted to other nodes.

INVALIDATE METADATA;

REFRESH <table>;

  • Only one Catalog process is required per cluster.
  • You can run Statestore and Catalog services on the same node.

    Planner

  • Parsing the query
  • Creating a Direct Acyclic Graph of operators such as SELECT, WHERE, GROUP BY etc.

    Coordinator

  • Coordinating execution of the entire query
  • Sends requests to various executors to read and process data.
  • Receives data back and streams it to the client.

    Executor

  • Responsible for reading data from HDFS/HBase and/or performing aggregations on data.

Impala Use cases

  • Impala is well-suited to executing SQL queries for interactive exploratory analytics on large data sets.
  • Hive and MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.
  • Ad hoc analytics
  • Data Warehouse solutions

Why is Impala faster than Hive?

  • Impala does most of its operation in-memory and does not write the intermediate results to disk.
  • Impala supports new file formats like parquet, which is columnar file format.

Limitations of Impala

  • Whenever you try complex queries, if the result set exceeds more than 128GB RAM then it will terminate abruptly, Even we can't see partially processed data.
  • Impala does not provide any support for Serialization and Deserialization.
  • Impala can only read text files, not custom binary files.
  • Whenever new records/files are added to the data directory in HDFS, the table needs to be refreshed.
  • Impala also does not support XML and JSON functions.
  • Impala does not support multiple DISTINCT clauses per query, although Impala includes some workarounds for this limitation.
  • Unlike Hive, Impala does not support NESTED SELECT statements in the WHERE clause.
  • Currently, Impala doesn't support fault tolerance within a query. If a node fails in the middle of processing, the whole query has to be re-run.
  • Impala is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory.

Comments

Popular posts from this blog

Hive File Formats

Why We Need Hadoop?

Hive Data Types