Hive File Formats - Questions & Answers

01. What is the primary use of Textfile, Sequencefile, RCfile, ORCfile formats?

  • Thus you can use the above four file formats depending on your data. 

            For example, 

    • If your data is delimited by some parameters then you can use TEXTFILE format. 
    • If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format. 
    • If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format. 
    • If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.

02. What is the difference between a block and a stripe?

  • HDFS blocks is the lowest level, ORC stripe is upper level, these levels are completely independent, and stripes in ORC do not care about lower storage layer.

HDFS blocks:

  • HDFS blocks is the lowest level, independent from file format. HDFS splits files in blocks to optimize storage.
  • One stripe can be stored in multiple blocks, one block can contain multiple stripes or part of the stripe. HDFS will split the file, not considering the stripe format or file format.
  • HDFS stores each file blocks metadata, writing and reading files is transparent for upper ORC reader level, HDFS will take care of all the blocks.

ORC stripes:

  • upper level of storage. Stripe does know nothing about blocks.
  • ORC is splittable on stripe level. HDFS knows nothing about ORC structure and how it can be splitted for processing. HDFS splits files in blocks to optimize storage. Minimum one stripe can be processed in single container. You can configure stripe size to fit to the block size.

03. What is the Difference between Parquet and RC file format


Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?