Serializer and Deserializer in Hive

What is SerDe?

  • The record parsing of a Hive table is handled by a serializer/deserializer or SerDe for short.
  • Hive uses the SerDe interface for IO.
  • The Hive deserializer converts record (string or binary) into a java object that Hive can process (modify).
  • Now, the Hive serializer will take this Java object, convert it into suitable format that can be stored into HDFS.
  • So, basically a serde is responsible for converting the record bytes into something that can be used by Hive. 

            HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object (Java object)

            Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files (Java object)

Why we need SerDe?

  • If we need to handle/load the semi structured data we can go with this case.
  • If we have unstructured data, then we use RegEx SerDe which will instruct hive how to handle that record.
  • A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format.
  • However, it is possible that anyone can write their own SerDe for their own data formats.
  • It will be very slow, but stil if you are having special chars in the data you want to remove after loading then we can use.
  • Hive comes with several SerDe like JSon SerDe for JSon files, CSV SerDe for CSV files etc.
  • We have to import below library for using SerDe in Hive,
      • org.apache.hadoop.hive.serde2
  • SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.
How does Hive understand the differrent kinds of data format?

        The answer is SerDe.

  • Serialization — Process of converting an object in memory into bytes that can be stored in a file or transmitted over a network.
  • Deserialization — Process of converting the bytes back into an object in memory.
  • Other Built-in SerDes are Avro, ORC, RegEx, Parquet, CSV, JsonSerDe, etc.

Example

DROP TABLE IF EXISTS default.test_serde;

create external table default.test_serde
    (
        id int,
        name string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
        'separatorChar' = ',',
        'quoteChar"= '"',
        'escapeChar' = '\"',
        'serialization.null.format'=''
    )
    STORED AS TEXTFILE;

separatorChar is how fields are seperated in input data file. 

quoteChar is what are all column's data we want to make it to NULL. As per above Example if our any of column data having '*' as data or '*' is a part of data then it will make that column data as NULL irrespective of columns. (or) Above example, SerDe is configured to use the double quote (") character as the quote character. When writing data, string literals will be enclosed within double quotes. When reading data, text enclosed within double quotes will be interpreted as a single string literal, even if it contains special characters or delimiters.

escapeChar is which char we want to escape from our data or data file irrespective of columns.

serialization.null.format is allows to define how null values are encoded and decoded while reading and writing data. We can mention like '' or "" or 'NULL'.

  • Below are the Default separator, quote, and escape characters if unspecified while in SerDe property,
      • DEFAULT_ESCAPE_CHARACTER is \ (backward slash)
      • DEFAULT_QUOTE_CHARACTER is " (double quote)
      • DEFAULT_SEPARATOR is , (comma)




Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?