Hive Joins

 Map Join

  • In Apache Hive, there is a feature that we use to speed up Hive queries. Basically, that feature is what we call Map join in Hive.
  • Apache Hive Map Join is also known as Auto Map Join, or Map Side Join, or Broadcast Join.
  • Also, we use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into RAM.
  • Map Join could be performed within a mapper without using a Map/Reduce step.
  • However, map joins in Hive are way faster than the regular joins since no reducers are necessary.
  • If we are joining two tables using map join then one table should have minimal data compare with the other one table. If that small table is having data upto or less than 25MB then it considered like small table. If it is exceeding 25MB but still it is small table(1TB) compare with the other one(10 TB) then we have to use some different method in map join to handle this case.
  • In a Map join no reduce task is required that is why it improves the query performance significantly in comparison of a common join/reduce join/map reduce jobs/aggregations/sorting.
            SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b ON a.key = b.key;

            b is small table. we should mention small table name in MAPJOIN select.

  • We have to set below properties for performing Map Join in Hive,

            set hive.auto.convert.join=true;

            set hive.auto.convert.join.noconditionaltask=true;

            set hive.auto.convert.join.noconditionaltask.size=10000000;

Bucket Map Join

  • Basically, while the tables are large and all the tables used in the join are bucketed on the join columns we use a Bucket Map Join in Hive.
  • For suppose if one table has 2 buckets then the other table must have either 2 buckets or a multiple of 2 buckets (2, 4, 6, and so on). 
  • Further, since the preceding condition is satisfied then the joining can be done on the mapper side only. Else a normal inner join is performed.
  • Therefore, it implies that only the required buckets are fetched on the mapper side and not the complete table.

            SELECT /*+ MAPJOIN(b2) */ b1.* FROM b1,b2 WHERE b1.col0=b2.col0;

            set hive.optimize.bucketmapjoin = true;

            b2 is the small table which is having less buckets. We should mention that table in select clause.

Skew Join in Hive

  • In table A there is 1 million data and table B has 10k only. In table A joining column has 80% values are same and rest is other. So hive reducer stuck at that value if we do Skew Join. For avoiding Skew join you can use below table creation methods, 
                A -> Large Table, B -> Small table

                create table B (c1 string, c2 string) skewed by (c1) on ('x1'); -- If single value repeated in one column.

        (OR)

                create table B (c1 string, c2 string) skewed by (c1) on ('x1', 'x2', 'x3'); -- If multiple values repeated in one column.

  • By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
  • We can improve this further by trying to reduce the processing of skewed keys. 
  • First read small table(B) and store the rows with skewed key in an in-memory hash table. The skewed keys in large table(A) are only read and processed by the Mapper by comparing with small table skewed key which is in RAM, and not sent to the reducer. The rest of the keys in large table go through only a single Map/Reduce.
                SET hive.optimize.skewjoin=true;

                SET hive.skewjoin.key=100000;

                SELECT a.* FROM Sales a JOIN Sales_orc b ON a.id = b.id;

Skew Join – Use Case

  • Basically, on the joining column, one table has huge skew values.

Advantages of Skew Join

  • With minimal changes in the query, we can get atleast 50% performance improvement.

Disadvantages of Skew Join

  • Here, are some Limitations of Hive Skew Join are discussed:
    • So, the major disadvantage of it is One table is read twice here.
    • Moreover, it is necessary that users should be aware of the skew key.

Sort-Merge-Bucket (SMB) Map Join in Hive

  • We mainly use it when there is no limit on file or partition or table join.
  • Also, when the tables are large we can use Hive Sort Merge Bucket join. 
  • However, using the join, all the tables need to be bucketed and sorted.
  • Although, make sure in SMB join all tables should have the same number of buckets.
                SET hive.enforce.bucketing=true;

                SET hive.enforce.sorting=true;

                SET hive.auto.convert.sortmerge.join=true;

                SET hive.optimize.bucketmapjoin = true;

                SET hive.optimize.bucketmapjoin.sortedmerge = true;

                SET hive.auto.convert.sortmerge.join.noconditionaltask=true;

                SET hive.auto.convert.join=false; //if we do not do this, automatically Map-Side Join will happen.

Reduce Side Join

  • Reduce side join is useful for Very large datasets.
  • The Reduce side join is a process where the join operation is performed in the reducer phase.
  • It is normal join only but we need to make sure the dataset is large.
  • The Map side join and the reduce side join. Map side join is usually used when one data set is large and the other data set is small. Whereas the Reduce side join can join both the large data sets. The Map side join is faster as it does not have to wait for all mappers to complete as in case of reducer. Hence reduce side join is slower.
  • As the name suggests, in the reduce side join, the reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us.
  • Map Side Join
    • Sorted by the same key.
    • Equal number of partition. 
    • All the records of the same key should be in same partition.
  • Reduce Side Join (if we need to increase performance we need to have below steps) 
    • Much flexible to implement. 
    • There has to be custom Writable, Comparable with necessary function over ridden.
    • We need a custom partitioner. 
    • Custom group comparator is required.

Left Semi Join

CREATE TABLE a(id INT, data STRING) row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/cloudera/tableA.txt' into table a;

CREATE TABLE b(id INT, data STRING) row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/cloudera/tableB.txt' into table b;

hive> SELECT * FROM a;
OK
1 DataA11
1 DataA12
1 DataA13
2 DataA21
3 DataA31
5 DataA51
Time taken: 0.412 seconds, Fetched: 5 row(s)

hive> SELECT * FROM b;
OK
1 DataB11
2 DataB21
2 DataB22
2 DataB23
4 DataB41
Time taken: 0.07 seconds, Fetched: 5 row(s)

SELECT a.id, a.data, b.id, b.data FROM a INNER JOIN b ON a.id = b.id;
OK
1 DataA11 1 DataB11
1 DataA12 1 DataB11
1 DataA13 1 DataB11
2 DataA21 2 DataB21
2 DataA21 2 DataB22
2 DataA21 2 DataB23
Time taken: 27.89 seconds, Fetched: 6 row(s)
  • inner join will return matched duplicate values from both tables.
SELECT a.id, a.data, b.id, b.data FROM a LEFT JOIN b ON a.id = b.id;
OK
1 DataA11 1 DataB11
1 DataA12 1 DataB11
1 DataA13 1 DataB11
2 DataA21 2 DataB21
2 DataA21 2 DataB22
2 DataA21 2 DataB23
3 DataA31 NULL NULL
5 DataA51 NULL NULL
Time taken: 26.955 seconds, Fetched: 7 row(s)
  • left join will return matched duplicates rows from left side table and unmatched rows from left side table. same like SQL left join.
SELECT a.id, a.data, b.id, b.data FROM a LEFT SEMI JOIN b ON a.id = b.id;

FAILED: SemanticException [Error 10004]: Line 1:21 Invalid table alias or column reference 'b': (possible column names are: id, data)
  • As per above example, while performing left semi join we should not mention any columns from RIGHT side table in SELECT clause.
SELECT a.id, a.data FROM a LEFT SEMI JOIN b ON a.id = b.id;
OK
1 DataA11
1 DataA12
1 DataA13
2 DataA21
Time taken: 28.014 seconds, Fetched: 4 row(s)
  • As per above example, while performing left semi join we should mention columns only from LEFT side table in SELECT clause.
  • As per above example, It will return matched data only from left side table.

Left Anti Join

  • This join wont work in our cloudera.
  • It will return the rows from left side table which wont match in right side table.
  • unmatched data from only left side table.
  • while performing left anti join we should mention columns only from LEFT side table in SELECT clause.
  • left anti join works using keyword of NOT EXISTS() in hive.
SELECT a.id, a.data FROM a left anti join b on a.id = b.id;

(OR)

SELECT * FROM a WHERE NOT EXISTS (Select 1 from b where a.id = b.id);
OK
3 DataA31
5 DataA51

finally,

left semi join + left anti join = left join (mention the columns only in left table)

SELECT * FROM a WHERE NOT EXISTS(SELECT 1 FROM a LEFT SEMI JOIN b ON a.id = b.id);

SELECT a.id, a.data FROM a LEFT SEMI JOIN b ON a.id = b.id;

Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?