SMB (Sort-Merge-Bucket) join in Hive is a type of join operation that is used when joining two large datasets that cannot fit into memory. The SMB join is performed in two phases: sort and merge.
In the sort phase, the data in each of the two datasets being joined is sorted by the join key, so that the join operation can be performed in a more efficient manner. In the merge phase, the sorted datasets are merged using a merge join algorithm, which combines the matching records from each dataset based on the join key.
Here is a simple example of an SMB join in Hive:
-- Create two tables CREATE TABLE customers ( customer_id INT, customer_name STRING ) CLUSTERED BY (customer_id) INTO 4 BUCKETS; CREATE TABLE sales ( customer_id INT, item_id INT, sale_amount FLOAT ) CLUSTERED BY (customer_id) INTO 4 BUCKETS; -- Load data into the tables LOAD DATA INPATH '/data/customers' INTO TABLE customers; LOAD DATA INPATH '/data/sales' INTO TABLE sales; -- Perform an SMB join SELECT customers.customer_name, SUM(sales.sale_amount) FROM customers JOIN sales ON (customers.customer_id = sales.customer_id) GROUP BY customers.customer_name;
In this example, we create two tables, “customers” and “sales,” and load data into each of them. The data in each table is clustered into 4 buckets by the “customer_id” column, which is the join key. Then, we perform an SMB join between the two tables by joining the “customers” and “sales” tables on the “customer_id” column. By using an SMB join, we can efficiently join two large datasets that cannot fit into memory.