Hash Joins

Hash Joins:

What we have done in the last nested loop join example is that we have created a clustered hash table
and than joined it with a normal table and the plan we got from the same is as following:

---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     1 |   191 |     3   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |              |     1 |   191 |     3   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1           |     2 |   152 |     3   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_PK        |     2 |       |     2   (0)| 00:00:01 |
|*  4 |   TABLE ACCESS HASH          | HASHED_TABLE |     1 |   115 |            |          |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T1"."N1">=1 AND "T1"."N1"<=2)
   4 - access("HT"."N1"="T1"."N2")
       filter("HT"."V1"='a')

Now what if the hashed table is big table having size of each row as larger (say 600 bytes) and
we are only interested in 20 bytes while retrieving our data, and in this scenario we may find that
table is big enough to be not able to fit into memory and as the query runs oracle has to keep
repeating random I/O’s to get them. Even if we can keep the blocks in memory there are cases that
other users are modifying and retrieving the data from same table and blocks, and we can find that
we are competing for the “cache buffer chains” latches that protect the blocks in that table.

So what we can do to enhance/tune our query? we can just create a table with the required bits from
the big table before starting the join and this is called as Hash Join and the tables appears to be
“back to front”.

Here below is the setup that i have used to test on 10.2.0.1

drop table hash_target purge;
create table hash_target(
	hash_value	number
	, status	varchar2(2)
	, v1		varchar2(10)
	, v2		varchar2(100)
)
;

insert into hash_target
select
	rownum
	, dbms_random.string('u', 2)
	, lpad(rownum, 10)
	, rpad('*', 100)
from
	all_objects
where
	rownum <= 1e3
;

create index ht_i1 on hash_target(status);
create index ht_i2 on hash_target(hash_value);

drop table t1 purge;
create table t1(
	join_value	number
	, status	varchar2(1)
	, v1		varchar2(10)
	, v2		varchar2(100)
)
;

insert into t1
select
	mod(rownum, 1000)
	, dbms_random.string('u', 1)
	, lpad(rownum, 10)
	, rpad('*', 100)
from
	all_objects
where
	rownum <= 2e4
;

create index t1_i1 on t1(status);

begin
	dbms_stats.gather_table_stats(
		user
		, 't1'
		, method_opt => 'for all columns size 1'
	);
	dbms_stats.gather_table_stats(
		user
		, 'hash_target'
		, method_opt => 'for all columns size 1'
	);
end;
/

set autotrace traceonly explain

select
	t1.v1
	, ht.v1
from
	t1
	, hash_target ht
where
	t1.status = 'A'
and	ht.hash_value = t1.join_value
and	ht.status = 'AA'
;

set autotrace off

Below is the execution plan which we have got from the above

Execution Plan
----------------------------------------------------------
Plan hash value: 227304121

--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |     3 |   105 |   106   (1)| 00:00:02 |
|*  1 |  HASH JOIN                   |             |     3 |   105 |   106   (1)| 00:00:02 |
|   2 |   TABLE ACCESS BY INDEX ROWID| HASH_TARGET |     2 |    36 |     3   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | HT_I1       |     2 |       |     1   (0)| 00:00:01 |
|*  4 |   TABLE ACCESS FULL          | T1          |   769 | 13073 |   103   (1)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("HT"."HASH_VALUE"="T1"."JOIN_VALUE")
   3 - access("HT"."STATUS"='AA')
   4 - filter("T1"."STATUS"='A')

Here we can see that Oracle has first pulled out the the rowsource from hash_target table and build
a hash cluster table from it and then it has used nested loop join to retrieve the rows from t1 table
but as here it has build hash table first Oracle calls it as Hash Join. And here the first table
hash_target is called as build table and the t1 table is called as probe table as Oracle has first
build a hash table and then probed the second table to retrieve the matching rows/records.

While looking at hash join we must take into consideration whether our build table is able to fit into
memory itself or not otherwise our reading of the blocks from the disk will increase leading to increase
in physical I/O’s.

We can also use hint use_hash(table/alias_name) to guide the optimizer to use a hash join.

For more information check out Jonathan Lewis-Joins – HJ

Cheers!!!
Jagdeep Sangwan

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s