sql - Why does breaking out this correlated subquery vastly improve performance? -
i tried running query against 2 tables different sizes - #temp 15,000 rows, , member 70,000,000, 68,000,000 of not have id 307.
select count(*) #temp cast(individual_id varchar) not in ( select individualid member m inner join person p on p.personid = m.personid companyid <> 307)
this query ran 18 hours, before killed , tried else, was:
select individualid #source member m inner join person p on p.personid = m.personid companyid <> 307 select count(*) #temp cast(individual_id varchar) not in ( select individualid #source)
and ran less second before giving me result.
i pretty surprised this. i'm middle-tier developer rather sql expert , understanding of goes on under hood little murky, have presumed that, since sub-query in first attempt exact same code, asking exact same data in second attempt, these equivalent.
but that's wrong. can't @ execution plan original query see sql server trying do. can kindly explain why splitting data out temp table faster?
edit: table schemas , indexes
the #temp
table has 2 columns, individual_id int
, source_code varchar(50)
member
, person
more complex. has 29 , 13 columns respectively don't want post them in full. personid
int , pk on person , fk on member. individualid
column on person - not clear in query written.
i tried using left join
instead of not in
before asking question. performance on second query wasn't noticeably different - both sub-second. on first query let run hour before stopping it, presuming make no significant difference.
i added index on #source, on original table, performance impact should identical.
first, query has 2 faux pas's stick out. converting varchar()
, not include length argument. should not allowed! default length varies context , need explicit.
second, matching 2 keys in different tables , seemingly have different types. foreign key references should have same type. can have big impact on performance. if dealing tables have millions of rows, need pay attention data structure.
to understand difference in performance, need understand execution plans. 2 queries have different execution plans. (educated) guess first version version using nested loop join algorithm. second version using more sophisticated algorithm. in case, due ability of sql server maintain statistics on tables. so, instantiating intermediate results helps optimizer produce better query plan.
the subject of how best write logic has been investigated lot. here discussion on subject aaron bertrand.
i agree aaron on preference not exists
in case:
select count(*) #temp t not exists (select 1 member m join person p on p.personid = m.personid memberid <> 307 , individual_id = t. individual_id );
however, don't know if have better performance in particular case.
Comments
Post a Comment