sql - Why does breaking out this correlated subquery vastly improve performance? -


i tried running query against 2 tables different sizes - #temp 15,000 rows, , member 70,000,000, 68,000,000 of not have id 307.

select count(*)  #temp cast(individual_id varchar) not in (         select individualid         member m         inner join person p on p.personid = m.personid         companyid <>  307) 

this query ran 18 hours, before killed , tried else, was:

select individualid #source member m inner join person p on p.personid = m.personid companyid <> 307  select count(*) #temp cast(individual_id varchar) not in (         select individualid         #source) 

and ran less second before giving me result.

i pretty surprised this. i'm middle-tier developer rather sql expert , understanding of goes on under hood little murky, have presumed that, since sub-query in first attempt exact same code, asking exact same data in second attempt, these equivalent.

but that's wrong. can't @ execution plan original query see sql server trying do. can kindly explain why splitting data out temp table faster?

edit: table schemas , indexes

the #temp table has 2 columns, individual_id int , source_code varchar(50)

member , person more complex. has 29 , 13 columns respectively don't want post them in full. personid int , pk on person , fk on member. individualid column on person - not clear in query written.

i tried using left join instead of not in before asking question. performance on second query wasn't noticeably different - both sub-second. on first query let run hour before stopping it, presuming make no significant difference.

i added index on #source, on original table, performance impact should identical.

first, query has 2 faux pas's stick out. converting varchar(), not include length argument. should not allowed! default length varies context , need explicit.

second, matching 2 keys in different tables , seemingly have different types. foreign key references should have same type. can have big impact on performance. if dealing tables have millions of rows, need pay attention data structure.

to understand difference in performance, need understand execution plans. 2 queries have different execution plans. (educated) guess first version version using nested loop join algorithm. second version using more sophisticated algorithm. in case, due ability of sql server maintain statistics on tables. so, instantiating intermediate results helps optimizer produce better query plan.

the subject of how best write logic has been investigated lot. here discussion on subject aaron bertrand.

i agree aaron on preference not exists in case:

select count(*)  #temp t not exists (select 1                   member m join                        person p                        on p.personid = m.personid                   memberid <>  307 , individual_id = t. individual_id                  ); 

however, don't know if have better performance in particular case.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -