python - Threshold value in one-dimensional data -


i have list of similarity scores similarity_scores between 2 texts using string matching method. manually added actual_value show if texts indeed similar. there statistical way find threshold value on similarity scrore?

similarity_scores   actual_value 1.0 1 1.0 1 1.0 1 1.0 1 0.99    1 0.99    1 0.99    1 0.989   1 0.944   1 0.944   1 0.941   1 0.941   1 0.941   1 0.941   1 0.941   0 0.934   0 0.933   0 0.933   1 0.88    1 0.784   0 0.727   0 0.727   0 0.714   0 0.714   1 0.714   0 0.714   0 0.711   0 0.711   0 0.707   0 0.707   0 0.696   0 0.696   0 0.696   0 0.696   0 

a common way of determining how particular classification document retrieval use precision , recall values. in example, given threshold [1]:

precision tells percentage of documents above threshold manually tagged 1 value, or,

number of documents above threshold tagged 1 ------------------------------------------------     number of documents above threshold 

recall tells percentage of documents tagged 1 above threshold:

number of documents above threshold tagged 1 ------------------------------------------------          number of documents tagged 1 

in example gave, can compute these values each possible threshold, relevant ones in have transitions between sequences of zeros , ones, i'll @ points:

1.0 1 1.0 1 1.0 1 1.0 1 0.99    1 0.99    1 0.99    1 0.989   1 0.944   1 0.944   1 th=0.944 #1's=10; #0's=0 0.941   1 0.941   1 0.941   1 0.941   1 0.941   0 th=0.941 #1's=14; #0's=1 0.934   0 0.933   0 0.933   1 th=0.933 #1's=15; #0's=3 0.88    1 th=0.880 #1's=16; #0's=3 0.784   0 0.727   0 0.727   0 0.714   0 0.714   1 0.714   0 0.714   0 th=0.714 #1's=17; #0's=9 0.711   0 0.711   0 0.707   0 0.707   0 0.696   0 0.696   0 0.696   0 0.696   0 

and total number of documents tagged 1 17.

therefore, these 5 possible thresholds th, have precision , recall follows:

th = 0.944     precision = 10/10       = 1.000     recall = 10/17          = 0.588 th = 0.941     precision = 14/15       = 0.933     recall = 14/17          = 0.824 th = 0.933     precision = 15/18       = 0.833     recall = 15/17          = 0.882 th = 0.880     precision = 16/19       = 0.842     recall = 16/17          = 0.941 th = 0.714     precision = 17/26       = 0.654     recall = 17/17          = 1.000 

what these values here depends great deal on data , how sensitive results should false negatives or false positives. instance, if want make sure have few false positives possible, want go threshold of th = 0.941 or th = 0.944.

if want balance precision , recall, might want go th = 0.880 because both measures increase threshold above , precision better threshold below it. rather subjective way of doing this, can automate extent using f-measure. in particular, i'll use f1-measure, can find 1 suits data.

the f1-measure defined as:

f1 = 2 * precision * recall          ------------------          precision + recall 

using numbers above get:

th = 0.944   f1 = 2*1.000*0.588/1.000+0.588 = 0.741 th = 0.941   f1 = 2*0.933*0.824/0.933+0.824 = 0.875 th = 0.933   f1 = 2*0.833*0.882/0.833+0.882 = 0.857 th = 0.880   f1 = 2*0.842*0.941/0.842+0.941 = 0.889 th = 0.714   f1 = 2*0.654*1.000/0.654+1.000 = 0.791 

as can see, f1 measure, th=0.880 comes out on top th=0.941 not far behind, giving similar results manual inspection of possible thresholds.

[1] clarify, define threshold such similarity scores greater or equal to threshold considered above threshold , similarity scores strictly less than threshold considered below.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -