python - Threshold value in one-dimensional data -
i have list of similarity scores similarity_scores
between 2 texts using string matching method. manually added actual_value
show if texts indeed similar. there statistical way find threshold value on similarity scrore
?
similarity_scores actual_value 1.0 1 1.0 1 1.0 1 1.0 1 0.99 1 0.99 1 0.99 1 0.989 1 0.944 1 0.944 1 0.941 1 0.941 1 0.941 1 0.941 1 0.941 0 0.934 0 0.933 0 0.933 1 0.88 1 0.784 0 0.727 0 0.727 0 0.714 0 0.714 1 0.714 0 0.714 0 0.711 0 0.711 0 0.707 0 0.707 0 0.696 0 0.696 0 0.696 0 0.696 0
a common way of determining how particular classification document retrieval use precision , recall values. in example, given threshold [1]:
precision tells percentage of documents above threshold manually tagged 1
value, or,
number of documents above threshold tagged 1 ------------------------------------------------ number of documents above threshold
recall tells percentage of documents tagged 1
above threshold:
number of documents above threshold tagged 1 ------------------------------------------------ number of documents tagged 1
in example gave, can compute these values each possible threshold, relevant ones in have transitions between sequences of zeros , ones, i'll @ points:
1.0 1 1.0 1 1.0 1 1.0 1 0.99 1 0.99 1 0.99 1 0.989 1 0.944 1 0.944 1 th=0.944 #1's=10; #0's=0 0.941 1 0.941 1 0.941 1 0.941 1 0.941 0 th=0.941 #1's=14; #0's=1 0.934 0 0.933 0 0.933 1 th=0.933 #1's=15; #0's=3 0.88 1 th=0.880 #1's=16; #0's=3 0.784 0 0.727 0 0.727 0 0.714 0 0.714 1 0.714 0 0.714 0 th=0.714 #1's=17; #0's=9 0.711 0 0.711 0 0.707 0 0.707 0 0.696 0 0.696 0 0.696 0 0.696 0
and total number of documents tagged 1
17
.
therefore, these 5 possible thresholds th
, have precision
, recall
follows:
th = 0.944 precision = 10/10 = 1.000 recall = 10/17 = 0.588 th = 0.941 precision = 14/15 = 0.933 recall = 14/17 = 0.824 th = 0.933 precision = 15/18 = 0.833 recall = 15/17 = 0.882 th = 0.880 precision = 16/19 = 0.842 recall = 16/17 = 0.941 th = 0.714 precision = 17/26 = 0.654 recall = 17/17 = 1.000
what these values here depends great deal on data , how sensitive results should false negatives or false positives. instance, if want make sure have few false positives possible, want go threshold of th = 0.941
or th = 0.944
.
if want balance precision , recall, might want go th = 0.880
because both measures increase threshold above , precision better threshold below it. rather subjective way of doing this, can automate extent using f-measure. in particular, i'll use f1-measure
, can find 1 suits data.
the f1-measure
defined as:
f1 = 2 * precision * recall ------------------ precision + recall
using numbers above get:
th = 0.944 f1 = 2*1.000*0.588/1.000+0.588 = 0.741 th = 0.941 f1 = 2*0.933*0.824/0.933+0.824 = 0.875 th = 0.933 f1 = 2*0.833*0.882/0.833+0.882 = 0.857 th = 0.880 f1 = 2*0.842*0.941/0.842+0.941 = 0.889 th = 0.714 f1 = 2*0.654*1.000/0.654+1.000 = 0.791
as can see, f1 measure, th=0.880
comes out on top th=0.941
not far behind, giving similar results manual inspection of possible thresholds.
[1] clarify, define threshold such similarity scores greater or equal to threshold considered above threshold , similarity scores strictly less than threshold considered below.
Comments
Post a Comment