java - Get bounding indices of non-unique words in a string -
suppose have following string:
(def strg "apple orange apple")
i'd bounding indices of each non-unique word in string. first occurrence of apple should have bounding indices (0,4) while second occurence of apple should have bounding indices (13, 17).
one approach i've been playing first store indices of each character in string and, then, each index n, identify word boundaries looking space @ n-1 (yes, misses beginning-of-string words). if condition has been met, iterate thru next k characters until space hit---the character @ position before space second bounding index. first part of (failed) code
(for [ch strg] (let [indx (int (.indexof strg (str ch)))] (cond (= (subs ch indx-1 ) " " ) continue rest of above-described code logic
any ideas (clojure, java, or python fine) appreciated
it more typical clojure/java use indices of starting character , 1 after ending character, [0, 5]
, [13, 18]
instead. java's matcher return start , end of each match in manner.
(def strg "apple orange apple") (defn re-indices [re s] (let [m (re-matcher re s)] ((fn step [] (when (. m find) (cons [(. m start) (. m end)] (lazy-seq (step)))))))) (re-indices #"\s+" strg) ;=> ([0 5] [6 12] [13 18])
and subs
use them appropriately
(->> (re-indices #"\s+" strg) (group-by (partial apply subs strg))) ;=> {"apple" [[0 5] [13 18]], "orange" [[6 12]]}
from here can filter out substring keys more 1 indices pair.
Comments
Post a Comment