string - R: use grep to find one or several matches in order of importance -
i using grep tidy address data, goal here identify street / avenue / road name etc. in given record , column, has been split space individual words in following variable tempval, example:
r > tempval [1] "38" "willow" "park" i use following statement spot of word follow street name might be:
stid <- grep("street|\\bst\\b|avenue|\\bave\\b|\\bav\\b|way|boulevard|\\bbd\\b|road|\\brd\\b|place|\\bpl\\b|esplanade|terrace|parade|drive|\\bdr\\b|\\bpark\\b|lane|crescent|\\bcourt\\b|b\\cres\\b", tempval, ignore.case = t) r > stid [1] 3 this fine, know "park" 3rd element , comes before street number , name.
however problem arises when there several matches length(stid) > 1, example:
r > tempval [1] "38" "park" "st" so here, get
r > stid [1] 2 3 how r return 1 match, in order of importance (the order in have placed strings in pattern of grep)? in other words, if r finds both "st" , "park", "st" more important "park" return stid = 3 only?
using grep dangerous, grep -even when take priority account- return "streetlife" street name when trying on "streetlife park" (it find "street" in "streetlife").
hence suggest use match instead. convert lower , use vector values in order of importance. can use match see @ positions in x have match vector. have first value not na , you're done:
checkstreet <- function(x){ x <- tolower(x) thenames <- c("street","st","avenue","ave","av", "way","boulevard", "bd", "road", "rd", "place", "pl", "esplanade","terrace","parade", "drive","dr","park","lane","crescent","court", "cres") id <- match(thenames, x) id[!is.na(id)][1] } gives:
> tmpval <- c("38","park","street") > checkstreet(tmpval) [1] 3 > tmpval <- c("44","average","esplanade") > checkstreet(tmpval) [1] 3 if insist on using grep , keep on using \\b word boundaries, can use same logic, time using which.min :
checkstreet <- function(x){ x <- tolower(x) thenames <- c("street","st","avenue","ave","av", "way","boulevard", "bd", "road", "rd", "place", "pl", "esplanade","terrace","parade", "drive","dr","park","lane","crescent","court", "cres") which.min(lapply(x,grep,thenames)) }
Comments
Post a Comment