python - Can I control the way the CountVectorizer vectorizes the corpus in scikit learn? -
i working countvectorizer scikit learn, , i'm possibly attempting things object not made for...but i'm not sure.
in terms of getting counts occurrence:
vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray()
gives:
[[0 0 0 0]]
what i'm realizing countvectorizer break corpus believe unigrams:
vocabulary = ['hi', 'bye', 'run'] corpus = ['run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray()
which gives:
[[0 0 1]]
is there way tell countvectorizer how you'd vectorize corpus? ideally outcome along lines of first example.
in honestly, however, i'm wondering if @ possible outcome along these lines:
vocabulary = ['hi', 'bye', 'run away!'] corpus = ['i want run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray() [[0 0 1]]
i don't see information in documentation fit_transform method, takes 1 argument is. if has ideas grateful. thanks!
the parameter want called ngram_range
. pass in tuple (1,2)
constructor unigrams , bigrams. however, vocabulary pass in needs dict
ngrams keys , integers values.
in [20]: print countvectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['i want run away!']).a [[0 0 1]]
note default tokeniser removes exclamation mark @ end, last token away
. if want more control on how string broken tokens, follow @brenbarn's comment.
Comments
Post a Comment