python - Can I control the way the CountVectorizer vectorizes the corpus in scikit learn? -

i working countvectorizer scikit learn, , i'm possibly attempting things object not made for...but i'm not sure.

in terms of getting counts occurrence:

vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray()

gives:

[[0 0 0 0]]

what i'm realizing countvectorizer break corpus believe unigrams:

vocabulary = ['hi', 'bye', 'run'] corpus = ['run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray()

which gives:

[[0 0 1]]

is there way tell countvectorizer how you'd vectorize corpus? ideally outcome along lines of first example.

in honestly, however, i'm wondering if @ possible outcome along these lines:

vocabulary = ['hi', 'bye', 'run away!'] corpus = ['i want run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray()  [[0 0 1]]

i don't see information in documentation fit_transform method, takes 1 argument is. if has ideas grateful. thanks!

the parameter want called ngram_range. pass in tuple (1,2) constructor unigrams , bigrams. however, vocabulary pass in needs dict ngrams keys , integers values.

in [20]: print countvectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['i want run away!']).a [[0 0 1]]

note default tokeniser removes exclamation mark @ end, last token away. if want more control on how string broken tokens, follow @brenbarn's comment.

Search This Blog

Brent

python - Can I control the way the CountVectorizer vectorizes the corpus in scikit learn? -

Comments

Post a Comment

Popular posts from this blog

ios - Change Storyboard View using Seague -

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -