python - Can I control the way the CountVectorizer vectorizes the corpus in scikit learn? -


i working countvectorizer scikit learn, , i'm possibly attempting things object not made for...but i'm not sure.

in terms of getting counts occurrence:

vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray() 

gives:

[[0 0 0 0]] 

what i'm realizing countvectorizer break corpus believe unigrams:

vocabulary = ['hi', 'bye', 'run'] corpus = ['run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray() 

which gives:

[[0 0 1]] 

is there way tell countvectorizer how you'd vectorize corpus? ideally outcome along lines of first example.

in honestly, however, i'm wondering if @ possible outcome along these lines:

vocabulary = ['hi', 'bye', 'run away!'] corpus = ['i want run away!'] cv = countvectorizer(vocabulary=vocabulary) x = cv.fit_transform(corpus) print x.toarray()  [[0 0 1]] 

i don't see information in documentation fit_transform method, takes 1 argument is. if has ideas grateful. thanks!

the parameter want called ngram_range. pass in tuple (1,2) constructor unigrams , bigrams. however, vocabulary pass in needs dict ngrams keys , integers values.

in [20]: print countvectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['i want run away!']).a [[0 0 1]] 

note default tokeniser removes exclamation mark @ end, last token away. if want more control on how string broken tokens, follow @brenbarn's comment.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -