python - Error when extract noun-phrases from the training corpus and remove stop words using NLTK -

i new both, python , nltk. have extract noun phrase corpus , remove stop words using nltk. coding still have error. can me fix problem? or please recommend if there better solution. thank you

import nltk nltk.tokenize import word_tokenize nltk.corpus import stopwords  docid='19509' title='example noun-phrase , stop words' print('document id:'),docid print('title:'),title  #list noun phrase content='this sample sentence, showing off stop words filtration.' is_noun = lambda pos: pos[:2] == 'nn' tokenized = nltk.word_tokenize(content) nouns = [word (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)] print('all noun phrase:'),nouns  #remove stop words stop_words = set(stopwords.words("english"))  example_words = word_tokenize(nouns) filtered_sentence = []  w in example_words:   if w not in stop_words:      filtered_sentence.append(w)  print('without stop words:'),filtered_sentence

and got following error

traceback (most recent call last):  file "c:\users\user\desktop\nlp\stop_word.py", line 20, in <module>   example_words = word_tokenize(nouns)  file "c:\python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in   word_tokenize   return [token sent in sent_tokenize(text, language)  file "c:\python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in   sent_tokenize   return tokenizer.tokenize(text)  file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in   tokenize   return list(self.sentences_from_text(text, realign_boundaries))  file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in   sentences_from_text   return [text[s:e] s, e in self.span_tokenize(text,realign_boundaries)]  file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in   span_tokenize   return [(sl.start, sl.stop) sl in slices]  file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in   _realign_boundaries   sl1, sl2 in _pair_iter(slices):  file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in   _pair_iter   prev = next(it)  file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in   _slices_from_text   match in self._lang_vars.period_context_re().finditer(text): typeerror: expected string or buffer

you getting error because function word_tokenize expecting string argument , give list of strings. far understand trying achieve, not need tokenize @ point. until print('all noun phrase:'),nouns, have nouns of sentence. remove stopwords, can use:

### remove stop words ### stop_words = set(stopwords.words("english")) # find nouns not in stopwords nouns_without_stopwords = [noun noun in nouns if noun not in stop_words] # sentence clear print('without stop words:',nouns_without_stopwords)

of course, in case have same result nouns, because none of nouns stopword.

i hope helps.

Search This Blog

Brent

python - Error when extract noun-phrases from the training corpus and remove stop words using NLTK -

Comments

Post a Comment

Popular posts from this blog

ios - Change Storyboard View using Seague -

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -