python - Error when extract noun-phrases from the training corpus and remove stop words using NLTK -
i new both, python , nltk. have extract noun phrase corpus , remove stop words using nltk. coding still have error. can me fix problem? or please recommend if there better solution. thank you
import nltk nltk.tokenize import word_tokenize nltk.corpus import stopwords docid='19509' title='example noun-phrase , stop words' print('document id:'),docid print('title:'),title #list noun phrase content='this sample sentence, showing off stop words filtration.' is_noun = lambda pos: pos[:2] == 'nn' tokenized = nltk.word_tokenize(content) nouns = [word (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)] print('all noun phrase:'),nouns #remove stop words stop_words = set(stopwords.words("english")) example_words = word_tokenize(nouns) filtered_sentence = [] w in example_words: if w not in stop_words: filtered_sentence.append(w) print('without stop words:'),filtered_sentence
and got following error
traceback (most recent call last): file "c:\users\user\desktop\nlp\stop_word.py", line 20, in <module> example_words = word_tokenize(nouns) file "c:\python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in word_tokenize return [token sent in sent_tokenize(text, language) file "c:\python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize return tokenizer.tokenize(text) file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text return [text[s:e] s, e in self.span_tokenize(text,realign_boundaries)] file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize return [(sl.start, sl.stop) sl in slices] file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries sl1, sl2 in _pair_iter(slices): file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter prev = next(it) file "c:\python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text match in self._lang_vars.period_context_re().finditer(text): typeerror: expected string or buffer
you getting error because function word_tokenize
expecting string argument , give list of strings. far understand trying achieve, not need tokenize @ point. until print('all noun phrase:'),nouns
, have nouns of sentence. remove stopwords, can use:
### remove stop words ### stop_words = set(stopwords.words("english")) # find nouns not in stopwords nouns_without_stopwords = [noun noun in nouns if noun not in stop_words] # sentence clear print('without stop words:',nouns_without_stopwords)
of course, in case have same result nouns, because none of nouns stopword.
i hope helps.
Comments
Post a Comment