python - Create a matrix of tf-idf values -
i have set of documents
like:
d1 = "the sky blue." d2 = "the sun bright." d3 = "the sun in sky bright."
and set of words
like:
"sky","land","sea","water","sun","moon"
i want create matrix this:
x d1 d2 d3 sky tf-idf 0 tf-idf land 0 0 0 sea 0 0 0 water 0 0 0 sun 0 tf-idf tf-idf moon 0 0 0
something example table given here: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html. in given link, uses same words document need use set of words
have mentioned.
if particular word present in document put tf-idf
values, else put 0
in matrix.
any idea how might build sort of matrix this? python best r appreciated.
i using following code not sure whether doing right thing or not. code is:
from sklearn.feature_extraction.text import countvectorizer sklearn.feature_extraction.text import tfidftransformer nltk.corpus import stopwords train_set = "the sky blue.", "the sun bright.", "the sun in sky bright." #documents test_set = ["sky","land","sea","water","sun","moon"] #query stopwords = stopwords.words('english') vectorizer = countvectorizer(stop_words = stopwords) #print vectorizer transformer = tfidftransformer() #print transformer trainvectorizerarray = vectorizer.fit_transform(train_set).toarray() testvectorizerarray = vectorizer.transform(test_set).toarray() #print 'fit vectorizer train set', trainvectorizerarray #print 'transform vectorizer test set', testvectorizerarray transformer.fit(trainvectorizerarray) #print #print transformer.transform(trainvectorizerarray).toarray() transformer.fit(testvectorizerarray) #print tfidf = transformer.transform(testvectorizerarray) print tfidf.todense()
i getting absurd results (values 0
, 1
while expecting values between 0 , 1).
[[ 0. 0. 1. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 1.] [ 0. 0. 0. 0.] [ 1. 0. 0. 0.]]
i open other libraries calculating tf-idf
. want correct matrix mentioned above.
a r solution this:
library(tm) docs <- c(d1 = "the sky blue.", d2 = "the sun bright.", d3 = "the sun in sky bright.") dict <- c("sky","land","sea","water","sun","moon") mat <- termdocumentmatrix(corpus(vectorsource(docs)), control=list(weighting = weighttfidf, dictionary = dict)) as.matrix(mat)[dict, ] # docs # terms d1 d2 d3 # sky 0.5849625 0.0000000 0.2924813 # land 0.0000000 0.0000000 0.0000000 # sea 0.0000000 0.0000000 0.0000000 # water 0.0000000 0.0000000 0.0000000 # sun 0.0000000 0.5849625 0.2924813 # moon 0.0000000 0.0000000 0.0000000
Comments
Post a Comment