python - Create a matrix of tf-idf values -


i have set of documents like:

d1 = "the sky blue." d2 = "the sun bright." d3 = "the sun in sky bright." 

and set of words like:

"sky","land","sea","water","sun","moon" 

i want create matrix this:

   x        d1           d2         d3 sky         tf-idf       0          tf-idf land        0            0          0 sea         0            0          0 water       0            0          0 sun         0            tf-idf     tf-idf moon        0            0          0 

something example table given here: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html. in given link, uses same words document need use set of words have mentioned.

if particular word present in document put tf-idf values, else put 0 in matrix.

any idea how might build sort of matrix this? python best r appreciated.

i using following code not sure whether doing right thing or not. code is:

from sklearn.feature_extraction.text import countvectorizer sklearn.feature_extraction.text import tfidftransformer nltk.corpus import stopwords   train_set = "the sky blue.", "the sun bright.", "the sun in sky bright." #documents test_set = ["sky","land","sea","water","sun","moon"] #query stopwords = stopwords.words('english')  vectorizer = countvectorizer(stop_words = stopwords) #print vectorizer transformer = tfidftransformer() #print transformer  trainvectorizerarray = vectorizer.fit_transform(train_set).toarray() testvectorizerarray = vectorizer.transform(test_set).toarray() #print 'fit vectorizer train set', trainvectorizerarray #print 'transform vectorizer test set', testvectorizerarray  transformer.fit(trainvectorizerarray) #print #print transformer.transform(trainvectorizerarray).toarray()  transformer.fit(testvectorizerarray) #print  tfidf = transformer.transform(testvectorizerarray) print tfidf.todense() 

i getting absurd results (values 0 , 1 while expecting values between 0 , 1).

[[ 0.  0.  1.  0.]  [ 0.  0.  0.  0.]  [ 0.  0.  0.  0.]  [ 0.  0.  0.  0.]  [ 0.  0.  0.  1.]  [ 0.  0.  0.  0.]  [ 1.  0.  0.  0.]]    

i open other libraries calculating tf-idf. want correct matrix mentioned above.

a r solution this:

library(tm) docs <- c(d1 = "the sky blue.",           d2 = "the sun bright.",           d3 = "the sun in sky bright.") dict <- c("sky","land","sea","water","sun","moon") mat <- termdocumentmatrix(corpus(vectorsource(docs)),                            control=list(weighting =  weighttfidf,                                         dictionary = dict)) as.matrix(mat)[dict, ] #         docs # terms          d1        d2        d3 #   sky   0.5849625 0.0000000 0.2924813 #   land  0.0000000 0.0000000 0.0000000 #   sea   0.0000000 0.0000000 0.0000000 #   water 0.0000000 0.0000000 0.0000000 #   sun   0.0000000 0.5849625 0.2924813 #   moon  0.0000000 0.0000000 0.0000000 

Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -