python - count all possible 2-grams in each row -


let's have csv file (in reality have more 100+ different services possible) :

user_id, services user_1, "s1,s2,s1,s4,s2,s3,s2" user_2, "s2,s3,s2,s1,s4" 

and have this, using python , pandas if possible :

user_id, c12,c21,c13,c31,c14,c42,c23,c32,c14,c43,c34 user_1, 1,1,0,0,1,1,1,1,0,0,0 user_2, 0,1,0,0,0,0,1,1,1,0,1 

where cij = number of sequence si,sj each user

ideally, usable not sequence of 2 sequence of 3

what found on overall count of si sj, not count one. guess need pivot table @ point, , n-gram don't know how mix :/

thanks help

recreating data (but having split service column in different columns)

import pandas pd df = pd.dataframe() df['user_id'] = [1,2] df['s1'] = [0, 1] df['s2'] = [1, 1] df['s3'] = [1,0] 

then can combine:

cols = list(df)[1:] c1, c2 in itertools.permutations(c,2):     df[c1+c2] = df[c1] & df[c2] 

by changing 2 in 3 can add 3-grams instead of n-grams.

edit:

i understand problem better. below solution works strings. first create data:

import pandas pd df = pd.dataframe([['user1',"s1,s2,s1,s4,s2,s3,s2"],['user2',"s2,s3,s2,s1,s4"]]) df.columns = ['userid','services'] 

for n-grams use flexible function (as indicated might want use higher level n-grams)

def find_ngrams(input_list, n):    return zip(*[input_list[i:] in range(n)]) 

we count occurrences , create dataframe:

results = {} idx, row in df.iterrows():     list_of_services = row['services'].split(',')     combinations = ['c_{}_{}'.format(c1,c2) c1, c2 in find_ngrams(list_of_services, 2)]     results[row['userid']] = {k: 1 k in combinations}  df2.from_dict(results).transpose() 

for toy example returns:

        c_s1_s2  c_s1_s4  c_s2_s1  c_s2_s3  c_s3_s2  c_s4_s2 user1      1.0      1.0      1.0      1.0      1.0      1.0 user2      nan      1.0      1.0      1.0      1.0      nan 

Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -