python - count all possible 2-grams in each row -
let's have csv file (in reality have more 100+ different services possible) :
user_id, services user_1, "s1,s2,s1,s4,s2,s3,s2" user_2, "s2,s3,s2,s1,s4" and have this, using python , pandas if possible :
user_id, c12,c21,c13,c31,c14,c42,c23,c32,c14,c43,c34 user_1, 1,1,0,0,1,1,1,1,0,0,0 user_2, 0,1,0,0,0,0,1,1,1,0,1 where cij = number of sequence si,sj each user
ideally, usable not sequence of 2 sequence of 3
what found on overall count of si sj, not count one. guess need pivot table @ point, , n-gram don't know how mix :/
thanks help
recreating data (but having split service column in different columns)
import pandas pd df = pd.dataframe() df['user_id'] = [1,2] df['s1'] = [0, 1] df['s2'] = [1, 1] df['s3'] = [1,0] then can combine:
cols = list(df)[1:] c1, c2 in itertools.permutations(c,2): df[c1+c2] = df[c1] & df[c2] by changing 2 in 3 can add 3-grams instead of n-grams.
edit:
i understand problem better. below solution works strings. first create data:
import pandas pd df = pd.dataframe([['user1',"s1,s2,s1,s4,s2,s3,s2"],['user2',"s2,s3,s2,s1,s4"]]) df.columns = ['userid','services'] for n-grams use flexible function (as indicated might want use higher level n-grams)
def find_ngrams(input_list, n): return zip(*[input_list[i:] in range(n)]) we count occurrences , create dataframe:
results = {} idx, row in df.iterrows(): list_of_services = row['services'].split(',') combinations = ['c_{}_{}'.format(c1,c2) c1, c2 in find_ngrams(list_of_services, 2)] results[row['userid']] = {k: 1 k in combinations} df2.from_dict(results).transpose() for toy example returns:
c_s1_s2 c_s1_s4 c_s2_s1 c_s2_s3 c_s3_s2 c_s4_s2 user1 1.0 1.0 1.0 1.0 1.0 1.0 user2 nan 1.0 1.0 1.0 1.0 nan
Comments
Post a Comment