dataframe - Create internal python loop based on index and groupings for all combinations -
i have script looks @ rows , columns headers belonging group (reg_id) , sums values. code runs on matrix (small subset) follows:
my code runs calculating sum ids based on rows , columns belonging each internal group (reg_id). example row , column ids belong reg_id 1 summed total flows between region 1 , region 1 (internal flows) calculated , on each region. wish extend code calculating (summing) flows between regions example region 1 region 2, 3, 4 ,5.... figure need include loop within existing while loop appreciate figure out should , how construct it. code runs on internal flow sum (1-1, 2-2, 3-3 etc) follows:
global index index = 1 x = index while index < len(idgroups): ward_list = idgroups[index] #select list of ward ids each region list of lists df6 = mergedcsv.loc[ward_list] #select rows values in list dfcols = mergedcsv.loc[ward_list, :] #select columns values in list ward_liststr = map(str, ward_list) #convert ward_list strings can used select columns, won't work integers. ward_listint = map(int, ward_list) #dfrowscols = mergedcsv.loc[ward_list, ward_listint] df7 = df6.loc[:, ward_liststr] print df7 regflowsum = df7.values.sum() #sum values in dataframe intflow = [regflowsum] print intflow dfintflow = pd.dataframe(intflow) dfintflow.reset_index(level=0, inplace=true) dfintflow.columns = ["regid", "regflowsum"] dfflows.set_value(index, 'regid', index) dfflows.set_value(index, 'regid2', index) dfflows.set_value(index, 'regflow', regflowsum) mergedcsv.set_value(ward_list, 'totregflows', regflowsum) index += 1 #increment index number print dfflows new_df = pd.merge(pairlist, dfflows, how='left', left_on=['origid','destid'], right_on = ['regid', 'regid2']) print new_df #useful checking dataframe merges regionflows = r"c:\temp\allni\regionflows.csv" header = ["wardid","label","reg_id","total","totregflows"] mergedcsv.to_csv(regionflows, columns = header, index=false) regregflows = r"c:\temp\allni\reg_regflows.csv" headerreg = ["reg_id_orig", "reg_id_dest", "flow"] pairlistcsv = r"c:\temp\allni\pairlist_regions.csv" new_df.to_csv(pairlistcsv) the output follows:
idgroups dataframe: (see image 1 - second part of image 1)
df7 , intflows each region reg_id:(third part of image 1 - on right)
ddflows dataframe:(fourth part of image 2)
and final output new_df:(fifth part of image 2)
i wish populate sums possible combinations of flows between regions not internal.
i figure need add loop while loop. possibly add enumerate function like:
while index < len(idgroups): #add line(s) calculate flows between regions index, item in enumerate(idgroups): ward_list = idgroups[index] print ward_list df6 = mergedcsv.loc[ward_list] #select rows values in list dfcols = mergedcsv.loc[ward_list, :] #select columns values in list ward_liststr = map(str, ward_list) #convert ward_list strings can used select columns, won't work integers. ward_listint = map(int, ward_list) #dfrowscols = mergedcsv.loc[ward_list, ward_listint] df7 = df6.loc[:, ward_liststr] print df7 regflowsum = df7.values.sum() #sum values in dataframe intflow = [regflowsum] print intflow dfintflow = pd.dataframe(intflow) dfintflow.reset_index(level=0, inplace=true) dfintflow.columns = ["regid", "regflowsum"] dfflows.set_value(index, 'regid', index) dfflows.set_value(index, 'regid2', index) dfflows.set_value(index, 'regflow', regflowsum) mergedcsv.set_value(ward_list, 'totregflows', regflowsum) index += 1 #increment index number i'm unsure how integrate item struggling extend code combinations. advice appreciated.
update based on flows function:
w=pysal.rook_from_shapefile("c:/temp/allni/niw01_sort.shp",idvariable='label') simil = pysal.open("c:/temp/allni/simni.csv") similarity = np.array(simil) db = pysal.open('c:\temp\sqlite\matrixcsv2.csv', 'r') dbf = pysal.open(r'c:\temp\allni\niw01_sortc.dbf', 'r') ids = np.array((dbf.by_col['label'])) commuters = np.array((dbf.by_col['total'],dbf.by_col['idno'])) commutersint = commuters.astype(int) comm = commutersint[0] floor = int(min_com_ct + 100) solution = pysal.region.maxp(w=w,z=similarity,floor=floor,floor_variable=comm) regions = solution.regions #print regions writecsv = r"c:\temp\allni\reg_output.csv" csv = open(writecsv,'w') csv.write('"label","reg_id"\n') in range(len(regions)): lines in regions[i]: csv.write('"' + lines + '","' + str(i+1) + '"\n') csv.close() flows = r"c:\temp\sqlite\matrixcsv2.csv" regs = r"c:\temp\allni\reg_output.csv" wardflows = pd.read_csv(flows) regoutput = pd.read_csv(regs) merged = pd.merge(wardflows, regoutput) #duplicate reg_id column index used later merged['reg_id2'] = merged['reg_id'] merged.to_csv("c:\temp\allni\merged.csv", index=false) mergedcsv = pd.read_csv("c:\temp\allni\merged.csv",index_col='wardid_1') #index dataframe using wardid_1 column flabellist = pd.read_csv("c:\temp\allni\merged.csv", usecols = ["wardid", "reg_id"]) #create list of flabel values reg_id = "reg_id" ward_flows = "regintflows" flds = [reg_id, ward_flows] #create list of fields use in search dict_ref = {} # create dictionary each reg_id list of corresponding flabel fields #group dataframe reg_id column idgroups = flabellist.groupby('reg_id')['wardid'].apply(lambda x: x.tolist()) print idgroups idgrp_df = pd.dataframe(idgroups) csvcols = mergedcsv.columns #create list of column names pass index select columns columnlist = list(mergedcsv.columns.values) mergedcsvgroup = mergedcsv.groupby('reg_id').sum() mergedcsvgroup.describe() idlist = idgroups[2] df4 = pd.dataframe() df5 = pd.dataframe() col_ids = idlist #ward id no regiddf = idgroups.index.get_values() print regiddf #total number of region ids #print regiddf #create pairlist combinations region ids #combinations replacement allows repeated items #pairs = list(itertools.combinations_with_replacement(regiddf, 2)) pairs = list(itertools.product(regiddf, repeat=2)) #print len(pairs) #create new dataframe pairlists , summed data pairlist = pd.dataframe(pairs,columns=['origid','destid']) print pairlist.tail() header_pairlist = ["origid","destid","flow"] header_intflow = ["regid", "regid2", "regflow"] dfflows = pd.dataframe(columns=header_intflow) print mergedcsv.index print mergedcsv.dtypes #mergedcsv = mergedcsv.select_dtypes(include=['int64']) #print mergedcsv.columns #mergedcsv.rename(columns = lambda x: int(x), inplace=true) def flows(): pass #def flows(mergedcsv, region_a, region_b): def flows(mergedcsv, ward_lista, ward_listb): """return sum of cells in row/column intersections of ward_lista , ward_listb.""" mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64'] regionflows = mergedcsv.loc[ward_lista, ward_listb] regionflowsum = regionflows.values.sum() #grid = [ax, bx, regflowsuma, regflowsumb] gridoutput = [ax, bx, regionflowsum] print gridoutput return regflowsuma return regflowsumb #print mergedcsv.index #mergedcsv.columns = mergedcsv.columns.str.strip() ax, group_a in enumerate(idgroups): ward_lista = map(int, group_a) print ward_lista bx, group_b in enumerate(idgroups[ax:], start=ax): ward_listb = map(int, group_b) #print ward_listb flow_ab = flows(mergedcsv, ward_lista, ward_listb) #flow_ab = flows(mergedcsv, group_a, group_b) this results in keyerror: 'none of [[189, 197, 198, 201]] in [columns]'
i have tried using ward_lista = map(str, group_a) , map(int, group_a) list objects not found in dataframe.loc. columns mixed datatypes columns containing labels should sliced of type int64. have tried many solutions around datatypes no avail. suggestions?
i can't speak computations you're doing, seems want arrange combinations of groups. question whether directed or undirected- is, need compute flows(a,b) , flows(b,a), or one?
if one, this:
for i,ward_list in enumerate(idgroups): j,ward_list2 in enumerate(idgroups[i:],start=i): this iterate on i,j pairs like:
0,0 0,1 0,2 ... 0,n 1,1 1,2 ... 1,n 2,2 ... 2,n which serve in undirected case.
if need compute both flows(a,b) , flows(b,a), can push code function called flows, , call reversed args, shown. ;-)
update
let's define function called flows:
def flows(): pass now, parameters?
well, looking @ code, gets data dataframe. , want 2 different wards, let's start those. result seems sum of resulting grid.
def flows(df, ward_a, ward_b): """return sum of cells in row/column intersections of ward_a , ward_b.""" return 0 now i'm going copy lines of code:
ward_list = idgroups[index] print ward_list df6 = mergedcsv.loc[ward_list] #select rows values in list dfcols = mergedcsv.loc[ward_list, :] #select columns values in list ward_liststr = map(str, ward_list) #convert ward_list strings can used select columns, won't work integers. ward_listint = map(int, ward_list) #dfrowscols = mergedcsv.loc[ward_list, ward_listint] df7 = df6.loc[:, ward_liststr] print df7 regflowsum = df7.values.sum() #sum values in dataframe intflow = [regflowsum] print intflow i think of flow function right here. let's look.
the
ward_listeitherward_aorward_bparameters.i'm not sure
df6is, because sort of recompute indf7. need clarified.regflowsumour desired output, think.
rewriting function:
def flows(df, ward_a, ward_b): """return sum of cells in row/column intersections of ward_a , ward_b.""" print "computing flows from:" print " ", ward_a print "" print "flows into:" print " ", ward_b # filter rows ward_a, cols ward_b: grid = df.loc[ward_a, ward_b] print "grid:" print grid flowsum = grid.values.sum() print "flows:", flowsum return flowsum now, have assumed ward_a , ward_b values in correct format. we'll have str-ify them or whatever outside function. let's that:
for ax, group_a in enumerate(idgroups): ward_a = map(str, group_a) bx, group_b in enumerate(idgroups[ax:], start=ax): ward_b = map(str, group_b) flow_ab = flows(mergedcsv, ward_a, ward_b) if ax != bx: flow_ba = flows(mergedcsv, ward_b, ward_a) else: flow_ba = flow_ab # what? at point, have 2 numbers. equal when wards same (internal flow?). @ point original code stops being helpful because deals internal flows, , not a->b flows, don't know do. values in variables, ...
Comments
Post a Comment