Python/Pandas: How to combine cumsum and cumcount with agg function? -
i have dataframe grouping internal score , issue date (by quarter). want create statistical table includes cumulative counts of number of loans (represented distinct count of loan #), cumulative sum of loan amounts, , sums of actual loss , outstanding principal. cumulative sum , cumulative count should include snapshot of first date until particular point in time. (i.e. cumulative sum of q1 2015 q2 2015, q1 2015 q3 2015, q1 2015 q4 2015, etc.)
sample dataset:
loan # amount issue date tu status list internal score last actual paid \ 0 57144 3337.76 2017-04-03 b 0.0 1 57145 5536.46 2017-04-03 b c 0.0 2 57160 3443.91 2017-04-03 b b 0.0 3 57161 1162.79 2017-04-03 b b 0.0 4 57162 3845.98 2017-04-03 b b 0.0 5 57163 3441.50 2017-04-03 b b 0.0 6 57164 2039.96 2017-04-03 b c 0.0 7 57165 4427.53 2017-04-03 b 0.0 8 57166 4427.53 2017-04-03 b 0.0 9 57167 1617.77 2017-04-03 b b 0.0 outstanding-principal actual loss 0 3337.76 0.0 1 5536.46 0.0 2 3443.91 0.0 3 1162.79 0.0 4 3845.98 0.0 5 3441.50 0.0 6 2039.96 0.0 7 4427.53 0.0 8 4427.53 0.0 9 1617.77 0.0 i tried this:
container = [] in ['a', 'b', 'c', 'd']: subdf = df[df['internal score'].str.contains(i)] # calculate quarterly vintages subdf.set_index('issue date', inplace=true) df2 = subdf.groupby(pd.timegrouper('q')).agg({'outstanding-principal': np.sum, 'actual loss': np.sum, 'amount': cumsum, 'loan #': cumcount}) df2['internal score'] = container.append(df2) ddf = pd.concat(container)
you can first use groupby , apply cumsum afterwards.
i modified dummy data while changing dates span across quarters make example more clear:
print(df) loan # amount issue date internal score outstanding principal actual loss 0 57144 3337.76 2017-04-03 3337.76 0.0 1 57145 5536.46 2017-04-03 c 5536.46 0.0 2 57160 3443.91 2017-04-03 b 3443.91 0.0 3 57161 1162.79 2017-04-03 b 1162.79 0.0 4 57162 3845.98 2017-04-03 b 3845.98 0.0 5 57163 3441.50 2017-07-03 b 3441.50 0.0 6 57164 2039.96 2017-07-03 c 2039.96 0.0 7 57165 4427.53 2017-07-03 4427.53 0.0 8 57166 4427.53 2017-07-03 4427.53 0.0 9 57167 1617.77 2017-07-03 b 1617.77 0.0 first, create column containing key identifies quarter , year of given timestamp:
# in case not timestamp df["issue date"] = pd.to_datetime(df["issue date"]) dt = df["issue date"].dt df["quarter"] = dt.strftime("%y").str.cat(dt.quarter.astype(str), " q") print(df["quarter"]) 0 2017 q2 1 2017 q2 2 2017 q2 3 2017 q2 4 2017 q2 5 2017 q3 6 2017 q3 7 2017 q3 8 2017 q3 9 2017 q3 name: quarter, dtype: object now, aggreagte:
funcs = {'outstanding principal': np.sum, 'actual loss': np.sum, 'amount': np.sum, 'loan #': len} result = df.groupby(['internal score', "quarter"]).agg(funcs) print(result) outstanding principal amount actual loss loan # internal score quarter 2017 q2 3337.76 3337.76 0.0 1 2017 q3 8855.06 8855.06 0.0 2 b 2017 q2 8452.68 8452.68 0.0 3 2017 q3 5059.27 5059.27 0.0 2 c 2017 q2 5536.46 5536.46 0.0 1 2017 q3 2039.96 2039.96 0.0 1 and use transform cumsum:
cum_cols = ["amount", "loan #"] cumsums = result.groupby(level="internal score")[cum_cols].transform(lambda x: x.cumsum()) result.loc[:, cum_cols] = cumsums print(result) outstanding principal amount actual loss loan # internal score quarter 2017 q2 3337.76 3337.76 0.0 1 2017 q3 8855.06 12192.82 0.0 3 b 2017 q2 8452.68 8452.68 0.0 3 2017 q3 5059.27 13511.95 0.0 5 c 2017 q2 5536.46 5536.46 0.0 1 2017 q3 2039.96 7576.42 0.0 2
Comments
Post a Comment