In this short blog post we are going to see how to flatten your pandas dataframe after aggregation operation.
Why flatten your columns?
Imagine working with your dataframe as you usually do on SQL Server: you apply different operations, like join, aggregate, select etc. However after running an aggregation function on your pandas dataframe, you have multilevel column headers which are difficult to manipulate. After looking through StackOverflow questions, I've finally created a simple and flexible function which allows me to convert my aggregated dataset back into the initial format. So if this code snippet saves you a couple of hours I will be completely satisfied. Up we go!
Why this happens?
Let's have a look at this example. Here we have a dataframe, which we want to aggregate.
pe_odds[ [ 'EVENT_ID', 'SELECTION_ID', 'ODDS' ] ]
-------------------------------------------------------------------
Out[67]:
EVENT_ID SELECTION_ID ODDS
0 100429300 5297529 18.001 100429300 5297529 20.002 100429300 5297529 21.003 100429300 5297529 22.004 100429300 5297529 23.005 100429300 5297529 24.006 100429300 5297529 25.00
And then when you group by Event Id and Selection Id, here's what you've got, a frame with mulit-index columns
pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] )
-------------------------------------------------------------------
Out[68]:
amin amax
EVENT_ID SELECTION_ID
100428417 5490293 1.71 1.715881623 1.14 1.355922296 2.00 2.005956692 2.00 2.02100428419 603721 2.44 2.904387436 4.30 6.204398859 1.23 1.354574687 1.35 1.464881396 14.50 19.006032606 2.94 4.206065580 2.70 5.806065582 2.42 3.65100428421 5911426 2.22 2.52
Pretty ugly, right?
Quote from Pandas:
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).
Undoubtedly, it is exciting, but then it gets relatively complicated to manipalte my dataframe, whereas I simply want to manipulate it as I used to do via good old SQL commands (this is definitely due to the lack of my knowledge, I admit, but usually the goal of our work is to get the things done, with the least effort).
All the suggestions on StackOverflow was either irrelevant, or required too much modifications to my initial dataframe, or were not sufficiently flexible. If we make a summary of what have been suggested on the Internet, the workarounds are:
Set the columns to the top level (using DataFrame.get_level_values(0))
Use to_flat_index() which has been released starting from pandas 0.24.0 (still there may be some issues with column naming)
Converting columns to list and then joining them all together
And in my case, it was even worse as I had two aggregation operations, mean and std. So I created simple function, to get rig of this noisy Multi-index columns (again, they are great, it's just that they were useless in my project). Here you are:
def flatten_columns(columns):
return [first if second == '' else second for first, second in columns.values]
And inside the main body
# e.g construct a correlation between columns
correlation_dataframe = generate_correlation_dataframe(unpivoted_dataframe, show_status)
# then aggregate using mean and standard deviation
aggregated_dataframe = correlation_dataframe.groupby(['Year', 'Sequence', 'Variable'],as_index=False).agg({'Correlation':['mean', 'std']})
# and flatten
aggregated_dataframe.columns = flatten_columns(aggregated_dataframe.columns)
It worked out for me, but if this does not work for you, I would be very grateful if you provided your experience of flattening the pandas Multi-index rows here.
Again, hope you will find it useful
May the force be with you!
Comentarios