If you ever tried to pivot a table containing non-numeric values, you have surely been struggling with any spreadsheet app to do it easily. There is, apparently, a VBA add-in for excel. Or you have to use Access, which should be fine for these kind of operations. The problem with spreadsheets is that by default they aggregate or sum your data, and when it comes to strings there usually is no straightforward workaround.
However, if you wanna do it with 9 (nine!) lines of code, then a panda is your friend :)
In my case, the raw data was shaped like this:
*id, key, value* *id1, size, 55* *id1, material, wood* *id2, size, 100* *id2, material, iron* *id2, shape, round* *...*
And I wanted to pivot it to: id, size, material, shape id1, 55, wood, , id2, 100, iron, round …
Here’s the code, the big point is the lambda function. To return strings it’s usually set as “lambda x: ‘ ‘.join(x)”. But this will return a boolean. In the aggfunc field you’ll need to use that small loop to return every specific value.
import pandas as pd import csv df = pd.read_csv('data_raw.csv') df pivot_table = df.pivot_table(index=['id'], columns=['key'], values=['value'], aggfunc=lambda x: ' '.join(str(v) for v in x)) pivot_table.to_csv('data_pivot.csv')
I reckon this is cool (hence worth sharing) for two reasons:
- Works with big amounts of data
- You don’t have to worry about eterogenity of keys (it will just be a column more in your results!)
- It’s fast and open-source! :)
###Hanging out with the big guys?
If you’re working with large datasets this method will return a memory error. You can avoid it (I used it on a 15gb dataset) reading your dataset chunk by chunk, like this:
### Beware of the Panda. ### 'You have no idea how tidy this is gonna get' from pandas import * import csv import numpy as np df = pandas.read_csv(‘data_raw.csv’, sep=” “, chunksize=5000) df appended_data =  for chunk in df: pivot_table = chunk.pivot_table(index=[‘id’], columns=[‘key’], values=[‘value’], aggfunc=lambda x: ‘ ‘.join(str(v) for v in x)) appended_data.append(pivot_table) appended_data = pandas.concat(appended_data, axis=0).reset_index() appended_data.to_csv('data_clean.csv', sep=",") #and if you wanna clean it a little bit where the chunk trunks it: appended_data_clean = appended_data.groupby('id', sort=True).agg(np.sum) appended_data_clean.to_csv('dati_clean_crunched.csv', sep=",")