The power of the Panda
If you ever tried to pivot a table containing non-numeric values, you have surely been struggling with any spreadsheet app to do it easily. There is, apparently, a VBA add-in for excel. Or you have to use Access, which should be fine for these kind of operations. The problem with spreadsheets is that by default they aggregate or sum your data, and when it comes to strings there usually is no straightforward workaround.
However, if you wanna do it with 9 (nine!) lines of code, then a panda is your friend :)
In my case, the raw data was shaped like this:
*id, key, value* *id1, size, 55* *id1, material, wood* *id2, size, 100* *id2, material, iron* *id2, shape, round* *...*
And I wanted to pivot it to:
*id, size, material, shape* *id1, 55, wood, ,* *id2, 100, iron, round* *...*
Here’s the code, the big point is the lambda function. To return strings it’s usually set as “lambda x: ‘ ‘.join(x)”. But this will return a boolean. In the aggfunc field you’ll need to use that small loop to return every specific value.
I reckon this is cool (hence worth sharing) for two reasons:
– Works with big amounts of data – You don’t have to worry about eterogenity of keys (it will just be a column more in your results!) – It’s fast and open-source! :)
Hanging out with the big guys?
If you’re working with large datasets this method will return a memory error. You can avoid it (I used it on a 15gb dataset) reading your dataset chunk by chunk, like this: