Back to Posts

Creating non-numeric pivot tables with Python Pandas

Posted in datascience

The power of the Panda

If you ever tried to pivot a table containing non-numeric values, you have surely been struggling with any spreadsheet app to do it easily. There is, apparently, a VBA add-in for excel. Or you have to use Access, which should be fine for these kind of operations. The problem with spreadsheets is that by default they aggregate or sum your data, and when it comes to strings there usually is no straightforward workaround.

However, if you wanna do it with 9 (nine!) lines of code, then a panda is your friend :)

In my case, the raw data was shaped like this:

	*id, key, value*
	*id1, size, 55*
	*id1, material, wood*
	*id2, size, 100*
	*id2, material, iron*
	*id2, shape, round*

And I wanted to pivot it to:

	*id, size, material, shape*
	*id1, 55, wood, ,*
	*id2, 100, iron, round*

Here’s the code, the big point is the lambda function. To return strings it’s usually set as “lambda x: ‘ ‘.join(x)”. But this will return a boolean. In the aggfunc field you’ll need to use that small loop to return every specific value.

I reckon this is cool (hence worth sharing) for two reasons:

– Works with big amounts of data – You don’t have to worry about eterogenity of keys (it will just be a column more in your results!) – It’s fast and open-source! :)

Hanging out with the big guys?

If you’re working with large datasets this method will return a memory error. You can avoid it (I used it on a 15gb dataset) reading your dataset chunk by chunk, like this:

The internet is the most incredible infrastracture ever built. Human beings learn through dialogues. Open sourcing your ideas means enhancing them. I keep my brain alive reading, writing and coding. Economics student, I'm (slowly) teaching myself Data Science. Hacktivism with folks.

Read Next

Due anni fa


fights for food, women, a right governance