Back to Posts

Creating non-numeric pivot tables with Python Pandas

Posted in datascience

If you ever tried to pivot a table containing non-numeric values, you have surely been struggling with any spreadsheet app to do it easily. There is, apparently, a VBA add-in for excel. Or you have to use Access, which should be fine for these kind of operations. The problem with spreadsheets is that by default they aggregate or sum your data, and when it comes to strings there usually is no straightforward workaround.

However, if you wanna do it with 9 (nine!) lines of code, then a panda is your friend :)

In my case, the raw data was shaped like this:

	*id, key, value*
	*id1, size, 55*
	*id1, material, wood*
	*id2, size, 100*
	*id2, material, iron*
	*id2, shape, round*
	*...*

And I wanted to pivot it to: id, size, material, shape id1, 55, wood, , id2, 100, iron, round

Here’s the code, the big point is the lambda function. To return strings it’s usually set as “lambda x: ‘ ‘.join(x)”. But this will return a boolean. In the aggfunc field you’ll need to use that small loop to return every specific value.

	import pandas as pd
	import csv
	df = pd.read_csv('data_raw.csv')
	df
	pivot_table = df.pivot_table(index=['id'],
				     columns=['key'],
			             values=['value'],
			             aggfunc=lambda x: ' '.join(str(v) for v in x))
	pivot_table.to_csv('data_pivot.csv')

I reckon this is cool (hence worth sharing) for two reasons:

  • Works with big amounts of data
  • You don’t have to worry about eterogenity of keys (it will just be a column more in your results!)
  • It’s fast and open-source! :)

###Hanging out with the big guys?

If you’re working with large datasets this method will return a memory error. You can avoid it (I used it on a 15gb dataset) reading your dataset chunk by chunk, like this:

	### Beware of the Panda. 
	### 'You have no idea how tidy this is gonna get'
	from pandas import *
	import csv
	import numpy as np
	df = pandas.read_csv(‘data_raw.csv’, sep=” “, chunksize=5000)
	df
	appended_data = []
	for chunk in df:
	    pivot_table = chunk.pivot_table(index=[‘id’],
	                             columns=[‘key’],
	                             values=[‘value’],
	                             aggfunc=lambda x: ‘ ‘.join(str(v) for v in x))
	    appended_data.append(pivot_table)
	appended_data = pandas.concat(appended_data, axis=0).reset_index()
	appended_data.to_csv('data_clean.csv', sep=",")
	#and if you wanna clean it a little bit where the chunk trunks it:
	appended_data_clean = appended_data.groupby('id', sort=True).agg(np.sum)
	appended_data_clean.to_csv('dati_clean_crunched.csv', sep=",")

The internet is the most incredible infrastracture ever built. Human beings learn through dialogues. Open sourcing your ideas means enhancing them. I keep my brain alive reading, writing and coding. Economics student, I'm (slowly) teaching myself Data Science. Hacktivism with www.onData.it folks.

Read Next

Due anni fa

ActionAid

fights for food, women, a right governance

GET INVOLVED