Version 30 April 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
Pivot Tables: working with long and wide data
These features work with and often create
hierarchical or multi-level Indexes;
(the pandas MultiIndex is powerful and complex).
Pivot, unstack, stack and melt
Pivot tables move from long format to wide format data
# Let's start with data in long format
from StringIO import StringIO # python2.7
#from io import StringIO # python 3
data = """Date,Pollster,State,Party,Est
13/03/2014, Newspoll, NSW, red, 25
13/03/2014, Newspoll, NSW, blue, 28
13/03/2014, Newspoll, Vic, red, 24
13/03/2014, Newspoll, Vic, blue, 23
13/03/2014, Galaxy, NSW, red, 23
13/03/2014, Galaxy, NSW, blue, 24
13/03/2014, Galaxy, Vic, red, 26
13/03/2014, Galaxy, Vic, blue, 25
13/03/2014, Galaxy, Qld, red, 21
13/03/2014, Galaxy, Qld, blue, 27"""
df = pd.read_csv(StringIO(data),
header=0, skipinitialspace=True)
# pivot to wide format on 'Party' column
# 1st: set up a MultiIndex for other cols
df1 = df.set_index(['Date', 'Pollster',
'State'])
# 2nd: do the pivot
wide1 = df1.pivot(columns='Party')
# unstack to wide format on State / Party
# 1st: MultiIndex all but the Values col
df2 = df.set_index(['Date', 'Pollster',
'State', 'Party'])
# 2nd: unstack a column to go wide on it
wide2 = df2.unstack('State')
wide3 = df2.unstack() # pop last index
# Use stack() to get back to long format
long1 = wide1.stack()
# Then use reset_index() to remove the
# MultiIndex.
long2 = long1.reset_index()
# Or melt() back to long format
# 1st: flatten the column index
wide1.columns = ['_'.join(col).strip()
for col in wide1.columns.values]
# 2nd: remove the MultiIndex
wdf = wide1.reset_index()
# 3rd: melt away
long3 = pd.melt(wdf, value_vars=
['Est_blue', 'Est_red'],
var_name='Party', id_vars=['Date',
'Pollster', 'State'])
Note: See documentation, there are many arguments to
these methods.
Working with dates, times and their indexes
Dates and time – points and spans
With its focus on time-series data, pandas has a suite of
tools for managing dates and time: either as a point in
time (a Timestamp) or as a span of time (a Period).
t = pd.Timestamp('2013-01-01')
t = pd.Timestamp('2013-01-01 21:15:06')
t = pd.Timestamp('2013-01-01 21:15:06.7')
p = pd.Period('2013-01-01', freq='M')
Note: Timestamps should be in range 1678 and 2261
years. (Check Timestamp.max and Timestamp.min).
A Series of Timestamps or Periods
ts = ['2015-04-01', '2014-04-02']
# Series of Timestamps (good)
s = pd.to_datetime(pd.Series(ts))
# Series of Periods (hard to make)
s = pd.Series(
[pd.Period(x, freq='M') for x in ts] )
s = pd.Series(pd.PeriodIndex(ts,freq='D'))
Note: While Periods make a very useful index; they may
be less useful in a Series.
From non-standard strings to Timestamps
t = ['09:08:55.7654-JAN092002',
'15:42:02.6589-FEB082016']
s = pd.Series(pd.to_datetime(t,
format="%H:%M:%S.%f-%b%d%Y"))
Also: %B = full month name; %m = numeric month;
%y = year without century; and more …
Dates and time – stamps and spans as indexes
An index of Timestamps is a DatetimeIndex.
An index of Periods is a PeriodIndex.
date_strs = ['2014-01-01', '2014-04-01',
'2014-07-01', '2014-10-01']
dti = pd.DatetimeIndex(date_strs)
pid = pd.PeriodIndex(date_strs, freq='D')
pim = pd.PeriodIndex(date_strs, freq='M')
piq = pd.PeriodIndex(date_strs, freq='Q')
print (pid[1] - pid[0]) # 90 days
print (pim[1] - pim[0]) # 3 months
print (piq[1] - piq[0]) # 1 quarter
time_strs = ['2015-01-01 02:10:40.12345',
'2015-01-01 02:10:50.67890']
pis = pd.PeriodIndex(time_strs, freq='U')
df.index = pd.period_range('2015-01',
periods=len(df), freq='M')
dti = pd.to_datetime(['04-01-2012'],
dayfirst=True) # Australian date format
pi = pd.period_range('1960-01-01',
'2015-12-31', freq='M')
Hint: unless you are working in less than seconds,
prefer PeriodIndex over DateTimeImdex.