Reshaping Data with R

By in Methods on January 1, 2014

RCode2One common data preparation step, particularly with time series data, is conversion from wide form to long form, and vice-versa. If the data is in a relational database we can just modify our SQL queries appropriately to get the data into the proper format. 

However, we’re frequently presented with a data set in a flat file such as a text or CSV file, with the data already in either wide or long form. Depending on the analytics software or specific algorithms we’re using, the data needs to be reshaped. We could load the file into a database and use SQL as mentioned above, but there are quicker alternatives.

As a refresher, wide form ‘denormalizes’ – or spreads – certain variables so that they have their own columns in the data set. For example, the following sports franchise value data is in wide form, with separate columns for each year of the time series data:

wide_data

The spread data relates to one or more variables that are still normalized (in this case each combination of ‘League’ and ‘Franchise’ only occurs once in the wide data set).

Long form ‘normalizes’ these spread variables so that they’re still keyed by the same variables (‘League’ and ‘Franchise’) but condensed into two columns – one for the name of the variable (in this case, ‘Year’) and one for the corresponding value. 

long_data

It’s the same data, just organized differently.

So how do we automate this transformation – especially when we have thousands (or millions) of observations? Using R, there’s a pretty easy way to convert these files from wide to long form and back again.

With base R, we can use the reshape function as described below.

Wide to Long

First, let’s read in our wide-form data:

starting.data.wide <- read.table(header=T, text='
  League Franchise 2007 2008 2009
     NFL   Atlanta  881  930  916
     NFL   Chicago 1090 1134 1158
     NBA     Miami  453  446  421
')

Now we can use the reshape function to get the data into long form. We specify the data set (line 8), the direction or shape we want (line 9), and the columns that will be collapsed (line 10). We also need to specify the name of the column for our values (line 11). Lines 12 and 13 are important as they identify the variable(s) that are keys in the dataset (idvar) and the variable that is to be collapsed in long form (timevar) as shown on the code snippet below:

reshaped.data.long <- 
  reshape(starting.data.wide, 
          direction="long", 
          varying=list(names(starting.data.wide)[3:5]),
          v.names="Value",
          idvar=c("League","Franchise"), 
          timevar="Year", times=2007:2009)

Because we assigned the function return value to the variable ‘reshaped.data.long’, we can easily view the data in long form to verify that the function worked as expected:

> reshaped.data.long
                 League Franchise Year Value
NFL.Atlanta.2007    NFL   Atlanta 2007   881
NFL.Chicago.2007    NFL   Chicago 2007  1090
NBA.Miami.2007      NBA     Miami 2007   453
NFL.Atlanta.2008    NFL   Atlanta 2008   930
NFL.Chicago.2008    NFL   Chicago 2008  1134
NBA.Miami.2008      NBA     Miami 2008   446
NFL.Atlanta.2009    NFL   Atlanta 2009   916
NFL.Chicago.2009    NFL   Chicago 2009  1158
NBA.Miami.2009      NBA     Miami 2009   421

From here we can further subset the data, output it as a new data file, use it as an input for further analysis, etc.

Long to Wide

In order to convert long form to wide, we can begin with our long form data that we generated above, and transform it to wide form. To do so we once again use the reshape function, this time setting the direction parameter to “wide”. The data is still keyed by “League” and “Franchise” so we use these for the idvar parameter. The timevar parameter is still “Year”.

reshaped.data.wide <- 
  reshape(reshaped.data.long, 
          direction = "wide", 
          idvar = c("League","Franchise"),
          timevar = "Year")

We can verify that the data is now in wide form, as shown below. From here, we can modify the column names as needed, further subset or sort the data, and use the data as an input in downstream analytics.

> reshaped.data.wide
                 League Franchise Value.2007 Value.2008 Value.2009
NFL.Atlanta.2007    NFL   Atlanta        881        930        916
NFL.Chicago.2007    NFL   Chicago       1090       1134       1158
NBA.Miami.2007      NBA     Miami        453        446        421

Other Methods

If you need additional flexibility, you may want to consider the reshape2 package by Hadley Wickham, which uses just two functions, melt and cast, to “flexibly restructure and aggregate data.” For example, the melt function would be used to transform data from wide to long, and cast (specifically dcast for data frames or acast for arrays or matrices) would be used to transform data from long to wide.

The Cookbook for R site has an excellent introduction and examples using reshape2. Between the base R reshape function and the reshape2 package, you have some nice tools in your data preparation arsenal!