Reshaping Data with R
One common data preparation step, particularly with time series data, is conversion from wide form to long form, and vice-versa. If the data is in a relational database we can just modify our SQL queries appropriately to get the data into the proper format.
However, we’re frequently presented with a data set in a flat file such as a text or CSV file, with the data already in either wide or long form. Depending on the analytics software or specific algorithms we’re using, the data needs to be reshaped. We could load the file into a database and use SQL as mentioned above, but there are quicker alternatives.
As a refresher, wide form ‘denormalizes’ – or spreads – certain variables so that they have their own columns in the data set. For example, the following sports franchise value data is in wide form, with separate columns for each year of the time series data:
The spread data relates to one or more variables that are still normalized (in this case each combination of ‘League’ and ‘Franchise’ only occurs once in the wide data set).
Long form ‘normalizes’ these spread variables so that they’re still keyed by the same variables (‘League’ and ‘Franchise’) but condensed into two columns – one for the name of the variable (in this case, ‘Year’) and one for the corresponding value.
It’s the same data, just organized differently.
So how do we automate this transformation – especially when we have thousands (or millions) of observations? Using R, there’s a pretty easy way to convert these files from wide to long form and back again.
With base R, we can use the reshape function as described below.
Wide to Long
First, let’s read in our wide-form data:
starting.data.wide <- read.table(header=T, text=' League Franchise 2007 2008 2009 NFL Atlanta 881 930 916 NFL Chicago 1090 1134 1158 NBA Miami 453 446 421 ')
Now we can use the reshape function to get the data into long form. We specify the data set (line 8), the direction or shape we want (line 9), and the columns that will be collapsed (line 10). We also need to specify the name of the column for our values (line 11). Lines 12 and 13 are important as they identify the variable(s) that are keys in the dataset (idvar) and the variable that is to be collapsed in long form (timevar) as shown on the code snippet below:
reshaped.data.long <- reshape(starting.data.wide, direction="long", varying=list(names(starting.data.wide)[3:5]), v.names="Value", idvar=c("League","Franchise"), timevar="Year", times=2007:2009)
Because we assigned the function return value to the variable ‘reshaped.data.long’, we can easily view the data in long form to verify that the function worked as expected:
> reshaped.data.long League Franchise Year Value NFL.Atlanta.2007 NFL Atlanta 2007 881 NFL.Chicago.2007 NFL Chicago 2007 1090 NBA.Miami.2007 NBA Miami 2007 453 NFL.Atlanta.2008 NFL Atlanta 2008 930 NFL.Chicago.2008 NFL Chicago 2008 1134 NBA.Miami.2008 NBA Miami 2008 446 NFL.Atlanta.2009 NFL Atlanta 2009 916 NFL.Chicago.2009 NFL Chicago 2009 1158 NBA.Miami.2009 NBA Miami 2009 421
From here we can further subset the data, output it as a new data file, use it as an input for further analysis, etc.
Long to Wide
In order to convert long form to wide, we can begin with our long form data that we generated above, and transform it to wide form. To do so we once again use the reshape function, this time setting the direction parameter to “wide”. The data is still keyed by “League” and “Franchise” so we use these for the idvar parameter. The timevar parameter is still “Year”.
reshaped.data.wide <- reshape(reshaped.data.long, direction = "wide", idvar = c("League","Franchise"), timevar = "Year")
We can verify that the data is now in wide form, as shown below. From here, we can modify the column names as needed, further subset or sort the data, and use the data as an input in downstream analytics.
> reshaped.data.wide League Franchise Value.2007 Value.2008 Value.2009 NFL.Atlanta.2007 NFL Atlanta 881 930 916 NFL.Chicago.2007 NFL Chicago 1090 1134 1158 NBA.Miami.2007 NBA Miami 453 446 421
If you need additional flexibility, you may want to consider the reshape2 package by Hadley Wickham, which uses just two functions, melt and cast, to “flexibly restructure and aggregate data.” For example, the melt function would be used to transform data from wide to long, and cast (specifically dcast for data frames or acast for arrays or matrices) would be used to transform data from long to wide.
The Cookbook for R site has an excellent introduction and examples using reshape2. Between the base R reshape function and the reshape2 package, you have some nice tools in your data preparation arsenal!