|Is your data not in the format of your favorite statistical package? Multiple files need to be merged and reformatted? You've been doing these tedious tasks by hand? There is a better way! Read about an example using a time series of Indian rainfall data.|
When your data is sourced from an online repository it can come in all shapes and sizes. Often, data will not be available in the standard format of your favorite statistical package. You may need to split and recombine rows or columns, merge multiple files together, reshape from a list into a data frame, or omit extraneous characters. These tasks are tedious and time-intensive if done by hand. There is a better solution: let the computer do it for you. We can show you how!
The Raw Data
Daily weather records for the Indian subcontinent were downloaded from the India Meteorological Department for the years 1951-2007. Three different types of data were available: rainfall, temperature, and number of recording stations. Each data type was packaged into files containing one year's worth of observations. A single day's data consisted of a matrix of latitude-longitude coordinates with the rainfall, temperature, or station data populating the cells. That makes a total of 365 matrices for one year, with 57 years, and 3 data types, all packaged in a total of 171 separate text and binary files.
The aim was to reformat the data from all 171 files into a single long-format (panel) data frame for input into Stata. The data frame was to have columns for rainfall, temperature, station number, latitude, longitude, day, month, and year. Each row would be a single set of observations. This would then be used for time-series analysis of precipitation trends in the subcontinent over the last half-century. To facilitate visualization of the data, we also wanted to create a movie of rainfall and temperature patterns over time by plotting each day's data as a colored grid and animating this over a map of India.
Since Stata is not a very flexible tool for manipulating data, the R statistical language was chosen to perform all the reformatting. The first step was to read in all of the files by creating a list of file names to loop over. Two of the file types were plain text and could be easily read. The rainfall data, however, was in binary format, which required a custom function to be written to parse the data. Upon viewing the data it became clear that standard R functions for inputing data, such as read.table(), would not suffice, since a random assortment of tab and space delimiters had been used to demarcate data cells in the files.
To overcome this complication it was necessary to read in each file one line at a time, and within each line to extract individual cell values and strip off any other formatting by using regular expression searches. This solved the problem of multiple delimiter types, but left the data without its original latitude-longitude matrix structure. To regain the lost structure, a template of the original data matrix was constructed and used as a reference to place each individual datum into its correct row-column position. For each day's observations, individual lines were read into R, stripped of all white space and formatting, and each datum was placed into a new matrix in its old position. This was then repeated for all 62,457 days of data using multiple nested loops.
Once the data had been successfully parsed and cleaned it still needed reshaping. At this point the rainfall, temperature, and station data were each in their own list structure, with each element in the lists consisting of a matrix of one day's data. Using R's data reshaping tools, the lists were split apart and all of the matrices were glued back together end on end to form three long matrices, one for each type of data. Each matrix was then unraveled column by column into separate vectors of rainfall, temperature, and station data that were combined to a final data frame. Spatial and temporal locations were then appended to this data frame as columns for latitude, longitude, day, month, and year, by striping this information from the matrix row and column names. The next step was to export this data frame from R into a Stata .dta file.
The final dataframe represented a huge volume of information that was difficult to summarize in a meaningful way. To aid analysis, we created a visualization by plotting each day's data as a colored grid, with rainfall and temperature values represented by hue and saturation. To aid interpretation of spatial context, we then superimposed a map of India on top of this grid. Finally, to provide a means of viewing temporal trends in the data, each day's grid was linked together in an animation that ran through a single year's data in 35 seconds.