Data Wrangling

Definition

Data Wrangling is a broad term referring to the processes involved when preparing data for analysis. It can include acquiring data, enriching, changing the format and shape of the data, combining, subsetting and sampling data, and cleaning data.

Some common steps involved with Data Wrangling are:

Discovering and gathering the data needed
Merging data from different sources, if necessary
Fixing flaws in the data entries
Extracting the necessary data and put it in the proper structure
Storing it in the proper format for further use

Examples

Merging data from different sources and fixing flaws or errors in data entries.

Tools

Tidyverse is a collection of open source R packages, several of which can be used for data wrangling and cleaning.

Pandas in a collection of open source Python libraries for data manipulation and analysis.

OpenRefine is a user-friendly, point-and-click tool for working with messy data.

Relevant Literature

This short Coursera video (What is Data Wrangling?) provides an excellent overview of the data wrangling process and common tasks involved when preparing data for analysis and publication.

Data Science for Practicing Clinicians: Data Wrangling is a Data Carpentry lesson that provides hands-on experience with installing and using dplyr, a core package in Tidyverse in the R programming language. Basic instructions for filtering, summarizing, parsing, and cleaning data are provided.

The Book Practical Data Wrangling (2017) by Allan Visochek provides information on data wrangling techniques in Python.

Data Wrangling

Definition

Relevant Literature

Contact Us

Regional Medical Libraries