It is often said that 80% of the time needed for data analytics is data wrangling or data cleaning. Different tools and techniques for data cleaning exist, e.g., specialized workflow tools such as BioVel or Kurator for biodiversity data, and general purpose interactive tools such as OpenRefine and Trifacta Wrangler. However, most data scientists employ script-based approaches, e.g., in R using tidyr(from the Tidyverse), or in Python using Pandas. In this project, you will implement a number of common "data wrangling" (or "data cleaning") tasks for a given scientific dataset, using OpenRefine, Python, or R, and then compare the different implementations w.r.t. transparency and reproducibility of the overall data cleaning workflow. To support reproducible interactive data cleaning with OpenRefine, a new WholeTale frontend for OpenRefine will be employed, based on existing docker recipes. In addition, different standards (e.g., W3C PROV and ProvONE) and tools (e.g., YesWorkflow and recordR) will be employed that can help in modeling and capturing data lineage and provenance information. The goal is to make data cleaning workflows more easy to use (via Jupyter-based Python notebooks, RStudio notebooks, and OpenRefine workflows), transparent (using provenance tools and standards), and reproducible (using the WholeTale execution environment).
Primary Mentor: Bertram Ludäscher, iSchool & NCSA, University of Illinois, Urbana-Champaign
Secondary Mentor: Kacper Kowalik, NCSA, University of Illinois, Urbana-Champaign