3.1. Introduction

3.1.1. What is OpenRefine?

OpenRefine is described as “a power tool for working with messy data” David Huynh - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you solve.

OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you:

  • Get an overview of a data set
  • Resolve inconsistencies in a data set, for example standardizing date formatting
  • Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
  • Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
  • Enhance a data set with data from other sources

Some common scenarios might be:

  • Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
  • Where you want to know how values are distributed across your whole data set
  • Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format. For example:
Example table
Data you have Desired data
1st January 2014 2014-01-01
01/01/2014 2014-01-01
Jan 1 2014 2014-01-01
2014-01-01 2014-01-01
  • Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example:
Data you have Desired data
London London
London] London
London,] London
london London
  • Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. For example going from a single address field (in the first column), to each part of the address in a separate field:
Address in single field Institu tion Library name Address 1 Address 2 Town/Ci ty Region Country Postcod e
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom Univers ity of Wales Llyfrge ll Thomas Parry Library Llanbad arn Fawr   Aberyst wyth Ceredig ion United Kingdom SY23 3AS
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom Univers ity of Abderde en Queen Mother Library Meston Walk   Aberdee n   United Kingdom AB24 3UE
University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom Univers ity of Birming ham Barnes Library Medical School Edgbast on Birming ham West Midland s United Kingdom B15 2TT
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom Univers ity of Warwick Library Gibbett Hill Road   Coventr y   United Kingdom CV4 7AL
  • Where you want to add to your data from an external data source:
Data you have Date of Birth from VIAF (Virtual International Authority File) Date of Death from VIAF (Virtual International Authority File)
Braddon, M. E. (Mary Elizabeth) 1835 1915
Rossetti, William Michael 1829 1919
Prest, Thomas Peckett 1810 1879
Next Section - 3.2. Importing data