3.1. Introduction¶

3.1.1. What is OpenRefine?¶

OpenRefine is described as “a power tool for working with messy data” David Huynh - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you solve.

OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you:

Get an overview of a data set
Resolve inconsistencies in a data set, for example standardizing date formatting
Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
Enhance a data set with data from other sources

Some common scenarios might be:

Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
Where you want to know how values are distributed across your whole data set
Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format. For example:

Example table¶
Data you have	Desired data
1st January 2014	2014-01-01
01/01/2014	2014-01-01
Jan 1 2014	2014-01-01
2014-01-01	2014-01-01

Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example:

Data you have	Desired data
London	London
London]	London
London,]	London
london	London

Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. For example going from a single address field (in the first column), to each part of the address in a separate field:

Address in single field	Institu tion	Library name	Address 1	Address 2	Town/Ci ty	Region	Country	Postcod e
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom	Univers ity of Wales	Llyfrge ll Thomas Parry Library	Llanbad arn Fawr		Aberyst wyth	Ceredig ion	United Kingdom	SY23 3AS
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom	Univers ity of Abderde en	Queen Mother Library	Meston Walk		Aberdee n		United Kingdom	AB24 3UE
University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom	Univers ity of Birming ham	Barnes Library	Medical School	Edgbast on	Birming ham	West Midland s	United Kingdom	B15 2TT
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom	Univers ity of Warwick	Library	Gibbett Hill Road		Coventr y		United Kingdom	CV4 7AL

Where you want to add to your data from an external data source:

Data you have	Date of Birth from VIAF (Virtual International Authority File)	Date of Death from VIAF (Virtual International Authority File)
Braddon, M. E. (Mary Elizabeth)	1835	1915
Rossetti, William Michael	1829	1919
Prest, Thomas Peckett	1810	1879

Next Section - 3.2. Importing data