3.1. Introduction¶
3.1.1. What is OpenRefine?¶
OpenRefine is described as “a power tool for working with messy data” David Huynh - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you solve.
OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you:
- Get an overview of a data set
- Resolve inconsistencies in a data set, for example standardizing date formatting
- Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
- Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
- Enhance a data set with data from other sources
Some common scenarios might be:
- Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
- Where you want to know how values are distributed across your whole data set
- Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format. For example:
Data you have | Desired data |
---|---|
1st January 2014 | 2014-01-01 |
01/01/2014 | 2014-01-01 |
Jan 1 2014 | 2014-01-01 |
2014-01-01 | 2014-01-01 |
- Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example:
Data you have | Desired data |
---|---|
London | London |
London] | London |
London,] | London |
london | London |
- Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. For example going from a single address field (in the first column), to each part of the address in a separate field:
Address in single field | Institu tion | Library name | Address 1 | Address 2 | Town/Ci ty | Region | Country | Postcod e |
---|---|---|---|---|---|---|---|---|
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom | Univers ity of Wales | Llyfrge ll Thomas Parry Library | Llanbad arn Fawr | Aberyst wyth | Ceredig ion | United Kingdom | SY23 3AS | |
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom | Univers ity of Abderde en | Queen Mother Library | Meston Walk | Aberdee n | United Kingdom | AB24 3UE | ||
University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom | Univers ity of Birming ham | Barnes Library | Medical School | Edgbast on | Birming ham | West Midland s | United Kingdom | B15 2TT |
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom | Univers ity of Warwick | Library | Gibbett Hill Road | Coventr y | United Kingdom | CV4 7AL |
- Where you want to add to your data from an external data source:
Data you have | Date of Birth from VIAF (Virtual International Authority File) | Date of Death from VIAF (Virtual International Authority File) |
---|---|---|
Braddon, M. E. (Mary Elizabeth) | 1835 | 1915 |
Rossetti, William Michael | 1829 | 1919 |
Prest, Thomas Peckett | 1810 | 1879 |