5.2. BASH SHELL - GREP¶

Consider once again the Super Bowl baby datasets. The approach taken in the previous handout was to clean up the file from each year and then append them together into a single file.

After creating a single file, this file can easily be read in Excel.

An alternative approach could be to *pull out only the information needed* for our analyses from these files.

In 2007, the Indianapolis Colts won the Super Bowl. The county associated with the Indianapolis Colts is Marion County, IN. Thus, the number of births from this county is needed for our analysis. The number of births for Marion County, IN is needed for all years.
In 2008 (and 2012), the New York Giants won the Super Bowl. Thus, the number of births from Bergen County, NJ is needed across all years.
Likewise for the remaining years.

Year	Super Bowl Winner	County
2007	Indianapolis Colts	Marion County, IN
2008	New York Giants	Bergen County, NJ
2009	Pittsburgh Steelers	Allegheny County, PA
2010	New Orleans Saints	Orleans Parish, LA
2011	Green Bay Packers	Brown County, WI
2012	New York Giants	Bergen County, NJ
2013	Baltimore Ravens	Baltimore County, MD
2014	Seattle Seahawks	King County, WA
2015	New England Patriots	Suffolk County, MA
2016	Denver Broncos	Denver County, CO
2017	New England Patriots	Suffolk County, MA

MAC users will need to use gsed whenever a sed command is used throughout.

In the previous handout, the following command was used to KEEP only the lines that started with \t from the Babies_2007.txt file.

$ sed -n ‘/^\t/p’ Babies_2007.txt

The command above can be modified as shown below to search for Marion County, IN in each line across several files. The wildcard character is used to search across years.

$ sed -n ‘/Marion County, IN/p’ Babies_20*.txt >

Colts_MarionCounty.txt

The wildcard character creates the following data stream for sed.

A quick glance of this output files reveals that the “Total” rows are included in this file. Thus, modifications will need to be made to the above command.

$ head -20 Colts_MarionCounty.txt

The following sed command will print the lines that start with \t AND contain Marion County, IN. It is necessary to use gsed on a MAC in order for this command to work.

\[\ \left\{ Start\ with\ \backslash t \right\}\ \cap \{ Contains\ Marion\ County,\ IN\}\]

$ sed -n ‘/^\t/p’ Babies_20*.txt | sed –n ‘/Marion County, IN/p’ > Colts_MarionCounty.txt

The components of this command are shown here.

The following command is an alternative to the piped command provided above. The .* can be interpreted as AND in this context. This approach appears to be somewhat limiting (I could not get compound .* commands to work). Mac users need to use gsed here.

$ sed -n ‘^\t.*Marion County, IN’ Babies_20*.txt > Colts_MarionCounty.txt

As seen below the output is as desired; however, notice that information regarding the Year is not provided. You can add year information to the file fairly easy as a pattern exists – Jan through Dec is repeated for each year.

$ head -20 Colts_MarionCounty.txt

Questions:

Your colleague suggest that Year information could be added to the dataset after bringing it into Excel. She believes the following will work. Verify whether or not this will work to label the year for each row in the dataset.

=2007+INT(ROW()/12)

Will this function work if more years are added? Discuss.

EDITING with GREP

Grep has some advantages over sed when it comes to the management of strings. Grep stands for Globally search a Regular Expression and Print and appears to be somewhat more sophisticated and flexible than sed.

Using various grep commands to search for lines that start with \t AND contain Marion County, IN.

Utility	Command Outcome
Sed	Command: $ sed -n ‘/^\t/p’ <in-file> \| sed –n ‘/Marion County, IN/p’ > <out-file> Outcome: Success – however, no information is provided regarding file origination
grep	Command: $ grep ‘^\t’ <in-file> \| grep ‘Marion County, IN’ > <out-file> Outcome: Nothing – grep does not appear to understand \t using it’s default regex interpreter
	Command: $ grep ‘^\s’ <in-file> \| grep ‘Marion County, IN’ > <out-file> Outcome: Success – grep does appear to understand the more universal \s, i.e. whitespace, character in it’s default regex interpreter
	Command: $ grep –P ‘^\t’ <in-file> \| grep –P ‘Marion County, IN’ > <out-file> Outcome: Success (on PC, not MAC) – interpret search pattern as a Perl regular expression
	Command: $ grep $’^\t’ <in-file> \| grep $‘Marion County, IN’ > <out-file> Outcome: Success – uses the ANSI C standard for the string

Consider the following command. This command is using the \s version from above. Once again the output for Marion County, IN is being saved into a file named Colts_MarionCounty.txt.

$ grep $’^\t’ Babies_20*.txt | grep $‘Marion County, IN’ > Colts_MarionCounty.txt

An alternative to the above that does not use piping.

$ grep $’^\t.*Marion County, IN’ Babies_20*.txt > Colts_MarionCounty.txt

Use the head command to look at the output file. Notice that the grep command has left an identifier for each line. This identifier corresponds to the file from which this line was found. This identifier is automatically placed in the file whenever a wildcard is being used in the in-file.

$ head -20 Colts_MarionCounty.txt

*GETTING DATA FROM ADDITIONAL COUNTIES *

The Marion County, IN data is needed for the Indianapolis Colts (2007 Super Bowl). Data from Bergen County, NJ is needed for the New York Giants (2008 / 2012 Super Bowl).

Year	Super Bowl Winner	County
2007	Indianapolis Colts	Marion County, IN
2008	New York Giants	Bergen County, NJ
:	:	:

Thus, all lines start with \t and are from {Marion County, IN or Bergen County, NJ} will be needed for our analysis.

\[\left\{ Start\ with\ \backslash t \right\}\ \cap \{ Marion\ County,\ IN \cup Contains\ Bergen\ County,\ NJ\}\]

The grep syntax for the set above is provided here. The left parenthesis, vertical bar, and right parenthesis characters must be encapsulated here, i.e. $, \|, and $. The reason is that by default grep will look for these characters; however, in this context these characters are being used to define the desired set of rows to keep.

\[\{ Marion\ County,\ IN \cup Bergen\ County,\ NJ\}\]

The following statement extends to all counties. The output is being pushed into a file named Babies_AllData.txt.

Comment: Grep is exact in its searching. For example, if a space is inadvertently placed before a county name (see below) as grep will be searching for “_King County, WA” which is different than “King County, WA”.

The following word count with line option command can be used to count the number of lines in this file. This appears to be correct.

12 months x 8 years x 7 unique counties = 672

$ wc -l Babies_AllData.txt

672 Babies_AllData.txt

ADDING A HEADER

Using sed with in-place editing to add a header row to the final dataset.

$ sed -i ‘1 i\File \t County \t County Code \t Month \t Month Code \t Births’ Babies_AllData.txt

Header row has been successfully added – verified via head and with line count.

$ head -5 Babies_AllData.txt

File County County Code Month Month Code Births

Babies_2007.txt: “Marion County, IN” “18097” “January” “1” 1320

Babies_2007.txt: “Marion County, IN” “18097” “February” “2” 1219

Babies_2007.txt: “Marion County, IN” “18097” “March” “3” 1295

Babies_2007.txt: “Marion County, IN” “18097” “April” “4” 1160

$ wc -l Babies_AllData.txt

673 Babies_AllData.txt

The Babies_AllData.txt file is ready for analysis. A snip-it of this file in Excel is shown here.

Once in Excel, the Year information can easily be extracted from the file name.

=MID(A2,FIND(“_”,A2,1)+1,4)

Tasks

Consider the following question about the use of grep.

Consider the grep command that is used to extract data from each of the seven counties for the Super Bowl babies investigation.

$ grep $’^\t’ Babies_20*.txt | grep $‘$Marion County, IN\|Bergen County, NJ\|Allegheny County, PA\|Orleans Parish, LA\|Brown County, WI\|Baltimore County, MD\|King County, WA$’ > Babies_AllData.txt

Suppose that you want grep to only search through data files from 2010 – 2014. How would this command change?
Consider the following variation to the grep command. What years are being included here?

$ grep $’^\t’ Babies_201[2-4].txt | grep $‘$Marion County, IN\|Bergen County, NJ\|Allegheny County, PA\|Orleans Parish, LA\|Brown County, WI\|Baltimore County, MD\|King County, WA$’ > Babies_AllData.txt

What would be the grep command get data for all years, from all seven counties, but only for the months October, November, and December? Hint: The most straight forward approach would be to use another pipe somehow.
What would happen to your command above if there was a county named October County, MN in this data? Discuss.

The following sed (or gsed on MAC) commands can be used to clean the 1^st column that contains the file name from which grep pulled this line of data. Excel was used above to clean up this field. The following sed commands are used to clean up this field in Bash.

Use the following substitute functionality of sed to strip off Babies_ and .txt of the 1st column of the Babies_AllData.txt dataset.

Command #1: $ sed -i ‘s/Babies_//’ Babies_AllData.txt

Command #2: $ sed -i ‘s/.txt://’ Babies_AllData.txt

What would happen if there were multiple instances of Babies_ on a line? Would all instances of Babies_ be replaced with nothing or only the first? Use a text editor, e.g. Sublime, to add a second Babies_ in a line to check your answer.
What is the impact of the following change to Command #1?

Command #3: $ sed -i ‘s/Babies_//g’ Babies_AllData.txt

Consider the following commands. What do each of these commands do? In Command 6, what is the purpose of the –r option?

Command #4: $ sed -e ‘s/.//7’ Babies_AllData.txt | head -10

Command #5: $ sed -e ‘s/.......//’ Babies_AllData.txt | head -10

Command #6: $ sed -r ‘s/.{7}//’ Babies_AllData.txt | head -10
Consider the following command. What does this command do?

Command #7: $ sed –e ‘s/\.[^:]*://’ Babies_AllData.txt | head -10

Using piping to run Command #6 and #7 from above. Save the output into a file called Babies_AllData_Final.txt. Verify that the File column includes only the Year and that Babies_ and .txt have been successfully removed.

Command #8: $ sed -r ‘s/.{7}//’ Babies_AllData.txt | sed ‘s/\.[^:]*://’ > Babies_AllData_Final.txt

Next Section - 5.3. BASH SHELL – FILE UTILITIES / AWK: RECORD & FIELD MANAGEMENT

Utility	Command Outcome
Sed	Command: $ sed -n ‘/^\t/p’ <in-file> \| sed –n ‘/Marion County, IN/p’ > <out-file> Outcome: Success – however, no information is provided regarding file origination
grep	Command: $ grep ‘^\t’ <in-file> \| grep ‘Marion County, IN’ > <out-file> Outcome: Nothing – grep does not appear to understand \t using it’s default regex interpreter
	Command: $ grep ‘^\s’ <in-file> \| grep ‘Marion County, IN’ > <out-file> Outcome: Success – grep does appear to understand the more universal \s, i.e. whitespace, character in it’s default regex interpreter
	Command: $ grep –P ‘^\t’ <in-file> \| grep –P ‘Marion County, IN’ > <out-file> Outcome: Success (on PC, not MAC) – interpret search pattern as a Perl regular expression
	Command: $ grep $’^\t’ <in-file> \| grep $‘Marion County, IN’ > <out-file> Outcome: Success – uses the ANSI C standard for the string