Lecture 14: Regular expressions in R¶

STAT598z: Intro. to computing for statistics¶

Vinayak Rao¶

Department of Statistics, Purdue University¶

options(repr.plot.width=5, repr.plot.height=3)

We have seen the print function:

x <- 1
print(x)
y <- list('Hello', TRUE, c(1,2,3))
print(y)

print is a generic function:

looks at class of input and calls appropriate function

my_df <- data.frame(x = c(1,2), y = c(3,4))
print(my_df)

print.default(my_df)

print.data.frame(my_df)

class(df) <- NULL
print(my_df)

`print` and `cat`¶

print can only print its first term

print('Right now it is', date())

For this we need the cat (concatenate) function

cat('Right now it is', date(), "in West Lafayette")

cat(..., file = '' , sep = ' ' , fill = FALSE,
labels = NULL, append = FALSE)

…: Inputs that R concatenates to print

sep: What to append after each input (default is space)

file: Destination file (default is stdout)

Use paste() to store the concatenated output (a string)

cat(1:5)

cat(1:5,sep= ',' )

cat(1:5,sep= '\n' )

cat('[' ,1:5, ']' ,sep=(',' ))

cat('[',1:5, ']' ,sep=c('', rep(',' ,4), '' ))

cat('Hello','World','New para',sep='\n',file='new_file.txt')

Section 8.1.22 in The R Inferno, Patrick Burns:

print outputs all characters in the string
cat outputs what the string represents

Compare:

print('Hello\nBye')

cat('Hello\nBye')

‘\’ escapes the following character (indicating it is special)

What if we want to output ‘\n’ using cat ?

Escape \ with another \

cat('Hello\\n')

Regular expression: representation of a collection of strings

Useful for searching and replacing patterns in strings

Composed of a grammar to build complicated patterns of strings

R has functions, which coupled with regular expressions allow powerful string manipulation

E.g. grep, grepl, regexpr, gregexpr, sub, gsub

Matching simple patterns¶

cities <- c('lafayette', 'indianapolis' , 'cincinnati')
grep('in', cities)

grepl('in', cities)

Usage:
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE)

grep('in',cities,value=TRUE) #Return values instead of indices

Where in each element did the match occur?

regexpr('in', cities)

What if more than one match occured?

gregexpr('in', cities)

What if we want to match

any letter followed by ’n’?
any vowel followed by ’n’?
two letters followed by ’n’?
any number of letters followed by ’n’?

Regular expressions!¶

allow us to match much more complicated patterns
build patterns from a simple vocabulary and grammar

R supports two ﬂavors of regular expressions, we will always use perl (set option perl = TRUE )

'.' (period) represents any character except empty string '””'

vec<-c('ct','at', 'cat', 'caat', 'cart', 'dog', 'rat', 'carert', 'bet')

grep('.at', vec, perl = TRUE)

grep('..t', vec, perl = TRUE)

+ represents one or more occurrences

vec<-c('ct','at', 'cat', 'caat', 'cart', 'dog', 'rat', 'carert', 'bet')

grep( 'ca+t', vec, perl = TRUE)

grep( 'c.+t', vec, perl = TRUE)

* represents zero or more occurrences

vec<-c('ct','at', 'cat', 'caat', 'cart', 'dog', 'rat', 'carert', 'bet')

grep('c.*t', vec, perl = TRUE)

Group terms with parentheses ’(’ and ’)’

vec<-c('ct','at', 'cat', 'caat', 'cart', 'dog', 'rat', 'carert', 'bet')

grep('c(.r)+t', vec, perl = TRUE)

grep('c(.r)*t', vec, perl = TRUE)

‘.’ ‘,’ ‘+’ ‘*’ are all metacharacters

Other useful ones include:

ˆ and $ (start and end of line)

grep('e.$', vec, perl = TRUE)

| ( logical OR )

grep('(c.t)|(c.rt)', vec, perl = TRUE)

[ and ] ( create special character classes)

[0-7ivx]: any of 0 to 7, i, v, and x

[a-z]: lowercase letters

[a-zA-Z]: any letter

[0-9]: any number

[aeiou]: any vowel

grep('[ei]t', vec, perl = TRUE)

Inside a character class ˆ means "anything except the following characters". E.g.

[ˆ0-9]: anything except a digit

grep('[^a]t', vec, perl = TRUE)

What if we want to match metacharacters like . or +?

vec <- c('ct', 'cat', 'caat', 'caart', 'caaaat', 'caaraat', 
         'c.t')
grep('c.t', vec, perl = TRUE) #Is this what we want?

Escape them with \

WARNING: a single \ doesn’t work. Why?

cat('c\.t')

R thinks \. is a special character like \n.

Use two \'s

cat('c\\.t')

grep('c\.t', vec, perl = TRUE)

grep('c\\.t', vec, perl = TRUE)

To match a \, our pattern must represent \\

my_var <- '\n'
grep('\\n', my_var)

my_var <- ('\\')
grep('\\\\', my_var)

Search and replace¶

The sub function allows search and replacement:

vec <-c('ct','cat','caat','caart','caaaat','caaraaat','c.t')
sub('a+', 'A', vec, perl = TRUE)

sub replaces only ﬁrst match, gsub replaces all

Use backreferences \1, \2 etc to refer to ﬁrst, second group etc

gsub('(a+)r(a+)', 'b\\1brc\\2c', vec, perl = TRUE)

Use \U, \L, \E to make following backreferences upper or lower case or leave unchanged respectively

gsub('(a+)r(a+)', '\\U\\1r\\2', vec, perl = TRUE)

gsub('(a+)r(a+)', '\\U\\1r\\E\\2', vec, perl = TRUE)