Lecture 15: Regular expressions in R¶

STAT598z: Intro. to computing for statistics¶

Vinayak Rao¶

Department of Statistics, Purdue University¶

We have seen the print function:

In [ ]:

x <- 1
print(x)
y <- list('Hello', TRUE, c(1,2,3))
print(y)

print is a generic function:

looks at class of input and calls appropriate function

`print` and `cat`¶

print can only print its first term

In [ ]:

print('Right now it is', date())

For this we need the cat (concatenate) function

In [ ]:

cat('Right now it is', date(), "in West Lafayette")

cat(..., file = '' , sep = ' ' , fill = FALSE,
labels = NULL, append = FALSE)

…: Inputs that R concatenates to print

sep: What to append after each input (default is space)

file: Destination file (default is stdout)

Use paste() to store the concatenated output (a string)

In [ ]:

cat(1:5)

In [ ]:

cat(1:5,sep= ',' )

In [ ]:

cat(1:5,sep= '\n' )

In [ ]:

cat('[' ,1:5, ']' ,sep=(',' ))

In [ ]:

cat('[',1:5, ']' ,sep=c('', rep(',' ,4), '' ))

In [ ]:

cat('Hello','World','New para',sep='\n',file='new_file.txt')

Section 8.1.22 in The R Inferno, Patrick Burns:

print outputs all characters in the string
cat outputs what the string represents

Compare:

In [ ]:

print('Hello\n')

In [ ]:

cat('Hello\n')

‘\’ escapes the following character (indicating it is special)

What if we want to output ‘\n’ using cat ?

Escape \ with another \

In [ ]:

cat('Hello\\n')

Regular expression: representation of a collection of strings

Useful for searching and replacing patterns in strings

Composed of a grammar to build complicated patterns of strings

R has functions, which coupled with regular expressions allow powerful string manipulation

E.g. grep, grepl, regexpr, gregexpr, sub, gsub

Matching simple patterns¶

In [ ]:

cities <- c('lafayette', 'indianapolis' , 'cincinnati')
grep('in', cities)

In [ ]:

grepl('in', cities)

Usage:
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE)

In [ ]:

grep('in',cities,value=TRUE) #Return values instead of indices

Where in each element did the match occur?

In [ ]:

regexpr('in', cities)

What if more than one match occured?

In [ ]:

gregexpr('in', cities)

What if we want to match

any letter followed by ’n’?
any vowel followed by ’n’?
two letters followed by ’n’?
any number of letters followed by ’n’?

Regular expressions!¶

allow us to match much more complicated patterns
build patterns from a simple vocabulary and grammar

R supports two ﬂavors of regular expressions, we will always use perl (set option perl = TRUE )

'.' (period) represents any character except empty string '””'

In [ ]:

vec<-c('ct','at', 'cat', 'caat', 'cart', 'dog', 'rat', 'carert', 'bet')

In [ ]:

grep('.at', vec, perl = TRUE)

In [ ]:

grep('..t', vec, perl = TRUE)

+ represents one or more occurrences

In [ ]:

grep( 'ca+t', vec, perl = TRUE)

In [ ]:

grep( 'c.+t', vec, perl = TRUE)

* represents zero or more occurrences

In [ ]:

grep('c.*t', vec, perl = TRUE)

Group terms with parentheses ’(’ and ’)’

In [ ]:

grep('c(.r)+t', vec, perl = TRUE)

In [ ]:

grep('c(.r)*t', vec, perl = TRUE)

‘.’ ‘,’ ‘+’ ‘*’ are all metacharacters

Other useful ones include:

ˆ and $ (start and end of line)

In [ ]:

grep('e.$', vec, perl = TRUE)

| ( logical OR )

In [ ]:

grep('(c.t)|(c.rt)', vec, perl = TRUE)

[ and ] ( create special character classes) i [0-7ivx]: any of 0 to 7, i, v, and x

[a-z]: lowercase letters

[a-zA-Z]: any letter

[0-9]: any number

[aeiou]: any vowel

In [ ]:

grep('[ei]t', vec, perl = TRUE)

Inside a character class ˆ means "anything except the following characters". E.g.

[ˆ0-9]: anything except a digit

In [ ]:

grep('[^a]t', vec, perl = TRUE)

What if we want to match metacharacters like . or +?

In [ ]:

vec <- c('ct', 'cat', 'caat', 'caart', 'caaaat', 'caaraat', 
         'c.t')
grep('c.t', vec, perl = TRUE) #Is this what we want?

Escape them with \

WARNING: a single \ doesn’t work. Why?

In [ ]:

cat('c\.t')

R thinks \. is a special character like \n.

Use two \'s

In [ ]:

cat('c\\.t')

In [ ]:

grep('c\.t', vec, perl = TRUE)

In [ ]:

grep('c\\.t', vec, perl = TRUE)

To match a \, our pattern must represent \\

In [ ]:

my_var <- '\n'
grep('\\n', my_var)

In [ ]:

my_var <- ('\\')
grep('\\\\', my_var)

Search and replace¶

The sub function allows search and replacement:

In [ ]:

vec <-c('ct','cat','caat','caart','caaaat','caaraaat','c.t')
sub('a+', 'A', vec, perl = TRUE)

sub replaces only ﬁrst match, gsub replaces all

Use backreferences \1, \2 etc to refer to ﬁrst, second group etc

In [ ]:

gsub('(a+)r(a+)', 'b\\1brc\\2c', vec, perl = TRUE)

Use \U, \L, \E to make following backreferences upper or lower case or leave unchanged respectively

In [ ]:

gsub('(a+)r(a+)', '\\U\\1r\\2', vec, perl = TRUE)

In [ ]:

gsub('(a+)r(a+)', '\\U\\1r\\E\\2', vec, perl = TRUE)

Lecture 15: Regular expressions in R¶

STAT598z: Intro. to computing for statistics¶

Vinayak Rao¶

Department of Statistics, Purdue University¶

print and cat¶

Matching simple patterns¶

Regular expressions!¶

Search and replace¶

`print` and `cat`¶