Non classé

Most of the time we use structured data i.e. data as table. However, most of the available data are not structured. Indeed data can be as images, texts, web articles, etc… To be able to valorize these data is some kind of Graal for analyst of the near future. The quest has already started for the giants of the internet.

When working with data as text you are making the so called text mining. You can perform text mining for various reasons and numerous goals. It is possible to perform semantic analysis, there are stunning examples for the last american presidential campaign . We can aim at building a structured database from a text. There are as many tools as goals to deal with.

This post aims at helping R beginners who want to use regular expressions for text mining in order to get structured data. But what is a regular expression (regex)? It is a pattern of characters which describe a fraction of a text (string of characters).

For more detail please look at the related wikipedia article for a concise definition.

Literal characters and metacharacters

First of all, it is necessary to put some theory in this post because with precise syntax there are precise rules. Several types of characters have to be taken into account, the literal characters and metacharacters.

Literal characters: characters having a unique meaning. For instance literal characters g o d constitute the word good. A literal character identifies itself. In the example below, the word Nacher is only present in the first string.

Str <- c("R. Nacher-Könitz", "A.-A. $Koziv", "C. Röss",  "C. Reichl",  "W. Wheider",    "T. Ihnt",  "K. Ein" )
grep("Nacher", Str)
[1] 1

Str <- c("R. Nacher-Könitz", "A.-A. $Koziv", "C. Röss", "C. Reichl", "W. Wheider", "T. Ihnt", "K. Ein" )

grep("Nacher", Str)

[1] 1

Watch out, regular expressions are case sensitive (i.e the fact that there is a difference between capital and non-capital letters or upper case and lower case letter or A and a)!

grep("nacher", Str)
integer(0)

1 2	grep("nacher", Str) integer(0)

Metacharacters are characters which are not used in their primary meaning. A good example is the character $ which alone will not identify the sympbol $, but will look for characters from the end of the targeted character string. In the example below, R found characters at the end of each strings.

grep("$", Str)
[1] 1 2 3 4 5 6 7

1 2	grep("$", Str) [1] 1 2 3 4 5 6 7

Because it is useful to be able to search for these metacharacter, we can escape them, I mean make them literal using the \. ATTENTION, with R, in contrary to other programmatic languages, in order to make R understands that \ is used to escape a metacharacter, it must be escaped itself by a second \. Here in “\\$” the first \ escapes the second \ in order for R to escape $ (this post is dedicated to R users who would like to use regular expressions, there are variations with other scripting languages). R, in the example below, find the $ character in the second string.

grep("\\$", Str)
[1] 2

1 2	grep("\\$", Str) [1] 2

Metacharacter are: [ ] \ – ^ $ . ? * + { } ( ) |

To conclude, look for a “\” in a string is possible like the following example.

grep("\\\\" , "A.-A. \\Koziv")
[1] 1

1 2	grep("\\\\" , "A.-A. \\Koziv") [1] 1

Classes of character

We know how to search for literal characters and escaped metacharacters. It’s a good start but the true power of regex comes from literal classes combined with metacharacters. A character class is defined by [ ] in which an expression is defined.

Here is a simple example. How to target words composed with the literal sequence “ex” which can be present at the begining the word or directly after the first letter.

library(stringr)
Str <- "En informatique, une expression régulière ou expression normale ou expression rationnelle ou motif, est une chaîne de caractères, qui décrit, selon une syntaxe précise, un ensemble de chaînes de caractères possibles. Les expressions régulières sont issues des théories mathématiques des langages formels des années 1940. Leur capacité à décrire avec concision des ensembles réguliers explique qu’elles se retrouvent dans plusieurs domaines scientifiques dans les années d’après-guerre et justifie leur adoption en informatique. Les expressions régulières sont aujourd’hui utilisées par les informaticiens dans l’édition et le contrôle de texte ainsi que dans la manipulation des langues formelles que sont les langages informatiques."
str_extract_all( Str,"[a-zA-Z]?ex[a-z]+")
[[1]]
[1] "expression"  "expression"  "expression"  "expressions" "explique"    "expressions" "texte"

library(stringr)

Str <- "En informatique, une expression régulière ou expression normale ou expression rationnelle ou motif, est une chaîne de caractères, qui décrit, selon une syntaxe précise, un ensemble de chaînes de caractères possibles. Les expressions régulières sont issues des théories mathématiques des langages formels des années 1940. Leur capacité à décrire avec concision des ensembles réguliers explique qu’elles se retrouvent dans plusieurs domaines scientifiques dans les années d’après-guerre et justifie leur adoption en informatique. Les expressions régulières sont aujourd’hui utilisées par les informaticiens dans l’édition et le contrôle de texte ainsi que dans la manipulation des langues formelles que sont les langages informatiques."

str_extract_all( Str,"[a-zA-Z]?ex[a-z]+")

[[1]]

[1] "expression" "expression" "expression" "expressions" "explique" "expressions" "texte"

The character class [a-zA-Z] is defined by [ ] and the sequence a-zA-Z. Here, we look for a capital or non-capital letter in the sequence a to (-) z and A to (-) Z (remember a regex is case sensitive). If I look for word at the begining of a sentence we expect a capital letter so the sequence should include them.
However, the presence of the metacharacter “?” make the above character class facultative. The character class is present or not.
Next, that a letter is detected or not, we look for the literal sequence “ex”, not the reversed one (“xe”). If the test invoked by “[a-zA-Z]” is realized for all characters encountered, “ex”, the new portion of the regex will constraint the first test by targeting a letter, present or not before “ex”.
Finally, a new class of character “[a-z]” is targeted with an index of quantity “+” which means that this class is present at least one time or more. This means that there is at least one letter just after “ex”.

Attention
The range given here “a-z” are just an example, “a-d” would also works, targeting the letters a b c d. The range “0-9” can be used as well.

Another example, this time we target words containing the literal sequence “re”, whatever its location in the words.

library(stringr)
Str <- "En informatique, une expression régulière ou expression normale ou expression rationnelle ou motif, est une chaîne de caractères, qui décrit, selon une syntaxe précise, un ensemble de chaînes de caractères possibles. Les expressions régulières sont issues des théories mathématiques des langages formels des années 1940. Leur capacité à décrire avec concision des ensembles réguliers explique qu’elles se retrouvent dans plusieurs domaines scientifiques dans les années d’après-guerre et justifie leur adoption en informatique. Les expressions régulières sont aujourd’hui utilisées par les informaticiens dans l’édition et le contrôle de texte ainsi que dans la manipulation des langues formelles que sont les langages informatiques."
str_extract_all( Str,"[[:alpha:]]*re([a-z]+)?")
[[1]]
 [1] "expression"  "régulière"   "expression"  "expression"  "caractères"  "caractères"  "expressions" "régulières"  "décrire"     "retrouvent"  "guerre"      "expressions" "régulières"

library(stringr)

str_extract_all( Str,"[[:alpha:]]*re([a-z]+)?")

[[1]]

[1] "expression" "régulière" "expression" "expression" "caractères" "caractères" "expressions" "régulières" "décrire" "retrouvent" "guerre" "expressions" "régulières"

The character class [[:alpha:]] looks for the presence of any letter, upper and lower cases.

Attention
The expression [:alpha:] alone does not have the same meaning that the class like [[:alpha:]]. The expression [:alpha:] alone, targets literals “: a l p h”. The cases below show explicitly this difference, but note that on the above example, there would be no difference thanks to the quantifier “*” (see below). Numerous groups of character exist, they are presented here for instance.

grep("[:alpha:]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))
[1] 4 6 7
grep("[[:alpha:]]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))
[1] 1 2 3 4 5 6 7

grep("[:punct:]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))
[1] 2 4 7
grep("[[:punct:]]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))
[1] 1

grep("[:alpha:]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))

[1] 4 6 7

grep("[[:alpha:]]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))

[1] 1 2 3 4 5 6 7

grep("[:punct:]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))

[1] 2 4 7

grep("[[:punct:]]", c("Hier,", "nous", "sommes", "partis", "à", "la" ,"plage"))

[1] 1

“*” which follows [[:alpha:]] means that the class [[:alpha:]] can be absent or present one or more time.
Next, that a letter is detected or not, we look for the literal sequence “re”. In other words, we can identify words starting with “re”.
Finally, we have the expression “([a-z]+)?” (which could have been simplified) that allows to understand the use of some metacharacters. This expression is equivalent to “[a-z]*”. First, “[a-z]+” target one or more letter (“more” means that follow each other) present in range of literals a to z. In contrary to the expression “[a-z]*”, if we stopped there, we could not select a word ending by “re” because at least one letter (lower case) must be present after “re”. A workaround consist in including “[a-z]+” between () which allows to create a syntaxic group to which a quantifier “?” can be added making it optionnal.

We can look for substracting characters from our targeting. This goal is achieved by using the metacharacter “^” inside a class of character like [^….].

str_extract_all("maintenant ou jamais","[^aeiouy]")
[[1]]
 [1] "m" "n" "t" "n" "n" "t" " " " " "j" "m" "s"

str_extract_all("maintenant ou jamais","[^aeiouy ]")
[[1]]
[1] "m" "n" "t" "n" "n" "t" "j" "m" "s"

str_extract_all("maintenant ou jamais","[^aeiouy]")

[[1]]

[1] "m" "n" "t" "n" "n" "t" " " " " "j" "m" "s"

str_extract_all("maintenant ou jamais","[^aeiouy ]")

[[1]]

[1] "m" "n" "t" "n" "n" "t" "j" "m" "s"

In the first example, I substract all the vowels which keep spaces between letters.
In the second one, I substract the vowels and spaces.

It is possible to look for your regex at the begining or the end of a character string using the metacharacter “^” and “$”, respectively.

grep("^[aeiouy]", c("Hier,", "nous", "sommes", "allés", "au", "cinéma"))
[1] 4 5
grep("[aeiouy]$", c("Hier,", "nous", "sommes", "allés", "au", "cinéma"))
[1] 5 6

grep("^[aeiouy]", c("Hier,", "nous", "sommes", "allés", "au", "cinéma"))

[1] 4 5

grep("[aeiouy]$", c("Hier,", "nous", "sommes", "allés", "au", "cinéma"))

[1] 5 6

In the first example, we are targeting the words starting with a vowel using the metacharacter “^” placed at the begining of the regex.
In the second example, we are targeting the words ending with a vowel using the metacharacter”$” placed at the end of the regex.

In term of positioning, a last metacharacter should be discussed. It is the “.”. This one look for any character present at a defined location. This means that after targeting a letter for instance, the “.” will take into account any other character that immediately follows. Here are some examples to understand the way “.” works:

str_extract_all("Regular expression",".")
[[1]]
 [1] "R" "e" "g" "u" "l" "a" "r" " " "e" "x" "p" "r" "e" "s" "s" "i" "o" "n"
str_extract_all("12.- €/kg",".")
[[1]]
[1] "1" "2" "." "-" " " "€" "/" "k" "g"
str_extract_all("Regular expression","^.")
[[1]]
[1] "R"
str_extract_all("Regular expression","\\b.")
[[1]]
[1] "R" " " "e"
str_extract_all("Regular expression","[g].")
[[1]]
[1] "gu"

str_extract_all("Regular expression",".")

[[1]]

[1] "R" "e" "g" "u" "l" "a" "r" " " "e" "x" "p" "r" "e" "s" "s" "i" "o" "n"

str_extract_all("12.- €/kg",".")

[[1]]

[1] "1" "2" "." "-" " " "€" "/" "k" "g"

str_extract_all("Regular expression","^.")

[[1]]

[1] "R"

str_extract_all("Regular expression","\\b.")

[[1]]

[1] "R" " " "e"

str_extract_all("Regular expression","[g].")

[[1]]

[1] "gu"

In the two first examples as no expression (giving a location in the string) is defined before the “.”, this metacharacter looks at all the characters testing if they fit, which is the case so it returns all of them one by one.
The third example look at the begining of the character string (condition defined by “^”) and then “.” target only the first character found.
The fourth example make use of a metacharacter which was not mentionned before. I invite you to check this site for more information. “\\b” allows to target characters at each boundaries of words (it predefines a loacation before the first letter and after ist last letter of the word). Consequently “.” will target character just after these locations. We identify “R” and ” ” (space) for the first word, for the second word only the “e” is returned as after its last letter, no character is found.
The last example targets the literal “g” and “.” looks for the character following the “g”.

Alternative targeting is also possible with the metacharacter “|”. To understand how it works, please look at the following example presented step-by-step.

library(stringr)
Str <- "En informatique, une expression régulière ou expression normale ou expression rationnelle ou motif, est une chaîne de caractères, qui décrit, selon une syntaxe précise, un ensemble de chaînes de caractères possibles. Les expressions régulières sont issues des théories mathématiques des langages formels des années 1940. Leur capacité à décrire avec concision des ensembles réguliers explique qu’elles se retrouvent dans plusieurs domaines scientifiques dans les années d’après-guerre et justifie leur adoption en informatique. Les expressions régulières sont aujourd’hui utilisées par les informaticiens dans l’édition et le contrôle de texte ainsi que dans la manipulation des langues formelles que sont les langages informatiques."
str_extract_all( Str,"[[:lower:]]*eu[a-z]*")
[[1]]
[1] "eur"       "plusieurs" "leur" 

str_extract_all( Str,"[[:upper:]]?eu[a-z]*")
[[1]]
[1] "Leur" "eurs" "eur" 

str_extract_all( Str,"([[:upper:]]?|[[:lower:]]*)eu[a-z]*")
[[1]]
[1] "Leur"      "plusieurs" "leur"

library(stringr)

str_extract_all( Str,"[[:lower:]]*eu[a-z]*")

[[1]]

[1] "eur" "plusieurs" "leur"

str_extract_all( Str,"[[:upper:]]?eu[a-z]*")

[[1]]

[1] "Leur" "eurs" "eur"

str_extract_all( Str,"([[:upper:]]?|[[:lower:]]*)eu[a-z]*")

[[1]]

[1] "Leur" "plusieurs" "leur"

In the first step, the regex “[[:lower:]]*eu[a-z]*” target lower case characters present 0 or more times (“[[:lower:]]*”) followed by literals “eu” followed by 0 or more literal ranging from a to z (“[a-z]*”). Note: “[a-z]” excludes language dependent litterals such as é, è, ü, ø, etc…
In the second step, “[[:upper:]]?eu[a-z]*” target upper case letter present 0 or 1 time (no more) followed by literals “eu” followed by 0 or more literal ranging from a to z.

As you can see, these two regex cannot extract some words in their completness when used alone. This is either due to the fact that the first regex cannot target upper case letters while the second one cannot target lower case letter at the begining of the words. The syntactic group “([[:upper:]]?|[[:lower:]]*)”separated with the metacharacter “|” allows to target either lower or upper case letters. In other words we target either an upper case letter which is present 0 or 1 time followed by “eu or lower case letters present 1 or more times. Beware, we are not targeting all the possibility! For instance a word starting with an upper case letter, followed by 1 or more lower case letters (not”eu”), followed by “eu”. The following examples explain this aspect.

str_extract_all(c("Lueur", "lueur"),"([[:upper:]]?|[[:lower:]]*)eu[a-z]*")
[[1]]
[1] "ueur"

[[2]]
[1] "lueur"

str_extract_all(c("Lueur", "lueur"),"([[:upper:]]?.*|[[:lower:]]*)eu[a-z]*")
[[1]]
[1] "Lueur"

[[2]]
[1] "lueur"

str_extract_all(c("Lueur", "lueur"),"([[:upper:]]?|[[:lower:]]*)eu[a-z]*")

[[1]]

[1] "ueur"

[[2]]

[1] "lueur"

str_extract_all(c("Lueur", "lueur"),"([[:upper:]]?.*|[[:lower:]]*)eu[a-z]*")

[[1]]

[1] "Lueur"

[[2]]

[1] "lueur"

In the first example, the word Lueur is sliced because the upper case letter should be foollwed by “eu”, ehich is not the case.
In the second example, we use “.*” to target any presence of character present 0 or n times between the “eu” string and the possible presence of an upper case letter.

NB: this example’s only goal is to explicitly demonstrate the use of “|”.The same result can be obtained without alternative targeting as you can see below.

str_extract_all(c("Lueur", "lueur"),"[[:alpha:]]*eu[a-z]*")
[[1]]
[1] "Lueur"

[[2]]
[1] "lueur"

str_extract_all(c("Lueur", "lueur"),"[[:alpha:]]*eu[a-z]*")

[[1]]

[1] "Lueur"

[[2]]

[1] "lueur"

Now, remember that one of the aims of this post was to produce structured data using regex. Below you will find a simple example. We have a character string in which names with initial(s) for the given names are present with or without space(s), dot, even numbers. How to automate this data structuration which can be used later on. The present sample is small, but with a bigger one, the coding effort would be worthwile.

STRINGS <- c("Nacher1,2 R. Koziv5 A. A. Röss1 C. Reichl3 C. Wheider1,4 W. Ihnt1 T. Ein1 K.", 
             "Nacher R. Koziv A.A. Röss C. Reichl C. Wheider W. Ihnt T. Ein K.",
             "NacherR.KozivA.A.RössC.ReichlC.WheiderW.IhntT.EinK.",
             "R. Nacher A. A. Koziv C. Röss C. Reichl W. Wheider T. Ihnt K. Ein",
             "R.A.NacherA.A.KozivC.RössC.ReichlW.WheiderT.IhntK.Ein",
             "R. A. Nacher A. A. Koziv C. Röss C. Reichl W. Wheider T. Ihnt K. Ein",
             "Nacher-Könitz R. Koziv A. A. Röss C. Reichl C. Wheider W. Ihnt T. Ein K.",
             "Nacher-Könitz R. Koziv A.-A. Röss C. Reichl C. Wheider W. Ihnt T. Ein K.",
             "R. Nacher-Könitz A.-A. Koziv C. Röss C. Reichl W. Wheider T. Ihnt K. Ein")            
Fam_nam <- str_extract_all(INT,"[[:upper:]]{1}[[a-zšžþàáâãäåçèéêëìíîïðñòóôõöùúûüý]]{1,}(-[[:upper:]]{1}[[a-zšžþàáâãäåçèéêëìíîïðñòóôõöùúûüý]]{1,})?")
Initials <- str_extract_all(INT,"[[:upper:]]{1}\\.(-?[[:upper:]]{1}\\.)?")
Names <- do.call(rbind, Fam_nam)
Given_N <- do.call(rbind, Initials)
List <- lapply(1:length(Names[, 1]), function(i){
	 sapply(1:length(Names[1, ]), function(j)paste(Names[i, j], Given_N[i, j], sep=" "))
	 }
	 )
do.call(rbind, List)
      [,1]               [,2]          [,3]      [,4]        [,5]         [,6]      [,7]    
 [1,] "Nacher R."        "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [2,] "Nacher R."        "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [3,] "Nacher R."        "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [4,] "Nacher R."        "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [5,] "Nacher R.A."      "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [6,] "Nacher R.A."      "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [7,] "Nacher-Könitz R." "Koziv A.A."  "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [8,] "Nacher-Könitz R." "Koziv A.-A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."
 [9,] "Nacher-Könitz R." "Koziv A.-A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

STRINGS <- c("Nacher1,2 R. Koziv5 A. A. Röss1 C. Reichl3 C. Wheider1,4 W. Ihnt1 T. Ein1 K.",

"Nacher R. Koziv A.A. Röss C. Reichl C. Wheider W. Ihnt T. Ein K.",

"NacherR.KozivA.A.RössC.ReichlC.WheiderW.IhntT.EinK.",

"R. Nacher A. A. Koziv C. Röss C. Reichl W. Wheider T. Ihnt K. Ein",

"R.A.NacherA.A.KozivC.RössC.ReichlW.WheiderT.IhntK.Ein",

"R. A. Nacher A. A. Koziv C. Röss C. Reichl W. Wheider T. Ihnt K. Ein",

"Nacher-Könitz R. Koziv A. A. Röss C. Reichl C. Wheider W. Ihnt T. Ein K.",

"Nacher-Könitz R. Koziv A.-A. Röss C. Reichl C. Wheider W. Ihnt T. Ein K.",

"R. Nacher-Könitz A.-A. Koziv C. Röss C. Reichl W. Wheider T. Ihnt K. Ein")

Fam_nam <- str_extract_all(INT,"[[:upper:]]{1}[[a-zšžþàáâãäåçèéêëìíîïðñòóôõöùúûüý]]{1,}(-[[:upper:]]{1}[[a-zšžþàáâãäåçèéêëìíîïðñòóôõöùúûüý]]{1,})?")

Initials <- str_extract_all(INT,"[[:upper:]]{1}\\.(-?[[:upper:]]{1}\\.)?")

Names <- do.call(rbind, Fam_nam)

Given_N <- do.call(rbind, Initials)

List <- lapply(1:length(Names[, 1]), function(i){

sapply(1:length(Names[1, ]), function(j)paste(Names[i, j], Given_N[i, j], sep=" "))

}

)

do.call(rbind, List)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] "Nacher R." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[2,] "Nacher R." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[3,] "Nacher R." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[4,] "Nacher R." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[5,] "Nacher R.A." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[6,] "Nacher R.A." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[7,] "Nacher-Könitz R." "Koziv A.A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[8,] "Nacher-Könitz R." "Koziv A.-A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

[9,] "Nacher-Könitz R." "Koziv A.-A." "Röss C." "Reichl C." "Wheider W." "Ihnt T." "Ein K."

See you soon,

Category: Non classé

Why and how to use regular expression (regex) with R

Literal characters and metacharacters

Classes of character

Literal characters and metacharacters

Classes of character

Share :