Collecting data on the internet is a common task in the commercial sector. Go checking your competitors’ prices is a best practice to keep your competitiveness. This can be done manually, but when the task becomes repetitive to gather lots of information, the time spent performing this task is no longer very productive. This is why bots (scrappers, crawlers, etc.) have been created. These bots are scripts, written in different programming languages, aiming at automatically gather information from websites.
But with their very high velocity, far beyond what is humanly possible to do, these bots can overload the servers. Beyond the will to protect its data, the “targeted” company will also try to protect its servers. The IT engineers in charge of the websites have therefore deployed protections to counter what can be a nuisance.
Here is an example. On the monetas.ch website, various information are available such as email addresses. I suggest you, for instance, to go to this address. You will see the email address of this company, but as shown by the following two R code portions, the information is not “visible” in the source code of the target Internet page and is therefore not harvested.
1 2 3 4 5 6 7 8 9 |
library(rvest) URL <- "http://www.monetas.ch/htm/660/fr/Contact-Roger-Claude-Choffat.htm?subj=2022352" search <- html(URL) A <- html_nodes(search, "div div div div ul li li a") B <- html_text(A) # Result: # [1] "dcw('CQxXXlUJFyoFVV9HAQVIEA0=')" |
or,
1 2 3 4 5 6 7 8 9 10 11 12 |
library(RCurl) library(XML) URL <- "http://www.monetas.ch/htm/660/fr/Contact-Roger-Claude-Choffat.htm?subj=2022352" data <- getURL(URL, ssl.verifypeer = FALSE, encoding="UTF-8") data_parsed <- htmlTreeParse(data, encoding = 'UTF-8') parsed_lines <- capture.output(data_parsed) grep("[a-zA-Z]{1,}@[a-zA-Z]{1,}.[a-zA-Z]{1,}", parsed_lines) # Result: # integer(0) |
No result!
In the web page, there is actually a script that when executed will display the email address. The R code portions only see the script, not the result of its execution. It is a simple and relatively effective protection.
But as often, a solution, actually multiple solutions exist to circumvent this practice. The behavior of the bot (here the R code) must be transformed into a human behavior. It’s possible ! When portion of code as previously used is seen as such by the “targeted” site, the solution that will be used to bypass this protection will “only shows” your web browser.
Let me explain: at this point, based on R, you have two options: use your favorite internet browser (chrome, firefox, safari, etc …) or phantomjs. Phantomjs does not talk to you? This is normal, you are not going to use it to go to your facebook page. Phantomjs does not display anything. Phantomjs browses the net, makes a rendering of an entire web page, this rendering is stored in the memory of your computer, but phantomjs does not offer you any visualization of this page. The advantage? If you are looking for to save your time (productivity), not having to display a web page will actually save you time. This is the major difference with other browsers.
In practice what should you do: on mac, as it is my case, I use homebrew for my installations in command lines via the Terminal. If you have already installed homebrew, phantomjs is installed with the following command line:
1 |
brew install phantomjs |
Then, you will have to download selenium, the tool that will allow us to pursue our quest for information (here an email address). I use an old version: selenium-server-standalone-2.53.1.jar.
I move the .jar file where I want it, the path here is the following Users> admin.
Back in the Terminal, we launch the selenium server. This step is to be performed whenever you want to harvest on the internet with selenium and R.
1 2 |
cd /Users/admin java -jar selenium-server-standalone-2.53.1.jar |
In the end you must see the following information displayed in the Terminal:
1 |
09:48:11.903 INFO - Selenium Server is up and running |
Back in R, we launch the harvest of data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
library(RSelenium) library(stringr) # A la place de "phantomjs" vous pourriez utiliser "chrome" remDr <- remoteDriver(browserName = "phantomjs") remDr$open() #[1] "Connecting to remote server" #$platform #[1] "OS X 10.11" #$acceptSslCerts #[1] FALSE #$javascriptEnabled #[1] TRUE #$browserName #[1] "phantomjs" #... URL <- "http://www.monetas.ch/htm/660/fr/Contact-Roger-Claude-Choffat.htm?subj=2022352" remDr$navigate( URL) webElem <- remDr$findElement(using='xpath', value='//*[@id="content"]/div/div[2]/div[2]/ul/li[2]/ul/li[2]//a') elemtxt <- unlist(webElem $getElementAttribute("outerHTML")) Splitted4Email <- unlist(str_split(elemtxt, "[\\\\< \\\\>]")) Splitted4Email [grep("@", Splitted4Email)] # Result: # [1] "choffat@valtra.ch" |
The information you are looking for appears as a result of this code. Note however, that the process is rather slow. Additionally, if the “targeted” site seeks to protect itself from any “intrusion” that is too important, you should take this as a “warning” when you interact with it. If you run many queries on a site via a R loop in R, you could randomly slow down your visit to the web pages so that they do not overload the server (by the way, your bot will be more “human”, you limit the chances that your IP address will be blocked). Here is a code to insert at the end of your loop generating iterated requests on the “targeted” site. This code tells R to wait between 2 and 5 seconds between each request:
1 |
Sys.sleep(runif(1, 2, 5)) |
Finally, this following remark should not be neglected when you harvest data, read the terms and conditions of sales. Particularly the legal aspects of the targeted site. They are usually found at the bottom of the page. Rebuilding a proprietary database for business purposes is not a good idea!
NB: Here, selenium is used for an extremely simple purpose. It can however be used for much more complex data collections like submitting a query as you might do on google and process the content displayed following this search. The possibilities are huge!
Update: if you face the following error when lauching the selenium server:
Exception in thread “main” java.lang.UnsupportedClassVersionError: org/openqa/gr
id/selenium/GridLauncher : Unsupported major.minor version 51.0
You should update java the JDK (and maybe the JRE):
Solution found here.