Web Scraping Experiment

Aug 5, 2017 7 min read

This is an attempt to collect meta-data from links to academic articles. There are several R packages for both web crawling and data extraction, including Rcrawler, rvest, and scrapeR. Among these, only RCrawler has capabilities for both data extraction and web crawling. I won’t need to make use of the latter functionality here, since I already have a list of url’s that need to be mined. Instead, I’m mostly interested in web usage mining and web content mining, the extraction of “valuable information from web content” (Khalil and Fakir 2017).

Below I attemp to extract the title of an article in The American Journal of Sociology (AJS) from both the publisher’s url and from a jstor link to the article’s abstract. I also try to extract the title from an open access jstor link to an ebook. Although I could access abstracts from public jstor links, I wasn’t able to extract data from jstor links using a proxy campus login, neither the abstracts nor the actual articles that are behind a paywall. Since I’m not on a college campus, I wasn’t be able to test whether data can be extracted from jstor links from a location that doesn’t require a proxy login.

The links I’ll be using are as follows:

The Control of Managerial Discretion: Evidence from Unionization’s Impact on Employment Segregation, American Journal of Sociology publisher’s site.
The Control of Managerial Discretion: Evidence from Unionization’s Impact on Employment Segregation, American Journal of Sociology Jstor link to abstract.
Social Media in Rural China, the jstor link to an open access ebook.

1. Storing the Links

ajs <- "http://www.journals.uchicago.edu/doi/full/10.1086/683357"
jstor <- "https://www.jstor.org/stable/10.1086/683357"
jstor_book <- "http://www.jstor.org/stable/j.ctt1g69xx3"

2. Using the Rcrawler Package

Below I use the LinkExtractor and ContentScraper functions from the Rcrawler package. The ContentScraper function takes a webpage argument, a patterns argument, and a patname argument. The webpage is a a character vector created from the LinkExtractor function. The patterns argument uses XPath patterns. Since, I didn’t know what XPaths were, I had to read a short tutorial on the web. After some tinkering, I found that the expression //*/title extracted the title I was looking for. But this would likely be different depending on the website layour of the vendor or publisher.

Also, the LinkExtractor function returns a list. It isn’t clear to me how this list is ordered but in every case, the title is contained somewhere in the 10th element of the first element, hence the brackets [[1]][[10]] are used.

## Installing the Rcrawler package if not already installed
if ("Rcrawler" %in% rownames(installed.packages()) == FALSE) {
    install.packages("Rcrawler")
}
require(Rcrawler)

pageInfo <- LinkExtractor(url = ajs)
Data <- ContentScraper(pageInfo[[1]][[10]], "//*/title", "title")
Data

## $title
## [1] "The Control of Managerial Discretion: Evidence from                     Unionizations Impact on Employment Segregation: American Journal of Sociology: Vol 121, No 3"

pageInfo2 <- LinkExtractor(url = jstor)
Data2 <- ContentScraper(pageInfo2[[1]][[10]], "//*/title", "title")
Data2

## $title
## [1] "The Control of Managerial Discretion: Evidence from Unionizations Impact on Employment Segregation on JSTOR"

pageInfo3 <- LinkExtractor(url = jstor_book)
Data3 <- ContentScraper(pageInfo3[[1]][[10]], "//*/title", "title")
Data3

## $title
## [1] "Social Media in Rural China on JSTOR"

Next, I wrote a short function to extract data from all three pages.

pages <- list(ajs, jstor, jstor_book)
getContent <- function(x){
  link_char <- character(0)
  L <- length(x)
  for(i in seq_along(x)){
    y <- LinkExtractor(x[[i]])
    y2 <- ContentScraper(y[[1]][[10]], "//*/title","Title") %>% unlist()
    link_char <- c(link_char, y2)
  }
  return(link_char)
}

y <- getContent(pages)
y

##                                                                                                                                                                   Title 
## "The Control of Managerial Discretion: Evidence from                     Unionizations Impact on Employment Segregation: American Journal of Sociology: Vol 121, No 3" 
##                                                                                                                                                                   Title 
##                                                          "The Control of Managerial Discretion: Evidence from Unionizations Impact on Employment Segregation on JSTOR" 
##                                                                                                                                                                   Title 
##                                                                                                                                  "Social Media in Rural China on JSTOR"

3. Using rvest package

Using the rvest package requires three steps. First, the read_html function from the xml2 package is used to extract the entire webpage. Second, the html_nodes function from the rvest package extracts a specific component of the webpage, using either the arguments css or xpath. The xpath argument would use XPath syntax, such as what I used above. Below I use the css selector instead. To find the css selector, I use a nice chrome plugin called Selector Gadget. A tutorial on how to use this tool and the rvest package to harvest web data can be found HERE. For the AJS publisher’s site and the Jstor link, the title can be found using the css selector .publicationContentTitle h1. For the jstor ebook, the title can be found using the css selector #content .mbs

AJS publisher’s website

## Installing the rvest package if not already installed
if ("rvest" %in% rownames(installed.packages()) == FALSE) {
    install.packages("rvest")
}
require(rvest)

## Loading required package: rvest

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

# Step 1
ajs_page <- xml2::read_html(ajs)
# Step 2 - Using CSS selectors to scrap title
ajs_css <- html_nodes(ajs_page, ".publicationContentTitle h1")
# Step 3 - Converting title data to text
ajs_title <- html_text(ajs_css)
ajs_title

## [1] "The Control of Managerial Discretion: Evidence from Unionizations Impact on Employment Segregation1"

Jstor link to abstract

The following code doesn’t work. A 403 error is returned on the first step. This is probably because it is against JSTOR’s terms of service. It is likely that an API would be required. I also tried this using a proxy login and that didn’t work either. Since I wasn’t on a college campus, I couldn’t test whether it would work without using a login from a location which has direct access to the Jstor content. Also note that the link is to an abstract only, not the material behind the paywall.

# Step 1
jstor_page <- xml2::read_html(jstor)
# Step 2 - Using CSS selectors to scrap title
jstor_css <- html_nodes(jstor_page, ".publicationContentTitle h1")
# Step 3 - Converting title data to text
jstor_title <- html_text(jstor_css)
jstor_title

Jstor link to open access e-book

The same problem occurs below. I never got past the first line of code. JSTOR doesn’t allow webscraping of their content, even if its open access and not behind a paywall. I’m not sure why the Rcrawler package worked and the rvest package did not.

# Step 1
book_page <- xml2::read_html(jstor_book)
# Step 2 - Using CSS selectors to scrap title
book_css <- html_nodes(book_page, ".mbs")
# Step 3 - Converting title data to text
book_title <- html_text(book_css)
book_title

4. Using the scrapeR package

The scrapeR package uses the function scrape in conjunction with the xml function xPathSApply. The code itself is somewhat inscrutable to me at the moment.

Also, at this point, I realized that the publisher had blocked my url. This is something to keep in mind. I had to switch to a proxy url to test the following code.

AJS publisher’s website

I could not get this to work. I tried setting the follow argument to TRUE and the parse argument to FALSE but the content extracted from the scrape function did not seem to contain any meta-data for the article. I also tried using different XPath codes, and printed out the entire content scraped the webpage.

I tried using different proxy url’s and ensured that my url was not being blocked, so I don’t know what’s going on with this.

## Installing the scrapeR package if not already installed
if ("scrapeR" %in% rownames(installed.packages()) == FALSE) {
    install.packages("scrapeR")
}
require(scrapeR)

## Loading required package: scrapeR

## Loading required package: XML

## 
## Attaching package: 'XML'

## The following object is masked from 'package:rvest':
## 
##     xml

## Loading required package: RCurl

## Loading required package: bitops

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

ajs_scrape <- scrape(url = ajs, follow = TRUE)

xpathSApply(ajs_scrape[[1]], "//*/title")

## [[1]]
## <title/>

Jstor link to abstract

Somehow, the JSTOR link DID work.

jstor_scrape <- scrape(url = jstor, follow = TRUE)

xpathSApply(jstor_scrape[[1]], "//*/title")

## [[1]]
## <title>
##     The Control of Managerial Discretion: Evidence from Unionizations Impact on Employment Segregation on JSTOR
## </title>

Jstor link to open access e-book

This also worked.

jstor_book <- scrape(url = jstor_book, follow = TRUE)

xpathSApply(jstor_book[[1]], "//*/title")

## [[1]]
## <title>
##     Social Media in Rural China on JSTOR
## </title>