Scraping html tables into R data frames using the XML package


Question

How do I scrape html tables using the XML package?

Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have played against FIFA recognised teams" table as a data.frame. How can I do this?

1
149
11/10/2016 3:40:17 PM

Accepted Answer

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]
139
4/8/2017 1:21:04 PM

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

Edited to add:

Sample output

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon