使用 R 从 php 网站读取数据 - Read data from a php website with R

Read data from a php website with R

本文关键字：读取数据网站 php 使用 | 更新日期: 2023-09-27

我想将数据从这样的表导入 R：

http://www.rout.gr/index.php?name=Rout&file=results&year=2011

我尝试按照以下线程的建议使用 XML 库，但我什么也得不到。

使用 XML 包将 html 表抓取到 R 数据框中

该网站似乎确实发生了一些时髦的事情。除非您伪造用户代理，否则它似乎不会返回任何数据。即便如此，readHTMLTable 的行为也不是很好，如果你doc传递它，就会返回一个错误。读取源代码后，您可以看到相关表具有 id table_results_r_1，并将其隔离并通过工作传递结果：

library(XML)
library(httr)
theurl <- "http://www.rout.gr/index.php?name=Rout&file=results&year=2011"
doc <- htmlParse(GET(theurl, user_agent("Mozilla")))
results <- xpathSApply(doc, "//*/table[@id='table_results_r_1']")
results <- readHTMLTable(results[[1]])
rm(doc)

现在，您需要整理表列名称。

进一步我的评论

theurl <- "http://www.rout.gr/index.php?name=Rout&file=results&year=2011"
doc <- htmlParse(GET(theurl, user_agent("Mozilla")))
removeNodes(getNodeSet(doc,"//*/comment()"))
dum.tables<-readHTMLTable(doc)

因此，第 14 个表的标题中的注释引起了问题。我们可以删除所有 html 注释，然后该函数将适用于页面上的所有表格。