我目前正在将所有漂亮的汤代码转换为PHP,只是为了习惯PHP。但是我遇到了一些问题,我的php代码只有在wiki页面在html中原始运行后具有"外部链接"时才有效(例如True Detective Wiki)。我刚刚发现这并不总是发生,因为可能并不总是有一个"外部链接"部分。我想知道是否有任何方法可以使用与我的漂亮汤代码相同的技术将我美丽的汤代码转换为 php 代码?
import requests, re
from bs4 import BeautifulSoup
def get_date(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
date = soup.find_all("table", {"class": "infobox"})
for item in date:
dates = item.find_all("th")
for item2 in dates:
if item2.text == "Original run":
test2 = item2.find_next("td").text.encode("utf-8")
mysub = re.sub(r''([^)]*')', '', test2)
return my sub
这是我目前的PHP代码
<?php
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
<?php
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
<?php
$scraped_page = curl("http://en.wikipedia.org/wiki/The_Walking_Dead_(TV_series)"); // Downloading IMDB home page to variable $scraped_page
$scraped_data = scrape_between($scraped_page, "<table class='"infobox vevent'" style='"width:22em'">", "</table>"); // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
$original_run = mb_substr($scraped_data, strpos($scraped_data, "Original run")-2, strpos($scraped_data, "External links") - strpos($scraped_data, "Original run")-2);
echo $original_run;
?>
您是否考虑过简单地使用维基百科API?自动生成的 wiki 标记通常处理起来非常糟糕,并且可能随时更改。
此外,与其尝试在 PHP 中正则表达式解析 HTML 或其他东西,只需将 phpQuery 库与 composer 一起使用,您只需搜索选择器 table.infobox.vevent
即可。