从网站获取 html 内容 - fetch html content from website

fetch html content from website

本文关键字：内容 html 获取网站 | 更新日期: 2023-09-27

可能的重复项：
如何使用PHP解析和处理HTML？

我已经使用此代码从给定的网址网站获取html内容。

**Code:**
=================================================================
example URL: http://www.qatarsale.com/EnMain.aspx
/*
$regexp = '/<div id="UpdatePanel4">(.*?)<'/div>/i';
@preg_match_all($regexp, @file_get_contents('http://www.qatarsale.com/EnMain.aspx'), $matches, PREG_SET_ORDER);*/
/*

但$matches返回空白数组。我想获取在div id="UpdatePanel4"中找到的所有html内容。

如果有人有任何解决方案，请建议我。

谢谢

首先，确保服务器允许您获取数据。

其次，改用 html 解析器来解析数据。

$html = @file_get_contents('http://www.qatarsale.com/EnMain.aspx');
if (!$html) {
  die('can not get the content!');
}
$doc = new DOMDocument();
$doc->loadHTML($html);
$content = $doc->getElementById('UpdatePanel4');

// Gets the webpage
$html = @file_get_contents('http://www.qatarsale.com/EnMain.aspx');
$startingTag = '<div id="UpdatePanel4">';
// Finds the position of the '<div id="UpdatePanel4">
$startPos = strpos($html, $startingTag);
// Get the position of the closing div
$endPos = strpos($html, '</div>', $startPos + strlen($startingTag));
// Get the content between the start and end positions
$contents = substr($html, $startPos + strlen($startingTag), $endPos);

如果该 UpdatePanel4div 包含更多div，您将不得不做更多的工作

那无济于事。即使你设法让正则表达式工作，你使用它的方式也存在两个问题：

如果服务器像这样更改HTML的次要内容怎么办：<div data-blah="blah" id="UpdatePanel4">？在这种情况下，您也必须更改正则表达式。
第二个问题：我想你想要div的innerHTML，对吧？在这种情况下，您使用正则表达式处理的方式是不关心嵌套或树结构。你将得到的字符串是从你指定的字符串，直到遇到的第一个</div>。

溶液：

使用正则表达式来解析 HTML 总是一个坏主意。请改用 DOMDocument。