查找网页中所有可能的文本(标题,占位符,…)


DOM PHP - Find all possible text in a webpage (title, placeholder, ...)

我有一个经典的HTML网页

<html>
<head>
  <meta charset="utf-8">
  <title>Some text</title>
  <link rel="stylesheet" href="style.css">
  <script src="script.js"></script>
  <script>
      var text = "Hi guys !";
  </script>
</head>
<body>
    <h1>Hello guys</h1>
    <p>Some text <strong>is more important</strong></p>
    <input value="Here also is some text" placeholder="and here too">
    <a href="not here">here is some text</a>
</body>
</html>

我想能够得到所有的文字从网页使用php。检查nodeType为DOMText将忘记占位符实例。

是否有一种简单的方法可以快速获取所有真实文本(在我的情况下意味着所有英文文本)?

假设您只想要body元素的子元素…

HTML例子

<html><head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <title> Example</title>
</head>
<body>
  a <div>b<span>c</span></div>
</body></html>
JavaScript

var body = document.body;
var textContent = body.textContent || body.innerText;
console.log(textContent);  //   a bc

你需要检查textContent,因为我们的好朋友IE用innerText代替。

如果你有一个库,如jQuery,即$('body').text(),这是容易得多。

参考This Also

Ref: http://www.phpro.org/examples/Get-Text-Between-Tags.html

<?php
$html='<html>
<head>
<meta charset="utf-8">
<title>Some text</title>
<link rel="stylesheet" href="style.css">
<script src="script.js"></script>
<script>
  var text = "Hi guys !";
</script>
</head>
<body>
<h1>Hello guys</h1>
<p>Some text <strong>is more important</strong></p>
<input value="Here also is some text" placeholder="and here too">
<a href="not here">here is some text</a>
</body>
</html>';
$content = getTextBetweenTags('body', $html);
foreach( $content as $item )
{
echo $item.'<br />';
}
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
    $dom->loadXML($html);
}
else
{
    $dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
    /*** add node value to the out array ***/
    $out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}

使用DomDocument的textContent属性

<?
error_reporting(-1); 
$dom = new DomDocument();
$dom->loadHTML($str);
echo $dom->textContent;
结果

Some text
      var text = "Hi guys !";
    Hello guys
    Some text is more important
    here is some text