正则表达式获取HTML文档类型


Regular expression Getting HTML Doctype

我的Html代码就像这个

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

或者这可以像这个

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">

我想从中得到Doc Type,它将像"XHTML 1.0 Strict"(第一个(和"HTML 4.0"(第二个(。它的正则表达式代码是什么?我喜欢在PHP preg_match()函数中使用它。

在这种情况下请帮帮我。

如果doctypes将采用所示的形式,则可以使用

'#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i'

所以

preg_match('#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i', html, $match);  
echo $match[0];

使用DOMDocumentDOMDocumentType怎么样?

$xml = new DOMDocument(); 
$xml->loadHTMLFile($url);
$name = $xml->doctype->publicId; // -//W3C//DTD XHTML 1.0 Strict//EN

$doctype现在包含以下值:

DOMDocumentType Object
(
    [name] => html
    [entities] => (object value omitted)
    [notations] => (object value omitted)
    [publicId] => -//W3C//DTD XHTML 1.0 Strict//EN
    [systemId] => http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
    [internalSubset] => 
    [nodeName] => html
    [nodeValue] => 
    [nodeType] => 10
    [parentNode] => (object value omitted)
    [childNodes] => 
    [firstChild] => 
    [lastChild] => 
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => 
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => 
    [baseURI] => 
    [textContent] => 
)

因此,您现在可以轻松提取类型:

$name = $xml->doctype->publicId;
$name = preg_replace('~.*//DTD(.*?)//.*~', '$1', $name);
echo $name;

这将导致XHTML 1.0 Strict。在这里使用phpfiddle示例。

function contains($haystack, $needle){
    if (strpos($haystack,$needle) !== false) {
        return true;
    }else{
        return false;
    }
}
                $theDocType = "";
                $stringWithHTML = ""; // load some HTML in here from somewhere
                // Create DOM from HTML 
                $doc = new DOMDocument();
                //@$doc->loadHTMLFile("just_a_file.html");
                @$doc->loadHTML($stringWithHTML);
                // Grab document type
                $dtName = $doc->doctype->name;
                $dtPublic = $doc->doctype->publicId;
                if( $dtName="html" && $dtPublic!=""){           
                    // HTML or XHTML?
                    if(contains($dtPublic,"xhtml")){
                        $theDocType = "XHTML 1.0";
                    }else{
                        $theDocType = "HTML 4.01";
                    }
                    // Which type?
                    if(contains($dtPublic,"strict")){
                        $theDocType .= " (Strict)";
                    }elseif(contains($dtPublic,"transitional")){
                        $theDocType .= " (Transitional)";
                    }elseif(contains($dtPublic,"frameset")){
                        $theDocType .= " (Frameset)";
                    }else{
                        $theDocType = "XHTML 1.1"; // XHTML 1.1
                    }
                }else{
                    $theDocType = "HTML 5";
                }
                // Result
                echo $theDocType;

这将输出以下内容:
XHTML 1.1
HTML 5
HTML 4.01(严格(

试试这个

<?php
   $html = file_get_contents("http://google.com");
   $html = str_replace("'n","",$html);
   $get_doctype = preg_match_all("/(<!DOCTYPE.+'">)<html/i",$html,$matches);
   $doctype = $matches[1][0];
?>
'<!doctype.*?//dtd's+([^/]*)//EN.*?dtd">'

这应该是您的示例的模式。

这个正则表达式提取"DTD"answers"/"之间的所有内容,而不需要任何语法检查:

.*DTD's+([^/]+)

这个正则表达式提取文档类型并检查字符串中的一些语法:

<!DOCTYPE's+'w*'s*'w*'s*"[-//'w'd]*DTD's+(['w'd's.]*)[^"]*[^>]*>

我以前使用过这个线程,但在测试过程中,我发现一些大型doctype有问题。有时,开发人员会将doctype拆分为2行或3行。在这种情况下,使用正则表达式并不是最好的方法。

我将doctype的方法粘贴在一行或几行中:

<?
class Doctype {
    var $html;
    var $doctype;
    var $version;
    function Doctype($html){
       $this->html = $html;
       $this->extractDoctype();
       $this->processDoctype();
    }
    private function extractDoctype(){
        $preDoctype = "";
        $preDoctypeValid = false;
        $lines = explode(PHP_EOL, $this->html);
        foreach ($lines as &$line) {
            $preDoctype = $preDoctype . $line;
            if(
                (strpos(strtolower($preDoctype), "<!doctype") !== false) && 
                (strpos(strtolower($preDoctype), ">") !== false)){
                $preDoctypeValid = true;
                break;
            }
        }
        if($preDoctypeValid){
            //Store only the pattern: <! doctype >
            $pos1 = strpos(strtolower($preDoctype), "<!doctype");
            $pos2 = strpos($preDoctype, ">", $pos1) + 1;
            $preDoctype = substr($preDoctype, $pos1, $pos2);            
        }else{
            $preDoctype = "";
        }
        $this->doctype = $preDoctype;
    }
    private function processDoctype(){
        $version = "";
        $pattern_html5 = "/<!doctype's+?html's?>/i";
        if (preg_match($pattern_html5, strtolower($this->doctype))) {
            $version = "HTML5";
        }else if(strpos(strtolower($this->doctype), "xhtml") !== false){
            $version = "XHTML";     
        }else if(strpos(strtolower($this->doctype), "html") !== false){
            if(strpos(strtolower($this->doctype), "3.2") !== false){
                $version = "HTML 3.2";  
            }
            if(strpos(strtolower($this->doctype), "4.01") !== false){
                $version = "HTML 4.01"; 
            }
            if(strpos(strtolower($this->doctype), "2.0") !== false){
                $version = "HTML 2.0";  
            }
        }else{
            $version = "OTHER";
        }
        $this->version = $version;
    }
    public function getDoctype(){
        return $this->doctype;
    }
    public function getDoctypeVersion(){
        return $this->version;
    }
}
?>

https://github.com/jabrena/WTAnalyzer/blob/master/r_php/document/Doctype.class.php