如何从web使用php刮印地语文本 - how to scrape hindi text from web using php

how to scrape hindi text from web using php

本文关键字：印地语文本 php 使用 web | 更新日期: 2023-09-27

这里我试图从web (in url)中抓取数据，但我得到这样的响应

' u093f ' u0938 '

如何解码这个unicode?请建议我如何做我的脚本在PHP。

这个脚本在英语文本中正常工作，那么英语发生了什么?我已经用这个脚本抓取了数据。我知道这个响应是dev nagri unicode，但如何解码它。

我是php问题的新手，提前感谢

$i= 1;
for($i; $i < 6; $i++)
{
    $html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles);
    print_r($articles);
}

如果您运行的是PHP 5.4或更高版本，在调用json_encode时传递JSON_UNESCAPED_UNICODE参数

$i= 1;
for($i; $i < 6; $i++)
{
    $html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
    print_r($articles);
}

我认为PHPhil的回答很好，我给它点了赞。我编辑了代码，因为它不工作，只是执行php部分-相反，重要的是要添加正确的元标记(见下面的代码)，以显示devnagari正确。此外，我想纠正错误与缺失的"="。不幸的是，我的编辑被拒绝了，所以我不得不添加一个新的答案和代码更正。

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<?php
$i= 1;
for($i; $i < 6; $i++)
{
    $html = file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
    print_r($articles);
}
?>
</body>
</html>

你很接近了。你收到的信号是:和

首先你可以试着用谷歌搜索这个字符，你会发现字符的devnagari含义:

https://www.google.de/q = % 5 cu093f

https://www.google.de/q = % 5 cu0938

如果你想在html中显示unicode，你必须将编码从/u0123改为ģ。请看这里:

<html>
<body>
<p>These are two chars in devnagari &#x93f;&#x938;<p>
</body>
</html>

但是当你想要抓取印地语时，你应该开始学习如何读取和处理unicode。下一个问题是，您希望如何处理您的结果。