使用PHP LINUX计算DOC和DOCX中的字符数 - count characters in DOC and DOCX with PHP LINUX

count characters in DOC and DOCX with PHP LINUX

本文关键字：字符 DOCX PHP LINUX 计算 DOC 使用 | 更新日期: 2023-09-27

添加：我发现最接近的计数行的方法是使用linux命令"antiword"对DOC文件进行计数，antiword会返回DOC的文本版本；而对于DOCX，使用将从DOCX检索内容并通过与反单词相同的文本函数推送数据的调用。

现在问题来了，当文件中有表时，反字符会添加很多空格。

===

我有一个在DOCX文件中计算字符数的脚本：

$zip = new ZipArchive;

$striped_content = '';
$content = '';
if(!$filename || !file_exists($filename)) return false;
$zip = zip_open($filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
    if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
    if (zip_entry_name($zip_entry) != "word/document.xml") continue;
    $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
    zip_entry_close($zip_entry);
}// end while
zip_close($zip_entry);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "'r'n", $content);
$striped_content = trim(strip_tags($content));

如果我有文档文件，我基本上使用LibreOffice命令行将文件转换为docx，然后运行上面的脚本。

问题是，我无法找出文件在"HEADER"answers"FOOTER"区域中有多少单词。如何才能做到这一点？

我的服务器运行：PHP 5.3办公套件CentOS 6.5

我不确定我还需要提供什么其他信息，谢谢你的回答。

p.s.

我尝试过将doc和docx转换为txt，但结果"HEADER"answers"FOOTER"区域没有保留在txt文档中

此外，我找到的最接近的解决方案是：https://github.com/nagilum/DOCx

Library分解了整个docx文件，您有纯文本的页眉、内容和页脚，我可以尝试从它们中计算字数。然而，libreoffice有时似乎很难将文件转换为docx，而一个只有1页的文档文件在转换后可能会在docx中有2页。

任何*.docx文件--zip存档。它由app.xml文件组成，您可以在其中找到节点：

<Characters>8657</Characters>

并通过正则表达式

提取值