是否可以仅从HTML文件中提取所有文本?


Is this possible to extract all the text only from a HTML file?

我正在考虑删除所有格式,所有文本,就像你去网站的任何页面一样,用户Ctrl+A和Ctrl+ C,然后使用Ctrl+ V将所有内容粘贴在记事本中。你会明白我说的提取所有文本....是什么意思。让我们用一个网站来更好地解释:这个网站:https://developer.palm.com/content/resources/develop/quick_start_ios.html

我想要的东西:

jump to navigation
jump to content
Showcase
Why webOS
The Opportunity
Innovative Platform
Cross-Platform
HP Reach
Vibrant Community
Showcase
Device Showcase
App Showcase
Developer Voices
My Apps
Resources
Design
Enyo Design Guide
Advanced Application Guidelines
webOS and Game Development
Development
Download the SDK
Enyo from the Ground Up
Enyo Tutorial
Third-party Tools
Developer Device Program
PDK Development
Unactivated Devices
Glossary
Distribution and Promotion
Distributing with HP
App Content Criteria
App Submission Checklist
International e-commerce FAQ
Submit Your Enyo App
Market Your App
Promo codes
In-App Purchase
FAQs
Developer Program FAQ
International e-commerce FAQ
PDK Technical FAQ
Videos
View All
Community
Connect
Forums
Developer Blog
Events
Twitter
IRC
RSS
Resources
Third-party Developers
webOS on github
Guide to Custom Feeds
webOS101 (external)
Community Sites
mobspot
Cyrket
PreCentral
webOS Roundup
Documentation
SDK Documentation
Index
Developer Guide
API Reference
Sign In Sign Up Search Form
Search   
HomeResourcesQuick Start iOS
Quick Start - iOS Developers
Print
Email
Share
If you've been developing for iOS® and are looking to expand your audience, we're here to help. Getting started with webOS is easy! If your current focus is OpenGL/SDL, then the transition will be simplicity itself. We have lots of great stories of developers porting their OpenGL apps very quickly. You can use the publicly available 3.0 SDK to do OpenGL/SDL development now with the included Plug-in Development Kit (PDK). Best of all, the PDK integrates nicely with  Xcode.
If your focus is web app development, you'll want to look at Enyo, our next-generation JavaScript framework, which is included in the 3.0 SDK.
Ready to get started?
Download the SDK
It's free! (While you're at it, sign up for the Developer Program.)
Try the Enyo tutorial or the OpenGL sample app
Choose the sample that's most appropriate for your skill set.
Check out our Resources pages
Get more information on developing for webOS. Or go straight to the Reference section to get all the details.
Quick Start Guide
iOS Developers
Web Developers
C/C++ Developers
Next Steps
Sign up!
Become a member of the webOS developer community
Watch Dev Day videos
See the talks from the NYC Dev Day
Find a Developer
Check out our list of third-party developers and designers
Support
We are here to help!
Why webOS
Business Case for webOS
Success Stories
App Showcase
Contact Us
Getting Started
Join the HP webOS Developer Program
Download the SDK/PDK
Developing Your First App
Videos
webOS CONNECT Events
MWC Developer Conference
NYC Developer Day
Podcasts
Support
Help
FAQs
Stay up to date
About RSS Feeds
Developer Blog
© 2011 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice. All screen images simulated. HP Pre 3 planned availability this summer.  Privacy Statement
Supported browsers: Firefox 3.6+; Google Chrome 10+; Safari 5+; Internet Explorer 8+
Palm.comLegal NoticesContact Us

应该可以了

<?php 
echo strip_tags(file_get_contents("https://developer.palm.com/content/resources/develop/quick_start_ios.html"));

这就是大意。您可以使用str_replace('<br/>', ''n', $output)来更好地格式化它。

我使用lynx,在您的终端上试试:

lynx -dump http://www.google.com

另一种方法是检索页面主体标签的值:

$html = new DOMDocument();
$html->loadHTMLFile("https://developer.palm.com/content/resources/develop/quick_start_ios.html");
$body = $html->getElementsByTagName("body");
$body = $body->item(0);
echo $body->nodeValue;

您可以使用文档树来做到这一点,只需保留所有文本节点并删除所有元素节点。

你可以用javascript或c++和webkit来实现。