使用Goutte的异步HTML解析器


Async HTML parser with Goutte

我正试图在Goutte的帮助下编写HTML解析器。它工作得很好。但是Goutte使用阻塞请求。如果您处理的是单一服务,则此操作效果良好。如果我想查询许多彼此独立的服务,这会导致一个问题。Goutte使用BrowserKit和Guzzle。我试图更改doRequest功能,但失败了

参数1传递给Symfony''Component''BrowserKit''CookieJar::updateFromResponse()必须为Symfony''Component''BrowserKit''Response 的实例

 protected function doRequest($request)
    {
        $headers = array();
        foreach ($request->getServer() as $key => $val) {
            $key = strtolower(str_replace('_', '-', $key));
            $contentHeaders = array('content-length' => true, 'content-md5' => true, 'content-type' => true);
            if (0 === strpos($key, 'http-')) {
                $headers[substr($key, 5)] = $val;
            }
            // CONTENT_* are not prefixed with HTTP_
            elseif (isset($contentHeaders[$key])) {
                $headers[$key] = $val;
            }
        }
        $cookies = CookieJar::fromArray(
            $this->getCookieJar()->allRawValues($request->getUri()),
            parse_url($request->getUri(), PHP_URL_HOST)
        );
        $requestOptions = array(
            'cookies' => $cookies,
            'allow_redirects' => false,
            'auth' => $this->auth,
        );
        if (!in_array($request->getMethod(), array('GET', 'HEAD'))) {
            if (null !== $content = $request->getContent()) {
                $requestOptions['body'] = $content;
            } else {
                if ($files = $request->getFiles()) {
                    $requestOptions['multipart'] = [];
                    $this->addPostFields($request->getParameters(), $requestOptions['multipart']);
                    $this->addPostFiles($files, $requestOptions['multipart']);
                } else {
                    $requestOptions['form_params'] = $request->getParameters();
                }
            }
        }
        if (!empty($headers)) {
            $requestOptions['headers'] = $headers;
        }
        $method = $request->getMethod();
        $uri = $request->getUri();
        foreach ($this->headers as $name => $value) {
            $requestOptions['headers'][$name] = $value;
        }
        // Let BrowserKit handle redirects
            $promise = $this->getClient()->requestAsync($method,$uri,$requestOptions);
            $promise->then(
                function (ResponseInterface $response) {
                    return $this->createResponse($response);
                },
                function (RequestException $e) {
                    $response = $e->getResponse();
                    if (null === $response) {
                        throw $e;
                    }

                }

            );
        $promise->wait();
    }

如何更改Goutte''Client.php,使其异步处理请求?这是不可能的吗?我如何运行同时针对不同端点的剪贴簿?感谢

Goutte本质上是Guzzle与Symphony的Browserkit和DomCrawler之间的桥梁。

使用Goutte的最大缺点是所有请求都是同步发出的

要按时间顺序完成任务,您必须放弃使用古特,直接使用古兹和DomCrawler。

例如:

$requests = [
    new GuzzleHttp'Psr7'Request('GET', $uri[0]),
    new GuzzleHttp'Psr7'Request('GET', $uri[1]),
    new GuzzleHttp'Psr7'Request('GET', $uri[2]),
    new GuzzleHttp'Psr7'Request('GET', $uri[3]),
    new GuzzleHttp'Psr7'Request('GET', $uri[4]),
    new GuzzleHttp'Psr7'Request('GET', $uri[5]),
    new GuzzleHttp'Psr7'Request('GET', $uri[6]),
];
$client = new GuzzleHttp'Client();
$pool = new GuzzleHttp'Pool($client, $requests, [
    'concurreny' => 5, //how many concurrent requests we want active at any given time
    'fulfilled' => function ($response, $index) {
        $crawler = new Symfony'Component'DomCrawler'Crawler(null, $uri[$index]);
        $crawler->addContent(
            $response->getBody()->__toString(),
            $response->getHeader['Content-Type'][0]
        );        
    },
    'rejected' => function ($response, $index) {
        // do something if the request failed.
    },
]);
$promise = $pool->promise();
$promise->wait();