Library for Rapid (Web) Crawler and Scraper Development
Json::all()
and also allow to get the whole decoded JSON, when using Json::get()
, inside a mapping using either empty string or *
as target. Example: Json::get(['all' => '*'])
. *
only works, when there is no key *
in the decoded data.<body>
and tries to decode this instead.Step::addToResult()
, so you can get data from nested output, like: $step->addToResult(['url' => 'response.url', 'status' => 'response.status', 'foo' => 'bar'])
.__serialize()
. If you want an object to be serialized differently for that purpose, you can define a toArrayForAddToResult()
method in that class. When that method exists, it's preferred to the __serialize()
method.toArrayForAddToResult()
method in the RespondedRequest
class, so on every step that somehow yields a RespondedRequest
object, you can use the keys url
, uri
, status
, headers
and body
with the addToResult()
method. Previously this only worked for Http
steps, because it defines output key aliases (HttpBase::outputKeyAliases()
). Now, in combination with the ability to use dot notation when adding data to the result, if your custom step returns nested output like ['response' => RespondedRequest, 'foo' => 'bar']
, you can add response data to the result like this $step->addToResult(['url' => 'response.url', 'body' => 'response.body'])
.Store
class instance) is called by the crawler with a final crawling result. When a crawling step initiates a crawling result (so, addToResult()
was called on the step instance), the crawler has to wait for all child outputs (resulting from one step-input) until it calls the store, because the child outputs can all add data to the same final result object. But previously this was not only the case for all child outputs starting from a step where addToResult()
was called, but all children of one initial crawler input. So with this change, in a lot of cases, the store will earlier be called with finished Result
objects and memory usage will be lowered.HttpBaseLoader
back to HttpLoader
. It's probably not a good idea to have multiple loaders. At least not multiple loaders just for HTTP. It should be enough to publicly expose the HeadlessBrowserLoaderHelper
via HttpLoader::browserHelper()
for the extension steps. But keep the HttpBase
step, to share the general HTTP functionality implemented there.HttpBaseLoader
and important functionality for the headless browser loader to a new HeadlessBrowserLoaderHelper
. Further, also share functionality from the Http
steps via a new abstract HttpBase
step. It's considered a fix, because there's no new functionality, just refactoring existing code for better extendability.DomQuery
class (parent of CssSelector
(Dom::cssSelector
) and XPathQuery
(Dom::xPath
)) has a new method formattedText()
that uses the new crwlr/html-2-text package to convert the HTML to formatted plain text. You can also provide a customized instance of the Html2Text
class to the formattedText()
method.Http::crawl()
step won't yield a page again if a newly found URL responds with a redirect to a previously loaded URL.