XML Sitemap parser class compliant with the Sitemaps.org protocol.
An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol.
The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others.
.xml
.xml.gz
robots.txt
The library is available for install via Composer. Just add this to your composer.json
file:
{
"require": {
"vipnytt/sitemapparser": "^1.0"
}
}
Then run composer update
.
Returns an list of URLs only.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser();
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Returns all available tags, for both Sitemaps and URLs.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'Sitemap<br>';
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '<hr>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '<hr>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Parses any sitemap detected while parsing, to get an complete list of URLs.
Use url_black_list
to skip sitemaps that are part of parent sitemap. Exact match only.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parseRecursive('http://www.google.com/robots.txt');
echo '<h2>Sitemaps</h2>';
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '<hr>';
}
echo '<h2>URLs</h2>';
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '<hr>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead.
To disable strict
standards, simply pass this configuration to constructor parameter #2: ['strict' => false]
.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
$parser->parse('https://www.xml-sitemaps.com/urllist.txt');
foreach ($parser->getSitemaps() as $url => $tags) {
echo $url . '<br>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
composer require hamburgscleanest/guzzle-advanced-throttle
$rules = new RequestLimitRuleset([
'https://www.google.com' => [
[
'max_requests' => 20,
'request_interval' => 1
],
[
'max_requests' => 100,
'request_interval' => 120
]
]
]);
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
$throttle = new ThrottleMiddleware($rules);
// Invoke the middleware
$stack->push($throttle());
// OR: alternatively call the handle method directly
$stack->push($throttle->handle());
$client = new \GuzzleHttp\Client(['handler' => $stack]);
setClient
method:$parser = new SitemapParser();
$parser->setClient($client);
More details about this middle ware is available here
composer require caseyamcl/guzzle_retry_middleware
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
$stack->push(GuzzleRetryMiddleware::factory());
$client = new \GuzzleHttp\Client(['handler' => $stack]);
$parser = new SitemapParser();
$parser->setClient($client);
More details about this middle ware is available here
composer require gmponos/guzzle_logger
$logger = new Logger();
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
$stack->push(new LogMiddleware($logger));
$client = new \GuzzleHttp\Client(['handler' => $stack]);
setClient
method:$parser = new SitemapParser();
$parser->setClient($client);
More details about this middleware config (like log levels, when to log and what to log) is available here
Even more examples available in the examples directory.
Available configuration options, with their default values:
$config = [
'strict' => true, // (bool) Disallow parsing of line-separated plain text
'guzzle' => [
// GuzzleHttp request options
// http://docs.guzzlephp.org/en/latest/request-options.html
],
// use this to ignore URL when parsing sitemaps that contain multiple other sitemaps. Exact match only.
'url_black_list' => []
];
$parser = new SitemapParser('MyCustomUserAgent', $config);
If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent.