Ultimate Sitemap Parser Save

Ultimate Website Sitemap Parser

Project README

.. image:: https://travis-ci.org/mediacloud/ultimate-sitemap-parser.svg?branch=develop :target: https://travis-ci.org/mediacloud/ultimate-sitemap-parser :alt: Build Status

.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest :target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://coveralls.io/repos/github/mediacloud/ultimate-sitemap-parser/badge.svg?branch=develop :target: https://coveralls.io/github/mediacloud/ultimate-sitemap-parser?branch=develop :alt: Coverage Status

.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg :target: https://badge.fury.io/py/ultimate-sitemap-parser :alt: PyPI package

.. image:: https://pepy.tech/badge/ultimate-sitemap-parser :target: https://pepy.tech/project/ultimate-sitemap-parser :alt: Download stats

Website sitemap parser for Python 3.5+.

Features

Supports all sitemap formats:
- XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>_
- Google News sitemaps <https://support.google.com/news/publisher-center/answer/74288?hl=en>_
- plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>_
- RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>_
- Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>_
Field-tested with ~1 million URLs as part of the Media Cloud project <https://mediacloud.org/>_
Error-tolerant with more common sitemap bugs
Tries to find sitemaps not listed in robots.txt
Uses fast and memory efficient Expat XML parsing
Doesn't consume much memory even with massive sitemap hierarchies
Provides a generated sitemap tree as easy to use object tree
Supports using a custom web client
Uses a small number of actively maintained third-party modules
Reasonably tested

Installation

.. code:: sh

pip install ultimate-sitemap-parser

Usage

.. code:: python

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.sitemap>_.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:

.. code:: python

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.page>_.

Open Source Agenda is not affiliated with "Ultimate Sitemap Parser" Project. README Source: mediacloud/ultimate-sitemap-parser

Stars

169

Open Issues

Last Commit

11 months ago

Repository

mediacloud/ultimate-sitemap-parser

Homepage

https://mediacloud.org/

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/ultimate-sitemap-parser"><img src="https://www.opensourceagenda.com/projects/ultimate-sitemap-parser/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022