Website Watcher Save

πŸ•΅οΈβ€β™€οΈ Naively watch websites for changes on regular intervals.

Project README

πŸ•΅οΈβ€β™€οΈ website-watcher

License Coding Activity GitHub code size in bytes GitHub issues GitHub last commit Say thanks Security Rating Maintainability Rating Technical Debt Lines of Code

πŸ—’ Summary

This script watches a website, saves its contents to a specified text file, compares this file's contents to the website contents at the next visit and sends an e-mail if there are differences.

Please note: This will only work for static websites, which are completely rendered on the server. To parse dynamic, JavaScript-powered websites, like Single Page Apps, you would need a tool like Selenium WebDriver. If you're interested, please refer to my blog article about "Building a cloud-native web scraper using 8 different AWS services".

πŸ–Š Description

I made it for the purpose to repeatedly check a specific webpage where university exam results get published so I get notified almost instantly. Another application could be watching on the postal service's shipment tracking or the like. The script is very simple and works in a way that it visits a website, saves the entire HTML code into a local file and compares its contents to the potentially new page contents at the next visit. If there was a difference you will be notified via an e-mail. You can specify a threshold for saying how many single-character changes you want to actually be considered a change (maybe some webpages will display the current time at the right bottom, which you want to ignore - if time is displayed like 6:45 pm than a theshold of at least 5 would result in ignoring these changes). In order to save memory and CPU time in idle (although only very few) the script itself will only run once when executing it and instantly exit after it has finished one website visit. To make it run repeatedly you will have to set up a cron job that simply execute the script.

βš™οΈ Requirements

  • Python >= 3.9
  • Cron jobs

▢️ Usage

  • Clone project: git clone https://github.com/n1try/website-watcher-script
  • sudo pip3 install -r requirements.txt
  • chmod +x watcher.py
  • Create cronjob for your user account with crontab -e and add – for instance – @hourly ~/dev/watcher.py -u https://kit.edu -t 5 --adapter email -r [email protected]. This will hourly visit kit.edu and send an e-mail in case of changes, while ignoring changes less than 6 characters.
  • See python3 watcher.py -h for information on all available parameters.
  • πŸ‘‰ New: See batch.sh for information on how to watch multiple websites at once

Options

  • -u URL (required): URL of the website to watch
  • -t TOLERANCE: Tolerance in characters, i.e. changes with a difference of less than or equal to TOLERANCE characters will be ignored and not trigger a notification
  • -x XPATH: An XPath query to restrict watching to certain parts of a website. Only child elements of the element matching the query will be considered while watching
  • -i XPATH_IGNORE: A list of XPath queries to exclude certain parts of a website. Multiple queries possible by separating with a space like -i "//script" "//style".
  • -ua USER_AGENT: A custom user agent header to set in requests, e.g. for pretending to be a browser. Shortcut firefox is available to fake a Firefox 84 on Windows 10
  • --adapter ADAPTER: Which sending adapter to use (see below)

πŸ‘€ Please note

When running the script for the first time, you will get an e-mail that there where changes, since there is a difference between the empty file and the entire webiste HMTL code.

πŸ”Œ Adapters

Multiple send methods are supported in the form of adapters. To choose one, supply --adapter (e.g. --adapter email) as a an argument to watcher.py

To write your own adapter, you need to implement abstract SendAdapter class. See adapters/email.py for an example.

E-Mail (email)

This adapter, which is also the default one, will send an e-mail to notify about changes. It either uses local sendmail or a specified SMTP server.

Options

  -r RECIPIENT_ADDRESS          – Recipient e-mail address (required)
  -s SENDER_ADDRESS             – Sender e-mail address
  --subject SUBJECT             – E-Mail subject
  --sendmail_path SENDMAIL_PATH – Path to Sendmail binary
  --smtp                        – If set, SMTP is used instead of local Sendmail.
  --smtp_host SMTP_HOST         – SMTP server host name to send mails with – only required of "--smtp" is set to true
  --smtp_port SMTP_PORT         – SMTP server port – only required of "--smtp" is set to true
  --smtp_username SMTP_USERNAME – SMTP server login username – only required of "--smtp" is set to true
  --smtp_password SMTP_PASSWORD – SMTP server login password – only required of "--smtp" is set to true
  --disable_tls                 – If set, SMTP connection is unencrypted (TLS disabled) – only required of "--smtp" is set to true

Telepush (telepush)

This adapter will send an push notification via Telegram using Telepush. You have to register for the bot first to get an token. To do so, send a message to TelepushBot (Telepush was formerly called MiddlemanBot).

Options

  -r RECIPIENT_TOKEN            – Recipient token (required)
  -s SENDER                     – Sender name
  --webhook_url WEBHOOK_URL     – URL of the Telepush bot instance

Gotify (gotify)

This adapter will send an push notification via Gotify. First, you have to register a new app in Gotify and gets its key as an authorization token.

Options

  --gotify_key GOTIFY_KEY       – Gotify app key / token (required)
  --gotify_url GOTIFY_URL       – Gotify server instance address (required)

Ntfy.sh (ntfy)

This adapter will send an push notification via ntfy.sh.

Options

  --ntfy_topic NTFY_TOPIC       – Ntfy topic to publish to (required)
  --ntfy_url NTFY_URL           – Ntfy server instance address (optional)
  --ntfy_token NTFY_TOKEN       – Ntfy access token (if server required authentication) (optional)

WebSub (websub)

This adapter will send a ping to a WebSub Hub (e.g. pubsubhubbub.superfeedr.com as a hosted service or Switchboard as a self-hosted hub). However, a check whether the target resource is actually a publisher for that hub is skipped. You should verify that yourself.

Options

  --hub_url HUB_URL             – URL of the WebSub hub to publish to (required)

Sub Process (subprocess)

This adapter allows executing arbitrary shell commands with the watch result included as environment variables (WATCHER_URL and WATCHER_DIFF).

Example

python watcher.py \
  -u https://kit.edu \
  --adapter subprocess \
  --cmd "echo $WATCHER_DIFF characters changed at $WATCHER_URL > /tmp/watcher.txt"

Options

  --cmd CMD                     – A shell command to execute in case of a change (required)

Stdout / Log (stdout)

This adapter simply prints a message (either as plain text or in JSON) to the console.

Options

  --log_format LOG_FORMAT       – Format of the logged message (default: 'plain')

🧩 Website Examples

Watching ebay-kleinanzeigen.de

  1. Go to the front page
  2. Use F12 to open your browser's dev tools and switch to the Network tab
  3. Enter your search query, location and radius and git Search
  4. Right-click the first request of type html and status code 301 and copy its URL (starts with https://www.ebay-kleinanzeigen.de/s-suchanfrage.html)
  5. Watch it: python3 watcher.py -u "<URL_FROM_STEP_4>" -ua firefox -x "//div[@id='srchrslt-content']" --adapter stdout

πŸ§‘β€πŸ’» Developer Notes

Tests

$ python3 -m unittest discover . '*_test.py'

↗️ Contributing

Feel free to contribute! All contributions that add value to the project are welcome. Please check the issues section for bug reports and feature requests.

πŸ““ License

MIT @ Ferdinand MΓΌtsch

Open Source Agenda is not affiliated with "Website Watcher" Project. README Source: muety/website-watcher
Stars
59
Open Issues
1
Last Commit
1 year ago
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating