Page Replica – Tool for Web Scraping, Prerendering, and SEO Boost
"Page Replica" is a versatile web scraping and caching tool built with Node.js, Express, and Puppeteer. It helps prerender web app (React, Angular, Vue,...) pages, which can be served via Nginx for SEO or other purposes.
The tool allows you to scrape individual web pages or entire sitemaps trough an api, selectively removing JavaScript, and caching the resulting HTML.
Additionally, it features an Nginx configuration that optimally handles user and search engine bot traffic.
Clone the Repository:
git clone https://github.com/html5-ninja/page-replica.git
cd page-replica
Install Dependencies:
npm install
Settings:
const CONFIG = {
baseUrl: "https://example.com",
removeJS: true,
addBaseURL: true,
cacheFolder: "path_to_cache_folder",
}
Start the API:
npm start
By scraping a page or a sitemap, a copy of the prerendered page will be stored in the cache folder.
To scrape a single page, make a GET request to /page
with the url
query parameter:
curl http://localhost:8080/page?url=https://example.com
To scrape pages from a sitemap, make a GET request to /sitemap
with the url
query parameter:
curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml
In this case, the cached pages are served using Nginx. You can adapt this configuration to your needs and your server.
The Nginx configuration, residing in nginx_config_sample/example.com.conf
, thoughtfully manages traffic.
It efficiently routes regular users to the main application server and redirects search engine bots to a dedicated server block for cached HTML delivery.
Please review the nginx_config_sample/example.com.conf
file to gain an understanding of its functionality.
We welcome contributions! If you have ideas for new features or server/cloud configurations that could enhance this tool, feel free to:
If you have any feature requests or suggestions for server/cloud configurations beyond Nginx, please open an issue to start a discussion.
nginx_config_sample
: Presents a sample Nginx configuration for redirecting bot traffic to the cached content server.api.js
: An Express application responsible for handling web scraping requests.index.js
: The core web scraping logic employing Puppeteer.package.json
: Node.js project configuration.