Brianmadden Krawler Versions Save

A web crawling framework written in Kotlin

4 years ago

Upgrade Kotlin to 1.3.61
Upgrade kotlinx.coroutines. This required an update to some of the places where coroutine builders were called internally.
Upgrade Gradle wrapper

6 years ago

Added ability to clear crawl queues by RequestId and Age, see Krawler#removeUrlsByRootPage and Krawler#removeUrlsByAge
Added config option to prevent crawler shutdown on empty queues
Added new single byte priority field to KrawlQueueEntry. Queues will always attempt to pop the lowest priority entry available. Priority can be assigned by overriding the Krawler#assignQueuePriorty method.
Update dependencies

6 years ago

0.4.1 (2017-8-15)

Removed logging implementation from dependencies to prevent logging conflicts when used as a library.
Updated Kotlin version to 1.1.4
Updated kotlinx.coroutines to .17

7 years ago

0.4.0 (2017-5-17)

Rewrote core crawl loop to use Kotlin 1.1 coroutines. This has effectively turned the crawl process into a multi-stage pipeline. This architecture change has removed the necessity for some locking by removing resource contention by multiple threads.
Updated the build file to build the simple example as a runnable jar
Minor bug fies in the KrawlUrl class.

7 years ago

Fixed a number of bugs that would result in a crashed thread, and subsequently an incorrect number of crawled pages as well as cause slowdowns due to a reduced number of worker threads.
Added a new utility function to wrap doCrawl and log any uncaught exceptions during crawling.

7 years ago

Created 1:1 mapping between threads and the number of queues used to serve URLs to visit. URLs have an affinity for a particular queue based on their domain. All URLs from that domain will end up in the same queue. This improves parallel crawl performance by reducing the frequency that the politeness delay effects requests. For crawls bound to fewer domains than queues, the excess queues are not used.
Many bug fixes including fix that eliminates accidental over-crawling.