Brianmadden Krawler Versions Save

A web crawling framework written in Kotlin

0.4.4

4 years ago
  • Upgrade Kotlin to 1.3.61
  • Upgrade kotlinx.coroutines. This required an update to some of the places where coroutine builders were called internally.
  • Upgrade Gradle wrapper

0.4.3

6 years ago
  • Added ability to clear crawl queues by RequestId and Age, see Krawler#removeUrlsByRootPage and Krawler#removeUrlsByAge
  • Added config option to prevent crawler shutdown on empty queues
  • Added new single byte priority field to KrawlQueueEntry. Queues will always attempt to pop the lowest priority entry available. Priority can be assigned by overriding the Krawler#assignQueuePriorty method.
  • Update dependencies

0.4.1

6 years ago

0.4.1 (2017-8-15)

  • Removed logging implementation from dependencies to prevent logging conflicts when used as a library.
  • Updated Kotlin version to 1.1.4
  • Updated kotlinx.coroutines to .17

0.4.0

7 years ago

0.4.0 (2017-5-17)

  • Rewrote core crawl loop to use Kotlin 1.1 coroutines. This has effectively turned the crawl process into a multi-stage pipeline. This architecture change has removed the necessity for some locking by removing resource contention by multiple threads.

  • Updated the build file to build the simple example as a runnable jar

  • Minor bug fies in the KrawlUrl class.

0.3.2

7 years ago
  • Fixed a number of bugs that would result in a crashed thread, and subsequently an incorrect number of crawled pages as well as cause slowdowns due to a reduced number of worker threads.

  • Added a new utility function to wrap doCrawl and log any uncaught exceptions during crawling.

0.3.1

7 years ago
  • Created 1:1 mapping between threads and the number of queues used to serve URLs to visit. URLs have an affinity for a particular queue based on their domain. All URLs from that domain will end up in the same queue. This improves parallel crawl performance by reducing the frequency that the politeness delay effects requests. For crawls bound to fewer domains than queues, the excess queues are not used.
  • Many bug fixes including fix that eliminates accidental over-crawling.