A web crawling framework written in Kotlin
Krawler is a web crawling framework written in Kotlin. It is heavily inspired by crawler4j by Yasser Ganjisaffar. The project is still very new, and those looking for a mature, well tested crawler framework should likely still use crawler4j. For those who can tolerate a bit of turbulence, Krawler should serve as a replacement for crawler4j with minimal modifications to existing applications.
Some neat features and benefits of Krawler include:
shouldCheck
or shouldVisit
and check
and visit
.Krawler is published through jitpack.io at: https://jitpack.io/#brianmadden/krawler/ . Add jitpack.io as a repository, and krawler as a dependency to use Krawler in your project:
repositories {
jcenter()
maven { url "https://jitpack.io" }
}
dependencies {
compile 'com.github.brianmadden:krawler:0.4.4'
}
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependency>
<groupId>com.github.brianmadden</groupId>
<artifactId>krawler</artifactId>
<version>0.4.4</version>
</dependency>
Using the Krawler framework is fairly simple. Minimally, there are two methods that must be overridden
in order to use the framework. Overriding the shouldVisit
method dictates what should be visited by
the crawler, and the visit
method dictates what happens once the page is visited. Overriding these
two methods is sufficient for creating your own crawler, however there are additional methods that
can be overridden to privde more robust behavior.
The full code for this simple example can also be found in the example project:
class SimpleExample(config: KrawlConfig = KrawlConfig()) : Krawler(config) {
private val FILTERS: Regex = Regex(".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|" +
"mov|mpeg|ram|m4v|pdf|rm|smil|wmv|swf|wma|zip|rar|gz|tar|ico))$", RegexOption.IGNORE_CASE)
/**
* Threadsafe whitelist of acceptable hosts to visit
*/
val whitelist: MutableSet<String> = ConcurrentSkipListSet()
override fun shouldVisit(url: KrawlUrl): Boolean {
val withoutGetParams: String = url.canonicalForm.split("?").first()
return (!FILTERS.matches(withoutGetParams) && url.host in whitelist)
}
private val counter: AtomicInteger = AtomicInteger(0)
override fun visit(url: KrawlUrl, doc: KrawlDocument) {
println("${counter.incrementAndGet()}. Crawling ${url.canonicalForm}")
}
override fun onContentFetchError(url: KrawlUrl, reason: String) {
println("${counter.incrementAndGet()}. Tried to crawl ${url.canonicalForm} but failed to read the content.")
}
private var startTimestamp: Long = 0
private var endTimestamp: Long = 0
override fun onCrawlStart() {
startTimestamp = LocalTime.now().toNanoOfDay()
}
override fun onCrawlEnd() {
endTimestamp = LocalTime.now().toNanoOfDay()
println("Crawled $counter pages in ${(endTimestamp - startTimestamp) / 1000000000.0} seconds.")
}
}
0.4.4 (2020-1-29)
kotlinx.coroutines
. This required an update to some of the places where coroutine builders were called internally.0.4.3 (2017-11-20)
Krawler#removeUrlsByRootPage
and Krawler#removeUrlsByAge
KrawlQueueEntry
. Queues will always attempt to pop the lowest
priority
entry available. Priority can be assigned by overriding the Krawler#assignQueuePriorty
method.0.4.2 (2017-10-25)
0.4.1 (2017-8-15)
kotlinx.coroutines
to .170.4.0 (2017-5-17)
Rewrote core crawl loop to use Kotlin 1.1 coroutines. This has effectively turned the crawl process into a multi-stage pipeline. This architecture change has removed the necessity for some locking by removing resource contention by multiple threads.
Updated the build file to build the simple example as a runnable jar
Minor bug fixes in the KrawlUrl class.
0.3.2 (2017-3-3)
Fixed a number of bugs that would result in a crashed thread, and subsequently an incorrect number of crawled pages as well as cause slowdowns due to a reduced number of worker threads.
Added a new utility function to wrap doCrawl
and log any uncaught exceptions during crawling.
0.3.1 (2017-2-2)
Created 1:1 mapping between threads and the number of queues used to serve URLs to visit. URLs have an affinity for a particular queue based on their domain. All URLs from that domain will end up in the same queue. This improves parallel crawl performance by reducing the frequency that the politeness delay effects requests. For crawls bound to fewer domains than queues, the excess queues are not used.
Many bug fixes including fix that eliminates accidental over-crawling.
0.2.2 (2017-1-21)
useFastRedirectHandling = true
(when redirects are enabled) will cause Krawler to
automatically follow redirects, keeping a history of the transitions and status codes.
This history is present in the KrawlDocument#redirectHistory
property.0.2.1 (2017-1-20)
Redirect handling has been changed. Redirects can be followed or not via configuration
option in KrawlConfig
. When redirects are enabled the redirected to URL will be added
to the queue as a part of the link harvesting phase of Krawler.
If an anchor tag specifies rel='canonical'
the canonicalForm
will not be subject
to further processing.
KrawlUrl.new
's implementation has been changed to prevent null
from being returned
in certain circumstances.
0.2.0 (2017-1-18)
RobotsConfig
to your Krawler
instance. By default Krawler will respect robots.txt without any additional configuration.src
attributes of tags in addition to the href
of anchor tags.