Nebula Crawler Versions Save

🌌 A network agnostic DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

2.3.0

1 week ago

Changelog

0ef063b feat: add pactus network

2.2.1

1 week ago

Changelog

3fef7b2 Goreleaser (#62)
a38f2d5 Specify CCompiler binaries for each OS and Arch (#64)
60804de Update README.md
8dab945 Update README.md
7d840d9 Update list of the Polkadot bootstrap nodes (#61)
69316c1 add ghcr.io/gorelease-cross docker to cross-compile binaries (#66)
8758d96 fix: commit multi address deletion
68fa3bd install missing gcc dependencies at github actions for gorelease (#63)
17a9ff3 remove the release-draft tag from the base-example (#68)
43e95a3 swap golreleaser build for release at github actions (#67)
038116e update: Filecoin bootstrap peers

2.2.0

2 months ago

Release 2.2.0

This release adds support for the Ethereum discv5 DHT and experimental support for discv4. It also adds network configuration to crawl the Celestia network.

What's Changed

Ethereum Support by @dennis-tra in https://github.com/dennis-tra/nebula/pull/42
Celestia Mainnet Support by @dennis-tra in https://github.com/dennis-tra/nebula/pull/45
Ethereum Consensus Layer Monitoring by @dennis-tra in https://github.com/dennis-tra/nebula/pull/47
Downgrade libp2p to support /quic by @dennis-tra in https://github.com/dennis-tra/nebula/pull/48
Add Ethereum Execution Layer support (discv4) by @dennis-tra in https://github.com/dennis-tra/nebula/pull/50
Discv4: request agent information and capabilities by @dennis-tra in https://github.com/dennis-tra/nebula/pull/51
query bootstrap peers from the database to start a crawl by @dennis-tra in https://github.com/dennis-tra/nebula/pull/52
fix: store reported discv5 listen maddrs in the DB by @guillaumemichel in https://github.com/dennis-tra/nebula/pull/54
discv5: only store crawl error if no query succeeded by @guillaumemichel in https://github.com/dennis-tra/nebula/pull/55
fix: crawl_error detection by @guillaumemichel in https://github.com/dennis-tra/nebula/pull/57
feat: store additional ENR fields by @dennis-tra in https://github.com/dennis-tra/nebula/pull/58

New Contributors

@guillaumemichel made their first contribution in https://github.com/dennis-tra/nebula/pull/54

Full Changelog: https://github.com/dennis-tra/nebula/compare/2.1.2...2.2.0

Detailed Changelog

d8d051c Add Ethereum Consensus Layer Support (#42)
4573f54 Add go releaser github action
9a62827 Update Goldberg bootstrap configuration
e366e6d Update README
294ed92 Update README.md
1ef6f92 Update README.md
6f4760f Update README.md
4826287 add celestia Mainnet Support (#45)
b50f7bf add command line option to log full error
e7a845b add discv4 ethereum execution layer support
15c8dfd add goreleaser config
7a2d780 add new known errors
e610e55 add relay connection error case
78dfa48 add snap capability to devp2p client
459288c add timeout to monitor dialing task
3c7583b add: YouTube video to readme
74dbe6e add: go test status badge
4b16c1b add: health endpoint and health check
0bca573 add: keep ENR crawl option
7bc88b3 add: new known error
eceaae7 add: version column to crawls table
b4b8536 addressing review
450725a allow transient and non-direct connections
cca3cb0 bump: Dockerfile golang basea image
83ab980 bump: go-libp2p to 0.28.3
60d95ef change: go test workflow postgres port
24b7988 decrease global dial timeout to 10s
19ea797 discv4: add devp2p package
523613b discv4: add peer identification logic
5c516b8 discv4: decrease retry timeout
8bbbbca discv5: improve address sanitization
52500d1 distinguish between goldberg LC and FN
a3ffd2f downgrade Dockerfile base image Go version
c7f41cf downgrade GitHub-Actions Go version
1519259 downgrade Identify call failure log message
540bdfa downgrade go-libp2p to support /quic transport
9fa221c downgrade goleak to work with go1.19
845988c enable CGO in Dockerfile
ce78e3c extend known error strings
bc7ffb2 feat: add Avail Goldberg network
ce55b29 feat: add holesky testnet
8b6aa61 feat: discv4 prefix map generation
1f6a210 feat: enable circuit relay transport
4fddad9 feat: ethereum consensus layer monitoring
904cb5d feat: store additional ENR fields
27b6127 fix: Makefile up down migration command
df87a4f fix: NPE as metric providers are not set in monitor engine config
d9d77c7 fix: attnets ENR parsing
55a4a2e fix: crawl_error detection (#57)
55ec088 fix: database test
0e6c5f7 fix: discvx imports
f74507c fix: makefile nebula port
cdaf72e fix: maxmind db loading error handling
d84e97f fix: migration 26 replace calc max failed visits
242311a fix: return crawl error if unhandled
cea10fe fix: store connection and not crawl errors
d905f8e fix: track errorbits even after retries are exceeded
c6aed4e fix: udger db database driver
b4fedb5 fix: upper bound check
32836e1 fixed successful query detection
e201684 go mod tidy
4c7f402 handle additional timeout error
b8c6c5a handle immediately closed connection
e4aaa92 ignore maxmind database files
55d0637 import mplex separately
08a72fa improve ethereum crawling
87d1c40 improve libp2p retry logic
0578b5a increase: disv5 timeout
22f4959 init as many libp2p hosts as available CPU cores
2aeec75 keep track of unknown crawl errors
7eeabc2 let Nebula listen over TCP/UDP
6682e88 libp2p: rework retry logic
65b985a log: number of queried peers
dc8f5e6 monitor: log if there are no open sessions
2eb99f5 prevent dial backoffs in monitoring mode
1022437 query bootstrap peers from the database to start a crawl
5773da0 refactor: code orgranisation
b20ff79 refactor: use opentelemetry
8e34dc4 remove debug log message
b7cb9ae remove: .git from .dockerignore
665df75 remove: Biryani configuration
5dab308 remove: codeql analysis
7e9ce0b remove: deployment folder
1230c29 remove: goreleaser before hooks
3b8ad31 remove: maxmind database embeddings
9b1dcac remove: reporting and analysis code
c9c4546 removed discv5 crawl error if 1 query succeeds
090de44 set pi.maddrs from peerstore addresses
1063edb update Dockerfile
90f44db update multiaddress resolution logic
5d53400 update: GitHub action go version
c0593c0 update: Goldberg configuration
db9d63c update: github actions
f252f81 use original go-ethereum discv4 discv5 implementations
e0c5705 use root config struct for log configuration

2.1.2

10 months ago

What's Changed

fix: store listen addresses as reported by identify by @dennis-tra in https://github.com/dennis-tra/nebula/pull/39
fix: timeout on identify exchange

Full Changelog: https://github.com/dennis-tra/nebula/compare/2.1.1...2.1.2

2.1.1

11 months ago

What's Changed

Fixes a memory-leak in the resolve command
Bumps dependencies

Full Changelog: https://github.com/dennis-tra/nebula/compare/2.1.0...2.1.1

2.1.0

1 year ago

What's Changed

Nebula can now also crawl the Kubo API (if exposed on port 5001). Just pass the command line flag --check-exposed to the crawl subcommand.
Nebula can now write its crawl results to a set of JSON files. Just pass the --json-out DIRECTORY command line flag to crawl subcommand. This will generate four JSON files which will contain all the crawl information. More usage information in the README#Usage.

Full Changelog: https://github.com/dennis-tra/nebula/compare/2.0.0...2.1.0

2.0.0

1 year ago

What's Changed

Highlights

New database scheme (hence major version upgrade) that partitions some core tables
New reporting scripts adjusted to the new schema and extended with new graphs. Powers: https://github.com/dennis-tra/nebula-crawler-reports/

Pull Rquests

Add weekly report scripts by @dennis-tra in https://github.com/dennis-tra/nebula/pull/15
postgres ssl by @coryschwartz in https://github.com/dennis-tra/nebula/pull/19
Drop raw visits by @dennis-tra in https://github.com/dennis-tra/nebula/pull/22
Nebula v2 by @dennis-tra in https://github.com/dennis-tra/nebula/pull/26
Neighbor Persistence Speed Up by @dennis-tra in https://github.com/dennis-tra/nebula/pull/27
Resolve each multiaddress in separate transaction, rollback txn on error by @iand in https://github.com/dennis-tra/nebula/pull/31
Add test workflow by @dennis-tra in https://github.com/dennis-tra/nebula/pull/32
add: new net errors by @dennis-tra in https://github.com/dennis-tra/nebula/pull/33

New Contributors

@coryschwartz made their first contribution in https://github.com/dennis-tra/nebula/pull/19
@iand made their first contribution in https://github.com/dennis-tra/nebula/pull/31

Full Changelog: https://github.com/dennis-tra/nebula/compare/1.1.0...2.0.0

1.1.0

2 years ago

Release v1.1.0

The crawler now persists every peer interaction and its associated information (protocols, agent version, multi addresses) plus timing measurements. Yet, the data generation of a crawl does not exceed ~4.5MB. This allows to do e.g. retrospective analyses as not only aggregate information like sessions are saved. To achieve this, the release drastically extends the database schema by normalizing many compound peer properties. For example, multi addresses, agent versions and supported protocols are now decoupled from peer IDs.

Note: The concept of sessions has not changed.

Highlights

more gathered data
more flexible database schema
ping subcommand that measures the ICMP latencies to all online peers of the most recent crawl
resolve subcommand that resolves multi addresses to their IP addresses and geolocation information

Database schema

This release aims to be fully compatible with the database schema introduced by this fork from wcgcyx. Everyone who was working with that schema should be able to just apply the migrations and benefit from more flexible schema. The analysis script are also adapted to use the new database schema.

This the list of tables:

`agent_versions`

    id            SERIAL PRIMARY KEY,
    updated_at    TIMESTAMPTZ   NOT NULL
    created_at    TIMESTAMPTZ   NOT NULL
    agent_version VARCHAR(1000) NOT NULL -- needs to be so large as Filecoin does weird things with this field...

`protocols`

    id         SERIAL PRIMARY KEY,
    updated_at TIMESTAMPTZ   NOT NULL
    created_at TIMESTAMPTZ   NOT NULL
    protocol   VARCHAR(1000) NOT NULL

`visits`

Every time the crawler or monitoring task tries to dial or connect to a peer the outcome of that visit is saved in the database. The following data is saved:

    id               SERIAL
    peer_id          SERIAL      NOT NULL -- this is now the internal database ID (not the peerID)
    crawl_id         INT                  -- can be null if this peer was visited from the monitoring task
    session_id       INT                  
    dial_duration    INTERVAL             -- The time it took to dial the peer or until an error occurred (NULL for crawl visits)
    connect_duration INTERVAL             -- The time it took to connect with the peer or until an error occurred (NULL for monitoring visits)
    crawl_duration   INTERVAL             -- The time it took to crawl the peer also if an error occurred (NULL for monitoring visits)
    updated_at       TIMESTAMPTZ NOT NULL 
    created_at       TIMESTAMPTZ NOT NULL 
    type             visit_type  NOT NULL -- either `dial` or `crawl`
    error            dial_error
    protocols_set_id INT                  -- a foreign key to the protocol set that this peer supported at this visit (NULL for monitoring visits as peers are just dialed)
    agent_version_id INT                  -- a foreign key to the peers agent version at this visit (NULL for monitoring visits as peers are just dialed)
    multi_addresses_set_id INT            -- a foreign key to the multi address set that was used to connect/dial for this visit

`protocols_sets`

    id           SERIAL
    protocol_ids INT ARRAY NOT NULL -- ordered array of foreign key (not db enforced) to the protocols table

`multi_addresses_sets`

    id              SERIAL
    multi_addresses INT ARRAY NOT NULL -- ordered array of foreign key (not db enforced) to the multi_addresses table

`multi_addresses`

    id             SERIAL
    maddr          VARCHAR(200) NOT NULL  -- The multi address in the form of `/ip4/123.456.789.123/tcp/4001`
    updated_at     TIMESTAMPTZ  NOT NULL
    created_at     TIMESTAMPTZ  NOT NULL

`crawl_properties`

Formerly the peers_properties table.

Used to track highly statistics of a crawl like, e.g., how many nodes were found with a specific agent version. Either protocol_id, agent_version_id or error is set.

    id               SERIAL PRIMARY KEY
    crawl_id         SERIAL      NOT NULL
    protocol_id      INT
    agent_version_id INT
    error            dial_error
    count            INT         NOT NULL
    created_at       TIMESTAMPTZ NOT NULL
    updated_at       TIMESTAMPTZ NOT NULL

`crawls`

This table received a state field of type crawl_state. At the start of a crawl an empty crawl row is written to the database. This allows the crawler to associate all subsequent data with this crawl.

CREATE TYPE crawl_state AS ENUM (
    'started',
    'cancelled', -- if crawl is run with the --limit command line option or the user cancelled the crawl via ^C
    'failed',
    'succeeded'
    );

`ip_addresses`

    id         SERIAL
    address    INET        NOT NULL
    country    VARCHAR(2)  NOT NULL
    updated_at TIMESTAMPTZ NOT NULL
    created_at TIMESTAMPTZ NOT NULL

`multi_addresses_x_ip_addresses`

As one IP address could be derived from multiple multi addresses and one multi address can be resolved to multiple IP addresses we need a join table for this many-to-many. For example:

/ip4/123.456.789.123/tcp/4001 + /ip4/123.456.789.123/tcp/4002 -> 123.456.789.123
The /dnsaddr/bootstrap.libp2p.io -> around 12 IP-addresses

    multi_address_id SERIAL
    ip_address_id    SERIAL

`latencies`

This table is populated by the new nebula ping command. This command measures the ICMP latency to all peers that were online during the most recent successful crawl and saves the results here.

    id                 SERIAL
    peer_id            SERIAL       NOT NULL
    ping_latency_s_avg FLOAT        NOT NULL -- The average round trip time (RTT) latency in seconds
    ping_latency_s_std FLOAT        NOT NULL -- The standard deviation of the RTT in seconds
    ping_latency_s_min FLOAT        NOT NULL -- The minimum observed ping RTT in seconds
    ping_latency_s_max FLOAT        NOT NULL -- The minimum observed ping RTT in seconds
    ping_packets_sent  INT          NOT NULL -- The number of sent ping packets
    ping_packets_recv  INT          NOT NULL -- The number of received ping packets
    ping_packets_dupl  INT          NOT NULL -- The number of duplicate ping packets received for one sent ping packet
    ping_packet_loss   FLOAT        NOT NULL -- The percentage of packets lost
    updated_at         TIMESTAMPTZ  NOT NULL
    created_at         TIMESTAMPTZ  NOT NULL

`peers`

With this release all peer_id references in other tables are linked to the database identifier and not to the mult hash of the peer identity.

    id               SERIAL
    multi_hash       SERIAL      NOT NULL -- this is now the internal database ID (not the peerID)
    updated_at       TIMESTAMPTZ NOT NULL
    created_at       TIMESTAMPTZ NOT NULL
    protocols_set_id INT                  -- a foreign key to the protocol set that this peer supported at this visit (NULL for monitoring visits as peers are just dialed)
    agent_version_id INT                  -- a foreign key to the peers agent version at this visit (NULL for monitoring visits as peers are just dialed)

`peers_x_multi_addresses`

This table holds the most recent association of a peer to its set of multi addresses.

    peer_id          SERIAL
    multi_address_id SERIAL

`raw_visits`

This table is here so that the crawl and monitor processes can dump their data in the database with very low latency. There is a database trigger that handles the dissemination of the data into the the other tables. The schema of this table is similar to the actual visits table. This tables has no indexes nor foreign key constraints. Although the database trigger is executed in a transaction, so the dissemination into other tables should happen synchronously, the approach via a database trigger was 100x faster than preparing the transaction on the application level.

The insertion of a visit data point went from >100ms to latencies around <2ms.

`pegasys_connections` + `pegasys_neighbors`

These tables are here for compatibility reasons of the analysis scripts and to prevent data loss when migrations the migrations are applied. The crawler is not actively interacting with these tables.

`neighbors`

This table should mimick the pegasys_neighbors connections table. However, the crawl task doesn't currently has the option to persist the neighbors.

This would lead to quite a lot of data. Back of a napkin calculation:

Each crawl finds ~7k online nodes
Each node returns roughly 200 neighbors
This will mean 1.4M database entries per crawl
The below schema consists of 4 integers with 4 bytes each -> 16 bytes
This would mean 22.4MB of extra data per crawl
Doing this 48 times a day (I'm running the crawler every 30m) would yield 1GB worth of data per day
Therefore, this should be added as a command line option and be run optionally.

    id          SERIAL
    crawl_id    INT
    peer_id     INT
    neighbor_id INT

Notes

This flexibility came with a huge performance hit. During a crawl, inserts could easily take >100ms. To come around this issue this release adds concurrency to the persistence operations by introducing the Persister workers. Further, the heavy lifting of normalizing the data is done on the database level itself. After the crawler has visited a peer it saves the raw data to a table (raw_visits) that doesn't have any constraints or foreign keys. A database trigger handles the dissemination into the other tables. This brought the insert latency down to ~2ms while preserving integrity guarantees of foreign keys and constraints. Don't know if this is how you do things. It's working well though 👍

1.0.0

2 years ago

This is the first release of Nebula.

Nebula Crawler Versions Save

2.3.0

Changelog

2.2.1

Changelog

2.2.0

Release 2.2.0

What's Changed

New Contributors

Detailed Changelog

2.1.2

What's Changed

2.1.1

What's Changed

2.1.0

What's Changed

2.0.0

What's Changed

Highlights

Pull Rquests

New Contributors

1.1.0

Release v1.1.0

Highlights

Database schema

agent_versions

protocols

visits

protocols_sets

multi_addresses_sets

multi_addresses

crawl_properties

crawls

ip_addresses

multi_addresses_x_ip_addresses

latencies

peers

peers_x_multi_addresses

raw_visits

pegasys_connections + pegasys_neighbors

neighbors

Notes

1.0.0

`agent_versions`

`protocols`

`visits`

`protocols_sets`

`multi_addresses_sets`

`multi_addresses`

`crawl_properties`

`crawls`

`ip_addresses`

`multi_addresses_x_ip_addresses`

`latencies`

`peers`

`peers_x_multi_addresses`

`raw_visits`

`pegasys_connections` + `pegasys_neighbors`

`neighbors`