Spark+scala implementation of neighborhood sampling and feature learning for large graphs:
This repository includes the implementation of node2vec.
Get the random walk application source code:
Go to the source code directory. A pre-built jar file, named randomwalk-0.0.1-SNAPSHOT.jar, is available at ./target. To run the application, you use this jar file. (If you want to build the jar file from the source code, you need to have Apache Maven installed and run: mvn clean package
)
Download Apache Spark 2.2.0 or later (e.g release 2.2.1, pre-built for apache hadoop 2.7)
To run the application on your machine, you can use spark-submit script. Go to the Apache Spark directory. Run the application with the following command:
bin/spark-submit --class au.csiro.data61.randomwalk.Main [random walk dir]/target/randomwalk-0.0.1-SNAPSHOT.jar
make sure that the prerequisites are installed:
git clone job-server
Create a .bashrc
with the paths to JVM, sbt and spark, e.g., for Mac OS users it will be the following:
export SBT=/usr/local/Cellar/sbt/1.1.0 export SPARK_HOME=~/spark-2.2.1-bin-hadoop2.7 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk export PATH=$JAVA_HOME/bin:$SBT/bin:$PATH
Run
source .bashrc
Go to spark folder and run start-all.sh
:
cd spark-2.2.1-bin-hadoop2.7/
sbin/sbin/start-all.sh
cd spark-jobserver/job-server/
sbt
Once inside sbt shell, run reStart
.
Go to http://localhost:8090/
and make sure that Spark Job Server UI is working (Note: in Chrome binaries were not updated properly, while in Firefox it was ok)
upload randomwalk jar to the server:
curl --data-binary @randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar localhost:8090/jars/randomwalk
submit a job:
curl -d "rw.input = --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 --directed false --input [random walk dir]/src/test/resources/karate.txt --output [output dir] --partitioned false" 'localhost:8090/jobs?appName=randomwalk&classPath=au.csiro.data61.randomwalk.Main'
Check the status in the Spark Job-server UI
The following options are available:
--walkLength <value> walkLength: 80
--numWalks <value> numWalks: 10
--p <value> return parameter p: 1.0
--q <value> in-out parameter q: 1.0
--rddPartitions <value> Number of RDD partitions in running Random Walk and Word2vec: 200
--weighted <value> weighted: true
--directed <value> directed: false
--w2vPartitions <value> Number of partitions in word2vec: 10
--input <value> Input edge-file/paths-file: empty
--output <value> Output path: empty
--singleOutput <value> Whether to write the output in a single file: true
--cmd <value> command: randomwalk/embedding/node2vec (to run randomwalk + embedding)
--partitioned <value> Whether the graph is partitioned: false
--lr <value> Learning rate in word2vec: 0.025
--iter <value> Number of iterations in word2vec: 10
--dim <value> Number of dimensions in word2vec: 128
--window <value> Window size in word2vec: 10
For example:
bin/spark-submit --class au.csiro.data61.randomwalk.Main ./randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar \
--cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 \
--input [input edge list] --output [output directory] --partitioned false
You can choose the algorithm to run by the --cmd option:
The input graph must be an edge list with integer vertex IDs. For example:
src1-id dst1-id
src1-id dst2-id
...
If the graph is weighted, it must include the weight in the last column for each edge. For example:
src1-id dst1-id 1.0
If the graph is partitioned, each edge should have a partition number, i.e., should be assigned to a partition. The partition number must be in the third column of the edge list. For example:
src1-id dst1-id 1 1.0
src1-id dst2-id 1 1.0
src3-id dst1-id 2 1.0
...
The application itself will replicate (cut) those vertices that span among multiple partitions.
The application writes output to the disk in the directory given by the --output parameter. In this directory, three folders may be created:
The output of the randomwalk (in the /path directory) is in the format of tab-separated integers (vertex-ids), where each line represents random steps starting from a vertex. The number of lines are equal to the number of generated paths (--numWalks*|V|). The result is partitioned in --rddPartitions number of files as plain text unless you set --singleOutput to true.
The embeddings are also in the the format of tab-separated numbers per line, where the first number represents the vertex-id and the rest of the numbers in that line represent the vertex's vector representation. The result is partitioned in --rddPartition number of files and is written as plain text unless you set --singleOutput to true.
An example of randomwalk output:
4 2 4 2 4
32 26 32 33 16
20 2 31 2 31
...
An example of vecotor representations:
34 -0.055403516 -0.007816158 0.053844377 -0.038890388 0.13484207 -0.07744342
12 -0.025323598 0.054739047 -0.077121474 -0.016575111 -0.041371193 -0.046176624
8 -0.07687951 0.120588094 -0.11744399 -0.058044944 -0.061426725 -0.13111477
...