Dgl Versions Save

Python package built to ease deep learning on graph, on top of existing DL frameworks.

0.9.1

1 year ago

v0.9.1 is a minor release with the following update:

Distributed Graph Partitioning Pipeline

DGL now supports partitioning and preprocessing graph data using multiple machines. At its core is a new data format called Chunked Graph Data Format (CGDF) which stores graph data by chunks. The new pipeline processes data chunks in parallel which not only reduces the memory requirement of each machine but also significantly accelerates the entire procedure. For the same random graph with 1B nodes/5B edges, using a cluster of 8 AWS EC2 x1e.4xlarge (16 vCPU, 488GB RAM each), the new pipeline can reduce the running time to 2.7 hours and cut down the money cost by 3.7x. Read the feature highlight blog for more details.

To get started with this new feature, check out the new user guide chapter.

New Additions

System Enhancement

  • Two new APIs dgl.use_libxsmm and dgl.is_libxsmm_enabled to enable/disable Intel LibXSMM. (#4455)
  • Added a new option exclude_self to exclude self-loop edges for dgl.knn_graph. The API now supports creating a batch of KNN graphs. (#4389)
  • The distributed training program launched by DGL will now report error when any trainer/server fails.
  • Speedup DataLoader by adding CPU affinity support. (#4126)
  • Enable graph partition book to support canonical edge types. (#4343)
  • Improve the performance of CUDA SpMMCSr (#4363)
  • Add CUDA Weighted Neighborhood Sampling (#4064)
  • Enable UVA for Weighted Samplers (#4314)
  • Allow add data to self loop created by AddSelfLoop or add_self_loop (#4261)
  • Add CUDA Weighted Randomwalk Sampling (#4243)

Deprecation & Cleanup

  • Removed the already deprecated AsyncTransferer class. The functionality has been incorporated to DGL DataLoader. (#4505)
  • Removed the already deprecated num_servers and num_workers arguments of dgl.distributed.initialize. (#4284)

Dependency Update

Starting from this release, we will drop support for CUDA 10.1 and 11.0. On windows, we will further drop support for CUDA 10.2.

Linux: CentOS 7+ / Ubuntu 18.04+

PyTorch ver. \ CUDA ver. 10.2 11.1 11.3 11.5 11.6
1.9      
1.10    
1.11  
1.12    

Windows: Windows 10+/Windows server 2016+

PyTorch ver. \ CUDA ver. 11.1 11.3 11.5 11.6
1.9      
1.10    
1.11  
1.12    

Bugfixes

  • Fix a crash bug due to incorrect dtype in dgl.to_block() (#4487)
  • Fix a bug related to unpinning when tensoradaptor is not available (#4450)
  • Fix a bug related to pinning empty tensors and graphs (#4393)
  • Remove duplicate entries of CUB submodule (#4499)
  • Fix broken static_assert (#4342)
  • A bunch of fixes in edge_softmax_hetero (#4336)
  • Fix the default value of num_bases in RelGraphConv module (#4321)
  • Fix etype check in DistGraph.edge_subgraph (#4322)
  • Fix incorrect _bias and bias usage (#4310)
  • Enable DistGraph.find_edge() works with str or tuple of str (#4319)
  • Fix a numerical bug related to SparseAdagrad. (#4253)

0.9.0

1 year ago

This is a major update with several new features including graph prediction pipeline in DGL-Go, cuGraph support, mixed precision support, and more.

Starting from 0.9 we also ship arm64 builds for Linux and OSX.

DGL-Go

DGL-Go now supports training GNNs for graph property prediction tasks. It includes two popular GNN models – Graph Isomorphism Network (GIN) and Principal Neighborhood Aggregation (PNA). For example, to train a GIN model on the ogbg-molpcba dataset, first generate a YAML configuration file using command:

dgl configure graphpred --data ogbg-molpcba --model gin

which generates the following configuration file. Users can then manually adjust the configuration file.

version: 0.0.2
pipeline_name: graphpred
pipeline_mode: train
device: cpu                     # Torch device name, e.g., cpu or cuda or cuda:0
data:
    name: ogbg-molpcba
    split_ratio:                # Ratio to generate data split, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
model:
     name: gin
     embed_size: 300            # Embedding size
     num_layers: 5              # Number of layers
     dropout: 0.5               # Dropout rate
     virtual_node: false        # Whether to use virtual node
general_pipeline:
    num_runs: 1                 # Number of experiments to run
    train_batch_size: 32        # Graph batch size when training
    eval_batch_size: 32         # Graph batch size when evaluating
    num_workers: 4              # Number of workers for data loading
    optimizer:
        name: Adam
        lr: 0.001
        weight_decay: 0
    lr_scheduler:
        name: StepLR
        step_size: 100
        gamma: 1
    loss: BCEWithLogitsLoss
    metric: roc_auc_score
    num_epochs: 100             # Number of training epochs
    save_path: results          # Directory to save the experiment results

Alternatively, users can fetch model recipes of pre-defined hyperparameters for the original experiments.

dgl recipe get graphpred_pcba_gin.yaml

To launch training:

dgl train --cfg graphpred_ogbg-molpcba_gin.yaml

Another addition is a new command to conduct inference of a trained model on some other dataset. For example, the following shows how to apply the GIN model trained on ogbg-molpcba to ogbg-molhiv.

# Generate an inference configuration file from a saved experiment checkpoint
dgl configure-apply graphpred --data ogbg-molhiv --cpt results/run_0.pth

# Apply the trained model for inference
dgl apply --cfg apply_graphpred_ogbg-molhiv_pna.yaml

It will save the model prediction in a CSV file like below image

Mixed Precision

DGL is compatible with the PyTorch Automatic Mixed Precision (AMP) package for mixed precision training, thus saving both training time and GPU memory consumption. This feature requires PyTorch 1.6+ and Python 3.7+.

By wrapping the forward pass with torch.cuda.amp.autocast(), PyTorch automatically selects the appropriate data type for each op and tensor. Half precision tensors are memory efficient, most operators on half precision tensors are faster as they leverage GPU tensorcores.

import torch.nn.functional as F
from torch.cuda.amp import autocast

def forward(g, feat, label, mask, model):
      with autocast(enabled=True):
            logit = model(g, feat)
            loss = F.cross_entropy(logit[mask], label[mask])
            return loss

Small gradients in float16 format have underflow problems (flush to zero). PyTorch provides a GradScaler module to address this issue. It multiplies the loss by a factor and invokes backward pass on the scaled loss to prevent the underflow problem. It then unscales the computed gradients before the optimizer updates the parameters. The scale factor is determined automatically.

from torch.cuda.amp import GradScaler

scaler = GradScaler()

def backward(scaler, loss, optimizer):
      scaler.scale(loss).backward()
      scaler.step(optimizer)
      scaler.update()

Putting everything together, we have the example below.

import torch
import torch.nn as nn
from dgl.data import RedditDataset
from dgl.nn import GATConv
from dgl.transforms import AddSelfLoop

class GAT(nn.Module):
      def __init__(self, in_feats, num_classes, num_hidden=256, num_heads=2):
            super().__init__()
            self.conv1 = GATConv(in_feats, num_hidden, num_heads, activation=F.elu)
            self.conv2 = GATConv(num_hidden * num_heads, num_hidden, num_heads)

      def forward(self, g, h):
            h = self.conv1(g, h).flatten(1)
            h = self.conv2(g, h).mean(1)
            return h

device = torch.device('cuda')

transform = AddSelfLoop()
data = RedditDataset(transform)

g = data[0]
g = g.int().to(device)
train_mask = g.ndata['train_mask']
feat = g.ndata['feat']
label = g.ndata['label']
in_feats = feat.shape[1]

model = GAT(in_feats, data.num_classes).to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)

for epoch in range(100):
     optimizer.zero_grad()
     loss = forward(g, feat, label, train_mask, model)
     backward(scaler, loss, optimizer)

Thanks @nv-dlasalle @ndickson-nvidia @yaox12 etc. for support!

cuGraph Interface

The RAPIDS cuGraph library provides a collection of GPU accelerated algorithms for graph analytics, such as centrality computation and community detection. According to its documentation, “the latest NVIDIA GPUs (RAPIDS supports Pascal and later GPU architectures) make graph analytics 1000x faster on average over NetworkX”.

To install cuGraph, we recommend following the practice below.

conda install mamba -n base -c conda-forge

mamba create -n dgl_and_cugraph -c dglteam -c rapidsai-nightly -c nvidia -c pytorch -c conda-forge cugraph pytorch torchvision torchaudio cudatoolkit=11.3 dgl-cuda11.3 tqdm

conda activate dgl_and_cugraph

DGL now supports compatibility with cuGraph by allowing conversion between a DGLGraph object and a cuGraph graph object, making it possible for DGL users to access efficient graph analytics implementations in cuGraph. For example, users can perform community detection on a graph with the Louvain method available in cuGraph.

import cugraph

from dgl.data import CoraGraphDataset

dataset = CoraGraphDataset()
g = dataset[0].to('cuda')
cugraph_g = g.to_cugraph()
cugraph_g = cugraph_g.to_undirected()
parts, modularity_score = cugraph.louvain(cugraph_g)

The community membership of nodes from parts['partition'] can then be used as auxiliary node labels or node features.

If you have modified the structure of a cuGraph graph object or loaded graph data with cuGraph, you can also convert it to a DGLGraph object.

import dgl
g = dgl.from_cugraph(cugraph_g)

Credits to @VibhuJawa!

Arm64 builds

Linux AArch64 and OSX M1 (arm64) are now supported. One can install them as usual with pip and conda:

pip install dgl-cuXX -f https://data.dgl.ai/wheels/repo.html
conda install -c dglteam dgl-cudaXX.X   # currently not available for OSX M1

Quality-of-life updates

  • Added more missing FP16 specializations (#4140, @ndickson-nvidia )
  • Allow communicators of size one when NCCL is missing (#3713, @nv-dlasalle )
  • Automatically unpin DGL tensors when out of scope to avoid potential bugs (#4135, @yaox12 )

System optimizations

  • Enable using UVA and FP16 with SparseAdam Optimizer (#3885, @nv-dlasalle )
  • Enable USE_EPOLL by default in distributed training (#4167)
  • Optimize the use of alternative streams in dataloader (#4177, @yaox12 )
  • Redirect AllocWorkspace to PyTorch's allocator if available (#4199, @yaox12 )

Bug fixes

  • Massive refactoring of examples including GCN, GraphSAGE, PinSAGE, EGES, DGI, GATv2, and many more (#4130, #4194, #4186, #4197, #4201, #4160, #4220, #4219, #4218, #4242, #4255, huge thanks to @chang-l!)
  • Fix CareGNN example to adapt to new sampler interface (#4211, @yaox12)
  • Fix #4150 (#4164, #4198, #4212)
  • Fix etype not guaranteed to be sorted in distributed training (#4156)
  • Fix compiler warnings (#4051, @TristonC)
  • Fix correct and smooth example using validation labels during prediction in validation (#4158, @LucasPrietoAl )
  • Fix build issues on mac OS (#4168, #4175)
  • Fix that pin_prefetcher is not actually enabled (#4169, @yaox12 )
  • Fix A Bug Related to GroupRevRes (#4181)
  • Fix deferred_dtype missing error (#4174, @nv-dlasalle )
  • Add CUDA context availability check before setting curand seed (#4223, @yaox12)
  • Fix dtype mismatch when copy graph into shared memory and get it back (#4222) (#4228)
  • Fix graph attribute missing in DataLoader when device is not specified (#4245)
  • Record stream when using another CUDA stream for data transfer (#4250, @yaox12 )
  • Fix Multiple Backwards Pass Error with retain_graph being set (#4078) (#4249)
  • Doc fixes (#4149, #4180, #4193, #4246, #4248, @PotatoChipsNinja @yaox12 @alxwen711 @Zhanghyi )

Misc

  • Test pipeline for distributed training (#4122 , @Kh4L)

0.8.2

2 years ago

This is a minor release with the following updates.

Test AArch64 Build

A 0.8.2 test build for AArch64 is available in

pip install dgl -f https://data.dgl.ai/wheels-test/repo.html   # or dgl-cuXX for CUDA

New Modules

  • Graph Isomorphism Network with Edge Features (#3934)
  • dgl.transforms.FeatMask for randomly dropping out dimensions of all node/edge features (#3968, @RecLusIve-F)
  • dgl.transforms.RowFeatNormalizer for normalization of all node/edge features (#3968, @RecLusIve-F)
  • Label propagation module (#4017)
  • Directional graph network layer (#4017)
  • Datasets for developing GNN explainability approaches (#3982)
  • dgl.transforms.SIGNDiffusion for augmenting input node features (#3982)

Quality-of-life Updates

  • Allow HeteroLinear with/without bias (#3970, @ksadowski13)
  • Allow selection of “socket” for RPC backend in distributed training (#3951)
  • Enable specification of maximum number of trials for socket backend in DistDGL (#3977)
  • Added floating-point conversion functions to dgl.transforms.functional (#3890, @ndickson-nvidia)
  • Improve the warning message when Tensoradapter is not found (#4055)
  • Add sanity check for in_edges/out_edges on empty graphs (#4050)

System Optimization

  • Improved graph batching on GPU for Graph DataLoaders (#3895, @ayasar70)
  • CPU DataLoader affinitization (#3723 @daniil-sizov)
  • Memory consumption optimization on index shuffling in dataloader (#3980)
  • Remove unnecessary induced vertices in edge subgraph (#3978, @yaox12)
  • Change the curandState and launch dimension of GPU neighbor sampling kernel (#3990, @paoxiaode)

Bug fixes

  • Fix multi-GPU edge classification crashing with pure-GPU sampling (#3946)
  • Fixed race conditions in distributed SparseAdam and SparseAdagrad (#3971, @ndickson-nvidia)
  • Fix launch parameters index select kernel in sparse pull for multi-GPU sparse embedding (#3524, @nv-dlasalle)
  • Fix import error when tensorflow backend is specified (#4015)
  • Fix DistDGL crashing when sampling on bipartite graphs (#4014)
  • Prevent users from attempting to pin PyTorch non-contiguous tensors or views only encompassing part of tensor (#3992, @nv-dlasalle)
  • Fix Cython CAPI holding GIL causes deadlock when Python callback is asynchronous (#4036)
  • Misc unit test, example, doc fixes etc. (#3947, #3941, #3928, #3944, #3505, #3953, #3983, #3996, #4009, #4010, #4016, #4022, #4023, #4027, #4030, #4034, #4038, #4053, #4058, #4060 @Kh4L, @daniil-sizov, @HenryChang213, @sharique1006, @msharmavikram, @initzhang, @yinpeiqi, @chang-l, @nv-dlasalle, @Sanzo00, @Eurus-Holmes, @xiaopqr, @decoherencer)

0.8.1

2 years ago

This is a minor release that includes the following model updates, optimizations, new features and bug fixes.

Model update

  • nn.GroupRevRes from Training Graph Neural Networks with 1000 layers [#3842]
  • transforms.LaplacianPositionalEncoding from Graph Neural Networks with Learnable Structural and Positional Representations [#3869]
  • transforms.RWPositionalEncoding from Graph Neural Networks with Learnable Structural and Positional Representations [#3869]
  • dataloading.SAINTSampler from GraphSAINT [#3879]
  • nn.EGNNConv from E(n) Equivariant Graph Neural Networks [#3901]
  • nn.PNAConv from the baselines of E(n) Equivariant Graph Neural Networks [#3901]

Example update

  • Position-aware GNN [#3823 @RecLusIve-F]
  • EGES (Enhanced Graph Embedding with Side info) [#3756 @Wang-Yu-Qing]

Feature update (new functionalities, interface changes, etc.)

  • Radius graph - construct a graph by connecting points within a given distance. [#3829 @ksadowski13]
    • It uses torch.cdist so the space complexity is O(N^2).
  • Added a get_attention parameter in GlobalAttentionPooling. [#3837 @decoherencer]

Quality of life update

  • Example to train with multi-GPU with PyTorch Lightning. [#3863]
  • Multi-GPU inference with UVA. [#3827 @nv-dlasalle]
  • Enable UVA sampling with CPU indices to save GPU memory. [#3892]
  • Set stacklevel=2 for DGL-raised warnings. [#3816]
  • Pure GPU example of GraphSAGE, with both node classification and link prediction. [#3796 @nv-dlasalle, #3856 @Kh4L]
  • Tensoradapter DLPack 0.6 compatibility / PyTorch 1.11 support. [#3803]

System optimization

  • Enable UVA for PinSAGE and RandomWalk. [#3857 @yaox12]
  • METIS partition with communication volume minimization, reduces the communication volume by 13.4% compared with edge-cut minimization on ogbn-products. [#3821 @chwan1016]
  • Change parameter of curand_init for reducing GPU latency [#3794 @paoxiaode]

Bug fixes

  • Fix Python 3.10 import error [#3862]
  • Fix repeated 0’s in DataLoader index iteration when shuffle=False [#3892]
  • DataLoader device cannot be None [#3822 @yinpeiqi]
  • Fix device error in negative sampling with UVA [#3904 @nv-dlasalle]
  • Illegal instruction in ClusterGCNSampler (#3910)
  • Include pin memory status in pickling and deep copy [#3914]
  • Misc doc fixes (@lvcrek @AzureLeon1 @decoherencer @yaox12 @ketyi )

0.8.0post2

2 years ago

This is a bugfix release including the following bugfixes:

Quality-of-life updates

  • Python 3.10 support.
  • PyTorch 1.11 support.
  • CUDA 11.5 support on Linux. Please install with
    pip install dgl-cu115 -f https://data.dgl.ai/wheels/repo.html  # if using pip
    conda install dgl-cuda11.5 -c dglteam  # if using conda
    
  • Compatibility to DLPack 0.6 in tensoradapter (#3803) for PyTorch 1.11
  • Set stacklevel=2 for dgl_warning (#3816)
  • Support custom datasets in DataLoader that are not necessarily tensors (#3810 @yinpeiqi )

Bug fixes

  • Pass ntype/etype into partition book when node/edge_split (#3828)
  • Fix multi-GPU RGCN example (#3871 @yaox12)
  • Send rpc messages blockingly in case of congestion (#3867). Note that this fix would probably cause speed regression in distributed DGL training. We were still finding the root cause of the underlying issue in #3881.
  • Fix CopyToSharedMem assuming that all relation graphs are homogeneous (#3841)
  • Fix HAN example crashing with CUDA (#3841)
  • Fix UVA sampling crash without specifying prefetching features (#3862)
  • Fix documentation display issue of node/edge_split (#3858)
  • Fix device mismatch error in GraphSAGE distributed training example under multi-node multi-GPU (#3870)
  • Use torch.distributed.algorithms.join.Join to deal with uneven training sets in distributed training (#3870)
  • Dataloader documentation fixes (#3886)
  • Remove redundant reference of networkx package in pagerank.py (#3888 @AzureLeon1 )
  • Make source build work for systems where the default is Python 2 (#3718)
  • Fix UVA sampling with partially specified node types (#3897)

0.8.0post1

2 years ago

This is a quick post-release with critical bug fixes:

  • Fix incorrect name when fetch data in sparse optimizer #3808
  • Fix DataLoader not working with heterogeneous graphs on multiple GPUs #3801
  • Fix error in heterogeneous graph partitioning when the graph is unidirectional bipartite #3793

0.8.0

2 years ago

v0.8.0 is a major release with many new features, system improvement and fixes. Read the blog for the highlighted features.

Major features

Mini-batch Sampling Pipeline Update

Enabled CUDA UVA-based optimization and feature prefetching for all built-in graph samplers (up to 4x speedup compared to v0.7). Users can now specify the features to prefetch and turn on UVA optimization in dgl.dataloading.Sampler and dgl.dataloading.DataLoader.

g = ...                             # some DGLGraph data
train_nids = ...                    # training node IDs
sampler = dgl.dataloading.MultiLayerNeighborSampler(
    fanout=[10, 15],
    prefetch_node_feats=['feat'],   # prefetch node feature 'feat'
    prefetch_labels=['label'],      # prefetch node label 'label'
)
dataloader = dgl.dataloading.DataLoader(
    g, train_nids, sampler,
    device='cuda:0',     # perform sampling on GPU 0
    batch_size=1024,
    shuffle=True,
    use_uva=True         # turn on UVA optimization
)

We have done a major refactor on the sampling components to make it easier to implement new graph samplers. Added a new base class dgl.dataloading.Sampler with one abstract method sample for overriding. Added new APIs dgl.set_src_lazy_features, dgl.set_dst_lazy_features, dgl.set_node_lazy_features, dgl.set_edge_lazy_features for customizing prefetching rules. The code below shows the new user experience.

class NeighborSampler(dgl.dataloading.Sampler):
    def __init__(self,
                 fanouts : list[int],
                 prefetch_node_feats: list[str] = None,
                 prefetch_edge_feats: list[str] = None,
                 prefetch_labels: list[str] = None):
        super().__init__()
        self.fanouts = fanouts
        self.prefetch_node_feats = prefetch_node_feats
        self.prefetch_edge_feats = prefetch_edge_feats
        self.prefetch_labels = prefetch_labels

    def sample(self, g, seed_nodes):
        output_nodes = seed_nodes
        subgs = []
        for fanout in reversed(self.fanouts):
            # Sample a fixed number of neighbors of the current seed nodes.
            sg = g.sample_neighbors(seed_nodes, fanout)
            # Convert this subgraph to a message flow graph.
            sg = dgl.to_block(sg, seed_nodes)
            seed_nodes = sg.srcdata[NID]
            subgs.insert(0, sg)
         input_nodes = seed_nodes
         
         # handle prefetching
         dgl.set_src_lazy_features(subgs[0], self.prefetch_node_feats)
         dgl.set_dst_lazy_features(subgs[-1], self.prefetch_labels)
         for subg in subgs:
             dgl.set_edge_lazy_features(subg, self.prefetch_edge_feats)

         return input_nodes, output_nodes, subgs

Related documentations:

We thank Xin Yao (@yaox12 ) and Dominique LaSalle (@nv-dlasalle ) from NVIDIA and David Min (@davidmin7 ) from UIUC for their contributions.

DGL-Go

DGL-Go is a new command line tool for users to get started with training, using and studying Graph Neural Networks (GNNs). Data scientists can quickly apply GNNs to their problems, whereas researchers will find it useful to customize their experiments.

The initial release include

  • Four commands, dgl train, dgl recipe, dgl configure and dgl export.
  • 3 training pipelines for node prediction using full graph training, link prediction using full graph training and node prediction using neighbor sampling.
  • 5 node encoding models: gat, gcn, gin, sage, sgc; 3 edge encoding models: bilinear, dot-product, element-wise.
  • 10 datasets including custom dataset in CSV format.

NN Modules

We have accelerated dgl.nn.RelGraphConv and dgl.nn.HGTConv by up to 36x and 12x compared with the baselines from v0.7 and PyG. Shortened the implementation of dgl.nn.RelGraphConv by 3x (from 200L → 64L).

Breaking change: dgl.nn.RelGraphConv no longer accepts 1-D integer tensor representing node IDs during forward. Please switch to torch.nn.Embedding to explicitly represent trainable node embeddings.

Below are the new NN modules added to v0.8:

A new edge_weight argument is added to several GNN modules to support training on weighted graph. Added a new user guide chapter 5.5 about how to use edge weights in your GNN model.

Graph Dataset and Transforms

Rename the old dgl.transform package to dgl.transforms to follow PyTorch’s namespace convention. All DGL’s datasets now accept an extra transforms keyword argument for data augmentation and transformation:

import dgl
import dgl.transforms as T
t = T.Compose([
    T.AddSelfLoop(),
    T.GCNNorm(),
])
dataset = dgl.data.CoraGraphDataset(transform=t)
g = dataset[0]  # graph and features will be transformed automatically

Added 16 graph data transforms module:

Added several dataset utilities:

Model Examples

A major rework of two classical examples:

7 new examples:

GNNLens

GNNLens is an interactive visualization tool for graph neural networks (GNN). It integrates GNN explanation model to analyze and understand graph data. See the repository here: https://github.com/dmlc/gnnlens2

Distributed Training

  • Allow launching persistent graph server (will not exit even if all training workers have finished) to speed up distributed experiments on the same graph data. See the user guide chapter for more details.
  • Breaking change: separate the data loaders for single-device and distributed training. Passing a DistGraph to dgl.dataloading.NodeDataLoader will cause an error. Please use dgl.dataloading.DistNodeDataLoader instead.
  • Replace the low-level network communicator with pytorch/tensorpipe.
  • dgl.sample_etype_neighbors now works for DistGraph. #3558

Documentation

Other API Updates

  • dgl.ops.segment_mm: An operator to perform matrix multiplication according to segments.
  • dgl.ops.gather_mm: An operator to perform matrix multiplication according to look-up indices.
  • dgl.merge: Merge a sequence of graphs together into a single one. @noncomputable #3522
  • dgl.dataloading.GlobalUniform: A negative sampler that draws negative samples uniformly from all nodes. #3599
  • dgl.DGLGraph.pin_memory_, dgl.DGLGraph.unpin_memory_ and dgl.DGLGraph.is_pinned to pin, unpin and check a DGLGraph to page-locked memory.
  • A new CPU kernel for dgl.edge_softmax. @ranzhejiang #3650
  • New CUDA kernel implementation that accelerated dgl.node_subgraph, dgl.in_subgraph, dgl.in_edges by several orders of magnitudes. @ayasar70, #3745
  • dgl.reorder_graph supports reordering edges according to user-provided permutation.

Patch and Bugfixes

  • Fixed an off-by-one bug in GenericRandomWalk(). @erickim555, #3500
  • Cleanup codebase and remove unused third_party dependency.
  • Fixed a device error in the pytorch/MNIST example. @sinhaharsh #3527
  • Improved the speed of PinSAGESampler by fusing several operations. @lixiaobai09, #3529
  • Enable CUDA PinSAGESampler. @lixiaobai09, #3567
  • Fixed GATv2Conv residual for mini-batch. @ksadowski13, #3535
  • Fixed the output dimensions of residual connection for GATv2Conv. @schmidt-ju, #3584
  • Improved csr2coo.cu:_RepeatKernal() for more robust GPU usage. @ayasar70, #3537
  • Fixed the dimension mismatch issue in PinSAGE example. #3539
  • Fixed a bug of GinConv when using in pickle. @lizeyan #3540
  • Fixed a bug in distributed training where improper data splitting causes training hanging. #3542
  • Fixed a bunch of bugs in distributed training. @xcwanAndy #3607
  • Fix a bug in TGN example. #3543
  • Improved building by only rebuilding libxsmm if necessary. #3497
  • Improved the documentation of TUDataset on the order. @sangyx #3549
  • Fixed a bug in distributed SparseAdam optimizer. #3561
  • Fixed a bug in TWIRLS module and example. @FFTYYY #3573
  • Fixed a bug in ndata and edata where lazy copy is triggered unnecessarily. #3585
  • Fixed a bug when using int32 array. @hirayaku, #3597
  • Fixed a bug of KNN graph on TensorFlow.
  • Fixed a bug in to_bidirected where a simple graph is needed. #3630
  • Fixed a compilation bug in parallel_for.h. #3631
  • Fixed a compilation crash related to libuv-devel. #3640
  • Remove the info message of RDFLib and “using backend: xxx” when importing DGL.
  • Dataset dependencies are loaded only when the dataset object is created.
  • Fixed a bug in distributed training of conflicting ports. #3658
  • Improved the CompGCN example. @nxznm #3663
  • Improved the GIN example on reproducibility. @miziha-zp #3676
  • Improved SAGEConv by adding sanity check on aggregator type. @thatlittleboy #3691
  • Fixed a bug in launching multiple DGL programs in parallel. #3696
  • Fix the document of GraphSAGE normalization. @KoyamaSohei, #3711

Breaking Changes & Deprecations

  • DGL now requires PyTorch >= 1.9.0.
  • Building from source now requires compiler with c++14 support.
  • For multi-GPU training, the new strategy is to use shared memory to speedup inter-process communication. Users may sometimes experience a "not enough shared memory" error. If it happens, please increase the shared memory capacity.

0.7.2

2 years ago

0.7.2 Release Notes

This is a patch release targeting CUDA 11.3 and PyTorch 1.10. It contains (1) distributed training on heterogeneous graphs, and (2) bug fixes and code reorganization commits. The performance impact should be minimal.

To install with CUDA 11.3 support, run either

pip install dgl-cu113 -f https://data.dgl.ai/wheels/repo.html

or

conda install -c dglteam dgl-cuda11.3

Distributed Training on Heterogeneous Graphs

We have made the interface of distributed sampling on heterogeneous graph consistent with single-machine code. Please refer to https://github.com/dmlc/dgl/blob/0.7.x/examples/pytorch/rgcn/experimental/entity_classify_dist.py for the new code.

Other fixes

  • [Bugfix] Fix bugs of farthest_point_sampler (#3327, @sangyx)
  • [Bugfix] Fix sparse embeddings for PyTorch < 1.7 #3291 (#3333)
  • Fixes bug in hg.update_all causing crash #3312 (#3345, @sanchit-misra)
  • [Bugfix] And PYTHONPATH in server launch. (#3352)
  • [CPU][Sampling][Performance] Improve sampling on the CPU. (#3274, @nv-dlasalle)
  • [Performance, CPU] Rewriting OpenMP pragmas into parallel_for (#3171, @tpatejko)
  • [Build] Fix OpenMP header inclusion for Mac builds (#3325)
  • [Performance] improve coo2csr space complexity when row is not sorted (#3326)
  • [BugFix] initialize data if null when converting from row sorted coo to csr (#3360)
  • fix broadcast tensor dim in dgl.broadcast_nodes (#3351, @jwyyy)
  • [BugFix] fix typo in fakenews dataset variable name (#3363, @kayzliu)
  • [Doc] Added md5sum info for OGB-LSC dataset (#3332, @msharmavikram)
  • [Feature] Graceful handling of exceptions thrown within OpenMP blocks (#3353)
  • Fix torch import in example (#3372, @jwyyy)
  • [Distributed] Allow user to pass-in extra env parameters when launching a distributed training task. (#3375)
  • [BugFix] extract gz into target dir (#3389)
  • [Model] Refine GraphSAINT (#3328 @ljh1064126026 )
  • [Bug] check dtype before convert to gk (#3414)
  • [BugFix] add count_nonzero() into SA_Client (#3417)
  • [Bug] Do not skip graphconv even no edge exists (#3416)
  • Fix edge ID exclusion when both g and g_sampling are specified in EdgeDataLoader(#3322)
  • [Bugfix] three bugs related to using DGL as a subdirectory(third_party) of another project. (#3379, @yuanzexi )
  • [PyTorch][Bugfix] Use uint8 instead of bool in pytorch to be compatible with nightly version (#3406, #3454, @nv-dlasalle)
  • [Fix] Use ==/!= to compare constant literals (str, bytes, int, float, tuple) (#3415, @cclauss)
  • [Bugfix][Pytorch] Fix model save and load bug of stgcn_wave (#3303, @HaoWei-TomTom )
  • [BugFix] Avoid Memory Leak Issue in PyTorch Backend (#3386, @chwan-rice )
  • [Fix] Split nccl sparse push into two groups (#3404, @nv-dlasalle )
  • [Doc] remove duplicate papers (#3393, @chwan-rice )
  • Fix GINConv backward #3437 (#3440)
  • [bugfix] Fix compilation with CUDA 11.5's CUB (#3468, @nv-dlasalle )
  • [Example][Performance] Enable faster validation for pytorch graphsage example (#3361, @nv-dlasalle )
  • [Doc] Evaluation Tutorial for Link Prediction (#3463)

0.7.1

2 years ago

0.7.1 Release Notes

0.7.1 is a minor release with multiple fixes and a few new models/features/optimizations included as follows.

Note: We noticed that 0.7.1 for Linux is unavailable on our anaconda repository. We are currently working on this issue. For now, please use pip installation instead.

New models

  • GCN-based spam review detection (#3145, @kayzliu)
  • CARE-GNN (#3187, @kayzliu)
  • GeniePath (#3199, @kayzliu)
  • EEG-GCNN (#3186, @JOHNW02)
  • EvolveGCN (#3190, @maqy1995)

New Features

  • Allows providing username in tools/launch.py (#3202, @erickim555)
  • Refactor and allows customized Python binary names in tools/launch.py (#3205, @erickim555)
  • Add support for distributed preprocessing for heterogeneous graphs (#3137, @ankit-garg)
  • Correctly pass all DGL client server environment variables for user-defined multi-command (#3245, @erickim555)
  • You can configure the DGL configuration directory with environment variable DGLDEFAULTDIR (#3277, @konstantino)

Optimizations

  • Improve usage of pinned memory in sparse optimizer (#3207, @nv-dlasalle)
  • Optimized counting of nonzero entries of DistTensor (#3203, @freeliuzc)
  • Remove activation cache if not required (#3258)
  • Edge excluding in EdgeDataLoader on GPU (#3226, @nv-dlasalle)

Fixes

  • Update numbers for HiLANDER model (#3175)
  • New training and test scripts for HiLANDER (#3180)
  • Fix potential starving in socket receiver (#3176, @JingchengYu94)
  • Fix typo in Tensorflow backend (#3182, @lululxvi)
  • Add WeightBasis documentation (#3189)
  • Default ntypes/etypes consistency between dgl.DGLGraph and dgl.graph (#3198)
  • Set sharing strategy for SEAL example (#3167, @KounianhuaDu)
  • Remove DGL_LOADALL in doc builds (#3150, @lululxvi)
  • Fix distributed training hang with multiple samplers (#3169)
  • Fix random_walk documentation inconsistency (#3188)
  • Fix curand_init() calls in rowwise sampling leading to not-so-random results (#3196, @nv-dlasalle)
  • Fix force_reload parameter of FraudDataset (#3210, @Orion-wyc)
  • Fix check for num_workers for using ScalarDataBatcher (#3219, @nv-dlasalle)
  • Tensoradapter linking issues (#3225, #3246, @nv-dlasalle)
  • Diffpool loss did not consider the loss of first diffpooling layer (#3233, @yinpeiqi)
  • Fix CUDA 11.1 SPMM crashing with duplicate edges (#3265)
  • Fix DotGatConv attention bug when computing edge_softmax (#3272, @Flawless1202)
  • RelGraphConv reshape argument is incorrect (#3256, @minchenGrab)
  • Documentation typos and fixes (#3214, #3221, #3244, #3231, #3261, #3264, #3275, #3285, @amorehead, @blokhinnv, @kalinin-sanja)

v0.7.0

2 years ago

This is a new major release with various system optimizations, new features and enhancements, new models and bugfixes.

Important: Change on PyPI Installation

DGL pip wheels are no longer shipped on PyPI. Use the following command to install DGL with pip:

  • pip install dgl -f https://data.dgl.ai/wheels/repo.html for CPU.
  • pip install dgl-cuXX -f https://data.dgl.ai/wheels/repo.html for CUDA.
  • pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html for CPU nightly builds.
  • pip install --pre dgl-cuXX -f https://data.dgl.ai/wheels-test/repo.html for CUDA nightly builds.

This does not impact conda installation.

GPU-based Neighbor Sampling

DGL now supports uniform neighbor sampling and MFG conversion on GPU, contributed by @nv-dlasalle from NVIDIA. Experiment for GraphSAGE on the ogbn-product graph gets a >10x speedup (reduced from 113s to 11s per epoch) on a g3.16x instance. The following docs have been updated accordingly:

New Tutorials for Multi-GPU and Distributed Training

The release brings two new tutorials about multi-GPU training for node classification and graph classification, respectively. There is also a new tutorial about distributed training across multiple machines. All of them are available at https://docs.dgl.ai/.

image

Improved CPU Message Passing Kernel

The update includes a new CPU implementation of the core GSpMM kernel for GNN message passing, thanks to @sanchit-misra from Intel. The new kernel performs tiling on the sparse CSR matrix and leverages Intel’s LibXSMM for kernel generation, which gives an up to 4.4x speedup over the old kernel. Please read their paper https://arxiv.org/abs/2104.06700 for details.

More efficient NodeEmbedding for multi-GPU training and distributed training

DGL now utilizes NCCL to synchronize the gradients of sparse node embeddings (dgl.nn.NodeEmbedding) during training (credits to @nv-dlasalle from NVIDIA). The NCCL feature is available in both dgl.optim.SparseAdam and dgl.optim.SparseAdagrad. Experiments show a 20% speedup (reduced from 47.2s to 39.5s per epoch) on a g4dn.12xlarge (4 T4 GPU) instance for training RGCN on ogbn-mag graph. The optimization is automatically turned on when NCCL backend support is detected.

The sparse optimizers for dgl.distributed.DistEmbedding now use a synchronized gradient update strategy. We add a new optimizer dgl.distributed.optim.SparseAdam. The dgl.distributed.SparseAdagrad has been moved to dgl.distributed.optim.SparseAdagrad.

Sparse-sparse Matrix Multiplication and Addition Support

We add two new APIs dgl.adj_product_graph and dgl.adj_sum_graph that perform sparse-sparse matrix multiplications and additions as graph operations respectively. They can run with both CPU and GPU with autograd support. An example usage of these functions is Graph Transformer Networks.

PyTorch Lightning Compatibility

DGL is now compatible with PyTorch Lightning for single-GPU training or training with DistributedDataParallel. See this example of training GraphSAGE with PyTorch Lightning.

We thank @justusschock for making DGL DataLoaders compatible with PyTorch Lightning (#2886).

New Models

0 7-high

A batch of 19 new model examples are added to DGL in 0.7 bringing the total number to be 90+. Users can now use the search bar on https://www.dgl.ai/ to quickly locate the examples with tagged keywords. Below is the list of new models added.

  • Interaction Networks for Learning about Objects, Relations, and Physics (https://arxiv.org/abs/1612.00222.pdf) (#2794, @Ericcsr)
  • Multi-GPU RGAT for OGB-LSC Node Classification (#2835, @maqy1995)
  • Network Embedding with Completely-imbalanced Labels (https://ieeexplore.ieee.org/document/8979355) (#2813, @Fizyhsp)
  • Temporal Graph Networks improved (#2860, @Ericcsr)
  • Diffusion Convolutional Recurrent Neural Network (https://arxiv.org/abs/1707.01926) (#2858, @Ericcsr)
  • Gated Attention Networks for Learning on Large and Spatiotemporal Graphs (https://arxiv.org/abs/1803.07294) (#2858, @Ericcsr)
  • DeeperGCN (https://arxiv.org/abs/2006.07739) (#2831, @xnuohz)
  • Deep Graph Contrastive Representation Learning (https://arxiv.org/abs/2006.04131) (#2828, #3009, @hengruizhang98)
  • Graph Neural Networks Inspired by Classical Iterative Algorithms (https://arxiv.org/abs/2103.06064) (#2770, @FFTTYY)
  • GraphSAINT (#2792) (@lt610)
  • Label Propagation (#2852, @xnuohz)
  • Combining Label Propagation and Simple Models Out-performs Graph Neural Networks (https://arxiv.org/abs/2010.13993) (#2852, @xnuohz)
  • GCNII (#2874, @kyawlin)
  • Latent Dirichlet Allocation on GPU (#2883, @yifeim)
  • A Heterogeneous Information Network based Cross Domain Insurance Recommendation System for Cold Start Users (#2864, @KounianhuaDu)
  • Five heterogeneous graph models: HetGNN/GTN/HAN/NSHE/MAGNN (#2993, @Theheavens)
  • New OGB-arxiv and OGB-proteins results (#3018, @Espylapiza)
  • Heterogeneous Graph Attention Networks with minibatch sampling (#3005, @maqy1995)
  • Learning Hierarchical Graph Neural Networks for Image Clustering (https://arxiv.org/abs/2107.01319) (#3087, #3105)

New Datasets

New Functionalities

  • KD-Tree, Brute-force family, and NN-descent implementation of KNN (#2767, #2892, #2941) (@lygztq)
  • BLAS-based KNN implementation on GPU (#2868, @milesial)
  • A new API dgl.sample_neighbors_biased for biased neighbor sampling where each node has a tag, and each tag has its own (unnormalized) probability (#1665, #2987, @soodoshll). We also provide two helper functions sort_csr_by_tag and sort_csc_by_tag to sort the internal storage of a graph based on tags to allow such kind of neighbor sampling (#1664, @soodoshll).
  • Distributed sparse Adam node embedding optimizer (#2733)
  • Heterogeneous graph’s multi_update_all now supports user-defined cross-type reducers (#2891, @Secbone)
  • Add in_degrees and out_degrees supports to dgl.DistGraph (#2918)
  • A new API dgl.sampling.node2vec_random_walk for Node2vec random walks (#2992, @Smilexuhc)
  • dgl.node_subgraph, dgl.edge_subgraph, dgl.in_subgraph and dgl.out_subgraph all have a relabel_nodes argument to allow graph compaction (i.e. removing the nodes with no edges). (#2929)
  • Allow direct slicing of a batched graph without constructing a new data structure. (#2349, #2851, #2965)
  • Allow setting the distributed node embeddings with NodeEmbedding.all_set_embedding() (#3047)
  • Graphs can be directly created from CSR or CSC representations on either CPU or GPU (#3045). See the API doc of dgl.graph for more details.
  • A new dgl.reorder API to permute a graph according to RCMK, METIS or custom strategy (#3063)
  • dgl.nn.GraphConv now has a left normalization which divides the outgoing messages by out-degrees, equivalent to random-walk normalization (#3114)
  • Add a new exclude='self' to EdgeDataLoader to exclude the edges sampled in the current minibatch alone during neighbor sampling when reverse edges are not available (#3122)

Performance Optimizations

  • Check if a COO is sorted to avoid sync during forward/backward and parallelize sorted COO/CSR conversion. (#2645, @nv-dlasalle)
  • Faster uniform sampling with replacement (#2953)
  • Eliminating ctor & dtor & IsNullArray overheads in random walks (#2990, @AjayBrahmakshatriya)
  • GatedGCNConv shortcut with one edge type (#2994)
  • Hierarchical Partitioning in distributed training with 25% speedup (#3000, @soodoshll)
  • Save memory usage in node_split and edge_split during partitioning (#3132, @JingchengYu94)

Other Enhancements

  • Graph partitioning now returns ID mapping from old nodes/edges to new ones (#2857)
  • Better error message when idx_list out of bound (#2848)
  • Kill training jobs on remote machines in distributed training when receiving KeyboardInterrupt (#2881)
  • Provide a dgl.multiprocessing namespace for multiprocess training with fork and OpenMP (#2905)
  • GAT supports multidimensional input features (#2912)
  • Users can now specify graph format for distributed training (#2948)
  • CI now runs on Kubernetes (#2957)
  • to_heterogeneous(to_homogeneous(hg)) now returns the same hg. (#2958)
  • remove_nodes and remove_edges now preserves batch information. (#3119)

Bug Fixes

  • Multiprocessing sampling in distributed training hangs in Python 3.8 (#2315, #2826)
  • Use correct NIC for distributed training (#2798, @Tonny-Gu)
  • Fix potential TypeError in HGT example (#2830, @zhangtianle)
  • Distributed training initialization fails with graphs without node/edge data (#2366, #2838)
  • DGL Sparse Optimizer will crash when some DGL NodeEmbedding is not involved in the forward pass (#2856, #2859)
  • Fix GATConv shape issues with Residual Connections (#2867, #2921, #2922, #2947, #2962, @xieweiyi, @jxgu1016)
  • Moving a graph to GPU will change the default CUDA device (#2895, #2897)
  • Remove __len__ method to stop polluting PyCharm outputs (#2902)
  • Inconsistency in the typing of node types and edge types returned by load_partition (#2742, @chwan-rice)
  • NodeDataLoader and EdgeDataLoader now supports DistributedDataParallel with proper shuffling and batching (#2539, #2911)
  • Nonuniform sampling with replacement may dereference null pointer (#2942, #2943, @nv-dlasalle)
  • Strange behavior of bipartite_from_networkx() (#2808, #2917)
  • Make GCMC example compatible with torchtext 0.9+ (#2985, @alexpod1000)
  • dgl.to_homogenous doesn't work correctly on graphs with 0 nodes of a given type (#2870, #3011)
  • TU regression datasets throw errors (#2952, #3010)
  • RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x (#2760, #3013, @nv-dlasalle)
  • Deal with situation where num_layers equals 1 for GraphSAGE (#3066, @Wang-Yu-Qing)
  • Lengthen the timeout for distributed node embedding (#2966, #2967 @sojiadeshina)
  • Misc fixes in code and documentation (#2844, #2869, #2840, #2879, #2863, #2822, #2907, #2928, #2935, #2960, #2938, #2968, #2961, #2983, #2981, #3017, #3051, #3040, #3064, #3065, #3133, #3139) (@Theheavens, @ab-10, @yunshiuan, @moritzblum, @kayzliu, @universvm, @europeanplaice, etc.)

Deprecations

  • preserve_nodes argument in dgl.edge_subgraph is deprecated and renamed to relabel_nodes.