Generating paper titles (and more!) with GPT trained on data scraped from arXiv.
Forecasting the progress of research is an elusive and important goal. Here, we take a toy step towards this goal by exploring generating new scientific paper titles given past titles on arXiv:
To generate author-specific titles, we take the five most recent titles from each author with atleast 3 arXiv AI papers (cs.ML, cs.LG, stat.ML). We then format the papers using the following template and query for a new title using GPT-3:
Here is a list of related machine-learning papers:
> [title 1]
> [title 2]
...
> [title 5]
> ____
See the results in the demo above or the full results in this json file.
Here's a concrete example -- when prompting with these 5 recent titles:
> Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods
> Fast Interpretable Greedy-Tree Sums (FIGS)
> Adaptive wavelet distillation from neural networks through interpretations
> Emb-GAM: an Interpretable and Efficient Predictor using Pre-trained Language Models
> Explaining Patterns in Data with Language Models via Interpretable Autoprompting
> ____
We get these 5 (independent) random generations for the blank:
1. Towards Interpretable Natural Language Processing: A Survey
2. A Unified Framework for Interpretable Machine Learning
3. Compositional Attention Networks for Machine Reasoning
4. Achieving Open Vocabulary Neural Machine Translation
5. A Deep Understanding of Neural Networks through Deep Visualization
The results are often interesting but fall into failure modes where they generate irrelevant titles for an author, often leaning towards popular topics such as deep learning, multi-task learning, and reinforcement learning.
Note: the model used was GPT-3 text-davinci-002
from the OpenAI API on Oct 14 2022. It likely was not up to date with the most current advances and could be improved with finetuning on more recent titles.
During preprocessing, paper titles with irregular formatting were removed and distinct authors with exactly the same name were not differentiated.
To improve the model's ability to generate cogent titles, we finetune it on a large corpuse of titles. We start from the gpt-neo-2.7B checkpoint (see our training script for hyperparameters). We finetune on all paper titles on arXiv in the categories cs.AI, cs.LG, stat.ML up to Oct 13, 2022. We exclude all papers after Apr 1, 2022 (to test the ability to forecast new papers) and an additional random 5% of titles. We also exclude titles with a length of less than 6 words or greater than 20 words. This results in 98,388 papers for finetuning:
Samples After finetuning, here are some example titles generated by the model, conditioned on different years (see a large dump in this folder):
2022
2023 (These samples tend to just be similar to 2021/2022 where the majority of the training data lies)
2010 (Seems to properly generate older titles)
Inference example We've released our finetuned model if you want to play with it:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model = AutoModelForCausalLM.from_pretrained("csinva/gpt-neo-2.7B-titles")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
pipe('2022\n\n')
------------------
> Automating the Visualization of Convolutional Neural Networks
During finetuning each paper title was given in the format <year>\n\n <title>\n
(e.g. 2020\n\n Interpretations are useful: penalizing explanations to align neural networks with prior knowledge\n
). The same format should be used for inference. These samples are considerably better than the samples we made with GPT2 back in 2019 (the good old days).
We now evaluate whether the generated titles for 2022 match the real paper titles from the test set (April 1 - Oct 13 2022). Note that the model has never seen any papers from this time period and it's pre-training corpus also only contained text from before 2022. We generate 5,000 titles and find for the closest match for each of them in the test set (which contains ~15,000 titles). The resulting BLEU scores are shown in this figure:
Here's a table of the first 5 matches. See if you can guess which are the real titles and which are generated (answers below):
A | B |
---|---|
Understanding the effect of data augmentation in generative adversarial networks | Understanding the effect of data augmentation in self-supervised anomaly detection |
Adversarial attacks on graph neural networks | Sparse vicious attacks on graph neural networks |
Differentiable reinforcement learning for continuous control | Normality-guided distributional reinforcement learning for continuous control |
Multilevel representation learning for time series forecasting | Out-of-distribution representation learning for time series classification |
Unsupervised feature learning for medical image segmentation | Distributed contrastive learning for medical image segmentation |
ㅤ
Answers sǝlʇᴉʇ lɐǝɹ ǝɥʇ suᴉɐʇuoɔ ꓭ uɯnloƆ The generated titles often seem to be overly general, missing the detailed specificity of the real titles (e.g. "Sparse vicious attacks" rather than "Adversarial attacks").
Some possible followups This post was very limited, but there are a bunch of directions to explore to see how well language models can really forecast scientific titles (and scientific progress in general). Here are some straightforward followups: