Awesome Scientific Language Models Save

A Curated List of Language Models in Scientific Domains

Project README

Awesome Scientific Language Models

A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, biology, medicine, materials science, and geoscience), covering different model sizes (from <100M to 70B parameters) and modalities (e.g., language, vision, molecule, protein, graph, and table). The repository will be continuously updated.

NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.

NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.

NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out to [email protected] or submit a pull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.

General
- Language
- Graph-Enhanced
Mathematics
Physics
- Language
Chemistry and Materials Science
Biology and Medicine
Geography, Geology, and Environmental Science

General

Language

(SciBERT) SciBERT: A Pretrained Language Model for Scientific Text EMNLP 2019
[Paper] [GitHub] [Model (Base)]
(SciGPT2) Explaining Relationships between Scientific Documents ACL 2021
[Paper] [GitHub] [Model (117M)]
(CATTS) TLDR: Extreme Summarization of Scientific Documents EMNLP 2020 Findings
[Paper] [GitHub] [Model (406M)]
(SciNewsBERT) SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-Checking CIKM 2021
[Paper] [Model (Base)]
(ScholarBERT) The Diminishing Returns of Masked Language Models to Science ACL 2023 Findings
[Paper] [Model (Large)] [Model (770M)]
(AcademicRoBERTa) A Japanese Masked Language Model for Academic Domain COLING 2022 Workshop
[Paper] [GitHub] [Model (125M)]
(Galactica) Galactica: A Large Language Model for Science arXiv 2022
[Paper]
(DARWIN) DARWIN Series: Domain Specific Large Language Models for Natural Science arXiv 2023
[Paper] [GitHub] [Model (7B)]
(FORGE) FORGE: Pre-Training Open Foundation Models for Science SC 2023
[Paper] [GitHub] [Model (1.4B, General)] [Model (1.4B, Biology/Medicine)] [Model (1.4B, Chemistry)] [Model (1.4B, Engineering)] [Model (1.4B, Materials Science)] [Model (1.4B, Physics)] [Model (1.4B, Social Science/Art)] [Model (13B, General)] [Model (22B, General)]
(SciGLM) SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning arXiv 2024
[Paper] [GitHub]

Graph-Enhanced

(SPECTER) SPECTER: Document-level Representation Learning using Citation-informed Transformers ACL 2020
[Paper] [GitHub] [Model (Base)]
(OAG-BERT) OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services KDD 2022
[Paper] [GitHub]
(ASPIRE) Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity NAACL 2022
[Paper] [GitHub] [Model (Base)]
(SciNCL) Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings EMNLP 2022
[Paper] [GitHub] [Model (Base)]
(SPECTER 2.0) SciRepEval: A Multi-Format Benchmark for Scientific Document Representations EMNLP 2023
[Paper] [GitHub] [Model (113M)]
(SciPatton) Patton: Language Model Pretraining on Text-Rich Networks ACL 2023
[Paper] [GitHub]
(SciMult) Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding EMNLP 2023 Findings
[Paper] [GitHub] [Model (138M)]

Mathematics

Language

(GenBERT) Injecting Numerical Reasoning Skills into Language Models ACL 2020
[Paper] [GitHub]
(GPT-f) Generative Language Modeling for Automated Theorem Proving arXiv 2020
[Paper]
(EPT) Point to the Expression: Solving Algebraic Word Problems using the Expression-Pointer Transformer Model NeurIPS 2020
[Paper] [GitHub]
(MathBERT) MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education arXiv 2021
[Paper] [GitHub] [Model (Base)]
(MWP-BERT) MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving NAACL 2022 Findings
[Paper] [GitHub] [Model (Base)]
(BERT-TD) Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems ACL 2022 Findings
[Paper] [GitHub]
(GSM8K-GPT) Training Verifiers to Solve Math Word Problems arXiv 2021
[Paper] [GitHub]
(DeductReasoner) Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction ACL 2022
[Paper] [GitHub] [Model (125M)]
(NaturalProver) NaturalProver: Grounded Mathematical Proof Generation with Language Models NeurIPS 2022
[Paper] [GitHub]
(Minerva) Solving Quantitative Reasoning Problems with Language Models NeurIPS 2022
[Paper]
(Bhaskara) Lila: A Unified Benchmark for Mathematical Reasoning EMNLP 2022
[Paper] [GitHub] [Model (2.7B)]
(WizardMath) WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(MAmmoTH) MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(MetaMath) MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(ToRA) ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(Llemma) Llemma: An Open Language Model For Mathematics ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (34B)]
(OVM) Outcome-supervised Verifiers for Planning in Mathematical Reasoning arXiv 2023
[Paper] [GitHub] [Model (7B)]

Vision-Language

(Inter-GPS) Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning ACL 2021
[Paper] [GitHub]
(Geoformer) UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression EMNLP 2022
[Paper] [GitHub]
(SCA-GPS) A Symbolic Character-Aware Model for Solving Geometry Problems ACM MM 2023
[Paper] [GitHub]
(UniMath-Flan-T5) UniMath: A Foundational and Multimodal Mathematical Reasoner EMNLP 2023
[Paper] [GitHub]
(G-LLaVA) G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model arXiv 2023
[Paper] [GitHub]

Other Modalities (Table)

(TAPAS) TAPAS: Weakly Supervised Table Parsing via Pre-training ACL 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(TaBERT) TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables ACL 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(GraPPa) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing ICLR 2021
[Paper] [GitHub] [Model (355M)]
(TUTA) TUTA: Tree-based Transformers for Generally Structured Table Pre-training KDD 2021
[Paper] [GitHub]
(RCI) Capturing Row and Column Semantics in Transformer Based Question Answering over Tables NAACL 2021
[Paper] [GitHub] [Model (12M)]
(TABBIE) TABBIE: Pretrained Representations of Tabular Data NAACL 2021
[Paper] [GitHub]
(TAPEX) TAPEX: Table Pre-training via Learning a Neural SQL Executor ICLR 2022
[Paper] [GitHub] [Model (140M)] [Model (406M)]
(FORTAP) FORTAP: Using Formulas for Numerical-Reasoning-Aware Table Pretraining ACL 2022
[Paper] [GitHub]
(OmniTab) OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering NAACL 2022
[Paper] [GitHub] [Model (406M)]
(ReasTAP) ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples EMNLP 2022
[Paper] [GitHub] [Model (406M)]
(TableLlama) TableLlama: Towards Open Large Generalist Models for Tables NAACL 2024
[Paper] [GitHub] [Model (7B)]

Physics

Language

(astroBERT) Building astroBERT, a Language Model for Astronomy & Astrophysics arXiv 2021
[Paper] [Model (Base)]
(AstroLLaMA) AstroLLaMA: Towards Specialized Foundation Models in Astronomy AACL 2023 Workshop
[Paper] [Model (7B)]
(AstroLLaMA-Chat) AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets Research Notes of the AAS 2024
[Paper] [Model (7B)]

Chemistry and Materials Science

Language

(ChemBERT) Automated Chemical Reaction Extraction from Scientific Literature Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] [Model (Base)]
(MatSciBERT) MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction npj Computational Materials 2022
[Paper] [GitHub] [Model (Base)]
(MatBERT) Quantifying the Advantage of Domain-Specific Pre-training on Named Entity Recognition Tasks in Materials Science Patterns 2022
[Paper] [GitHub]
(BatteryBERT) BatteryBERT: A Pretrained Language Model for Battery Database Enhancement Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] [Model (Base)]
(MaterialsBERT) A General-Purpose Material Property Data Extraction Pipeline from Large Polymer Corpora using Natural Language Processing npj Computational Materials 2023
[Paper] [Model (Base)]
(LLM-Prop) LLM-Prop: Predicting Physical and Electronic Properties of Crystalline Solids from Their Text Descriptions arXiv 2023
[Paper] [GitHub]
(ChemDFM) ChemDFM: Dialogue Foundation Model for Chemistry arXiv 2024
[Paper] [GitHub] [Model (13B)]
(CrystalLLM) Fine-Tuned Language Models Generate Stable Inorganic Materials as Text ICLR 2024
[Paper] [GitHub]
(ChemLLM) ChemLLM: A Chemical Large Language Model arXiv 2024
[Paper] [Model (7B)]
(LlaSMol) LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset arXiv 2024
[Paper] [GitHub] [Model (7B)]

Vision-Language

(MolScribe) MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation Journal of Chemical Information and Modeling 2023
[Paper] [GitHub] [Model (88M)]
(GIT-Mol) GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text Computers in Biology and Medicine 2024
[Paper] [GitHub]

Graph-Enhanced / Other Modalities (Molecule)

(SMILES-BERT) SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction ACM BCB 2019
[Paper] [GitHub]
(MAT) Molecule Attention Transformer arXiv 2020
[Paper] [GitHub] [Model (125M)]
(ChemBERTa) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction arXiv 2020
[Paper] [GitHub] [Model (125M)]
(MolBERT) Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks arXiv 2020
[Paper] [GitHub] [Model (Base)]
(RXNFP) Mapping the Space of Chemical Reactions using Attention-Based Neural Networks Nature Machine Intelligence 2021
[Paper] [GitHub] [Model (Base)]
(RXNMapper) Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions Science Advances 2021
[Paper] [GitHub]
(MoLFormer) Large-Scale Chemical Language Representations Capture Molecular Structure and Properties Nature Machine Intelligence 2022
[Paper] [GitHub] [Model (47M)]
(Chemformer) Chemformer: A Pre-trained Transformer for Computational Chemistry Machine Learning: Science and Technology 2022
[Paper] [GitHub] [Model (45M)] [Model (230M)]
(MolGPT) MolGPT: Molecular Generation using a Transformer-Decoder Model Journal of Chemical Information and Modeling 2022
[Paper] [GitHub]
(Text2Mol) Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries EMNLP 2021
[Paper] [GitHub]
(KV-PLM) A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals Nature Communications 2022
[Paper] [GitHub] [Model (Base)]
(T5Chem) Unified Deep Learning Model for Multitask Reaction Predictions with Explanation Journal of Chemical Information and Modeling 2022
[Paper] [GitHub]
(MolT5) Translation between Molecules and Natural Language EMNLP 2022
[Paper] [GitHub] [Model (77M)] [Model (250M)] [Model (800M)]
(ChemGPT) Neural Scaling of Deep Chemical Models Nature Machine Intelligence 2023
[Paper] [Model (4.7M)] [Model (19M)] [Model (1.2B)]
(MoMu) A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language arXiv 2022
[Paper] [GitHub]
(MFBERT) Large-Scale Distributed Training of Transformers for Chemical Fingerprinting Journal of Chemical Information and Modeling 2022
[Paper] [GitHub]
(SPMM) Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model Nature Communications 2024
[Paper] [GitHub]
(MoleculeSTM) Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing Nature Machine Intelligence 2023
[Paper] [GitHub]
(MolGen) Domain-Agnostic Molecular Generation with Self-feedback ICLR 2024
[Paper] [GitHub] [Model (7B)]
(Text+Chem T5) Unifying Molecular and Textual Representations via Multi-task Language Modelling ICML 2023
[Paper] [GitHub] [Model (60M)] [Model (220M)]
(CLAMP) Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language ICML 2023
[Paper] [GitHub]
(GIMLET) GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning NeurIPS 2023
[Paper] [GitHub] [Model (60M)]
(MolFM) MolFM: A Multimodal Molecular Foundation Model arXiv 2023
[Paper] [GitHub]
(CatBERTa) Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models ACS Catalysis 2023
[Paper] [GitHub]
(GPT-MolBERTa) GPT-MolBERTa: GPT Molecular Features Language Model for Molecular Property Prediction arXiv 2023
[Paper] [GitHub] [Model (82M, BERT)] [Model (82M, RoBERTa)]
(MolCA) MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter EMNLP 2023
[Paper] [GitHub]
(ActFound) A Foundation Model for Bioactivity Prediction using Pairwise Meta-learning bioRxiv 2023
[Paper] [GitHub]
(InstructMol) InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery arXiv 2023
[Paper] [GitHub]
(PolyNC) PolyNC: A Natural and Chemical Language Model for the Prediction of Unified Polymer Properties Chemical Science 2024
[Paper] [GitHub] [Model (220M)]
(3D-MoLM) Towards 3D Molecule-Text Interpretation in Language Models ICLR 2024
[Paper] [GitHub] [Model (7B)]

Biology and Medicine

Acknowledgment: We referred to Wang et al.'s survey paper Pre-trained Language Models in Biomedical Domain: A Systematic Survey when writing some parts of this section.

Language

(BioBERT) BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining Bioinformatics 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(BioELMo) Probing Biomedical Embeddings from Language Models NAACL 2019 Workshop
[Paper] [GitHub] [Model (93M)]
(ClinicalBERT, Alsentzer et al.) Publicly Available Clinical BERT Embeddings NAACL 2019 Workshop
[Paper] [GitHub] [Model (Base)]
(ClinicalBERT, Huang et al.) ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission arXiv 2019
[Paper] [GitHub] [Model (Base)]
(BlueBERT, f.k.a. NCBI-BERT) Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets ACL 2019 Workshop
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(BEHRT) BEHRT: Transformer for Electronic Health Records Scientific Reports 2020
[Paper] [GitHub]
(EhrBERT) Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study JMIR Medical Informatics 2019
[Paper] [GitHub]
(Clinical XLNet) Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation EMNLP 2020 Workshop
[Paper] [GitHub]
(ouBioBERT) Pre-training Technique to Localize Medical BERT and Enhance Biomedical BERT arXiv 2020
[Paper] [GitHub] [Model (Base)]
(COVID-Twitter-BERT) COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter Frontiers in Artificial Intelligence 2023
[Paper] [GitHub] [Model (Large)]
(Med-BERT) Med-BERT: Pretrained Contextualized Embeddings on Large-Scale Structured Electronic Health Records for Disease Prediction npj Digital Medicine 2021
[Paper] [GitHub]
(Bio-ELECTRA) On the Effectiveness of Small, Discriminatively Pre-trained Language Representation Models for Biomedical Text Mining EMNLP 2020 Workshop
[Paper] [GitHub] [Model (Base)]
(BiomedBERT, f.k.a. PubMedBERT) Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing ACM Transactions on Computing for Healthcare 2021
[Paper] [Model (Base)] [Model (Large)]
(MCBERT) Conceptualized Representation Learning for Chinese Biomedical Text Mining arXiv 2020
[Paper] [GitHub] [Model (Base)]
(BRLTM) Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression JBHI 2021
[Paper] [GitHub]
(BioRedditBERT) COMETA: A Corpus for Medical Entity Linking in the Social Media EMNLP 2020
[Paper] [GitHub] [Model (Base)]
(BioMegatron) BioMegatron: Larger Biomedical Domain Language Model EMNLP 2020
[Paper] [GitHub] [Model (345M)]
(SapBERT) Self-Alignment Pretraining for Biomedical Entity Representations NAACL 2021
[Paper] [GitHub] [Model (Base)]
(ClinicalTransformer) Clinical Concept Extraction using Transformers JAMIA 2020
[Paper] [GitHub] [Model (Base, BERT)] [Model (125M, RoBERTa)] [Model (12M, ALBERT)] [Model (Base, ELECTRA)] [Model (117M, XLNet)] [Model (149M, Longformer)] [Model (139M, DeBERTa)]
(BioRoBERTa) Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art EMNLP 2020 Workshop
[Paper] [GitHub] [Model (125M)] [Model (355M)]
(RAD-BERT) Highly Accurate Classification of Chest Radiographic Reports using a Deep Learning Natural Language Model Pre-trained on 3.8 Million Text Reports Bioinformatics 2020
[Paper] [GitHub]
(BioMedBERT) BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR COLING 2020
[Paper] [GitHub]
(LBERT) LBERT: Lexically Aware Transformer-Based Bidirectional Encoder Representation Model for Learning Universal Bio-Entity Relations Bioinformatics 2021
[Paper] [GitHub]
(ELECTRAMed) ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP arXiv 2021
[Paper] [GitHub] [Model (Base)]
(SciFive) SciFive: A Text-to-Text Transformer Model for Biomedical Literature arXiv 2021
[Paper] [GitHub] [Model (220M)] [Model (770M)]
(BioALBERT) Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT BMC Bioinformatics 2022
[Paper] [GitHub] [Model (12M)] [Model (18M)]
(Clinical-Longformer) Clinical-Longformer and Clinical-BigBird: Transformers for Long Clinical Sequences arXiv 2021
[Paper] [GitHub] [Model (149M, Longformer)] [Model (127M, BigBird)]
(BioBART) BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model ACL 2022 Workshop
[Paper] [GitHub] [Model (140M)] [Model (406M)]
(BioGPT) BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining Briefings in Bioinformatics 2022
[Paper] [GitHub] [Model (355M)] [Model (1.5B)]
(Med-PaLM) Large Language Models Encode Clinical Knowledge Nature 2023
[Paper]
(ChatDoctor) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) using Medical Domain Knowledge Cureus 2023
[Paper] [GitHub]
(DoctorGLM) DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task arXiv 2023
[Paper] [GitHub]
(BenTsao, f.k.a. HuaTuo) HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge arXiv 2023
[Paper] [GitHub]
(MedAlpaca) MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(PMC-LLaMA) PMC-LLaMA: Towards Building Open-source Language Models for Medicine arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(Med-PaLM 2) Towards Expert-Level Medical Question Answering with Large Language Models arXiv 2023
[Paper]
(HuatuoGPT) HuatuoGPT, towards Taming Language Model to Be a Doctor EMNLP 2023 Findings
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(MedCPT) MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval Bioinformatics 2023
[Paper] [GitHub] [Model (Base)]
(DISC-MedLLM) DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation arXiv 2023
[Paper] [GitHub] [Model (13B)]
(DRG-LLaMA) DRG-LLaMA: Tuning LLaMA Model to Predict Diagnosis-related Group for Hospitalized Patients npj Digital Medicine 2024
[Paper] [GitHub]
(HuatuoGPT-II) HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (34B)]
(PLLaMa) PLLaMa: An Open-source Large Language Model for Plant Science arXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(BioMistral) BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains arXiv 2024
[Paper] [Model (7B)]
(BioMedLM, f.k.a. PubMedGPT) BioMedLM: a Domain-Specific Large Language Model for Biomedical Text arXiv 2024
[Paper] [GitHub] [Model (2.7B)]

Graph-Enhanced

(G-BERT) Pre-training of Graph Augmented Transformers for Medication Recommendation IJCAI 2019
[Paper] [GitHub]
(CODER) CODER: Knowledge Infused Cross-Lingual Medical Term Embedding for Term Normalization JBI 2022
[Paper] [GitHub] [Model (Base)]
(KeBioLM) Improving Biomedical Pretrained Language Models with Knowledge NAACL 2021 Workshop
[Paper] [GitHub] [Model (155M)]
(MoP) Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT EMNLP 2021
[Paper] [GitHub]
(BioLinkBERT) LinkBERT: Pretraining Language Models with Document Links ACL 2022
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(DRAGON) Deep Bidirectional Language-Knowledge Graph Pretraining NeurIPS 2022
[Paper] [GitHub] [Model (360M)]

Vision-Language

(ConVIRT) Contrastive Learning of Medical Visual Representations from Paired Images and Text MLHC 2022
[Paper] [GitHub]
(MedViLL) Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training JBHI 2022
[Paper] [GitHub]
(GLoRIA) GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition ICCV 2021
[Paper] [GitHub]
(LoVT) Joint Learning of Localized Representations from Medical Images and Reports ECCV 2022
[Paper] [GitHub]
(CvT2DistilGPT2) Improving Chest X-Ray Report Generation by Leveraging Warm Starting Artificial Intelligence in Medicine 2023
[Paper] [GitHub]
(BioViL) Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing ECCV 2022
[Paper] [GitHub]
(LViT) LViT: Language meets Vision Transformer in Medical Image Segmentation TMI 2022
[Paper] [GitHub]
(M3AE) Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training MICCAI 2022
[Paper] [GitHub]
(ARL) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge ACM MM 2022
[Paper] [GitHub]
(CheXzero) Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning Nature Biomedical Engineering 2022
[Paper] [GitHub]
(MGCA) Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning NeurIPS 2022
[Paper] [GitHub]
(MedCLIP) MedCLIP: Contrastive Learning from Unpaired Medical Images and Text EMNLP 2022
[Paper] [GitHub]
(BioViL-T) Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing CVPR 2023
[Paper] [GitHub] [Model]
(BiomedCLIP) BiomedCLIP: A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text Pairs arXiv 2023
[Paper] [Model]
(RGRG) Interactive and Explainable Region-guided Radiology Report Generation CVPR 2023
[Paper] [GitHub]
(LLaVA-Med) LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day NeurIPS 2023
[Paper] [GitHub]
(Med-PaLM M) Towards Generalist Biomedical AI NEJM AI 2024
[Paper] [GitHub]
(BioCLIP) BioCLIP: A Vision Foundation Model for the Tree of Life arXiv 2023
[Paper] [Github] [Model]

Other Modalities (Protein, DNA, RNA)

(ProtTrans) ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing TPAMI 2021
[Paper] [GitHub] [Model (Base, BERT)] [Model (12M, ALBERT)] [Model (117M, XLNet)] [Model (3B, T5)] [Model (11B, T5)]
(DNABERT) DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome Bioinformatics 2021
[Paper] [GitHub] [Model (Base)]
(ESM-1b) Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences PNAS 2021
[Paper] [GitHub] [Model (650M)]
(ESM-1v) Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein Function NeurIPS 2021
[Paper] [GitHub] [Model (650M)]
(RNABERT) Informative RNA-base Embedding for Functional RNA Structural Alignment and Clustering by Deep Representation Learning NAR Genomics and Bioinformatics 2022
[Paper] [GitHub]
(ProteinBERT) ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and Function Bioinformatics 2022
[Paper] [GitHub] [Model (16M)]
(RNA-FM) Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions arXiv 2022
[Paper] [GitHub]
(ESM-IF1) Learning Inverse Folding from Millions of Predicted Structures ICML 2022
[Paper] [GitHub] [Model (124M)]
(ProtGPT2) ProtGPT2 is a Deep Unsupervised Language Model for Protein Design Nature Communications 2022
[Paper] [Model (738M)]
(ProGen) Large Language Models Generate Functional Protein Sequences across Diverse Families Nature Biotechnology 2023
[Paper]
(ProGen2) ProGen2: Exploring the Boundaries of Protein Language Models Cell Systems 2023
[Paper] [GitHub] [Model (151M)] [Model (764M)] [Model (2.7B)] [Model (6.4B)]
(ESM-2) Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model Science 2023
[Paper] [GitHub] [Model (8M)] [Model (35M)] [Model (150M)] [Model (650M)] [Model (3B)] [Model (15B)]
(Ankh) Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling arXiv 2023
[Paper] [GitHub] [Model (450M)] [Model (1.1B)]
(ProtST) ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts ICML 2023
[Paper] [GitHub]
(RNA-MSM) Multiple Sequence-Alignment-Based RNA Language Model and its Application to Structural Inference Nucleic Acids Research 2024
[Paper] [GitHub]
(Geneformer) Transfer Learning Enables Predictions in Network Biology Nature 2023
[Paper] [Model (10M)] [Model (40M)]
(CellLM) Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning arXiv 2023
[Paper] [GitHub]
(DNABERT-2) DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome ICLR 2024
[Paper] [GitHub] [Model (Base)]
(xTrimoPGLM) xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein bioRxiv 2023
[Paper]
(Prot2Text) Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers AAAI 2024
[Paper] [GitHub] [Model (256M)] [Model (283M)] [Model (398M)] [Model (898M)]
(BioMedGPT) BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (10B)]
(CellPLM) CellPLM: Pre-training of Cell Language Model Beyond Single Cells ICLR 2024
[Paper] [GitHub] [Model (82M)]

Geography, Geology, and Environmental Science

Language

(ClimateBERT) ClimateBERT: A Pretrained Language Model for Climate-Related Text arXiv 2021
[Paper] [GitHub] [Model (82M)]
(SpaBERT) SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation EMNLP 2022 Findings
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(K2) K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization WSDM 2024
[Paper] [GitHub] [Model (7B)]
(OceanGPT) OceanGPT: A Large Language Model for Ocean Science Tasks arXiv 2023
[Paper] [GitHub] [Model (7B)]
(ClimateBERT-NetZero) ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction Targets EMNLP 2023
[Paper] [Model (82M)]
(GeoGalactica) GeoGalactica: A Scientific Large Language Model in Geoscience arXiv 2024
[Paper] [GitHub] [Model (30B)]

Graph-Enhanced

(ERNIE-GeoL) ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps KDD 2022
[Paper]
(PK-Chat) PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue Model arXiv 2023
[Paper] [GitHub] [Model (132M)]

Vision-Language

(MGeo) MGeo: Multi-Modal Geographic Pre-Training Method SIGIR 2023
[Paper] [GitHub]

Open Source Agenda is not affiliated with "Awesome Scientific Language Models" Project. README Source: yuzhimanhua/Awesome-Scientific-Language-Models

Stars

241

Open Issues

Last Commit

3 weeks ago

Repository

yuzhimanhua/Awesome-Scientific-Language-Models

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/awesome-scientific-language-models"><img src="https://www.opensourceagenda.com/projects/awesome-scientific-language-models/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022

Awesome Scientific Language Models Save

Awesome Scientific Language Models

Contents

General

Language

Graph-Enhanced

Mathematics

Language

Vision-Language

Other Modalities (Table)

Physics

Language

Chemistry and Materials Science

Language

Vision-Language

Graph-Enhanced / Other Modalities (Molecule)

Biology and Medicine

Language

Graph-Enhanced

Vision-Language

Other Modalities (Protein, DNA, RNA)

Geography, Geology, and Environmental Science

Language

Graph-Enhanced

Vision-Language

Open Source Agenda Badge

From the blog

How to Choose Which Programming Language to Learn First?

From the blog

How to Choose Which Programming Language to Learn First?