Awesome CVPR2024 AIGC Save

A Collection of Papers and Codes for CVPR2024 AIGC

Project README

Awesome-CVPR2024-AIGCAwesome

A Collection of Papers and Codes for CVPR2024 AIGC

整理汇总下今年CVPR AIGC相关的论文和代码,具体如下。

欢迎star,fork和PR~

Please feel free to star, fork or PR if helpful~

参考或转载请注明出处

CVPR2024官网:https://cvpr.thecvf.com/Conferences/2024

CVPR完整论文列表:https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers

开会时间:2024年6月17日-6月21日

论文接收公布时间:2024年2月27日

【Contents】

1.图像生成(Image Generation/Image Synthesis)

Accelerating Diffusion Sampling with Optimized Time Steps

Adversarial Text to Continuous Image Generation

Amodal Completion via Progressive Mixed Context Diffusion

Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder

Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion

Attention Calibration for Disentangled Text-to-Image Personalization

CapHuman: Capture Your Moments in Parallel Universes

CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

Condition-Aware Neural Network for Controlled Image Generation

CosmicMan: A Text-to-Image Foundation Model for Humans

Countering Personalized Text-to-Image Generation with Influence Watermarks

Cross Initialization for Personalized Text-to-Image Generation

Customization Assistant for Text-to-image Generation

DeepCache: Accelerating Diffusion Models for Free

DemoFusion: Democratising High-Resolution Image Generation With No $

Desigen: A Pipeline for Controllable Design Template Generation

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

Diffusion-driven GAN Inversion for Multi-Modal Facial Image Generation

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Diversity-aware Channel Pruning for StyleGAN Compression

Discriminative Probing and Tuning for Text-to-Image Generation

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

Dynamic Prompt Optimizing for Text-to-Image Generation

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

Efficient Dataset Distillation via Minimax Diffusion

ElasticDiffusion: Training-free Arbitrary Size Image Generation

EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models

Enabling Multi-Concept Fusion in Text-to-Image Models

  • Paper:
  • Code:

Exact Fusion via Feature Distribution Matching for Few-shot Image Generation

  • Paper:
  • Code:

FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Generalizable Tumor Synthesis

Generating Daylight-driven Architectural Design via Diffusion Models

Generative Unlearning for Any Identity

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

High-fidelity Person-centric Subject-to-Image Synthesis

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

InstanceDiffusion: Instance-level Control for Image Generation

Instruct-Imagen: Image Generation with Multi-modal Instruction

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models

Inversion-Free Image Editing with Natural Language

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion

Learned representation-guided diffusion models for large-image generation

Learning Continuous 3D Words for Text-to-Image Generation

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Learning Multi-dimensional Human Preference for Text-to-Image Generation

LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model

MACE: Mass Concept Erasure in Diffusion Models

MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

MindBridge: A Cross-Subject Brain Decoding Framework

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

On the Scalability of Diffusion-based Text-to-Image Generation

Personalized Residuals for Concept-Driven Text-to-Image Generation

Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

Taming Text-to-Image Diffusion for Accurate Prompt Following

Readout Guidance: Learning Control from Diffusion Features

Relation Rectification in Diffusion Model

Residual Denoising Diffusion Models

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Rich Human Feedback for Text-to-Image Generation

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Self-correcting LLM-controlled Diffusion Models

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Shadow Generation for Composite Image Using Diffusion Model

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Structure-Guided Adversarial Training of Diffusion Models

Style Aligned Image Generation via Shared Attention

SVGDreamer: Text Guided SVG Generation with Diffusion Model

Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting

Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

Taming Stable Diffusion for Text to 360∘ Panorama Image Generation

TextCraftor: Your Text Encoder Can be Image Quality Controller

Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

TokenCompose: Grounding Diffusion with Token-level Supervision

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Towards Memorization-Free Diffusion Models

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

UniGS: Unified Representation for Image Generation and Segmentation

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models

When StyleGAN Meets Stable Diffusion: a 𝒲+ Adapter for Personalized Image Generation

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

2.图像编辑(Image Editing)

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

Deformable One-shot Face Stylization via DINO Semantic Guidance

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

DiffusionLight: Light Probes for Free by Painting a Chrome Ball

Diffusion Models Without Attention

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Edit One for All: Interactive Batch Image Editing

Face2Diffusion for Fast and Editable Face Personalization

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Inversion-Free Image Editing with Natural Language

PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models

Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

Text-Driven Image Editing via Learnable Regions

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

UniHuman: A Unified Model For Editing Human Images in the Wild

ZONE: Zero-Shot Instruction-Guided Local Editing

3.视频生成(Video Generation/Video Synthesis)

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Delving Deep into Diffusion Transformers for Image and Video Generation

DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation

DisCo: Disentangled Control for Realistic Human Dance Generation

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Generation

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Grid Diffusion Models for Text-to-Video Generation

Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

LAMP: Learn A Motion Pattern for Few-Shot Video Generation

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Make Your Dream A Vlog

Make Pixels Dance: High-Dynamic Video Generation

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Simple but Effective Text-to-Video Generation with Grid Diffusion Models

StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

VideoBooth: Diffusion-based Video Generation with Image Prompts

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Video-P2P: Video Editing with Cross-attention Control

4.视频编辑(Video Editing)

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

CAMEL: Causal Motion Enhancement tailored for lifting text-driven video editing

CCEdit: Creative and Controllable Video Editing via Diffusion Models

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

VidToMe: Video Token Merging for Zero-Shot Video Editing

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

5.3D生成(3D Generation/3D Synthesis)

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

CAD: Photorealistic 3D Generation via Adversarial Distillation

CAGE: Controllable Articulation GEneration

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

ControlRoom3D: Room Generation using Semantic Controls

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

Diffusion Time-step Curriculum for One Image to 3D Generation

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

EscherNet: A Generative Model for Scalable View Synthesis

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Gaussian Shell Maps for Efficient 3D Human Generation

HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

Holodeck: Language Guided Generation of 3D Embodied AI Environments

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Interactive3D: Create What You Want by Interactive 3D Generation

InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusio

Intrinsic Image Diffusion for Single-view Material Estimation

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

MoMask: Generative Masked Modeling of 3D Human Motions

Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

PEGASUS: Personalized Generative 3D Avatars with Composable Attributes

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D.

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

SceneWiz3D: Towards Text-guided 3D Scene Composition

SemCity: Semantic Scene Generation with Triplane Diffusion

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

Single Mesh Diffusion Models with Field Latents for Texture Generation

SPAD : Spatially Aware Multiview Diffusers

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors

Text-to-3D using Gaussian Splatting

Tiger: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process

Towards Realistic Scene Generation with LiDAR Diffusion Models

UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

6.3D编辑(3D Editing)

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

GenN2N: Generative NeRF2NeRF Translation

Makeup Prior Models for 3D Facial Makeup Estimation and Applications

7.多模态大语言模型(Multi-Modal Large Language Models)

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Anchor-based Robust Finetuning of Vision-Language Models

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Describing Differences in Image Sets with Natural Language

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Efficient Stitchable Task Adaptation

Efficient Test-Time Adaptation of Vision-Language Models

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

FairCLIP: Harnessing Fairness in Vision-Language Learning

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Generative Multimodal Models are In-Context Learners

GLaMM: Pixel Grounding Large Multimodal Model

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

OneLLM: One Framework to Align All Modalities with Language

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

PixelLM: Pixel Reasoning with Large Multimodal Model

PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

SEED-Bench: Benchmarking Multimodal Large Language Models

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

The Manga Whisperer: Automatically Generating Transcriptions for Comics

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

VBench: Comprehensive Benchmark Suite for Video Generative Models

VideoChat: Chat-Centric Video Understanding

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

ViTamin: Designing Scalable Vision Models in the Vision-language Era

ViT-Lens: Towards Omni-modal Representations

8.其他任务(Others)

AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error

Diff-BGM: A Diffusion Model for Video Background Music Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

On the Content Bias in Fréchet Video Distance

TexTile: A Differentiable Metric for Texture Tileability

持续更新~

参考

CVPR 2024 论文和开源项目合集(Papers with Code)

相关整理

Open Source Agenda is not affiliated with "Awesome CVPR2024 AIGC" Project. README Source: Kobaayyy/Awesome-CVPR2024-AIGC

Open Source Agenda Badge

Open Source Agenda Rating