VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models

Authors anonymized
Project Gallery

Abstract

Diffusion models have shown impressive results in text-to-image synthesis. Using massive datasets of captioned images, diffusion models learn to generate raster images of highly diverse objects and scenes. However, designers frequently use vector representations of images like Scalable Vector Graphics (SVGs) for digital icons, graphics and stickers. Vector graphics can be scaled to any size, and are compact. In this work, we show that a text-conditioned diffusion model trained on pixel representations of images can be used to generate SVG-exportable vector graphics. We do so without access to large datasets of captioned SVGs. Instead, inspired by recent work on text-to-3D synthesis, we vectorize a text-to-image diffusion sample and fine-tune with a Score Distillation Sampling loss. By optimizing a differentiable vector graphics rasterizer, our method distills abstract semantic knowledge out of a pretrained diffusion model. By constraining the vector representation, we can also generate coherent pixel art and sketches. Our approach, VectorFusion, produces more coherent graphics than prior works that optimize CLIP, a contrastive image-text model.


Example generated vectors

VectorFusion generates vector graphics from diverse captions. Search through SVGs in our gallery.


Infinitely scalable assets

Vector graphics are compact but can be scaled to arbitrary size while staying sharp. Caption: "a train. minimal flat 2d vector icon. lineal color. on a white background. trending on artstation."


Visualizing text-to-SVG generation

We generate SVGs from text in an efficient multi-stage process. First, our method samples raster images from the Stable Diffusion text-to-image diffusion model. VectorFusion then traces those samples automatically with LIVE. However, these samples are often difficult to convert to vector graphics, dull, or don't reflect all the details of the text. Our approach finally refines samples with an image-text loss based on Score Distillation Sampling, improving vibrancy and consistency with the text. VectorFusion uses an inverse graphics approach, enabled by the DiffVG differentiable SVG renderer.