🎨 LeX-Enhancer: Prompt Enhancer
LeX-Enhancer is a lightweight prompt enhancement model distilled from DeepSeek-R1.
We collected 60,856 caption pairs — before and after DeepSeek-R1 enhancement — and fine-tuned a DeepSeek-R1-Distilled-Qwen-14B model using LoRA to reproduce high-quality, richly visualized prompt outputs.
This enables efficient, large-scale generation of prompt enhancement, perfect for high-fidelity text-to-image generation.
✍️ Example: From Simple to Enhanced Caption
🧾 Input (Simple Caption):
A thank you card with the words very much, with the text on it: "VERY" in black, "MUCH" in yellow.
🧾 Output (Enhanced Caption):
The image features a rectangular thank you card with a muted, warm-toned background in a soft gradient of creamy beige and blush pink, providing a subtle textured finish that adds depth without overwhelming the central elements. Centered near the top of the card, the word "VERY" is rendered in thick, uppercase black lettering with a clean sans-serif font, slightly curved to follow the natural flow of the composition. Directly beneath it, the word "MUCH" appears in a vibrant, sunny-yellow hue, matching the same bold font style but slightly larger in size, positioned to align neatly with the lower edge of "VERY," creating a balanced hierarchy. Both texts are outlined with faint drop shadows, enhancing readability against the softly lit backdrop. The cards’ edges are crisp, with minimalistic borders in a lighter shade of the background tones, and the overall layout is symmetrical, exuding a polished, contemporary aesthetic. Gentle ambient lighting casts soft, diffused shadows around the card’s corners, suggesting a lightly textured surface underneath, while the absence of decorative embellishments keeps the focus on the typography. The color palette harmonizes warmth and neutrality, ensuring the text remains the focal point while maintaining a serene, approachable ambiance.
🚀 Usage (Python Code)
import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
# System instruction for reasoning + answering
SYSTEM_TEMPLATE = (
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
"The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
"The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think> <answer> answer here </answer>."
)
model_path = 'X-ART/LeX-Enhancer'
# Your simple caption goes here
simple_caption = "A thank you card with the words very much, with the text on it: \"VERY\" in black, \"MUCH\" in yellow."
def create_chat_template(user_prompt):
return [
{"role": "system", "content": SYSTEM_TEMPLATE},
{"role": "user", "content": user_prompt},
{"role": "assistant", "content": "<think>"}
]
def create_direct_template(user_prompt):
return user_prompt + "<think>"
def create_user_prompt(simple_caption):
return (
"Below is the simple caption of an image with text. Please deduce the detailed description of the image based on this simple caption. "
"Note: 1. The description should only include visual elements and should not contain any extended meanings. "
"2. The visual elements should be as rich as possible, such as the main objects in the image, their respective attributes, "
"the spatial relationships between the objects, lighting and shadows, color style, any text in the image and its style, etc. "
"3. The output description should be a single paragraph and should not be structured. "
"4. The description should avoid certain situations, such as pure white or black backgrounds, blurry text, excessive rendering of text, "
"or harsh visual styles. "
"5. The detailed caption should be human readable and fluent. "
"6. Avoid using vague expressions such as \"may be\" or \"might be\"; the generated caption must be in a definitive, narrative tone. "
"7. Do not use negative sentence structures, such as \"there is nothing in the image,\" etc. The entire caption should directly describe the content of the image. "
"8. The entire output should be limited to 200 words.
"
f"SIMPLE CAPTION: {simple_caption}"
)
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16)
# Prepare input prompt
messages = create_direct_template(create_user_prompt(simple_caption))
input_ids = tokenizer.encode(messages, return_tensors="pt").to(model.device)
# Stream output
streamer = TextStreamer(tokenizer, skip_special_tokens=True, clean_up_tokenization_spaces=True)
output = model.generate(
input_ids,
max_length=2048,
num_return_sequences=1,
do_sample=True,
temperature=0.6,
repetition_penalty=1.1,
streamer=streamer
)
print("*" * 80)
# Output will stream via TextStreamer
Github repository: https://github.com/zhaoshitian/LeX-Art
@article{zhao2025lexart,
title={LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis},
author={Zhao, Shitian and Wu, Qilong and Li, Xinyue and Zhang, Bo and Li, Ming and Qin, Qi and Liu, Dongyang and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Gao, Peng and Fu, Bin and Li, Zhen},
journal={arXiv preprint arXiv:2503.21749},
year={2025}
}
- Downloads last month
- 241