Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM machines with high resolution - native resolution is 1360px and up to 10 seconds 161 frames - audios generated with new open source audio model
Step 2: Data Collection Gather high-quality photos of yourself
I used a Poco X6 Pro (mid-tier phone) with good results
Ensure good variety in poses and lighting
Step 3: Training Use "ohwx man" as the only caption for all images
Keep it simple - no complex descriptions needed
Step 4: Testing & Optimization Use SwarmUI grid to find the optimal checkpoint
Test different variations to find what works best
Step 5: Generation Settings Upscale Parameters:
Scale: 2x
Refiner Control: 0.6
Model: RealESRGAN_x4plus.pth
Prompt Used:
photograph of ohwx man wearing an amazing ultra expensive suit on a luxury studio<segment:yolo-face_yolov9c.pt-1,0.7,0.5>photograph of ohwx man Note: The model naturally generated smiling expressions since the training dataset included many smiling photos.
Note: yolo-face_yolov9c.pt used to mask face and auto inpaint face to improve distant shot face quality