--- name: complex-image-editing title: "Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2507.05259" keywords: [Image Editing, Instruction Following, Multimodal Planning, Mask Generation, Object Localization] description: "Decompose complex image editing instructions into simpler sub-tasks with automatically generated control guidance. Handles multi-object edits, preserves identity of surrounding regions, and eliminates manual mask creation." --- # X-Planner: Planning-Based Image Editing from Complex Instructions Editing images based on complex instructions requires more than direct pixel manipulation. When a user says "make the building taller and the sky more dramatic," the system must understand that these are two separate edits targeting different objects, generate precise boundaries for each, and apply appropriate transformations without bleeding into adjacent regions. X-Planner solves this by decomposing complex instructions into manageable sub-tasks, automatically generating the masks and control signals that guide editing models. The core challenge is that complex instructions are indirectly specified and often target multiple objects. Current approaches either require users to manually provide masks or fail when identity preservation matters—editing one object corrupts its surroundings. ## Core Concept X-Planner operates as a three-stage pipeline that separates planning from execution: 1. **Instruction decomposition**: Parse the complex instruction into simpler, atomic sub-instructions 2. **Mask generation**: For each sub-instruction, generate precise segmentation masks tailored to the edit type 3. **Bounding box prediction**: For insertion tasks, predict spatial locations for new objects By treating masking as a learned task conditioned on edit type, the system generates tighter masks for texture edits and dilated masks for shape changes—each adapted to the specific editing goal. ## Architecture Overview - **MLLM instruction parser**: Analyzes complex instructions and produces structured sub-tasks with edit types - **Edit-specific mask generator**: Creates customized segmentation masks based on edit type (replacement, style change, insertion, etc.) - **Spatial predictor**: For insertions, predicts bounding boxes where new objects should appear - **Compatible editing backend**: Works with existing models (UltraEdit, InstructPix2Pix, etc.) - **Iterative refinement**: Applies sub-instructions sequentially, each building on previous edits ## Implementation Start by analyzing a complex instruction and decomposing it into sub-tasks: ```python from xplanner.decomposer import InstructionDecomposer from xplanner.masker import MaskGenerator decomposer = InstructionDecomposer(model="gpt-4-vision") # Complex instruction that targets multiple objects implicitly instruction = "Make the car red, remove the traffic cone, and brighten the road" # Decompose into atomic sub-instructions sub_tasks = decomposer.decompose( instruction=instruction, image=image ) # Output: # [ # {"text": "change the car color to red", "target": "car", "type": "color_change"}, # {"text": "remove the traffic cone", "target": "traffic_cone", "type": "deletion"}, # {"text": "brighten the road surface", "target": "road", "type": "lighting_change"} # ] ``` For each sub-task, generate a specialized mask conditioned on the edit type: ```python masker = MaskGenerator() for sub_task in sub_tasks: edit_type = sub_task["type"] # Generate mask adapted to edit type mask = masker.generate_mask( image=image, target_description=sub_task["text"], edit_type=edit_type, # Different masks for different edits: # - "texture" or "color_change": tight mask (exact object) # - "shape" or "size": dilated mask (include context) # - "deletion": precise boundary # - "global": full image mask ) # Validate mask covers the target assert masker.validate_coverage(mask, sub_task["target"]) sub_task["mask"] = mask ``` For insertion tasks, predict bounding boxes since existing detectors can't hallucinate objects not in the original image: ```python from xplanner.spatial import BoundingBoxPredictor predictor = BoundingBoxPredictor() insertion_tasks = [t for t in sub_tasks if t["type"] == "insertion"] for task in insertion_tasks: # Predict where new object should appear bbox = predictor.predict( image=image, instruction=task["text"], context_objects=get_visible_objects(image) ) # Bbox provides spatial guidance to editing model task["bbox"] = bbox ``` Apply the sub-tasks iteratively using a compatible editing model: ```python from xplanner.executor import ImageEditor editor = ImageEditor(backend="ultarEdit") # or InstructPix2Pix result_image = image.copy() # Apply sub-tasks sequentially for i, sub_task in enumerate(sub_tasks): # Get mask and optional spatial guidance mask = sub_task["mask"] bbox = sub_task.get("bbox", None) # Edit using specified mask and guidance result_image = editor.edit( image=result_image, instruction=sub_task["text"], mask=mask, spatial_guidance=bbox, preserve_identity=True # Keep regions outside mask unchanged ) # Validate edit quality assert editor.validate_quality(result_image, result_image_prev) return result_image ``` ## Practical Guidance ### When to Use X-Planner Use this approach for: - Complex, multi-object editing instructions - Scenarios where identity preservation is critical - User instructions with ambiguous or indirect language - Cases where manual masks would be tedious or error-prone - Applications requiring iterative refinement of edits ### When NOT to Use Avoid X-Planner for: - Simple, single-object edits (direct approaches are faster) - Fully structured instructions already decomposed by users - Style transfer or artistic transformations (doesn't require decomposition) - Real-time editing requiring immediate feedback - Highly specialized editing domains with custom models ### Edit-Type Mask Strategies | Edit Type | Mask Strategy | Example | |-----------|---------------|---------| | Color change | Tight mask (exact object boundary) | "Make the car blue" | | Shape change | Dilated mask (object + buffer) | "Make the building taller" | | Style transfer | Full region mask | "Make the road surface glossy" | | Deletion | Precise boundary | "Remove the traffic cone" | | Insertion | Bounding box guidance | "Add a tree near the building" | | Global edit | Full image mask | "Brighten the entire scene" | ### Key Hyperparameters | Parameter | Typical Range | Guidance | |-----------|---------------|----------| | Mask dilation | 0-30 pixels | Larger for shape edits, smaller for color | | Confidence threshold | 0.5-0.9 | Higher = more selective masks | | Iteration count | 1-5 steps | More iterations for complex edits, but slower | | Model backbone | GPT-4V, Claude | Larger models decompose better | ### Common Pitfalls 1. **Over-decomposing**: Not every instruction needs splitting. Keep sub-tasks atomic but not granular. 2. **Ignoring mask quality**: A good mask is 80% of the editing success. Validate carefully. 3. **Forgetting spatial context**: When inserting objects, ensure they appear in physically plausible locations. 4. **Sequential error accumulation**: Each edit can degrade the image. Monitor quality after each step. 5. **Missing identity preservation**: Ensure masks don't bleed into adjacent objects, or explicitly dilate for shape changes. ### Validation Checklist - [ ] Each sub-instruction is atomic and independent - [ ] Masks cover intended targets completely - [ ] Masks don't overlap with protected regions - [ ] Inserted objects have valid bounding boxes - [ ] Edit sequence respects dependencies - [ ] Final image preserves original identity outside edited regions ## Reference "Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing" - [arXiv:2507.05259](https://arxiv.org/abs/2507.05259)