We study text-guided geometric transformation for object-level scene editing, where the goal is to modify an object’s position, orientation, and scale in an image according to a text instruction, while keeping the scene static and unedited regions consistent. We focus on three fundamental spatial operations: object translation, rotation, and resizing, which together span the core dimensions of geometric object transformation. To regularize the instruction–action space, we use a set of predefined transformation templates for translation, rotation, and resizing. Each template specifies a target object and a spatial relation (e.g., “move the mug to the left”), providing a consistent basis for supervision and evaluation.