Possible engine bug between trials

aic_controller::on_activate rebases reference to pre-teleport pose during home_robot(), causing impedance lunge at trial start @Yadunund Also mentioned in Reset Issues - Robot moves after homing before cable is spawned

Summary

Between trials, Engine::home_robot() deactivates the controller, calls /scoring/reset_joints, then reactivates the controller. The new trial starts with the controller’s
last_tool_reference_ pointing at the pre-teleport pose, even though the arm has been teleported to home. The impedance loop then pulls the arm back toward the stale reference at the
controller’s max_translational_velocity cap (≈ 0.25 m/s). Observed as a ~170–180 mm “lunge” at the start of every trial.

Reproduction

Run any back-to-back trials. Record /aic_controller/controller_state (tcp_pose and reference_tcp_pose) across a trial boundary.

Evidence

Two recordings around trial-start lunges (tcp_pose vs reference_tcp_pose, downsampled to ~10 Hz):

Recording 1 — single-tick teleport at t = 2.426 s

t= 2.278s cur=(-0.3702,+0.3030,+0.1971) ref=(-0.3707,+0.3032,+0.1471) gap= 50.0mm
t= 2.426s cur=(-0.3714,+0.1948,+0.3294) ref=(-0.3702,+0.3030,+0.1971) gap=171.0mm
t= 3.302s cur=(-0.3864,+0.2154,+0.2905) ref=(-0.3702,+0.3030,+0.1971) gap=129.1mm
t= 4.103s cur=(-0.3600,+0.2890,+0.2023) ref=(-0.3702,+0.3030,+0.1971) gap= 18.1mm

Recording 2 — single-tick teleport at t = 4.757 s

t= 4.655s cur=(-0.4386,+0.3524,+0.2842) ref=(-0.4391,+0.3676,+0.2922) gap= 17.1mm
t= 4.757s cur=(-0.3714,+0.1948,+0.3294) ref=(-0.4386,+0.3524,+0.2842) gap=177.2mm
t= 5.662s cur=(-0.3357,+0.1772,+0.3146) ref=(-0.4386,+0.3524,+0.2842) gap=205.4mm
t=10.611s cur=(-0.4251,+0.2814,+0.2467) ref=(-0.4386,+0.3524,+0.2842) gap= 81.5mm

Three signatures uniquely identify the issue:

  1. Δcur ≈ 170–180 mm in one 52 ms tick → physical move rate ≈ 3.4 m/s, impossible for the real arm. This is reset_joints` taking effect.
  2. Post-teleport reference_tcp_pose equals the pre-teleport tcp_pose to the millimeter — the controller rebased last_tool_reference_ to a stale FK reading.
  3. Post-lunge pullback at exactly 0.26 m/s — matches cartesian_limits.max_translational_velocity = 0.25 m/s from aic_ros2_controllers.yaml. Classic impedance-pulling-to-stale-reference
    behavior.

Mechanism

Engine::home_robot() (aic_engine.cpp:1651–1711):

switch_controllers({}, {“aic_controller”}); // (1) deactivate
auto reset_joints_future = reset_joints_client_->async_send_request(…); // (2) request reset
wait_for_interruptible(reset_joints_future, 10s);
if (!reset_joints_future.get()->success) return false;
switch_controllers({“aic_controller”}, {}); // (3) reactivate

aic_controller::on_activate() (aic_controller.cpp:717):

last_tool_reference_ = current_tool_state_;

The engine waits on reset_joints_future and treats response->success == true as “reset applied.” But the Gazebo set_joint_positions plugin most likely returns success when the request
is queued, not when the physics step has applied the new joint positions. Between step (2)'s ack and step (3)'s reactivation, the controller reads stale joint state via
read_state_from_hardware, computes stale FK, and rebases last_tool_reference_ to the pre-teleport pose. The physics step then catches up; current_tool_state_ jumps to the home pose;
last_tool_reference_ is left behind. Impedance error becomes ≈ 170 mm; the loop pulls the arm back at the velocity cap.

Why it’s not a participant-side issue

In our recordings, the policy has not yet published a single motion command when the lunge occurs — reference_tcp_pose is bit-perfectly constant for the entire pre-lunge window. The
lunge happens in the engine ↔ controller ↔ Gazebo handoff, before the policy’s first tick. No policy implementation can prevent it; a “wait until controller is settled” check at policy
start can only observe it, not fix it.

Suggested fixes (in order of preference)

  1. Engine-side: wait for joint state to match reset target before reactivating. After the reset_joints future returns and before switch_controllers({“aic_controller”}, {}), poll
    /joint_states until joint positions are within tolerance of home_reset_joints_request_->initial_positions, with a timeout. Closes the window deterministically.
  2. Controller-side: defer the reference rebase to the first update tick. Instead of setting last_tool_reference_ = current_tool_state_ in on_activate, set a needs_rebase_ flag, and
    rebase inside the first update() call after activation. The first update() runs after at least one hardware read post-activation, by which time the post-teleport joint state is in.

I can confirm that overlaps a lot with our findings just yesterday! We were having repeated issues with that too. We noticed very increased occurrence when the engine is doing the resets. When we use the setup incl. controller to do our own resets (our own timing, wait-for-settling etc) then the “deep dive” issue doesn’t happen near as often as when having the engine do the resets - which is a problem at eval time.