The Illusion of “Working”
At first glance, everything looked fine.
-
The cameras detected targets
-
The robot moved toward them
-
The system appeared to succeed
But underneath that…
-
Pixel centers drifted
-
Bounding boxes shifted
-
Depth assumptions broke down
-
Small errors compounded into failed insertions
Nothing was obviously wrong.
It was just… off.
Step 1 — Stop Treating Pixels Like Positions
Originally, I was doing what most systems do:
Take pixel offsets → scale them → move the robot
That works… until it doesn’t.
So I rebuilt the entire pipeline using a pinhole camera model.
Instead of treating pixels as distances, I treated them as angles:
dx_norm = (center_px[0] - (img_w / 2.0)) / (img_w / 2.0)
dy_norm = (center_px[1] - (img_h / 2.0)) / (img_h / 2.0)
angle_x = dx_norm * (HFOV_RAD / 2.0)
angle_y = dy_norm * (VFOV_RAD / 2.0)
Now the system wasn’t guessing anymore. It was projecting.
Step 2 — Turn Angles Into Real-World Coordinates
Once I had angles, I mapped them into physical space:
total_pitch_angle = PITCH_RAD - angle_y
off_y_m = HEIGHT_M * math.tan(total_pitch_angle)
slant_range = HEIGHT_M / math.cos(PITCH_RAD)
off_x_m = slant_range * math.tan(angle_x)
This gave me:
-
Real-world X/Y offsets
-
Based on camera height
-
Using actual geometry
Step 3 — Accept That Simulation Lies
Even with correct math… things were still off.
Because simulation isn’t perfect.
So instead of hacking each camera individually, I introduced a dynamic correction:
gazebo_scalar = self.get_intelligent_scalar(HEIGHT_M)
off_y_m *= gazebo_scalar
One adjustment layer.
No per-camera hacks.
No magic numbers everywhere.
Just controlled correction based on observed behavior.
Step 4 — Don’t Trust a Camera — Score It
This was one of the biggest shifts.
Instead of asking:
“Which camera sees the target?”
I started asking:
“Which camera should I trust?”
I built a multi-factor confidence score:
visual_score = base ** 1.5
error = max(0.0, abs(1.0 - mpp_ratio) - 0.1)
geo_score = max(0.0, 1.0 - (error ** 2))
tilt_penalty = 0.5 if abs(dy_norm) > 0.5 else 1.0
confidence_score = visual_score * geo_score * tilt_penalty
Each camera is evaluated on:
-
Visual alignment (how centered it is)
-
Geometric sanity (meters-per-pixel consistency)
-
Viewing angle (tilt penalty)
Now the system doesn’t just see.
It judges what it sees
and a massive log
return {
"x": calc_x,
"y": calc_y,
"visual_score": visual_score,
"geo_score": geo_score,
"tilt_penalty": tilt_penalty,
"mpp_ratio": mpp_ratio,
"confidence_score": confidence_score,
"method": method_label,
"debug": {
"center_px": center_px,
"img_size": [img_w, img_h],
"dx_norm": dx_norm,
"dy_norm": dy_norm,
"vfov_deg": math.degrees(VFOV_RAD),
"hfov_deg": math.degrees(HFOV_RAD),
"pitch_deg": cfg.get("pitch_deg", 0.0),
"yaw_deg": cfg.get("yaw_deg", 0.0),
"tcp_yaw_deg": math.degrees(tcp_yaw),
"angle_x_deg": math.degrees(angle_x),
"angle_y_deg": math.degrees(angle_y),
"height_m": HEIGHT_M,
"slant_range": slant_range,
"off_x_m": off_x_m,
"off_y_m": off_y_m,
"cam_base_xyz": [cam_base_x, cam_base_y, cam_base_z],
"sc_obs": [sc_obs_x, sc_obs_y],
"obs_err": [obs_err_x, obs_err_y],
"delta": [delta_x, delta_y],
"tcp_xyz": [tcp_x, tcp_y, tcp_z],
"cam_offset_xyz": [cfg.get("off_x", 0.0), cfg.get("off_y", 0.0), cfg.get("off_z", 0.0)],
"navigator_truth_xyz": [
nav_pos.get("x", 0.0),
nav_pos.get("y", 0.0),
nav_pos.get("z", 0.0),
],
"target_truth_xyz": [
goal_pos.get("x", 0.0),
goal_pos.get("y", 0.0),
goal_pos.get("z", 0.0),
],
"pitch_plus_angle_y_deg": math.degrees(PITCH_RAD - angle_y),
"tan_pitch_plus_angle_y": math.tan(PITCH_RAD - angle_y),
"total_yaw_deg": math.degrees(total_yaw),
"rotated_offset_xy": [rot_off_x, rot_off_y],
"rotated_projection_xy": [rot_x, rot_y],
"dy_px": dy_px,
"dx_px": dx_px,
"meters_per_pixel_y": meters_per_pixel_y,
"meters_per_pixel_x": meters_per_pixel_x,
"gazebo_y_scalar": gazebo_scalar,
"navigator_id": navigator_id,
}
}
Step 5 — Reconstruct the Target (Don’t Chase It)
Once the system detects the port, it doesn’t aim directly at it.
It reconstructs the real target using known offsets:
delta_x = goal_pos["x"] - nav_pos["x"]
delta_y = goal_pos["y"] - nav_pos["y"]
calc_x = sc_obs_x + delta_x
calc_y = sc_obs_y + delta_y
That’s a subtle but critical difference:
I’m not chasing pixels. I’m rebuilding the world.
Step 6 — Movement Isn’t Enough — You Need Convergence
At this point, I could compute a target.
But the robot still didn’t land perfectly.
Because real systems don’t work in one step.
So I built a closed-loop hover system:
Every iteration:
-
Capture a new observation
-
Compute error
-
Apply a correction
-
Move
-
Repeat
Measure Both Pixel and Truth Error
pixel_dist = math.sqrt(dx_norm**2 + dy_norm**2)
truth_dist = math.sqrt((calc_x - tx)**2 + (calc_y - ty)**2)
Pixel error tells you what the camera sees. Truth error tells you what actually happened.
You need both.
Track the Best Attempt
Even if the system doesn’t fully converge, it keeps the best result:
if score > best_score:
best_lock = {...}
So failure becomes:
“Use the closest known good position”
Control the Movement (This Was Critical)
Originally, I had a bug where the robot kept making tiny fixed movements and never converged.
The fix was simple, but powerful:
Dynamically adjust how the move is executed:
if requested_xy > 0.010:
xy_steps = 10
else:
xy_steps = 1
Big corrections → smooth motion
Small corrections → precision control
What This Isn’t
Is this Machine Learning?
No.
Is this Reinforcement Learning?
Also no.
There’s no training loop.
No reward function.
What This Actually Is
This is:
-
Geometry
-
Calibration
-
Feedback control
-
Systems engineering