Error Outputs on Submissions

I did, kept a buffer time before the stated 180 s of the task duration

@bha51 @Ajin_J1 When testing locally, did your runs terminate cleanly? When I run the eval manually in two terminals, neither the model script nor the eval engine exit cleanly. Same when I run the images jointly with docker-compose, it does not shut down. There is a spin_thread=True in aic_model.py that seems to be causing the model script to not terminate, but I am not able to get the engine to exit cleanly. I asked the same in this post with additional logs.

having the same issue. the evaluation runs for an execution time of around 637 sec. and fails without any logs. Team name is KawaharaLab.

what is the error u are seeing on shutdown?

I posted the error logs in the linked post, didn’t want to derail the current thread’s discussion. However, here are the last log lines from the engine:

eval-1   | [INFO] [aic_engine-5]: process has finished cleanly [pid 992]
eval-1   | [INFO] [component_container-3]: sending signal 'SIGINT' to process[component_container-3]
eval-1   | [INFO] [aic_adapter-2]: sending signal 'SIGINT' to process[aic_adapter-2]
eval-1   | [INFO] [robot_state_publisher-1]: sending signal 'SIGINT' to process[robot_state_publisher-1]
eval-1   | [component_container-3] (2026-04-20 11:03:09.227) [debug] [SignalHandler.cc:278] Received signal[2].
eval-1   | [component_container-3] (2026-04-20 11:03:09.227) [debug] [ServerPrivate.cc:127] Server received signal[2]
eval-1   | [component_container-3] (2026-04-20 11:03:09.227) [debug] [Sensors.cc:565] SensorsPrivate::Stop
eval-1   | [aic_adapter-2] [INFO] [1776682989.227872692] [rclcpp]: signal_handler(SIGINT/SIGTERM)
eval-1   | [robot_state_publisher-1] [INFO] [1776682989.229056882] [rclcpp]: signal_handler(SIGINT/SIGTERM)
eval-1   | [INFO] [robot_state_publisher-1]: process has finished cleanly [pid 988]
eval-1   | [INFO] [aic_adapter-2]: process has finished cleanly [pid 989]
eval-1   | [component_container-3] component_container: ./OgreMain/src/Threading/OgreThreadsPThreads.cpp:61: static void Ogre::Threads::WaitForThreads(size_t, const Ogre::ThreadHandlePtr*): Assertion `numThreadHandles < 128' failed.
eval-1   | [ERROR] [component_container-3]: process has died [pid 990, exit code -6, cmd '/opt/ros/kilted/lib/rclcpp_components/component_container --ros-args -r __node:=ros_gz_container -r __ns:=/'].

After this, I have to manually kill it every time. This is well after the policy returns True.

Submission timing issue — configure_model_node sim-time sleep consuming most of allotted time
Team:LinkedVerse

We have been debugging our submission (Team: LinkedVerse) and would like to share our findings and ask for clarification.

What we observed locally

After adding diagnostic logs and running docker compose up with both the eval and model containers, we traced the following timeline:

The 94-second delay occurs inside configure_model_node() in aic_engine, specifically at:

insert_cable_action_client_->async_send_goal(goal_msg, goal_options);
node_->get_clock()->sleep_for(rclcpp::Duration(std::chrono::seconds(1)));
Because aic_engine runs with use_sim_time:=true, sleep_for(1s) waits for the Gazebo simulation clock to advance by one second. During world initialization (before the task board and cable are spawned), the Gazebo real-time factor appears to be very low, causing 1 simulation second to take ~94 real seconds.

I’m running into the same issue and I have reached out to support team about it.

Team name: slobot

I had submitted a slight modification of the RunACT policy from the example policies. Instead of downloading the weights from the internet, it loads them locally. This because the network is internal only in the docker compose settings.

The cached artifacts were

When running locally, the scoring work almost all the time.

However I reproduced the issue on the 1st attempt, which typically means docker containers were not “warmed up”.

The log shows “Participant model is not ready for trial”, indicating the AIC Engine did not get a successful “handshake” with the policy ROS node.

See logs from this attached screenshot.

Right before, we see

GetState service call timed out for node ‘aic_model’

This indicates that the root cause of the issue is the following:

The service call future timed-out after 5 seconds, causing the client to give up.

One possible mitigation is to increase the 5 second time to allow more time for the policy initializer to complete successfully. How about 30 seconds?

The configuration is setup here:

Adding another data point to this thread — we’re hitting the same blank-logs problem.

Our model runs cleanly locally in Docker (containers come up, eval completes), but on submission it fails after ~150 seconds every time. Both stderr and stdout come back empty, so we have no way to debug.

The 150s figure is suspicious — it’s well past the 60s configure->activate timeout that @Ajin_J1 flagged, which makes us think it’s a different failure mode. Is there another timeout in the eval pipeline around that duration?

Would really appreciate:

  • Access to container stderr for failed runs
  • A list of all timeouts enforced during evaluation

Team name: Datameister

Same here. Team ArmoByte. No logs are seen as well. Also, I noticed that sometime I get infinite:
“eval-1 | [component_container-3] (2026-04-25 12:22:25.726) [error] [Physics.cc:3188] Internal error: a physics entity ptr with an ID of [929] does not exist.
e” messages. I don’t know if this is from my policy but don’t think so because it is related to spawning/killing entities in Gazebo I think. Did anyone got it before?

Same here,

My submission failed but I don’t have any meaningful logs to debug the issue.

  • Submissions #383, #389, #394 (Team: BartolosCrew)

In the end, thanks to this post Fixed submission failing on Portal, I managed to fix it by moving the imports inside __init__ and other functions. After that, I was finally able to get a score. Thanks a lot to @bha51 for the hint!

One thing I’d still like to know: is there any way to find out which CUDA version is installed on the AWS evaluation platform? I couldn’t find that information documented anywhere @Yadunund

Regards!

Hi mate, so are you saying the solution for you was to put the imports like import numpy as np inside __init__ instead of at the begginning of the file? If this is the case, you are saying also to copy the imports inside every function that uses it?

Yes, exactly. Moving the imports (like import numpy as np) inside __init__ and in some cases inside the functions that use them solved it for me.

wow, it worksss. Thanks! But do you know why it is the case that it works when we put imports in the functions? I don’t think it is a good practice to do so.

@jlamperez thanks for flagging. I’ve opened Specify GPU driver and CUDA versions for qualification evaluation by Yadunund · Pull Request #511 · intrinsic-dev/aic · GitHub to specify the versions.

1 Like

After following the guidelines, I was able to get the ACT policy to pass on the submssion portal.

Here are the modifications I added to the ACT policy to speed up the python import on the evaluation fleet.

I agree it’s a bit odd to import heavy libraries such as torch in the initializer. Re-importing them in other functions should be fine as they are cached.