Error Outputs on Submissions

bha51 · April 21, 2026, 6:41am

I did, kept a buffer time before the stated 180 s of the task duration

sreekar · April 21, 2026, 7:52am

@bha51 @Ajin_J1 When testing locally, did your runs terminate cleanly? When I run the eval manually in two terminals, neither the model script nor the eval engine exit cleanly. Same when I run the images jointly with docker-compose, it does not shut down. There is a spin_thread=True in aic_model.py that seems to be causing the model script to not terminate, but I am not able to get the engine to exit cleanly. I asked the same in this post with additional logs.

uzumibi · April 21, 2026, 1:34pm

having the same issue. the evaluation runs for an execution time of around 637 sec. and fails without any logs. Team name is KawaharaLab.

bha51 · April 21, 2026, 3:34pm

what is the error u are seeing on shutdown?

sreekar · April 21, 2026, 10:01pm

I posted the error logs in the linked post, didn’t want to derail the current thread’s discussion. However, here are the last log lines from the engine:

eval-1   | [INFO] [aic_engine-5]: process has finished cleanly [pid 992]
eval-1   | [INFO] [component_container-3]: sending signal 'SIGINT' to process[component_container-3]
eval-1   | [INFO] [aic_adapter-2]: sending signal 'SIGINT' to process[aic_adapter-2]
eval-1   | [INFO] [robot_state_publisher-1]: sending signal 'SIGINT' to process[robot_state_publisher-1]
eval-1   | [component_container-3] (2026-04-20 11:03:09.227) [debug] [SignalHandler.cc:278] Received signal[2].
eval-1   | [component_container-3] (2026-04-20 11:03:09.227) [debug] [ServerPrivate.cc:127] Server received signal[2]
eval-1   | [component_container-3] (2026-04-20 11:03:09.227) [debug] [Sensors.cc:565] SensorsPrivate::Stop
eval-1   | [aic_adapter-2] [INFO] [1776682989.227872692] [rclcpp]: signal_handler(SIGINT/SIGTERM)
eval-1   | [robot_state_publisher-1] [INFO] [1776682989.229056882] [rclcpp]: signal_handler(SIGINT/SIGTERM)
eval-1   | [INFO] [robot_state_publisher-1]: process has finished cleanly [pid 988]
eval-1   | [INFO] [aic_adapter-2]: process has finished cleanly [pid 989]
eval-1   | [component_container-3] component_container: ./OgreMain/src/Threading/OgreThreadsPThreads.cpp:61: static void Ogre::Threads::WaitForThreads(size_t, const Ogre::ThreadHandlePtr*): Assertion `numThreadHandles < 128' failed.
eval-1   | [ERROR] [component_container-3]: process has died [pid 990, exit code -6, cmd '/opt/ros/kilted/lib/rclcpp_components/component_container --ros-args -r __node:=ros_gz_container -r __ns:=/'].

After this, I have to manually kill it every time. This is well after the policy returns True.

air7 · April 21, 2026, 10:08pm

Submission timing issue — configure_model_node sim-time sleep consuming most of allotted time
Team:LinkedVerse

We have been debugging our submission (Team: LinkedVerse) and would like to share our findings and ask for clarification.

What we observed locally

After adding diagnostic logs and running docker compose up with both the eval and model containers, we traced the following timeline:

The 94-second delay occurs inside configure_model_node() in aic_engine, specifically at:

insert_cable_action_client_->async_send_goal(goal_msg, goal_options);
node_->get_clock()->sleep_for(rclcpp::Duration(std::chrono::seconds(1)));
Because aic_engine runs with use_sim_time:=true, sleep_for(1s) waits for the Gazebo simulation clock to advance by one second. During world initialization (before the task board and cable are spawned), the Gazebo real-time factor appears to be very low, causing 1 simulation second to take ~94 real seconds.

air7 · April 21, 2026, 10:09pm

alexis779 · April 22, 2026, 4:26pm

I’m running into the same issue and I have reached out to support team about it.

Team name: slobot

I had submitted a slight modification of the RunACT policy from the example policies. Instead of downloading the weights from the internet, it loads them locally. This because the network is internal only in the docker compose settings.

The cached artifacts were

HuggingFace model, grkw/aic_act_policy · Hugging Face
torchvision ResNet model: https://download.pytorch.org/models/resnet18-f37072fd.pth

When running locally, the scoring work almost all the time.

However I reproduced the issue on the 1st attempt, which typically means docker containers were not “warmed up”.

The log shows “Participant model is not ready for trial”, indicating the AIC Engine did not get a successful “handshake” with the policy ROS node.

See logs from this attached screenshot.

Right before, we see

GetState service call timed out for node ‘aic_model’

This indicates that the root cause of the issue is the following:

The service call future timed-out after 5 seconds, causing the client to give up.

One possible mitigation is to increase the 5 second time to allow more time for the policy initializer to complete successfully. How about 30 seconds?

The configuration is setup here:

github.com/intrinsic-dev/aic

aic_engine/src/aic_engine.cpp

ccbc69ea3


      
            RCLCPP_ERROR(node_->get_logger(),
                         "GetState service '%s' not available after waiting",
                         model_get_state_service_name_.c_str());
            return false;
          }
          
          // Call the service to get current state
          auto request = std::make_shared<lifecycle_msgs::srv::GetState::Request>();
          auto future = model_get_state_client_->async_send_request(request);
          
          if (!wait_for_interruptible(future, std::chrono::seconds(5))) {
            RCLCPP_ERROR(node_->get_logger(),
                         "GetState service call timed out for node '%s'",
                         model_node_name_.c_str());
            return false;
          }
          
          auto response = future.get();
          
          // Check if the state is unconfigured (PRIMARY_STATE_UNCONFIGURED = 1)
          if (response->current_state.id !=

Axel_Vlaminck · April 24, 2026, 4:28pm

Adding another data point to this thread — we’re hitting the same blank-logs problem.

Our model runs cleanly locally in Docker (containers come up, eval completes), but on submission it fails after ~150 seconds every time. Both stderr and stdout come back empty, so we have no way to debug.

The 150s figure is suspicious — it’s well past the 60s configure->activate timeout that @Ajin_J1 flagged, which makes us think it’s a different failure mode. Is there another timeout in the eval pipeline around that duration?

Would really appreciate:

Access to container stderr for failed runs
A list of all timeouts enforced during evaluation

Team name: Datameister

ahmadserhal · April 25, 2026, 1:06pm

Same here. Team ArmoByte. No logs are seen as well. Also, I noticed that sometime I get infinite:
“eval-1 | [component_container-3] (2026-04-25 12:22:25.726) [error] [Physics.cc:3188] Internal error: a physics entity ptr with an ID of [929] does not exist.
e” messages. I don’t know if this is from my policy but don’t think so because it is related to spawning/killing entities in Gazebo I think. Did anyone got it before?

jlamperez · April 25, 2026, 1:49pm

Same here,

My submission failed but I don’t have any meaningful logs to debug the issue.

Submissions #383, #389, #394 (Team: BartolosCrew)

jlamperez · April 25, 2026, 8:14pm

In the end, thanks to this post Fixed submission failing on Portal, I managed to fix it by moving the imports inside __init__ and other functions. After that, I was finally able to get a score. Thanks a lot to @bha51 for the hint!

One thing I’d still like to know: is there any way to find out which CUDA version is installed on the AWS evaluation platform? I couldn’t find that information documented anywhere @Yadunund

Regards!

ahmadserhal · April 26, 2026, 1:08pm

Hi mate, so are you saying the solution for you was to put the imports like import numpy as np inside __init__ instead of at the begginning of the file? If this is the case, you are saying also to copy the imports inside every function that uses it?

jlamperez · April 26, 2026, 1:18pm

Yes, exactly. Moving the imports (like import numpy as np) inside __init__ and in some cases inside the functions that use them solved it for me.

ahmadserhal · April 26, 2026, 6:49pm

wow, it worksss. Thanks! But do you know why it is the case that it works when we put imports in the functions? I don’t think it is a good practice to do so.

Yadunund · April 27, 2026, 4:55pm

@jlamperez thanks for flagging. I’ve opened Specify GPU driver and CUDA versions for qualification evaluation by Yadunund · Pull Request #511 · intrinsic-dev/aic · GitHub to specify the versions.

alexis779 · April 28, 2026, 12:30am

After following the guidelines, I was able to get the ACT policy to pass on the submssion portal.

Here are the modifications I added to the ACT policy to speed up the python import on the evaluation fleet.

github.com/intrinsic-dev/aic

ACT updates to pass the submission portal requirements (#514)

main ← alexis779:main

opened 12:21AM - 28 Apr 26 UTC

alexis779

+206 -99

## Description This change provides a version of the ACT policy that will pas…s the submission requirements. - `RunACT` policy file should have minimal *global* python imports - `insert_cable` function should return within the task limit, considering the *sim* clock ## Testing It was tested on non-nvidia GPU, such as an AMD integrated GPU in the CPU chip. It mounts the DRI device in the Docker container to provide graphics acceleration. ``` docker compose -f docker/docker-compose.yaml up ``` It was tested on EC2 instance `g4dn.2xlarge`, with nvidia T4 GPU, running Ubuntu 24.04. It mounts the nvidia layer for OpenGL implementation to provide graphics acceleration. ``` docker compose -f docker/docker-compose.yaml -f docker/docker-compose.remote.yaml up ```

I agree it’s a bit odd to import heavy libraries such as torch in the initializer. Re-importing them in other functions should be fine as they are cached.

Topic		Replies	Views
Error after submission and the container isn't getting cleanly terminated locally AI for Industry Challenge	0	30	April 20, 2026
Error during submission AI for Industry Challenge	2	101	April 21, 2026
Cannot 'Start the Evaluation Container' AI for Industry Challenge	2	104	March 10, 2026
Fixed submission failing on Portal AI for Industry Challenge	1	92	April 20, 2026
Issues on Step 2: Start the Evaluation Container AI for Industry Challenge	4	97	March 31, 2026

Error Outputs on Submissions

Related topics