Hi everyone,
At iRobot, we have been working on lifecycle node transitions and are looking toward the community w.r.t. designing more flexible transition functions (e.g., on_configure). Namely, transitions that can:
- be deferred (i.e., async)
- be cancelled
Core github issue describing this: rclcpp#2213
What is the current behavior / limitations of lifecycle node transitions?
Currently lifecycle transition functions require a CallbackReturn thus making them synchronous (i.e., we block on the executor thread until we return). This poses 2 specific issues:
- we cannot call services within an ongoing service without hacky workarounds (rclcpp#773 ; rclcpp#2057)
- we have no way of interacting with a node externally when it is in a transition.
How to handle dependencies/interoperability in lifecycle transitions?
I’ll use examples in this section to demonstrate the limitations/needs:
1. Deferral / Async Transition Need:
In the above example, we have:
- (1) An external
SupervisorNodethat requests a change state on our LifecycleNode. - (2) This is then serviced by the executor, eventually calling the
CallbackReturn on_configure(State)function. - (3) The
LifecycleNode’son_configurerequires calling an external service - in this case aGetParameterwhere we want thebattery_thresh. - (4) Our
BatteryNoderesponds, putting the response in theEventsQueuefor the executor to process when ready. However, theon_configureis still being serviced, waiting on this response to be processed. Therefore we are deadlocked.
A proposed alternative would be to pass some handle to the user. They can do what they wish (call services / spawn other threads etc) and send a response using the handle. Code examples for reference: rclcpp#2214.
2. Cancel Need:
)In the above example, we have:
- (1) A
SupervisorNodethat monitors the state of the overall system, bringing nodes up/down depending on their dependencies. It sees the dependencyLCConsumerhas onLCProducerand checks thatLCProducerisACTIVE. - (2) This is true
- (3) Therefor it tries to transition the
LCConsumerto ACTIVE`. - (4) As the transition starts for
LCConsumer,LCProducerraises an error (ros2_design#283) and attempts recovery by runningon_error(State)callback. - (5)
LCConsumeris now stuck inACTIVATINGas it depends on aLCProducerservice. We have no way to transition out ofACTIVATINGfrom ourSuperVisorNode’s point of view.
An ideal solution here would be to allow for these transitions to be canceleable. The user transition code (in the above example it would be LCConsumer::on_active(State&, DeferResponseHandle*))would be responsible for monitoring for cancels. If the user acknowledges the cancel request, they are responsible for cleaning up the ongoing transition and responding (i.e., a cooperative cancellation approach). Exact behavior can be found in the “More Detailed Expected Behaviors” section below. Code examples for reference rclcpp#2214.
Should lifecycle transitions be Actions?
You may notice these deferrable + canceleable transition function needs are entirely encompassed under ROS2 Actions. In our current proposed approach (see “Current Proposal” section below), we essentially recreate a GoalHandler naming it a ChangeStateHandler. Further, the change_state process already publishes events on change (i.e., published feedback).
We think ideally we could replace the current ChangeState.srv and corresponding transition functions to be a ChangeState.action instead. However, this would fully break all backward compatibility across all lifecycle work (e.g., rclcpp::lifecycle, rcl::lifecycle, rclpy::lifecycle …). A possible solution we are thinking about is some form of tick-tock deprecation pattern where both co-exist but this is still very far reaching.
On a related note (but much broader than this post / may be worth discussing in another post), we have talked quite a bit about how ROS 2 Actions possibly should be thought of as Asynchronous Services that can be optionally cancelled. We have found (especially in this implementation of async lifecycle transitions) that using Async Services (introduced in rclcpp#1709) leads down a path of wanting the remaining components already implemented in an Action (accept / reject requests, canceleable, feedback on completion …).
Current Proposal
Our current approach:
- re-organizes
rclcpplifecycle code to better fit a model-view-controller paradigm. Note this has been separated out into its own respective issue (rclcpp#2212) and PR (rclcpp#2211) as it is a large architecture - is fully backward compatible by adding a
ChangeStateHandlerthat allows for response deferral + cancellation monitoring with issue (rclcpp#2213) and PRs (rclcpp#2214 , rcl_interfaces#157)
These are rather large code base changes to rclcpp lifecycle in particular therefor we would like community feedback before a more thorough review.
More Detailed Expected Behaviors
To mitigate length but be as concrete as possible, this section is expandable. It goes into the finer grained details of exact expected behavior for a deferrable + canceleable transition.
Expand
The “user” refers to the user specific Lifecycle node code (e.g., the user’s on_conifgure implementation). The “Lifecycle backend” refers to the underlying rclcpp/rcl implementation. Finally, some of these descriptions are described w.r.t. ChangeStateHandler/our current proposed implementation for convenience. Higher level design language would be ideal if used for future design documents.
Deferral
- User is passed a
shared_ptror equivalent handle (e.g.,shared_ptr<ChangeStateHandler>) with which they can send aCallbackReturnresponse when they wish (could be immediately as before or defer until later) - When calling an async transition callback, the
Lifecyclebackend relinquishes control of the executor thread to the transition callback and does not expect it back until response (must wait until user sends achange_state_hdl->send_callback_resp(CallbackReturn)or handles a cancel request) - The handle is only valid for 1
send_callback_resp; Whensend_callback_respis called, the handle is subsequently invalidated. - A user can check for a valid handler atomically (e.g.,
change_state_hdl->is_executing()) - By default, Lifecycle transition callbacks remain synchronous, requiring a
register_async_on_X(function)to override the default synchronous function - Only 1 transition function can be registered per transition state callback at any given time (see rclcpp#2216)
- The client of a
ChangeState.srvwill receive asuccess = trueupon the successful completion. - “Successful completion” is defined by the underlying state machine being updated to a primary state and the change event being published. Note it does not refer to
CallbackReturn::SUCCESS, just a full completion which can be made up ofCallbackReturn::FAILURE/ERRORas well. - At most 1 transition request can be processed at any given time on a first come, first serve basis. All transition requests made while a transition is ongoing will immediately be responded to with
success = falsewith an error message indicating as such (see rclcpp#2154)
Cancellation
- A new
CancelTransition.srv(or equivalent) & respective service allows for external node requests to cancel an ongoing transition - A user can check for a cancel request atomically (e.g., via
change_state_hdl->is_cancelling()) - The handler implements a cooperative cancellation policy
- It is up to the user to monitor for and unwind a cancelled request; This is due to the user being the only one who knows how to unwind at their defined points within a transition
- A user can ignore a cancellation request completely:
- A successfully completed transition request supersedes an ongoing cancel request with the state machine being updated according to the completed transition response.
- If a user decides to ignore the cancellation request and subsequently successfully completes a transition, the cancel requester will be responded to with
success = falseand given an error reason.
- A user can respond to a cancelled request (e.g., via
handled_cancelled(bool)) indicating a successful handle or not- Upon a
change_state_hdl->handled_cancelled(true), the lifecycle node will follow theCallbackReturn::Failurepath; This keeps the same valid state machine while often “falling back” to the prior state - Upon a
change_state_hdl->handled_cancelled(false), the lifecycle node will follow theCallbackReturn::Errorpath
- Upon a
- a
CancelTransition.srvrequires a request field indicating the desired transition to be cancelled. This is to avoid race conditions / follows RESTful concurrent PUT request design. Note this is not in the current proposed implementation but is planned to be added.

