Generalized Traffic Light Classification Architecture

@LiyouZhou - As I mentioned before, we cannot completely remove the feat_proj node. See my reasoning here: Generalized Traffic Light Classification Architecture - #9 by JWhitleyWork - Autoware - Open Robotics Discourse. The functionality of extracting the information about which traffic light is relevant to our current lane can not be done with vision alone.

I’m also not sure if it is possible to train a single neural network to detect signals with lights in multiple orientations with different numbers of lights in the signals. This is something that requires research. If it is possible, we would also need to train it to output the locations of the bulbs and their states for some “internationalization” node to be able to make a decision. With the additional information in this discussion, I feel like these are our broad steps:

  1. [currently done by feat_proj] Find traffic signals in the vector map within a given distance of the vehicle in the vehicle’s forward arc (front 180 degrees).
  2. [currently done by feat_proj] Determine which signals are applicable to the current driving lane.
  3. [currently done by feat_proj] Project the position of those signals in the vector map into the 2D plane of the camera to determine ROIs.
  4. [currently not done] Determine the position and state of each individual light in the signal inside the ROI.
  5. [currently not done] Determine a signal type from the number and configuration of lights in the signal.
  6. [currently not done] Determine the list of rules that the signal implies given the configuration and state of all lights contained in the signal.
  7. [currently not done] Determine the rule or rules that apply to the vehicle given the current lane and intended direction of travel.
  8. [currently done by one of several nodes in trafficlight_recognizer] Publish a final decision in the form of STOP, YIELD, or GO. Yield is currently not implemented but we have found it very necessary in our tests to avoid unexpected braking so I’m recommending here that we add it.

Which node or nodes do the above tasks is what we’re trying to determine here. We have to keep feat_proj as mentioned before and I think its scope is appropriate so we can probably keep 1-3 as they are. I would argue that either just 4 or 4 and 5 could be the work of a neural network. Steps 6-8 could either be included as part of the NN node or could be a separate node - I would argue that separate would be more along the lines of separation of concerns and we could provide much of the “internationalization” effort in this last node.

I think the pipeline you propose makes sense.

I made a revised graph mocking up some of the things mentioned.

@startuml
digraph G {
    ROI [shape = diamond]
    raw_image [shape = diamond]
    light_bounding_boxes_and_states [shape = diamond] 
    planned_path [shape = diamond]
    STOP_YIELD_GO [shape = diamond]
    Feature_Projection -> ROI
    raw_image -> TLD_Vision_Detector
    ROI -> TLD_Vision_Detector
    ROI -> TLD_Internationalisation
    TLD_Vision_Detector -> light_bounding_boxes_and_states
    light_bounding_boxes_and_states -> TLD_Internationalisation
    light_bounding_boxes_and_states -> GUI
    planned_path -> TLD_Internationalisation
    TLD_Internationalisation -> STOP_YIELD_GO

    subgraph cluster_01 {
        label = "Legend";
        Topics [shape = diamond]
        Nodes
        Nodes -> Topics
    }
} 
@enduml

@LiyouZhou - I would request 2 changes to the above graph:

  1. Can we make the bottom node named something like TLD_Decision instead? The internationalization part of that node is somewhat secondary to it’s ability to make a final decision based on the configuration of the light.
  2. To support the use of multiple detector types (both NN-based and classical), I recommend we use either a service or a topic (I prefer service) to connect the TLD_Vision_Detector to another node doing the actual detection. This guarantees that the common code for the detector (the callback for receiving the ROIs, the callback for receiving the raw image, the publisher for the bounding boxes and state, etc.) is all done in TLD_Vision_Detector but the actual detection is done in a separate node which receives only the cropped ROI image and returns the list of light positions and states to the TLD_Vision_Detector, which then publishes them. This way, instead of having to reproduce all of the common code in every type of detector, we can just have the actual detection node be a thin wrapper around the detection mechanism. Also, if this idea is OK with everyone, we should probably rename TLD_Vision_Detector to TLD_Interface and make the new, thin-wrapper-node the actual TLD_Vision_Detector.

I have updated the graph. @JWhitleyWork does this capture what you mean?

@startuml
digraph G {
    ROI [shape = diamond]
    raw_image [shape = diamond]
    light_bounding_boxes_and_states [shape = diamond] 
    planned_path [shape = diamond]
    STOP_YIELD_GO [shape = diamond]
    TLD_Vision_Detector [shape = rectangle]
    Feature_Projection -> ROI
    raw_image -> TLD_Interface
    ROI -> TLD_Interface
    ROI -> TLD_Decision
    TLD_Interface -> light_bounding_boxes_and_states
    light_bounding_boxes_and_states -> TLD_Decision
    light_bounding_boxes_and_states -> GUI
    planned_path -> TLD_Decision
    TLD_Decision -> STOP_YIELD_GO
    TLD_Vision_Detector -> TLD_Interface [ label="bboxes & states"]
    TLD_Interface -> TLD_Vision_Detector [ label="raw img & ROI"]
    
    subgraph cluster_01 {
        label = "Legend";
        Topics [shape = diamond]
        Nodes
        Nodes -> Topics
        
        Node_W_Service_Interface [shape = rectangle]
        Nodes -> Node_W_Service_Interface
        Node_W_Service_Interface -> Nodes
    }
} 
@enduml

@LiyouZhou - The TLD_Vision_Detector is a node, which would have a service interface (which I didn’t give a name), just like a topic. The TLD_Interface would also pass a “cropped” ROI to the TLD_Vision_Detector service, not the whole raw image. Everything else looks correct.

I tweaked the above graph but kept the raw image bit. I want to propose for TLD_Vision_Detector to get the whole image, this would allow it to:

  1. perform more general traffic light detection for any traffic light that is not on the map, eg. light for temporary road works, newly installed lights.
  2. reduce the reliance on accurate ROI, which, as you pointed out, has been less than reliable.

For most basic scenarios, detection can still be based on the cropped image. Just that cropping happens inside TLD_Vision_Detector.

This archetecture would be more versatile if TLD_Vision_Detector can get the whole image. In which case, one can imagine a TLD_Vision_Detector do detection on the whole image and only apply ROI to determine applicability. ROI just need to point to a general area of the light instead of replying on ROI containing exactly the traffic light cluster.

We can take advantage of existing research for building the neural network for the traffic light detection node as most of them take the vision only approch. Research papers in this area:

  1. R-CNN network on bosch dataset [1806.07987] A Hierarchical Deep Architecture and Mini-Batch Selection Method For Joint Traffic Sign and Light Detection
  2. TLD using YOLO A deep learning approach to traffic lights: Detection, tracking, and classification | IEEE Conference Publication | IEEE Xplore
  3. YOLOv2 on LISA dataset Evaluating State-of-the-Art Object Detector on Challenging Traffic Light Data | IEEE Conference Publication | IEEE Xplore
1 Like

I think this seems reasonable. If there is no further feedback from the community, we will proceed with the above design as our initial blueprints. I’ve created an issue (Implement New Traffic Light Recognition Framework (#2) · Issues · Autoware Foundation / MovedToGitHub / core_perception · GitLab) on the core_perception repository to track implementation.

Hi @JWhitleyWork

Following your discussion above, do you have a set of requirements that the Traffic Light Architecture has on the maps to support this?

If you are willing (and able), I’d like to invite you (or others working with you on this) to present the architecture and the map requirements at a meeting of the Autoware Maps Working Group where we can discuss how to support these requirements.

Regards
Brian

@Brian_Holt - The requirements shouldn’t be that different from the current architecture. We assumed that we would have x, y, and z position and facing orientation in map-relative coordinates of a 2D face of the front of the signal, which I believe is what is currently provided in the Aisan format.

@JWhitleyWork
As you probably know, we are currently planning to move to other mapping format(OpenDRIVE as physical storage format and Lanelet2 as internal map handling library), and it may not have facing orientation. Do you think face orientation is required, or is it okay if we have traffic light related to appropriate lanes.

1 Like

@mitsudome-r Lane association is probably OK instead of facing orientation. We may have to modify feat_proj but it shouldn’t be a problem.