Subgoal Decomposition
Description
The Subgoal Decomposition module takes the task \(\langle s_0, g\rangle\) as input, where:
\(s_0\) represents the initial state of the environment
\(g\) is the task goal
Similar to Action Sequencing module, Subgoal Decomposition also takes a transition model \(\mathcal{M}\) as input specific to the simulator, which describes the transition between two consecutive states when an action is implemented. For more details on the transition model, refer to:
src/behavior_eval/evolving_graph/evolving_graph.pyforBehaviorsrc/virtualhome_eval/simulation/evolving_graph/environment.pyforVirtualHome
The module generates a subgoal trajectory \(\bar{\phi}=\{\phi_i\}_{i=1}^m\) specific to the final goal \(g\), where each \(\phi_i\) is a subgoal and listed in temporal order. Our embodied agent interface uses BFS algorithm to search possible action lists to satisfy each subgoal \(\phi_i\), which then can validated by simulator through action sequence execution.
Evaluation Details
Evaluation Prerequisites
It is challenging to actually evaluate a generated subgoal trajectory itself, as
There is no referenced subgoal annotation available as ground truth
There are always multiple ways to achieve a task goal. For example, in the task
wash clothes, the agent can either wash with hands, or wash with a washing machine.
One way to evaluate subgoal trajectories generated by planners like LLMs is transforming then into executable sequences, which requires us:
Define a structured symbolic representation for subgoal trajectories for easy translation into actions;
Implement a translation layer to transform existing subgoal trajectories into legal action sequences.
To resolve these challenges, we implement a customized Linear Temporal Logic (LTL), and uses BFS algorithm to help us translate the subgoal trajectories into action sequences with some heuristics.
LTL Implementation
Please refer behavior_eval.tl_formula.simple_tl and behavior_eval.tl_formula.simple_tl_parser for further details about LTL.
BFS Algorithm Implementation
We incorporate some heuristics into our BFS algorithm. In the following, we will briefly introduce how they are implemented in our simulators.
BEHAVIOR
In BEHAVIOR simulator, we implement the translation part in behavior_eval.evaluation.subgoal_decomposition.state_action_translator.py. For your better understanding, here is an example of translating state ontop into actions that can lead to it:
elif primitive_name == 'ontop' or primitive_name == 'onfloor':
receivee = predicate.args[0]
receiver = predicate.args[1]
common_action_list = []
hand_holding = self.check_in_which_hand(receivee, cur_hand_state)
if hand_holding is None:
hand_holding, grasp_actions_list = self.grasp_obj(cur_hand_state, receivee, None)
common_action_list.extend(grasp_actions_list)
ontop_args = [receiver]
action_set_1 = copy.deepcopy(common_action_list)
place_ontop_str = f'<hand>_PLACE_ONTOP'
place_ontop = self.craft_an_action(place_ontop_str, ontop_args, hand_holding)
action_set_1.append(place_ontop)
action_set_2 = copy.deepcopy(common_action_list)
transfer_str = f'<hand>_TRANSFER_CONTENTS_ONTOP'
transfer = self.craft_an_action(transfer_str, ontop_args, hand_holding)
action_set_2.append(transfer)
action_set_candidates.append(action_set_1)
Here, we can see that either PLACE_ONTOP action or TRANSFER_CONETENTS_ONTOP action can lead to ontop state after the action is executed. So we have two action lists.
VirtualHome
In VirtualHome simulator, we implement the translation part in virtualhome_eval.evaluation.subgoal_decomposition.state_action_translator.py. For your better understanding, here is an example of translating the subgoal CLEAN state into action list candidates that can lead to this state:
elif primitive_name == 'CLEAN':
find_item = self.translate_single_action('FIND', obj_name, obj_id)
common_action_set.append(find_item)
ans_set_1 = copy.deepcopy(common_action_set)
ans_set_2 = copy.deepcopy(common_action_set)
wipe_item = self.translate_single_action('WIPE', obj_name, obj_id)
ans_set_1.append(wipe_item)
wash_item = self.translate_single_action('WASH', obj_name, obj_id)
ans_set_2.append(wash_item)
action_set_candidates.append(ans_set_1)
action_set_candidates.append(ans_set_2)
Here, we can see that actions like WIPE and WASH can transition the state to CLEAN. Therefore, we have two action candidates (each is a small action list) for further searching.
As each subgoal coresponds to one or more action list candidates and all subgoals are listed in terms of temporal order, we can now perform a BFS algorithm by executing all possible (with a pruning) combinitions in the simulator.
Evaluation Workflow
Success Definition
The subgoal will be recognized as executable as long as it has at least one action sequence translation; while it will be recognized as a success as long as it has at least one action sequence translation that satisfies the task goal after execution.
Workflow
We consider all possible action sequences \(\mathcal{A}=\{a_j\}^m\) for a subgoal trajectory \(\bar{\phi}\) The evaluation of the subgoal decomposition involves two parts:
Trajectory Evaluation
Purpose: To determine whether there is at least one of the translated action sequences \(\hat{a}\in\mathcal{A}\) is executable in the simulator.
**Process: ** Execute \(\hat{a}\) to obtain the trajectory \(T = ⟨\{s_i\}_{i=0}^{m}, \{a_i\}_{i=1}^{m}⟩\).
Outcome: If an infeasible action occurs, execution may stop early. Execution failures are categorized into:
Missing Steps: Necessary actions that were omitted.
Additional Steps: Unnecessary actions that were included.
Wrong Temporal Order: Actions executed in an incorrect sequence.
Affordance Errors: Actions incompatible with the current state of objects (e.g., trying to “open” an object that cannot be opened).
**Best Selection: ** Iterate all \(\hat{a}\in\mathcal{A}\). If all fail, select one with the minimum errors in total for statistics calculation (E.g., for a subgoal trajectory \(\bar{\phi}_0\), say it has two trajectories \(a_1\) andn \(a_2\). If \(a_1\) has 1 error, and \(a_2\) has 3 errors, then we only use \(a_1\) to represent \(\bar{\phi}_0\)).
Goal Evaluation:
Purpose: To assess if the task goal \(g\) is satisfied after executing \(\hat{a}\in\mathcal{A}\).
Process: Check for goal satisfaction, e.g.
behavior_eval.evolving_graph.evolving_graph.check_success.Partial Goal Satisfaction Evaluation:
Measures the percentage of subgoals in \(g\) that are satisfied by \(\bar{a}\).
Process:
Decompose \(g\) into simple Linear Temporal Logic (LTL) goals \(g_i\).
For each \(g_i\):
Let \(g_i = a₁ \overset{\text{then}}{\ldots} aₖ \textbf{~then~} (p₁ \land \ldots \land p_\ell)\).
Check if a subsequence in \(\bar{a}\) matches \(\{a_j\}_{j=1}^k\).
Evaluate the final state propositions \(p_j\) in \(s_m\).
Assign partial credits based on the number of propositions satisfied.
Final Metric: \(\textit{PartialSucc}(\hat{a}, g) = \max_{g_i \in \mathcal{G}(g, \mathcal{U})} \textit{PartialSucc}(\hat{a}, g_i)\).
**Best Selection: ** Iterate all executable \(\hat{a}\in\mathcal{A}\), select successful ones, or select the one that is closest to achieve \(g\) for statistics.
Metrics
The evaluation metrics are divided into two categories:
Trajectory Metrics:
Execution Success Rate: The proportion of actions in \(\hat{a}\) executed successfully without errors.
Error Rates:
Parsing Errors: Issues in interpreting the action sequence.
Hallucination Errors: Actions involving objects or states not present in the environment.
Argument Errors: Incorrect arguments provided for actions.
Missing Steps: Rate of necessary actions that were omitted.
Additional Steps: Rate of unnecessary actions included.
Wrong Temporal Order: Rate of actions executed in an incorrect sequence.
Affordance Errors: Rate of actions that cannot be performed due to object states.
Goal Metrics:
Task Success Rate: The proportion of tasks where the goal \(g\) is fully satisfied after executing \(\hat{a}\).
Partial Goal Satisfaction Evaluation:
State Goal Satisfaction: Success rate for satisfying state-based goals (e.g., object states).
Relation Goal Satisfaction: Success rate for satisfying relation-based goals (e.g., object relationships).
Action Goal Satisfaction: Success rate for achieving the specified action sequence.
Total Goal Satisfaction: Overall goal achievement rate, combining state, relation, and action goals.
Output
The evaluation process produces several outputs:
Execution Information:
Details for each action in \(\bar{a}\), indicating whether it was executed successfully.
Error types encountered during execution (if any).
Step-by-step execution status.
Goal Satisfaction Results:
Metrics indicating whether the goal was fully or partially satisfied.
Counts of total and satisfied predicates, including:
Total Predicates: Number of conditions evaluated.
Satisfied Predicates: Number of conditions that were satisfied.
Breakdown into edge and node predicates.
Overall Evaluation Metrics:
Goal Evaluation:
Task Success Rate: Overall success rate for completing the task.
State Goal Satisfaction: Success rate for satisfying state-based goals.
Relation Goal Satisfaction: Success rate for satisfying relation-based goals.
Action Goal Satisfaction: Success rate for achieving the specified action sequence.
Total Goal Satisfaction: Combined success rate across all goal types.
Trajectory Evaluation:
Execution Success Rate: Overall success rate of the action sequence execution.
Grammar Errors: Rates of parsing, hallucination, and predicate argument number errors.
Runtime Errors: Rates of wrong order, missing step, affordance, and additional step errors.
Examples
**Task: ** bringing_in_wood_0_Benevolence_1_int_0_2021-09-15_18-42-25
Model: o1-preview
Transition Model (\(\mathcal{M}\)): Behavior simulator
Initial States (\(s_0\)):
onfloor(plywood.0, room_floor_living_room.0)
onfloor(plywood.1, room_floor_living_room.0)
onfloor(plywood.2, room_floor_living_room.0)
onfloor(agent_n_01.1, room_floor_living_room.0)
Goal (\(g\)):
forall(plywood_n_01, onfloor(plywood_n_01, room_floor_kitchen.0))
Output:
[
'holds_rh(plywood.0)',
'onfloor(agent_n_01.1, room_floor_kitchen.0)',
'onfloor(plywood.0, room_floor_kitchen.0) and not holds_rh(plywood.0)',
'onfloor(agent_n_01.1, room_floor_living_room.0)',
'holds_rh(plywood.1)',
'onfloor(agent_n_01.1, room_floor_kitchen.0)',
'onfloor(plywood.1, room_floor_kitchen.0) and not holds_rh(plywood.1)',
'onfloor(agent_n_01.1, room_floor_living_room.0)',
'holds_rh(plywood.2)',
'onfloor(agent_n_01.1, room_floor_kitchen.0)',
'onfloor(plywood.2, room_floor_kitchen.0) and not holds_rh(plywood.2)'
]
Results:
"bringing_in_wood_0_Benevolence_1_int_0_2021-09-15_18-42-25": {
"success": true,
"info": "('Correct', True, [[{'action': 'RIGHT_GRASP', 'object': 'plywood_0'}, {'action': 'RIGHT_PLACE_ONTOP', 'object': 'room_floor_kitchen_0'}, {'action': 'RIGHT_GRASP', 'object': 'plywood_1'}, {'action': 'RIGHT_PLACE_ONTOP', 'object': 'room_floor_kitchen_0'}, {'action': 'RIGHT_GRASP', 'object': 'plywood_2'}, {'action': 'RIGHT_PLACE_ONTOP', 'object': 'room_floor_kitchen_0'}]], [])",
"goal_info": {
"success": true,
"subgoals": [
[
"forall",
"plywood.n.01",
"-",
"plywood.n.01",
"onfloor",
"plywood.n.01",
"floor.n.01_2"
]
],
"subgoal_success": [
true
]
}
},
Overall Results Across Tasks
{
"trajectory_evaluation": {
"execution_success_rate": 0.62,
"grammar_error": {
"parsing": 0.02,
"hallucination": 0.03,
"predicate_argument_number": 0.0
},
"runtime_error": {
"wrong_order": 0.05,
"missing_step": 0.25,
"affordance": 0.03,
"additional_step": 0.07
}
},
"goal_evaluation": {
"task_success_rate": 0.57,
"state_goal": 0.565,
"relation_goal": 0.6936,
"action_goal": 0,
"total_goal": 0.6585
}
}
Usage
To evaluate the subgoal decomposition module, use the following commands:
eai-eval --dataset virtualhome --eval-type subgoal_decomposition --mode evaluate_results
eai-eval --dataset behavior --eval-type subgoal_decomposition --mode evaluate_results
eai-eval --dataset virtualhome --eval-type subgoal_decomposition --mode generate_prompts
eai-eval --dataset behavior --eval-type subgoal_decomposition --mode generate_prompts