Action Sequencing

Description

The action sequencing module takes the task ⟨\(s_0\), \(g\)⟩ as input, where:

\(s_0\) represents the initial state of the environment.
\(g\) is the task goal.

The module uses a transition model \(\mathcal{M}\) specific to the simulator, which governs how the environment evolves based on actions. For more details on the transition model, refer to:

src/behavior_eval/evolving_graph/evolving_graph.py for Behavior
src/virtualhome_eval/simulation/evolving_graph/environment.py for VirtualHome

The module generates an action sequence \(\bar{a} = \{a_i\}_{i=1}^{n}\), representing the actions required to move from the initial state toward achieving the task goal.

Evaluation Details

Evaluation Workflow

The evaluation of the action sequencing module involves two main components:

Trajectory Evaluation:
- Purpose: To determine whether the generated action sequence \(\bar{a}\) is executable in the simulator.
- Process: Execute \(\bar{a}\) to obtain the trajectory \(T = ⟨\{s_i\}_{i=0}^{m}, \{a_i\}_{i=1}^{m}⟩\), e.g. behavior_eval.evolving_graph.eval_evolving_graph_env.apply_action.
- Outcome: If an infeasible action occurs, execution may stop early. Execution failures are categorized into:
  - Missing Steps: Necessary actions that were omitted.
  - Additional Steps: Unnecessary actions that were included.
  - Wrong Temporal Order: Actions executed in an incorrect sequence.
  - Affordance Errors: Actions incompatible with the current state of objects (e.g., trying to “open” an object that cannot be opened).
Goal Evaluation:
- Purpose: To assess if the task goal \(g\) is satisfied after executing \(\bar{a}\).
- Process: Check for goal satisfaction, e.g. behavior_eval.evolving_graph.evolving_graph.check_success.
- Partial Goal Satisfaction Evaluation:
  - Measures the percentage of subgoals in \(g\) that are satisfied by \(\bar{a}\).
  - Process:
    - Decompose \(g\) into simple Linear Temporal Logic (LTL) goals \(g_i\).
    - For each \(g_i\):
      - Let \(g_i = a₁ \overset{\text{then}}{\ldots} aₖ \textbf{~then~} (p₁ \land \ldots \land p_\ell)\).
      - Check if a subsequence in \(\bar{a}\) matches \(\{a_j\}_{j=1}^k\).
      - Evaluate the final state propositions \(p_j\) in \(s_m\).
    - Assign partial credits based on the number of propositions satisfied.
  - Final Metric: \(\textit{PartialSucc}(\bar{a}, g) = \max_{g_i \in \mathcal{G}(g, \mathcal{U})} \textit{PartialSucc}(\bar{a}, g_i)\).

Metrics

The evaluation metrics are divided into two categories:

Trajectory Metrics:
- Execution Success Rate: The proportion of actions in \(\bar{a}\) executed successfully without errors.
- Error Rates:
  - Parsing Errors: Issues in interpreting the action sequence.
  - Hallucination Errors: Actions involving objects or states not present in the environment.
  - Argument Errors: Incorrect arguments provided for actions.
  - Missing Steps: Rate of necessary actions that were omitted.
  - Additional Steps: Rate of unnecessary actions included.
  - Wrong Temporal Order: Rate of actions executed in an incorrect sequence.
  - Affordance Errors: Rate of actions that cannot be performed due to object states.
Goal Metrics:
- Task Success Rate: The proportion of tasks where the goal \(g\) is fully satisfied after executing \(\bar{a}\).
- Partial Goal Satisfaction Evaluation:
  - State Goal Satisfaction: Success rate for satisfying state-based goals (e.g., object states).
  - Relation Goal Satisfaction: Success rate for satisfying relation-based goals (e.g., object relationships).
  - Action Goal Satisfaction: Success rate for achieving the specified action sequence.
  - Total Goal Satisfaction: Overall goal achievement rate, combining state, relation, and action goals.

Output

The evaluation process produces several outputs:

Execution Information:
- Details for each action in \(\bar{a}\), indicating whether it was executed successfully.
- Error types encountered during execution (if any).
- Step-by-step execution status.
Goal Satisfaction Results:
- Metrics indicating whether the goal was fully or partially satisfied.
- Counts of total and satisfied predicates, including:
  - Total Predicates: Number of conditions evaluated.
  - Satisfied Predicates: Number of conditions that were satisfied.
  - Breakdown into edge and node predicates.
Overall Evaluation Metrics:
- Goal Evaluation:
  - Task Success Rate: Overall success rate for completing the task.
  - State Goal Satisfaction: Success rate for satisfying state-based goals.
  - Relation Goal Satisfaction: Success rate for satisfying relation-based goals.
  - Action Goal Satisfaction: Success rate for achieving the specified action sequence.
  - Total Goal Satisfaction: Combined success rate across all goal types.
- Trajectory Evaluation:
  - Execution Success Rate: Overall success rate of the action sequence execution.
  - Grammar Errors: Rates of parsing, hallucination, and predicate argument number errors.
  - Runtime Errors: Rates of wrong order, missing step, affordance, and additional step errors.

Example

Task: assembling_gift_baskets_0_Beechwood_0_int_0_2021-10-26_12-46-37

Model: o1-preview

Transition Model (\(\mathcal{M}\)): Behavior simulator

Initial States (\(s_0\)):

[
    "['onfloor', 'basket_0', 'room_floor_living_room_0']",
    "['onfloor', 'basket_1', 'room_floor_living_room_0']",
    "['onfloor', 'basket_2', 'room_floor_living_room_0']",
    "['onfloor', 'basket_3', 'room_floor_living_room_0']",
    "['ontop', 'candle_0', 'breakfast_table_13']",
    "['ontop', 'candle_1', 'breakfast_table_13']",
    "['ontop', 'candle_2', 'breakfast_table_13']",
    "['ontop', 'candle_3', 'breakfast_table_13']",
    "['ontop', 'cookie_0', 'breakfast_table_13']",
    "['ontop', 'cookie_1', 'breakfast_table_13']",
    "['ontop', 'cookie_2', 'breakfast_table_13']",
    "['ontop', 'cookie_3', 'breakfast_table_13']",
    "['ontop', 'cheese_0', 'coffee_table_12']",
    "['ontop', 'cheese_1', 'coffee_table_12']",
    "['ontop', 'cheese_2', 'coffee_table_12']",
    "['ontop', 'cheese_3', 'coffee_table_12']",
    "['ontop', 'bow_0', 'coffee_table_12']",
    "['ontop', 'bow_1', 'coffee_table_12']",
    "['ontop', 'bow_2', 'coffee_table_12']",
    "['ontop', 'bow_3', 'coffee_table_12']",
    "['onfloor', 'agent.n.01_1', 'room_floor_living_room_0']"
]

Goal (\(g\)):

[
    "['forpairs', 'basket.n.01', '-', 'basket.n.01', 'candle.n.01', '-', 'candle.n.01', 'inside', 'candle.n.01', 'basket.n.01']",
    "['forpairs', 'basket.n.01', '-', 'basket.n.01', 'cheese.n.01', '-', 'cheese.n.01', 'inside', 'cheese.n.01', 'basket.n.01']",
    "['forpairs', 'basket.n.01', '-', 'basket.n.01', 'cookie.n.01', '-', 'cookie.n.01', 'inside', 'cookie.n.01', 'basket.n.01']",
    "['forpairs', 'basket.n.01', '-', 'basket.n.01', 'bow.n.08', '-', 'bow.n.08', 'inside', 'bow.n.08', 'basket.n.01']"
]

Output:

[
    {"action": "LEFT_GRASP", "object": "candle_0"},
    {"action": "RIGHT_GRASP", "object": "cookie_0"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_0"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_0"},
    {"action": "LEFT_GRASP", "object": "cheese_0"},
    {"action": "RIGHT_GRASP", "object": "bow_0"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_0"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_0"},
    {"action": "LEFT_GRASP", "object": "candle_1"},
    {"action": "RIGHT_GRASP", "object": "cookie_1"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_1"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_1"},
    {"action": "LEFT_GRASP", "object": "cheese_1"},
    {"action": "RIGHT_GRASP", "object": "bow_1"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_1"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_1"},
    {"action": "LEFT_GRASP", "object": "candle_2"},
    {"action": "RIGHT_GRASP", "object": "cookie_2"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_2"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_2"},
    {"action": "LEFT_GRASP", "object": "cheese_2"},
    {"action": "RIGHT_GRASP", "object": "bow_2"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_2"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_2"},
    {"action": "LEFT_GRASP", "object": "candle_3"},
    {"action": "RIGHT_GRASP", "object": "cookie_3"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_3"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_3"},
    {"action": "LEFT_GRASP", "object": "cheese_3"},
    {"action": "RIGHT_GRASP", "object": "bow_3"},
    {"action": "LEFT_PLACE_INSIDE", "object": "basket_3"},
    {"action": "RIGHT_PLACE_INSIDE", "object": "basket_3"}
]

Results:

"llm_rst": {
    "error_type": {
        "parsing": null,            # No parsing errors occurred
        "hallucination": null,      # No hallucination errors occurred (no false information)
        "arguments": null,          # No argument errors occurred
        "execution_success": true   # Execution was successful
    },
    "goal_rst": {
        "all_goal_satisfied_ig": true,       # All goals were satisfied according to the internal graph (IG)
        "all_goal_satisfied_graph": true,    # All goals were satisfied according to the external goal graph
        "tot_predicates": 4.0,               # Total number of predicates (conditions) evaluated
        "tot_edge_predicates": 4.0,          # Total number of edge predicates (relationships between entities)
        "tot_node_predicates": 0.0,          # Total number of node predicates (properties of entities)
        "satisfied_predicates": 4.0,         # Number of predicates that were satisfied
        "satisfied_edge_predicates": 4.0,    # Number of satisfied edge predicates
        "satisfied_node_predicates": 0.0,    # Number of satisfied node predicates
        "pure_edge_predicates": 4,           # Number of pure edge predicates (without involving nodes)
        "pure_node_predicates": 0,           # Number of pure node predicates
        "mixed_predicates": 0,               # Number of mixed predicates (involving both edges and nodes)
        "satisfied_pure_edge_predicates": 4, # Number of satisfied pure edge predicates
        "satisfied_pure_node_predicates": 0, # Number of satisfied pure node predicates
        "satisfied_mixed_predicates": 0      # Number of satisfied mixed predicates
    },
    "execution_info": [
        {
            "action": "LEFT_GRASP",
            "object": "candle_0",
            "execution_success": True,
            "step": 0
        },
        {
            "action": "RIGHT_GRASP",
            "object": "cookie_0",
            "execution_success": True,
            "step": 1
        },
        {
            "action": "LEFT_PLACE_INSIDE",
            "object": "basket_0",
            "execution_success": True,
            "step": 2
        },
        {
            "action": "RIGHT_PLACE_INSIDE",
            "object": "basket_0",
            "execution_success": True,
            "step": 3
        },
        ...
        {
            "action": "RIGHT_PLACE_INSIDE",
            "object": "basket_3",
            "execution_success": True,
            "step": 31
        }
    ]
}

Overall Results Across Tasks

{
    "goal_evaluation": {
        "task_success_rate": 0.81,    # Overall success rate for completing the task
        "state_goal": 0.895,          # Success rate for satisfying state-based goals
        "relation_goal": 0.844,       # Success rate for satisfying relation-based goals
        "action_goal": 0,             # Success rate for achieving the specified action sequence
        "total_goal": 0.8579          # Combined goal achievement rate
    },
    "trajectory_evaluation": {
        "execution_success_rate": 0.91,   # Overall success rate of action sequence execution
        "grammar_error": {
            "parsing": 0.0,               # No parsing errors
            "hallucination": 0.0,         # No hallucination errors
            "predicate_argument_number": 0.0  # No predicate argument number errors
        },
        "runtime_error": {
            "wrong_order": 0.0,           # No wrong order errors
            "missing_step": 0.06,         # 6% of sequences had missing steps
            "affordance": 0.02,           # 2% had affordance errors
            "additional_step": 0.03       # 3% had additional steps
        }
    }
}

Usage

To evaluate the action sequencing module, use the following commands:

eai-eval --dataset virtualhome --eval-type action_sequencing --mode evaluate_results
eai-eval --dataset behavior --eval-type action_sequencing --mode evaluate_results
eai-eval --dataset virtualhome --eval-type action_sequencing --mode generate_prompts
eai-eval --dataset behavior --eval-type action_sequencing --mode generate_prompts