Transition Modeling

Description

Transition Modeling evaluates LLMs’ ability to predict the preconditions and effects of operators. In EAgent, evaluation of transition modeling incorporates two parts: generate_prompts and evaluate_results. With annotated data, running EAgent with the generate_prompts option will generate a JSON file of LLM prompts, asking LLM to predict the relevant operators for each task. You can provide your LLM with the prompts and generate outputs. After specifying llm-response-path, running EAgent with the evaluate_results option provides evaluation results in two perspectives: logic matching score and planner success rate.

Evaluation Details

Meta data

Some meta data are necessary for evaluation. The raw meta data is provided under resources/{dataset}.

  • id2action: A map from task_id to a list of operators involved for the task. Only relevant operator names will be provided to LLM for prediction.

  • id2category_2: A map from task_id to its categories. Each task is categorized into 2 out of 5 (VirtualHome) or 3 (BEHAVIOR) classes.

  • id2task: A map from task_id to its task name.

  • gold_action: Annotated ground truth PDDL operators.

  • predicates_category: A map from operator name to a category.

  • {dataset}_pd.pddl: PDDL domain file without operator definitions, only including predicates list.

Evaluation Workflow

The evaluation process is primarily handled by the evaluate_results function. Key steps include:

  1. Data Loading:

    • Load necessary meta data and configurations.

    • Load LLM responses from the specified path.

  2. Evaluation Loop:

    • For each LLM response:

      • Extract predicted operator definitions.

      • Compare predicted preconditions and effects with ground truth.

      • Calculate logic matching scores.

      • Attempt to generate a plan using the predicted operators.

  3. Metric Calculation:

    • Compute precision, recall, and F1 scores for preconditions and effects.

    • Calculate planning success rates.

  4. Results Aggregation:

    • Aggregate results by predicate type, action type, and task type.

  5. Output Generation:

    • Generate summary statistics and save results to JSON files.

Metrics

Several key metrics are used in the evaluation:

  1. Logic Matching Score:

    • Calculated for preconditions and effects separately.

    • Broken down by predicate type (e.g., object_states, spatial_relations).

    • Reported as precision, recall, and F1 score.

    • Variables: precond_predicate_type_res_dict, effect_predicate_type_res_dict, full_predicate_type_res_dict.

  2. Planning Success Rate:

    • Measures the ability to generate a valid plan using predicted operators.

    • Calculated per task type.

    • Variable: success_by_task_type_dict.

  3. Action-specific Metrics:

    • Logic matching scores calculated per action.

    • Variables: precond_action_type_dict, effect_action_type_dict, full_action_type_dict.

  4. Predicate-specific Metrics:

    • Logic matching scores calculated per predicate.

    • Variables: precond_predicate_score_dict, effect_predicate_score_dict, full_predicate_score_dict.

  5. Sensitivity Analysis:

    • Measures the impact of individual operator predictions on overall task success.

    • Variables: task_variate_control_by_type, task_variate_control_precond_by_type, task_variate_control_effect_by_type, action_variate_control.

Output

The evaluation generates a summary JSON file for each model, containing:

  • Precision, recall, and F1 scores for each predicate type.

  • Planning success rates for each task type.

  • Overall scores across all categories.

These results are saved in the specified output directory, providing a comprehensive view of the LLM’s performance in transition modeling.

Usage

To run the evaluation:

  1. Ensure all LLM responses are in place.

  2. Run the eai-eval command with appropriate arguments:

    eai-eval --dataset [virtualhome, behavior] --eval-type transition_modeling --mode evaluate_results
    
  3. The function will process all LLM responses, calculate metrics, and save results to the specified output directory.

Customization

The evaluation framework is designed to work with both VirtualHome and BEHAVIOR datasets. The code automatically adjusts based on the specified dataset, handling differences in categories and evaluation criteria. For adding new datasets or metrics, modify the relevant sections in the evaluate_results function and ensure appropriate meta data is provided.