Goal Interpretation
Description
Goal Interpretation aims to ground the natural language instruction to the environment representations of objects, states, relations, and actions. For example, the task instruction
"Use the rag to clean the trays, the bowl, and the refrigerator. When you are done, leave the rag next to the sink..."
can be grounded to specific objects with IDs, such as fridge (ID: 97), tray (ID: 1), bowl (ID: 1), rag (ID: 0), and sink (ID: 82). Note that a simple natural language description can be grounded into a set of multiple goal conditions (object state and relation).
The goal interpretation module takes the state **$s_0$** and a natural language instruction **$l_g$** as input, and generates an LTL goal **$\hat{g}_l$** as a formal goal specification which a symbolic planner can conceivably take as input. In this paper, we only generate simple LTL goals formed by an ordered action sequence and a conjunction of propositions to be satisfied in the final state.
Evaluation Details
Evaluation Workflow
The evaluation process is primarily handled by the evaluate_results function. Key steps include:
Data Loading:
Loading meta data and configurations.
Load LLM responses from the specified path.
Evaluation Loop:
For each LLM response:
Parse into json format.
Handle format errors and object/state hallucinations.
Evaluate LLM predicted object states against GT.
Evaluate LLM predicted object relations against GT.
Metric Calculation:
Compute precision, recall, and F1 scores for object states and relations.
Calculate format error and hallucination rates.
Results Aggregation:
Aggregate results by goal type and error type.
Log Output:
Generate summary statistics and save results to summary files.
Save detailed per-sample error analysis to log files.
Metrics
Our metrics are broken down into 4 primary categories:
Grammatical Errors:
Rate and number of state hallucinations.
Rate and number of object hallucinations.
Rate and number of output format errors.
Object States:
Number of satisfied conditions (TP).
Number of unsatisfied conditions (FN).
Number of false positive conditions (FP).
Confusion matrix (Precision/Recall/F1 Score).
Object Relations:
Number of satisfied conditions (TP).
Number of unsatisfied conditions (FN).
Number of false positive conditions (FP).
Confusion matrix (Precision/Recall/F1 Score).
Overall Performance:
Number of satisfied conditions (TP).
Number of unsatisfied conditions (FN).
Number of false positive conditions (FP).
Confusion matrix (Precision/Recall/F1 Score).