(enh) StuckDetector: fix+enhance syntax error loop detection #3628

tobitege · 2024-08-28T08:48:17Z

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

Loop detection of one type of syntax error was broken. Expanded to 4 types of syntax errors.

Give a summary of what the PR does, explaining any non-trivial design decisions

There was only one type of syntax error in the stuck detection, like unterminated string literal.
However, this was broken as the detection only checked a limited range of characters for it.

Instead, the detection is now expanded to the last 4 lines of the observation.
In recent enhancements there were potentially 2 lines added for interpreter and prompt,
making the check basically skip the detection each time.

Unit tests adapted and fixed accordingly.

List of detected errors is now this:

    SYNTAX_ERROR_MESSAGES = [
        'SyntaxError: unterminated string literal (detected at line',
        'SyntaxError: invalid syntax. Perhaps you forgot a comma?',
        'SyntaxError: incomplete input',
        "E999 SyntaxError: unmatched ')'",
    ]

The added errors is a distinct collection from 98 occurences in a bench that I ran.

tobitege · 2024-08-28T09:00:00Z

This is an excerpt from a recent Aider bench showing different kinds of syntax errors, that appeared in a loop (I've shortened the content here):

566-  Cell In[1], line 3
567-    to_replace="""# subclassing the built-in ValueError to create MeetupDayException
568-               ^
569:SyntaxError: invalid syntax. Perhaps you forgot a comma?
570-[Jupyter current working directory: /workspace]
--
4890-  Cell In[1], line 3
4891-    to_replace="""class MeteredFile(io.BufferedRandom):
4892-               ^
4893:SyntaxError: invalid syntax. Perhaps you forgot a comma?
4894-[Jupyter current working directory: /workspace]
--
14510-  Cell In[1], line 50
14511-    return f"({self.area_code}) {self.exchange_code}-{self.subscriber_number}"""")
14512-                                                                                ^
14513:SyntaxError: unterminated string literal (detected at line 50)
14514-[Jupyter current working directory: /workspace]
--
14640-05:51:13 - openhands:INFO: agent_controller.py:215 - **IPythonRunCellObservation**
14641-[Your proposed edit has introduced new syntax error(s). Please understand the errors and retry your edit command.]
14642-ERRORS:
14643:/workspace/phone_number.py:45:84: E999 SyntaxError: unmatched ')'
--
19323-  Cell In[1], line 4
19324-    return f"({self.area_code}) {self.exchange_code}-{self.subscriber_number}"""",
19325-                                                                                ^
19326:SyntaxError: unterminated string literal (detected at line 4)
19327-[Jupyter current working directory: /workspace]
--
19358-  Cell In[1], line 4
19359-    return f"({self.area_code}) {self.exchange_code}-{self.subscriber_number}"""",
19360-                                                                                ^
19361:SyntaxError: unterminated string literal (detected at line 4)
19362-[Jupyter current working directory: /workspace]
--
25908-  Cell In[1], line 30
25909-    new_content="""from collections import Counter
25910-                ^
25911:SyntaxError: incomplete input
25912-[Jupyter current working directory: /workspace]
--
25991-  Cell In[1], line 30
25992-    new_content="""from collections import Counter
25993-                ^
25994:SyntaxError: incomplete input
25995-[Jupyter current working directory: /workspace]
--
80271-  Cell In[1], line 41
80272-    return json.dumps({"users": sorted([lender_obj, borrower_obj], key=lambda x: x["name"])})""", """import json
80273-                                                                                                  ^
80274:SyntaxError: incomplete input
80275-[Jupyter current working directory: /workspace]
--
91020-  Cell In[1], line 41
91021-    new_content="""class Scale:
91022-                ^
91023:SyntaxError: incomplete input
91024-[Jupyter current working directory: /workspace]
--
96875-  Cell In[1], line 56
96876-    new_content="""def parse_properties(s):
96877-                ^
96878:SyntaxError: incomplete input
96879-[Jupyter current working directory: /workspace]
--
98626-  File "/workspace/sgf_parsing.py", line 45
98627-    if s[0] == '\' and len(s) > 1:
98628-               ^
98629:SyntaxError: unterminated string literal (detected at line 45)
98630-
--
119294-  Cell In[1], line 3
119295-    to_replace="def equilateral(sides):
119296-               ^
119297:SyntaxError: unterminated string literal (detected at line 3)
119298-[Jupyter current working directory: /workspace]
--
133225-  Cell In[1], line 57
133226-    new_content="""def answer(question):
133227-                ^
133228:SyntaxError: incomplete input
133229-[Jupyter current working directory: /workspace]
--
137464-  Cell In[1], line 51
137465-    new_content="""HOUSES = range(5)
137466-                ^
137467:SyntaxError: incomplete input
137468-[Jupyter current working directory: /workspace]
--
139180-  Cell In[1], line 59
139181-    new_content="""def is_valid_solution(solution):
139182-                ^
139183:SyntaxError: incomplete input
139184-[Jupyter current working directory: /workspace]
--
142461-  Cell In[1], line 57
142462-    return zipper.focus""", """from __future__ import annotations
142463-                            ^
142464:SyntaxError: incomplete input
142465-[Jupyter current working directory: /workspace]

openhands/controller/stuck.py

tests/unit/test_is_stuck.py

enyst · 2024-08-29T07:56:59Z

My hope for this particular stuck check was that we can get rid of it, replace the hard-coded error with figuring out there was a syntax error from the runtime, or the ast, or jupyter. I think that would be better when we can do it, much more reliable, and catching obviously more than the exact match as it was.

If we need to expand it first as this PR is doing, okay, let's expand it. I would personally be cautious of the probability that being too loose when playing with strings could catch cases that shouldn't be caught though. That said, when you're happy with it, feel free to take it in and let's try it.

tobitege · 2024-08-29T08:06:08Z

I'm having trouble imagining false positives happening explicitly for bash/ipython command executions, that have 3+ times "SyntaxError...." in their last 4 lines coming back from LLM, except for editing/execution errors? 🤔 @enyst

enyst · 2024-08-29T08:15:27Z

Working with files that include the text "Syntax error" in various contexts. Like this exact file, dunno, I don't have an exact scenario, but why wouldn't the LLM want to edit lines catching or documenting syntax errors? 🤔

tobitege · 2024-08-29T08:31:26Z

I'll try to narrow it down further, with e.g. checks for the last 2 lines coming from us, like

[Jupyter current working directory: xxx]
[Jupyter Python interpreter: xxx]

tobitege · 2024-08-29T12:35:00Z

@enyst - I made the stuck detection way more stringent and also added more unit test cases for the other errors.
One I actually had to remove again as that was a different linting error.

Example output for new test test_is_not_stuck_incomplete_input_error: different line numbers and number of lines in between (incl. bottom 2 lines as shortened test string for startswith):

   Cell In[1], line 211
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input



[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 177
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input


[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 196
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input


[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 215
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input


[Jupyter current working directory:
[Jupyter Python interpreter:
PASSED

The is stuck version of it (I know, it doesn't read nicely here 😄 ):

to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 42
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 42
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 42
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
PASSED

enyst · 2024-08-29T15:27:07Z

tests/unit/test_is_stuck.py


    def test_is_stuck_ipython_syntax_error_not_at_end(
        self, stuck_detector: StuckDetector, event_stream: EventStream
    ):
+        # this test is to make sure we don't get false positives
+        # since the "at line x" is changing in between!


🤔
FWIW, I wonder if requiring the error message to be identical isn't too restrictive now, hah (I'll see myself out lol). Please correct me if I'm not seeing right (looking at the diff), around line 190 in stuck.py it looks like the error too has to be entirely the same. It's an interesting question. Because I recall the stuck llm tries to make minor variations when it sees a syntax error, like it might try to run stuff starting with quotes or backquotes. It's still a syntax error, but it can be on another word or another line.

(FWIW I tend to think that this test was succeeding because the "syntax error" string had to be near the end of the line or something like that, not requiring an identical line number. I could be wrong though.)

Maybe that's all okay, since we're reducing from 4 to 3. 🤔 Three but identical might work well enough. The llm did seem to have a favorite thing to try lol, while variations of the code to run (e.g. when it tried running with quotes) were more rare than the main thing it wanted. If this works in the scenario you found, we can go with it.

The error doesn't have to be exactly the same, but the first line in those errors is like this:
Cell In[1], line 42
So similar to the other syntax error, where a change of the line number "breaks" the loop, I added the first line to be the same in 3 observations. The first line does not contain the actual error, but the location.
Then it is up to the bottom 3 lines to match, basically (error type and 2 lines of our prompt).

I'll try to re-run a couple of instances from the Aider bench, from where I had that error collection from originally. :)

…ds-AI#3628) * fix StuckDetector and add more errors for detection * more stringent error detection and more unit tests

fix StuckDetector and add more errors for detection

cf1307a

tobitege added enhancement New feature or request agent quality Problems with specific agents labels Aug 28, 2024

tobitege requested review from enyst and xingyaoww August 28, 2024 08:48

Merge branch 'main' into enh-stuck-detection

acccc7a

enyst reviewed Aug 29, 2024

View reviewed changes

openhands/controller/stuck.py Outdated Show resolved Hide resolved

enyst reviewed Aug 29, 2024

View reviewed changes

tests/unit/test_is_stuck.py Outdated Show resolved Hide resolved

more stringent error detection and more unit tests

9a1d8fd

Merge branch 'main' into enh-stuck-detection

14c6443

tobitege requested a review from enyst August 29, 2024 12:35

enyst reviewed Aug 29, 2024

View reviewed changes

enyst approved these changes Aug 29, 2024

View reviewed changes

tobitege merged commit a2d94c9 into main Aug 29, 2024

tobitege deleted the enh-stuck-detection branch August 29, 2024 15:33

This was referenced Aug 30, 2024

(fix) test_is_stuck.py to not fail (regression) #3662

Merged

(fix) StuckDetector: syntax error loops were not detected #3663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(enh) StuckDetector: fix+enhance syntax error loop detection #3628

(enh) StuckDetector: fix+enhance syntax error loop detection #3628

tobitege commented Aug 28, 2024 •

edited

Loading

tobitege commented Aug 28, 2024

enyst commented Aug 29, 2024

tobitege commented Aug 29, 2024

enyst commented Aug 29, 2024

tobitege commented Aug 29, 2024

tobitege commented Aug 29, 2024 •

edited

Loading

enyst Aug 29, 2024 •

edited

Loading

tobitege Aug 29, 2024

(enh) StuckDetector: fix+enhance syntax error loop detection #3628

(enh) StuckDetector: fix+enhance syntax error loop detection #3628

Conversation

tobitege commented Aug 28, 2024 • edited Loading

tobitege commented Aug 28, 2024

enyst commented Aug 29, 2024

tobitege commented Aug 29, 2024

enyst commented Aug 29, 2024

tobitege commented Aug 29, 2024

tobitege commented Aug 29, 2024 • edited Loading

enyst Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

tobitege Aug 29, 2024

Choose a reason for hiding this comment

tobitege commented Aug 28, 2024 •

edited

Loading

tobitege commented Aug 29, 2024 •

edited

Loading

enyst Aug 29, 2024 •

edited

Loading