Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(enh) StuckDetector: fix+enhance syntax error loop detection #3628

Merged
merged 4 commits into from
Aug 29, 2024

Conversation

tobitege
Copy link
Collaborator

@tobitege tobitege commented Aug 28, 2024

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

Loop detection of one type of syntax error was broken. Expanded to 4 types of syntax errors.


Give a summary of what the PR does, explaining any non-trivial design decisions

There was only one type of syntax error in the stuck detection, like unterminated string literal.
However, this was broken as the detection only checked a limited range of characters for it.

Instead, the detection is now expanded to the last 4 lines of the observation.
In recent enhancements there were potentially 2 lines added for interpreter and prompt,
making the check basically skip the detection each time.

Unit tests adapted and fixed accordingly.

List of detected errors is now this:

    SYNTAX_ERROR_MESSAGES = [
        'SyntaxError: unterminated string literal (detected at line',
        'SyntaxError: invalid syntax. Perhaps you forgot a comma?',
        'SyntaxError: incomplete input',
        "E999 SyntaxError: unmatched ')'",
    ]

The added errors is a distinct collection from 98 occurences in a bench that I ran.

@tobitege tobitege added enhancement New feature or request agent quality Problems with specific agents labels Aug 28, 2024
@tobitege
Copy link
Collaborator Author

This is an excerpt from a recent Aider bench showing different kinds of syntax errors, that appeared in a loop (I've shortened the content here):

566-  Cell In[1], line 3
567-    to_replace="""# subclassing the built-in ValueError to create MeetupDayException
568-               ^
569:SyntaxError: invalid syntax. Perhaps you forgot a comma?
570-[Jupyter current working directory: /workspace]
--
4890-  Cell In[1], line 3
4891-    to_replace="""class MeteredFile(io.BufferedRandom):
4892-               ^
4893:SyntaxError: invalid syntax. Perhaps you forgot a comma?
4894-[Jupyter current working directory: /workspace]
--
14510-  Cell In[1], line 50
14511-    return f"({self.area_code}) {self.exchange_code}-{self.subscriber_number}"""")
14512-                                                                                ^
14513:SyntaxError: unterminated string literal (detected at line 50)
14514-[Jupyter current working directory: /workspace]
--
14640-05:51:13 - openhands:INFO: agent_controller.py:215 - **IPythonRunCellObservation**
14641-[Your proposed edit has introduced new syntax error(s). Please understand the errors and retry your edit command.]
14642-ERRORS:
14643:/workspace/phone_number.py:45:84: E999 SyntaxError: unmatched ')'
--
19323-  Cell In[1], line 4
19324-    return f"({self.area_code}) {self.exchange_code}-{self.subscriber_number}"""",
19325-                                                                                ^
19326:SyntaxError: unterminated string literal (detected at line 4)
19327-[Jupyter current working directory: /workspace]
--
19358-  Cell In[1], line 4
19359-    return f"({self.area_code}) {self.exchange_code}-{self.subscriber_number}"""",
19360-                                                                                ^
19361:SyntaxError: unterminated string literal (detected at line 4)
19362-[Jupyter current working directory: /workspace]
--
25908-  Cell In[1], line 30
25909-    new_content="""from collections import Counter
25910-                ^
25911:SyntaxError: incomplete input
25912-[Jupyter current working directory: /workspace]
--
25991-  Cell In[1], line 30
25992-    new_content="""from collections import Counter
25993-                ^
25994:SyntaxError: incomplete input
25995-[Jupyter current working directory: /workspace]
--
80271-  Cell In[1], line 41
80272-    return json.dumps({"users": sorted([lender_obj, borrower_obj], key=lambda x: x["name"])})""", """import json
80273-                                                                                                  ^
80274:SyntaxError: incomplete input
80275-[Jupyter current working directory: /workspace]
--
91020-  Cell In[1], line 41
91021-    new_content="""class Scale:
91022-                ^
91023:SyntaxError: incomplete input
91024-[Jupyter current working directory: /workspace]
--
96875-  Cell In[1], line 56
96876-    new_content="""def parse_properties(s):
96877-                ^
96878:SyntaxError: incomplete input
96879-[Jupyter current working directory: /workspace]
--
98626-  File "/workspace/sgf_parsing.py", line 45
98627-    if s[0] == '\' and len(s) > 1:
98628-               ^
98629:SyntaxError: unterminated string literal (detected at line 45)
98630-
--
119294-  Cell In[1], line 3
119295-    to_replace="def equilateral(sides):
119296-               ^
119297:SyntaxError: unterminated string literal (detected at line 3)
119298-[Jupyter current working directory: /workspace]
--
133225-  Cell In[1], line 57
133226-    new_content="""def answer(question):
133227-                ^
133228:SyntaxError: incomplete input
133229-[Jupyter current working directory: /workspace]
--
137464-  Cell In[1], line 51
137465-    new_content="""HOUSES = range(5)
137466-                ^
137467:SyntaxError: incomplete input
137468-[Jupyter current working directory: /workspace]
--
139180-  Cell In[1], line 59
139181-    new_content="""def is_valid_solution(solution):
139182-                ^
139183:SyntaxError: incomplete input
139184-[Jupyter current working directory: /workspace]
--
142461-  Cell In[1], line 57
142462-    return zipper.focus""", """from __future__ import annotations
142463-                            ^
142464:SyntaxError: incomplete input
142465-[Jupyter current working directory: /workspace]

tests/unit/test_is_stuck.py Outdated Show resolved Hide resolved
@enyst
Copy link
Collaborator

enyst commented Aug 29, 2024

My hope for this particular stuck check was that we can get rid of it, replace the hard-coded error with figuring out there was a syntax error from the runtime, or the ast, or jupyter. I think that would be better when we can do it, much more reliable, and catching obviously more than the exact match as it was.

If we need to expand it first as this PR is doing, okay, let's expand it. I would personally be cautious of the probability that being too loose when playing with strings could catch cases that shouldn't be caught though. That said, when you're happy with it, feel free to take it in and let's try it.

@tobitege
Copy link
Collaborator Author

I'm having trouble imagining false positives happening explicitly for bash/ipython command executions, that have 3+ times "SyntaxError...." in their last 4 lines coming back from LLM, except for editing/execution errors? 🤔 @enyst

@enyst
Copy link
Collaborator

enyst commented Aug 29, 2024

Working with files that include the text "Syntax error" in various contexts. Like this exact file, dunno, I don't have an exact scenario, but why wouldn't the LLM want to edit lines catching or documenting syntax errors? 🤔

@tobitege
Copy link
Collaborator Author

I'll try to narrow it down further, with e.g. checks for the last 2 lines coming from us, like

[Jupyter current working directory: xxx]
[Jupyter Python interpreter: xxx]

@tobitege
Copy link
Collaborator Author

tobitege commented Aug 29, 2024

@enyst - I made the stuck detection way more stringent and also added more unit test cases for the other errors.
One I actually had to remove again as that was a different linting error.

Example output for new test test_is_not_stuck_incomplete_input_error: different line numbers and number of lines in between (incl. bottom 2 lines as shortened test string for startswith):

   Cell In[1], line 211
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input



[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 177
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input


[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 196
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input


[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 215
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input


[Jupyter current working directory:
[Jupyter Python interpreter:
PASSED

The is stuck version of it (I know, it doesn't read nicely here 😄 ):

to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 42
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 42
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
  Cell In[1], line 42
to_replace="""def largest(min_factor, max_factor):
            ^
SyntaxError: incomplete input
[Jupyter current working directory:
[Jupyter Python interpreter:
PASSED

@tobitege tobitege requested a review from enyst August 29, 2024 12:35

def test_is_stuck_ipython_syntax_error_not_at_end(
self, stuck_detector: StuckDetector, event_stream: EventStream
):
# this test is to make sure we don't get false positives
# since the "at line x" is changing in between!
Copy link
Collaborator

@enyst enyst Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔
FWIW, I wonder if requiring the error message to be identical isn't too restrictive now, hah (I'll see myself out lol). Please correct me if I'm not seeing right (looking at the diff), around line 190 in stuck.py it looks like the error too has to be entirely the same. It's an interesting question. Because I recall the stuck llm tries to make minor variations when it sees a syntax error, like it might try to run stuff starting with quotes or backquotes. It's still a syntax error, but it can be on another word or another line.

(FWIW I tend to think that this test was succeeding because the "syntax error" string had to be near the end of the line or something like that, not requiring an identical line number. I could be wrong though.)

Maybe that's all okay, since we're reducing from 4 to 3. 🤔 Three but identical might work well enough. The llm did seem to have a favorite thing to try lol, while variations of the code to run (e.g. when it tried running with quotes) were more rare than the main thing it wanted. If this works in the scenario you found, we can go with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error doesn't have to be exactly the same, but the first line in those errors is like this:
Cell In[1], line 42
So similar to the other syntax error, where a change of the line number "breaks" the loop, I added the first line to be the same in 3 observations. The first line does not contain the actual error, but the location.
Then it is up to the bottom 3 lines to match, basically (error type and 2 lines of our prompt).

I'll try to re-run a couple of instances from the Aider bench, from where I had that error collection from originally. :)

@tobitege tobitege merged commit a2d94c9 into main Aug 29, 2024
@tobitege tobitege deleted the enh-stuck-detection branch August 29, 2024 15:33
RajWorking pushed a commit to RajWorking/OpenHands that referenced this pull request Sep 2, 2024
…ds-AI#3628)

* fix StuckDetector and add more errors for detection

* more stringent error detection and more unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent quality Problems with specific agents enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants