Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HELP NEEDED] add vision support / multimodal image input #430

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thiswillbeyourgithub
Copy link

Signed-off-by: thiswillbeyourgithub 26625900+thiswillbeyourgithub@users.noreply.github.com

Fixed #429

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub
Copy link
Author

thiswillbeyourgithub commented Apr 23, 2024

I made a another commit to add support for base64 encoding of local image files but for the life of me I can't figure out how to push it again to that branch.

Here's a demo:
I added some code so that pressing alt+v just enters [PASTEPNG] and when this string is found in chatgpt.nvim it will parse the image from the clipboard:

    function enterPastePNG()
        vim.cmd("exe \"normal i[PASTEPNG]\\<Esc>\"")
    end
    vim.api.nvim_set_keymap('i', '<A-v>', '<Cmd>lua enterPastePNG()<CR>', { noremap = true, silent = true })

image
After screenshotting this image:
image

Here's the commit:

Author: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Date:   Tue Apr 23 16:23:02 2024 +0200

    feat: add support for local images using base64 utility on unix

diff --git a/lua/chatgpt/flows/chat/base.lua b/lua/chatgpt/flows/chat/base.lua
index 0d167ff..8e56c5a 100644
--- a/lua/chatgpt/flows/chat/base.lua
+++ b/lua/chatgpt/flows/chat/base.lua
@@ -497,7 +497,14 @@ local function createContent(line)
   local extensions = { "%.jpeg", "%.jpg", "%.png", "%.gif", "%.bmp", "%.tif", "%.tiff", "%.webp" }
   for _, ext in ipairs(extensions) do
     if string.find(line:lower(), ext .. "$") then
-      return { type = "image_url", image_url = { url = line } }
+      if string.find(line:lower(), "^https?:") then
+        return { type = "image_url", image_url = { url = line } }
+      else
+        local base64 = io.popen("base64 -w 0 " .. line, "r")
+        local encoded = base64:read("*a")
+        base64:close()
+        return { type = "image_url", image_url = { url = "data:image/jpeg;base64," .. encoded } }
+      end
     end
   end
   return { type = "text", text = line }

@thiswillbeyourgithub
Copy link
Author

thiswillbeyourgithub commented Apr 26, 2024

So after further testing I know this works and even included a way to directly paste images into chatgpt.nvim using xclip but then I sometimes ran into issues because the way curl is called cannot handle large "data" and large screenshots etc should be sent via a slightly different curl invocation but I wasn't sure of myself to dive into the api file.

Here's the full diff :

--- /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua	2024-04-24 11:35:42.983303191 +0200
+++ /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua.patch	2024-04-24 11:35:42.983303191 +0200
@@ -497,18 +497,40 @@
   local extensions = { "%.jpeg", "%.jpg", "%.png", "%.gif", "%.bmp", "%.tif", "%.tiff", "%.webp" }
   for _, ext in ipairs(extensions) do
     if string.find(line:lower(), ext .. "$") then
-      return { type = "image_url", image_url = line }
+      if string.find(line:lower(), "^https?:") then
+        return { type = "image_url", image_url = { url = line } }
+      else
+        local base64 = io.popen("base64 -w 0 " .. line, "r")
+        local encoded = base64:read("*a")
+        base64:close()
+        return { type = "image_url", image_url = { url = "data:image/jpeg;base64," .. encoded } }
+      end
     end
   end
+  if string.find(line, "[PASTEPNG]", 1, true) then
+    print("pasted")
+    local base64 = io.popen("xclip -sel clipboard -o -t image/png | base64 -w 0", "r")
+    local encoded = base64:read("*a")
+    base64:close()
+    return { type = "image_url", image_url = { url = "data:image/png;base64," .. encoded } }
+  end
   return { type = "text", text = line }
 end
 
 function Chat:toMessages()
   local messages = {}
+  local use_vision = false
   if self.system_message ~= nil then
     table.insert(messages, { role = "system", content = self.system_message })
   end
 
+  if string.find(self.params.model, "vision", 1, true) or
+        string.find(self.params.model, "gpt-4-turbo", 1, true) or
+        string.find(Settings.params.model, "vision", 1, true) or
+        string.find(Settings.params.model, "gpt-4-turbo", 1, true) then
+      use_vision = true
+  end
+
   for _, msg in pairs(self.messages) do
     local role = "user"
     if msg.type == SYSTEM then
@@ -517,7 +539,7 @@
       role = "assistant"
     end
     local content = {}
-    if self.params.model == "gpt-4-vision-preview" then
+    if use_vision then
       for _, line in ipairs(msg.lines) do
         table.insert(content, createContent(line))
       end

Edit: improved it some more:

--- /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua	2024-04-24 11:35:42.983303191 +0200
+++ /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua.patch	2024-04-24 11:35:42.983303191 +0200
@@ -497,18 +497,40 @@
   local extensions = { "%.jpeg", "%.jpg", "%.png", "%.gif", "%.bmp", "%.tif", "%.tiff", "%.webp" }
   for _, ext in ipairs(extensions) do
     if string.find(line:lower(), ext .. "$") then
-      return { type = "image_url", image_url = line }
+      if string.find(line:lower(), "^https?:") then
+        return { type = "image_url", image_url = { url = line } }
+      else
+        local base64 = io.popen("base64 -w 0 " .. line, "r")
+        local encoded = base64:read("*a")
+        base64:close()
+        return { type = "image_url", image_url = { url = "data:image/jpeg;base64," .. encoded } }
+      end
     end
   end
   return { type = "text", text = line }
 end
 
 function Chat:toMessages()
   local messages = {}
+  local use_vision = false
   if self.system_message ~= nil then
     table.insert(messages, { role = "system", content = self.system_message })
   end
 
+  if string.find(self.params.model, "vision", 1, true) or
+        string.find(self.params.model, "gpt-4-turbo", 1, true) or
+        string.find(Settings.params.model, "vision", 1, true) or
+        string.find(Settings.params.model, "gpt-4-turbo", 1, true) then
+      use_vision = true
+  end
+
   for _, msg in pairs(self.messages) do
     local role = "user"
     if msg.type == SYSTEM then
@@ -517,7 +539,7 @@
       role = "assistant"
     end
     local content = {}
-    if self.params.model == "gpt-4-vision-preview" then
+    if use_vision then
       for _, line in ipairs(msg.lines) do
         table.insert(content, createContent(line))
       end

With the following shortcut:

    function pasteImage()
        -- Generate a random filename in /tmp
        local path = "/tmp/nvim_pasted_image_" .. math.random(1000000) .. ".png"
        -- Use xclip to save the clipboard image to the file
        os.execute("xclip -sel clipboard -o -t image/png > " .. path)
        -- Insert the file path into the buffer
        vim.api.nvim_exec("normal! o" .. path, false)
    end
    vim.api.nvim_set_keymap('i', '<A-v>', '<Cmd>lua pasteImage()<CR>', { noremap = true, silent = true })

@thiswillbeyourgithub
Copy link
Author

Edit: a sure way but less privacy friendly to send image is to first send it to litterbox:

        -- upload to litterbox then send as url
        local handle = io.popen('curl -F "reqtype=fileupload" -F "time=1h" -F "fileToUpload=@' .. line .. '" https://litterbox.catbox.moe/resources/internals/api.php')
        local result = handle:read("*a")
        handle:close()
        return { type = "image_url", image_url = { url = result} }

@thiswillbeyourgithub
Copy link
Author

thiswillbeyourgithub commented Jun 22, 2024

Update: although the way to send images via the shortcut can be sound, i mainly made this PR to allow others to easily give it a try.

  1. I really lack the skills to modify the api of the curl to send the file so can't do it. And I really tried :(
  2. Until then the allowed image size is in effect pretty small.
  3. Also, instead of using [PASTEIMG], which I chose to avoid interfering with code, it might actually be better to use markdown format and only trigger the image sending if the path leads to an actual file:
    • This would be cleaner to read
    • It would allow sending multiple images at once
    • It would allow plugin that display images in vim to be used.
  4. Also, currently the code that runs base64 only works on unix and needs extra dependencies.

In any way I won't do any enhancement until someone fixed the curl to send files :/

I can put this in draft if you want but I woumd prefer the extra visibility of staing Open

@thiswillbeyourgithub thiswillbeyourgithub changed the title fix: vision models [HELP NEEDED] add vision support / multimodal image input Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FR: allow multimodal input / vision / images
2 participants