[HELP NEEDED] add vision support / multimodal image input #430

thiswillbeyourgithub · 2024-04-23T11:50:55Z

Signed-off-by: thiswillbeyourgithub 26625900+thiswillbeyourgithub@users.noreply.github.com

Fixed #429

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>

thiswillbeyourgithub · 2024-04-23T14:43:41Z

I made a another commit to add support for base64 encoding of local image files but for the life of me I can't figure out how to push it again to that branch.

Here's a demo:
I added some code so that pressing alt+v just enters [PASTEPNG] and when this string is found in chatgpt.nvim it will parse the image from the clipboard:

    function enterPastePNG()
        vim.cmd("exe \"normal i[PASTEPNG]\\<Esc>\"")
    end
    vim.api.nvim_set_keymap('i', '<A-v>', '<Cmd>lua enterPastePNG()<CR>', { noremap = true, silent = true })

After screenshotting this image:

Here's the commit:

Author: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Date:   Tue Apr 23 16:23:02 2024 +0200

    feat: add support for local images using base64 utility on unix

diff --git a/lua/chatgpt/flows/chat/base.lua b/lua/chatgpt/flows/chat/base.lua
index 0d167ff..8e56c5a 100644
--- a/lua/chatgpt/flows/chat/base.lua
+++ b/lua/chatgpt/flows/chat/base.lua
@@ -497,7 +497,14 @@ local function createContent(line)
   local extensions = { "%.jpeg", "%.jpg", "%.png", "%.gif", "%.bmp", "%.tif", "%.tiff", "%.webp" }
   for _, ext in ipairs(extensions) do
     if string.find(line:lower(), ext .. "$") then
-      return { type = "image_url", image_url = { url = line } }
+      if string.find(line:lower(), "^https?:") then
+        return { type = "image_url", image_url = { url = line } }
+      else
+        local base64 = io.popen("base64 -w 0 " .. line, "r")
+        local encoded = base64:read("*a")
+        base64:close()
+        return { type = "image_url", image_url = { url = "data:image/jpeg;base64," .. encoded } }
+      end
     end
   end
   return { type = "text", text = line }

thiswillbeyourgithub · 2024-04-26T12:40:53Z

So after further testing I know this works and even included a way to directly paste images into chatgpt.nvim using xclip but then I sometimes ran into issues because the way curl is called cannot handle large "data" and large screenshots etc should be sent via a slightly different curl invocation but I wasn't sure of myself to dive into the api file.

Here's the full diff :

--- /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua	2024-04-24 11:35:42.983303191 +0200
+++ /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua.patch	2024-04-24 11:35:42.983303191 +0200
@@ -497,18 +497,40 @@
   local extensions = { "%.jpeg", "%.jpg", "%.png", "%.gif", "%.bmp", "%.tif", "%.tiff", "%.webp" }
   for _, ext in ipairs(extensions) do
     if string.find(line:lower(), ext .. "$") then
-      return { type = "image_url", image_url = line }
+      if string.find(line:lower(), "^https?:") then
+        return { type = "image_url", image_url = { url = line } }
+      else
+        local base64 = io.popen("base64 -w 0 " .. line, "r")
+        local encoded = base64:read("*a")
+        base64:close()
+        return { type = "image_url", image_url = { url = "data:image/jpeg;base64," .. encoded } }
+      end
     end
   end
+  if string.find(line, "[PASTEPNG]", 1, true) then
+    print("pasted")
+    local base64 = io.popen("xclip -sel clipboard -o -t image/png | base64 -w 0", "r")
+    local encoded = base64:read("*a")
+    base64:close()
+    return { type = "image_url", image_url = { url = "data:image/png;base64," .. encoded } }
+  end
   return { type = "text", text = line }
 end
 
 function Chat:toMessages()
   local messages = {}
+  local use_vision = false
   if self.system_message ~= nil then
     table.insert(messages, { role = "system", content = self.system_message })
   end
 
+  if string.find(self.params.model, "vision", 1, true) or
+        string.find(self.params.model, "gpt-4-turbo", 1, true) or
+        string.find(Settings.params.model, "vision", 1, true) or
+        string.find(Settings.params.model, "gpt-4-turbo", 1, true) then
+      use_vision = true
+  end
+
   for _, msg in pairs(self.messages) do
     local role = "user"
     if msg.type == SYSTEM then
@@ -517,7 +539,7 @@
       role = "assistant"
     end
     local content = {}
-    if self.params.model == "gpt-4-vision-preview" then
+    if use_vision then
       for _, line in ipairs(msg.lines) do
         table.insert(content, createContent(line))
       end

Edit: improved it some more:

--- /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua	2024-04-24 11:35:42.983303191 +0200
+++ /home/$USER/.local/share/nvim/lazy/ChatGPT.nvim/lua/chatgpt/flows/chat/base.lua.patch	2024-04-24 11:35:42.983303191 +0200
@@ -497,18 +497,40 @@
   local extensions = { "%.jpeg", "%.jpg", "%.png", "%.gif", "%.bmp", "%.tif", "%.tiff", "%.webp" }
   for _, ext in ipairs(extensions) do
     if string.find(line:lower(), ext .. "$") then
-      return { type = "image_url", image_url = line }
+      if string.find(line:lower(), "^https?:") then
+        return { type = "image_url", image_url = { url = line } }
+      else
+        local base64 = io.popen("base64 -w 0 " .. line, "r")
+        local encoded = base64:read("*a")
+        base64:close()
+        return { type = "image_url", image_url = { url = "data:image/jpeg;base64," .. encoded } }
+      end
     end
   end
   return { type = "text", text = line }
 end
 
 function Chat:toMessages()
   local messages = {}
+  local use_vision = false
   if self.system_message ~= nil then
     table.insert(messages, { role = "system", content = self.system_message })
   end
 
+  if string.find(self.params.model, "vision", 1, true) or
+        string.find(self.params.model, "gpt-4-turbo", 1, true) or
+        string.find(Settings.params.model, "vision", 1, true) or
+        string.find(Settings.params.model, "gpt-4-turbo", 1, true) then
+      use_vision = true
+  end
+
   for _, msg in pairs(self.messages) do
     local role = "user"
     if msg.type == SYSTEM then
@@ -517,7 +539,7 @@
       role = "assistant"
     end
     local content = {}
-    if self.params.model == "gpt-4-vision-preview" then
+    if use_vision then
       for _, line in ipairs(msg.lines) do
         table.insert(content, createContent(line))
       end

With the following shortcut:

    function pasteImage()
        -- Generate a random filename in /tmp
        local path = "/tmp/nvim_pasted_image_" .. math.random(1000000) .. ".png"
        -- Use xclip to save the clipboard image to the file
        os.execute("xclip -sel clipboard -o -t image/png > " .. path)
        -- Insert the file path into the buffer
        vim.api.nvim_exec("normal! o" .. path, false)
    end
    vim.api.nvim_set_keymap('i', '<A-v>', '<Cmd>lua pasteImage()<CR>', { noremap = true, silent = true })

thiswillbeyourgithub · 2024-06-13T12:21:10Z

Edit: a sure way but less privacy friendly to send image is to first send it to litterbox:

        -- upload to litterbox then send as url
        local handle = io.popen('curl -F "reqtype=fileupload" -F "time=1h" -F "fileToUpload=@' .. line .. '" https://litterbox.catbox.moe/resources/internals/api.php')
        local result = handle:read("*a")
        handle:close()
        return { type = "image_url", image_url = { url = result} }

thiswillbeyourgithub · 2024-06-22T07:12:13Z

Update: although the way to send images via the shortcut can be sound, i mainly made this PR to allow others to easily give it a try.

I really lack the skills to modify the api of the curl to send the file so can't do it. And I really tried :(
Until then the allowed image size is in effect pretty small.
Also, instead of using [PASTEIMG], which I chose to avoid interfering with code, it might actually be better to use markdown format and only trigger the image sending if the path leads to an actual file:
- This would be cleaner to read
- It would allow sending multiple images at once
- It would allow plugin that display images in vim to be used.
Also, currently the code that runs base64 only works on unix and needs extra dependencies.

In any way I won't do any enhancement until someone fixed the curl to send files :/

I can put this in draft if you want but I woumd prefer the extra visibility of staing Open

fix: vision models

4a7e1af

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>

thiswillbeyourgithub mentioned this pull request Apr 26, 2024

FR: allow multimodal input / vision / images #429

Closed

jackMort approved these changes Jun 22, 2024

View reviewed changes

thiswillbeyourgithub changed the title ~~fix: vision models~~ [HELP NEEDED] add vision support / multimodal image input Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HELP NEEDED] add vision support / multimodal image input #430

[HELP NEEDED] add vision support / multimodal image input #430

thiswillbeyourgithub commented Apr 23, 2024

thiswillbeyourgithub commented Apr 23, 2024 •

edited

Loading

thiswillbeyourgithub commented Apr 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Jun 13, 2024

thiswillbeyourgithub commented Jun 22, 2024 •

edited

Loading

[HELP NEEDED] add vision support / multimodal image input #430

Are you sure you want to change the base?

[HELP NEEDED] add vision support / multimodal image input #430

Conversation

thiswillbeyourgithub commented Apr 23, 2024

thiswillbeyourgithub commented Apr 23, 2024 • edited Loading

thiswillbeyourgithub commented Apr 26, 2024 • edited Loading

thiswillbeyourgithub commented Jun 13, 2024

thiswillbeyourgithub commented Jun 22, 2024 • edited Loading

thiswillbeyourgithub commented Apr 23, 2024 •

edited

Loading

thiswillbeyourgithub commented Apr 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Jun 22, 2024 •

edited

Loading