Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IM driver crashing php process when creating many thumbs for the first time #5328

Open
iskrisis opened this issue Jun 27, 2023 · 21 comments
Open
Labels
type: regression 🚨 Is a regression between versions

Comments

@iskrisis
Copy link

iskrisis commented Jun 27, 2023

Description

I can't pin point out why or when but somewhere around v3.8 something about thumb creation changed. I know it from upgrading some image heavy sites to 3.8+. I have 4gig ram VPS and use im driver for most stuff to create avif/webp/jpg thumbs. It takes some time first time but it never was an issue. But with recent versions one can crash PHP process. Happened on both php-fpm and also clients Apache. Firefox starts to throw NS_BINDING_ABORTED erros for the images which i suspect is because error messages get served instead of image.

After asking @bastianallgeier he suspects it's
https://github.com/getkirby/kirby/blob/6b5dda600472b59a3a779f7e6b8e9e130cc414a1/src/Cms/Media.php#L56C4-L56C4 change that loads/serves original files through PHP.

I've tried to setup im security policies really thoroughly and i don't think it makes much difference. The im processes from logs seems to go fine and it never crashes so i don't think it's a problem with im driver.

Curiously it doesn't seem to be a problem with gd but maybe thats just because gd can't create .avif which is the format that takes a long time to create.

I am not sure why this change has been made but i've managed to crash two live sites already when they launched. It's a bit nerve wrecking and really hard to track down. It would be really great if it worked as it did in the past.

@bastianallgeier bastianallgeier added the type: regression 🚨 Is a regression between versions label Jun 28, 2023
@iskrisis
Copy link
Author

iskrisis commented Jul 4, 2023

Can i help this somehow? Like would it help if i created test repo with example that was crashing?

@distantnative
Copy link
Member

@iskrisis that would help indeed

@tobimori
Copy link
Contributor

tobimori commented Jul 4, 2023

I think I had experienced similiar issues in the past, but I couldn't pinpoint it (as they're always gone after the first load)

@bastianallgeier
Copy link
Member

I agree. A test repo would be great. So far, I didn't run into this issue with our sandbox environments.

@iskrisis
Copy link
Author

iskrisis commented Jul 6, 2023

Playing with it whole afternoon. I am starting to think this is not fixable unless some architectural change would happen.

What i think is happening

When you have list page with enough images each of the thumb requests will spawn convert process
iTerm2 - ploi@helpful-farm- ~-kirbyimagetest floriankarsten dev-media 2023-07-06 at 4 06 06 PM

Even when imagemagick has properly configured security policies (it will comfortably convert thumbs individually) what happens is that linux out of memory killer will kick in and at some point kills php-fpm too. This is pronounced longer the convert runs so with bigger files and .avif it's easier for this to happen. But it happens even with webp (and most likely would with jpg if pushed enough) The image conversion will fill up ram and swap and then it's matter of time.

Of course when you have server with enough ram then you won't hit this. Also i think it's a linux thing and might work differently on win/mac. Plus kinda impossible to make happen on developers machine.

Observations

  • I've looked at how to avoid this on server level but i don't think it's possible to limit how many instances of convert can be spawned at the same time. Nor is it easily possible to limit max ram consumption of convert processes together (one would have to run convert under different user and limit the user). Reading about linux OOM killer it's probably possible to discourage to ever kill php-fpm but that is not encouraged solution since you might want to kill in situations...

  • There is condition where php waits for convert that gets killed by linux OOM in the middle and it might not be handled well memory wise.

  • It seems like reading of the original backup file increases the memory consumption of php-fpm and it might be one of the reasons its get killed too.

  • I happen to bump into what i think is another issue and thats that the convert doesn't have any check if it can actually convert to .avif. This might lead to wierd stuff.

  • At some point i endedup with .avif files in media which were actually the original .jpegs. I don't understand why this happens.

  • I was logging what commands was the driver generating and it was wierd

'convert' -limit thread 1 '/home/ploi/kirbyimagetest.floriankarsten.dev/media/pages/test/a5013b8841-1688648823/fredrik-ohlander-ysqok-3eir8-unsplash-1800x.webp' -auto-orient -thumbnail '1800x2700!' -quality '90' '/home/ploi/kirbyimagetest.floriankarsten.dev/media/pages/test/a5013b8841-1688648823/fredrik-ohlander-ysqok-3eir8-unsplash-1800x.webp'

does this mean that the image gets copied then somehow converted in place?

  • I think this might be kinda problematic security concern if somebody goes to your site and requests all the wierd undreated srcsets at same time they could crash your server? I think i will be more mindful about what i put in there.

Test

Repo here https://github.com/iskrisis/test-kirby-image-crasher but it's just plainkit with loop of some hires thumbs. Tested on 2GB ram hetzner vps with ubuntu with what i think are the most common settings. Imagemagick security policies were setup and i could create each image individually with cli directly.

Logs

php-fpm logs

Jul 06 13:48:02 helpful-farm systemd[1]: Started The PHP 8.1 FastCGI Process Manager.
Jul 06 13:48:41 helpful-farm systemd[1]: php8.1-fpm.service: A process of this unit has been killed by the OOM killer.
Jul 06 13:48:42 helpful-farm systemd[1]: php8.1-fpm.service: Failed with result 'oom-kill'.
Jul 06 13:48:42 helpful-farm systemd[1]: php8.1-fpm.service: Consumed 22.965s CPU time.

digging why it was oom-killed - process convert (imagemagick)

[Thu Jul  6 13:13:54 2023] Out of memory: Killed process 46637 (convert) total-vm:1259896kB, anon-rss:327376kB, file-rss:1988kB, shmem-rss:0kB, UID:1000 pgtables:1924kB oom_score_adj:0
[Thu Jul  6 13:48:40 2023] Out of memory: Killed process 47353 (convert) total-vm:779280kB, anon-rss:381476kB, file-rss:1960kB, shmem-rss:0kB, UID:1000 pgtables:1028kB oom_score_adj:0

My takeaway

I didn't hit on this in past just because of luck we have 4gig ram server and i've never hit the right kind of load.
This doesn't seem to happen with GD probably because it is extension not driver that runs in cli. I am thinking i should really try to write driver for vips php extension because this might always.

@lukasbestle
Copy link
Member

We could solve this with a semaphore (PHP sem_acquire()). This function does not exist in all PHP installations, but we could check if it does and treat it as progressive enhancement.

It would work something like this:

  • Before Kirby spawns convert, it acquires a semaphore with a maximum limit of let's say 2 (= maximum number of simultaneous IM processes).
  • If there are already 2 (or whatever number) running, further calls block and wait for the completion of the previous calls.
  • After IM finishes, the semaphore is released.

@bastianallgeier
Copy link
Member

@lukasbestle I never heard about it before, but if this could help to solve the problem, I'm all in. We could also think about adding an imagick driver to instead of im. The PHP extension would also probably be a safer option in case it exists.

@iskrisis
Copy link
Author

I have no idea why this doesn't happen with php-gd but if its the case also for the extension version of imagemagick than that would probably be easier to make and more supported? Cpanel and most shared hostings i've seen have the extension either on by default or under option from gui. Ploi also has it as option.
Seems like semaphore is not only that well supported but also pretty complex.

@iskrisis
Copy link
Author

It's really pitty SimpleImage doesn't support more extensions. claviska/SimpleImage#308 It seems to me like this wouldn't have to be that much different from php-gd driver.

For example https://github.com/Intervention/image wraps both gd and imagemagick (even vips with additional driver). I understand why you are hesitant to include these big dependencies... maybe Kirby core should keep it gd only and it should be plugin for Intervention that supports extensions.

@lukasbestle
Copy link
Member

I think it would be worth trying the Imagick extension in a plugin. If it doesn't show this behavior, is more stable but still works well for everything we need, this could be a good option.

Semaphores are not that complex, but not many users would benefit from them.

@bastianallgeier
Copy link
Member

TBH, I wouldn't mind to move away from SimpleImage. We use it forever, but I'm sure it's not the best image library out there.

@distantnative distantnative added this to the 4.0.0 milestone Oct 19, 2023
@distantnative distantnative removed this from the 4.0 milestone Oct 29, 2023
@iskrisis
Copy link
Author

@adamkiss has made prototype with Kirby Intervention driver but it doesn't seem to solve many issues. It's slow. I've also managed to crash php-gd and thus i asum even php-im crashes. Even Vips thats order of magnitude faster is most likely in same boat (if you push it hard enough).

It doesn't solve underlying issue that if you fire too many thumb tasks you might crash php-fpm. Especially on VPSes with smaller memory.

Not sure what's the solution here. Some kind of throttle. I am not sure about semaphore @lukasbestle mentioned before. Other solution would be some kind of optional queue for .jobs with cron - thats how i've seen other projects to handle. There must be some settings to protect the processes on linux side but i couldn't find the correct ones.

But for me this is currently quite a big painpoint. We stopped using .avif and i am thinking about basically solving it by throwing everything on one big server that will survive a lot bigger bursts.

@lukasbestle
Copy link
Member

The issue with Semaphores would be that all additional calls beyond the "ImageMagick concurrency limit" would either block (= make the PHP-FPM worker and request hang) or fail (= don't return the requested thumb).

I like the idea of a job queue. Basically an independent process running on the server that processes the incoming jobs. Maybe that could be implemented as a plugin for the Kirby CLI? If it helps with handling large numbers of thumbs, we could then think about integrating it into the core somehow.

@tobimori
Copy link
Contributor

tobimori commented Dec 26, 2023

A core-supported Job/queue system would be very well appreciated. I could definitely make use of this for other plugins.

@iskrisis
Copy link
Author

iskrisis commented Jan 5, 2024

Kirby already uses .jobs folder for whitelisting the thumb creation. Is kinda similar to synchronous queue.

Worker for these .jobs would be straighforward BUT the problem is i dont want every whitelisted thumb to be created i want only those that are actualy requested by clients. Srcsets etc are very broad and it would be very wasteful to create every possible thub. On the other hand being explicit about thumbs is very hard and restricting with all the new filedormats and resolutions.

Maybe some kind of lock/semaphore/process manager for tracking number of thumb processes currently working with optional limit setting is solution fitting Kirby better.

Imho i am ok with returning placeholder image (or something) if the process limit is reached. This is rare situation on first visits - people will refresh or whatever next time the images are there. One could probably even very crude way sleep() and retry in the request.

Its much much better solution than crashing webserver along with all websites on it.

@lukasbestle
Copy link
Member

The advantage with semaphores would be that they would "just work" if they are available. Versus a queue that first needs to be set up with a cronjob or daemon process.

Maybe we could do something like "if there are no more than x running convert processes, process the thumb, otherwise wait for half a second and try again, otherwise return a placeholder image". Do you have an idea what the placeholder image could be or how it could work?

In any case, before we go for that route, we would first need to check how common the semaphore feature is in the wild. Wouldn't be worth it if only a fraction of PHP installations had it.

@iskrisis
Copy link
Author

iskrisis commented Jan 22, 2024

Placeholder - for me even 1px gray image would be fine? But i think first request would this be to have this customizable. Which is probably just matter of saying the image is always /media/_placeholder.jpg or something.

About support i've very quickly tried some hostings using

<?php

if (extension_loaded('sysvsem')) {
    echo "sysvsem/semaphores loaded";
} else {
    echo "sysvsem/semaphores unavailable ";
}
?>

To my surprise most everything i have access to supports it out of the box. Ploi includes them by default (no need to turn on the extension). Uberspace also have it on by default. Two cpanel based shared hostings have them available as extensions and were turned on by default. Out of the 5 hostings i've tried only wierd Czech one doesn't have it on by default (and they don't have gui to turn extensions on).

I thought i will have tu turn on the extension but it seems to be one of the more common ones. Laravel Valet/Herd also have it available by default.

@lukasbestle
Copy link
Member

That's great, thanks for testing!

@ovenum
Copy link
Contributor

ovenum commented Jun 26, 2024

Can confirm @iskrisis observations that on image heavy sites with many visitors oom-kill will be summoned when lots of convert process are spawned for source set generation. Even worse if the original image has a large file size.

This can eventually take a whole site offline if oom-kill decides it’s time to terminate the php-fpm process and it is not setup to automatically restart.

@nilshoerrmann
Copy link
Contributor

We seem to encounter similar issue, but with GD.

@iskrisis: What have you done do work around this issue?

@ovenum
Copy link
Contributor

ovenum commented Jul 3, 2024

We seem to encounter similar issue, but with GD.

@iskrisis: What have you done do work around this issue?

@nilshoerrmann you might be already aware of this, but leaving this here for others that require a quick fix.

You could use a hook to generate srcsets when images are uploaded to the panel. By this you can limit the amount of concurrent image creations taking place on the server.

The problem with this is that when the modified timestamp of the page the image(s) belong to changes Kirby will re-run the srcset creation on request. Just caused a couple of 503s when changing page numbers…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: regression 🚨 Is a regression between versions
Projects
None yet
Development

No branches or pull requests

7 participants