[bug] drop/delete terribly slow #9636

lukaszgryglicki · 2018-03-27T13:13:40Z

Bug report

Influx version: v1.5.1
OS version: Ubuntu Linux 17.04

Steps to reproduce:

Have a big database with > 20000 series
Try to execute drop series from /{{regexp_pattern}}/.
or try to execute drop measurement /{{regexp_pattern}}/.
or try to execute delete from /{{regexp_pattern}}/.

It will take > 30 minutes.

Expected behavior:

Well deleting data should take miliseconds?
Or there should be some special command modifier that will do it fast (I think this is becaus esome influx DB cleaning stuff). And there should be another comment to execute this "long" part manually, so for example I can delete ton of series and run cleanup process manually (to reclaim space) from a cron job or something like this.

The text was updated successfully, but these errors were encountered:

mark-rushakoff · 2018-03-27T16:01:30Z

It definitely shouldn't be taking that long.

Can you gather some profiles while the slow delete is in progress?

curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=true"

curl -o vars.txt "http://localhost:8086/debug/vars"
iostat -xd 1 30 > iostat.txt

The workaround I would try is to break out the regular expression, like running SHOW MEASUREMENTS WITH MEASUREMENT =~ /pattern/, and then make individual DROP calls in one statement, like
DROP MEASUREMENT match1; DROP MEASUREMENT match2; DROP MEASUREMENT match3....

If you run the DROP calls as separate statements, you may trigger a compaction on each run.

lukaszgryglicki · 2018-03-27T16:16:50Z

I will try to provide the details You need tomorrow.
But dropping series one by one is actually a lot faster, than single drop with regexp, but still dropping any single series takes tens of seconds.

Can I somehow turn compaction off?
My actual / current solution (implemeneted toay) is code reorganization in such a way that drop is no longer needed, I'm using time value as a kay, so no need to drop. Adding a point will either add new point or overwrite existing one.

I also have a problem with influx db copy program (that I've. written myself months ago). It now takes about 3-10x times more than it was taking on v1.3.x.

I'll be investigating this tomorrow, for now the 3x-10x more time is due to mutiple "timeout error" that I'm receiving and retrying (my copy program deals with this).

I suspect the problem is due to the number of threads - I'm using 48 threads and http-timeout = 300s. And just after about 300s of data copying I'm staring to receive "timeout error"
Maybe the influxd server is not allowing 48 threads at once? But I cannot check this now, because very time consuming process is runniung, and it will take hours to finish.

Is there a way to tell influxd to allow unlimited number of connections or say at least 100? What is the default?

This the influx db backup tool. It can copy one influx database to another local/remote etc:
https://github.com/cncf/devstats/blob/master/cmd/idb_backup/idb_backup.go

I need such a tool becausewhen I'm doing a full data regenerate, I need to save it to the temp database first (it takes hours), not to disturb original database which is used by Grafana.
Once this is completed successfully I'm copying temp database to the one used by Grafana (which is safer and gives less downtime) and finall drop the temp database.

The better option would be to generate temp db and the just rename it, but unfotunatelly Influx db is not supporting this too :-(

lukaszgryglicki · 2018-03-29T11:58:07Z

Uploaded requested profiles:
profiles.tar.gz
iostat.txt
vars.txt

While running delete from /^open_(issues|prs)_sigs_milestones/ - it already takes 4+ minutes.

lukaszgryglicki · 2018-03-29T12:14:52Z

Delete still running (already 19 minutes) it was at least 40x faster in v1.3.x.
Requested stats after about 17minutes running delete:
iostat.txt
profiles.tar.gz
vars.txt

mark-rushakoff · 2018-03-29T15:26:00Z

Can I somehow turn compaction off?

I don't believe that is supported. Maybe setting max-concurrent-compactions=1 would suffice here. https://docs.influxdata.com/influxdb/v1.5/administration/config/#max-concurrent-compactions-0

Adding a point will either add new point or overwrite existing one.

Overwriting existing points is expensive and should generally be avoided if possible; otherwise it should happen in bulk if possible.

I also have a problem with influx db copy program (that I've. written myself months ago).

I suspect you'll get much better performance with using the native backup and restore tools.

Is there a way to tell influxd to allow unlimited number of connections or say at least 100? What is the default?

Default should be unlimited: https://docs.influxdata.com/influxdb/v1.5/administration/config/#max-connection-limit-0

Uploaded requested profiles:

Thanks, this will be very helpful in identifying the performance bottleneck.

@rbetts care to prioritize/assign this?

rbetts · 2018-03-29T15:29:42Z

@lukaszgryglicki Thanks for the very complete report. We'll see triage this in an upcoming grooming session.

lukaszgryglicki · 2018-03-29T15:36:19Z

BTW: I've killed the query after 40 minutes and implemented alternate solution:

Program that zero-fills all series which match given regexp:

https://github.com/cncf/devstats/blob/master/cmd/z2influx/z2influx.go

lukaszgryglicki · 2018-03-29T15:40:31Z

I suspect you'll get much better performance with using the native backup and restore tools.:
BTW: is there a native program to copy one influx DB into another, or rename?
There wasn't such a program 3 months ago, I'm quite sure, because I was researching this for more than week some time ago?

Overwriting existing points is expensive and should generally be avoided if possible; otherwise it should happen in bulk if possible.:
All my programs use bulk writing.

mark-rushakoff · 2018-03-29T15:50:19Z

BTW: is there a native program to copy one influx DB into another, or rename?

https://docs.influxdata.com/influxdb/v1.5/administration/backup_and_restore/

lukaszgryglicki · 2018-03-29T15:58:41Z

I'll double check tomorrow then but a quick question:
Can I backup database named A into some file and then restore into database named B?
What I need to do

Assume I have to regenerate database A (calculations are slow, takes hours, data is generated from complex set of postgres SQLs)
I'm generating new influx datamase A_temp
If success then I want to rename A_temp to A (overwriting original A).
If there is an error with A_temp generation, A remains intact and A_temp (broken) is dropped.

Copy A_temp to A should be orders of magnitude faster than just generating A (to avoid downtime).

Is this feature new in v1.5 ?
There already were my issues for database copying reported here few months ago and my solutions to that.

lukaszgryglicki · 2018-03-29T18:25:22Z

Maybe -newdb option of influxd will be useful for creating a copy of a database, but I see that I need to stop influxd to restore database.
This is not a good solution, because influxd server more than 20 databases, so this is not a good idea to stop it just to copy a database.
Why do I need to stop influxd when restoring on a new (not yet existing) fresh database?
My custom written idb_backup program does full copy & restore of influx database without stopping influxd and it was quite fast in v1.3.x and now is a lot slower in v1.5.0 but still usable.
What is interesting: it seems like its performance drops when series number increase.
And it looks like a quadric dependency...
So when I have > 10000 series it becomes slow, but for 50000 series it is awful!

lukaszgryglicki · 2018-03-30T11:28:17Z

Seems like regexps are also very slow.
There is only one series name which match /^prs_labelled_d$/ and it is named "prs_labelled_d".
When I do:

select last(*) from "prs_labelled_d" it returns data instantly.
When I do:
select last(*) from /^prs_labelled_d$/ it takes 2m30s.

pol · 2018-05-24T00:24:01Z

I am having a similar issue where I need to delete lots of measurements (malformed graphite template led to 100k measurements being created). Deleting an individual measurement takes between 20s and 2m when it finishes at all (sometimes it just hangs).

I am unable to drop by regex (I get an error: ERR: error parsing query: found /, expected identifier at line 1, char 18).

The system is reasonably loaded, using 50% of system ram and ~70% load.

This seems entirely related to the number of measurements, not the amount of data. These operations yesterday were fast (less than 5 seconds, but not instant) when I had 50 measurements. I am not writing any data to the new bugged measurements, so my data has not grown much, but the number of measurements has. I went from ~50 measurements to ~110000. Since they are raw graphite items, their patterns are easy to regex, but if I was to delete the measurements individually it would take weeks.

Requesting the debug profiles using the command listed above takes 30s to start (the curl stats show no activity for 30s, then the download happens). This feels like the kind of latency that is happening during other operations (it just feels very slow and laggy). Due to that (and the lack of regex drop), I can't dump all three stats during the same measurement delete operation, so i just start the next one as soon as I can. As a note, the vars download is not slow.

Here is an archive of my profile data:

influx-profile.tar.gz

System Info (aws ec2 m5.large):

pol@influxdb-a-1 ~ % influx -version
InfluxDB shell version: 1.5.2
x86_64 x86_64 GNU/Linux
pol@influxdb-a-1 ~ % cat /etc/issue
Ubuntu 16.04.3 LTS \n \l
pol@influxdb-a-1 ~ % uname -a
Linux influxdb-a-1 4.4.0-1060-aws #69-Ubuntu SMP Sun May 20 13:42:07 UTC 2018 x86_64

lukaszgryglicki · 2018-05-24T05:01:00Z

We've replaced influx with PostgreSQL and delete is instant.
Postgres (which is a surprise for me) is faster not only in delete operation, but in all operations basically.

pol · 2018-05-24T15:10:17Z

This is kind of drifting off-topic, but how do you deal with the dynamic schema benefits of influxdb? Do you just not index the tags and store all measurements as strings? We have a fairly complex pile of pre-existing metrics and developing a sql schema for all of them in a reasonable way would be very difficult (unless we just crudely stored them as txt or something); the benefit of influxdb (and other storage systems like it) is that it is specific to the problem of metrics storage and querying (as opposed to general purpose databasing).

lukaszgryglicki · 2018-05-24T15:42:53Z

I've implemented something that dynamically creates tables/columns/indices as needed. Took more than week, but it works great and is faster.

jacobmarble · 2018-07-03T23:50:35Z

I can reproduce this bug, but not consistently. Created a database with 100000 measurements and about 8 million series. I have twice seen delete from /regex/ behave as though it were completely locked up.

Delete from 1 measurement:

$ time influx -database stress -execute "delete from m99999"

real    0m0.084s
user    0m0.020s
sys     0m0.022s

Delete from about 9 measurements:

$ time influx -database stress -execute "delete from /m9999./"

real    0m3.773s
user    0m0.024s
sys     0m0.022s

Delete from about 90 measurements:

$ time influx -database stress -execute "delete from /m999.*/"

real    0m3.106s
user    0m0.019s
sys     0m0.018s

Delete from about 900 measurements (gave up, ctrl-C after ~5m):

$ time influx -database stress -execute "delete from /m99.*/"
^C

Adding some debug logging to tsdb/store.go I can see that there is a deadlock. This also causes shutdown to timeout, and failover to a hard shutdown.

jacobmarble · 2018-07-04T00:12:55Z

The deadlock is on partition.go, Partition.Wait() which simply calls i.wg.Wait().

jacobmarble · 2018-07-04T00:37:38Z

e-dard · 2018-07-09T16:21:04Z

Opening as there is still possibly a performance issue.

jacobmarble · 2018-07-09T16:24:06Z

Related to #10056

jacobmarble · 2018-07-17T20:30:07Z

master
$ time influx -database stress -execute 'delete from /m9.*/'

real    0m31.381s
user    0m0.021s
sys     0m0.021s

1.6
$ time influx -database stress -execute 'delete from /m9.*/'

real    0m29.174s
user    0m0.023s
sys     0m0.033s

1.5
$ time influx -database stress -execute 'delete from /m9.*/'

real    0m22.821s
user    0m0.022s
sys     0m0.029s

1.4 (inmem)
$ time influx -database stress -execute 'delete from /m9.*/'

real    0m0.385s
user    0m0.023s
sys     0m0.030s

1.3 (inmem)
$ time influx -database stress -execute 'delete from /m8.*/'

real    0m0.413s
user    0m0.020s
sys     0m0.023s

JulienChampseix · 2018-09-14T11:56:48Z

@jacobmarble do you have an update about this investigation/bug fix ?

stale · 2019-07-24T11:33:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

e-dard · 2019-07-24T12:10:11Z

don't close.

PennyYip · 2019-09-28T08:29:37Z

I am using 1.7.7 and the issue still persists.

Deleting a single measurement with 1mio series takes forever.

iostat shows disk is not of a very high utilization


Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1          0.00  723.00      0.00   3046.50     0.00   163.00   0.00  18.40    0.00    1.62   0.07     0.00     4.21   0.05   3.60
nvme1n1          0.00  723.00      0.00   3046.50     0.00   163.00   0.00  18.40    0.00    1.84   0.10     0.00     4.21   0.07   5.20
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md2              0.00  761.00      0.00   3044.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

pohmelie · 2019-10-01T14:55:54Z

Still present in 1.7.8.
10_000 measurements with delete query delete where time < '2018-01-01' took 25 seconds.

bryanspears · 2019-10-24T23:56:12Z

It's not just the delete that is slow, but the garbage collection and/or compaction that happens after the delete is terrible. I deleted a couple hundred measurements and the load jumped 6x for hour+. Memory maxed. IO wait is 50% or so of all CPU time (cloud resources!).

jacobmarble · 2019-10-28T22:16:10Z

If I remember correctly, we decided in 1.5 to block on delete, rather than accept the delete request and handle it asynchronously. My earlier analysis probably reflects this. So "that's a feature, not a bug" is my offhand comment for that.

@bryanspears have you tried limiting concurrent compactions? Where your bulk delete operation likely touched several shards, and those shards all require multiple levels of compaction, limiting concurrency has helped other folks in the past. Start with limiting to 1. https://docs.influxdata.com/influxdb/v1.7/administration/config/#max-concurrent-compactions-0

jacobmarble · 2019-10-29T15:15:26Z

Possibly related to #15271

bryanspears · 2019-10-29T22:16:38Z

@jacobmarble your suggestion has stabilized our small influx setup. For anyone else using limited I/O, cloud-based resources. These configuration options helped dramatically. Compaction does take longer, but that's better than the alternative crash that was occurring for us.

max-concurrent-compactions = 1
  wal-fsync-delay = "50ms"
  compact-throughput = "10m"
  compact-throughput-burst = "100m"

jacobmarble · 2019-10-29T23:06:48Z

@bryanspears I'm glad to know that limiting concurrent compactions helped. In many situations, compaction duration doesn't cost anything, and slower compactions free resources for queries and writes. For anyone else reading, when dealing with this sort of write-vs-read contention, I suggest

first set max-concurrent-compactions to 2
then set max-concurrent-compactions to 1
then set compact-throughput and compact-throughput-burst to values smaller than 48m
then set wal-fsync-delay to 10ms

Be aware that setting wal-fsync-delay is often helpful for constrained I/O situations (fewer calls to fsync()), but it also risks losing the most recently written data (up to 50ms in the example) when power is lost. For many use cases, this is acceptable.

Another way to help WAL I/O is to write in batches, say 5000 points per batch. This way, you can let it fsync() with every call to the /write endpoint (wal-fsync-delay=0), since there are fewer such calls.

lazzarello · 2019-11-28T01:24:29Z

This issue was the first hit for a search about slow drop database operations. I'm on 1.7.9 and witnessed the same performance. I thought my network connection had dropped so after an hour I sent ctrl + c and it returned to the influx shell. So not the network. The database I wish to drop shows up in show databases but when I use it and show measurements I get the following output ERR: engine is closed.

I'm optimizing for disk space, so I very much would like to delete this temporary database.

yozik04 · 2020-06-22T20:31:53Z

I am cleaning up some invalid values from my measurement in python code. Table has data for a year with probably one minute resolution.

DELETE FROM Cool_Return_Temp WHERE time = $time;

Query is tremendously slow. i am thinking if it is possible to pass an array of times to make index rebuild only once.

Unfortunately next query was not working last time I tried.

DELETE FROM Cool_Return_Temp WHERE value > 50;

bapBardas · 2021-05-04T14:57:01Z

Experiencing the same difficulties here with InfluxDB v1.8, it takes more than 25min to drop 1 measurement where data is stored over 3 shards of 4 weeks duration. I have 42k measurements to delete it's not acceptable for our production platform, i'm gonna need to find a trick.

lukaszgryglicki · 2021-05-04T15:10:07Z

Wow, over 3 years passed since I've reported this and it's not fixed...

cheeseandcereal · 2021-05-12T23:26:30Z

Not sure how related, but I've got a measurement with a single point on influxdb 1.8.5 and when I run DROP MEASUREMENT <name>, the operation just hangs forever. Not sure how to further debug. Rebooting influxdb does not seem to help

jalesingh · 2021-06-21T06:12:34Z

I am also getting the same issue, even I have only 7-8 measurements and none of measurement have more than 1500 records in it.
version:
CLI: Influx CLI 2.0.6 (git: 4db98b4) build_date: 2021-04-29T16:48:12Z
DB:Version 2.0.5 ('7c3ead)

Delete command : curl -X POST
'http://127.0.0.1:8086/api/v2/delete?bucket=intralogistics&org=greyorange'
-H 'authorization: Token hzCzxEPyoXUcnJOkGQDuYzlMIQZ06jUKP8De-HKSQU5hmgXAAWevPXuZcrduV9OTIl-8phT2-ZFyd-sqMOeLSA=='
-H 'content-type: application/json'
-d '{
"start":"2020-06-20T00:00:00.000000Z",
"stop":"2021-06-20T23:59:59.999999Z",
"predicate": "_measurement="service_request" AND external_service_request_id="order_234_20210618135243""
}'

iostat.txt
vars.txt
profiles.tar.gz

I am at the default configuration so please let me know if I need to do set some configurations

NicolasGoeddel · 2021-07-13T12:00:12Z

We also want to drop measurements of hosts which do not exist anymore on influxdb v1.8.5. But after dropping a few of them the database just locks up and does not accept further drop statements. Then we have to wait up to a day until it is available again.

Would be nice to get this finally solved.

Also does someone have a clue if this issue persists in version 2.x?

msherman13 · 2021-10-18T21:43:18Z

experiencing this issue as well. we had a problem a while back which caused a measurement to be created with about 300k series. the new measurement is small, only about 1m rows, but trying to drop the measurement hangs forever. the cardinality of the database is blown up now (before this measurement, it was around 20k) and causing all kinds of problems.

i'm not even able to successfully drop any series from this measurement now. for example, a drop series query filtering on just 10 series hangs forever. nothing useful in the logs.

we're on v1.8.6 and TSI index. this issue is also causing our db to take about an hour to startup...

any idea here?

robertsLando · 2023-02-01T09:39:32Z

Having this issue too, influx 1.8.10. In my case I'm using a query like:

DROP SERIES from /.*/ WHERE tag=test

praveenydv · 2024-09-17T05:15:28Z

I am still getting this issue. DROP MEASUREMENT <name> is getting stuck and failing. When running from CLI, then getting 504 from the server after some time.
Sometimes when I delete some data from measurement (not drop) for some time period, It is not deleting and getting request timeout, no matter how much I increase the timeout. But when I restart the server then it gets deleted.

mark-rushakoff added area/performance area/storage labels Mar 27, 2018

rbetts assigned benbjohnson Mar 29, 2018

jacobmarble assigned jacobmarble and unassigned benbjohnson Jul 3, 2018

jacobmarble mentioned this issue Jul 5, 2018

Resolve deadlock deleting from many measurements concurrently #10050

Merged

jacobmarble closed this as completed in #10050 Jul 9, 2018

e-dard reopened this Jul 9, 2018

jacobmarble reopened this Jul 17, 2018

nicolas17 mentioned this issue Oct 2, 2018

DELETE query is very slow -- still slow in InfluxDb 1.6.0 #10218

Closed

dgnorton added the 1.x label Jan 7, 2019

stale bot added the wontfix label Jul 24, 2019

stale bot removed the wontfix label Jul 24, 2019

e-dard added the stale-exempt label Jul 24, 2019

jacobmarble mentioned this issue Oct 29, 2019

InfluxDB TSM compactions cause temporary write timeouts #15271

Closed

russorat removed the stale-exempt label Feb 10, 2020

DengzhiLiu mentioned this issue Mar 13, 2020

perf(tsi1): batch write tombstone entries when dropping/deleting #17246

Closed

7 tasks

BruceJo mentioned this issue Dec 9, 2022

delete terribly slow : InfluxDB v2.x #23974

Open

[bug] drop/delete terribly slow #9636

[bug] drop/delete terribly slow #9636

Comments

lukaszgryglicki commented Mar 27, 2018

Bug report

mark-rushakoff commented Mar 27, 2018

lukaszgryglicki commented Mar 27, 2018 • edited Loading

lukaszgryglicki commented Mar 29, 2018 • edited Loading

lukaszgryglicki commented Mar 29, 2018

mark-rushakoff commented Mar 29, 2018

rbetts commented Mar 29, 2018

lukaszgryglicki commented Mar 29, 2018

lukaszgryglicki commented Mar 29, 2018

mark-rushakoff commented Mar 29, 2018

lukaszgryglicki commented Mar 29, 2018 • edited Loading

lukaszgryglicki commented Mar 29, 2018

lukaszgryglicki commented Mar 30, 2018

pol commented May 24, 2018

lukaszgryglicki commented May 24, 2018

pol commented May 24, 2018

lukaszgryglicki commented May 24, 2018

jacobmarble commented Jul 3, 2018 • edited Loading

jacobmarble commented Jul 4, 2018

jacobmarble commented Jul 4, 2018

e-dard commented Jul 9, 2018

jacobmarble commented Jul 9, 2018

jacobmarble commented Jul 17, 2018 • edited Loading

JulienChampseix commented Sep 14, 2018

stale bot commented Jul 24, 2019

e-dard commented Jul 24, 2019

PennyYip commented Sep 28, 2019 • edited Loading

pohmelie commented Oct 1, 2019

bryanspears commented Oct 24, 2019 • edited Loading

jacobmarble commented Oct 28, 2019

jacobmarble commented Oct 29, 2019 • edited Loading

bryanspears commented Oct 29, 2019 • edited Loading

jacobmarble commented Oct 29, 2019 • edited Loading

lazzarello commented Nov 28, 2019 • edited Loading

yozik04 commented Jun 22, 2020

bapBardas commented May 4, 2021

lukaszgryglicki commented May 4, 2021

cheeseandcereal commented May 12, 2021

jalesingh commented Jun 21, 2021

NicolasGoeddel commented Jul 13, 2021

msherman13 commented Oct 18, 2021 • edited Loading

robertsLando commented Feb 1, 2023

praveenydv commented Sep 17, 2024

lukaszgryglicki commented Mar 27, 2018 •

edited

Loading

lukaszgryglicki commented Mar 29, 2018 •

edited

Loading

lukaszgryglicki commented Mar 29, 2018 •

edited

Loading

jacobmarble commented Jul 3, 2018 •

edited

Loading

jacobmarble commented Jul 17, 2018 •

edited

Loading

PennyYip commented Sep 28, 2019 •

edited

Loading

bryanspears commented Oct 24, 2019 •

edited

Loading

jacobmarble commented Oct 29, 2019 •

edited

Loading

bryanspears commented Oct 29, 2019 •

edited

Loading

jacobmarble commented Oct 29, 2019 •

edited

Loading

lazzarello commented Nov 28, 2019 •

edited

Loading

msherman13 commented Oct 18, 2021 •

edited

Loading