-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'out of ressources error' returned from opencl code with nvidia cards with memory >8GB #30
Comments
Could not reproduce this on a GeForce GTX TITAN X with 12 GB main memory (running Ubuntu 14.04.5 LTS). Maybe collecting devices/setups that do and don't work helps to narrow down the list of possible causes? |
In case it helps, I was able to run |
I spoke too soon. Although the Lorenz code did run, I'm experiencing the same issue when using my own data. OUT_OF_RESOURCES error on the 2080 Ti (11 GB), but no problems on the Tesla K20c (5GB) or when using CPUs.
Data structure contains 16 processes, 46 samples, 1106 replications.
I did check memory usage on the card, and was always very low, less than 1GB. |
Hi all,
my gut feeling is that one of the buffers that we allocate in the opencl is too small, i.e we use more than we request - but only a 'little bit'. On some cards/systems/drivers and on some smaller datasets we get do not get a problem, as long as our buffer stays within the bounds of the last memory page we requested (say we requested 129 kbyte, and the pages have 64kb each, then we will actually get 3 pages, and can silently consume another 63 kB that we did not request.
Things that could influence this error are:
(1) Card type will influence the page size, i.e. it used to be 64k on AMD Hawaii and the fglrx driver, but seems to be only 4k with Vega64 and both the amdgpu-pro and the ROCm driver.
(2) Different drivers may have different ways of allocating memory as well.
(3) The interaction of dataset size and card memory size may lead to too large scheduled computations that then amplify our misallocation problem until we overrun the slack we have on the last memory page we requested.
ALTERNATIVELY, the interplay of available card memory, dataset size, and chip architecture may lead to too large OpenCL-workitems (or something similar) that overtax the 'local' memory of each compute shader (i.e. we're trying to store too much information in local memory). I think this is less likely, because in this case we should see the failures align more with Nvidia's chip architecture Kepler-Pascal-Volta-Turing. However, we have large analyses running on pascal 1080's and failures on pascal Quadro P6000.
I will also forward this to Pedro, to get his input.
Best,
Michael
…________________________________
From: Javier G. Orlandi <notifications@github.com>
Sent: Thursday, March 7, 2019 5:09:59 AM
To: pwollstadt/IDTxl
Cc: Wibral, Michael; Author
Subject: Re: [pwollstadt/IDTxl] 'out of ressources error' returned from opencl code with nvidia cards with memory >8GB (#30)
I spoke too soon. Although the Lorenz code did run, I'm experiencing the same issue when using my own data. OUT_OF_RESOURCES error on the 2080 Ti (11 GB), but no problems on the Tesla K20c (5GB) or when using CPUs.
I'm running the standard multivariate TE with the OpenCLKraskovCMI CMI estimator:
from idtxl.multivariate_te import MultivariateTE
from idtxl.data import Data
from idtxl.visualise_graph import plot_network
from idtxl import idtxl_io as io
import matplotlib.pyplot as plt
import pickle
import numpy, scipy.io
data = io.import_matarray(
file_name='test.mat',
array_name='XR',
dim_order='rps',
file_version='v7.3',
normalise=False)
network_analysis = MultivariateTE()
settings = {'cmi_estimator': 'OpenCLKraskovCMI',
'max_lag_sources': 3,
'min_lag_sources': 1}
results = network_analysis.analyse_network(settings=settings, data=data)
pickle.dump(results, open('results.p', 'wb'))
Data structure contains 16 processes, 46 samples, 1106 replications.
With 200 replications it runs fine, but with the above number it results in the following error on computing sources for the first target:
---------------------------- (2) include source candidates
candidate set: [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3), (5, 1), (5, 2), (5, 3), (6, 1), (6, 2), (6, 3), (7, 1), (7, 2), (7, 3), (8, 1), (8, 2), (8, 3), (9, 1), (9, 2), (9, 3), (10, 1), (10, 2), (10, 3), (11, 1), (11, 2), (11, 3), (12, 1), (12, 2), (12, 3), (13, 1), (13, 2), (13, 3), (14, 1), (14, 2), (14, 3), (15, 1), (15, 2), (15, 3)]
testing candidate: (14, 1) maximum statistic, n_perm: 200
Traceback (most recent call last):
File "multivariateTEtestR.py", line 29, in <module>
results = network_analysis.analyse_network(settings=settings, data=data)
File "/home/benuccilab/IDTxl/idtxl/multivariate_te.py", line 159, in analyse_network
settings, data, targets[t], sources[t])
File "/home/benuccilab/IDTxl/idtxl/multivariate_te.py", line 276, in analyse_single_target
self._include_source_candidates(data)
File "/home/benuccilab/IDTxl/idtxl/network_inference.py", line 826, in _include_source_candidates
self._include_candidates(candidates, data)
File "/home/benuccilab/IDTxl/idtxl/network_inference.py", line 120, in _include_candidates
conditional=self._selected_vars_realisations)
File "/home/benuccilab/IDTxl/idtxl/estimator.py", line 278, in estimate_parallel
return self.estimate(n_chunks=n_chunks, **data)
File "/home/benuccilab/IDTxl/idtxl/estimators_opencl.py", line 539, in estimate
n_chunks_current_run)
File "/home/benuccilab/IDTxl/idtxl/estimators_opencl.py", line 680, in _estimate_single_run
cl.enqueue_copy(self.queue, distances, d_distances)
File "/home/benuccilab/conda/envs/idtxl/lib/python3.7/site-packages/pyopencl/__init__.py", line 1709, in enqueue_copy
return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
pyopencl._cl.RuntimeError: clEnqueueReadBuffer failed: OUT_OF_RESOURCES
I did check memory usage on the card, and was always very low, less than 1GB.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#30 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIqYGsq4o_rPaasVMIrEn0z43d-oPg-bks5vUJEXgaJpZM4avYU1>.
|
Some more info, now that I am testing on multiple machines, including AMD ones. (1) On two machines with VEGA 64 and AMD ROCm's OpenCL, I get from python /pyopencl: Memory access fault by GPU node-1 (Agent handle: 0x564110c33270) on address 0xa02a00000. Reason: Page not present or supervisor privilege. Note that the address 0xa02a0000 is identical on both systems, althought the cards are slightly different (regular vega64, 8GB, and WX9100, 16GB, radeon pro model) dmesg returns: gmc_v9_0_process_interrupt: 6 callbacks suppressed (2) On an AMD APU with the amdgpu pro driver I simply get a system crash. This happens if I run more than 37 or 38 replications in the systemtest_lorenz2_opencl.py. (3) Using the -older- develop version that Aaron is running in Frankfurt (and obtained from patricia via email I think?), I get a different error (and much earlier in the process): Googling for this last error messsage turns up posts (https://stackoverflow.com/questions/17575032/using-clcreatesubbuffer) where memory management should be done in relation to the device property CL_DEVICE_MEM_BASE_ADDR_ALIGN |
Hi, Just to chip in with the same error. I am running windows 10 on an intel i7-5960X with 2 GeForce Titan X (12GB) cards. I have a data set with 839 processes, 14422 samples, 1 replications. I run the following code: and I get the error: Traceback (most recent call last): Any help with this? Thanks |
Replace unsigned int types in OpenCL/CUDA code. For very large point sets this leads to an overflow and incorrect indexing of arrays. Add test scripts. Update CUDA makefile. Fixes #30.
An update on this issue (after the fix with the int index that's already included in the branch fix gpu_bug): Unfortunately, there are still errors if the product of npointsdimchunks exceeds a certain threshold AND the padding is used (necessary), i.e. if the number of bytes (datapoints) that go to the GPU card is not a multiple of 1024. In that case the computation on the GPU runs (as seen by the time elapsed until the error), but there is a memory access violation when returning, leading to the following error messages: This does not happen when the data that goes to the GPU is a multiple of 1024, i.e. when we pad with zero points, or when we switch of padding (this only works for nvidia cards, see below). Note that the padding is only necessary on cards that need manual subbuffer alignment (AMD cards). So on Nvidia cards a simple solution would be to detect the manufacturer and switch off the padding altogether. On some AMD cards that only provide opencl1.2 capabilitites (e.g. Lexa XT chip and the old Hawaii chips) there seems to be no problem with the padding - for reasons unknown. So for AMD cards that provide only opencl1.2. capabilities the solution could be to detect the capabilities and to use the padding as is. Btw. running the opencl code on a multicore GPU using Intel's opencl implementation also works (it's just 100x slower), so there are no really gross errors with the implementation of the actual opencl kernel, I guess. The remaining problems on AMD cards with the rocm driver and opencl2.0 capabilities (definitely Vega, possibly Polaris and Fiji) need to be solved in the opencl code. It is also possible there is a opencl2.0 issue, possibly in pyopencl. I would be very glad if someone else could confirm the above observations by:
and then report: |
Not sure if this is still open or under consideration. But anyway... OS: Win 10 (Enterprise ver 1909 build 18363.778); GPU: NVIDIA GeForce RTX 2070; pyOpenCL: pyopencl-2020.2+cl12-cp38-cp38-win_amd64.whl (max 1.2 on NVIDIA as you know)
No clinfo on win so below a shortened version of a GPU Caps Viewer report with more GPU and OpenCL info
|
Replace unsigned int types in OpenCL/CUDA code. For very large point sets this leads to an overflow and incorrect indexing of arrays. Add test scripts. Update CUDA makefile. Fixes #30.
Replace unsigned int types in OpenCL/CUDA code. For very large point sets this leads to an overflow and incorrect indexing of arrays. Add test scripts. Update CUDA makefile. Fixes #30.
I have uploaded a preliminary bugfix for this problem. See branch OpenCL_bugfix. Testing is appreciated. |
…riables: signallength_padded and signallength orig, I set padding default to true, made callers aware of additional argument in opencl kernels Fixes #30.
There is an issue with nividia cards with a memory of larger (!) than 8 GB ironically reporting an 'out of ressources' error sometime into the computation (e.g. when running systemtest_lorenz2_opencl.py). Cards of the same chip architecture with up to 8GB do not seem to have that problem, e.g.:
Cards running fine: Titan 1st gen. (Kepler, 6GB), GTX 1080 8GB (Pascal chip)
Cards returning errors: Quadro P6000 (24GB), Tesla V100 (32GB)
The text was updated successfully, but these errors were encountered: