Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Bug for matrices of multiple dimension, with one dimension much larger #11495

Closed
jaanli opened this issue Jun 29, 2018 · 6 comments
Closed

Bug for matrices of multiple dimension, with one dimension much larger #11495

jaanli opened this issue Jun 29, 2018 · 6 comments

Comments

@jaanli
Copy link

jaanli commented Jun 29, 2018

There is a bug when creating large matrices. Although the size of each dimension is moderate, the matrix cannot be created.

Minimal example on mxnet 1.3.0:

In [4]: from mxnet import gluon, nd

In [5]: m = gluon.nn.Embedding(14000, 128)

In [6]: m.initialize()

In [7]: ind = nd.zeros((700000, 128))

In [8]: x = m(ind)

In [9]: x.shape
Out[9]: (700000, 128, 128)

In [10]: test = x.asnumpy()
---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-10-3ffeab2024d8> in <module>()
----> 1 test = x.asnumpy()

/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
   1892             self.handle,
   1893             data.ctypes.data_as(ctypes.c_void_p),
-> 1894             ctypes.c_size_t(data.size)))
   1895         return data
   1896

/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/base.py in check_call(ret)
    208     """
    209     if ret != 0:
--> 210         raise MXNetError(py_str(_LIB.MXGetLastError()))
    211
    212

MXNetError: [11:44:36] include/mxnet/./tensor_blob.h:257: Check failed: this->shape_.Size() == shape.Size() (11468800000 vs. 2878865408) TBlob.get_with_shape: new and old shape do not match total elements

Stack trace returned 8 entries:
[bt] (0) 0   libmxnet.so                         0x0000000110551eb4 libmxnet.so + 20148
[bt] (1) 1   libmxnet.so                         0x0000000110551c6f libmxnet.so + 19567
[bt] (2) 2   libmxnet.so                         0x000000011058ce59 libmxnet.so + 261721
[bt] (3) 3   libmxnet.so                         0x000000011177638e MXNDListFree + 1511918
[bt] (4) 4   libmxnet.so                         0x000000011175e538 MXNDListFree + 1414040
[bt] (5) 5   libmxnet.so                         0x00000001115c017d MXNDArraySyncCopyToCPU + 45
[bt] (6) 6   libffi.6.dylib                      0x00000001036e1884 ffi_call_unix64 + 76
[bt] (7) 7   ???                                 0x00007ffeee06ef30 0x0 + 140732891852592
@szha
Copy link
Member

szha commented Jun 29, 2018

This is likely due to the type being used for TShape.

@frankfliu
Copy link
Contributor

Hi @altosaar, thanks for submitting issue. @sandeep-krishnamurthy requesting this be labeled.

@apeforest
Copy link
Contributor

I have created a JIRA ticket to track this bug. I will work on it.

@apeforest
Copy link
Contributor

The bug is due to 32-bit unsigned int overflow. The second value of Size() in the CHECK_EQ is from mshadow::Shape<dim>, whose return type is defined as unsigned in the header file.

@apeforest
Copy link
Contributor

More details about this bug: The Size() in TShape returns an size_t object (700000x128x128 in this case). However, the mshadow::Shape uses unsigned_int for each dimension. When converting a ndarray to numpy, the Copy function flattens the TShape to a 1-D mshadow::Shape, therefore causing the integer overflow issue here.

@apeforest
Copy link
Contributor

@sandeep-krishnamurthy This issue is resolved by PR #11742. Please close it. Thx

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants