Coverage for /var/srv/projects/api.amasfac.comuna18.com/tmp/venv/lib/python3.9/site-packages/numpy/lib/format.py: 8%
269 statements
« prev ^ index » next coverage.py v6.4.4, created at 2023-07-17 14:22 -0600
« prev ^ index » next coverage.py v6.4.4, created at 2023-07-17 14:22 -0600
1"""
2Binary serialization
4NPY format
5==========
7A simple format for saving numpy arrays to disk with the full
8information about them.
10The ``.npy`` format is the standard binary file format in NumPy for
11persisting a *single* arbitrary NumPy array on disk. The format stores all
12of the shape and dtype information necessary to reconstruct the array
13correctly even on another machine with a different architecture.
14The format is designed to be as simple as possible while achieving
15its limited goals.
17The ``.npz`` format is the standard format for persisting *multiple* NumPy
18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy``
19files, one for each array.
21Capabilities
22------------
24- Can represent all NumPy arrays including nested record arrays and
25 object arrays.
27- Represents the data in its native binary form.
29- Supports Fortran-contiguous arrays directly.
31- Stores all of the necessary information to reconstruct the array
32 including shape and dtype on a machine of a different
33 architecture. Both little-endian and big-endian arrays are
34 supported, and a file with little-endian numbers will yield
35 a little-endian array on any machine reading the file. The
36 types are described in terms of their actual sizes. For example,
37 if a machine with a 64-bit C "long int" writes out an array with
38 "long ints", a reading machine with 32-bit C "long ints" will yield
39 an array with 64-bit integers.
41- Is straightforward to reverse engineer. Datasets often live longer than
42 the programs that created them. A competent developer should be
43 able to create a solution in their preferred programming language to
44 read most ``.npy`` files that they have been given without much
45 documentation.
47- Allows memory-mapping of the data. See `open_memmap`.
49- Can be read from a filelike stream object instead of an actual file.
51- Stores object arrays, i.e. arrays containing elements that are arbitrary
52 Python objects. Files with object arrays are not to be mmapable, but
53 can be read and written to disk.
55Limitations
56-----------
58- Arbitrary subclasses of numpy.ndarray are not completely preserved.
59 Subclasses will be accepted for writing, but only the array data will
60 be written out. A regular numpy.ndarray object will be created
61 upon reading the file.
63.. warning::
65 Due to limitations in the interpretation of structured dtypes, dtypes
66 with fields with empty names will have the names replaced by 'f0', 'f1',
67 etc. Such arrays will not round-trip through the format entirely
68 accurately. The data is intact; only the field names will differ. We are
69 working on a fix for this. This fix will not require a change in the
70 file format. The arrays with such structures can still be saved and
71 restored, and the correct dtype may be restored by using the
72 ``loadedarray.view(correct_dtype)`` method.
74File extensions
75---------------
77We recommend using the ``.npy`` and ``.npz`` extensions for files saved
78in this format. This is by no means a requirement; applications may wish
79to use these file formats but use an extension specific to the
80application. In the absence of an obvious alternative, however,
81we suggest using ``.npy`` and ``.npz``.
83Version numbering
84-----------------
86The version numbering of these formats is independent of NumPy version
87numbering. If the format is upgraded, the code in `numpy.io` will still
88be able to read and write Version 1.0 files.
90Format Version 1.0
91------------------
93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``.
95The next 1 byte is an unsigned byte: the major version number of the file
96format, e.g. ``\\x01``.
98The next 1 byte is an unsigned byte: the minor version number of the file
99format, e.g. ``\\x00``. Note: the version of the file format is not tied
100to the version of the numpy package.
102The next 2 bytes form a little-endian unsigned short int: the length of
103the header data HEADER_LEN.
105The next HEADER_LEN bytes form the header data describing the array's
106format. It is an ASCII string which contains a Python literal expression
107of a dictionary. It is terminated by a newline (``\\n``) and padded with
108spaces (``\\x20``) to make the total of
109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible
110by 64 for alignment purposes.
112The dictionary contains three keys:
114 "descr" : dtype.descr
115 An object that can be passed as an argument to the `numpy.dtype`
116 constructor to create the array's dtype.
117 "fortran_order" : bool
118 Whether the array data is Fortran-contiguous or not. Since
119 Fortran-contiguous arrays are a common form of non-C-contiguity,
120 we allow them to be written directly to disk for efficiency.
121 "shape" : tuple of int
122 The shape of the array.
124For repeatability and readability, the dictionary keys are sorted in
125alphabetic order. This is for convenience only. A writer SHOULD implement
126this if possible. A reader MUST NOT depend on this.
128Following the header comes the array data. If the dtype contains Python
129objects (i.e. ``dtype.hasobject is True``), then the data is a Python
130pickle of the array. Otherwise the data is the contiguous (either C-
131or Fortran-, depending on ``fortran_order``) bytes of the array.
132Consumers can figure out the number of bytes by multiplying the number
133of elements given by the shape (noting that ``shape=()`` means there is
1341 element) by ``dtype.itemsize``.
136Format Version 2.0
137------------------
139The version 1.0 format only allowed the array header to have a total size of
14065535 bytes. This can be exceeded by structured arrays with a large number of
141columns. The version 2.0 format extends the header size to 4 GiB.
142`numpy.save` will automatically save in 2.0 format if the data requires it,
143else it will always use the more compatible 1.0 format.
145The description of the fourth element of the header therefore has become:
146"The next 4 bytes form a little-endian unsigned int: the length of the header
147data HEADER_LEN."
149Format Version 3.0
150------------------
152This version replaces the ASCII string (which in practice was latin1) with
153a utf8-encoded string, so supports structured types with any unicode field
154names.
156Notes
157-----
158The ``.npy`` format, including motivation for creating it and a comparison of
159alternatives, is described in the
160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have
161evolved with time and this document is more current.
163"""
164import numpy
165import warnings
166from numpy.lib.utils import safe_eval
167from numpy.compat import (
168 isfileobj, os_fspath, pickle
169 )
172__all__ = []
175EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'}
176MAGIC_PREFIX = b'\x93NUMPY'
177MAGIC_LEN = len(MAGIC_PREFIX) + 2
178ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096
179BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes
181# difference between version 1.0 and 2.0 is a 4 byte (I) header length
182# instead of 2 bytes (H) allowing storage of large structured arrays
183_header_size_info = {
184 (1, 0): ('<H', 'latin1'),
185 (2, 0): ('<I', 'latin1'),
186 (3, 0): ('<I', 'utf8'),
187}
189# Python's literal_eval is not actually safe for large inputs, since parsing
190# may become slow or even cause interpreter crashes.
191# This is an arbitrary, low limit which should make it safe in practice.
192_MAX_HEADER_SIZE = 10000
194def _check_version(version):
195 if version not in [(1, 0), (2, 0), (3, 0), None]:
196 msg = "we only support format version (1,0), (2,0), and (3,0), not %s"
197 raise ValueError(msg % (version,))
199def magic(major, minor):
200 """ Return the magic string for the given file format version.
202 Parameters
203 ----------
204 major : int in [0, 255]
205 minor : int in [0, 255]
207 Returns
208 -------
209 magic : str
211 Raises
212 ------
213 ValueError if the version cannot be formatted.
214 """
215 if major < 0 or major > 255:
216 raise ValueError("major version must be 0 <= major < 256")
217 if minor < 0 or minor > 255:
218 raise ValueError("minor version must be 0 <= minor < 256")
219 return MAGIC_PREFIX + bytes([major, minor])
221def read_magic(fp):
222 """ Read the magic string to get the version of the file format.
224 Parameters
225 ----------
226 fp : filelike object
228 Returns
229 -------
230 major : int
231 minor : int
232 """
233 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string")
234 if magic_str[:-2] != MAGIC_PREFIX:
235 msg = "the magic string is not correct; expected %r, got %r"
236 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2]))
237 major, minor = magic_str[-2:]
238 return major, minor
240def _has_metadata(dt):
241 if dt.metadata is not None:
242 return True
243 elif dt.names is not None:
244 return any(_has_metadata(dt[k]) for k in dt.names)
245 elif dt.subdtype is not None:
246 return _has_metadata(dt.base)
247 else:
248 return False
250def dtype_to_descr(dtype):
251 """
252 Get a serializable descriptor from the dtype.
254 The .descr attribute of a dtype object cannot be round-tripped through
255 the dtype() constructor. Simple types, like dtype('float32'), have
256 a descr which looks like a record array with one field with '' as
257 a name. The dtype() constructor interprets this as a request to give
258 a default name. Instead, we construct descriptor that can be passed to
259 dtype().
261 Parameters
262 ----------
263 dtype : dtype
264 The dtype of the array that will be written to disk.
266 Returns
267 -------
268 descr : object
269 An object that can be passed to `numpy.dtype()` in order to
270 replicate the input dtype.
272 """
273 if _has_metadata(dtype):
274 warnings.warn("metadata on a dtype may be saved or ignored, but will "
275 "raise if saved when read. Use another form of storage.",
276 UserWarning, stacklevel=2)
277 if dtype.names is not None:
278 # This is a record array. The .descr is fine. XXX: parts of the
279 # record array with an empty name, like padding bytes, still get
280 # fiddled with. This needs to be fixed in the C implementation of
281 # dtype().
282 return dtype.descr
283 else:
284 return dtype.str
286def descr_to_dtype(descr):
287 """
288 Returns a dtype based off the given description.
290 This is essentially the reverse of `dtype_to_descr()`. It will remove
291 the valueless padding fields created by, i.e. simple fields like
292 dtype('float32'), and then convert the description to its corresponding
293 dtype.
295 Parameters
296 ----------
297 descr : object
298 The object retrieved by dtype.descr. Can be passed to
299 `numpy.dtype()` in order to replicate the input dtype.
301 Returns
302 -------
303 dtype : dtype
304 The dtype constructed by the description.
306 """
307 if isinstance(descr, str):
308 # No padding removal needed
309 return numpy.dtype(descr)
310 elif isinstance(descr, tuple):
311 # subtype, will always have a shape descr[1]
312 dt = descr_to_dtype(descr[0])
313 return numpy.dtype((dt, descr[1]))
315 titles = []
316 names = []
317 formats = []
318 offsets = []
319 offset = 0
320 for field in descr:
321 if len(field) == 2:
322 name, descr_str = field
323 dt = descr_to_dtype(descr_str)
324 else:
325 name, descr_str, shape = field
326 dt = numpy.dtype((descr_to_dtype(descr_str), shape))
328 # Ignore padding bytes, which will be void bytes with '' as name
329 # Once support for blank names is removed, only "if name == ''" needed)
330 is_pad = (name == '' and dt.type is numpy.void and dt.names is None)
331 if not is_pad:
332 title, name = name if isinstance(name, tuple) else (None, name)
333 titles.append(title)
334 names.append(name)
335 formats.append(dt)
336 offsets.append(offset)
337 offset += dt.itemsize
339 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles,
340 'offsets': offsets, 'itemsize': offset})
342def header_data_from_array_1_0(array):
343 """ Get the dictionary of header metadata from a numpy.ndarray.
345 Parameters
346 ----------
347 array : numpy.ndarray
349 Returns
350 -------
351 d : dict
352 This has the appropriate entries for writing its string representation
353 to the header of the file.
354 """
355 d = {'shape': array.shape}
356 if array.flags.c_contiguous:
357 d['fortran_order'] = False
358 elif array.flags.f_contiguous:
359 d['fortran_order'] = True
360 else:
361 # Totally non-contiguous data. We will have to make it C-contiguous
362 # before writing. Note that we need to test for C_CONTIGUOUS first
363 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS.
364 d['fortran_order'] = False
366 d['descr'] = dtype_to_descr(array.dtype)
367 return d
370def _wrap_header(header, version):
371 """
372 Takes a stringified header, and attaches the prefix and padding to it
373 """
374 import struct
375 assert version is not None
376 fmt, encoding = _header_size_info[version]
377 header = header.encode(encoding)
378 hlen = len(header) + 1
379 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN)
380 try:
381 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen)
382 except struct.error:
383 msg = "Header length {} too big for version={}".format(hlen, version)
384 raise ValueError(msg) from None
386 # Pad the header with spaces and a final newline such that the magic
387 # string, the header-length short and the header are aligned on a
388 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes
389 # aligned up to ARRAY_ALIGN on systems like Linux where mmap()
390 # offset must be page-aligned (i.e. the beginning of the file).
391 return header_prefix + header + b' '*padlen + b'\n'
394def _wrap_header_guess_version(header):
395 """
396 Like `_wrap_header`, but chooses an appropriate version given the contents
397 """
398 try:
399 return _wrap_header(header, (1, 0))
400 except ValueError:
401 pass
403 try:
404 ret = _wrap_header(header, (2, 0))
405 except UnicodeEncodeError:
406 pass
407 else:
408 warnings.warn("Stored array in format 2.0. It can only be"
409 "read by NumPy >= 1.9", UserWarning, stacklevel=2)
410 return ret
412 header = _wrap_header(header, (3, 0))
413 warnings.warn("Stored array in format 3.0. It can only be "
414 "read by NumPy >= 1.17", UserWarning, stacklevel=2)
415 return header
418def _write_array_header(fp, d, version=None):
419 """ Write the header for an array and returns the version used
421 Parameters
422 ----------
423 fp : filelike object
424 d : dict
425 This has the appropriate entries for writing its string representation
426 to the header of the file.
427 version : tuple or None
428 None means use oldest that works. Providing an explicit version will
429 raise a ValueError if the format does not allow saving this data.
430 Default: None
431 """
432 header = ["{"]
433 for key, value in sorted(d.items()):
434 # Need to use repr here, since we eval these when reading
435 header.append("'%s': %s, " % (key, repr(value)))
436 header.append("}")
437 header = "".join(header)
438 if version is None:
439 header = _wrap_header_guess_version(header)
440 else:
441 header = _wrap_header(header, version)
442 fp.write(header)
444def write_array_header_1_0(fp, d):
445 """ Write the header for an array using the 1.0 format.
447 Parameters
448 ----------
449 fp : filelike object
450 d : dict
451 This has the appropriate entries for writing its string
452 representation to the header of the file.
453 """
454 _write_array_header(fp, d, (1, 0))
457def write_array_header_2_0(fp, d):
458 """ Write the header for an array using the 2.0 format.
459 The 2.0 format allows storing very large structured arrays.
461 .. versionadded:: 1.9.0
463 Parameters
464 ----------
465 fp : filelike object
466 d : dict
467 This has the appropriate entries for writing its string
468 representation to the header of the file.
469 """
470 _write_array_header(fp, d, (2, 0))
472def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE):
473 """
474 Read an array header from a filelike object using the 1.0 file format
475 version.
477 This will leave the file object located just after the header.
479 Parameters
480 ----------
481 fp : filelike object
482 A file object or something with a `.read()` method like a file.
484 Returns
485 -------
486 shape : tuple of int
487 The shape of the array.
488 fortran_order : bool
489 The array data will be written out directly if it is either
490 C-contiguous or Fortran-contiguous. Otherwise, it will be made
491 contiguous before writing it out.
492 dtype : dtype
493 The dtype of the file's data.
494 max_header_size : int, optional
495 Maximum allowed size of the header. Large headers may not be safe
496 to load securely and thus require explicitly passing a larger value.
497 See :py:meth:`ast.literal_eval()` for details.
499 Raises
500 ------
501 ValueError
502 If the data is invalid.
504 """
505 return _read_array_header(
506 fp, version=(1, 0), max_header_size=max_header_size)
508def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE):
509 """
510 Read an array header from a filelike object using the 2.0 file format
511 version.
513 This will leave the file object located just after the header.
515 .. versionadded:: 1.9.0
517 Parameters
518 ----------
519 fp : filelike object
520 A file object or something with a `.read()` method like a file.
521 max_header_size : int, optional
522 Maximum allowed size of the header. Large headers may not be safe
523 to load securely and thus require explicitly passing a larger value.
524 See :py:meth:`ast.literal_eval()` for details.
526 Returns
527 -------
528 shape : tuple of int
529 The shape of the array.
530 fortran_order : bool
531 The array data will be written out directly if it is either
532 C-contiguous or Fortran-contiguous. Otherwise, it will be made
533 contiguous before writing it out.
534 dtype : dtype
535 The dtype of the file's data.
537 Raises
538 ------
539 ValueError
540 If the data is invalid.
542 """
543 return _read_array_header(
544 fp, version=(2, 0), max_header_size=max_header_size)
547def _filter_header(s):
548 """Clean up 'L' in npz header ints.
550 Cleans up the 'L' in strings representing integers. Needed to allow npz
551 headers produced in Python2 to be read in Python3.
553 Parameters
554 ----------
555 s : string
556 Npy file header.
558 Returns
559 -------
560 header : str
561 Cleaned up header.
563 """
564 import tokenize
565 from io import StringIO
567 tokens = []
568 last_token_was_number = False
569 for token in tokenize.generate_tokens(StringIO(s).readline):
570 token_type = token[0]
571 token_string = token[1]
572 if (last_token_was_number and
573 token_type == tokenize.NAME and
574 token_string == "L"):
575 continue
576 else:
577 tokens.append(token)
578 last_token_was_number = (token_type == tokenize.NUMBER)
579 return tokenize.untokenize(tokens)
582def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE):
583 """
584 see read_array_header_1_0
585 """
586 # Read an unsigned, little-endian short int which has the length of the
587 # header.
588 import struct
589 hinfo = _header_size_info.get(version)
590 if hinfo is None:
591 raise ValueError("Invalid version {!r}".format(version))
592 hlength_type, encoding = hinfo
594 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length")
595 header_length = struct.unpack(hlength_type, hlength_str)[0]
596 header = _read_bytes(fp, header_length, "array header")
597 header = header.decode(encoding)
598 if len(header) > max_header_size:
599 raise ValueError(
600 f"Header info length ({len(header)}) is large and may not be safe "
601 "to load securely.\n"
602 "To allow loading, adjust `max_header_size` or fully trust "
603 "the `.npy` file using `allow_pickle=True`.\n"
604 "For safety against large resource use or crashes, sandboxing "
605 "may be necessary.")
607 # The header is a pretty-printed string representation of a literal
608 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte
609 # boundary. The keys are strings.
610 # "shape" : tuple of int
611 # "fortran_order" : bool
612 # "descr" : dtype.descr
613 # Versions (2, 0) and (1, 0) could have been created by a Python 2
614 # implementation before header filtering was implemented.
615 if version <= (2, 0):
616 header = _filter_header(header)
617 try:
618 d = safe_eval(header)
619 except SyntaxError as e:
620 msg = "Cannot parse header: {!r}"
621 raise ValueError(msg.format(header)) from e
622 if not isinstance(d, dict):
623 msg = "Header is not a dictionary: {!r}"
624 raise ValueError(msg.format(d))
626 if EXPECTED_KEYS != d.keys():
627 keys = sorted(d.keys())
628 msg = "Header does not contain the correct keys: {!r}"
629 raise ValueError(msg.format(keys))
631 # Sanity-check the values.
632 if (not isinstance(d['shape'], tuple) or
633 not all(isinstance(x, int) for x in d['shape'])):
634 msg = "shape is not valid: {!r}"
635 raise ValueError(msg.format(d['shape']))
636 if not isinstance(d['fortran_order'], bool):
637 msg = "fortran_order is not a valid bool: {!r}"
638 raise ValueError(msg.format(d['fortran_order']))
639 try:
640 dtype = descr_to_dtype(d['descr'])
641 except TypeError as e:
642 msg = "descr is not a valid dtype descriptor: {!r}"
643 raise ValueError(msg.format(d['descr'])) from e
645 return d['shape'], d['fortran_order'], dtype
647def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None):
648 """
649 Write an array to an NPY file, including a header.
651 If the array is neither C-contiguous nor Fortran-contiguous AND the
652 file_like object is not a real file object, this function will have to
653 copy data in memory.
655 Parameters
656 ----------
657 fp : file_like object
658 An open, writable file object, or similar object with a
659 ``.write()`` method.
660 array : ndarray
661 The array to write to disk.
662 version : (int, int) or None, optional
663 The version number of the format. None means use the oldest
664 supported version that is able to store the data. Default: None
665 allow_pickle : bool, optional
666 Whether to allow writing pickled data. Default: True
667 pickle_kwargs : dict, optional
668 Additional keyword arguments to pass to pickle.dump, excluding
669 'protocol'. These are only useful when pickling objects in object
670 arrays on Python 3 to Python 2 compatible format.
672 Raises
673 ------
674 ValueError
675 If the array cannot be persisted. This includes the case of
676 allow_pickle=False and array being an object array.
677 Various other errors
678 If the array contains Python objects as part of its dtype, the
679 process of pickling them may raise various errors if the objects
680 are not picklable.
682 """
683 _check_version(version)
684 _write_array_header(fp, header_data_from_array_1_0(array), version)
686 if array.itemsize == 0:
687 buffersize = 0
688 else:
689 # Set buffer size to 16 MiB to hide the Python loop overhead.
690 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1)
692 if array.dtype.hasobject:
693 # We contain Python objects so we cannot write out the data
694 # directly. Instead, we will pickle it out
695 if not allow_pickle:
696 raise ValueError("Object arrays cannot be saved when "
697 "allow_pickle=False")
698 if pickle_kwargs is None:
699 pickle_kwargs = {}
700 pickle.dump(array, fp, protocol=3, **pickle_kwargs)
701 elif array.flags.f_contiguous and not array.flags.c_contiguous:
702 if isfileobj(fp):
703 array.T.tofile(fp)
704 else:
705 for chunk in numpy.nditer(
706 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
707 buffersize=buffersize, order='F'):
708 fp.write(chunk.tobytes('C'))
709 else:
710 if isfileobj(fp):
711 array.tofile(fp)
712 else:
713 for chunk in numpy.nditer(
714 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
715 buffersize=buffersize, order='C'):
716 fp.write(chunk.tobytes('C'))
719def read_array(fp, allow_pickle=False, pickle_kwargs=None, *,
720 max_header_size=_MAX_HEADER_SIZE):
721 """
722 Read an array from an NPY file.
724 Parameters
725 ----------
726 fp : file_like object
727 If this is not a real file object, then this may take extra memory
728 and time.
729 allow_pickle : bool, optional
730 Whether to allow writing pickled data. Default: False
732 .. versionchanged:: 1.16.3
733 Made default False in response to CVE-2019-6446.
735 pickle_kwargs : dict
736 Additional keyword arguments to pass to pickle.load. These are only
737 useful when loading object arrays saved on Python 2 when using
738 Python 3.
739 max_header_size : int, optional
740 Maximum allowed size of the header. Large headers may not be safe
741 to load securely and thus require explicitly passing a larger value.
742 See :py:meth:`ast.literal_eval()` for details.
743 This option is ignored when `allow_pickle` is passed. In that case
744 the file is by definition trusted and the limit is unnecessary.
746 Returns
747 -------
748 array : ndarray
749 The array from the data on disk.
751 Raises
752 ------
753 ValueError
754 If the data is invalid, or allow_pickle=False and the file contains
755 an object array.
757 """
758 if allow_pickle:
759 # Effectively ignore max_header_size, since `allow_pickle` indicates
760 # that the input is fully trusted.
761 max_header_size = 2**64
763 version = read_magic(fp)
764 _check_version(version)
765 shape, fortran_order, dtype = _read_array_header(
766 fp, version, max_header_size=max_header_size)
767 if len(shape) == 0:
768 count = 1
769 else:
770 count = numpy.multiply.reduce(shape, dtype=numpy.int64)
772 # Now read the actual data.
773 if dtype.hasobject:
774 # The array contained Python objects. We need to unpickle the data.
775 if not allow_pickle:
776 raise ValueError("Object arrays cannot be loaded when "
777 "allow_pickle=False")
778 if pickle_kwargs is None:
779 pickle_kwargs = {}
780 try:
781 array = pickle.load(fp, **pickle_kwargs)
782 except UnicodeError as err:
783 # Friendlier error message
784 raise UnicodeError("Unpickling a python object failed: %r\n"
785 "You may need to pass the encoding= option "
786 "to numpy.load" % (err,)) from err
787 else:
788 if isfileobj(fp):
789 # We can use the fast fromfile() function.
790 array = numpy.fromfile(fp, dtype=dtype, count=count)
791 else:
792 # This is not a real file. We have to read it the
793 # memory-intensive way.
794 # crc32 module fails on reads greater than 2 ** 32 bytes,
795 # breaking large reads from gzip streams. Chunk reads to
796 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead
797 # of the read. In non-chunked case count < max_read_count, so
798 # only one read is performed.
800 # Use np.ndarray instead of np.empty since the latter does
801 # not correctly instantiate zero-width string dtypes; see
802 # https://github.com/numpy/numpy/pull/6430
803 array = numpy.ndarray(count, dtype=dtype)
805 if dtype.itemsize > 0:
806 # If dtype.itemsize == 0 then there's nothing more to read
807 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize)
809 for i in range(0, count, max_read_count):
810 read_count = min(max_read_count, count - i)
811 read_size = int(read_count * dtype.itemsize)
812 data = _read_bytes(fp, read_size, "array data")
813 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype,
814 count=read_count)
816 if fortran_order:
817 array.shape = shape[::-1]
818 array = array.transpose()
819 else:
820 array.shape = shape
822 return array
825def open_memmap(filename, mode='r+', dtype=None, shape=None,
826 fortran_order=False, version=None, *,
827 max_header_size=_MAX_HEADER_SIZE):
828 """
829 Open a .npy file as a memory-mapped array.
831 This may be used to read an existing file or create a new one.
833 Parameters
834 ----------
835 filename : str or path-like
836 The name of the file on disk. This may *not* be a file-like
837 object.
838 mode : str, optional
839 The mode in which to open the file; the default is 'r+'. In
840 addition to the standard file modes, 'c' is also accepted to mean
841 "copy on write." See `memmap` for the available mode strings.
842 dtype : data-type, optional
843 The data type of the array if we are creating a new file in "write"
844 mode, if not, `dtype` is ignored. The default value is None, which
845 results in a data-type of `float64`.
846 shape : tuple of int
847 The shape of the array if we are creating a new file in "write"
848 mode, in which case this parameter is required. Otherwise, this
849 parameter is ignored and is thus optional.
850 fortran_order : bool, optional
851 Whether the array should be Fortran-contiguous (True) or
852 C-contiguous (False, the default) if we are creating a new file in
853 "write" mode.
854 version : tuple of int (major, minor) or None
855 If the mode is a "write" mode, then this is the version of the file
856 format used to create the file. None means use the oldest
857 supported version that is able to store the data. Default: None
858 max_header_size : int, optional
859 Maximum allowed size of the header. Large headers may not be safe
860 to load securely and thus require explicitly passing a larger value.
861 See :py:meth:`ast.literal_eval()` for details.
863 Returns
864 -------
865 marray : memmap
866 The memory-mapped array.
868 Raises
869 ------
870 ValueError
871 If the data or the mode is invalid.
872 OSError
873 If the file is not found or cannot be opened correctly.
875 See Also
876 --------
877 numpy.memmap
879 """
880 if isfileobj(filename):
881 raise ValueError("Filename must be a string or a path-like object."
882 " Memmap cannot use existing file handles.")
884 if 'w' in mode:
885 # We are creating the file, not reading it.
886 # Check if we ought to create the file.
887 _check_version(version)
888 # Ensure that the given dtype is an authentic dtype object rather
889 # than just something that can be interpreted as a dtype object.
890 dtype = numpy.dtype(dtype)
891 if dtype.hasobject:
892 msg = "Array can't be memory-mapped: Python objects in dtype."
893 raise ValueError(msg)
894 d = dict(
895 descr=dtype_to_descr(dtype),
896 fortran_order=fortran_order,
897 shape=shape,
898 )
899 # If we got here, then it should be safe to create the file.
900 with open(os_fspath(filename), mode+'b') as fp:
901 _write_array_header(fp, d, version)
902 offset = fp.tell()
903 else:
904 # Read the header of the file first.
905 with open(os_fspath(filename), 'rb') as fp:
906 version = read_magic(fp)
907 _check_version(version)
909 shape, fortran_order, dtype = _read_array_header(
910 fp, version, max_header_size=max_header_size)
911 if dtype.hasobject:
912 msg = "Array can't be memory-mapped: Python objects in dtype."
913 raise ValueError(msg)
914 offset = fp.tell()
916 if fortran_order:
917 order = 'F'
918 else:
919 order = 'C'
921 # We need to change a write-only mode to a read-write mode since we've
922 # already written data to the file.
923 if mode == 'w+':
924 mode = 'r+'
926 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
927 mode=mode, offset=offset)
929 return marray
932def _read_bytes(fp, size, error_template="ran out of data"):
933 """
934 Read from file-like object until size bytes are read.
935 Raises ValueError if not EOF is encountered before size bytes are read.
936 Non-blocking objects only supported if they derive from io objects.
938 Required as e.g. ZipExtFile in python 2.6 can return less data than
939 requested.
940 """
941 data = bytes()
942 while True:
943 # io files (default in python3) return None or raise on
944 # would-block, python2 file will truncate, probably nothing can be
945 # done about that. note that regular files can't be non-blocking
946 try:
947 r = fp.read(size - len(data))
948 data += r
949 if len(r) == 0 or len(data) == size:
950 break
951 except BlockingIOError:
952 pass
953 if len(data) != size:
954 msg = "EOF: reading %s, expected %d bytes got %d"
955 raise ValueError(msg % (error_template, size, len(data)))
956 else:
957 return data