ArrayProcessing

Tools for parallel processing of large arrays.

Note

This module provides an interface to deal with large numpy arrays and speed up numpy routines that get very slow for data arrays above 100-500GB of size.

The implementation builds on the buffer interface used by cython.

apply_lut(source, lut, sink=None, blocks=None, processes=None, verbose=False)[source]

Transforms the source via a lookup table.

Arguments

sourcearray

The source array.

lutarray

The lookup table.

sinkarray or None

The result array, if none an array is created.

processesNone or int

Number of processes to use, if None use number of cpus.

verbosebool

If True, print progress information.

Returns

sinkarray

The source transformed via the lookup table.

apply_lut_to_index(source, kernel, lut, sink=None, processes=None, verbose=False)[source]

Correlates the source with an index kernel and returns the value of the the look-up table.

Arguments

sourcearray

The source array.

kernelarray

The correlation kernel.

lutarray

The lookup table.

sinkarray or None

The result array, if none an array is created.

processesNone or int

Number of processes to use, if None use number of cpus

Returns

sinkarray

The source transformed via the lookup table.

block_ranges(source, blocks=None, processes=None)[source]

Ranges of evenly spaced blocks in array.

Arguments

sourcearray

Source to divide in blocks.

blocksint or None

Number of blocks to split array into.

processesNone or int

Number of processes, if None use number of cpus.

Returns

block_rangesarray

List of the range boundaries

block_sums(source, blocks=None, processes=None)[source]

Sums of evenly spaced blocks in array.

Arguments

dataarray

Array to perform the block sums on.

blocksint or None

Number of blocks to split array into.

processesNone or int

Number of processes, if None use number of cpus.

Returns

block_sumsarray

Sums of the values in the different blocks.

correlate1d(source, kernel, sink=None, axis=0, processes=None, verbose=False)[source]

Correlates the source along the given axis wih ta 1d kernel.

Arguments

sourcearray

The source array.

lutarray

The lookup table.

sinkarray or None

The result array, if none an array is created.

processesNone or int

Number of processes to use, if None use number of cpus

verbosebool

If True, print progress information.

Returns

sinkarray

The source transformed via the lookup table.

finalize_processing(verbose=False, function=None, timer=None)[source]

Finalize parallel array processing.

Arguments

verbosebool

If True, print progress information.

functionstr or None

The nae of the function.

timerTimer or None

A processing timer.

index_neighbours(indices, offset, processes=None)[source]

Returns all pairs of indices that are a part of a specified offset.

Arguments

indicesarray

List of indices.

offsetint

The offset to check for.

processesNone or int

Number of processes, if None use number of cpus.

initialize_processing(processes=None, verbose=False, function=None, blocks=None, return_blocks=False)[source]

Initialize parallel array processing.

Arguments

processesint, ‘seial’ or None

The number of processes to use. If None use number of cpus.

verbosebool

If True, print progress information.

functionstr or None

The nae of the function.

Returns

processesint

The number of processes.

timerTimer

A timer for the processing.

initialize_sink(sink=None, shape=None, dtype=None, order=None, memory=None, location=None, mode=None, source=None, return_buffer=True, as_1d=False, return_shape=False, return_strides=False)[source]

Initialze or create a sink.

Arguments

sinksink specification

The source to initialize.

shapetuple of int

Optional shape of the sink. If None, inferred from the source.

dtypedtype

Optional dtype of the sink. If None, inferred from the source.

order‘C’, ‘F’ or None

Optonal order of the sink. If None, inferred from the source.

memory‘shared’ or None

If ‘shared’ create a shared memory sink.

locationstr

Optional location specification of the sink.

sourceSource or None

Optional source to infer sink specifictions from.

return_bufferbool

If True, return alos a buffer compatible with cython memory views.

return_shapebool

If True, also return shape of the sink.

return_stridesbool

If True, also return the element strides of the sink.

Returns

sinkSource

The intialized sink.

bufferarray

Buffer of the sink.

shapetuple of int

Shape of the source.

stridestuple of int

Element strides of the source.

initialize_source(source, return_buffer=True, as_1d=False, return_shape=False, return_strides=False, return_order=False)[source]

Initialize a source buffer for parallel array processing.

Arguments

sourcesource specification

The source to initialize.

return_bufferbool

If True, return a buffer compatible with cython memory views.

return_shapebool

If True, also return shape of the source.

return_stridesbool

If True, also return the element strides of the source.

return_orderbool

If True, also return order of the source.

Returns

sourceSource

The intialized source.

source_buffer

The initialized source as buffer.

shapetuple of int

Shape of the source.

return_Stridestuple of int

Element strides of the source.

neighbours(indices, offset, processes=None, verbose=False)[source]

Returns all pairs in a list of indices that are apart a specified offset.

Arguments

indicesarray

List of indices.

offsetint

The offset to search for.

processesNone or int

Number of processes, if None use number of cpus.

verbosebool

If True, print progress.

Returns

neighboursarray

List of pairs of neighbours.

Note

This function can be used to create graphs from binary images.

read(source, sink=None, slicing=None, memory=None, blocks=None, processes=None, verbose=False, **kwargs)[source]

Read a large array into memory in parallel.

Arguments

sourcestr or Source

The source on diks to load.

slicingslice, tuple, or None

Optional sublice to read.

memory‘shared; or None

If ‘shared’, read into shared memory.

blocksint or None

number of blocks to split array into for parallel processing

processesNone or int

number of processes, if None use number of cpus

verbosebool

print info about the file to be loaded

Returns

sinkSource class

The read source in memory.

where(source, sink=None, blocks=None, cutoff=None, processes=None, verbose=False)[source]

Returns the indices of the non-zero entries of the array.

Arguments

sourcearray

Array to search for nonzero indices.

sinkarray or None

If not None, results is written into this array

blocksint or None

Number of blocks to split array into for parallel processing

cutoffint

Number of elements below whih to switch to numpy.where

processesNone or int

Number of processes, if None use number of cpus.

Returns

wherearray

Positions of the nonzero entries of the input array

Note

Uses numpy.where if there is no match of dimension implemented!

write(sink, source, slicing=None, overwrite=True, blocks=None, processes=None, verbose=False)[source]

Write a large array to disk in parallel.

Arguments

sinkstr or Source

The sink on disk to write to.

sourcearray or Source

The data to write to disk.

slicingslicing or None

Optional slicing for the sink to write to.

overwritebool

If True, create new file if the source specifications do not match.

blocksint or None

Number of blocks to split array into for parallel processing.

processesNone or int

Number of processes, if None use number of cpus.

verbosebool

Print info about the file to be loaded.

Returns

sinkSource class

The sink to which the source was written.

default_blocks_per_process = 10

Default number of blocks per process to split the data.

Note

10 blocks per process is a good choice.

default_cutoff = 20000000

Default size of array below which ordinary numpy is used.

Note

Ideally test this on your machine for different array sizes.

default_processes = 8

Default number of processes to use