Skip to content

Histogram of xarray.DataArray can be extremely slow #6804

@maahn

Description

@maahn

Xarray is a package for labeled arrays. If you use plt.hist to make a histogram of a DataArray, the speed depends a lot how you do it:

import xarray as xr
import numpy as np
import matplotlib.pyplot as plt

nPoints = 100000
data = xr.DataArray(np.random.random(nPoints),dims=['time'],coords=[np.arange(nPoints)])

It takes only some ms if you use

plt.figure()
%time data.plot.hist()

plt.figure()
%time plt.hist(data.values)

However, if you omit .values it takes extremely long:

In [12]: %time plt.hist(data)
CPU times: user 2min, sys: 9.73 s, total: 2min 9s
Wall time: 2min 3s

%prun suggests that the OrderdDict class is to blame:

         145056729 function calls (144455882 primitive calls) in 198.255 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1800009   11.602    0.000   29.587    0.000 collections.py:50(__init__)
  2800014    9.536    0.000   24.279    0.000 _abcoll.py:548(update)
   400002    5.934    0.000   30.815    0.000 variable.py:334(__getitem__)
 10200067    5.499    0.000    5.499    0.000 collections.py:90(__iter__)
 13203051    5.330    0.000   13.624    0.000 {isinstance}
  6600104    5.002    0.000    5.002    0.000 {hasattr}
  2400082    4.842    0.000    7.813    0.000 abc.py:128(__instancecheck__)
  2400012    4.821    0.000    4.821    0.000 collections.py:71(__setitem__)
  6200102    4.821    0.000    4.821    0.000 _weakrefset.py:70(__contains__)
  4400022    4.449    0.000    4.449    0.000 common.py:196(__setattr__)
   200001    4.378    0.000   12.152    0.000 base.py:124(__new__)
  1000005    4.308    0.000    5.123    0.000 indexing.py:10(expanded_indexer)
   200001    4.304    0.000   51.921    0.000 dataset.py:878(isel)
   200001    3.624    0.000   15.635    0.000 alignment.py:108(align_variables)
   400002    3.434    0.000    8.928    0.000 dataset.py:66(_calculate_dims)
   200001    3.347    0.000    9.288    0.000 dataset.py:470(_construct_dataarray)
   600003    3.216    0.000    6.535    0.000 variable.py:87(as_compatible_data)
   200001    3.142    0.000   99.663    0.000 merge.py:116(merge_datasets)
   200001    3.044    0.000   12.129    0.000 coordinates.py:198(_to_dataset)
   200001    2.740    0.000   92.687    0.000 merge.py:101(_merge_dataset_with_dict)
  2800014    2.714    0.000    5.114    0.000 abc.py:148(__subclasscheck__)
   400002    2.710    0.000   34.370    0.000 variable.py:493(isel)
  3400017    2.651    0.000    4.377    0.000 collections.py:138(iteritems)
  2400018    2.266    0.000    3.617    0.000 utils.py:355(ndim)
   200001    2.028    0.000   18.420    0.000 dataset.py:221(_update_vars_and_coords)
   400002    1.954    0.000   48.424    0.000 alignment.py:37(_join_indexes)
   600003    1.792    0.000   14.645    0.000 variable.py:192(__init__)
  4000844    1.712    0.000    1.713    0.000 {getattr}
   400002    1.675    0.000    3.975    0.000 dataset.py:367(_construct_direct)
   400002    1.624    0.000   44.523    0.000 alignment.py:28(_get_all_indexes)
   200001    1.607    0.000   48.978    0.000 alignment.py:95(partial_align)
   200001    1.585    0.000   17.709    0.000 merge.py:82(_merge_expand)

If one forgets to add .values or to use xarray's plot routine, one can be stuck for a long time. I have to kill python 1-2 per day due to that issue.

I have matplotlib 1.5.1 and xarray 0.7.2 on OSX/anaconda.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions