Skip to content

Commit 9b03947

Browse files
committed
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays
This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <[email protected]> Closes apache#3257 from wesm/ARROW-3928 and squashes the following commits: d9a8870 <Wes McKinney> Prettier output a00b51c <Wes McKinney> Add benchmarks for object deduplication ca88b96 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b <Wes McKinney> First working iteration of string deduplication when calling to_pandas
1 parent 0696eb5 commit 9b03947

File tree

14 files changed

+409
-352
lines changed

14 files changed

+409
-352
lines changed

cpp/src/arrow/python/arrow_to_pandas.cc

Lines changed: 156 additions & 130 deletions
Large diffs are not rendered by default.

cpp/src/arrow/python/arrow_to_pandas.h

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -43,32 +43,32 @@ namespace py {
4343

4444
struct PandasOptions {
4545
/// If true, we will convert all string columns to categoricals
46-
bool strings_to_categorical;
47-
bool zero_copy_only;
48-
bool integer_object_nulls;
49-
bool date_as_object;
50-
bool use_threads;
51-
52-
PandasOptions()
53-
: strings_to_categorical(false),
54-
zero_copy_only(false),
55-
integer_object_nulls(false),
56-
date_as_object(false),
57-
use_threads(false) {}
46+
bool strings_to_categorical = false;
47+
bool zero_copy_only = false;
48+
bool integer_object_nulls = false;
49+
bool date_as_object = false;
50+
bool use_threads = false;
51+
52+
/// \brief If true, do not create duplicate PyObject versions of equal
53+
/// objects. This only applies to immutable objects like strings or datetime
54+
/// objects
55+
bool deduplicate_objects = false;
5856
};
5957

6058
ARROW_PYTHON_EXPORT
61-
Status ConvertArrayToPandas(PandasOptions options, const std::shared_ptr<Array>& arr,
62-
PyObject* py_ref, PyObject** out);
59+
Status ConvertArrayToPandas(const PandasOptions& options,
60+
const std::shared_ptr<Array>& arr, PyObject* py_ref,
61+
PyObject** out);
6362

6463
ARROW_PYTHON_EXPORT
65-
Status ConvertChunkedArrayToPandas(PandasOptions options,
64+
Status ConvertChunkedArrayToPandas(const PandasOptions& options,
6665
const std::shared_ptr<ChunkedArray>& col,
6766
PyObject* py_ref, PyObject** out);
6867

6968
ARROW_PYTHON_EXPORT
70-
Status ConvertColumnToPandas(PandasOptions options, const std::shared_ptr<Column>& col,
71-
PyObject* py_ref, PyObject** out);
69+
Status ConvertColumnToPandas(const PandasOptions& options,
70+
const std::shared_ptr<Column>& col, PyObject* py_ref,
71+
PyObject** out);
7272

7373
// Convert a whole table as efficiently as possible to a pandas.DataFrame.
7474
//
@@ -77,15 +77,16 @@ Status ConvertColumnToPandas(PandasOptions options, const std::shared_ptr<Column
7777
//
7878
// tuple item: (indices: ndarray[int32], block: ndarray[TYPE, ndim=2])
7979
ARROW_PYTHON_EXPORT
80-
Status ConvertTableToPandas(PandasOptions options, const std::shared_ptr<Table>& table,
81-
MemoryPool* pool, PyObject** out);
80+
Status ConvertTableToPandas(const PandasOptions& options,
81+
const std::shared_ptr<Table>& table, MemoryPool* pool,
82+
PyObject** out);
8283

8384
/// Convert a whole table as efficiently as possible to a pandas.DataFrame.
8485
///
8586
/// Explicitly name columns that should be a categorical
8687
/// This option is only used on conversions that are applied to a table.
8788
ARROW_PYTHON_EXPORT
88-
Status ConvertTableToPandas(PandasOptions options,
89+
Status ConvertTableToPandas(const PandasOptions& options,
8990
const std::unordered_set<std::string>& categorical_columns,
9091
const std::shared_ptr<Table>& table, MemoryPool* pool,
9192
PyObject** out);

cpp/src/arrow/type.cc

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -137,12 +137,11 @@ std::string FixedSizeBinaryType::ToString() const {
137137
// ----------------------------------------------------------------------
138138
// Date types
139139

140-
DateType::DateType(Type::type type_id, DateUnit unit)
141-
: FixedWidthType(type_id), unit_(unit) {}
140+
DateType::DateType(Type::type type_id) : FixedWidthType(type_id) {}
142141

143-
Date32Type::Date32Type() : DateType(Type::DATE32, DateUnit::DAY) {}
142+
Date32Type::Date32Type() : DateType(Type::DATE32) {}
144143

145-
Date64Type::Date64Type() : DateType(Type::DATE64, DateUnit::MILLI) {}
144+
Date64Type::Date64Type() : DateType(Type::DATE64) {}
146145

147146
std::string Date64Type::ToString() const { return std::string("date64[ms]"); }
148147

cpp/src/arrow/type.h

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -600,17 +600,17 @@ enum class DateUnit : char { DAY = 0, MILLI = 1 };
600600
/// \brief Base type class for date data
601601
class ARROW_EXPORT DateType : public FixedWidthType {
602602
public:
603-
DateUnit unit() const { return unit_; }
603+
virtual DateUnit unit() const = 0;
604604

605605
protected:
606-
DateType(Type::type type_id, DateUnit unit);
607-
DateUnit unit_;
606+
explicit DateType(Type::type type_id);
608607
};
609608

610609
/// Concrete type class for 32-bit date data (as number of days since UNIX epoch)
611610
class ARROW_EXPORT Date32Type : public DateType {
612611
public:
613612
static constexpr Type::type type_id = Type::DATE32;
613+
static constexpr DateUnit UNIT = DateUnit::DAY;
614614

615615
using c_type = int32_t;
616616

@@ -622,12 +622,14 @@ class ARROW_EXPORT Date32Type : public DateType {
622622
std::string ToString() const override;
623623

624624
std::string name() const override { return "date32"; }
625+
DateUnit unit() const override { return UNIT; }
625626
};
626627

627628
/// Concrete type class for 64-bit date data (as number of milliseconds since UNIX epoch)
628629
class ARROW_EXPORT Date64Type : public DateType {
629630
public:
630631
static constexpr Type::type type_id = Type::DATE64;
632+
static constexpr DateUnit UNIT = DateUnit::MILLI;
631633

632634
using c_type = int64_t;
633635

@@ -639,6 +641,7 @@ class ARROW_EXPORT Date64Type : public DateType {
639641
std::string ToString() const override;
640642

641643
std::string name() const override { return "date64"; }
644+
DateUnit unit() const override { return UNIT; }
642645
};
643646

644647
struct TimeUnit {

cpp/src/arrow/type_traits.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -371,6 +371,11 @@ template <typename T>
371371
using enable_if_boolean =
372372
typename std::enable_if<std::is_same<BooleanType, T>::value>::type;
373373

374+
template <typename T>
375+
using enable_if_binary_like =
376+
typename std::enable_if<std::is_base_of<BinaryType, T>::value ||
377+
std::is_base_of<FixedSizeBinaryType, T>::value>::type;
378+
374379
template <typename T>
375380
using enable_if_fixed_size_binary =
376381
typename std::enable_if<std::is_base_of<FixedSizeBinaryType, T>::value>::type;

cpp/src/arrow/util/hashing.h

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,18 @@ struct ScalarHelper<Scalar, AlgNum,
102102
}
103103
};
104104

105+
template <typename Scalar, uint64_t AlgNum>
106+
struct ScalarHelper<
107+
Scalar, AlgNum,
108+
typename std::enable_if<std::is_same<util::string_view, Scalar>::value>::type>
109+
: public ScalarHelperBase<Scalar, AlgNum> {
110+
// ScalarHelper specialization for util::string_view
111+
112+
static hash_t ComputeHash(const util::string_view& value) {
113+
return ComputeStringHash<AlgNum>(value.data(), static_cast<int64_t>(value.size()));
114+
}
115+
};
116+
105117
template <typename Scalar, uint64_t AlgNum>
106118
struct ScalarHelper<Scalar, AlgNum,
107119
typename std::enable_if<std::is_floating_point<Scalar>::value>::type>
@@ -332,7 +344,7 @@ class ScalarMemoTable {
332344
explicit ScalarMemoTable(int64_t entries = 0)
333345
: hash_table_(static_cast<uint64_t>(entries)) {}
334346

335-
int32_t Get(const Scalar value) const {
347+
int32_t Get(const Scalar& value) const {
336348
auto cmp_func = [value](const Payload* payload) -> bool {
337349
return ScalarHelper<Scalar, 0>::CompareScalars(payload->value, value);
338350
};
@@ -346,7 +358,7 @@ class ScalarMemoTable {
346358
}
347359

348360
template <typename Func1, typename Func2>
349-
int32_t GetOrInsert(const Scalar value, Func1&& on_found, Func2&& on_not_found) {
361+
int32_t GetOrInsert(const Scalar& value, Func1&& on_found, Func2&& on_not_found) {
350362
auto cmp_func = [value](const Payload* payload) -> bool {
351363
return ScalarHelper<Scalar, 0>::CompareScalars(value, payload->value);
352364
};
@@ -364,7 +376,7 @@ class ScalarMemoTable {
364376
return memo_index;
365377
}
366378

367-
int32_t GetOrInsert(const Scalar value) {
379+
int32_t GetOrInsert(const Scalar& value) {
368380
return GetOrInsert(value, [](int32_t i) {}, [](int32_t i) {});
369381
}
370382

@@ -389,6 +401,7 @@ class ScalarMemoTable {
389401
Scalar value;
390402
int32_t memo_index;
391403
};
404+
392405
using HashTableType = HashTableTemplateType<Payload>;
393406
using HashTableEntry = typename HashTableType::Entry;
394407
HashTableType hash_table_;
@@ -621,9 +634,11 @@ class BinaryMemoTable {
621634
struct Payload {
622635
int32_t memo_index;
623636
};
637+
624638
using HashTableType = HashTable<Payload>;
625639
using HashTableEntry = typename HashTable<Payload>::Entry;
626640
HashTableType hash_table_;
641+
627642
std::vector<int32_t> offsets_;
628643
std::string values_;
629644

python/benchmarks/convert_pandas.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717

1818
import numpy as np
1919
import pandas as pd
20+
import pandas.util.testing as tm
21+
2022
import pyarrow as pa
2123

2224

@@ -50,6 +52,26 @@ def time_to_series(self, n, dtype):
5052
self.arrow_data.to_pandas()
5153

5254

55+
class ToPandasStrings(object):
56+
57+
param_names = ('uniqueness', 'total')
58+
params = ((0.001, 0.01, 0.1, 0.5), (1000000,))
59+
string_length = 25
60+
61+
def setup(self, uniqueness, total):
62+
nunique = int(total * uniqueness)
63+
unique_values = [tm.rands(self.string_length) for i in range(nunique)]
64+
values = unique_values * (total // nunique)
65+
self.arr = pa.array(values, type=pa.string())
66+
self.table = pa.Table.from_arrays([self.arr], ['f0'])
67+
68+
def time_to_pandas_dedup(self, *args):
69+
self.arr.to_pandas()
70+
71+
def time_to_pandas_no_dedup(self, *args):
72+
self.arr.to_pandas(deduplicate_objects=False)
73+
74+
5375
class ZeroCopyPandasRead(object):
5476

5577
def setup(self):

python/pyarrow/array.pxi

Lines changed: 58 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -339,7 +339,61 @@ def _restore_array(data):
339339
return pyarrow_wrap_array(MakeArray(ad))
340340

341341

342-
cdef class Array:
342+
cdef class _PandasConvertible:
343+
344+
def to_pandas(self, categories=None, bint strings_to_categorical=False,
345+
bint zero_copy_only=False, bint integer_object_nulls=False,
346+
bint date_as_object=False,
347+
bint use_threads=True,
348+
bint deduplicate_objects=True,
349+
bint ignore_metadata=False):
350+
"""
351+
Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
352+
353+
Parameters
354+
----------
355+
strings_to_categorical : boolean, default False
356+
Encode string (UTF8) and binary types to pandas.Categorical
357+
categories: list, default empty
358+
List of fields that should be returned as pandas.Categorical. Only
359+
applies to table-like data structures
360+
zero_copy_only : boolean, default False
361+
Raise an ArrowException if this function call would require copying
362+
the underlying data
363+
integer_object_nulls : boolean, default False
364+
Cast integers with nulls to objects
365+
date_as_object : boolean, default False
366+
Cast dates to objects
367+
use_threads: boolean, default True
368+
Whether to parallelize the conversion using multiple threads
369+
deduplicate_objects : boolean, default False
370+
Do not create multiple copies Python objects when created, to save
371+
on memory use. Conversion will be slower
372+
ignore_metadata : boolean, default False
373+
If True, do not use the 'pandas' metadata to reconstruct the
374+
DataFrame index, if present
375+
376+
Returns
377+
-------
378+
NumPy array or DataFrame depending on type of object
379+
"""
380+
cdef:
381+
PyObject* out
382+
PandasOptions options
383+
384+
options = PandasOptions(
385+
strings_to_categorical=strings_to_categorical,
386+
zero_copy_only=zero_copy_only,
387+
integer_object_nulls=integer_object_nulls,
388+
date_as_object=date_as_object,
389+
use_threads=use_threads,
390+
deduplicate_objects=deduplicate_objects)
391+
392+
return self._to_pandas(options, categories=categories,
393+
ignore_metadata=ignore_metadata)
394+
395+
396+
cdef class Array(_PandasConvertible):
343397

344398
def __init__(self):
345399
raise TypeError("Do not call {}'s constructor directly, use one of "
@@ -602,42 +656,13 @@ cdef class Array:
602656

603657
return pyarrow_wrap_array(result)
604658

605-
def to_pandas(self, bint strings_to_categorical=False,
606-
bint zero_copy_only=False, bint integer_object_nulls=False,
607-
bint date_as_object=False):
608-
"""
609-
Convert to a NumPy array object suitable for use in pandas.
610-
611-
Parameters
612-
----------
613-
strings_to_categorical : boolean, default False
614-
Encode string (UTF8) and binary types to pandas.Categorical
615-
zero_copy_only : boolean, default False
616-
Raise an ArrowException if this function call would require copying
617-
the underlying data
618-
integer_object_nulls : boolean, default False
619-
Cast integers with nulls to objects
620-
date_as_object : boolean, default False
621-
Cast dates to objects
622-
623-
See also
624-
--------
625-
Column.to_pandas
626-
Table.to_pandas
627-
RecordBatch.to_pandas
628-
"""
659+
def _to_pandas(self, options, **kwargs):
629660
cdef:
630661
PyObject* out
631-
PandasOptions options
662+
PandasOptions c_options = options
632663

633-
options = PandasOptions(
634-
strings_to_categorical=strings_to_categorical,
635-
zero_copy_only=zero_copy_only,
636-
integer_object_nulls=integer_object_nulls,
637-
date_as_object=date_as_object,
638-
use_threads=False)
639664
with nogil:
640-
check_status(ConvertArrayToPandas(options, self.sp_array,
665+
check_status(ConvertArrayToPandas(c_options, self.sp_array,
641666
self, &out))
642667
return wrap_array_output(out)
643668

python/pyarrow/compat.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,11 +192,15 @@ def _iterate_python_module_paths(package_name):
192192
for finder in sys.meta_path:
193193
try:
194194
spec = finder.find_spec(absolute_name, None)
195-
except AttributeError:
195+
except (AttributeError, TypeError):
196196
# On Travis (Python 3.5) the above produced:
197197
# AttributeError: 'VendorImporter' object has no
198198
# attribute 'find_spec'
199+
#
200+
# ARROW-4117: When running "asv dev", TypeError is raised
201+
# due to the meta-importer
199202
spec = None
203+
200204
if spec is not None:
201205
break
202206

0 commit comments

Comments
 (0)