Daft 0.0.24 Release Notes#

The Daft 0.0.24 release adds many bugfixes and moves Daft much closer towards full integration of its Rust code. The highlights are:

  • Better integrations with Ray Datasets, adding a DataFrame.from_ray_dataset constructor and leveraging its ArrowTensorExtension types when converting data.

  • DataFrame.into_partitions API adds a much faster way of increasing/decreasing the number of partitions in a DataFrame without incurring a shuffle

Enhancements#

  • Refactor @udf to return a UDF class instead of wrapped functions #644

  • Add DataFrame.from_ray_dataset #722

  • Use Ray Datasets ArrowTensorExtension type when handling numpy arrays #719

  • New API: DataFrame.into_partitions(n) #711

  • Issue #707: Fix join run_partial_metadata logic #708

  • Push row limit through to Parquet row group batch reading. #646

  • RayRunner: Run scheduler locally if we are not in ray client mode #643

  • Deduce, propagate, and use partial partition metadata in physical plan. #642

  • Query and store parquet metadata. #634

  • Add future-like abstraction to PartitionTask #633

  • Remove type annotation inference from UDFs #632

  • Check top-level types for public API parameters #628

  • Refactor aggregation logic into a AggregationPlanBuilder class #627

  • Set show to default to showing 8 rows #624

  • Add option to disable threadpool in PyRunner #614

  • Typecheck resource_request parameter in DataFrame.with_column() #611

  • Improve resource request settings for file reads #602

  • Add 2x partition size memory to all tasks’ ResourceRequests. #600

  • into_partitions (splitting) emits one task per input partition instead of per output partition. #726

  • Ignore multiple calls to set_runner_ray with kwarg noop_if_initialized #728

Bug Fixes#

  • Fix bug with tabular file scans #674

  • Pin the version of sphinx-book-theme to <1.0.0 which fixes our docs #639

  • Fix notebooks for UDF changes #636

  • Fix serialization of ArrowDataBlocks with shared buffers #625

  • Fix df.show() always showing at most 10 rows #720

  • Fix schema inference of reads from Parquet and empty vPartitions #706

  • Add multi-partition TPC-H unit test and fixes #675

  • Fix ChunkedArray serialization logic to be compatible with PyArrow 6.0.0 #635

  • Fix 100G regression from #600 #631

  • Python version compatibility fix for #628 #630

  • Always run GPU tasks in main thread instead of threadpool. #619

  • Fix self.X == self.X typos in local_eq #618

  • Disable thread pool in MinDalle notebook #617

  • Remove aberrant print #616

  • Hotfix for unguarded pickle5 import #615

Build Changes#

  • Update ray requirements to >=2.0.0 #691

  • Manually split files during TPC-H data generation #686

  • Cache TPCH data generation #683

  • Skip file_read benchmark on remote ray runner #679

  • Add debug logging to PyRunner and physical plan. #673

  • Downgrade isort 5.12.0 -> 5.11.5 for py37 compatibility. #652

  • Add Slack notification on failure of CI jobs #641

  • Fixes for Ray compatibility tests #638

  • Fix docs link for ray-init #629

Rust Integration#

  • Fix Series arithmetic naming logic #725

  • Fix size_bytes for logical types #724

  • Remove use of ExpressionType.__hash__ in logicalplan #723

  • [rust] leverage getstate directly in rust wrappers #718

  • Add Rust/Series/Expressions for string .startswith and .contains #717

  • [rust] Rust Data structures pickling #716

  • Add ExpressionNamespace construction similar to SeriesNamespace #715

  • Remove ExprResolveTypeError in favor of a plain TypeError #712

  • Use ValueError instead of SchemaMismatch in FunctionExpr evaluation #710

  • [rust] Date Year and Array Slices #709

  • [rust] String multikey sorting bug fix #705

  • Implement grouped aggregations. #704

  • [rust] Simple Date operations via as_physical operations #703

  • Hook in agg expressions to the table.agg API (global mode only). #702

  • Unblocking [RUST-INT][TPCH]: Add a call to .combine_chunks() when creating a Table from arrow #700

  • Unblocking [RUST-INT][TPCH]: Remove calls to Expression._required_columns outside of optimizer #699

  • Fix Table.from_pydict and add unit tests #698

  • [rust] Table and Series Concat #697

  • [rust] Function Expressions and Abs Expression #693

  • Add endswith kernel to Series #692

  • Type-checking schema resolution of Expressions2 #690

  • [rust] Partitioning Ops (Hash, Random, Range) #685

  • Mean, Min, Max, Count global aggs for Expressions2 #684

  • Implement TableIO reads and tests #680

  • [rust] Table Sample and Quantile operations #678

  • [rust] size_bytes for Table and Series #677

  • [rust] Validate field params during expression eval #676

  • Add resolve_schema functionality to expressions2 #671

  • [rust] Single Column Naive Join #670

  • Add more Rust python binding requirements for TPC-H #669

  • Add Date DataType #668

  • Runner IO code #667

  • Expression2 optimizer methods #665

  • Add table_io.py utility module for reading and writing of tables #663

  • Sammy/rust series hash #662

  • Refactor old Expressions for compatibility with Expressions2 #659

  • Refactor old types.py to be compatible with datatypes.py #658

  • Expressions and Schema refactors for v2 compatibility #657

  • Add ExpressionsProjection to replace ExpressionList #656

  • Refactor Schema initialization in the codebase #655

  • Refactor vPartition repartitioning APIs #654

  • Global sum aggregation for expressions2 #653

  • Refactor .from_pydict() to infer schema in the method #651

  • Add Rust bindings for Schema #650

  • [rust] Table Multicolumn Sorts #649

  • Remove partition_id from vPartition #648

  • Add stub code to Table #647

  • [rust] Single Column Table Sorts #645

  • [Rust] Series Sorting and Argsort #640

  • [rust] Table, Series Filter with DaftLogical Operators #620

Deprecations#

  • Deprecate type inference from Python types in @udf and .apply #661

Closed Issues#

  • df.show() only shows at most 10 rows #721

  • Limit after joins are broken #707

  • [Bug] Resource Requests breaking in tutorial notebook #603

  • Improved UDF syntax #591

  • Order-preserving groupby-aggregations #623

  • Node-level initializations for UDF #622

  • Add option to disable threading in PyRunner #613

  • Typecheck all user parameters of public APIs #612

  • Text to image generation notebook failing on GPU oom #609

  • Limit .show to 10 by default #607