Daft 0.0.24 Release Notes
Contents
Daft 0.0.24 Release Notes#
The Daft 0.0.24 release adds many bugfixes and moves Daft much closer towards full integration of its Rust code. The highlights are:
Better integrations with Ray Datasets, adding a
DataFrame.from_ray_dataset
constructor and leveraging its ArrowTensorExtension types when converting data.DataFrame.into_partitions
API adds a much faster way of increasing/decreasing the number of partitions in a DataFrame without incurring a shuffle
Enhancements#
Refactor @udf to return a UDF class instead of wrapped functions #644
Add DataFrame.from_ray_dataset #722
Use Ray Datasets ArrowTensorExtension type when handling numpy arrays #719
New API:
DataFrame.into_partitions(n)
#711Issue #707: Fix join run_partial_metadata logic #708
Push row limit through to Parquet row group batch reading. #646
RayRunner: Run scheduler locally if we are not in ray client mode #643
Deduce, propagate, and use partial partition metadata in physical plan. #642
Query and store parquet metadata. #634
Add future-like abstraction to PartitionTask #633
Remove type annotation inference from UDFs #632
Check top-level types for public API parameters #628
Refactor aggregation logic into a AggregationPlanBuilder class #627
Set show to default to showing 8 rows #624
Add option to disable threadpool in PyRunner #614
Typecheck
resource_request
parameter inDataFrame.with_column()
#611Improve resource request settings for file reads #602
Add 2x partition size memory to all tasks’ ResourceRequests. #600
into_partitions (splitting) emits one task per input partition instead of per output partition. #726
Ignore multiple calls to set_runner_ray with kwarg noop_if_initialized #728
Bug Fixes#
Fix bug with tabular file scans #674
Pin the version of sphinx-book-theme to <1.0.0 which fixes our docs #639
Fix notebooks for UDF changes #636
Fix serialization of ArrowDataBlocks with shared buffers #625
Fix df.show() always showing at most 10 rows #720
Fix schema inference of reads from Parquet and empty vPartitions #706
Add multi-partition TPC-H unit test and fixes #675
Fix ChunkedArray serialization logic to be compatible with PyArrow 6.0.0 #635
Fix 100G regression from #600 #631
Python version compatibility fix for #628 #630
Always run GPU tasks in main thread instead of threadpool. #619
Fix self.X == self.X typos in local_eq #618
Disable thread pool in MinDalle notebook #617
Remove aberrant print #616
Hotfix for unguarded pickle5 import #615
Build Changes#
Update ray requirements to >=2.0.0 #691
Manually split files during TPC-H data generation #686
Cache TPCH data generation #683
Skip file_read benchmark on remote ray runner #679
Add debug logging to PyRunner and physical plan. #673
Downgrade isort 5.12.0 -> 5.11.5 for py37 compatibility. #652
Add Slack notification on failure of CI jobs #641
Fixes for Ray compatibility tests #638
Fix docs link for ray-init #629
Rust Integration#
Fix Series arithmetic naming logic #725
Fix size_bytes for logical types #724
Remove use of ExpressionType.__hash__ in logicalplan #723
[rust] leverage getstate directly in rust wrappers #718
Add Rust/Series/Expressions for string .startswith and .contains #717
[rust] Rust Data structures pickling #716
Add ExpressionNamespace construction similar to SeriesNamespace #715
Remove ExprResolveTypeError in favor of a plain TypeError #712
Use ValueError instead of SchemaMismatch in FunctionExpr evaluation #710
[rust] Date Year and Array Slices #709
[rust] String multikey sorting bug fix #705
Implement grouped aggregations. #704
[rust] Simple Date operations via as_physical operations #703
Hook in agg expressions to the table.agg API (global mode only). #702
Unblocking [RUST-INT][TPCH]: Add a call to .combine_chunks() when creating a Table from arrow #700
Unblocking [RUST-INT][TPCH]: Remove calls to Expression._required_columns outside of optimizer #699
Fix Table.from_pydict and add unit tests #698
[rust] Table and Series Concat #697
[rust] Function Expressions and Abs Expression #693
Add endswith kernel to Series #692
Type-checking schema resolution of Expressions2 #690
[rust] Partitioning Ops (Hash, Random, Range) #685
Mean, Min, Max, Count global aggs for Expressions2 #684
Implement TableIO reads and tests #680
[rust] Table Sample and Quantile operations #678
[rust] size_bytes for Table and Series #677
[rust] Validate field params during expression eval #676
Add resolve_schema functionality to expressions2 #671
[rust] Single Column Naive Join #670
Add more Rust python binding requirements for TPC-H #669
Add Date DataType #668
Runner IO code #667
Expression2 optimizer methods #665
Add
table_io.py
utility module for reading and writing of tables #663Sammy/rust series hash #662
Refactor old Expressions for compatibility with Expressions2 #659
Refactor old types.py to be compatible with datatypes.py #658
Expressions and Schema refactors for v2 compatibility #657
Add ExpressionsProjection to replace ExpressionList #656
Refactor Schema initialization in the codebase #655
Refactor vPartition repartitioning APIs #654
Global sum aggregation for expressions2 #653
Refactor .from_pydict() to infer schema in the method #651
Add Rust bindings for Schema #650
[rust] Table Multicolumn Sorts #649
Remove partition_id from vPartition #648
Add stub code to Table #647
[rust] Single Column Table Sorts #645
[Rust] Series Sorting and Argsort #640
[rust] Table, Series Filter with DaftLogical Operators #620
Deprecations#
Deprecate type inference from Python types in @udf and .apply #661
Closed Issues#
df.show() only shows at most 10 rows #721
Limit after joins are broken #707
[Bug] Resource Requests breaking in tutorial notebook #603
Improved UDF syntax #591
Order-preserving groupby-aggregations #623
Node-level initializations for UDF #622
Add option to disable threading in PyRunner #613
Typecheck all user parameters of public APIs #612
Text to image generation notebook failing on GPU oom #609
Limit .show to 10 by default #607