{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install getdaft --pre --extra-index-url https://pypi.anaconda.org/daft-nightly/simple" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```{hint}\n", "✨✨✨ **Run this notebook on Google Colab** ✨✨✨\n", "\n", "You can [run this notebook yourself with Google Colab](https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/docs/source/learn/10-min.ipynb)!\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 10 minutes Quickstart\n", "\n", "This is a short introduction to all the main functionality in Daft, geared towards new users.\n", "\n", "We import from daft as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from daft import DataFrame" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame creation\n", "\n", "See also: [API Reference: DataFrame Construction](df-construction)\n", "\n", "We can create a DataFrame from a dictionary of columns - this is a dictionary where the keys are strings representing the columns' names and the values are equal-length lists representing the columns' values." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-01-21 14:36:57.855 | INFO | daft.context:runner:85 - Using PyRunner\n" ] } ], "source": [ "import datetime\n", "\n", "df = DataFrame.from_pydict({\n", " \"A\": [1, 2, 3, 4],\n", " \"B\": [1.5, 2.5, 3.5, 4.5],\n", " \"C\": [True, True, False, False],\n", " \"D\": [\"a\", \"b\", \"c\", \"d\"],\n", " \"E\": [b\"a\", b\"b\", b\"c\", b\"d\"],\n", " \"F\": [datetime.date(1994, 1, 1), datetime.date(1994, 1, 2), datetime.date(1994, 1, 3), datetime.date(1994, 1, 4)],\n", " \"G\": [[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]],\n", "})" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also load DataFrames from other sources, such as:\n", "\n", "1. CSV files: `DataFrame.read_csv(\"s3://bucket/*.csv\")`\n", "2. Parquet files: `DataFrame.read_parquet(\"/path/*.parquet\")`\n", "3. JSON line-delimited files: `DataFrame.read_json(\"/path/*.parquet\")`\n", "4. Files on disk: `DataFrame.from_glob_path(\"/path/*.jpeg\")`\n", "\n", "Daft automatically supports local paths as well as paths to object storage such as AWS S3." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Inspect your dataframe by printing the `df` variable" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|---|
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] |
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
A INTEGER | B FLOAT |
---|---|
1 | 1.5 |
2 | 2.5 |
3 | 3.5 |
4 | 4.5 |
A2 INTEGER | B FLOAT |
---|---|
1 | 1.5 |
2 | 2.5 |
3 | 3.5 |
4 | 4.5 |
B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|
1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] |
3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] |
4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | A_plus1 INTEGER |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | 2 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | 3 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | 4 |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | 5 |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | D_length INTEGER |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | 1 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | 1 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | 1 |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | 1 |
urls STRING | image_bytes BYTES |
---|---|
http://farm9.staticflickr.com/8186/8119368305_4e622c8349_... | b'\\xff\\xd8\\xff\\xe1\\x00TExif\\x00\\x00MM\\x00*\\x00\\x00\\x00\\x0... |
http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg | b'\\xff\\xd8\\xff\\xe1\\x00(Exif\\x00\\x00MM\\x00*\\x00\\x00\\x00\\x0... |
http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_... | b'\\xff\\xd8\\xff\\xe1\\x00\\x16Exif\\x00\\x00MM\\x00*\\x00\\x00\\x00... |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | G_repeat PY[object] |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | [1, 1, 1, 1, 1, 1, 1, 1, 1] |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | [2, 2, 2, 2, 2, 2, 2, 2, 2] |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | [3, 3, 3, 3, 3, 3, 3, 3, 3] |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | [4, 4, 4, 4, 4, 4, 4, 4, 4] |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | G_count_A PY[object] |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | 3 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | 3 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | 3 |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | 3 |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | G_to_numpy PY[ndarray] |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | <np.ndarray shape=(3,) dtype=int64> |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | <np.ndarray shape=(3,) dtype=int64> |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | <np.ndarray shape=(3,) dtype=int64> |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | <np.ndarray shape=(3,) dtype=int64> |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[object] |
---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | 1 |
1 | 1.5 | true | a | b'a' | 1994-01-01 | 1 |
1 | 1.5 | true | a | b'a' | 1994-01-01 | 1 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | 2 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | 2 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | 2 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | 3 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | 3 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | 3 |
4 | 4.5 | false | d | b'd' | 1994-01-04 | 4 |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | F_add_A_days DATE |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | 1994-01-02 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | 1994-01-04 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | 1994-01-06 |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | 1994-01-08 |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] | expensive_model_results FLOAT |
---|---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] | 8.07 |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] | 13.86 |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] | 19.65 |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] | 25.44 |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] |
floats FLOAT | floats_is_null LOGICAL | floats_is_nan LOGICAL |
---|---|---|
1.5 | false | false |
None | true | none |
nan | false | true |
floats FLOAT | floats_is_null LOGICAL | floats_is_nan LOGICAL | filled_in_floats FLOAT |
---|---|---|---|
1.5 | false | false | 1.5 |
None | true | none | 0 |
nan | false | true | nan |
A INTEGER | B FLOAT | C LOGICAL | D STRING | E BYTES | F DATE | G PY[list] |
---|---|---|---|---|---|---|
1 | 1.5 | true | a | b'a' | 1994-01-01 | [1, 1, 1] |
2 | 2.5 | true | b | b'b' | 1994-01-02 | [2, 2, 2] |
3 | 3.5 | false | c | b'c' | 1994-01-03 | [3, 3, 3] |
4 | 4.5 | false | d | b'd' | 1994-01-04 | [4, 4, 4] |
A STRING | B STRING | C INTEGER | D INTEGER |
---|---|---|---|
foo | a | 0 | 0 |
bar | a | 1 | 1 |
foo | b | 2 | 2 |
bar | c | 3 | 3 |
foo | b | 4 | 4 |
bar | b | 5 | 5 |
foo | a | 6 | 6 |
foo | c | 7 | 7 |
A STRING | C_sum INTEGER | D_mean INTEGER |
---|---|---|
bar | 9 | 3 |
foo | 19 | 3.8 |
A STRING | B STRING | C_sum INTEGER | D_mean INTEGER |
---|---|---|---|
foo | a | 6 | 3 |
bar | b | 5 | 5 |
foo | c | 7 | 7 |
bar | a | 1 | 1 |
bar | c | 3 | 3 |
foo | b | 6 | 3 |
file_path STRING |
---|
my-dataframe.csv/c9da510b-9bef-4d4b-a3db-d40462100b52-0.csv |