Koalas: pandas API on Apache Spark
Koalas 1.8.2 is a maintenance release. Koalas is officially included in PySpark as pandas API on Spark in Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
Although moving to pandas API on Spark is recommended, Koalas 1.8.2 still works with Spark 3.2 (#2203).
Koalas 1.8.1 is a maintenance release. Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
Along with the following fixes:
Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
ExtensionDtype
We added the support of pandas' categorical type (#2064, #2106).
>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0 0
1 1
2 1
3 2
4 2
5 2
dtype: int8
>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
categories=['a', 'b', 'c'], ordered=False, dtype='category')
>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')
and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):
def func() -> ks.Series[pd.Int32Dtype()]:
...
We added the following new features:
DataFrame:
first
(#2128)at_time
(#2116)Series:
at_time
(#2130)first
(#2128)between_time
(#2129)DatetimeIndex:
indexer_between_time
(#2104)indexer_at_time
(#2109)between_time
(#2111)Along with the following fixes:
We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde
and Series.plot.kde
(#2028).
import databricks.koalas as ks
kdf = ks.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
kdf.plot.hist()
Plotting backend can be switched to matplotlib
by setting ks.options.plotting.backend
to matplotlib
.
ks.options.plotting.backend = "matplotlib"
We added more types of Index
such as Index64Index
, Float64Index
and DatetimeIndex
(#2025, #2066).
When creating an index, Index
instance is always returned regardless of the data type.
But now Int64Index
, Float64Index
or DatetimeIndex
is returned depending on the data type of the index.
>>> type(ks.Index([1, 2, 3]))
<class 'databricks.koalas.indexes.numeric.Int64Index'>
>>> type(ks.Index([1.1, 2.5, 3.0]))
<class 'databricks.koalas.indexes.numeric.Float64Index'>
>>> type(ks.Index([datetime.datetime(2021, 3, 9)]))
<class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>
In addition, we added many properties for DatetimeIndex
such as year
, month
, day
, hour
, minute
, second
, etc. (#2074) and added APIs for DatetimeIndex
such as round()
, floor()
, ceil()
, normalize()
, strftime()
, month_name()
and day_name()
(#2082, #2086, #2089).
Index can be created by taking Series
or Index
objects (#2071).
>>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30])
>>> ks.Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Int64Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Float64Index(kser)
Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
>>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20])
>>> ks.Index(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
>>> ks.DatetimeIndex(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
We added basic extension dtypes support (#2039).
>>> kdf = ks.DataFrame(
... {
... "a": [1, 2, None, 3],
... "b": [4.5, 5.2, 6.1, None],
... "c": ["A", "B", "C", None],
... "d": [False, None, True, False],
... }
... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"})
>>> kdf
a b c d
0 1 4.5 A False
1 2 5.2 B <NA>
2 <NA> 6.1 C True
3 3 NaN <NA> False
>>> kdf.dtypes
a Int32
b float64
c string
d boolean
dtype: object
The following types are supported per the installed pandas:
Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
BooleanDtype
StringDtype
Float32Dtype
Float64Dtype
Binary operations and type casting are supported:
>>> kdf.a + kdf.b
0 5
1 7
2 <NA>
3 <NA>
dtype: Int64
>>> kdf + kdf
a b
0 2 8
1 4 10
2 <NA> 12
3 6 <NA>
>>> kdf.a.astype('Float64')
0 1.0
1 2.0
2 <NA>
3 3.0
Name: a, dtype: Float64
We added the following new features:
koalas:
date_range
(#2081)read_orc
(#2017)Series:
align
(#2019)DataFrame:
align
(#2019)to_orc
(#2024)Along with the following fixes:
We improved plotting support by implementing pie, histogram and box plots with Plotly plot backend. Koalas now can plot data with Plotly via:
DataFrame.plot.pie
and Series.plot.pie
(#1971)
DataFrame.plot.hist
and Series.plot.hist
(#1999)
Series.plot.box
(#2007)
In addition, we optimized histogram calculation as a single pass in DataFrame
(#1997) instead of launching each job to calculate each Series
in DataFrame
.
The operations between Series
and Index
are now supported as below (#1996):
>>> kser = ks.Series([1, 2, 3, 4, 5, 6, 7])
>>> kidx = ks.Index([0, 1, 2, 3, 4, 5, 6])
>>> (kser + 1 + 10 * kidx).sort_index()
0 2
1 13
2 24
3 35
4 46
5 57
6 68
dtype: int64
>>> (kidx + 1 + 10 * kser).sort_index()
0 11
1 22
2 33
3 44
4 55
5 66
6 77
dtype: int64
Series
via attribute accessWe have added the support of setting a column via attribute assignment in DataFrame
, (#1989).
>>> kdf = ks.DataFrame({'A': [1, 2, 3, None]})
>>> kdf.A = kdf.A.fillna(kdf.A.median())
>>> kdf
A
0 1.0
1 2.0
2 3.0
3 2.0
We added the following new features:
Series:
factorize
(#1972)sem
(#1993)DataFrame
insert
(#1983)sem
(#1993)In addition, we also implement new parameters:
Along with the following fixes:
We improved Index operations support (#1944, #1955).
Here are some examples:
Before
>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx
Int64Index([2, 4, 6, 8, 10], dtype='int64')
>>> kidx + kidx + kidx
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series
After
>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx + kidx
Int64Index([3, 6, 9, 12, 15], dtype='int64')
>>> ks.options.compute.ops_on_diff_frames = True
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Int64Index([7, 9, 13, 11, 15], dtype='int64')
We added the following new features:
DataFrame:
swaplevel
(#1928)swapaxes
(#1946)dot
(#1945)itertuples
(#1960)Series:
swaplevel
(#1919)swapaxes
(#1954)Index:
to_list
(#1948)MultiIndex:
to_list
(#1948)GroupBy:
tail
(#1949)median
(#1957)We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.
Here are some examples:
Added np.float32
and "float32"
(matched to FloatType
)
>>> ks.Series([10]).astype(np.float32)
0 10.0
dtype: float32
>>> ks.Series([10]).astype("float32")
0 10.0
dtype: float32
Added np.datetime64
and "datetime64[ns]"
(matched to TimestampType
)
>>> ks.Series(["2020-10-26"]).astype(np.datetime64)
0 2020-10-26
dtype: datetime64[ns]
>>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
0 2020-10-26
dtype: datetime64[ns]
Fixed np.int
to match LongType
, not IntegerType
.
>>> pd.Series([100]).astype(np.int)
0 100.0
dtype: int64
>>> ks.Series([100]).astype(np.int)
0 100.0
dtype: int32 # This fixed to `int64` now.
Fixed np.float
to match DoubleType
, not FloatType
.
>>> pd.Series([100]).astype(np.float)
0 100.0
dtype: float64
>>> ks.Series([100]).astype(np.float)
0 100.0
dtype: float32 # This fixed to `float64` now.
We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.
To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).
The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:
It also helps mypy enable static analysis over the method body.
We verified the behaviors of pandas 1.1.4 in Koalas.
As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic
(MultiIndex.is_monotonic_increasing
) and MultiIndex.is_monotonic_decreasing
(pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).
We added the following new features:
DataFrame:
__neg__
(#1847)rename_axis
(#1843)spark.repartition
(#1864)spark.coalesce
(#1873)spark.checkpoint
(#1877)spark.local_checkpoint
(#1878)reindex_like
(#1880)Series:
rename_axis
(#1843)compare
(#1802)reindex_like
(#1880)Index:
intersection
(#1747)MultiIndex:
intersection
(#1747)We verified the behaviors of pandas 1.1 in Koalas. Koalas now supports pandas 1.1 officially (#1688, #1822, #1829).
Now we support for non-string names (#1784). Previously names in Koalas, e.g., df.columns
, df.colums.names
, df.index.names
, needed to be a string or a tuple of string, but it should allow other data types which are supported by Spark.
Before:
>>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
>>> kdf.columns
Index(['0', '1'], dtype='object')
After:
>>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
>>> kdf.columns
Int64Index([0, 1], dtype='int64')
distributed-sequence
default indexThe performance is improved when creating a distributed-sequence
as a default index type by avoiding the interaction between Python and JVM (#1699).
Make behaviors of binary operations (+
, -
, *
, /
, //
, %
) between int
and str
columns consistent with respective pandas behaviors (#1828).
It standardizes binary operations as follows:
+
: raise TypeError
between int column and str column (or string literal)*
: act as spark SQL repeat
between int column(or int literal) and str columns; raise TypeError
if a string literal is involved-
, /
, //
, %(modulo)
: raise TypeError
if a str column (or string literal) is involvedWe added the following new features:
DataFrame:
product
(#1739)from_dict
(#1778)pad
(#1786)backfill
(#1798)Series:
reindex
(#1737)explode
(#1777)pad
(#1786)argmin
(#1790)argmax
(#1790)argsort
(#1793)backfill
(#1798)Index:
inferred_type
(#1745)item
(#1744)is_unique
(#1766)asi8
(#1764)is_type_compatible
(#1765)view
(#1788)insert
(#1804)MultiIndex:
inferred_type
(#1745)item
(#1744)is_unique
(#1766)asi8
(#1764)is_type_compatible
(#1765)from_frame
(#1762)view
(#1788)insert
(#1804)GroupBy:
get_group
(#1783)Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0"
if no name is specified or None
is set to the name, whereas pandas allows a Series without the name.
For example:
>>> ks.__version__
'1.1.0'
>>> kser = ks.Series([1, 2, 3])
>>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
>>> kser.name = None
>>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
Now the Series will be non-named.
>>> ks.__version__
'1.2.0'
>>> ks.Series([1, 2, 3])
0 1
1 2
2 3
dtype: int64
>>> kser = ks.Series([1, 2, 3], name="a")
>>> kser.name = None
>>> kser
0 1
1 2
2 3
dtype: int64
Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:
>>> from databricks import koalas as ks
>>> ks.options.compute.default_index_type = 'distributed-sequence'
>>> ks.range(10).reset_index()
did not work as below:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
...
File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
current_partition_offset = sums[id.iloc[0]]
KeyError: 103
We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.
We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.
We added the following new features:
DataFrame:
last_valid_index
(#1705)Series:
product
(#1677)last_valid_index
(#1705)GroupBy:
cumcount
(#1702)partitionBy
explicitly in to_parquet
.mode
and partition_cols
to to_csv
and to_json
.Optional
.PlotAccessor
for DataFrame and Series (#1662)We added support for API extensions (#1617).
You can register your custom accessors to DataFrame
, Seires
, and Index
.
For example, in your library code:
from databricks.koalas.extensions import register_dataframe_accessor
@register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, koalas_obj):
self._obj = koalas_obj
# other constructor logic
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map
pass
...
Then, in a session:
>>> from my_ext_lib import GeoAccessor
>>> kdf = ks.DataFrame({"longitude": np.linspace(0,10),
... "latitude": np.linspace(0, 20)})
>>> kdf.geo.center
(5.0, 10.0)
>>> kdf.geo.plot()
...
See also: https://koalas.readthedocs.io/en/latest/reference/extensions.html
We introduced plotting.backend
configuration (#1639).
Plotly (>=4.8) or other libraries that pandas supports can be used as a plotting backend if they are installed in the environment.
>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
>>> kdf.plot(title="Example Figure") # defaults to backend="matplotlib"
>>> fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)
>>> ## same as:
>>> # ks.options.plotting.backend = "plotly"
>>> # fig = kdf.plot(title="Example Figure", height=500, width=500)
>>> fig.show()
Each backend returns the figure in their own format, allowing for further editing or customization if required.
>>> fig.update_layout(template="plotly_dark")
>>> fig.show()
We introduced koalas
accessor and some methods specific to Koalas (#1613, #1628).
DataFrame.apply_batch
, DataFrame.transform_batch
, and Series.transform_batch
are deprecated and moved to koalas
accessor.
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf + 1 # should always return the same length as input.
...
>>> kdf.koalas.transform_batch(pandas_plus)
a b
0 2 5
1 3 6
2 4 7
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_filter(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> kdf.koalas.apply_batch(pandas_filter)
a b
1 2 5
2 3 6
or
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
... return pser + 1 # should always return the same length as input.
...
>>> kdf.a.koalas.transform_batch(pandas_plus)
0 2
1 3
2 4
Name: a, dtype: int64
See also: https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html
We added the following new features:
DataFrame:
tail
(#1632)droplevel
(#1622)Series:
iteritems
(#1603)items
(#1603)tail
(#1632)droplevel
(#1630)