.. _missing_data: {{ header }} ************************* Working with missing data ************************* Values considered "missing" ~~~~~~~~~~~~~~~~~~~~~~~~~~~ pandas uses different sentinel values to represent a missing (also referred to as NA) depending on the data type. ``numpy.nan`` for NumPy data types. The disadvantage of using NumPy data types is that the original data type will be coerced to ``np.float64`` or ``object``. .. ipython:: python pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2]) pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2]) :class:`NaT` for NumPy ``np.datetime64``, ``np.timedelta64``, and :class:`PeriodDtype`. For typing applications, use :class:`api.typing.NaTType`. .. ipython:: python pd.Series([1, 2], dtype=np.dtype("timedelta64[ns]")).reindex([0, 1, 2]) pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2]) pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2]) :class:`NA` for :class:`StringDtype`, :class:`Int64Dtype` (and other bit widths), :class:`Float64Dtype` (and other bit widths), :class:`BooleanDtype` and :class:`ArrowDtype`. These types will maintain the original data type of the data. For typing applications, use :class:`api.typing.NAType`. .. ipython:: python pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2]) pd.Series([True, False], dtype="boolean[pyarrow]").reindex([0, 1, 2]) To detect these missing value, use the :func:`isna` or :func:`notna` methods. .. ipython:: python ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT]) ser pd.isna(ser) .. note:: :func:`isna` or :func:`notna` will also consider ``None`` a missing value. .. ipython:: python ser = pd.Series([1, None], dtype=object) ser pd.isna(ser) .. warning:: Equality comparisons between ``np.nan``, :class:`NaT`, and :class:`NA` do not act like ``None`` .. ipython:: python None == None # noqa: E711 np.nan == np.nan pd.NaT == pd.NaT pd.NA == pd.NA Therefore, an equality comparison between a :class:`DataFrame` or :class:`Series` with one of these missing values does not provide the same information as :func:`isna` or :func:`notna`. .. ipython:: python ser = pd.Series([True, None], dtype="boolean[pyarrow]") ser == pd.NA pd.isna(ser) .. _missing_data.NA: :class:`NA` semantics ~~~~~~~~~~~~~~~~~~~~~ .. warning:: Experimental: the behaviour of :class:`NA` can still change without warning. Starting from pandas 1.0, an experimental :class:`NA` value (singleton) is available to represent scalar missing values. The goal of :class:`NA` is provide a "missing" indicator that can be used consistently across data types (instead of ``np.nan``, ``None`` or ``pd.NaT`` depending on the data type). For example, when having missing values in a :class:`Series` with the nullable integer dtype, it will use :class:`NA`: .. ipython:: python s = pd.Series([1, 2, None], dtype="Int64") s s[2] s[2] is pd.NA Currently, pandas does not use those data types using :class:`NA` by default in a :class:`DataFrame` or :class:`Series`, so you need to specify the dtype explicitly. An easy way to convert to those dtypes is explained in the :ref:`conversion section `. Propagation in arithmetic and comparison operations --------------------------------------------------- In general, missing values *propagate* in operations involving :class:`NA`. When one of the operands is unknown, the outcome of the operation is also unknown. For example, :class:`NA` propagates in arithmetic operations, similarly to ``np.nan``: .. ipython:: python pd.NA + 1 "a" * pd.NA There are a few special cases when the result is known, even when one of the operands is ``NA``. .. ipython:: python pd.NA ** 0 1 ** pd.NA In equality and comparison operations, :class:`NA` also propagates. This deviates from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always return ``False``. .. ipython:: python pd.NA == 1 pd.NA == pd.NA pd.NA < 2.5 To check if a value is equal to :class:`NA`, use :func:`isna` .. ipython:: python pd.isna(pd.NA) .. note:: An exception on this basic propagation rule are *reductions* (such as the mean or the minimum), where pandas defaults to skipping missing values. See the :ref:`calculation section ` for more. Logical operations ------------------ For logical operations, :class:`NA` follows the rules of the `three-valued logic `__ (or *Kleene logic*, similarly to R, SQL and Julia). This logic means to only propagate missing values when it is logically required. For example, for the logical "or" operation (``|``), if one of the operands is ``True``, we already know the result will be ``True``, regardless of the other value (so regardless the missing value would be ``True`` or ``False``). In this case, :class:`NA` does not propagate: .. ipython:: python True | False True | pd.NA pd.NA | True On the other hand, if one of the operands is ``False``, the result depends on the value of the other operand. Therefore, in this case :class:`NA` propagates: .. ipython:: python False | True False | False False | pd.NA The behaviour of the logical "and" operation (``&``) can be derived using similar logic (where now :class:`NA` will not propagate if one of the operands is already ``False``): .. ipython:: python False & True False & False False & pd.NA .. ipython:: python True & True True & False True & pd.NA ``NA`` in a boolean context --------------------------- Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. .. ipython:: python :okexcept: bool(pd.NA) This also means that :class:`NA` cannot be used in a context where it is evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can potentially be :class:`NA`. In such cases, :func:`isna` can be used to check for :class:`NA` or ``condition`` being :class:`NA` can be avoided, for example by filling missing values beforehand. A similar situation occurs when using :class:`Series` or :class:`DataFrame` objects in ``if`` statements, see :ref:`gotchas.truth`. NumPy ufuncs ------------ :attr:`pandas.NA` implements NumPy's ``__array_ufunc__`` protocol. Most ufuncs work with ``NA``, and generally return ``NA``: .. ipython:: python np.log(pd.NA) np.add(pd.NA, 1) .. warning:: Currently, ufuncs involving an ndarray and ``NA`` will return an object-dtype filled with NA values. .. ipython:: python a = np.array([1, 2, 3]) np.greater(a, pd.NA) The return type here may change to return a different array type in the future. See :ref:`dsintro.numpy_interop` for more on ufuncs. .. _missing_data.NA.conversion: Conversion ^^^^^^^^^^ If you have a :class:`DataFrame` or :class:`Series` using ``np.nan``, :meth:`DataFrame.convert_dtypes` and :meth:`Series.convert_dtypes`, respectively, will convert your data to use the nullable data types supporting :class:`NA`, such as :class:`Int64Dtype` or :class:`ArrowDtype`. This is especially helpful after reading in data sets from IO methods where data types were inferred. .. ipython:: python import io data = io.StringIO("a,b\n,True\n2,") df = pd.read_csv(data) df.dtypes df_conv = df.convert_dtypes() df_conv df_conv.dtypes .. _missing.inserting: Inserting missing data ~~~~~~~~~~~~~~~~~~~~~~ You can insert missing values by simply assigning to a :class:`Series` or :class:`DataFrame`. The missing value sentinel used will be chosen based on the dtype. .. ipython:: python ser = pd.Series([1., 2., 3.]) ser.loc[0] = None ser ser = pd.Series([pd.Timestamp("2021"), pd.Timestamp("2021")]) ser.iloc[0] = np.nan ser ser = pd.Series([True, False], dtype="boolean[pyarrow]") ser.iloc[0] = None ser For ``object`` types, pandas will use the value given: .. ipython:: python s = pd.Series(["a", "b", "c"], dtype=object) s.loc[0] = None s.loc[1] = np.nan s .. _missing_data.calculations: Calculations with missing data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Missing values propagate through arithmetic operations between pandas objects. .. ipython:: python ser1 = pd.Series([np.nan, np.nan, 2, 3]) ser2 = pd.Series([np.nan, 1, np.nan, 4]) ser1 ser2 ser1 + ser2 The descriptive statistics and computational methods discussed in the :ref:`data structure overview ` (and listed :ref:`here ` and :ref:`here `) all account for missing data. When summing data, NA values or empty data will be treated as zero. .. ipython:: python pd.Series([np.nan]).sum() pd.Series([], dtype="float64").sum() When taking the product, NA values or empty data will be treated as 1. .. ipython:: python pd.Series([np.nan]).prod() pd.Series([], dtype="float64").prod() Cumulative methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod` ignore NA values by default, but preserve them in the resulting array. To override this behaviour and include NA values in the calculation, use ``skipna=False``. .. ipython:: python ser = pd.Series([1, np.nan, 3, np.nan]) ser ser.cumsum() ser.cumsum(skipna=False) .. _missing_data.dropna: Dropping missing data ~~~~~~~~~~~~~~~~~~~~~ :meth:`~DataFrame.dropna` drops rows or columns with missing data. .. ipython:: python df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]]) df df.dropna() df.dropna(axis=1) ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]") ser.dropna() Filling missing data ~~~~~~~~~~~~~~~~~~~~ .. _missing_data.fillna: Filling by value ---------------- :meth:`~DataFrame.fillna` replaces NA values with non-NA data. Replace NA with a scalar value .. ipython:: python data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")} df = pd.DataFrame(data) df df.fillna(0) When the data has object dtype, you can control what type of NA values are present. .. ipython:: python df = pd.DataFrame({"a": [pd.NA, np.nan, None]}, dtype=object) df df.fillna(None) df.fillna(np.nan) df.fillna(pd.NA) However when the dtype is not object, these will all be replaced with the proper NA value for the dtype. .. ipython:: python data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")} df = pd.DataFrame(data) df df.fillna(None) df.fillna(np.nan) df.fillna(pd.NA) Fill gaps forward or backward .. ipython:: python df.ffill() df.bfill() .. _missing_data.fillna.limit: Limit the number of NA values filled .. ipython:: python df.ffill(limit=1) NA values can be replaced with corresponding value from a :class:`Series` or :class:`DataFrame` where the index and column aligns between the original object and the filled object. .. ipython:: python dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC")) dff.iloc[3:5, 0] = np.nan dff.iloc[4:6, 1] = np.nan dff.iloc[5:8, 2] = np.nan dff dff.fillna(dff.mean()) .. note:: :meth:`DataFrame.where` can also be used to fill NA values. Same result as above. .. ipython:: python dff.where(pd.notna(dff), dff.mean(), axis="columns") .. _missing_data.interpolate: Interpolation ------------- :meth:`DataFrame.interpolate` and :meth:`Series.interpolate` fills NA values using various interpolation methods. .. ipython:: python df = pd.DataFrame( { "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], } ) df df.interpolate() idx = pd.date_range("2020-01-01", periods=10, freq="D") data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64) ts = pd.Series(data, index=idx) ts.iloc[[1, 2, 5, 6, 9]] = np.nan ts @savefig series_before_interpolate.png ts.plot() .. ipython:: python ts.interpolate() @savefig series_interpolate.png ts.interpolate().plot() Interpolation relative to a :class:`Timestamp` in the :class:`DatetimeIndex` is available by setting ``method="time"`` .. ipython:: python ts2 = ts.iloc[[0, 1, 3, 7, 9]] ts2 ts2.interpolate() ts2.interpolate(method="time") For a floating-point index, use ``method='values'``: .. ipython:: python idx = [0.0, 1.0, 10.0] ser = pd.Series([0.0, np.nan, 10.0], idx) ser ser.interpolate() ser.interpolate(method="values") If you have scipy_ installed, you can pass the name of a 1-d interpolation routine to ``method``. as specified in the scipy interpolation documentation_ and reference guide_. The appropriate interpolation method will depend on the data type. .. tip:: If you are dealing with a time series that is growing at an increasing rate, use ``method='barycentric'``. If you have values approximating a cumulative distribution function, use ``method='pchip'``. To fill missing values with goal of smooth plotting use ``method='akima'``. .. ipython:: python df = pd.DataFrame( { "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], } ) df df.interpolate(method="barycentric") df.interpolate(method="pchip") df.interpolate(method="akima") When interpolating via a polynomial or spline approximation, you must also specify the degree or order of the approximation: .. ipython:: python df.interpolate(method="spline", order=2) df.interpolate(method="polynomial", order=2) Comparing several methods. .. ipython:: python np.random.seed(2) ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37)) missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29]) ser.iloc[missing] = np.nan methods = ["linear", "quadratic", "cubic"] df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods}) @savefig compare_interpolations.png df.plot() Interpolating new observations from expanding data with :meth:`Series.reindex`. .. ipython:: python ser = pd.Series(np.sort(np.random.uniform(size=100))) # interpolate at new_index new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])) interp_s = ser.reindex(new_index).interpolate(method="pchip") interp_s.loc[49:51] .. _scipy: https://scipy.org/ .. _documentation: https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation .. _guide: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html .. _missing_data.interp_limits: Interpolation limits ^^^^^^^^^^^^^^^^^^^^ :meth:`~DataFrame.interpolate` accepts a ``limit`` keyword argument to limit the number of consecutive ``NaN`` values filled since the last valid observation .. ipython:: python ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan]) ser ser.interpolate() ser.interpolate(limit=1) By default, ``NaN`` values are filled in a ``forward`` direction. Use ``limit_direction`` parameter to fill ``backward`` or from ``both`` directions. .. ipython:: python ser.interpolate(limit=1, limit_direction="backward") ser.interpolate(limit=1, limit_direction="both") ser.interpolate(limit_direction="both") By default, ``NaN`` values are filled whether they are surrounded by existing valid values or outside existing valid values. The ``limit_area`` parameter restricts filling to either inside or outside values. .. ipython:: python # fill one consecutive inside value in both directions ser.interpolate(limit_direction="both", limit_area="inside", limit=1) # fill all consecutive outside values backward ser.interpolate(limit_direction="backward", limit_area="outside") # fill all consecutive outside values in both directions ser.interpolate(limit_direction="both", limit_area="outside") .. _missing_data.replace: Replacing values ---------------- :meth:`Series.replace` and :meth:`DataFrame.replace` can be used similar to :meth:`Series.fillna` and :meth:`DataFrame.fillna` to replace or insert missing values. .. ipython:: python df = pd.DataFrame(np.eye(3)) df df_missing = df.replace(0, np.nan) df_missing df_filled = df_missing.replace(np.nan, 2) df_filled Replacing more than one value is possible by passing a list. .. ipython:: python df_filled.replace([1, 44], [2, 28]) Replacing using a mapping dict. .. ipython:: python df_filled.replace({1: 44, 2: 28}) .. _missing_data.replace_expression: Regular expression replacement ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: Python strings prefixed with the ``r`` character such as ``r'hello world'`` are `"raw" strings `_. They have different semantics regarding backslashes than strings without this prefix. Backslashes in raw strings will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. Replace the '.' with ``NaN`` .. ipython:: python d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]} df = pd.DataFrame(d) df.replace(".", np.nan) Replace the '.' with ``NaN`` with regular expression that removes surrounding whitespace .. ipython:: python df.replace(r"\s*\.\s*", np.nan, regex=True) Replace with a list of regexes. .. ipython:: python df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True) Replace with a regex in a mapping dict. .. ipython:: python df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True) Pass nested dictionaries of regular expressions that use the ``regex`` keyword. .. ipython:: python df.replace({"b": {"b": r""}}, regex=True) df.replace(regex={"b": {r"\s*\.\s*": np.nan}}) df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True) Pass a list of regular expressions that will replace matches with a scalar. .. ipython:: python df.replace([r"\s*\.\s*", r"a|b"], "placeholder", regex=True) All of the regular expression examples can also be passed with the ``to_replace`` argument as the ``regex`` argument. In this case the ``value`` argument must be passed explicitly by name or ``regex`` must be a nested dictionary. .. ipython:: python df.replace(regex=[r"\s*\.\s*", r"a|b"], value="placeholder") .. note:: A regular expression object from ``re.compile`` is a valid input as well.