Translate

Pages

Pages

Pages

Intro Video

Sunday, August 2, 2020

How To Compare Two Dataframes with Pandas compare?

Sometimes you may have two similar dataframes and would like to know exactly what those differences are between the two data frames. Starting from Pandas 1.1.0 version, Pandas has a new function compare() that lets you compare two data frames or Series and identify the differences between them and nicely tabulate them.

In this post let us see a simple example of Pandas compare function on two similar data frames and summarize the differences.
Let us load Pandas and Numpy.

import pandas as pd
import numpy as np

Let us check Pandas version and make sure we have Pandas version 1.1.0 and above. Otherwise make sure to install the latest version of Pandas using conda/pip install.

pd.__version__
'1.1.0'

We will first create a toy dataframe with three columns and four rows.

# create dataframe
df1 = pd.DataFrame(
    {
       "col1": ["a", "v", "x", "y"],
       "col2": [1.0, 2.0, 3.0, np.nan],
       "col3": [7.0, 8.0, 9.0, 3.0]
     },
    columns=["col1", "col2", "col3"],
)

And this is how our data looks like.

df1

      col1      col2    col3
0       a       1.0     7.0
1       v       2.0     8.0
2       x       3.0     9.0
3       y       NaN     3.0

Let us create second dataframe by copying the first dataframe and making two changes. In this example, we change the values of first row in first column and second row in second column.

# create a copy of dataframe
df2 = df1.copy()
# change the values of few elements
df2.loc[0, 'col1'] = 'zz'
df2.loc[1, 'col2'] = '10'

If we compare it to the first data frame, it differs in just two places in the first two rows.

df2

     col1       col2    col3
0       zz      1       7.0
1       v       10      8.0
2       x       3       9.0
3       y       NaN     3.0

Let us use Pandas compare() function to summarize the differences between the two dataframes. Pandas compare() function outputs the columns and elements that is different between the dataframes. “self” label in the result correspond to the dataframe that we compare and “other” label corresponds to the dataframe that we compare to.

By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.

df1.compare(df2)

            col1        col2
       self     other   self    other
0       a       zz      NaN     NaN
1       NaN     NaN     2.0     10

We can also change how Pandas compare display result. Here with align_axis=”rows” or align_axis=0, Pandas compare() function displays self and other in separate rows.

df1.compare(df2, align_axis='rows')

        col1     col2
0       self    a       NaN
      other     zz      NaN
1       self    NaN     2
      other     NaN     10

The post How To Compare Two Dataframes with Pandas compare? appeared first on Python and R Tips.



from Python and R Tips https://ift.tt/2BOIxHI
via Gabe's MusingsGabe's Musings