比较两个DataFrame并并排输出它们的差异


162

我试图突出显示两个数据框之间到底发生了什么变化。

假设我有两个Python Pandas数据框:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表:

  1. 标识已更改的行(可以是int,float,boolean,string)
  2. 输出具有相同,OLD和NEW值的行(理想情况下将其输出到HTML表中),以便使用者可以清楚地看到两个数据框之间的变化:

    "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"

我想我可以逐行和逐列进行比较,但是有没有更简单的方法?


在pandas 1.1中,您可以通过单个函数-df.compare轻松地执行此操作。
cs95

Answers:


153

第一部分类似于君士坦丁,您可以获取哪些行为空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

然后,我们可以查看哪些条目已更改:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

在这里,第一个条目是索引,第二个条目是已更改的列。

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

*注:这是非常重要的df1,并df2在这里分享相同的索引。为了克服这种歧义,您可以确保仅使用来查看共享标签df1.index & df2.index,但我想将其保留为练习。


2
我相信“共享同一索引”的意思是“确保索引已排序” ...这将比较先进先出df1与先进先出df2,无论索引的值如何。JFYI,如果我不是唯一对此并不明显的人。; D谢谢!
dmn 2015年

12
如果nandf1和df1中的score均相等,则此函数会将其报告为从nan变为nan。这是因为np.nan != np.nan回报True
James Owers

2
@kungfujam是正确的。此外,如果比较的值是None,那么您在那里也会得到假差异
FistOfFury

只是要清楚-我说明了这种解决方案的问题,并提供了一个易于使用的功能,解决了这个问题下面
詹姆斯Owers

1
[.row。,'col']优于['id','col']作为change.index.names,因为它不是id,而是行。
藤田直树

87

突出显示两个DataFrame之间的差异

可以使用DataFrame样式属性突出显示存在差异的单元格的背景色。

使用原始问题中的示例数据

第一步是使用concat功能将DataFrames水平连接,并使用keys参数区分每个帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

在此处输入图片说明

交换列级别并将相同的列名称彼此相邻可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

在此处输入图片说明

现在,更容易发现框架中的差异。但是,我们可以走得更远,并使用该style属性突出显示不同的单元格。我们定义了一个自定义函数来执行此操作,您可以在文档的此部分中看到。

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

在此处输入图片说明

这将突出显示两个均缺少值的单元格。您可以填充它们或提供额外的逻辑,以免突出显示它们。


1
您知道如何用不同的颜色同时对“第一”和“第二”进行着色吗?
aturegano

1
是否可以仅选择不同的行?在这种情况下,如何选择第二行和第三行而不选择第一行(111)?
珊塔诺

1
@shantanuo,是的,只需将最终方法编辑为df_final[(df != df2).any(1)].style.apply(highlight_diff, axis=None)
anmol,

3
比较具有26K行和400列的数据帧时,此实现将花费更长的时间。有什么办法可以加快速度吗?
codelord

42

这个答案只是扩展了@Andy Hayden的值,使其在数字字段为时具有弹性nan,并将其包装到函数中。

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

因此,对于您的数据(略作编辑以使分数列中具有NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

我添加了代码来处理数据类型的细微差别,如果不考虑这一点,则会引发错误。
Roobie Nuby

如果我两边都没有相同的行来比较怎么办?
Kishor kumar R

@KishorkumarR,那么您应该首先对行进行平整,方法是检测添加到新数据帧中的行,并从旧数据帧中删除行
Sabre

22
import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

版画

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

3
非常好的解决方案,比我的紧凑得多!
Andy Hayden 2013年

1
@AndyHayden:我对这种解决方案并不完全满意;它似乎仅在索引是多级索引时才有效。如果我尝试仅将其id用作索引,则会df.groupby(level='id')引发错误,并且我不确定为什么...
unutbu 2013年

19

我已经遇到了这个问题,但是在找到这篇文章之前找到了答案:

根据unutbu的答案,加载您的数据...

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Jack                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Jack   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

...定义您的diff函数...

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

然后,您可以简单地使用面板来得出结论:

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Jack   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00

顺便说一句,如果您使用的是IPython Notebook,则可能希望使用彩色的diff函数根据单元格是不同,相等还是left / right null来赋予颜色:

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))

(在常规Python中,而不是在iPython Notebook中)是否可以在my_panel = pd.Panel(dict(df1=df1,df2=df2))函数内部包含report_diff()?我的意思是,是否可以这样做:print report_diff(df1,df2)并获得与打印语句相同的输出?
edesz

pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)- 这太棒了!!!
MaxU

5
面板已弃用!知道如何移植吗?
denfromufa

@denfromufa我在我的答案中进行了一些更新:stackoverflow.com/a/49038417/7607701
Aaron N. Brock

9

如果您的两个数据帧中具有相同的ID,那么找出更改实际上是很容易的。这样做frame1 != frame2会为您提供一个布尔型DataFrame,其中每个True都是已更改的数据。由此,您可以通过轻松获得每个更改行的索引changedids = frame1.index[np.any(frame1 != frame2,axis=1)]


6

使用concat和drop_duplicates的另一种方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

3

摆弄@journois的答案后,由于Panel的贬值,我能够使用MultiIndex而不是Panel使它正常工作

首先,创建一些虚拟数据:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})

然后,定义您的diff函数,在这种情况下,我将使用他的答案中的一个report_diff保持不变:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

然后,我将把数据连接到一个MultiIndex数据帧中:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

最后,我将report_diff向下应用每个列组:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

输出:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6

仅此而已!


3

扩展@cge的答案,这对于提高结果的可读性非常酷:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

完整的演示示例:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

1

这是使用选择并合并的另一种方法:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

这是Jupyter屏幕截图中的相同内容:

在此处输入图片说明


0

大熊猫> = 1.1: DataFrame.compare

使用pandas 1.1,您基本上可以通过一个函数调用来复制Ted Petrou的输出。来自文档的示例:

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

df1.compare(df2)

  score       isEnrolled       Comment             
   self other       self other    self        other
1  1.11  1.21        NaN   NaN     NaN          NaN
2   NaN   NaN        1.0   0.0     NaN  On vacation

此处,“自身”是指LHS数据帧,而“其他”是指RHS数据帧。默认情况下,相等值将替换为NaN,因此您可以仅关注差异。如果要显示相等的值,请使用

df1.compare(df2, keep_equal=True, keep_shape=True) 

  score       isEnrolled           Comment             
   self other       self  other       self        other
1  1.11  1.21      False  False  Graduated    Graduated
2  4.12  4.12       True  False        NaN  On vacation

您还可以使用align_axis以下方式更改比较轴:

df1.compare(df2, align_axis='index')

         score  isEnrolled      Comment
1 self    1.11         NaN          NaN
  other   1.21         NaN          NaN
2 self     NaN         1.0          NaN
  other    NaN         0.0  On vacation

这将按行而不是按列比较值。


注意:pandas 1.1仍处于试验阶段,仅可通过构建开发沙箱获得
cs95

-1

查找两个数据帧之间不对称差异的函数在以下实现:(基于熊猫的集合差异)GIST:https : //gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: /programming/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

例:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

-1

将pda导入为pd将numpy导入为np

df = pd.read_excel('D:\ HARISH \ DATA SCIENCE \ 1 MY Training \ SAMPLE DATA&PROJS \ CRICKET DATA \ IPL PLAYER LIST \ IPL PLAYER LIST _ harish.xlsx')

df1 = srh = df [df ['TEAM']。str.contains(“ SRH”)] df2 = csk = df [df ['TEAM']。str.contains(“ CSK”)]

srh = srh.iloc [:,0:2] csk = csk.iloc [:,0:2]

csk = csk.reset_index(drop = True)csk

srh = srh.reset_index(drop = True)srh

new = pd.concat([srh,csk],axis = 1)

new.head()

**玩家类型玩家类型

0戴维·华纳·蝙蝠侠... MS Dhoni Captain

1布瓦内什瓦尔·库马尔·鲍勒(Bhuvaneshwar Kumar Bowler)...

2 Manish Pandey击球手... Suresh Raina All-Rounder

3拉希德·汗·阿曼·鲍勒(Kashir Jadhav All-Rounder)

4 Shikhar Dhawan击球手.... Dwayne Bravo All-Rounder


玩家类型PLAYER TYPE 0大卫华纳击球手MS的Dhoni 1船长Bhuvaneshwar库马尔保龄球雷文德拉·贾迪哈全才2和Manish Pandey的击球手苏雷什雷纳全才3拉希德汗阿尔曼保龄球基达贾达夫全才4希克哈·德霍万击球手德韦恩布拉沃全才
哈里什TRASH,

你好哈里斯,请格式化您的答案,否则,它很难阅读:)
马库斯
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.