在scikit-learn中估算分类缺失值


70

我有一些文本类型的列的熊猫数据。这些文本列中包含一些NaN值。我想做的是通过sklearn.preprocessing.Imputer(用最常用的值替换NaN )来估算这些NaN 。问题在于实施。假设有一个具有30列的Pandas数据框df,其中10列属于分类性质。一旦我运行:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python会生成一个error: 'could not convert string to float: 'run1'',其中'run1'是带有分类数据的第一列中的普通(不丢失)值。

任何帮助将非常欢迎


10
Imputer适用于数字,而不是字符串。转换为数字,然后估算,然后转换回。
福雷

1
是否有任何合适的方法可以通过scikit-learn自动化?
night_bat 2014年

4
为什么不允许对most_frequent策略使用分类var?奇怪。
Ketan

4
您现在可以使用from sklearn.impute import SimpleImputer,然后imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
pentandrous

Answers:


98

要将平均值用于数字列,将最频繁的值用于非数字列,您可以执行以下操作。您可以进一步区分整数和浮点数。我想用中位数代替整数列可能有意义。

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

哪个打印,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

2
很好。我将在xtoy中使用您的代码段:)如果您还有其他建议,我很高兴听到他们的建议。
PascalVKooten

1
很好,但是如果任何列具有所有NaN值,它将无法正常工作。所有这些NaN列都应从DF中删除。
中代

1
太好了:)我将使用此功能,但要对其进行一些更改,以使它用于均值表示浮点数,中值表示整数,字符串模式
奥斯丁,

DataFrameImputer() does not have get_params() attribute在GridSearchCV中使用时出错。解决方法是也要继承sklearn.base.BaseEstimator
Gautham Kumaran '11

4
@mamun该fit_transform方法由提供TransfomerMixin类。
sveitser '18

11

您可以将其sklearn_pandas.CategoricalImputer用于分类列。细节:

首先,(从书中动手机器学习与Scikit,学习和TensorFlow),你可以对数字和字符串/类别特征,其中每个subpipeline的第一变压器是采用列名的列表中选择subpipelines(和full_pipeline.fit_transform()需要pandas DataFrame):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

然后,您可以将这些子管道与结合使用sklearn.pipeline.FeatureUnion,例如:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

现在,在 num_pipeline您可以在中简单地使用sklearn.preprocessing.Imputer(),但在中cat_pipline,您可以CategoricalImputer()sklearn_pandas包中使用。

注意: sklearn-pandas软件包可以通过进行安装pip install sklearn-pandas,但作为导入import sklearn_pandas


8

有一个软件包sklearn-pandas可以选择归类类别变量 https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)

1
我支持这个答案;pypi网站上的sklearn-pandas官方文档中提到:“ CategoricalImputer,因为scikit-learn Imputer转换器当前仅可使用数字,因此sklearn-pandas提供了一个等效的辅助转换器,该转换器可以处理字符串,将空值替换为最常用的值在该列中。” pypi.org/project/sklearn-pandas/1.5.0
Sumanth Lazarus,

已从包装中取出
沃尔特

4

复制和修改sveitser的答案,我为pandas.Series对象制造了麻烦

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

要使用它,您可以这样做:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series

4
  • strategy ='most_frequent'只能用于定量功能,而不能用于定性功能。此定制冲床可用于定性和定量。同样,使用scikit Learn imputer,我们可以将其用于整个数据帧(如果所有特征都是定量的),或者可以将“ for循环”用于具有相似类型的特征/列的列表(请参见以下示例)。但是自定义imputer可以与任何组合一起使用。

        from sklearn.preprocessing import Imputer
        impute = Imputer(strategy='mean')
        for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
              xx[cols] = impute.fit_transform(xx[[cols]])
    
  • 定制电脑:

       from sklearn.preprocessing import Imputer
       from sklearn.base import TransformerMixin
    
       class CustomImputer(TransformerMixin):
             def __init__(self, cols=None, strategy='mean'):
                   self.cols = cols
                   self.strategy = strategy
    
             def transform(self, df):
                   X = df.copy()
                   impute = Imputer(strategy=self.strategy)
                   if self.cols == None:
                          self.cols = list(X.columns)
                   for col in self.cols:
                          if X[col].dtype == np.dtype('O') : 
                                 X[col].fillna(X[col].value_counts().index[0], inplace=True)
                          else : X[col] = impute.fit_transform(X[[col]])
    
                   return X
    
             def fit(self, *_):
                   return self
    
  • 数据框:

          X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                     francisco', 'tokyo'], 
              'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
              'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                'somewhat like', 'dislike'], 
              'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
    
                city              boolean   ordinal_column  quantitative_column
            0   tokyo             yes       somewhat like   1.0
            1   NaN               no        like            11.0
            2   london            NaN       somewhat like   -0.5
            3   seattle           no        like            10.0
            4   san francisco     no        somewhat like   NaN
            5   tokyo             yes       dislike         20.0
    
  • 1)可以与功能相似的列表一起使用。

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
     cci.fit_transform(X)
    
  • 可以与策略=中位数一起使用

     sd = CustomImputer(['quantitative_column'], strategy = 'median')
     sd.fit_transform(X)
    
  • 3)可以用于整个数据框架,它将使用默认均值(或者我们也可以使用中位数更改它。对于定性特征,它使用strategy ='most_frequent'以及定量均值/中位数。

     call = CustomImputer()
     call.fit_transform(X)   
    

2

受到这里答案的启发,并且由于缺少所有用例的goto Imputer,我最终写了这篇。它支持mean, mode, median, fillpd.DataFrame和的插补工作的四种策略Pd.Series

mean并且median仅适用于数字数据,mode并且fill适用于数字和分类数据。

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler

    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is \
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self

   def transform(self, X, y=None):
       return X.fillna(self.fill)

用法

>> df   
    MasVnrArea  FireplaceQu
Id  
1   196.0   NaN
974 196.0   NaN
21  380.0   Gd
5   350.0   TA
651 NaN     Gd


>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   Gd
974 196.0   Gd
21  380.0   Gd
5   350.0   TA
651 196.0   Gd

>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   NA
974 196.0   NA
21  380.0   Gd
5   350.0   TA
651 0.0     Gd 

1

此代码用最常见的类别填充了一个系列:

import pandas as pd
import numpy as np

# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan

print('m = ')
print(m)

#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

new_m = m.map(replace_most_common) #apply function to original data

print('new_m = ')
print(new_m)

输出:

m =
0      a
1    NaN
2      c
3      a
dtype: object

new_m =
0    a
1    a
2    c
3    a
dtype: object

0

类似。修改Imputerstrategy='most_frequent'

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)

    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

在其中pandas.DataFrame.mode()查找每一列的最频繁值,然后pandas.DataFrame.fillna()使用这些值填充缺失值。其他strategy值仍使用相同的方式处理Imputer


0

您可以尝试以下方法:

replace = df.<yourcolumn>.value_counts().argmax()

df['<yourcolumn>'].fillna(replace, inplace=True) 


By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.