将数据框中的NaN值替换为相关列的平均值的函数

Question

6 浏览2023年6月28日

匿名的 2023年6月28日

0 Comments

编辑: 这个问题不是pandas dataframe replace nan values with average of columns的克隆，因为我想用每一列的平均值替换每一列的值，而不是用整个数据框的平均值替换。

问题

我有一个pandas数据框（train），其中有一百列需要应用机器学习技术。

通常情况下，我会手工进行特征工程，但在这种情况下，我有很多列要处理。

我想建立一个Python函数，实现以下功能：

1）找到每一列中的NaN值（我考虑使用df.isnull().any()）

2）对于每一个NaN值，用找到的该列的平均值替换它。

我的想法是这样的：

def replace(value):
    for value in train:
        if train['value'].isnull():
           train['value'] = train['value'].fillna(train['value'].mean())
train = train.apply(replace,axis=1)

但是我收到了以下错误

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3063             try:
-> 3064                 return self._engine.get_loc(key)
   3065             except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'value'
During handling of the above exception, another exception occurred:
KeyError                                  Traceback (most recent call last)
 in ()
----> 1 train = train.apply(replace,axis=1)
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6012                          args=args,
   6013                          kwds=kwds)
-> 6014         return op.get_result()
   6015 
   6016     def applymap(self, func):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    140             return self.apply_raw()
    141 
--> 142         return self.apply_standard()
    143 
    144     def apply_empty_result(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    246 
    247         # compute the result using the series generator
--> 248         self.apply_series_generator()
    249 
    250         # wrap results
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    275             try:
    276                 for i, v in enumerate(series_gen):
--> 277                     results[i] = self.f(v)
    278                     keys.append(v.name)
    279             except Exception as e:
 in replace(value)
      1 def replace(value):
      2     for value in train:
----> 3         if train['value'].isnull():
      4            train['value'] = train['value'].fillna(df['value'].mean())
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res
/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3064                 return self._engine.get_loc(key)
   3065             except KeyError:
-> 3066                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3067 
   3068         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('value', 'occurred at index 0')

在寻找解决方案时，我找到了：

这个，但它适用于txt文件（不是pandas数据框）
这个关于df.isnull().any()方法的问题。

0

3 答案

匿名的 · Answer 1 · 2023-09-18T12:38:03+00:00

在处理数据的过程中，经常会遇到缺失值（NaN）的情况。缺失值会影响数据的分析和建模结果，因此需要找到一种方法来处理这些缺失值。本文将介绍一种使用相关列的均值来替换DataFrame中NaN值的方法。

问题的出现是因为DataFrame中存在缺失值，而缺失值可能会对数据的分析和建模结果产生负面影响。因此，我们需要找到一种方法来处理这些缺失值，使得数据的分析和建模结果更加准确可靠。

解决方法是使用相关列的均值来替换DataFrame中的NaN值。具体的实现方法是通过使用apply函数和lambda函数来逐列处理DataFrame中的NaN值。首先，通过x.mean()计算每一列的均值。然后，使用x.fillna()函数将每一列的NaN值替换为该列的均值。最后，通过df.apply()函数将该处理应用到整个DataFrame中的每一列。

代码实现如下：

df.apply(lambda x: x.fillna(x.mean()))

这段代码会遍历DataFrame中的每一列，对于每一列使用x.fillna(x.mean())来替换NaN值。其中，x代表每一列的Series对象，x.mean()计算该列的均值，x.fillna()函数用来替换NaN值。

通过使用这段代码，我们可以很方便地将DataFrame中的NaN值替换为其对应列的均值。这样可以确保在数据的分析和建模过程中，缺失值不会对结果产生负面影响，同时保持数据的完整性和准确性。

匿名的 · Answer 2 · 2023-07-06T20:56:52+00:00

在上述代码中，我们可以看到一个函数用于将数据框中的NaN值替换为相关列的均值。问题的出现是因为数据框中存在NaN值，我们需要找到一种方法来处理这些缺失值。解决方法是使用fillna函数和df.mean(axis=0)计算每一列的均值。然后将均值传递给fillna函数，用均值替换NaN值。这种方法比使用apply函数处理数据集的方法要快两倍。

匿名的 · Answer 3 · 2023-07-17T15:33:18+00:00

问题的出现原因是代码中使用了错误的列名，导致无法正确替换NaN值。解决方法是将代码中的列名修改为正确的列名。

以下是可以尝试的代码解决方法：

[df[col].fillna(df[col].mean(), inplace=True) for col in df.columns]

你的代码本质上是正确的。你的错误是在代码中应该调用

train[value]

而不是

train['value']

在代码的各个地方。因为后者会尝试查找名为"value"的列，而实际上它是一个你正在迭代的列表中的变量。