ML |处理缺失值-yiteyi-C++库

有了这篇文章，你就可以熟练掌握ML算法、概念、数学和编码了。

null

为了使用ML代码，库在Python中扮演着非常重要的角色，我们将详细研究这些库，但让我们看一下最重要的库的非常简短的描述：

努比（数字Python）： 它是Python最伟大的科学和数学计算库之一。Keras、Tensorflow等平台在张量上嵌入了Numpy操作。我们关心的是它的功能强大，易于处理和在阵列上执行操作。
熊猫 : 这个软件包在处理数据时非常有用。这使得操作、聚合和可视化数据变得非常容易。
MatplotLib : 这个库有助于实现功能强大且非常简单的可视化。

还有很多图书馆，但它们现在没有用处。那么，让我们开始吧。

下载数据集： 转到链接并下载 缺少值的数据。csv .

巨蟒： 我建议你们安装水蟒在你的系统上。在您的系统上启动Spyder我们的Jupyter。建议的原因是——Anaconda已经预装了所有基本的Python库。图片[1]-ML |处理缺失值-yiteyi-C++库

下面是Python代码：

                     # Python code explaining How to                   
                     # Handle Missing Value in Dataset                   
                             
                     """ PART 1                   
                                         Importing Libraries """                   
                             
                     import                               numpy as np                   
                     import                               matplotlib.pyplot as plt                   
                     import                               pandas as pd                   
                             
                             
                     """ PART 2                   
                                         Importing Data """                   
                             
                     data_sets                               =                               pd.read_csv(                               'C:\Users\Admin\Desktop\Data_for_Missing_Values.csv'                               )                   
                             
                     print                               (                               "Data Head : "                               , data_sets.head())                   
                             
                     print                               (                               "Data Describe : "                               , data_sets.describe())                   
                             
                     """ PART 3                   
                                         Input and Output Data """                   
                             
                     # All rows but all columns except last                   
                     X                               =                               data_sets.iloc[:, :                               -                               1                               ].values                   
                             
                     # TES                   
                     # All rows but only last column                   
                     Y                               =                               data_sets.iloc[:,                               3                               ].values                   
                             
                     print                               (                               "Input : "                               , X)                   
                     print                               (                               "Output: "                               , Y)                   
                             
                             
                     """ PART 4                   
                                         Handling the missing values """                   
                             
                     # We will use sklearn library >> preprocessing package                   
                     # Imputer class of that package                   
                     from                               sklearn.preprocessing                               import                               Imputer                   
                             
                     # Using Imputer function to replace NaN                   
                     # values with mean of that parameter value                   
                     imputer                               =                               Imputer(missing_values                               =                               "NaN"                               ,                   
                                         strategy                               =                               "mean"                               , axis                               =                               0                               )                   
                             
                     # Fitting the data, function learns the stats                   
                     imputer                               =                               imputer.fit(X[:,                               1                               :                               3                               ])                   
                             
                     # fit_transform() will execute those                   
                     # stats on the input ie. X[:, 1:3]                   
                     X[:,                               1                               :                               3                               ]                               =                               imputer.fit_transform(X[:,                               1                               :                               3                               ])                   
                             
                     # filling the missing value with mean                   
                     print                               (                               "New Input with Mean Value for NaN : "                               , X)                   

输出：

Data Head : 
    Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes


Data Describe : 
              Age        Salary
count   9.000000      9.000000
mean   38.777778  63777.777778
std     7.693793  12265.579662
min    27.000000  48000.000000
25%    35.000000  54000.000000
50%    38.000000  61000.000000
75%    44.000000  72000.000000
max    50.000000  83000.000000


Input : 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Output: 
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


New Input with Mean Value for NaN : 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

代码说明：

第1部分-导入库： 在上面的代码中，导入了numpy、pandas和matplotlib，但我们只使用了pandas。
第2部分-导入数据：
- 进口 Data_for_Missing_Values.csv 通过给pandas read_csv函数提供路径。现在“data_sets”是一个数据框架（带有标记行和列的二维表格数据结构）。
- 然后使用 头（） 作用可以更改条目数，例如，对于前3个值，我们可以使用dataframe。头（3）。类似地，最后的值也可以使用 tail（） 作用
- 然后用 描述（） 作用它给出了数据的统计汇总，包括每个参数值的最小值、最大值、百分位数（.25、.5、.75）、平均值和标准偏差。
第3部分-输入和输出数据： 我们将数据框拆分为输入和输出。
第4部分-处理缺失值： 使用sklearn中的inputer（）函数。预处理包。

插补者： Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True) 是sklearn的输入类中的一个函数。预处理包。它的作用是将参数值从缺失值（NaN）转换为设置策略值。

Syntax : sklearn.preprocessing.Imputer()

Parameters : 

-> missing_values  : integer or “NaN”
-> strategy        : What to impute - mean, median or most_frequent along axis
-> axis(default=0) : 0 means along column and 1 means along row

文章版权归作者所有，未经允许请勿转载。

THE END

技术文章