Skip to content

Latest commit

 

History

History
150 lines (130 loc) · 4.91 KB

5.Pandas学习笔记.md

File metadata and controls

150 lines (130 loc) · 4.91 KB

Pandas 方法

pd.read_csv(csv_path) 读入csv文件

读入csv文件,一般用于返回~

~head() 获取前五行数据

供快速参考。

~info() 迅速获取数据描述

获取总行数、每个属性的类型、非空值的数量。

~value_counts() 获取每个值出现的次数

housing["ocean_proximity"].value_counts()

# 输出
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

pd.set_option() 设置指定的值

详细内容

设置最大输出行数

pd.set_option('max_rows', 7) 

~describe() 简要显示数据的数字特征

例如:总数、平均值、标准差、最大值最小值、25%/50%/75%值

~hist() 以直方图形式绘制所有属性

hist()方法依赖于Matplotlib,而Matplotlib又依赖于一个用户指定的图形后端去在你的屏幕上绘制。在Jupyter notebook中可用“%matplotlib inline”告诉Jupyter安装 Matplotlib使用时,使用Jupyter拥有的后端。

Jupyter中,show()方法是可选的,因为如果有图形需要输出,它会自动绘制。

%matplotlib inline  # only in a Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()

~loc[] 纯粹基于标签位置的索引器

strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

~where() 通过判断自身的值来修改自身对应的值

housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
  • cond 如果为True则保持原始值,若为False则使用第二个参数other替换值。
  • other 替换的目标值
  • inplace 是否在数据上执行操作

pandas.DataFrame() Pandas表格

具有标签轴,且算术运算在行和列标签上对齐。

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

~drop()

返回删除了请求轴标签的新对象。

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)
  • labels 索引或列标签
  • axis 从索引(0),还是列(1)中删除
  • inplace 若为True则在原数据执行操作

~corr() 计算相关系数

  • method: 可选 {‘pearson’, ‘kendall’, ‘spearman’}
    • pearson : standard correlation coefficient
    • kendall : Kendall Tau correlation coefficient
    • spearman : Spearman rank correlation
  • min_periods: Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
# 计算标准相关系数
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

#输出:
# median_house_value    1.000000
# median_income         0.687160
# total_rooms           0.135097
# housing_median_age    0.114110
# households            0.064506
# total_bedrooms        0.047689
# population           -0.026920
# longitude            -0.047432
# latitude             -0.142724
# Name: median_house_value, dtype: float64

scatter_matrix() 通过绘图比较相关性

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")

~dropna() 返回略去丢失数据部分后的剩余数据

Return object with labels on given axis omitted where alternately any or all of the data are missing

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

~fillna() 用指定的方法填充

# 用中位数填充
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

~factorize() 将数据转换为数值类型特征

housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 输出
# 17606     <1H OCEAN
# 18632     <1H OCEAN
# 14650    NEAR OCEAN
# 3230         INLAND
# 3555      <1H OCEAN
# 19480        INLAND
# 8879      <1H OCEAN
# 13685        INLAND
# 4937      <1H OCEAN
# 4861      <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# 输出
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)