pandasデータのイテレーション

pandas ではデータを列や表形式のデータ構造として扱うが、
これらのデータから列ごと・行ごと（一列ずつ・一行ずつ）に値を取得
(イテレーション) して何か操作をしたいことがよくある。

Series のイテレーション

eries は以下 2つのイテレーション用メソッドを持つ。各メソッドの挙動は以下のようになる。

iter: Series の値 ( values ) のみをイテレーション
（Seriesをそのままforループに突っ込む）
iteritems: Series の index と values からなる tuple をイテレーション

In [1]:

import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

# Series の値 ( values ) のみをイテレーション
for v in s:
    print(v)
    
# index と values からなる tuple をイテレーション
for i, v in s.iteritems():
    print(str(i) + ' ' + str(v))

DataFrame のイテレーション

DataFrame は以下 4つのイテレーション用メソッドを持つ。同様に挙動を示す。

iter: DataFrame の列名 ( columns ) のみをイテレーション
iteritems: DataFrame の列名と列の値 ( Series ) からなる tuple をイテレーション
- pandas.DataFrame.iteritems — pandas 0.21.0 documentation
iterrows: DataFrame の行名と行の値のペア(index, Series) をイテレーション
- pandas.DataFrame.iterrows — pandas 0.21.0 documentation
itertuples: DataFrame の行名と行の値からなるタプル をイテレーション
タプルの最初の要素がインデックス名となる。
- pandas.DataFrame.itertuples — pandas 0.21.0 documentation
- デフォルトではPandasという名前のnamedtupleを返す。
  namedtupleなので、[]のほか.でも各要素の値にアクセスできる。
- 引数nameをNoneとするとノーマルのタプルを返す。

In [2]:

import pandas as pd

df = pd.read_csv('data/sample_pandas_normal.csv', index_col=0).head(2)
print(df)

# DataFrame の列名 ( columns ) のみをイテレーション
for column_name in df:
    print(column_name, type(column_name))

       age state  point
name                   
Alice   24    NY     64
Bob     42    CA     92
age <class 'str'>
state <class 'str'>
point <class 'str'>

In [3]:

#  DataFrame の列名と 列の値 ( Series ) からなる tuple をイテレーション
for column_name, item in df.iteritems():
    print(column_name, type(column_name))
    print(item, type(item))

    print(item['Alice'], item[0], item.Alice)
    print('======\n')

age <class 'str'>
name
Alice    24
Bob      42
Name: age, dtype: int64 <class 'pandas.core.series.Series'>
24 24 24
======

state <class 'str'>
name
Alice    NY
Bob      CA
Name: state, dtype: object <class 'pandas.core.series.Series'>
NY NY NY
======

point <class 'str'>
name
Alice    64
Bob      92
Name: point, dtype: int64 <class 'pandas.core.series.Series'>
64 64 64
======

In [4]:

# DataFrame の行名と 行の値 ( Series ) からなる tuple をイテレーション
for index, row in df.iterrows():
    print(index, type(index))
    print(row, type(row))

    print(row['point'], row[2], row.point)
    print('======\n')

Alice <class 'str'>
age      24
state    NY
point    64
Name: Alice, dtype: object <class 'pandas.core.series.Series'>
64 64 64
======

Bob <class 'str'>
age      42
state    CA
point    92
Name: Bob, dtype: object <class 'pandas.core.series.Series'>
92 92 92
======

In [5]:

# DataFrame の**行名と 行の値からなるタプル ** をイテレーション
# デフォルトではPandasという名前のnamedtupleを返す。
for row in df.itertuples():
    print(row, type(row))

    print(row[3], row.point)
    print('======')

Pandas(Index='Alice', age=24, state='NY', point=64) <class 'pandas.core.frame.Pandas'>
64 64
======
Pandas(Index='Bob', age=42, state='CA', point=92) <class 'pandas.core.frame.Pandas'>
92 92
======

In [6]:

# 引数nameをNoneとするとノーマルのタプルを返す。
for row in df.itertuples(name=None):
    print(row, type(row))

    print(row[3])
    print('======\n')

('Alice', 24, 'NY', 64) <class 'tuple'>
64
======

('Bob', 42, 'CA', 92) <class 'tuple'>
92
======

GroupBy

GroupBy は以下のイテレーション用メソッドを持つ。

iter: GroupBy のグループ名とグループ ( DataFrame もしくは Series ) からなる tuple をイテレーション

In [7]:

df = pd.read_csv('data/sample_pandas_normal.csv', index_col=0)
print(df)

grouped = df.groupby('state')
for name, group in grouped:
    print(name)
    print(group)    
    print('======')

         age state  point
name                     
Alice     24    NY     64
Bob       42    CA     92
Charlie   18    CA     70
Dave      68    TX     70
Ellen     24    CA     88
Frank     30    NY     57
CA
         age state  point
name                     
Bob       42    CA     92
Charlie   18    CA     70
Ellen     24    CA     88
======
NY
       age state  point
name                   
Alice   24    NY     64
Frank   30    NY     57
======
TX
      age state  point
name                  
Dave   68    TX     70
======

ループ処理で値を更新する

1行ずつ値を取り出すiterrows()メソッドはビューではなくコピーを返すので、
pandas.Seriesを変更しても元データは更新されない。

In [8]:

for index, row in df.iterrows():
    row.point /= 2
print(df)

         age state  point
name                     
Alice     24    NY     64
Bob       42    CA     92
Charlie   18    CA     70
Dave      68    TX     70
Ellen     24    CA     88
Frank     30    NY     57

at で元のDataFrameからデータを選択して処理する必要がある。

In [9]:

for index, row in df.iterrows():
    df.at[index, 'point'] /= 2

print(df)

         age state  point
name                     
Alice     24    NY     32
Bob       42    CA     46
Charlie   18    CA     35
Dave      68    TX     35
Ellen     24    CA     44
Frank     30    NY     28

showery9hxnの日記

pandasデータのイテレーション（forループ処理）

pandasデータのイテレーション

Series のイテレーション

DataFrame のイテレーション

GroupBy

ループ処理で値を更新する