Tag Archives: pandas

Changing the Index of a pandas DataFrame Copy Also Affects the Original

This was observed while I was trying play with the Index and MultiIndex of pandas.DataFrame objects. If I make a copy of a DataFrame and change one of the values of its index, the same index in the original DataFrame aslo changes.

In [1]:
from pandas import DataFrame
import pandas
import numpy as np
from IPython.display import HTML

pandas.__version__
Out[1]:
'0.17.1'

1. Let’s define functions to generate and display these DataFrames together

The functions below display the DataFrame side by side for ease of viewing.

In [2]:
def showDFs(dfs):
    html = ''
    for k, v in dfs.items():
        html += ('''<div style=\"float:left; padding:10px;\">
                 DataFrame: <strong>{}</strong>{}</div>'''
                 .format(k , v.to_html()))
    return HTML('<div style="float:left; width:100%;">' + html + '</div>')

def show():
    return showDFs({'df_original': df_original, 'df_copy':df_copy})

def make2frames():
    s = np.arange(20).reshape(4,5)
    d = DataFrame(s, index=list('abcd'), columns=list('ABCDE'))
    return d, d.copy()

2. Create DataFrames from a numpy array

Now let’s create a DataFrame then make a copy.

In [3]:
df_original, df_copy = make2frames()
show()
Out[3]:
DataFrame: df_original

A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
DataFrame: df_copy

A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19

3. Make change to the index values of the DataFrames

Changes in the copy affects the original and vice versa.

In [4]:
df_copy.index.values[1] = 'NEW VALUE'
show()
Out[4]:
DataFrame: df_original

A B C D E
a 0 1 2 3 4
NEW VALUE 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
DataFrame: df_copy

A B C D E
a 0 1 2 3 4
NEW VALUE 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
In [5]:
df_original.index.values[3] = 'New Value 2'
show()
Out[5]:
DataFrame: df_original

A B C D E
a 0 1 2 3 4
NEW VALUE 5 6 7 8 9
c 10 11 12 13 14
New Value 2 15 16 17 18 19
DataFrame: df_copy

A B C D E
a 0 1 2 3 4
NEW VALUE 5 6 7 8 9
c 10 11 12 13 14
New Value 2 15 16 17 18 19

This is very much unexpected to me. I went on and tested the following changes to see whether I can find something else that I don’t expect just to confirm all that I have learn from pandas.

4. Making change to the values

As expected, making changes to the values in either DataFrame does not affect each other.

In [6]:
#reset both DataFrames
df_original, df_copy = make2frames()

# change the copy
df_copy['B'] = 1000
show()
Out[6]:
DataFrame: df_original

A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
DataFrame: df_copy

A B C D E
a 0 1000 2 3 4
b 5 1000 7 8 9
c 10 1000 12 13 14
d 15 1000 17 18 19
In [7]:
#chage the original
df_original['C'] = 9999
show()
Out[7]:
DataFrame: df_original

A B C D E
a 0 1 9999 3 4
b 5 6 9999 8 9
c 10 11 9999 13 14
d 15 16 9999 18 19
DataFrame: df_copy

A B C D E
a 0 1000 2 3 4
b 5 1000 7 8 9
c 10 1000 12 13 14
d 15 1000 17 18 19

5. Making changes to column names

In [8]:
#reset both DataFrames
df_original, df_copy = make2frames()

#change the copy
df_copy.columns = list('FGHIJ')
show()
Out[8]:
DataFrame: df_original

A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
DataFrame: df_copy

F G H I J
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
In [9]:
#change the original
df_original.columns = list('PQRST')
show()
Out[9]:
DataFrame: df_original

P Q R S T
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
DataFrame: df_copy

F G H I J
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19

6. Making changes to the index – the ‘conventional way’

In [10]:
#reset both DataFrames
df_original, df_copy = make2frames()

df_copy.set_index([['k', 'l', 'm', 'n']], inplace=True)
show()
Out[10]:
DataFrame: df_original

A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
DataFrame: df_copy

A B C D E
k 0 1 2 3 4
l 5 6 7 8 9
m 10 11 12 13 14
n 15 16 17 18 19
In [11]:
df_original.set_index([['one', 'two', 'three', 'four']], inplace=True)
show()
Out[11]:
DataFrame: df_original

A B C D E
one 0 1 2 3 4
two 5 6 7 8 9
three 10 11 12 13 14
four 15 16 17 18 19
DataFrame: df_copy

A B C D E
k 0 1 2 3 4
l 5 6 7 8 9
m 10 11 12 13 14
n 15 16 17 18 19

7. Now the indices are different in the two frames, let’s see if they are still affecting each other.

In [12]:
df_copy.index.values[1] = 'NEW VALUE'
show()
Out[12]:
DataFrame: df_original

A B C D E
one 0 1 2 3 4
two 5 6 7 8 9
three 10 11 12 13 14
four 15 16 17 18 19
DataFrame: df_copy

A B C D E
k 0 1 2 3 4
NEW VALUE 5 6 7 8 9
m 10 11 12 13 14
n 15 16 17 18 19
In [13]:
df_original.index.values[1] = 'ANOTHER VALUE'
show()
Out[13]:
DataFrame: df_original

A B C D E
one 0 1 2 3 4
ANOTHER VALUE 5 6 7 8 9
three 10 11 12 13 14
four 15 16 17 18 19
DataFrame: df_copy

A B C D E
k 0 1 2 3 4
NEW VALUE 5 6 7 8 9
m 10 11 12 13 14
n 15 16 17 18 19

8. Thoughts

In conclusion, when a fresh copy is made from a DataFrame, altering the individual values of either index with affect both the original and the copy. It is most likely that the index of the copied DataFrame is actually still pointing to the same memory location as the original until either of their reference is changed, i.e. they are pointing to different memory locations. When that happens, they are not going to affect each other any longer.