Simplifying Data Analysis: The Power of drop_duplicates()

When working with datasets, duplicate rows can be a major obstacle to accurate analysis. That’s where the drop_duplicates() method in Pandas comes in – a powerful tool designed to eliminate duplicate rows from a DataFrame.

Understanding the Syntax

The drop_duplicates() method takes four optional arguments:

  • subset: a list of column names or labels to consider for identifying duplicates
  • keep: specifies which duplicates to keep (‘first’, ‘last’, or False)
  • <strong=inplace< strong=””>: If True, modifies the original DataFrame in place; if False, returns a new DataFrame</strong=inplace<>
  • ignore_index: If True, resets the index of the resulting DataFrame to a clean, new index

Removing Duplicate Rows Across All Columns

In our first example, we’ll use drop_duplicates() to remove duplicate rows across all columns, keeping only the first occurrence of each unique row. The result? A simplified dataset with no duplicate rows.

import pandas as pd

# create a sample DataFrame with duplicates
data = {'Name': ['John', 'Jane', 'John', 'Jane', 'Bob'], 
        'Age': [25, 30, 25, 30, 35]}
df = pd.DataFrame(data)

# remove duplicates across all columns, keeping the first occurrence
df.drop_duplicates(inplace=True)

print(df)

Targeted Duplicate Removal

But what if we want to identify duplicates based on a specific subset of columns? That’s where the subset parameter comes in. By setting subset to [‘Student_ID’, ‘Name’], we can remove duplicates based on the combination of these two columns.

import pandas as pd

# create a sample DataFrame with duplicates
data = {'Student_ID': [1, 2, 1, 2, 3], 
        'Name': ['John', 'Jane', 'John', 'Jane', 'Bob'], 
        'Grade': [90, 85, 90, 85, 95]}
df = pd.DataFrame(data)

# remove duplicates based on 'Student_ID' and 'Name' columns
df.drop_duplicates(subset=['Student_ID', 'Name'], inplace=True)

print(df)

Customizing Duplicate Removal

The keep argument gives us even more control over duplicate removal. We can choose to keep the first occurrence, the last occurrence, or remove all duplicates altogether.

import pandas as pd

# create a sample DataFrame with duplicates
data = {'Name': ['John', 'Jane', 'John', 'Jane', 'Bob'], 
        'Age': [25, 30, 25, 30, 35]}
df = pd.DataFrame(data)

# remove duplicates, keeping the last occurrence
df.drop_duplicates(keep='last', inplace=True)

print(df)

Resetting the Index

Finally, the ignore_index parameter allows us to reset the index of the resulting DataFrame. By setting ignore_index to True, we can start fresh with a new index starting from 0.

import pandas as pd

# create a sample DataFrame with duplicates
data = {'Name': ['John', 'Jane', 'John', 'Jane', 'Bob'], 
        'Age': [25, 30, 25, 30, 35]}
df = pd.DataFrame(data)

# remove duplicates, resetting the index
df.drop_duplicates(inplace=True, ignore_index=True)

print(df)

Leave a Reply