Unlocking Insights: Best Practices for Dealing with Missing Values along with Data Analysis in Pandas

MD TAHSEEN EQUBAL
7 min readMay 14, 2024

--

Resources used in this blog (datasets, Day 8 Data Cleaning.ipynb file)

Definition

  • The Value which is not available , Blanked Data.
  • Missing values, also known as blanked data, refer to data points that are not available within a dataset.
  • They are often denoted as “nan”, “NaN”, or null, and arise from factors like data entry errors or system malfunctions.

Impact

Missing values can hinder data analysis by obscuring insights and patterns. They can make it challenging to identify trends or draw accurate conclusions from the data.

Mechanisms

Mechanism for missing data

1. Missing Completely at Random (MCAR)

  • Missing values occur randomly without any discernible pattern or reason.
  • Examples: Include system glitches or internet connectivity issues.

2. Missing at Random (MAR)

  • Missing values depend on other observed variables but not on the missing values themselves.
  • Example: respondents may choose not to answer certain questions based on their demographic characteristics.

3. Missing not at Random (MNAR)

  • The reason for missing values is related to the missing values themselves, making them non-random.
  • Example: patients with severe health conditions may fail to report their symptoms in a medical study.

Identification

Identifying missing values is essential for effective data analysis.

  1. Using `df.isna()`: This method identifies missing values in a DataFrame and returns a Boolean mask indicating their presence.
  2. Using `df.isnull()`: Similar to `df.isna()`, this method also identifies missing values in a DataFrame and returns a boolean mask.
import pandas as pd

# Creating a sample DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
'B': [None, 2, 3, None, 5],
'C': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)

# Check for missing values using isna()
print("Using isna():")
print(df.isna())

# Check for missing values using isnull()
print("\nUsing isnull():")
print(df.isnull())
OUTPUT

Using isna():
A B C
0 False True False
1 False False False
2 True False False
3 False True False
4 False False False

Using isnull():
A B C
0 False True False
1 False False False
2 True False False
3 False True False
4 False False False

In this example, both df.isna() and df.isnull() produce identical results, as they are essentially aliases of each other and perform the same operation. Both methods identify missing values in the DataFrame and return boolean masks where True represents missing values and False represents non-missing values.

Treatment of Missing Values

Treatment of Missing Values
  1. Drop ( dropna() )

Explanation:

  • Dropping missing values involves removing rows or columns from the dataset that contain missing values.
  • This approach eliminates any observations with missing data, ensuring that only complete cases are retained for analysis.
  • dropna() is not always the best choice, especially when dealing with small datasets or when missing values are prevalent

Advantages:

  • Simplicity: Dropping missing values is a straightforward process that requires minimal preprocessing.
  • Preservation of Data Structure: By removing incomplete cases, the overall structure and integrity of the dataset are preserved.
  • Reduction of Bias: Dropping missing values can prevent biased estimates or erroneous conclusions that may arise from imputing missing data.

Disadvantages:

  • Loss of Information: Dropping missing values may lead to a reduction in the sample size and loss of valuable information, especially if missingness is not random.
  • Reduced Statistical Power: Removing observations with missing values can decrease the statistical power of analyses, potentially limiting the ability to detect significant effects.
  • Selective Bias: If missingness is related to the outcome variable or other variables of interest, dropping missing values may introduce selective bias into the analysis.

Example:

  • Consider a dataset containing information about customer demographics and their purchase behavior. If the dataset has missing values in the “Age” and “Income” columns, dropping these rows removes the incomplete cases. However, this approach may result in a smaller sample size, potentially overlooking important patterns or trends in customer behavior.

2. Fill( fillna() )

Explanation:

  • Filling missing values involves replacing them with substitute values, such as mean, median, mode, or interpolated values.
  • This approach imputes missing data to ensure that all observations in the dataset have complete information.

Advantages:

  • Retention of Sample Size: Filling missing values allows for the retention of the entire sample size, ensuring that no observations are excluded from the analysis.
  • Improved Statistical Power: By retaining all observations, filling missing values can enhance the statistical power of analyses, enabling more robust and reliable results.
  • Facilitates Analysis: Imputing missing values enables researchers to conduct comprehensive analyses without disregarding incomplete cases.

Disadvantages:

  • Introduction of Bias: Imputing missing values with substitute values may introduce bias into the dataset, particularly if the imputed values do not accurately reflect the true distribution of the data.
  • Distortion of Variability: Imputation methods such as mean or median imputation may underestimate the variability of the data, leading to inaccurate estimates of variability and dispersion.
  • Complexity of Imputation: Choosing appropriate imputation methods and parameters requires careful consideration, as different approaches may yield varying results.

Example:

  • In a healthcare dataset, if there are missing values in the “Blood Pressure” variable, filling these values with the mean blood pressure of the population may provide a reasonable estimate. However, if missingness is related to certain demographic factors or health conditions, imputing missing values with the mean may introduce bias into the analysis. In such cases, more sophisticated imputation techniques, such as predictive modeling or multiple imputation, may be warranted.

Filling Approaches for Handling Missing Value

Original dataset
  1. Statistical Method (Mean, Median, Mode)
 df['Temperature'].fillna(df['Temperature'].mean(),inplace=True) 
df
Output of Mean for the data frame

Use:

  • Impute missing values with the mean, median, or mode of the respective variable.

Advantage:

  • Simplicity: Easy to implement and understand.
  • Preservation of Sample Size: Retains the original sample size.

Disadvantage:

  • Potential Bias: May introduce bias if the imputed values do not accurately represent the underlying distribution.
  • Loss of Variability: May underestimate the variability of the data.

Example:

  • In a dataset containing student exam scores, missing values in the “Grade” column can be imputed with the mean, median, or mode grade.

2. Filling Dataset Values — bfill (Backward Fill) / ffill (Forward Fill)

df1.fillna(method='bfill')
df1.fillna(method='ffill')
df1pyp
Using bfill(), ffill()

Use:

  • Propagate non-missing values forward or backward to fill missing values.

Advantage:

  • Preservation of Temporal Order: Retains the temporal order of observations, useful for time series data.
  • Ease of Implementation: Simple and efficient for sequential data.

Disadvantage:

  • Potential Propagation of Errors: May propagate errors or outliers if adjacent observations contain them.
  • Assumption of Stationarity: Assumes that the data generating process is stationary over time.

Example:

  • In a stock price dataset, missing values in the “Closing Price” column can be filled using ffill to propagate the previous day’s closing price.

3. Constant

df1['Temperature'].fillna(27.4,inplace = True )
df1['Humidity'].fillna(75,inplace = True )
df1
using constant values

Use:

  • Impute missing values with a constant value, typically based on domain knowledge.

Advantage:

  • Flexibility: Allows for customization of imputed values based on specific domain requirements.

Disadvantage:

  • Potential Bias: May introduce bias if the chosen constant value does not accurately represent the missing values.
  • Loss of Information: Ignores any underlying patterns or relationships in the data.

Example:

  • In a dataset of weather measurements, missing values in the “Temperature” column can be imputed with a constant value representing the average temperature for the region.

4. Interpolate

df2.interpolate(method = 'linear')
df2
using interpolate
  • Linear / Polynomial

Use:

  • Predict missing values based on linear or polynomial interpolation between adjacent observations.

Advantage:

  • Preservation of Trends: Retains underlying trends and patterns in the data.
  • Advantage:Can handle irregularly spaced data points and missing values within time series or spatial datasets.

Disadvantage:

  • Sensitivity to Outliers: May be sensitive to outliers or extreme values, leading to inaccurate imputations.
  • Assumption of Linearity: Assumes a linear relationship between adjacent observations, which may not hold true for all datasets.

Example:

  • In a dataset containing monthly sales data, missing values in the “Revenue” column can be interpolated using linear interpolation based on the trend observed in neighboring months.

5. GroupBy

Original Dataset
df['Marks'].fillna(df['Marks'].mean(),inplace = True )
df
Using Groupby

Use:

  • Impute missing values based on group-specific statistics.

Advantage:

  • Customization: Allows for tailored imputations based on group-level characteristics.
  • Reduced Bias: Reduces bias by accounting for differences in distributions between groups.

Disadvantage:

  • Complexity: Requires additional preprocessing steps to identify relevant groups and calculate group-specific statistics.
  • Small Group Sizes: Less effective for groups with small sample sizes.

Example:

  • In a dataset of customer transactions, missing values in the “Purchase Amount” column can be imputed with the mean purchase amount for each customer segment identified through groupby analysis.

--

--