
# CS5481 Data Engineering: Tutorial on Data Management (Lecture 9)

## Objective:
- Reinforce concepts from Lecture 9 through practical exercises.
- Apply data quality, security, privacy, and federated learning techniques.
- Practice problem-solving in data management scenarios.

---



## Part 1: Data Quality Practice

### Exercise 1: Identifying Data Quality Issues
- Examine a given dataset and identify data quality issues (accuracy, completeness, consistency).
#### Task:
1. Load the dataset.
2. Identify missing values, duplicates, and inconsistent formats.


In [None]:

import pandas as pd

# Load the dataset
data = pd.read_csv('data_quality.csv')
# Display first few rows
data.head()



### Exercise 2: Data Cleaning and Standardization
- Clean and standardize the dataset.
- Tasks:
  - Remove duplicates
  - Standardize date format
  - Handle missing values


In [None]:

# Remove duplicates
data = data.drop_duplicates()

# Standardize date format
data['date'] = pd.to_datetime(data['date'], errors='coerce')

# Fill missing values with the column mean
data.fillna(data.mean(), inplace=True)

# Display cleaned data
data.head()



## Part 2: Data Security and Privacy

### Exercise: Data Masking
- Anonymize sensitive data by masking personal information.
- Example: Masking credit card numbers and replacing names.


In [None]:

# Mask credit card numbers
data['card_number'] = data['card_number'].astype(str).str.replace(r'\d{12}', '**** **** **** ', regex=True)

# Replace names with pseudonyms
data['name'] = data['name'].apply(lambda x: f'User_{hash(x) % 1000}' if pd.notnull(x) else 'Anonymous')

# Display anonymized data
data.head()



## Part 3: Privacy-Preserving Techniques

### Exercise: Randomized Response Simulation
- Simulate responses using the randomized response method to protect privacy.


In [None]:

import numpy as np

def randomized_response(p_truth, flip_prob=0.5):
    flip = np.random.rand()
    if flip < flip_prob:
        return p_truth
    else:
        return np.random.choice([True, False])

# Simulate answers for 100 individuals
responses = [randomized_response(True) for _ in range(100)]
print(f'Simulated Responses: {responses[:10]}')



## Part 4: Additional Problem Set

1. **Data Quality:**
   - a) Identify common data quality issues in a dataset.
   - b) How can data profiling help detect quality problems early?

2. **Data Security:**
   - a) What measures can be taken to prevent data breaches in a cloud environment?
   - b) Describe the role of encryption in data security.

3. **Data Privacy:**
   - a) Compare data masking and data perturbation techniques.
   - b) How does federated learning enhance data privacy while training machine learning models?

4. **Practical Problem:**
   - Write a Python function to clean a dataset by removing duplicates and filling missing numerical values with the column mean.


![image.png](attachment:image.png)