CSIP12.in
Back to List
Calculating...
UNIT 1 : CH 2 Dec 14, 2025

πŸ“Š Data Handling Using Pandas – II

## πŸ“š1. Definitions
* **Iteration:** The process of accessing each element (row or column) of a DataFrame one by one.
* **Descriptive Statistics:** Statistical functions used to summarize the central tendency, dispersion, and shape of a dataset’s distribution (e.g., mean, median, max).
* **Missing Data (NaN):** Represents "Not a Number." It indicates missing or undefined values in the dataset.
* **Binary Operation:** Mathematical operations performed between two DataFrames or a Series and a DataFrame (e.g., addition, subtraction).
* **Groupby:** A technique to split data into groups based on some criteria, apply a function to each group, and combine the results.

---

## πŸ› οΈ 2.1 Introduction & Setup
Before we start, let's create a "Master DataFrame" that we will use for most examples. Imagine this is a result sheet for a class.

### πŸ’» Code Example: Creating the Dataset
```python
import pandas as pd
import numpy as np

# Creating a Dictionary
data = {
'Name': ['Arjun', 'Bina', 'Chirag', 'Divya'],
'Maths': [90, 85, 78, 92],
'IP': [88, 95, np.nan, 90], # Chirag was absent for IP (NaN)
'Section': ['A', 'A', 'B', 'B']
}

# Creating the DataFrame
df = pd.DataFrame(data)
print(df)

```

### πŸ“Ÿ Output
```text
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
2 Chirag 78 NaN B
3 Divya 92 90.0 B

```

---

## πŸ”„ 2.2 Iterating Over a DataFrame
Iteration means visiting the data items one by one.

### 2.2.1 Horizontal Iteration (`iterrows()`)
Traverses the DataFrame **row by row**.

### πŸ’» Code Example
```python
# Printing Name and Maths marks for each student
for index, row in df.iterrows():
print(f"Index {index}: {row['Name']} scored {row['Maths']}")

```

### πŸ“Ÿ Output
```text
Index 0: Arjun scored 90
Index 1: Bina scored 85
Index 2: Chirag scored 78
Index 3: Divya scored 92

```

### 2.2.2 Vertical Iteration (`iteritems()`)
Traverses the DataFrame **column by column**.

### πŸ’» Code Example
```python
# Printing each column name and its content
for col_name, col_data in df.iteritems():
print(f"--- Column: {col_name} ---")
print(col_data.values) # Using .values just to keep output short

```

### πŸ“Ÿ Output
```text
--- Column: Name ---
['Arjun' 'Bina' 'Chirag' 'Divya']
--- Column: Maths ---
[90 85 78 92]
--- Column: IP ---
[88. 95. nan 90.]
--- Column: Section ---
['A' 'A' 'B' 'B']

```

---

## βž• 2.3 Binary Operations
This refers to math operations (Add, Sub, Mul, Div) between two DataFrames.
**πŸ”‘ Key Rule:** Pandas aligns data based on **Row Index** and **Column Label**. If labels don't match, you get `NaN`.

### πŸ’» Code Example
```python
# Let's create two small DataFrames
df1 = pd.DataFrame({'A': [10, 20], 'B': [30, 40]}, index=[0, 1])
df2 = pd.DataFrame({'A': [5, 5], 'B': [5, 5]}, index=[1, 2])

print("DF1:\n", df1)
print("DF2:\n", df2)

# Adding them
print("\n--- Result of DF1 + DF2 ---")
print(df1 + df2)

```

### πŸ“Ÿ Output
```text
DF1:
A B
0 10 30
1 20 40

DF2:
A B
1 5 5
2 5 5

--- Result of DF1 + DF2 ---
A B
0 NaN NaN <-- Index 0 exists in DF1 but not DF2
1 25.0 45.0 <-- Index 1 exists in both (Matched!)
2 NaN NaN <-- Index 2 exists in DF2 but not DF1

```

---

## πŸ“Š 2.4 Descriptive Statistics
Functions to analyze numbers. *Note: These functions automatically skip `NaN` values.*

### 2.4.1 Min, Max, Sum, Count
### πŸ’» Code Example
```python
print("Highest Marks (Max):\n", df[['Maths', 'IP']].max())
print("\nTotal Marks (Sum):\n", df[['Maths', 'IP']].sum())
print("\nCount of Values:\n", df[['Maths', 'IP']].count())

```

### πŸ“Ÿ Output
```text
Highest Marks (Max):
Maths 92.0
IP 95.0
dtype: float64

Total Marks (Sum):
Maths 345.0
IP 273.0
dtype: float64

Count of Values:
Maths 4
IP 3 <-- Note: Chirag's NaN was not counted
dtype: int64

```

### 2.4.2 Mean, Median, Mode
### πŸ’» Code Example
```python
print("Average (Mean):\n", df[['Maths', 'IP']].mean())
print("\nMedian (Middle Value):\n", df[['Maths', 'IP']].median())

```

### πŸ“Ÿ Output
```text
Average (Mean):
Maths 86.25
IP 91.00
dtype: float64

Median (Middle Value):
Maths 87.5
IP 90.0
dtype: float64

```

### 2.4.5 The `describe()` Function
The "Swiss Army Knife" of statistics. It gives you everything in one go.

### πŸ’» Code Example
```python
print(df.describe())

```

### πŸ“Ÿ Output
```text
Maths IP
count 4.000000 3.000000 <-- Count
mean 86.250000 91.000000 <-- Average
std 6.184658 3.605551 <-- Standard Deviation
min 78.000000 88.000000 <-- Minimum
25% 83.250000 89.000000 <-- 25th Percentile
50% 87.500000 90.000000 <-- Median (50%)
75% 90.500000 92.500000 <-- 75th Percentile
max 92.000000 95.000000 <-- Maximum

```

---

## πŸ” 2.5 Essential Functions
### 2.5.1 Inspection (`info`, `head`, `tail`)
### πŸ’» Code Example
```python
# Shows structure of the DataFrame
df.info()

print("\n--- Top 2 Rows (head) ---")
print(df.head(2))

```

### πŸ“Ÿ Output
```text

RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Maths 4 non-null int64
2 IP 3 non-null float64
3 Section 4 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes

--- Top 2 Rows (head) ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A

```

---

## 🚫 2.6 Handling Missing Data
In our data, **Chirag** has `NaN` in IP. Let's fix it.

### 2.6.1 Detecting Missing Data
### πŸ’» Code Example
```python
print(df.isnull()) # True if data is missing

```

### πŸ“Ÿ Output
```text
Name Maths IP Section
0 False False False False
1 False False False False
2 False False True False <-- Look at index 2 (IP)
3 False False False False

```

###2.6.2 Dropping & Filling
### πŸ’» Code Example
```python
# Strategy 1: Delete rows with missing data
print("--- After Dropna ---")
print(df.dropna())

# Strategy 2: Fill missing data with 0
print("\n--- After Fillna(0) ---")
print(df.fillna(0))

```

### πŸ“Ÿ Output
```text
--- After Dropna ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
3 Divya 92 90.0 B <-- Chirag is gone!

--- After Fillna(0) ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
2 Chirag 78 0.0 B <-- Chirag is back with 0 marks
3 Divya 92 90.0 B

```

---

## πŸ“¦ 2.7 Function `groupby()`
This is used to split data into groups. Let's group our students by **Section**.

### πŸ’» Code Example
```python
# Group by 'Section' and find the average marks for each section
grouped_df = df.groupby('Section').mean()
print(grouped_df)

```

### πŸ“Ÿ Output
```text
Maths IP
Section
A 87.5 91.5
B 85.0 90.0

```

* **Explanation:**
* **Section A:** (Arjun 90 + Bina 85)/2 = **87.5**
* **Section B:** (Chirag 78 + Divya 92)/2 = **85.0**



---

## πŸš€ Pro Tips for the Exam
1. **Iterrows vs Iteritems:** Remember `rows` = Horizontal (slow), `items` = Vertical (fast).
2. **Describe:** If asked "Which function gives the summary of the dataset?", the answer is `describe()`.
3. **Axis:**
* `axis=0` means calculate **down** the column (default).
* `axis=1` means calculate **across** the row.



---