# Data Visualization using Matplotlib
## 1. π Definitions & Key Terminology
* **Data Visualization:** The art of translating information into a visual context (maps, graphs) to help the human brain understand data and pull insights quickly. π§
* **Matplotlib:** A comprehensive **Python library** for creating static, animated, and interactive visualizations. It is the foundation for many other libraries (like Seaborn) and comes pre-installed with Anaconda. π
* **Pyplot:** A specific *module* within Matplotlib (imported as `plt`) that mimics the interface of MATLAB, allowing users to create 2D plots easily.
* **Figure:** The "Canvas". The top-level container holding all plot elements (axes, titles, legends). πΌοΈ
* **Axes:** The actual region where data is plotted. A figure can have multiple axes (subplots), but an axes belongs to only one figure.
* **Axis:** The number-lines that handle scales, limits, and ticks (marks). π
* **Marker:** A symbol (dot, star, square) representing a specific data point.
* **Legend:** The key that identifies what different colors or line styles represent. πΊοΈ
* **Histogram:** A graph showing the **frequency distribution** of continuous data (grouped into "bins").
* **Box Plot:** Displays data distribution based on a five-number summary (Min, Q1, Median, Q3, Max). π¦
---
## 2. π§ Concepts & Architecture
### 2.1 Why Visualize Data?
In the era of **Big Data**, raw tables are hard to read. Visualization helps because:
1. **Better Analysis:** Reveals hidden trends and correlations. π
2. **Quick Action:** The brain processes visuals faster than text. β‘
3. **Pattern Recognition:** Identifies seasonal trends or exponential growth.
4. **Error Spotting:** Visual outliers (spikes) help find bad data. π
5. **Business Insights:** Helps decision-makers grasp facts instantly. πΌ
### 2.2 The Matplotlib Architecture
Matplotlib has three layers:
1. **Backend Layer:** Renders the plot to screen or file.
2. **Artist Layer:** Contains visuals like titles, lines, and text.
3. **Scripting Layer (Pyplot):** The user-friendly interface for writing code.
### 2.2.1 Installation & Import
**Install:**
```bash
pip install matplotlib
```
**Standard Import:**
```python
import matplotlib.pyplot as plt
```
### 2.2.2 The Pyplot "State Machine"
Pyplot tracks the *current* figure and axes. Any command you type (like `plt.plot()`) applies to the currently active chart.
---
## 3. π¨ Creating Charts
### 3.1 Line & Scatter Charts
#### π Line Chart (`plt.plot()`) - The default plot type. Best for showing **trends over time** (time-series).
* **Syntax:** `plt.plot(x, y)`
* **Note:** If you only provide one list `plt.plot(y)`, Matplotlib assumes they are Y-values and automatically generates X-values `[0, 1, 2...]`.
#### π Scatter Chart (`plt.scatter()`) - Displays individual data points **without connecting lines**.
* **Use Case:** Observing relationships or **correlations** between two variables (e.g., Study Hours vs. Marks).
### 3.2 Bar & Pie Charts
#### π Bar Chart (`plt.bar()`) - Used for comparing **categorical data** (discrete categories).
* **Vertical:** `plt.bar(x, height)`
* **Horizontal:** `plt.barh(y, width)` (Good for long category names).
* **Multiple Bars:** You must manually offset the X-coordinates so bars don't overlap.
#### π₯§ Pie Chart (`plt.pie()`) - Shows numerical proportions (composition of a whole).
* **Key Params:** `autopct` (shows %), `explode` (highlights a slice).
### 3.3 Histograms & Box Plots
#### π§± Histogram (`plt.hist()`) - Shows frequency of **continuous** data.
* **Bins:** The ranges into which data is grouped.
* **Visual Distinction:** Unlike bar charts, histograms usually have **no gaps** between bars.
#### π¦ Box Plot (`plt.boxplot()`) - Visualizes statistical summary (IQR, Median, Outliers).
---
## 4. βοΈ Customization & Syntax
### 4.1 Anatomy Customization
| Function | Description |
| --- | --- |
| `plt.figure(figsize=(w,h))` | Sets chart size in inches. |
| `plt.title("Text")` | Adds a heading. |
| `plt.xlabel("Text")` | Labels the X-axis. |
| `plt.ylabel("Text")` | Labels the Y-axis. |
| `plt.grid(True)` | Turns on grid lines. πΈοΈ |
| `plt.legend()` | Displays the legend (requires `label=` in plot). |
| `plt.savefig("name.png")` | Saves the chart. **Must be called BEFORE `show()**`. |
### 4.2 Style Parameters π¨
**Common Color Codes:**
| Code | Color | Code | Color |
| :---: | :--- | :---: | :--- |
| `'b'` | π΅ Blue | `'r'` | π΄ Red |
| `'g'` | π’ Green | `'k'` | β« Black |
| `'y'` | π‘ Yellow | `'w'` | βͺ White |
**Line Styles & Markers:**
| Style Code | Description | Marker | Description |
| :---: | :--- | :---: | :--- |
| `'-'` | Solid (Default) | `'o'` | Circle |
| `'--'` | Dashed | `'*'` | Star |
| `':'` | Dotted | `'s'` | Square |
| `'-.'` | Dash-dot | `'^'` | Triangle |
### 4.3 Plotting from Pandas πΌ
```python
dataframe.plot(kind='bar', x='col_name', y='col_name', color='red')
```
*Kinds:* `'line'`, `'bar'`, `'barh'`, `'hist'`, `'box'`, `'pie'`, `'scatter'`
---
## 5. π» Code Examples
### Example 1: Custom Line Chart
```python
import matplotlib.pyplot as plt
years = [2020, 2021, 2022, 2023]
sales = [5000, 7000, 6500, 8000]
# Customization: Red dashed line with circle markers
plt.plot(years, sales, color='r', linestyle='--', marker='o', label='Sales')
plt.title("Annual Sales Report")
plt.xlabel("Year")
plt.ylabel("Sales ($)")
plt.grid(True)
plt.legend()
plt.show()
```
### Example 2: Histogram (Frequency)
```python
import matplotlib.pyplot as plt
marks = [10, 15, 20, 20, 25, 30, 35, 40, 45, 50, 50, 55]
# Bins=5 groups the data into 5 ranges
plt.hist(marks, bins=5, edgecolor='black', color='cyan')
plt.title("Marks Distribution")
plt.ylabel("Number of Students")
plt.show()
```
---
## 6. π Comparisons
### Line Chart vs. Scatter Plot
| Feature | π Line Chart | π Scatter Plot |
| --- | --- | --- |
| **Purpose** | Trends over time (continuity). | Correlation between variables. |
| **Connection** | Points connected by lines. | Standalone markers. |
| **Function** | `plt.plot()` | `plt.scatter()` |
### Bar Chart vs. Histogram
| Feature | π Bar Chart | π§± Histogram |
| --- | --- | --- |
| **Data Type** | Categorical / Discrete. | Continuous / Numerical. |
| **X-Axis** | Distinct Categories (e.g., Cities). | Numerical Bins (Ranges). |
| **Gaps** | Bars usually have gaps. | Bars touch (no gaps). |
| **Ordering** | Can be reordered. | Must follow numerical order. |
---
## 7. β οΈ Common Errors & Troubleshooting
* **π« The Blank Image Error:**
* *Mistake:* Calling `plt.savefig()` *after* `plt.show()`.
* *Reason:* `show()` clears the canvas (flushes memory).
* *Fix:* **Always Save First!** `savefig` -> `show`.
* **π Mismatched Lists:**
* *Error:* `ValueError: x and y must have same first dimension`.
* *Fix:* Ensure `len(x) == len(y)`.
* **π΅ Confusing Bar Arguments:**
* `plt.plot(y)` works (auto-generates x).
* `plt.bar(y)` **fails**. You must provide `plt.bar(x, height)`.
* **ποΈ Style Fail:**
* Trying to use `linestyle='--'` inside `plt.scatter()`. Scatter plots don't have lines!
---
## 8. π Exam-Oriented Short Notes
* **Import:** `import matplotlib.pyplot as plt`
* **Order of Ops:** Define Data -> Plot Data -> Customize (Labels/Title) -> Save -> Show.
* **Grid:** `plt.grid(True)` is essential for reading graph values accurately.
* **Legend:** Requires `label='Name'` inside the plot function to work.
* **Case Sensitive:** `plt.plot()` is correct; `plt.Plot()` is wrong. β
* **Pandas:** `df.plot(kind='...')` is the fastest way to plot existing dataframes.
---
## 1. π Definitions & Key Terminology
* **Data Visualization:** The art of translating information into a visual context (maps, graphs) to help the human brain understand data and pull insights quickly. π§
* **Matplotlib:** A comprehensive **Python library** for creating static, animated, and interactive visualizations. It is the foundation for many other libraries (like Seaborn) and comes pre-installed with Anaconda. π
* **Pyplot:** A specific *module* within Matplotlib (imported as `plt`) that mimics the interface of MATLAB, allowing users to create 2D plots easily.
* **Figure:** The "Canvas". The top-level container holding all plot elements (axes, titles, legends). πΌοΈ
* **Axes:** The actual region where data is plotted. A figure can have multiple axes (subplots), but an axes belongs to only one figure.
* **Axis:** The number-lines that handle scales, limits, and ticks (marks). π
* **Marker:** A symbol (dot, star, square) representing a specific data point.
* **Legend:** The key that identifies what different colors or line styles represent. πΊοΈ
* **Histogram:** A graph showing the **frequency distribution** of continuous data (grouped into "bins").
* **Box Plot:** Displays data distribution based on a five-number summary (Min, Q1, Median, Q3, Max). π¦
---
## 2. π§ Concepts & Architecture
### 2.1 Why Visualize Data?
In the era of **Big Data**, raw tables are hard to read. Visualization helps because:
1. **Better Analysis:** Reveals hidden trends and correlations. π
2. **Quick Action:** The brain processes visuals faster than text. β‘
3. **Pattern Recognition:** Identifies seasonal trends or exponential growth.
4. **Error Spotting:** Visual outliers (spikes) help find bad data. π
5. **Business Insights:** Helps decision-makers grasp facts instantly. πΌ
### 2.2 The Matplotlib Architecture
Matplotlib has three layers:
1. **Backend Layer:** Renders the plot to screen or file.
2. **Artist Layer:** Contains visuals like titles, lines, and text.
3. **Scripting Layer (Pyplot):** The user-friendly interface for writing code.
### 2.2.1 Installation & Import
**Install:**
```bash
pip install matplotlib
```
**Standard Import:**
```python
import matplotlib.pyplot as plt
```
### 2.2.2 The Pyplot "State Machine"
Pyplot tracks the *current* figure and axes. Any command you type (like `plt.plot()`) applies to the currently active chart.
---
## 3. π¨ Creating Charts
### 3.1 Line & Scatter Charts
#### π Line Chart (`plt.plot()`) - The default plot type. Best for showing **trends over time** (time-series).
* **Syntax:** `plt.plot(x, y)`
* **Note:** If you only provide one list `plt.plot(y)`, Matplotlib assumes they are Y-values and automatically generates X-values `[0, 1, 2...]`.
#### π Scatter Chart (`plt.scatter()`) - Displays individual data points **without connecting lines**.
* **Use Case:** Observing relationships or **correlations** between two variables (e.g., Study Hours vs. Marks).
### 3.2 Bar & Pie Charts
#### π Bar Chart (`plt.bar()`) - Used for comparing **categorical data** (discrete categories).
* **Vertical:** `plt.bar(x, height)`
* **Horizontal:** `plt.barh(y, width)` (Good for long category names).
* **Multiple Bars:** You must manually offset the X-coordinates so bars don't overlap.
#### π₯§ Pie Chart (`plt.pie()`) - Shows numerical proportions (composition of a whole).
* **Key Params:** `autopct` (shows %), `explode` (highlights a slice).
### 3.3 Histograms & Box Plots
#### π§± Histogram (`plt.hist()`) - Shows frequency of **continuous** data.
* **Bins:** The ranges into which data is grouped.
* **Visual Distinction:** Unlike bar charts, histograms usually have **no gaps** between bars.
#### π¦ Box Plot (`plt.boxplot()`) - Visualizes statistical summary (IQR, Median, Outliers).
---
## 4. βοΈ Customization & Syntax
### 4.1 Anatomy Customization
| Function | Description |
| --- | --- |
| `plt.figure(figsize=(w,h))` | Sets chart size in inches. |
| `plt.title("Text")` | Adds a heading. |
| `plt.xlabel("Text")` | Labels the X-axis. |
| `plt.ylabel("Text")` | Labels the Y-axis. |
| `plt.grid(True)` | Turns on grid lines. πΈοΈ |
| `plt.legend()` | Displays the legend (requires `label=` in plot). |
| `plt.savefig("name.png")` | Saves the chart. **Must be called BEFORE `show()**`. |
### 4.2 Style Parameters π¨
**Common Color Codes:**
| Code | Color | Code | Color |
| :---: | :--- | :---: | :--- |
| `'b'` | π΅ Blue | `'r'` | π΄ Red |
| `'g'` | π’ Green | `'k'` | β« Black |
| `'y'` | π‘ Yellow | `'w'` | βͺ White |
**Line Styles & Markers:**
| Style Code | Description | Marker | Description |
| :---: | :--- | :---: | :--- |
| `'-'` | Solid (Default) | `'o'` | Circle |
| `'--'` | Dashed | `'*'` | Star |
| `':'` | Dotted | `'s'` | Square |
| `'-.'` | Dash-dot | `'^'` | Triangle |
### 4.3 Plotting from Pandas πΌ
```python
dataframe.plot(kind='bar', x='col_name', y='col_name', color='red')
```
*Kinds:* `'line'`, `'bar'`, `'barh'`, `'hist'`, `'box'`, `'pie'`, `'scatter'`
---
## 5. π» Code Examples
### Example 1: Custom Line Chart
```python
import matplotlib.pyplot as plt
years = [2020, 2021, 2022, 2023]
sales = [5000, 7000, 6500, 8000]
# Customization: Red dashed line with circle markers
plt.plot(years, sales, color='r', linestyle='--', marker='o', label='Sales')
plt.title("Annual Sales Report")
plt.xlabel("Year")
plt.ylabel("Sales ($)")
plt.grid(True)
plt.legend()
plt.show()
```
### Example 2: Histogram (Frequency)
```python
import matplotlib.pyplot as plt
marks = [10, 15, 20, 20, 25, 30, 35, 40, 45, 50, 50, 55]
# Bins=5 groups the data into 5 ranges
plt.hist(marks, bins=5, edgecolor='black', color='cyan')
plt.title("Marks Distribution")
plt.ylabel("Number of Students")
plt.show()
```
---
## 6. π Comparisons
### Line Chart vs. Scatter Plot
| Feature | π Line Chart | π Scatter Plot |
| --- | --- | --- |
| **Purpose** | Trends over time (continuity). | Correlation between variables. |
| **Connection** | Points connected by lines. | Standalone markers. |
| **Function** | `plt.plot()` | `plt.scatter()` |
### Bar Chart vs. Histogram
| Feature | π Bar Chart | π§± Histogram |
| --- | --- | --- |
| **Data Type** | Categorical / Discrete. | Continuous / Numerical. |
| **X-Axis** | Distinct Categories (e.g., Cities). | Numerical Bins (Ranges). |
| **Gaps** | Bars usually have gaps. | Bars touch (no gaps). |
| **Ordering** | Can be reordered. | Must follow numerical order. |
---
## 7. β οΈ Common Errors & Troubleshooting
* **π« The Blank Image Error:**
* *Mistake:* Calling `plt.savefig()` *after* `plt.show()`.
* *Reason:* `show()` clears the canvas (flushes memory).
* *Fix:* **Always Save First!** `savefig` -> `show`.
* **π Mismatched Lists:**
* *Error:* `ValueError: x and y must have same first dimension`.
* *Fix:* Ensure `len(x) == len(y)`.
* **π΅ Confusing Bar Arguments:**
* `plt.plot(y)` works (auto-generates x).
* `plt.bar(y)` **fails**. You must provide `plt.bar(x, height)`.
* **ποΈ Style Fail:**
* Trying to use `linestyle='--'` inside `plt.scatter()`. Scatter plots don't have lines!
---
## 8. π Exam-Oriented Short Notes
* **Import:** `import matplotlib.pyplot as plt`
* **Order of Ops:** Define Data -> Plot Data -> Customize (Labels/Title) -> Save -> Show.
* **Grid:** `plt.grid(True)` is essential for reading graph values accurately.
* **Legend:** Requires `label='Name'` inside the plot function to work.
* **Case Sensitive:** `plt.plot()` is correct; `plt.Plot()` is wrong. β
* **Pandas:** `df.plot(kind='...')` is the fastest way to plot existing dataframes.
---