# a. What is CSV?

- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.
- Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.).

## Advantages

- Lightweight and easy to create.
- Supported by almost all data tools and programming languages.

## Common File Extensions

- .csv
- .txt (sometimes used with a CSV structure).

# b. How to Read CSV Files

- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module.

## 1. Using pandas:

In [None]:
import pandas as pd

# Read a CSV file
df = pd.read_csv("file.csv")
print(df.head())

# Reading a CSV file with a custom delimiter
df = pd.read_csv("file.csv", sep=";")

## 2. Using Python's Built-in csv Module:

In [None]:
import csv

with open("file.csv", "r") as file:
 reader = csv.reader(file)
 for row in reader:
 print(row)

## 3.Reading Large CSV Files in Chunks:

In [None]:
# Process large CSV files in smaller chunks
for chunk in pd.read_csv("large_file.csv", chunksize=1000):
 print(chunk.head())

# c. Issues Encountered When Handling CSV Files

1. Delimiter Issues:
 - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.
2. Encoding Problems:
 - Non-UTF-8 encodings may cause errors while reading files.
 - Example: "UnicodeDecodeError."
3. Missing or Inconsistent Data:
 - Some fields may be empty, and column lengths may vary.
4. Header Issues:
 - Files may lack headers or have duplicate/misaligned headers.
5. Large File Sizes:
 - Processing very large CSV files can lead to memory issues.

# d. How to Overcome These Issues

1. Delimiter Issues:
 - Specify the correct delimiter while reading:

In [None]:
df = pd.read_csv("file.csv", sep=";")

2. Encoding Problems:
 - Explicitly set the encoding:

In [None]:
df = pd.read_csv("file.csv", encoding="ISO-8859-1")

3. Handling Missing Data:
 - Fill missing values:

In [None]:
df.fillna("Unknown", inplace=True)

- Drop rows/columns with missing data:

In [None]:
df.dropna(inplace=True)

4. Header Issues:
 - Manually assign headers:

In [None]:
df = pd.read_csv("file.csv", header=None, names=["Col1", "Col2", "Col3"])

5. Optimizing for Large Files:
 - Use chunk processing:

In [None]:
for chunk in pd.read_csv("file.csv", chunksize=5000):
 process(chunk)

### Use lightweight libraries like dask or polars for very large files.