{ "cells": [ { "cell_type": "markdown", "id": "288dc3d6-2f59-4af4-b9a0-ac11110c95a4", "metadata": {}, "source": [ "# a. What is CSV?" ] }, { "cell_type": "markdown", "id": "8a29ef9f-d2b1-44ae-aa00-b7307dc1f1fa", "metadata": {}, "source": [ "- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.\n", "- Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.)." ] }, { "cell_type": "markdown", "id": "aed4bbe7-49a7-44f7-a222-1dbc76b94b74", "metadata": {}, "source": [ "## Advantages" ] }, { "cell_type": "markdown", "id": "0908c962-52a0-481d-9c4a-734d0954aeb5", "metadata": {}, "source": [ "- Lightweight and easy to create.\n", "- Supported by almost all data tools and programming languages." ] }, { "cell_type": "markdown", "id": "9a3c3937-cb91-411b-8606-16728aabbbc1", "metadata": {}, "source": [ "## Common File Extensions" ] }, { "cell_type": "markdown", "id": "41bf2a14-0cc1-458b-be33-62e9431a9b31", "metadata": {}, "source": [ "- .csv\n", "- .txt (sometimes used with a CSV structure)." ] }, { "cell_type": "markdown", "id": "00250776-617f-49d9-88bb-e6cba943f599", "metadata": {}, "source": [ "# b. How to Read CSV Files" ] }, { "cell_type": "markdown", "id": "98989d08-8d4d-4a02-82b1-ba08757e71ff", "metadata": {}, "source": [ "- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module." ] }, { "cell_type": "markdown", "id": "6776fe4e-8155-47ff-99f4-ec26c916c45d", "metadata": {}, "source": [ "## 1. Using pandas:" ] }, { "cell_type": "code", "execution_count": null, "id": "a508ae8d-3a3d-43f0-9453-11c87877b2b1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Read a CSV file\n", "df = pd.read_csv(\"file.csv\")\n", "print(df.head())\n", "\n", "# Reading a CSV file with a custom delimiter\n", "df = pd.read_csv(\"file.csv\", sep=\";\")" ] }, { "cell_type": "markdown", "id": "7c3f7a6a-0c13-45f2-930b-2c5796985efd", "metadata": {}, "source": [ "## 2. Using Python's Built-in csv Module:" ] }, { "cell_type": "code", "execution_count": null, "id": "a33ffb8b-88b6-4061-b816-00397f2b3a3e", "metadata": {}, "outputs": [], "source": [ "import csv\n", "\n", "with open(\"file.csv\", \"r\") as file:\n", " reader = csv.reader(file)\n", " for row in reader:\n", " print(row)" ] }, { "cell_type": "markdown", "id": "2a57c10b-51bf-4a4e-978a-51644964b856", "metadata": {}, "source": [ "## 3.Reading Large CSV Files in Chunks:" ] }, { "cell_type": "code", "execution_count": null, "id": "5a056573-3d16-400a-8ccd-a15d0398b454", "metadata": {}, "outputs": [], "source": [ "# Process large CSV files in smaller chunks\n", "for chunk in pd.read_csv(\"large_file.csv\", chunksize=1000):\n", " print(chunk.head())" ] }, { "cell_type": "markdown", "id": "b52ebad6-0c64-4317-974a-3498f05feaea", "metadata": {}, "source": [ "# c. Issues Encountered When Handling CSV Files" ] }, { "cell_type": "markdown", "id": "8fb34287-7754-4170-8095-46c2a82db4ba", "metadata": {}, "source": [ "1. Delimiter Issues:\n", " - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.\n", "2. Encoding Problems:\n", " - Non-UTF-8 encodings may cause errors while reading files.\n", " - Example: \"UnicodeDecodeError.\"\n", "3. Missing or Inconsistent Data:\n", " - Some fields may be empty, and column lengths may vary.\n", "4. Header Issues:\n", " - Files may lack headers or have duplicate/misaligned headers.\n", "5. Large File Sizes:\n", " - Processing very large CSV files can lead to memory issues." ] }, { "cell_type": "markdown", "id": "67c01a56-9b7c-46ba-8a79-9586a244978c", "metadata": {}, "source": [ "# d. How to Overcome These Issues" ] }, { "cell_type": "markdown", "id": "45564d75-7870-45e1-8d53-e78ff71ff018", "metadata": {}, "source": [ "1. Delimiter Issues:\n", " - Specify the correct delimiter while reading:" ] }, { "cell_type": "code", "execution_count": null, "id": "36c282a6-cbdc-4a3e-933a-91080ea4dccc", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"file.csv\", sep=\";\")" ] }, { "cell_type": "markdown", "id": "7b998672-5d7e-4a6a-8cc4-36b18446b9be", "metadata": {}, "source": [ "2. Encoding Problems:\n", " - Explicitly set the encoding:" ] }, { "cell_type": "code", "execution_count": null, "id": "2657d869-a303-4e03-bc07-b15f012f76e6", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"file.csv\", encoding=\"ISO-8859-1\")" ] }, { "cell_type": "markdown", "id": "113e7e43-7031-4904-9e87-c9df4acefaff", "metadata": {}, "source": [ "3. Handling Missing Data:\n", " - Fill missing values:" ] }, { "cell_type": "code", "execution_count": null, "id": "67ea80c7-7a86-4694-b6ff-a55ca27caad5", "metadata": {}, "outputs": [], "source": [ "df.fillna(\"Unknown\", inplace=True)" ] }, { "cell_type": "markdown", "id": "a6d7a40b-c495-4482-b006-767c14209bf2", "metadata": {}, "source": [ "- Drop rows/columns with missing data:" ] }, { "cell_type": "code", "execution_count": null, "id": "c8191abd-6281-466e-aeba-5f8df351de2d", "metadata": {}, "outputs": [], "source": [ "df.dropna(inplace=True)" ] }, { "cell_type": "markdown", "id": "6542d341-d38f-4c59-a5ca-d2503bd35e51", "metadata": {}, "source": [ "4. Header Issues:\n", " - Manually assign headers:" ] }, { "cell_type": "code", "execution_count": null, "id": "3f2ee8b5-c54d-4349-b473-a8c3d6230c38", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"file.csv\", header=None, names=[\"Col1\", \"Col2\", \"Col3\"])" ] }, { "cell_type": "markdown", "id": "2461fa9d-02bb-4008-85d0-1cc47e412671", "metadata": {}, "source": [ "5. Optimizing for Large Files:\n", " - Use chunk processing:" ] }, { "cell_type": "code", "execution_count": null, "id": "e684dc56-f980-4d37-affb-3d7fde7a99b0", "metadata": {}, "outputs": [], "source": [ "for chunk in pd.read_csv(\"file.csv\", chunksize=5000):\n", " process(chunk)" ] }, { "cell_type": "markdown", "id": "5a9677e2-e475-4829-9660-a2ec1674d221", "metadata": {}, "source": [ "### Use lightweight libraries like dask or polars for very large files." ] }, { "cell_type": "code", "execution_count": null, "id": "c378769c-56a9-4675-b988-e6b57eeed54e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "fe9e2b34-a679-4b8e-923a-f296f775a6a2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3ece1968-048b-4337-a79e-3c9a7161231d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b940d8eb-c668-4553-9bb9-c1b8e39cf211", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8a88eeae-cfdf-48bd-aa05-3b0c29ff25f0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }