Spaces:

shwetashweta05
/

Zero_to_Hero_Machine_Learning

Sleeping

File size: 8,317 Bytes

4a57e16

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "288dc3d6-2f59-4af4-b9a0-ac11110c95a4",
   "metadata": {},
   "source": [
    "# a. What is CSV?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a29ef9f-d2b1-44ae-aa00-b7307dc1f1fa",
   "metadata": {},
   "source": [
    "- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.\n",
    "-  Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aed4bbe7-49a7-44f7-a222-1dbc76b94b74",
   "metadata": {},
   "source": [
    "## Advantages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0908c962-52a0-481d-9c4a-734d0954aeb5",
   "metadata": {},
   "source": [
    "- Lightweight and easy to create.\n",
    "- Supported by almost all data tools and programming languages."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a3c3937-cb91-411b-8606-16728aabbbc1",
   "metadata": {},
   "source": [
    "## Common File Extensions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41bf2a14-0cc1-458b-be33-62e9431a9b31",
   "metadata": {},
   "source": [
    "- .csv\n",
    "- .txt (sometimes used with a CSV structure)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00250776-617f-49d9-88bb-e6cba943f599",
   "metadata": {},
   "source": [
    "# b. How to Read CSV Files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98989d08-8d4d-4a02-82b1-ba08757e71ff",
   "metadata": {},
   "source": [
    "- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6776fe4e-8155-47ff-99f4-ec26c916c45d",
   "metadata": {},
   "source": [
    "## 1. Using pandas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a508ae8d-3a3d-43f0-9453-11c87877b2b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Read a CSV file\n",
    "df = pd.read_csv(\"file.csv\")\n",
    "print(df.head())\n",
    "\n",
    "# Reading a CSV file with a custom delimiter\n",
    "df = pd.read_csv(\"file.csv\", sep=\";\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c3f7a6a-0c13-45f2-930b-2c5796985efd",
   "metadata": {},
   "source": [
    "## 2. Using Python's Built-in csv Module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a33ffb8b-88b6-4061-b816-00397f2b3a3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "with open(\"file.csv\", \"r\") as file:\n",
    "    reader = csv.reader(file)\n",
    "    for row in reader:\n",
    "        print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a57c10b-51bf-4a4e-978a-51644964b856",
   "metadata": {},
   "source": [
    "## 3.Reading Large CSV Files in Chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a056573-3d16-400a-8ccd-a15d0398b454",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Process large CSV files in smaller chunks\n",
    "for chunk in pd.read_csv(\"large_file.csv\", chunksize=1000):\n",
    "    print(chunk.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b52ebad6-0c64-4317-974a-3498f05feaea",
   "metadata": {},
   "source": [
    "# c. Issues Encountered When Handling CSV Files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fb34287-7754-4170-8095-46c2a82db4ba",
   "metadata": {},
   "source": [
    "1. Delimiter Issues:\n",
    "    - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.\n",
    "2. Encoding Problems:\n",
    "    - Non-UTF-8 encodings may cause errors while reading files.\n",
    "    - Example: \"UnicodeDecodeError.\"\n",
    "3. Missing or Inconsistent Data:\n",
    "    - Some fields may be empty, and column lengths may vary.\n",
    "4. Header Issues:\n",
    "    - Files may lack headers or have duplicate/misaligned headers.\n",
    "5. Large File Sizes:\n",
    "    - Processing very large CSV files can lead to memory issues."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67c01a56-9b7c-46ba-8a79-9586a244978c",
   "metadata": {},
   "source": [
    "# d. How to Overcome These Issues"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45564d75-7870-45e1-8d53-e78ff71ff018",
   "metadata": {},
   "source": [
    "1. Delimiter Issues:\n",
    "   - Specify the correct delimiter while reading:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36c282a6-cbdc-4a3e-933a-91080ea4dccc",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"file.csv\", sep=\";\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b998672-5d7e-4a6a-8cc4-36b18446b9be",
   "metadata": {},
   "source": [
    "2. Encoding Problems:\n",
    "   - Explicitly set the encoding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2657d869-a303-4e03-bc07-b15f012f76e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"file.csv\", encoding=\"ISO-8859-1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "113e7e43-7031-4904-9e87-c9df4acefaff",
   "metadata": {},
   "source": [
    "3. Handling Missing Data:\n",
    "    - Fill missing values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67ea80c7-7a86-4694-b6ff-a55ca27caad5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.fillna(\"Unknown\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6d7a40b-c495-4482-b006-767c14209bf2",
   "metadata": {},
   "source": [
    "- Drop rows/columns with missing data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8191abd-6281-466e-aeba-5f8df351de2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6542d341-d38f-4c59-a5ca-d2503bd35e51",
   "metadata": {},
   "source": [
    "4. Header Issues:\n",
    "   - Manually assign headers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f2ee8b5-c54d-4349-b473-a8c3d6230c38",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"file.csv\", header=None, names=[\"Col1\", \"Col2\", \"Col3\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2461fa9d-02bb-4008-85d0-1cc47e412671",
   "metadata": {},
   "source": [
    "5. Optimizing for Large Files:\n",
    "   - Use chunk processing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e684dc56-f980-4d37-affb-3d7fde7a99b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "for chunk in pd.read_csv(\"file.csv\", chunksize=5000):\n",
    "    process(chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a9677e2-e475-4829-9660-a2ec1674d221",
   "metadata": {},
   "source": [
    "### Use lightweight libraries like dask or polars for very large files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c378769c-56a9-4675-b988-e6b57eeed54e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe9e2b34-a679-4b8e-923a-f296f775a6a2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ece1968-048b-4337-a79e-3c9a7161231d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b940d8eb-c668-4553-9bb9-c1b8e39cf211",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a88eeae-cfdf-48bd-aa05-3b0c29ff25f0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}