Spaces:

shwetashweta05
/

Zero_to_Hero_Machine_Learning

Running

File size: 6,530 Bytes

9bdd214

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "61a28560-a233-418e-8266-442a4a0cb810",
   "metadata": {},
   "source": [
    "# a. What is XML?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88af4e03-44db-41a9-a8d7-66d4c06301d7",
   "metadata": {},
   "source": [
    "- XML (eXtensible Markup Language) is a markup language used to store and transport data in a structured format.\n",
    "- It is human-readable and machine-readable, with a hierarchical structure using tags.\n",
    "- ##  Advantages:\n",
    "- Flexible and self-descriptive.\n",
    "- Widely used in data exchange between systems, such as web APIs and configuration files.\n",
    "- ## Common File Extensions:\n",
    "- .xml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d9bfd21-9483-4fc1-9a3d-9c1067f437b9",
   "metadata": {},
   "source": [
    "Example of XML Structure:\n",
    "<person>\n",
    "    <name>Shweta Singh</name>\n",
    "    <age>27</age>\n",
    "    <city>Kolkata</city>\n",
    "</person>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91ef31d0-018d-4ece-b1aa-26b9fb11cec0",
   "metadata": {},
   "source": [
    "## b. How to Read XML Files\n",
    "- XML files can be parsed and processed using Python libraries like xml.etree.ElementTree, lxml, or pandas.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69b8009e-71cc-49ec-a604-8f5ef329b972",
   "metadata": {},
   "source": [
    " 1. Using xml.etree.ElementTree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b12e0ae-6189-48f7-98e2-33dac2f4f9f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "# Parse an XML file\n",
    "tree = ET.parse(\"file.xml\")\n",
    "root = tree.getroot()\n",
    "\n",
    "# Access elements\n",
    "for child in root:\n",
    "    print(child.tag, child.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65b985fa-4873-46dc-9770-8d9736547959",
   "metadata": {},
   "source": [
    "- 2. Using pandas for tabular data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83ad5cda-43d1-44a7-9cf3-fee7a584f5cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Read XML into a DataFrame\n",
    "df = pd.read_xml(\"file.xml\")\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5035955-6155-4a8c-8e16-8aac5f967e50",
   "metadata": {},
   "source": [
    "- 3. Using lxml for advanced parsing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52cb4fa5-5b98-4701-9c1f-5bb16ce56c42",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lxml import etree\n",
    "\n",
    "# Parse XML file\n",
    "tree = etree.parse(\"file.xml\")\n",
    "root = tree.getroot()\n",
    "\n",
    "# Extract specific elements\n",
    "for element in root.iter(\"name\"):\n",
    "    print(element.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae8fff0c-2018-4098-bfae-1dd2fe0f4db1",
   "metadata": {},
   "source": [
    "# c. Issues Encountered When Handling XML Files1. \n",
    "1. Complex Structures:\n",
    "- XML files can have deeply nested and complex hierarchies.\n",
    "2. Large File Sizes:\n",
    "- Parsing large XML files can consume significant memory.\n",
    "3. Data Inconsistency:\n",
    "- Missing or unexpected tags can cause parsing errors.\n",
    "4. Encoding Issues:\n",
    "- XML files with non-standard encoding formats (e.g., ISO-8859-1) may fail to parse."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8602e413-fbd8-4839-8eb6-440dbe6b2ae2",
   "metadata": {},
   "source": [
    "# d. How to Overcome These Issues"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e74ed8c-476f-4ceb-826f-07361f98f10a",
   "metadata": {},
   "source": [
    "1. Handle Complex Structures:\n",
    "\n",
    "- Use libraries like lxml for efficient navigation and processing of nested XML structures.\n",
    "  \n",
    "2. Optimize Large File Processing:\n",
    "\n",
    "- Use event-driven parsing with xml.sax or lxml.iterparse to process files in chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e0a577a-8fa4-4dd2-8426-48e1422674e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lxml import etree\n",
    "\n",
    "# Process XML in chunks\n",
    "for event, element in etree.iterparse(\"large_file.xml\", events=(\"end\",)):\n",
    "    print(element.tag, element.text)\n",
    "    element.clear()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e486525-88a1-4272-b205-9ecccd1775fe",
   "metadata": {},
   "source": [
    "3. Handle Missing or Unexpected Tags:\n",
    "\n",
    "- Use default values or conditional checks to handle missing elements:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b14f7b0-d18c-4bf5-9b28-883acde3989b",
   "metadata": {},
   "outputs": [],
   "source": [
    "for child in root:\n",
    "    name = child.find(\"name\")\n",
    "    print(name.text if name is not None else \"Unknown\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b53524c-7150-41d2-9bc9-c1e4dea2f1fa",
   "metadata": {},
   "source": [
    "4. Resolve Encoding Issues:\n",
    "\n",
    "- Explicitly specify the encoding when parsing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58eb5b60-9304-4929-a6aa-4c9655a9c492",
   "metadata": {},
   "outputs": [],
   "source": [
    "tree = ET.parse(\"file.xml\", parser=ET.XMLParser(encoding=\"ISO-8859-1\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2732b34d-eadd-4acc-921d-1594b52843d9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e902fed-2e6f-4de8-879e-de88e665ae39",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e63144df-35e8-4a07-9dad-1c9466948487",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}