{ "cells": [ { "cell_type": "markdown", "id": "61a28560-a233-418e-8266-442a4a0cb810", "metadata": {}, "source": [ "# a. What is XML?" ] }, { "cell_type": "markdown", "id": "88af4e03-44db-41a9-a8d7-66d4c06301d7", "metadata": {}, "source": [ "- XML (eXtensible Markup Language) is a markup language used to store and transport data in a structured format.\n", "- It is human-readable and machine-readable, with a hierarchical structure using tags.\n", "- ## Advantages:\n", "- Flexible and self-descriptive.\n", "- Widely used in data exchange between systems, such as web APIs and configuration files.\n", "- ## Common File Extensions:\n", "- .xml" ] }, { "cell_type": "markdown", "id": "5d9bfd21-9483-4fc1-9a3d-9c1067f437b9", "metadata": {}, "source": [ "Example of XML Structure:\n", "\n", " Shweta Singh\n", " 27\n", " Kolkata\n", "" ] }, { "cell_type": "markdown", "id": "91ef31d0-018d-4ece-b1aa-26b9fb11cec0", "metadata": {}, "source": [ "## b. How to Read XML Files\n", "- XML files can be parsed and processed using Python libraries like xml.etree.ElementTree, lxml, or pandas.\n" ] }, { "cell_type": "markdown", "id": "69b8009e-71cc-49ec-a604-8f5ef329b972", "metadata": {}, "source": [ " 1. Using xml.etree.ElementTree:" ] }, { "cell_type": "code", "execution_count": null, "id": "8b12e0ae-6189-48f7-98e2-33dac2f4f9f7", "metadata": {}, "outputs": [], "source": [ "import xml.etree.ElementTree as ET\n", "\n", "# Parse an XML file\n", "tree = ET.parse(\"file.xml\")\n", "root = tree.getroot()\n", "\n", "# Access elements\n", "for child in root:\n", " print(child.tag, child.text)" ] }, { "cell_type": "markdown", "id": "65b985fa-4873-46dc-9770-8d9736547959", "metadata": {}, "source": [ "- 2. Using pandas for tabular data:" ] }, { "cell_type": "code", "execution_count": null, "id": "83ad5cda-43d1-44a7-9cf3-fee7a584f5cf", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Read XML into a DataFrame\n", "df = pd.read_xml(\"file.xml\")\n", "print(df.head())" ] }, { "cell_type": "markdown", "id": "e5035955-6155-4a8c-8e16-8aac5f967e50", "metadata": {}, "source": [ "- 3. Using lxml for advanced parsing:" ] }, { "cell_type": "code", "execution_count": null, "id": "52cb4fa5-5b98-4701-9c1f-5bb16ce56c42", "metadata": {}, "outputs": [], "source": [ "from lxml import etree\n", "\n", "# Parse XML file\n", "tree = etree.parse(\"file.xml\")\n", "root = tree.getroot()\n", "\n", "# Extract specific elements\n", "for element in root.iter(\"name\"):\n", " print(element.text)" ] }, { "cell_type": "markdown", "id": "ae8fff0c-2018-4098-bfae-1dd2fe0f4db1", "metadata": {}, "source": [ "# c. Issues Encountered When Handling XML Files1. \n", "1. Complex Structures:\n", "- XML files can have deeply nested and complex hierarchies.\n", "2. Large File Sizes:\n", "- Parsing large XML files can consume significant memory.\n", "3. Data Inconsistency:\n", "- Missing or unexpected tags can cause parsing errors.\n", "4. Encoding Issues:\n", "- XML files with non-standard encoding formats (e.g., ISO-8859-1) may fail to parse." ] }, { "cell_type": "markdown", "id": "8602e413-fbd8-4839-8eb6-440dbe6b2ae2", "metadata": {}, "source": [ "# d. How to Overcome These Issues" ] }, { "cell_type": "markdown", "id": "3e74ed8c-476f-4ceb-826f-07361f98f10a", "metadata": {}, "source": [ "1. Handle Complex Structures:\n", "\n", "- Use libraries like lxml for efficient navigation and processing of nested XML structures.\n", " \n", "2. Optimize Large File Processing:\n", "\n", "- Use event-driven parsing with xml.sax or lxml.iterparse to process files in chunks:" ] }, { "cell_type": "code", "execution_count": null, "id": "7e0a577a-8fa4-4dd2-8426-48e1422674e3", "metadata": {}, "outputs": [], "source": [ "from lxml import etree\n", "\n", "# Process XML in chunks\n", "for event, element in etree.iterparse(\"large_file.xml\", events=(\"end\",)):\n", " print(element.tag, element.text)\n", " element.clear()" ] }, { "cell_type": "markdown", "id": "2e486525-88a1-4272-b205-9ecccd1775fe", "metadata": {}, "source": [ "3. Handle Missing or Unexpected Tags:\n", "\n", "- Use default values or conditional checks to handle missing elements:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2b14f7b0-d18c-4bf5-9b28-883acde3989b", "metadata": {}, "outputs": [], "source": [ "for child in root:\n", " name = child.find(\"name\")\n", " print(name.text if name is not None else \"Unknown\")" ] }, { "cell_type": "markdown", "id": "3b53524c-7150-41d2-9bc9-c1e4dea2f1fa", "metadata": {}, "source": [ "4. Resolve Encoding Issues:\n", "\n", "- Explicitly specify the encoding when parsing:" ] }, { "cell_type": "code", "execution_count": null, "id": "58eb5b60-9304-4929-a6aa-4c9655a9c492", "metadata": {}, "outputs": [], "source": [ "tree = ET.parse(\"file.xml\", parser=ET.XMLParser(encoding=\"ISO-8859-1\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "2732b34d-eadd-4acc-921d-1594b52843d9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7e902fed-2e6f-4de8-879e-de88e665ae39", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e63144df-35e8-4a07-9dad-1c9466948487", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }