File size: 6,530 Bytes
9bdd214
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "61a28560-a233-418e-8266-442a4a0cb810",
   "metadata": {},
   "source": [
    "# a. What is XML?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88af4e03-44db-41a9-a8d7-66d4c06301d7",
   "metadata": {},
   "source": [
    "- XML (eXtensible Markup Language) is a markup language used to store and transport data in a structured format.\n",
    "- It is human-readable and machine-readable, with a hierarchical structure using tags.\n",
    "- ##  Advantages:\n",
    "- Flexible and self-descriptive.\n",
    "- Widely used in data exchange between systems, such as web APIs and configuration files.\n",
    "- ## Common File Extensions:\n",
    "- .xml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d9bfd21-9483-4fc1-9a3d-9c1067f437b9",
   "metadata": {},
   "source": [
    "Example of XML Structure:\n",
    "<person>\n",
    "    <name>Shweta Singh</name>\n",
    "    <age>27</age>\n",
    "    <city>Kolkata</city>\n",
    "</person>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91ef31d0-018d-4ece-b1aa-26b9fb11cec0",
   "metadata": {},
   "source": [
    "## b. How to Read XML Files\n",
    "- XML files can be parsed and processed using Python libraries like xml.etree.ElementTree, lxml, or pandas.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69b8009e-71cc-49ec-a604-8f5ef329b972",
   "metadata": {},
   "source": [
    " 1. Using xml.etree.ElementTree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b12e0ae-6189-48f7-98e2-33dac2f4f9f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "# Parse an XML file\n",
    "tree = ET.parse(\"file.xml\")\n",
    "root = tree.getroot()\n",
    "\n",
    "# Access elements\n",
    "for child in root:\n",
    "    print(child.tag, child.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65b985fa-4873-46dc-9770-8d9736547959",
   "metadata": {},
   "source": [
    "- 2. Using pandas for tabular data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83ad5cda-43d1-44a7-9cf3-fee7a584f5cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Read XML into a DataFrame\n",
    "df = pd.read_xml(\"file.xml\")\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5035955-6155-4a8c-8e16-8aac5f967e50",
   "metadata": {},
   "source": [
    "- 3. Using lxml for advanced parsing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52cb4fa5-5b98-4701-9c1f-5bb16ce56c42",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lxml import etree\n",
    "\n",
    "# Parse XML file\n",
    "tree = etree.parse(\"file.xml\")\n",
    "root = tree.getroot()\n",
    "\n",
    "# Extract specific elements\n",
    "for element in root.iter(\"name\"):\n",
    "    print(element.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae8fff0c-2018-4098-bfae-1dd2fe0f4db1",
   "metadata": {},
   "source": [
    "# c. Issues Encountered When Handling XML Files1. \n",
    "1. Complex Structures:\n",
    "- XML files can have deeply nested and complex hierarchies.\n",
    "2. Large File Sizes:\n",
    "- Parsing large XML files can consume significant memory.\n",
    "3. Data Inconsistency:\n",
    "- Missing or unexpected tags can cause parsing errors.\n",
    "4. Encoding Issues:\n",
    "- XML files with non-standard encoding formats (e.g., ISO-8859-1) may fail to parse."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8602e413-fbd8-4839-8eb6-440dbe6b2ae2",
   "metadata": {},
   "source": [
    "# d. How to Overcome These Issues"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e74ed8c-476f-4ceb-826f-07361f98f10a",
   "metadata": {},
   "source": [
    "1. Handle Complex Structures:\n",
    "\n",
    "- Use libraries like lxml for efficient navigation and processing of nested XML structures.\n",
    "  \n",
    "2. Optimize Large File Processing:\n",
    "\n",
    "- Use event-driven parsing with xml.sax or lxml.iterparse to process files in chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e0a577a-8fa4-4dd2-8426-48e1422674e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lxml import etree\n",
    "\n",
    "# Process XML in chunks\n",
    "for event, element in etree.iterparse(\"large_file.xml\", events=(\"end\",)):\n",
    "    print(element.tag, element.text)\n",
    "    element.clear()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e486525-88a1-4272-b205-9ecccd1775fe",
   "metadata": {},
   "source": [
    "3. Handle Missing or Unexpected Tags:\n",
    "\n",
    "- Use default values or conditional checks to handle missing elements:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b14f7b0-d18c-4bf5-9b28-883acde3989b",
   "metadata": {},
   "outputs": [],
   "source": [
    "for child in root:\n",
    "    name = child.find(\"name\")\n",
    "    print(name.text if name is not None else \"Unknown\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b53524c-7150-41d2-9bc9-c1e4dea2f1fa",
   "metadata": {},
   "source": [
    "4. Resolve Encoding Issues:\n",
    "\n",
    "- Explicitly specify the encoding when parsing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58eb5b60-9304-4929-a6aa-4c9655a9c492",
   "metadata": {},
   "outputs": [],
   "source": [
    "tree = ET.parse(\"file.xml\", parser=ET.XMLParser(encoding=\"ISO-8859-1\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2732b34d-eadd-4acc-921d-1594b52843d9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e902fed-2e6f-4de8-879e-de88e665ae39",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e63144df-35e8-4a07-9dad-1c9466948487",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}