shwetashweta05
commited on
Delete csv_guide.ipynb
Browse files- csv_guide.ipynb +0 -356
csv_guide.ipynb
DELETED
@@ -1,356 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"cells": [
|
3 |
-
{
|
4 |
-
"cell_type": "markdown",
|
5 |
-
"id": "288dc3d6-2f59-4af4-b9a0-ac11110c95a4",
|
6 |
-
"metadata": {},
|
7 |
-
"source": [
|
8 |
-
"# a. What is CSV?"
|
9 |
-
]
|
10 |
-
},
|
11 |
-
{
|
12 |
-
"cell_type": "markdown",
|
13 |
-
"id": "8a29ef9f-d2b1-44ae-aa00-b7307dc1f1fa",
|
14 |
-
"metadata": {},
|
15 |
-
"source": [
|
16 |
-
"- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.\n",
|
17 |
-
"- Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.)."
|
18 |
-
]
|
19 |
-
},
|
20 |
-
{
|
21 |
-
"cell_type": "markdown",
|
22 |
-
"id": "aed4bbe7-49a7-44f7-a222-1dbc76b94b74",
|
23 |
-
"metadata": {},
|
24 |
-
"source": [
|
25 |
-
"## Advantages"
|
26 |
-
]
|
27 |
-
},
|
28 |
-
{
|
29 |
-
"cell_type": "markdown",
|
30 |
-
"id": "0908c962-52a0-481d-9c4a-734d0954aeb5",
|
31 |
-
"metadata": {},
|
32 |
-
"source": [
|
33 |
-
"- Lightweight and easy to create.\n",
|
34 |
-
"- Supported by almost all data tools and programming languages."
|
35 |
-
]
|
36 |
-
},
|
37 |
-
{
|
38 |
-
"cell_type": "markdown",
|
39 |
-
"id": "9a3c3937-cb91-411b-8606-16728aabbbc1",
|
40 |
-
"metadata": {},
|
41 |
-
"source": [
|
42 |
-
"## Common File Extensions"
|
43 |
-
]
|
44 |
-
},
|
45 |
-
{
|
46 |
-
"cell_type": "markdown",
|
47 |
-
"id": "41bf2a14-0cc1-458b-be33-62e9431a9b31",
|
48 |
-
"metadata": {},
|
49 |
-
"source": [
|
50 |
-
"- .csv\n",
|
51 |
-
"- .txt (sometimes used with a CSV structure)."
|
52 |
-
]
|
53 |
-
},
|
54 |
-
{
|
55 |
-
"cell_type": "markdown",
|
56 |
-
"id": "00250776-617f-49d9-88bb-e6cba943f599",
|
57 |
-
"metadata": {},
|
58 |
-
"source": [
|
59 |
-
"# b. How to Read CSV Files"
|
60 |
-
]
|
61 |
-
},
|
62 |
-
{
|
63 |
-
"cell_type": "markdown",
|
64 |
-
"id": "98989d08-8d4d-4a02-82b1-ba08757e71ff",
|
65 |
-
"metadata": {},
|
66 |
-
"source": [
|
67 |
-
"- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module."
|
68 |
-
]
|
69 |
-
},
|
70 |
-
{
|
71 |
-
"cell_type": "markdown",
|
72 |
-
"id": "6776fe4e-8155-47ff-99f4-ec26c916c45d",
|
73 |
-
"metadata": {},
|
74 |
-
"source": [
|
75 |
-
"## 1. Using pandas:"
|
76 |
-
]
|
77 |
-
},
|
78 |
-
{
|
79 |
-
"cell_type": "code",
|
80 |
-
"execution_count": null,
|
81 |
-
"id": "a508ae8d-3a3d-43f0-9453-11c87877b2b1",
|
82 |
-
"metadata": {},
|
83 |
-
"outputs": [],
|
84 |
-
"source": [
|
85 |
-
"import pandas as pd\n",
|
86 |
-
"\n",
|
87 |
-
"# Read a CSV file\n",
|
88 |
-
"df = pd.read_csv(\"file.csv\")\n",
|
89 |
-
"print(df.head())\n",
|
90 |
-
"\n",
|
91 |
-
"# Reading a CSV file with a custom delimiter\n",
|
92 |
-
"df = pd.read_csv(\"file.csv\", sep=\";\")"
|
93 |
-
]
|
94 |
-
},
|
95 |
-
{
|
96 |
-
"cell_type": "markdown",
|
97 |
-
"id": "7c3f7a6a-0c13-45f2-930b-2c5796985efd",
|
98 |
-
"metadata": {},
|
99 |
-
"source": [
|
100 |
-
"## 2. Using Python's Built-in csv Module:"
|
101 |
-
]
|
102 |
-
},
|
103 |
-
{
|
104 |
-
"cell_type": "code",
|
105 |
-
"execution_count": null,
|
106 |
-
"id": "a33ffb8b-88b6-4061-b816-00397f2b3a3e",
|
107 |
-
"metadata": {},
|
108 |
-
"outputs": [],
|
109 |
-
"source": [
|
110 |
-
"import csv\n",
|
111 |
-
"\n",
|
112 |
-
"with open(\"file.csv\", \"r\") as file:\n",
|
113 |
-
" reader = csv.reader(file)\n",
|
114 |
-
" for row in reader:\n",
|
115 |
-
" print(row)"
|
116 |
-
]
|
117 |
-
},
|
118 |
-
{
|
119 |
-
"cell_type": "markdown",
|
120 |
-
"id": "2a57c10b-51bf-4a4e-978a-51644964b856",
|
121 |
-
"metadata": {},
|
122 |
-
"source": [
|
123 |
-
"## 3.Reading Large CSV Files in Chunks:"
|
124 |
-
]
|
125 |
-
},
|
126 |
-
{
|
127 |
-
"cell_type": "code",
|
128 |
-
"execution_count": null,
|
129 |
-
"id": "5a056573-3d16-400a-8ccd-a15d0398b454",
|
130 |
-
"metadata": {},
|
131 |
-
"outputs": [],
|
132 |
-
"source": [
|
133 |
-
"# Process large CSV files in smaller chunks\n",
|
134 |
-
"for chunk in pd.read_csv(\"large_file.csv\", chunksize=1000):\n",
|
135 |
-
" print(chunk.head())"
|
136 |
-
]
|
137 |
-
},
|
138 |
-
{
|
139 |
-
"cell_type": "markdown",
|
140 |
-
"id": "b52ebad6-0c64-4317-974a-3498f05feaea",
|
141 |
-
"metadata": {},
|
142 |
-
"source": [
|
143 |
-
"# c. Issues Encountered When Handling CSV Files"
|
144 |
-
]
|
145 |
-
},
|
146 |
-
{
|
147 |
-
"cell_type": "markdown",
|
148 |
-
"id": "8fb34287-7754-4170-8095-46c2a82db4ba",
|
149 |
-
"metadata": {},
|
150 |
-
"source": [
|
151 |
-
"1. Delimiter Issues:\n",
|
152 |
-
" - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.\n",
|
153 |
-
"2. Encoding Problems:\n",
|
154 |
-
" - Non-UTF-8 encodings may cause errors while reading files.\n",
|
155 |
-
" - Example: \"UnicodeDecodeError.\"\n",
|
156 |
-
"3. Missing or Inconsistent Data:\n",
|
157 |
-
" - Some fields may be empty, and column lengths may vary.\n",
|
158 |
-
"4. Header Issues:\n",
|
159 |
-
" - Files may lack headers or have duplicate/misaligned headers.\n",
|
160 |
-
"5. Large File Sizes:\n",
|
161 |
-
" - Processing very large CSV files can lead to memory issues."
|
162 |
-
]
|
163 |
-
},
|
164 |
-
{
|
165 |
-
"cell_type": "markdown",
|
166 |
-
"id": "67c01a56-9b7c-46ba-8a79-9586a244978c",
|
167 |
-
"metadata": {},
|
168 |
-
"source": [
|
169 |
-
"# d. How to Overcome These Issues"
|
170 |
-
]
|
171 |
-
},
|
172 |
-
{
|
173 |
-
"cell_type": "markdown",
|
174 |
-
"id": "45564d75-7870-45e1-8d53-e78ff71ff018",
|
175 |
-
"metadata": {},
|
176 |
-
"source": [
|
177 |
-
"1. Delimiter Issues:\n",
|
178 |
-
" - Specify the correct delimiter while reading:"
|
179 |
-
]
|
180 |
-
},
|
181 |
-
{
|
182 |
-
"cell_type": "code",
|
183 |
-
"execution_count": null,
|
184 |
-
"id": "36c282a6-cbdc-4a3e-933a-91080ea4dccc",
|
185 |
-
"metadata": {},
|
186 |
-
"outputs": [],
|
187 |
-
"source": [
|
188 |
-
"df = pd.read_csv(\"file.csv\", sep=\";\")"
|
189 |
-
]
|
190 |
-
},
|
191 |
-
{
|
192 |
-
"cell_type": "markdown",
|
193 |
-
"id": "7b998672-5d7e-4a6a-8cc4-36b18446b9be",
|
194 |
-
"metadata": {},
|
195 |
-
"source": [
|
196 |
-
"2. Encoding Problems:\n",
|
197 |
-
" - Explicitly set the encoding:"
|
198 |
-
]
|
199 |
-
},
|
200 |
-
{
|
201 |
-
"cell_type": "code",
|
202 |
-
"execution_count": null,
|
203 |
-
"id": "2657d869-a303-4e03-bc07-b15f012f76e6",
|
204 |
-
"metadata": {},
|
205 |
-
"outputs": [],
|
206 |
-
"source": [
|
207 |
-
"df = pd.read_csv(\"file.csv\", encoding=\"ISO-8859-1\")"
|
208 |
-
]
|
209 |
-
},
|
210 |
-
{
|
211 |
-
"cell_type": "markdown",
|
212 |
-
"id": "113e7e43-7031-4904-9e87-c9df4acefaff",
|
213 |
-
"metadata": {},
|
214 |
-
"source": [
|
215 |
-
"3. Handling Missing Data:\n",
|
216 |
-
" - Fill missing values:"
|
217 |
-
]
|
218 |
-
},
|
219 |
-
{
|
220 |
-
"cell_type": "code",
|
221 |
-
"execution_count": null,
|
222 |
-
"id": "67ea80c7-7a86-4694-b6ff-a55ca27caad5",
|
223 |
-
"metadata": {},
|
224 |
-
"outputs": [],
|
225 |
-
"source": [
|
226 |
-
"df.fillna(\"Unknown\", inplace=True)"
|
227 |
-
]
|
228 |
-
},
|
229 |
-
{
|
230 |
-
"cell_type": "markdown",
|
231 |
-
"id": "a6d7a40b-c495-4482-b006-767c14209bf2",
|
232 |
-
"metadata": {},
|
233 |
-
"source": [
|
234 |
-
"- Drop rows/columns with missing data:"
|
235 |
-
]
|
236 |
-
},
|
237 |
-
{
|
238 |
-
"cell_type": "code",
|
239 |
-
"execution_count": null,
|
240 |
-
"id": "c8191abd-6281-466e-aeba-5f8df351de2d",
|
241 |
-
"metadata": {},
|
242 |
-
"outputs": [],
|
243 |
-
"source": [
|
244 |
-
"df.dropna(inplace=True)"
|
245 |
-
]
|
246 |
-
},
|
247 |
-
{
|
248 |
-
"cell_type": "markdown",
|
249 |
-
"id": "6542d341-d38f-4c59-a5ca-d2503bd35e51",
|
250 |
-
"metadata": {},
|
251 |
-
"source": [
|
252 |
-
"4. Header Issues:\n",
|
253 |
-
" - Manually assign headers:"
|
254 |
-
]
|
255 |
-
},
|
256 |
-
{
|
257 |
-
"cell_type": "code",
|
258 |
-
"execution_count": null,
|
259 |
-
"id": "3f2ee8b5-c54d-4349-b473-a8c3d6230c38",
|
260 |
-
"metadata": {},
|
261 |
-
"outputs": [],
|
262 |
-
"source": [
|
263 |
-
"df = pd.read_csv(\"file.csv\", header=None, names=[\"Col1\", \"Col2\", \"Col3\"])"
|
264 |
-
]
|
265 |
-
},
|
266 |
-
{
|
267 |
-
"cell_type": "markdown",
|
268 |
-
"id": "2461fa9d-02bb-4008-85d0-1cc47e412671",
|
269 |
-
"metadata": {},
|
270 |
-
"source": [
|
271 |
-
"5. Optimizing for Large Files:\n",
|
272 |
-
" - Use chunk processing:"
|
273 |
-
]
|
274 |
-
},
|
275 |
-
{
|
276 |
-
"cell_type": "code",
|
277 |
-
"execution_count": null,
|
278 |
-
"id": "e684dc56-f980-4d37-affb-3d7fde7a99b0",
|
279 |
-
"metadata": {},
|
280 |
-
"outputs": [],
|
281 |
-
"source": [
|
282 |
-
"for chunk in pd.read_csv(\"file.csv\", chunksize=5000):\n",
|
283 |
-
" process(chunk)"
|
284 |
-
]
|
285 |
-
},
|
286 |
-
{
|
287 |
-
"cell_type": "code",
|
288 |
-
"execution_count": null,
|
289 |
-
"id": "3e3ebeb4-1758-499c-9c0a-a6389b6ed6cd",
|
290 |
-
"metadata": {},
|
291 |
-
"outputs": [],
|
292 |
-
"source": []
|
293 |
-
},
|
294 |
-
{
|
295 |
-
"cell_type": "code",
|
296 |
-
"execution_count": null,
|
297 |
-
"id": "c378769c-56a9-4675-b988-e6b57eeed54e",
|
298 |
-
"metadata": {},
|
299 |
-
"outputs": [],
|
300 |
-
"source": []
|
301 |
-
},
|
302 |
-
{
|
303 |
-
"cell_type": "code",
|
304 |
-
"execution_count": null,
|
305 |
-
"id": "fe9e2b34-a679-4b8e-923a-f296f775a6a2",
|
306 |
-
"metadata": {},
|
307 |
-
"outputs": [],
|
308 |
-
"source": []
|
309 |
-
},
|
310 |
-
{
|
311 |
-
"cell_type": "code",
|
312 |
-
"execution_count": null,
|
313 |
-
"id": "3ece1968-048b-4337-a79e-3c9a7161231d",
|
314 |
-
"metadata": {},
|
315 |
-
"outputs": [],
|
316 |
-
"source": []
|
317 |
-
},
|
318 |
-
{
|
319 |
-
"cell_type": "code",
|
320 |
-
"execution_count": null,
|
321 |
-
"id": "b940d8eb-c668-4553-9bb9-c1b8e39cf211",
|
322 |
-
"metadata": {},
|
323 |
-
"outputs": [],
|
324 |
-
"source": []
|
325 |
-
},
|
326 |
-
{
|
327 |
-
"cell_type": "code",
|
328 |
-
"execution_count": null,
|
329 |
-
"id": "8a88eeae-cfdf-48bd-aa05-3b0c29ff25f0",
|
330 |
-
"metadata": {},
|
331 |
-
"outputs": [],
|
332 |
-
"source": []
|
333 |
-
}
|
334 |
-
],
|
335 |
-
"metadata": {
|
336 |
-
"kernelspec": {
|
337 |
-
"display_name": "Python 3 (ipykernel)",
|
338 |
-
"language": "python",
|
339 |
-
"name": "python3"
|
340 |
-
},
|
341 |
-
"language_info": {
|
342 |
-
"codemirror_mode": {
|
343 |
-
"name": "ipython",
|
344 |
-
"version": 3
|
345 |
-
},
|
346 |
-
"file_extension": ".py",
|
347 |
-
"mimetype": "text/x-python",
|
348 |
-
"name": "python",
|
349 |
-
"nbconvert_exporter": "python",
|
350 |
-
"pygments_lexer": "ipython3",
|
351 |
-
"version": "3.11.7"
|
352 |
-
}
|
353 |
-
},
|
354 |
-
"nbformat": 4,
|
355 |
-
"nbformat_minor": 5
|
356 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|