Initial Commit

Browse files

Files changed (9) hide show

.gitignore +164 -0
README.md +6 -0
chapter1/1-transformers-what-can-they-do.ipynb +819 -0
chapter2/2-behind-the-pipeline.ipynb +340 -0
chapter2/2-handling-multiple-sequences.ipynb +384 -0
chapter2/2-tokenizers.ipynb +231 -0
chapter3/3-a-full-training.ipynb +393 -0
chapter3/3-fine-tuning-a-model-with-the-Trainer-API.ipynb +685 -0
chapter3/3-processing-the-data.ipynb +1719 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,164 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+artifacts/
+test-trainer/
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/

README.md CHANGED Viewed

@@ -1,3 +1,9 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+# Huggingface Course
+This is my first model repo.
+This aims to document my progress in going through the huggingface course and my understanding of different libraries provided by huggingface.

chapter1/1-transformers-what-can-they-do.ipynb ADDED Viewed

	@@ -0,0 +1,819 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Pipeline\n",
+    "\n",
+    "This is the most basic object in huggingface transformers libray. It is a one-stop object for doing everything under the hood and abstracting away a lot of the complexity away from the task at hand like `tokenization`, `preprocessing`, `postprocessing` etc."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import pipeline\n",
+    "classifier = pipeline(task = \"sentiment-analysis\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sentences = [\n",
+    "    \"I have been sleeping a lot lately. Wish I could do more and procrastinate less\",\n",
+    "    \"It is a wonderful day today\",\n",
+    "    \"What the heck, this software sucks!!\"\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'label': 'NEGATIVE', 'score': 0.9991617202758789},\n",
+       " {'label': 'POSITIVE', 'score': 0.999890923500061},\n",
+       " {'label': 'NEGATIVE', 'score': 0.9995805621147156}]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "classifier(sentences)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zero Shot Classification"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sentences = [\n",
+    "    \"Rahul Dravid was a great coach and led India to win the world cup in 2024\",\n",
+    "    \"What is a transformer? It is a black box neural network model which can be used to do stuff with sequences\",\n",
+    "    \"How can one understand the meaning of life? It is not so simple\",\n",
+    "    \"Shaun had a great insight right in the middle of a surgery\"\n",
+    "]\n",
+    "\n",
+    "labels = [\"Sports\", \"Education\", \"Other\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "classifier = pipeline(\"zero-shot-classification\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'sequence': 'Rahul Dravid was a great coach and led India to win the world cup in 2024',\n",
+       "  'labels': ['Sports', 'Other', 'Education'],\n",
+       "  'scores': [0.967433512210846, 0.025695420801639557, 0.006871006917208433]},\n",
+       " {'sequence': 'What is a transformer? It is a black box neural network model which can be used to do stuff with sequences',\n",
+       "  'labels': ['Other', 'Education', 'Sports'],\n",
+       "  'scores': [0.776347279548645, 0.11728236079216003, 0.10637037456035614]},\n",
+       " {'sequence': 'How can one understand the meaning of life? It is not so simple',\n",
+       "  'labels': ['Other', 'Education', 'Sports'],\n",
+       "  'scores': [0.8647233247756958, 0.08910410851240158, 0.046172577887773514]},\n",
+       " {'sequence': 'Shaun had a great insight right in the middle of a surgery',\n",
+       "  'labels': ['Other', 'Sports', 'Education'],\n",
+       "  'scores': [0.7419394850730896, 0.18247079849243164, 0.07558975368738174]}]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "classifier(sequences = sentences, candidate_labels = labels)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Text Generation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using default model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "generator = pipeline(task = \"text-generation\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seed_text = \"Dhoni finishes off in style and the entire Indian team\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/transformers/generation/utils.py:1201: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)\n",
+      "  warnings.warn(\n",
+      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n",
+      "/home/huggingface/lib/python3.10/site-packages/transformers/generation/utils.py:1288: UserWarning: Using `max_length`'s default (50) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'generated_text': 'Dhoni finishes off in style and the entire Indian team look forward to meeting him at home to continue their efforts towards an unbeaten run in this World Cup.'}]"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "generator(text_inputs = seed_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'generated_text': \"Dhoni finishes off in style and the entire Indian team is delighted with his victory\\n\\nIndia have failed to impress Pakistan's Ranji Trophy winner\"},\n",
+       " {'generated_text': \"Dhoni finishes off in style and the entire Indian team goes to great lengths to make him comfortable. It's a very important decision for the first\"},\n",
+       " {'generated_text': 'Dhoni finishes off in style and the entire Indian team is immediately in a good position to secure victory.\\n\\nA few weeks from now,'}]"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "generator(text_inputs = seed_text, num_return_sequences = 3, max_length = 30)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using specific model from huggingface hub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "/home/huggingface/lib/python3.10/site-packages/transformers/generation/utils.py:1201: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)\n",
+      "  warnings.warn(\n",
+      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'generated_text': 'Dhoni finishes off in style and the entire Indian team has their legs.\\n\\n\\nThe match between the West Indian and the Americans was the'},\n",
+       " {'generated_text': 'Dhoni finishes off in style and the entire Indian team is preparing to compete on October 31st.\\n\\nThe squad of India is made up'},\n",
+       " {'generated_text': 'Dhoni finishes off in style and the entire Indian team looks happy to be back as usual this term,\" he added.'}]"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "generator = pipeline(\"text-generation\", model = \"distilgpt2\")\n",
+    "\n",
+    "generator(text_inputs= seed_text, num_return_sequences = 3, max_length = 30)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Mask Filling"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "filler = pipeline(\"fill-mask\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'score': 0.07598453760147095,\n",
+       "  'token': 6943,\n",
+       "  'token_str': ' depression',\n",
+       "  'sequence': 'How deep is your depression?'},\n",
+       " {'score': 0.035246096551418304,\n",
+       "  'token': 12172,\n",
+       "  'token_str': ' bubble',\n",
+       "  'sequence': 'How deep is your bubble?'},\n",
+       " {'score': 0.027820784598588943,\n",
+       "  'token': 7530,\n",
+       "  'token_str': ' addiction',\n",
+       "  'sequence': 'How deep is your addiction?'},\n",
+       " {'score': 0.014877567999064922,\n",
+       "  'token': 4683,\n",
+       "  'token_str': ' hole',\n",
+       "  'sequence': 'How deep is your hole?'},\n",
+       " {'score': 0.013593271374702454,\n",
+       "  'token': 1144,\n",
+       "  'token_str': ' heart',\n",
+       "  'sequence': 'How deep is your heart?'}]"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "filler(\"How deep is your <mask>?\", top_k = 5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n",
+      "- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    }
+   ],
+   "source": [
+    "filler = pipeline(\"fill-mask\", model = \"bert-base-cased\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'score': 0.0551474466919899,\n",
+       "  'token': 1762,\n",
+       "  'token_str': 'heart',\n",
+       "  'sequence': 'How deep is your heart?'},\n",
+       " {'score': 0.04252220690250397,\n",
+       "  'token': 5785,\n",
+       "  'token_str': 'wound',\n",
+       "  'sequence': 'How deep is your wound?'},\n",
+       " {'score': 0.038988541811704636,\n",
+       "  'token': 3960,\n",
+       "  'token_str': 'soul',\n",
+       "  'sequence': 'How deep is your soul?'},\n",
+       " {'score': 0.03589598089456558,\n",
+       "  'token': 2922,\n",
+       "  'token_str': 'throat',\n",
+       "  'sequence': 'How deep is your throat?'},\n",
+       " {'score': 0.0302369873970747,\n",
+       "  'token': 1567,\n",
+       "  'token_str': 'love',\n",
+       "  'sequence': 'How deep is your love?'}]"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "filler(\"How deep is your [MASK]?\", top_k = 5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Named Entity Recognition (NER)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "/home/huggingface/lib/python3.10/site-packages/transformers/pipelines/token_classification.py:157: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy=\"simple\"` instead.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "ner = pipeline(task = \"ner\", grouped_entities = True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'entity_group': 'PER',\n",
+       "  'score': 0.9884488,\n",
+       "  'word': 'Sachin Tendulkar',\n",
+       "  'start': 63,\n",
+       "  'end': 79},\n",
+       " {'entity_group': 'ORG',\n",
+       "  'score': 0.9564063,\n",
+       "  'word': 'Indian Cricket Team',\n",
+       "  'start': 89,\n",
+       "  'end': 108}]"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ner(\"Hey everyone, please welcome, the chief guest for tonight: Mr. Sachin Tendulkar from the Indian Cricket Team\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "/home/huggingface/lib/python3.10/site-packages/transformers/pipelines/token_classification.py:157: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy=\"none\"` instead.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'entity': 'I-PER',\n",
+       "  'score': 0.9995166,\n",
+       "  'index': 15,\n",
+       "  'word': 'Sa',\n",
+       "  'start': 63,\n",
+       "  'end': 65},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.9992397,\n",
+       "  'index': 16,\n",
+       "  'word': '##chin',\n",
+       "  'start': 65,\n",
+       "  'end': 69},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.99916065,\n",
+       "  'index': 17,\n",
+       "  'word': 'Ten',\n",
+       "  'start': 70,\n",
+       "  'end': 73},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.9957129,\n",
+       "  'index': 18,\n",
+       "  'word': '##du',\n",
+       "  'start': 73,\n",
+       "  'end': 75},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.9410511,\n",
+       "  'index': 19,\n",
+       "  'word': '##lk',\n",
+       "  'start': 75,\n",
+       "  'end': 77},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.99601185,\n",
+       "  'index': 20,\n",
+       "  'word': '##ar',\n",
+       "  'start': 77,\n",
+       "  'end': 79},\n",
+       " {'entity': 'I-ORG',\n",
+       "  'score': 0.9637556,\n",
+       "  'index': 23,\n",
+       "  'word': 'Indian',\n",
+       "  'start': 89,\n",
+       "  'end': 95},\n",
+       " {'entity': 'I-ORG',\n",
+       "  'score': 0.9248884,\n",
+       "  'index': 24,\n",
+       "  'word': 'Cricket',\n",
+       "  'start': 96,\n",
+       "  'end': 103},\n",
+       " {'entity': 'I-ORG',\n",
+       "  'score': 0.98057497,\n",
+       "  'index': 25,\n",
+       "  'word': 'Team',\n",
+       "  'start': 104,\n",
+       "  'end': 108}]"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ner = pipeline(task = \"ner\", grouped_entities = False)\n",
+    "ner(\"Hey everyone, please welcome, the chief guest for tonight: Mr. Sachin Tendulkar from the Indian Cricket Team\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "pos = pipeline(task = \"token-classification\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'entity': 'I-PER',\n",
+       "  'score': 0.99938285,\n",
+       "  'index': 4,\n",
+       "  'word': 'S',\n",
+       "  'start': 11,\n",
+       "  'end': 12},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.99815494,\n",
+       "  'index': 5,\n",
+       "  'word': '##yl',\n",
+       "  'start': 12,\n",
+       "  'end': 14},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.9959072,\n",
+       "  'index': 6,\n",
+       "  'word': '##va',\n",
+       "  'start': 14,\n",
+       "  'end': 16},\n",
+       " {'entity': 'I-PER',\n",
+       "  'score': 0.99923277,\n",
+       "  'index': 7,\n",
+       "  'word': '##in',\n",
+       "  'start': 16,\n",
+       "  'end': 18},\n",
+       " {'entity': 'I-ORG',\n",
+       "  'score': 0.9738931,\n",
+       "  'index': 12,\n",
+       "  'word': 'Hu',\n",
+       "  'start': 33,\n",
+       "  'end': 35},\n",
+       " {'entity': 'I-ORG',\n",
+       "  'score': 0.97611505,\n",
+       "  'index': 13,\n",
+       "  'word': '##gging',\n",
+       "  'start': 35,\n",
+       "  'end': 40},\n",
+       " {'entity': 'I-ORG',\n",
+       "  'score': 0.9887976,\n",
+       "  'index': 14,\n",
+       "  'word': 'Face',\n",
+       "  'start': 41,\n",
+       "  'end': 45},\n",
+       " {'entity': 'I-LOC',\n",
+       "  'score': 0.9932106,\n",
+       "  'index': 16,\n",
+       "  'word': 'Brooklyn',\n",
+       "  'start': 49,\n",
+       "  'end': 57}]"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pos(\"My name is Sylvain and I work at Hugging Face in Brooklyn.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Question Answering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.21678458154201508,\n",
+       " 'start': 48,\n",
+       " 'end': 76,\n",
+       " 'answer': 'I wish I could get some rest'}"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bot = pipeline(\"question-answering\")\n",
+    "bot(\n",
+    "    question = \"How am I doing?\",\n",
+    "    context = \"I have just came back from a very busy trip and I wish I could get some rest.\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is a model which is meant to extract the phrases from the given text which could be the answer and does not generate the answer."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summarization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]"
+      ]
+     },
+     "execution_count": 31,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "summary = pipeline(\"summarization\")\n",
+    "\n",
+    "summary(\n",
+    "\"\"\"\n",
+    "    America has changed dramatically during recent years. Not only has the number of \n",
+    "    graduates in traditional engineering disciplines such as mechanical, civil, \n",
+    "    electrical, chemical, and aeronautical engineering declined, but in most of \n",
+    "    the premier American universities engineering curricula now concentrate on \n",
+    "    and encourage largely the study of engineering science. As a result, there \n",
+    "    are declining offerings in engineering subjects dealing with infrastructure, \n",
+    "    the environment, and related issues, and greater concentration on high \n",
+    "    technology subjects, largely supporting increasingly complex scientific \n",
+    "    developments. While the latter is important, it should not be at the expense \n",
+    "    of more traditional engineering.\n",
+    "\n",
+    "    Rapidly developing economies such as China and India, as well as other \n",
+    "    industrial countries in Europe and Asia, continue to encourage and advance \n",
+    "    the teaching of engineering. Both China and India, respectively, graduate \n",
+    "    six and eight times as many traditional engineers as does the United States. \n",
+    "    Other industrial countries at minimum maintain their output, while America \n",
+    "    suffers an increasingly serious decline in the number of engineering graduates \n",
+    "    and a lack of well-educated engineers.\n",
+    "\"\"\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Translation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "KeyError",
+     "evalue": "'translation'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[34], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m translator \u001b[38;5;241m=\u001b[39m \u001b[43mpipeline\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtranslation\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mHariSekhar/Eng_Marathi_translation\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/transformers/pipelines/__init__.py:692\u001b[0m, in \u001b[0;36mpipeline\u001b[0;34m(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)\u001b[0m\n\u001b[1;32m    690\u001b[0m     hub_kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_commit_hash\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m config\u001b[38;5;241m.\u001b[39m_commit_hash\n\u001b[1;32m    691\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m config \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(model, \u001b[38;5;28mstr\u001b[39m):\n\u001b[0;32m--> 692\u001b[0m     config \u001b[38;5;241m=\u001b[39m \u001b[43mAutoConfig\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m_from_pipeline\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtask\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mhub_kwargs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mmodel_kwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    693\u001b[0m     hub_kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_commit_hash\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m config\u001b[38;5;241m.\u001b[39m_commit_hash\n\u001b[1;32m    695\u001b[0m custom_tasks \u001b[38;5;241m=\u001b[39m {}\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:917\u001b[0m, in \u001b[0;36mAutoConfig.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m    915\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m config_class\u001b[38;5;241m.\u001b[39mfrom_pretrained(pretrained_model_name_or_path, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m    916\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel_type\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict:\n\u001b[0;32m--> 917\u001b[0m     config_class \u001b[38;5;241m=\u001b[39m \u001b[43mCONFIG_MAPPING\u001b[49m\u001b[43m[\u001b[49m\u001b[43mconfig_dict\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmodel_type\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\n\u001b[1;32m    918\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m config_class\u001b[38;5;241m.\u001b[39mfrom_dict(config_dict, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39munused_kwargs)\n\u001b[1;32m    919\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    920\u001b[0m     \u001b[38;5;66;03m# Fallback: use pattern matching on the string.\u001b[39;00m\n\u001b[1;32m    921\u001b[0m     \u001b[38;5;66;03m# We go from longer names to shorter names to catch roberta before bert (for instance)\u001b[39;00m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:623\u001b[0m, in \u001b[0;36m_LazyConfigMapping.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m    621\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_extra_content[key]\n\u001b[1;32m    622\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m key \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_mapping:\n\u001b[0;32m--> 623\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key)\n\u001b[1;32m    624\u001b[0m value \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_mapping[key]\n\u001b[1;32m    625\u001b[0m module_name \u001b[38;5;241m=\u001b[39m model_type_to_module_name(key)\n",
+      "\u001b[0;31mKeyError\u001b[0m: 'translation'"
+     ]
+    }
+   ],
+   "source": [
+    "translator = pipeline(\"translation\", model = \"HariSekhar/Eng_Marathi_translation\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "translator(\"\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

chapter2/2-behind-the-pipeline.ipynb ADDED Viewed

	@@ -0,0 +1,340 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using the pipeline function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).\n",
+      "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import pipeline\n",
+    "\n",
+    "classifier = pipeline(task=\"sentiment-analysis\")\n",
+    "\n",
+    "inputs = [\"This was so bad I couldn´t finish it. The actresses are so bad at acting it feels like a bad comedy from minute one. The high rated reviews is obviously from friend/family and is pure BS.\",\n",
+    "          \"I thought the cast was great. Brianna and Emma were exceptionaly talented in thier characters. Fun film.\"]\n",
+    "\n",
+    "outputs = classifier(inputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'label': 'NEGATIVE', 'score': 0.9995231628417969},\n",
+       " {'label': 'POSITIVE', 'score': 0.9998352527618408}]"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "outputs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Defining tokenizer and model manually"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "checkpoint = \"distilbert/distilbert-base-uncased-finetuned-sst-2-english\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "tokenized_inputs = tokenizer(\n",
+    "    inputs, padding=True, truncation=True, return_tensors=\"pt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([  101,  2023,  2001,  2061,  2919,  1045,  2481, 29658,  2102,  3926,\n",
+      "         2009,  1012,  1996, 19910,  2024,  2061,  2919,  2012,  3772,  2009,\n",
+      "         5683,  2066,  1037,  2919,  4038,  2013,  3371,  2028,  1012,  1996,\n",
+      "         2152,  6758,  4391,  2003,  5525,  2013,  2767,  1013,  2155,  1998,\n",
+      "         2003,  5760, 18667,  1012,   102])\n",
+      "tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+      "        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tokenized_inputs[\"input_ids\"][0], tokenized_inputs[\"attention_mask\"][0], sep = \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([  101,  1045,  2245,  1996,  3459,  2001,  2307,  1012, 25558,  1998,\n",
+      "         5616,  2020, 11813,  2100, 10904,  1999, 16215,  3771,  3494,  1012,\n",
+      "         4569,  2143,  1012,   102,     0,     0,     0,     0,     0,     0,\n",
+      "            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+      "            0,     0,     0,     0,     0])\n",
+      "tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+      "        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tokenized_inputs[\"input_ids\"][1], tokenized_inputs[\"attention_mask\"][1], sep = \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(45, 45)"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(tokenized_inputs[\"input_ids\"][0]), len(tokenized_inputs[\"input_ids\"][1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoModelForSequenceClassification\n",
+    "import torch\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)\n",
+    "model.eval();"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.no_grad():\n",
+    "    outputs = model(**tokenized_inputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['__annotations__', '__class__', '__class_getitem__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__or__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'attentions', 'clear', 'copy', 'fromkeys', 'get', 'hidden_states', 'items', 'keys', 'logits', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'to_tuple', 'update', 'values']\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(dir(outputs))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 4.2415, -3.4063],\n",
+       "        [-4.1783,  4.5328]])"
+      ]
+     },
+     "execution_count": 59,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "outputs.logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[9.9952e-01, 4.7686e-04],\n",
+       "        [1.6471e-04, 9.9984e-01]])"
+      ]
+     },
+     "execution_count": 60,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import torch.nn.functional as F\n",
+    "F.softmax(outputs.logits, dim = -1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions = outputs.logits.argmax(dim = -1)\n",
+    "pred_probas = F.softmax(outputs.logits, dim = -1).max(dim = -1).values\n",
+    "\n",
+    "preds = []\n",
+    "for p, pp in zip(predictions, pred_probas):\n",
+    "    preds.append({'label': model.config.id2label[p.item()], 'score': pp.item()})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'label': 'NEGATIVE', 'score': 0.9995231628417969},\n",
+       " {'label': 'POSITIVE', 'score': 0.9998352527618408}]"
+      ]
+     },
+     "execution_count": 67,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "preds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "\n",
+    "Reference Output\n",
+    "\n",
+    "---\n",
+    "\n",
+    "[{'label': 'NEGATIVE', 'score': 0.9995231628417969},\n",
+    " {'label': 'POSITIVE', 'score': 0.9998352527618408}]\n",
+    "````"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

chapter2/2-handling-multiple-sequences.ipynb ADDED Viewed

	@@ -0,0 +1,384 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
+    "\n",
+    "checkpoint = \"distilbert-base-uncased-finetuned-sst-2-english\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)\n",
+    "model.eval();"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Try forward pass on single Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([2057, 2342, 2062, 3737, 7435, 1010, 6145, 1998, 9559, 1999, 2256, 3842,\n",
+       "        1012])"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sequence = \"We need more quality doctors, engineers and lawyers in our nation.\"\n",
+    "token_ids = torch.tensor(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sequence)))\n",
+    "token_ids"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "RuntimeError",
+     "evalue": "The size of tensor a (13) must match the size of tensor b (512) at non-singleton dimension 1",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[3], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m torch\u001b[38;5;241m.\u001b[39mno_grad():\n\u001b[0;32m----> 2\u001b[0m     \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtoken_ids\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/torch/nn/modules/module.py:1532\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1530\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1531\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1532\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/torch/nn/modules/module.py:1541\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1536\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1537\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1538\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1539\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1540\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1541\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1543\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m   1544\u001b[0m     result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/transformers/models/distilbert/modeling_distilbert.py:763\u001b[0m, in \u001b[0;36mDistilBertForSequenceClassification.forward\u001b[0;34m(self, input_ids, attention_mask, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)\u001b[0m\n\u001b[1;32m    755\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124mr\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    756\u001b[0m \u001b[38;5;124;03mlabels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\u001b[39;00m\n\u001b[1;32m    757\u001b[0m \u001b[38;5;124;03m    Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\u001b[39;00m\n\u001b[1;32m    758\u001b[0m \u001b[38;5;124;03m    config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\u001b[39;00m\n\u001b[1;32m    759\u001b[0m \u001b[38;5;124;03m    `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\u001b[39;00m\n\u001b[1;32m    760\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    761\u001b[0m return_dict \u001b[38;5;241m=\u001b[39m return_dict \u001b[38;5;28;01mif\u001b[39;00m return_dict \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39muse_return_dict\n\u001b[0;32m--> 763\u001b[0m distilbert_output \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdistilbert\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    764\u001b[0m \u001b[43m    \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    765\u001b[0m \u001b[43m    \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    766\u001b[0m \u001b[43m    \u001b[49m\u001b[43mhead_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mhead_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    767\u001b[0m \u001b[43m    \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    768\u001b[0m \u001b[43m    \u001b[49m\u001b[43moutput_attentions\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutput_attentions\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    769\u001b[0m \u001b[43m    \u001b[49m\u001b[43moutput_hidden_states\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutput_hidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    770\u001b[0m \u001b[43m    \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_dict\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    771\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    772\u001b[0m hidden_state \u001b[38;5;241m=\u001b[39m distilbert_output[\u001b[38;5;241m0\u001b[39m]  \u001b[38;5;66;03m# (bs, seq_len, dim)\u001b[39;00m\n\u001b[1;32m    773\u001b[0m pooled_output \u001b[38;5;241m=\u001b[39m hidden_state[:, \u001b[38;5;241m0\u001b[39m]  \u001b[38;5;66;03m# (bs, dim)\u001b[39;00m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/torch/nn/modules/module.py:1532\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1530\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1531\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1532\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/torch/nn/modules/module.py:1541\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1536\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1537\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1538\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1539\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1540\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1541\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1543\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m   1544\u001b[0m     result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/transformers/models/distilbert/modeling_distilbert.py:581\u001b[0m, in \u001b[0;36mDistilBertModel.forward\u001b[0;34m(self, input_ids, attention_mask, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)\u001b[0m\n\u001b[1;32m    578\u001b[0m \u001b[38;5;66;03m# Prepare head mask if needed\u001b[39;00m\n\u001b[1;32m    579\u001b[0m head_mask \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mget_head_mask(head_mask, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39mnum_hidden_layers)\n\u001b[0;32m--> 581\u001b[0m embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43membeddings\u001b[49m\u001b[43m(\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m)\u001b[49m  \u001b[38;5;66;03m# (bs, seq_length, dim)\u001b[39;00m\n\u001b[1;32m    583\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtransformer(\n\u001b[1;32m    584\u001b[0m     x\u001b[38;5;241m=\u001b[39membeddings,\n\u001b[1;32m    585\u001b[0m     attn_mask\u001b[38;5;241m=\u001b[39mattention_mask,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    589\u001b[0m     return_dict\u001b[38;5;241m=\u001b[39mreturn_dict,\n\u001b[1;32m    590\u001b[0m )\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/torch/nn/modules/module.py:1532\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1530\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1531\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1532\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/torch/nn/modules/module.py:1541\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1536\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1537\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1538\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1539\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1540\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1541\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1543\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m   1544\u001b[0m     result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n",
+      "File \u001b[0;32m~/huggingface/lib/python3.10/site-packages/transformers/models/distilbert/modeling_distilbert.py:135\u001b[0m, in \u001b[0;36mEmbeddings.forward\u001b[0;34m(self, input_ids, input_embeds)\u001b[0m\n\u001b[1;32m    131\u001b[0m     position_ids \u001b[38;5;241m=\u001b[39m position_ids\u001b[38;5;241m.\u001b[39munsqueeze(\u001b[38;5;241m0\u001b[39m)\u001b[38;5;241m.\u001b[39mexpand_as(input_ids)  \u001b[38;5;66;03m# (bs, max_seq_length)\u001b[39;00m\n\u001b[1;32m    133\u001b[0m position_embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mposition_embeddings(position_ids)  \u001b[38;5;66;03m# (bs, max_seq_length, dim)\u001b[39;00m\n\u001b[0;32m--> 135\u001b[0m embeddings \u001b[38;5;241m=\u001b[39m \u001b[43minput_embeds\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mposition_embeddings\u001b[49m  \u001b[38;5;66;03m# (bs, max_seq_length, dim)\u001b[39;00m\n\u001b[1;32m    136\u001b[0m embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mLayerNorm(embeddings)  \u001b[38;5;66;03m# (bs, max_seq_length, dim)\u001b[39;00m\n\u001b[1;32m    137\u001b[0m embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdropout(embeddings)  \u001b[38;5;66;03m# (bs, max_seq_length, dim)\u001b[39;00m\n",
+      "\u001b[0;31mRuntimeError\u001b[0m: The size of tensor a (13) must match the size of tensor b (512) at non-singleton dimension 1"
+     ]
+    }
+   ],
+   "source": [
+    "with torch.no_grad():\n",
+    "    model(token_ids)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([2057, 2342, 2062, 3737, 7435, 1010, 6145, 1998, 9559, 1999, 2256, 3842,\n",
+       "        1012])"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "token_ids"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As seen above our model does not have a batch dimension because of which we are seeing this issue. Let's add a batch dimension and then pass our sequence through the model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "SequenceClassifierOutput(loss=None, logits=tensor([[ 1.2781, -1.0656]]), hidden_states=None, attentions=None)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "with torch.no_grad():\n",
+    "    out = model(token_ids.unsqueeze(0))\n",
+    "out"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's try by duplicating the input if we get the same logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 1.2781, -1.0656],\n",
+       "        [ 1.2781, -1.0656]])"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "with torch.no_grad():\n",
+    "    inp = torch.cat([token_ids.unsqueeze(0), token_ids.unsqueeze(0)], dim = 0)\n",
+    "    out = model(inp)\n",
+    "out.logits"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Input padding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)\n",
+      "tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)\n",
+      "tensor([[ 1.5694, -1.3895],\n",
+      "        [ 0.9907, -0.9139]], grad_fn=<AddmmBackward0>)\n"
+     ]
+    }
+   ],
+   "source": [
+    "padding_id = 100\n",
+    "\n",
+    "batched_ids = [\n",
+    "    [200, 200, 200],\n",
+    "    [200, 200, padding_id],\n",
+    "]\n",
+    "\n",
+    "print(model(torch.tensor([batched_ids[0]])).logits)\n",
+    "print(model(torch.tensor([batched_ids[1][:2]])).logits)\n",
+    "print(model(torch.tensor(batched_ids)).logits)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!\n",
+    "\n",
+    "This is because when we add padding, we need to make sure we nullify it's impact during the attention matrix computation step. This is why we need a mask so that we can explicily shut these tokens from the attention calculation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Cross checking the working of attention masks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'input_ids': tensor([[  101,  1045,  1521,  2310,  2042,  3403,  2005,  1037, 17662, 12172,\n",
+       "          2607,  2026,  2878,  2166,  1012,   102],\n",
+       "        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,\n",
+       "             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
+       "        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokens"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sentences = [\"I’ve been waiting for a HuggingFace course my whole life.\",\n",
+    "             \"I hate this so much!\"]\n",
+    "tokens = tokenizer(sentences, padding=True, return_tensors=\"pt\")\n",
+    "with torch.no_grad():\n",
+    "    out = model(**tokens)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'input_ids': tensor([[  101,  1045,  1521,  2310,  2042,  3403,  2005,  1037, 17662, 12172,\n",
+      "          2607,  2026,  2878,  2166,  1012,   102],\n",
+      "        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,\n",
+      "             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
+      "        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tokens)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-1.5979,  1.6390],\n",
+       "        [ 4.1692, -3.3464]])"
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out.logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[-1.5979,  1.6390]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Do the entire forward pass manually for sentence 1\n",
+    "\n",
+    "# Tokenize the sentence and get the tokenids\n",
+    "token_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0]))\n",
+    "\n",
+    "# Add the special token CLS and SEP at the start and end of the token rspectively\n",
+    "token_ids = [101] + token_ids + [102]\n",
+    "\n",
+    "# Perform the forward pass and print the logits\n",
+    "with torch.no_grad():\n",
+    "    print(model(torch.tensor([token_ids])).logits)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[ 4.1692, -3.3464]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Do the entire forward pass manually for sentence 2\n",
+    "\n",
+    "# Tokenize the sentence and get the tokenids\n",
+    "s0_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0]))\n",
+    "token_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[1]))\n",
+    "s1_tokens = len(token_ids)\n",
+    "additional_ids = len(s0_ids) - len(token_ids)\n",
+    "\n",
+    "# Add the special token CLS and SEP at the start and end of the token repectively\n",
+    "# Also create an attention mask here to stop the attention from considering additional padding tokens\n",
+    "token_ids = [101] + token_ids + [102] + [0 for _ in range(additional_ids)]\n",
+    "attention_mask = [1 for _ in range(s1_tokens + 2)] + [0 for _ in range(additional_ids)]\n",
+    "\n",
+    "# Perform the forward pass and print the logits\n",
+    "with torch.no_grad():\n",
+    "    print(model(input_ids = torch.tensor([token_ids]),\n",
+    "                attention_mask = torch.tensor([attention_mask])).logits)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

chapter2/2-tokenizers.ipynb ADDED Viewed

	@@ -0,0 +1,231 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import BertTokenizer\n",
+    "from pprint import pprint\n",
+    "tokenizer = BertTokenizer.from_pretrained(\"bert-base-cased\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
+      " 'input_ids': [101,\n",
+      "               1109,\n",
+      "               20164,\n",
+      "               10932,\n",
+      "               2271,\n",
+      "               7954,\n",
+      "               10176,\n",
+      "               1110,\n",
+      "               2385,\n",
+      "               1107,\n",
+      "               7926,\n",
+      "               8588,\n",
+      "               102],\n",
+      " 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(tokenizer(\"The HuggingFace Course is quite intuitive\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('./artifacts/tokenizer_config.json',\n",
+       " './artifacts/special_tokens_map.json',\n",
+       " './artifacts/vocab.txt',\n",
+       " './artifacts/added_tokens.json')"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.save_pretrained(save_directory=\"./artifacts/\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Breaking it down"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sequence = \"The HuggingFace Course is quite intuitive\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['The', 'Hu', '##gging', '##F', '##ace', 'Course', 'is', 'quite', 'in', '##tu', '##itive']\n"
+     ]
+    }
+   ],
+   "source": [
+    "tokens = tokenizer.tokenize(sequence)\n",
+    "print(tokens)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[1109, 20164, 10932, 2271, 7954, 10176, 1110, 2385, 1107, 7926, 8588]"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "token_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
+    "token_ids"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Try tokenization using tokenize method and the __call__ method of the tokenizer object and confirm the outputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[101, 146, 787, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102]\n",
+      "[CLS] I ’ ve been waiting for a HuggingFace course my whole life. [SEP]\n",
+      "\n",
+      "[146, 787, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119]\n",
+      "I ’ ve been waiting for a HuggingFace course my whole life.\n",
+      "====================================================================================================\n",
+      "[101, 146, 4819, 1142, 1177, 1277, 106, 102]\n",
+      "[CLS] I hate this so much! [SEP]\n",
+      "\n",
+      "[146, 4819, 1142, 1177, 1277, 106]\n",
+      "I hate this so much!\n",
+      "====================================================================================================\n"
+     ]
+    }
+   ],
+   "source": [
+    "sentences = [\"I’ve been waiting for a HuggingFace course my whole life.\", \"I hate this so much!\"]\n",
+    "\n",
+    "for sentence in sentences:\n",
+    "    # 1: Perform tokenization using the default call method\n",
+    "    token_ids = tokenizer(sentence)[\"input_ids\"]\n",
+    "    print(token_ids)\n",
+    "    print(tokenizer.decode(token_ids))\n",
+    "    print()\n",
+    "\n",
+    "    # 2: First tokenize and then convert to ids\n",
+    "    tokens = tokenizer.tokenize(sentence)\n",
+    "    token_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
+    "    print(token_ids)\n",
+    "    print(tokenizer.decode(token_ids))\n",
+    "\n",
+    "    print(\"=\"*100)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'[CLS] [SEP]'"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.decode([101, 102])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The difference in the first and last token values is because of the introduction of special tokens which is proposed in the BERT paper otherwise all the tokens are exactly the same."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

chapter3/3-a-full-training.ipynb ADDED Viewed

	@@ -0,0 +1,393 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Train with Pytorch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Map: 100%|██████████| 872/872 [00:00<00:00, 15492.15 examples/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification\n",
+    "\n",
+    "raw_dataset = load_dataset(\"glue\", \"sst2\")\n",
+    "checkpoint = \"bert-base-uncased\"\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
+    "\n",
+    "# # For MRPC\n",
+    "# def tokenize_function(sample):\n",
+    "#     return tokenizer(sample[\"sentence1\"], sample[\"sentence2\"], truncation = True)\n",
+    "\n",
+    "# For SST2\n",
+    "def tokenize_function(sample):\n",
+    "    return tokenizer(sample[\"sentence\"], truncation = True)\n",
+    "\n",
+    "\n",
+    "tokenized_dataset = raw_dataset.map(tokenize_function, batched = True)\n",
+    "data_collator = DataCollatorWithPadding(tokenizer = tokenizer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Preprocess the dataset "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'train': ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       " 'validation': ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       " 'test': ['labels', 'input_ids', 'token_type_ids', 'attention_mask']}"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Remove unwanted columns which are not to be uitilized during pytorch dataloading\n",
+    "# # For MRPC\n",
+    "# tokenized_dataset = tokenized_dataset.remove_columns([\"sentence1\", \"sentence2\", \"idx\"])\n",
+    "\n",
+    "# For SST2\n",
+    "tokenized_dataset = tokenized_dataset.remove_columns([\"sentence\", \"idx\"])\n",
+    "\n",
+    "# Rename the target column appropriately\n",
+    "tokenized_dataset = tokenized_dataset.rename_column(\"label\", \"labels\")\n",
+    "\n",
+    "# Set the format to return tensors instead of lists\n",
+    "tokenized_dataset.set_format(\"torch\")\n",
+    "\n",
+    "tokenized_dataset.column_names"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "train_dataloader = DataLoader(tokenized_dataset[\"train\"], shuffle = True, batch_size = 64, collate_fn = data_collator)\n",
+    "eval_dataloader = DataLoader(tokenized_dataset[\"validation\"], batch_size = 64, collate_fn= data_collator)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'labels': torch.Size([64]),\n",
+       " 'input_ids': torch.Size([64, 41]),\n",
+       " 'token_type_ids': torch.Size([64, 41]),\n",
+       " 'attention_mask': torch.Size([64, 41])}"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "one_batch = next(iter(train_dataloader))\n",
+    "{k: v.shape for k, v in one_batch.items()}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Define the model and start training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "SequenceClassifierOutput(loss=tensor(0.7528), logits=tensor([[-0.4735,  0.2345],\n",
+      "        [-0.5462,  0.2849],\n",
+      "        [-0.8623,  0.6073],\n",
+      "        [-0.6334,  0.3747],\n",
+      "        [-0.5882,  0.4656],\n",
+      "        [-0.1711,  0.1957],\n",
+      "        [-0.4656,  0.2387],\n",
+      "        [-0.8434,  0.6939],\n",
+      "        [-0.4384,  0.2810],\n",
+      "        [-0.5239,  0.2832],\n",
+      "        [-0.4431,  0.2877],\n",
+      "        [-0.5974,  0.2958],\n",
+      "        [-0.7655,  0.6273],\n",
+      "        [-0.7656,  0.6703],\n",
+      "        [-0.7001,  0.4183],\n",
+      "        [-0.3617,  0.2145],\n",
+      "        [-0.6250,  0.3684],\n",
+      "        [-0.5722,  0.4677],\n",
+      "        [-0.1536,  0.1978],\n",
+      "        [-0.5606,  0.3755],\n",
+      "        [-0.6292,  0.3662],\n",
+      "        [-0.7420,  0.3527],\n",
+      "        [-0.4581,  0.2733],\n",
+      "        [-0.6560,  0.4098],\n",
+      "        [-0.2436,  0.1589],\n",
+      "        [-0.5316,  0.2916],\n",
+      "        [-0.6136,  0.3340],\n",
+      "        [-0.6650,  0.3447],\n",
+      "        [-0.6319,  0.4982],\n",
+      "        [-0.7093,  0.4292],\n",
+      "        [-0.3495,  0.2136],\n",
+      "        [-0.5344,  0.2056],\n",
+      "        [-0.2243,  0.2376],\n",
+      "        [-0.2150,  0.2638],\n",
+      "        [-0.6236,  0.4449],\n",
+      "        [-0.3363,  0.2330],\n",
+      "        [-0.7103,  0.5592],\n",
+      "        [-0.6709,  0.4674],\n",
+      "        [-0.6250,  0.4823],\n",
+      "        [-0.8934,  0.8637],\n",
+      "        [-0.7147,  0.4695],\n",
+      "        [-0.4029,  0.2238],\n",
+      "        [-0.6455,  0.4327],\n",
+      "        [-0.2547,  0.2432],\n",
+      "        [-0.3518,  0.3581],\n",
+      "        [-0.1312,  0.1507],\n",
+      "        [-0.5558,  0.4219],\n",
+      "        [-0.4881,  0.3416],\n",
+      "        [-0.6623,  0.4497],\n",
+      "        [-0.5963,  0.4848],\n",
+      "        [-0.5053,  0.3500],\n",
+      "        [-0.1152,  0.1482],\n",
+      "        [-0.6302,  0.3531],\n",
+      "        [-0.6268,  0.4978],\n",
+      "        [-0.4811,  0.2927],\n",
+      "        [ 0.0057,  0.1694],\n",
+      "        [-0.6268,  0.3306],\n",
+      "        [-0.5859,  0.4029],\n",
+      "        [-0.3552,  0.2425],\n",
+      "        [-0.5622,  0.4161],\n",
+      "        [-0.7670,  0.5203],\n",
+      "        [-0.6624,  0.5146],\n",
+      "        [-0.6089,  0.4091],\n",
+      "        [-0.4992,  0.2702]]), hidden_states=None, attentions=None)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "model.eval()\n",
+    "with torch.no_grad():\n",
+    "    print(model(**one_batch))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2106\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AdamW\n",
+    "from transformers import get_scheduler\n",
+    "\n",
+    "# Define the optimizer here\n",
+    "optimizer = AdamW(model.parameters(), lr = 5e-5)\n",
+    "\n",
+    "# Define the learning rate scheduler here\n",
+    "num_epochs = 2\n",
+    "num_training_steps = num_epochs * len(train_dataloader)\n",
+    "lr_scheduler = get_scheduler(\n",
+    "    \"linear\",\n",
+    "    optimizer=optimizer,\n",
+    "    num_warmup_steps=0,\n",
+    "    num_training_steps=num_training_steps,\n",
+    ")\n",
+    "print(num_training_steps)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Use GPU if available\n",
+    "device = torch.device(\"cuda:0\") if torch.cuda.is_available() else torch.device(\"cpu\")\n",
+    "model.to(device);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      " 50%|█████     | 1054/2106 [03:48<13:25,  1.31it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Metrics at end of epoch 0:\n",
+      "{'accuracy': 0.9288990825688074}\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|█████████▉| 2105/2106 [07:35<00:00,  4.98it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Metrics at end of epoch 1:\n",
+      "{'accuracy': 0.926605504587156}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from tqdm.auto import tqdm\n",
+    "import evaluate\n",
+    "progress_bar = tqdm(range(num_training_steps))\n",
+    "\n",
+    "for epoch_id in range(num_epochs):\n",
+    "\n",
+    "    # Train for one epoch\n",
+    "    model.train()\n",
+    "    for batch in train_dataloader:\n",
+    "        batch = {k: v.to(device) for k, v in batch.items()}\n",
+    "        outputs = model(**batch)\n",
+    "        outputs.loss.backward()\n",
+    "\n",
+    "        optimizer.step()\n",
+    "        lr_scheduler.step()\n",
+    "        optimizer.zero_grad()\n",
+    "        progress_bar.update(1)\n",
+    "\n",
+    "    # Evaluate at the end of epoch\n",
+    "    model.eval()\n",
+    "    # # For MRPC\n",
+    "    # metric = evaluate.load(\"glue\", \"mrpc\")\n",
+    "\n",
+    "    # For SST2\n",
+    "    metric = evaluate.load(\"glue\", \"sst2\")\n",
+    "\n",
+    "    with torch.no_grad():\n",
+    "        for batch in eval_dataloader:\n",
+    "            batch = {k: v.to(device) for k, v in batch.items()}\n",
+    "            outputs = model(**batch)\n",
+    "            logits = outputs.logits\n",
+    "            predictions = logits.argmax(dim = -1)\n",
+    "            metric.add_batch(predictions = predictions, references = batch[\"labels\"])\n",
+    "        m = metric.compute()\n",
+    "\n",
+    "    print(f\"Metrics at end of epoch {epoch_id}:\\n{m}\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

chapter3/3-fine-tuning-a-model-with-the-Trainer-API.ipynb ADDED Viewed

	@@ -0,0 +1,685 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fine tuning bert base uncased for paraphrasing identification task"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Map: 100%|██████████| 408/408 [00:00<00:00, 8416.87 examples/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from transformers import AutoTokenizer, DataCollatorWithPadding\n",
+    "\n",
+    "raw_datasets = load_dataset(\"glue\", \"mrpc\")\n",
+    "checkpoint = \"bert-base-uncased\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
+    "\n",
+    "\n",
+    "def tokenize_function(example):\n",
+    "    return tokenizer(example[\"sentence1\"], example[\"sentence2\"], truncation=True)\n",
+    "\n",
+    "\n",
+    "tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)\n",
+    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "3.0"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "training_args.num_train_epochs = 1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "TrainingArguments(\n",
+       "_n_gpu=1,\n",
+       "adafactor=False,\n",
+       "adam_beta1=0.9,\n",
+       "adam_beta2=0.999,\n",
+       "adam_epsilon=1e-08,\n",
+       "auto_find_batch_size=False,\n",
+       "bf16=False,\n",
+       "bf16_full_eval=False,\n",
+       "data_seed=None,\n",
+       "dataloader_drop_last=False,\n",
+       "dataloader_num_workers=0,\n",
+       "dataloader_pin_memory=True,\n",
+       "ddp_bucket_cap_mb=None,\n",
+       "ddp_find_unused_parameters=None,\n",
+       "ddp_timeout=1800,\n",
+       "debug=[],\n",
+       "deepspeed=None,\n",
+       "disable_tqdm=False,\n",
+       "do_eval=False,\n",
+       "do_predict=False,\n",
+       "do_train=False,\n",
+       "eval_accumulation_steps=None,\n",
+       "eval_delay=0,\n",
+       "eval_steps=None,\n",
+       "evaluation_strategy=no,\n",
+       "fp16=False,\n",
+       "fp16_backend=auto,\n",
+       "fp16_full_eval=False,\n",
+       "fp16_opt_level=O1,\n",
+       "fsdp=[],\n",
+       "fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},\n",
+       "fsdp_min_num_params=0,\n",
+       "fsdp_transformer_layer_cls_to_wrap=None,\n",
+       "full_determinism=False,\n",
+       "gradient_accumulation_steps=1,\n",
+       "gradient_checkpointing=False,\n",
+       "greater_is_better=None,\n",
+       "group_by_length=False,\n",
+       "half_precision_backend=auto,\n",
+       "hub_model_id=None,\n",
+       "hub_private_repo=False,\n",
+       "hub_strategy=every_save,\n",
+       "hub_token=<HUB_TOKEN>,\n",
+       "ignore_data_skip=False,\n",
+       "include_inputs_for_metrics=False,\n",
+       "jit_mode_eval=False,\n",
+       "label_names=None,\n",
+       "label_smoothing_factor=0.0,\n",
+       "learning_rate=5e-05,\n",
+       "length_column_name=length,\n",
+       "load_best_model_at_end=False,\n",
+       "local_rank=-1,\n",
+       "log_level=passive,\n",
+       "log_level_replica=warning,\n",
+       "log_on_each_node=True,\n",
+       "logging_dir=test-trainer/runs/Jul21_12-03-08_602d65b93b25,\n",
+       "logging_first_step=False,\n",
+       "logging_nan_inf_filter=True,\n",
+       "logging_steps=500,\n",
+       "logging_strategy=steps,\n",
+       "lr_scheduler_type=linear,\n",
+       "max_grad_norm=1.0,\n",
+       "max_steps=-1,\n",
+       "metric_for_best_model=None,\n",
+       "mp_parameters=,\n",
+       "no_cuda=False,\n",
+       "num_train_epochs=2,\n",
+       "optim=adamw_hf,\n",
+       "optim_args=None,\n",
+       "output_dir=test-trainer,\n",
+       "overwrite_output_dir=False,\n",
+       "past_index=-1,\n",
+       "per_device_eval_batch_size=8,\n",
+       "per_device_train_batch_size=8,\n",
+       "prediction_loss_only=False,\n",
+       "push_to_hub=False,\n",
+       "push_to_hub_model_id=None,\n",
+       "push_to_hub_organization=None,\n",
+       "push_to_hub_token=<PUSH_TO_HUB_TOKEN>,\n",
+       "ray_scope=last,\n",
+       "remove_unused_columns=True,\n",
+       "report_to=[],\n",
+       "resume_from_checkpoint=None,\n",
+       "run_name=test-trainer,\n",
+       "save_on_each_node=False,\n",
+       "save_steps=500,\n",
+       "save_strategy=steps,\n",
+       "save_total_limit=None,\n",
+       "seed=42,\n",
+       "sharded_ddp=[],\n",
+       "skip_memory_metrics=True,\n",
+       "tf32=None,\n",
+       "torch_compile=False,\n",
+       "torch_compile_backend=None,\n",
+       "torch_compile_mode=None,\n",
+       "torchdynamo=None,\n",
+       "tpu_metrics_debug=False,\n",
+       "tpu_num_cores=None,\n",
+       "use_ipex=False,\n",
+       "use_legacy_prediction_loop=False,\n",
+       "use_mps_device=False,\n",
+       "warmup_ratio=0.0,\n",
+       "warmup_steps=0,\n",
+       "weight_decay=0.0,\n",
+       "xpu_backend=None,\n",
+       ")"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from transformers import TrainingArguments\n",
+    "training_args = TrainingArguments(\"test-trainer\")\n",
+    "training_args.num_train_epochs = 2\n",
+    "training_args"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoModelForSequenceClassification\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Linear(in_features=768, out_features=2, bias=True)"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.classifier"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import Trainer\n",
+    "trainer = Trainer(\n",
+    "    model,\n",
+    "    training_args,\n",
+    "    train_dataset=tokenized_datasets[\"train\"],\n",
+    "    eval_dataset=tokenized_datasets[\"validation\"],\n",
+    "    data_collator=data_collator,\n",
+    "    tokenizer=tokenizer,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='918' max='918' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [918/918 01:11, Epoch 2/2]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>500</td>\n",
+       "      <td>0.323900</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=918, training_loss=0.26239446669102756, metrics={'train_runtime': 72.0933, 'train_samples_per_second': 101.757, 'train_steps_per_second': 12.733, 'total_flos': 270693998197680.0, 'train_loss': 0.26239446669102756, 'epoch': 2.0})"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Plain training\n",
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='6' max='51' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [ 6/51 00:00 < 00:00, 50.51 it/s]\n",
+       "    </div>\n",
+       "    "
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(408, 2) (408,)\n"
+     ]
+    }
+   ],
+   "source": [
+    "predictions = trainer.predict(tokenized_datasets[\"validation\"])\n",
+    "print(predictions.predictions.shape, predictions.label_ids.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "preds = np.argmax(predictions.predictions, axis=-1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'accuracy': 0.8553921568627451, 'f1': 0.8963093145869947}"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import evaluate\n",
+    "metric = evaluate.load(\"glue\", \"mrpc\")\n",
+    "metric.compute(predictions=preds, references=predictions.label_ids)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def compute_metrics(eval_preds):\n",
+    "    metric = evaluate.load(\"glue\", \"mrpc\")\n",
+    "    logits, labels = eval_preds\n",
+    "    predictions = np.argmax(logits, axis=-1)\n",
+    "    return metric.compute(predictions=predictions, references=labels)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "training_args = TrainingArguments(\"test-trainer\", evaluation_strategy=\"epoch\", num_train_epochs = 2)\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)\n",
+    "\n",
+    "trainer = Trainer(\n",
+    "    model,\n",
+    "    training_args,\n",
+    "    train_dataset=tokenized_datasets[\"train\"],\n",
+    "    eval_dataset=tokenized_datasets[\"validation\"],\n",
+    "    data_collator=data_collator,\n",
+    "    tokenizer=tokenizer,\n",
+    "    compute_metrics=compute_metrics,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='918' max='918' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [918/918 01:21, Epoch 2/2]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Epoch</th>\n",
+       "      <th>Training Loss</th>\n",
+       "      <th>Validation Loss</th>\n",
+       "      <th>Accuracy</th>\n",
+       "      <th>F1</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>1</td>\n",
+       "      <td>No log</td>\n",
+       "      <td>0.418620</td>\n",
+       "      <td>0.830882</td>\n",
+       "      <td>0.883249</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2</td>\n",
+       "      <td>0.498100</td>\n",
+       "      <td>0.485925</td>\n",
+       "      <td>0.860294</td>\n",
+       "      <td>0.903226</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=918, training_loss=0.39945665579735584, metrics={'train_runtime': 82.0502, 'train_samples_per_second': 89.409, 'train_steps_per_second': 11.188, 'total_flos': 270693998197680.0, 'train_loss': 0.39945665579735584, 'epoch': 2.0})"
+      ]
+     },
+     "execution_count": 33,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Finetuning on GLUE-SST-2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
+      "Map: 100%|██████████| 1821/1821 [00:00<00:00, 9257.22 examples/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "raw_dataset = load_dataset(\"glue\", \"sst2\")\n",
+    "\n",
+    "from transformers import AutoTokenizer\n",
+    "from transformers import AutoModelForSequenceClassification\n",
+    "\n",
+    "checkpoint = \"bert-base-uncased\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)\n",
+    "\n",
+    "def tokenize_function(sequence):\n",
+    "    return tokenizer(sequence[\"sentence\"], padding = True, truncation = True, return_tensors=\"pt\")\n",
+    "\n",
+    "tokenized_dataset = raw_dataset.map(tokenize_function, batched = True)\n",
+    "\n",
+    "from transformers import DataCollatorWithPadding\n",
+    "dc = DataCollatorWithPadding(tokenizer = tokenizer, padding = True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def compute_metrics(eval_preds):\n",
+    "    metric = evaluate.load(\"glue\", \"sst2\")\n",
+    "    logits, labels = eval_preds\n",
+    "    predictions = np.argmax(logits, axis=-1)\n",
+    "    return metric.compute(predictions=predictions, references=labels)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "training_args = TrainingArguments(\"test-trainer\", evaluation_strategy=\"epoch\", num_train_epochs = 2,\n",
+    "                                  per_device_eval_batch_size = 32, per_device_train_batch_size = 64)\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)\n",
+    "\n",
+    "trainer = Trainer(\n",
+    "    model,\n",
+    "    training_args,\n",
+    "    train_dataset=tokenized_dataset[\"train\"],\n",
+    "    eval_dataset=tokenized_dataset[\"validation\"],\n",
+    "    data_collator=data_collator,\n",
+    "    tokenizer=tokenizer,\n",
+    "    compute_metrics=compute_metrics,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='2106' max='2106' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [2106/2106 11:38, Epoch 2/2]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Epoch</th>\n",
+       "      <th>Training Loss</th>\n",
+       "      <th>Validation Loss</th>\n",
+       "      <th>Accuracy</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>1</td>\n",
+       "      <td>0.166400</td>\n",
+       "      <td>0.236536</td>\n",
+       "      <td>0.918578</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2</td>\n",
+       "      <td>0.089700</td>\n",
+       "      <td>0.238962</td>\n",
+       "      <td>0.925459</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=2106, training_loss=0.1493011535289507, metrics={'train_runtime': 698.8876, 'train_samples_per_second': 192.732, 'train_steps_per_second': 3.013, 'total_flos': 4556217062352120.0, 'train_loss': 0.1493011535289507, 'epoch': 2.0})"
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "trainer.train()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

chapter3/3-processing-the-data.ipynb ADDED Viewed

	@@ -0,0 +1,1719 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exploring the GLUE - MRPC dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from pprint import pprint\n",
+    "\n",
+    "raw_dataset = load_dataset(path = \"glue\", name = \"mrpc\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "_home_.cache_huggingface_datasets_glue_mrpc_0.0.0_bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c.lock\n",
+      "downloads\n",
+      "glue\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls ~/.cache/huggingface/datasets/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
+       "        num_rows: 3668\n",
+       "    })\n",
+       "    validation: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
+       "        num_rows: 408\n",
+       "    })\n",
+       "    test: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
+       "        num_rows: 1725\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "raw_dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'sentence1': 'Amrozi accused his brother , whom he called \" the witness \" , of deliberately distorting his evidence .',\n",
+       " 'sentence2': 'Referring to him as only \" the witness \" , Amrozi accused his brother of deliberately distorting his evidence .',\n",
+       " 'label': 1,\n",
+       " 'idx': 0}"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "raw_dataset[\"train\"][0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'sentence1': Value(dtype='string', id=None),\n",
+       " 'sentence2': Value(dtype='string', id=None),\n",
+       " 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),\n",
+       " 'idx': Value(dtype='int32', id=None)}"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "raw_dataset[\"train\"].features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'idx': 16,\n",
+      " 'label': 0,\n",
+      " 'sentence1': 'Rudder was most recently senior vice president for the '\n",
+      "              'Developer & Platform Evangelism Business .',\n",
+      " 'sentence2': 'Senior Vice President Eric Rudder , formerly head of the '\n",
+      "              'Developer and Platform Evangelism unit , will lead the new '\n",
+      "              'entity .'}\n",
+      "{'idx': 812,\n",
+      " 'label': 0,\n",
+      " 'sentence1': 'However , EPA officials would not confirm the 20 percent figure '\n",
+      "              '.',\n",
+      " 'sentence2': 'Only in the past few weeks have officials settled on the 20 '\n",
+      "              'percent figure .'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Look at the 15th and 87th element of the train and validation datasets respectively\n",
+    "pprint(raw_dataset[\"train\"][15])\n",
+    "pprint(raw_dataset[\"validation\"][87])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tokenizer for pair processing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "checkpoint = \"bert-base-uncased\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
+      " 'input_ids': [101,\n",
+      "               2023,\n",
+      "               2003,\n",
+      "               1996,\n",
+      "               2034,\n",
+      "               6251,\n",
+      "               1012,\n",
+      "               102,\n",
+      "               2023,\n",
+      "               2003,\n",
+      "               1996,\n",
+      "               2117,\n",
+      "               2028,\n",
+      "               1012,\n",
+      "               102],\n",
+      " 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "inputs = tokenizer(\"This is the first sentence.\", \"This is the second one.\")\n",
+    "pprint(inputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'[CLS] this is the first sentence. [SEP] this is the second one. [SEP]'"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.decode(inputs['input_ids'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we can see that the tokenizer has appended the two sentences together and introduced `[CLS]` and `[SEP]` tokens specially because that's how bert was trained for next sentence prediction task."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'input_ids': [101, 24049, 2001, 2087, 3728, 3026, 3580, 2343, 2005, 1996, 9722, 1004, 4132, 9340, 12439, 2964, 2449, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tokenizer(raw_dataset[\"train\"][15][\"sentence1\"]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'input_ids': [101, 3026, 3580, 2343, 4388, 24049, 1010, 3839, 2132, 1997, 1996, 9722, 1998, 4132, 9340, 12439, 2964, 3131, 1010, 2097, 2599, 1996, 2047, 9178, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tokenizer(raw_dataset[\"train\"][15][\"sentence2\"]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'input_ids': [101, 24049, 2001, 2087, 3728, 3026, 3580, 2343, 2005, 1996, 9722, 1004, 4132, 9340, 12439, 2964, 2449, 1012, 102, 3026, 3580, 2343, 4388, 24049, 1010, 3839, 2132, 1997, 1996, 9722, 1998, 4132, 9340, 12439, 2964, 3131, 1010, 2097, 2599, 1996, 2047, 9178, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tokenizer(raw_dataset[\"train\"][15][\"sentence1\"], raw_dataset[\"train\"][15][\"sentence2\"]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we need to observe the `token_type_ids` field. It is different if we encode the two sentences at the same time vs if we do them independently. Also the `[CLS]` and `[SEP]` tokens are added differently in the two cases."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dataset Map to create new datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def tokenize_function(example):\n",
+    "    return tokenizer(example[\"sentence1\"], example[\"sentence2\"], truncation=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Map: 100%|██████████| 3668/3668 [00:00<00:00, 9953.91 examples/s] \n",
+      "Map: 100%|██████████| 408/408 [00:00<00:00, 9044.46 examples/s]\n",
+      "Map: 100%|██████████| 1725/1725 [00:00<00:00, 9891.51 examples/s] \n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       "        num_rows: 3668\n",
+       "    })\n",
+       "    validation: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       "        num_rows: 408\n",
+       "    })\n",
+       "    test: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       "        num_rows: 1725\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)\n",
+    "tokenized_datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
+       "        num_rows: 3668\n",
+       "    })\n",
+       "    validation: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
+       "        num_rows: 408\n",
+       "    })\n",
+       "    test: Dataset({\n",
+       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
+       "        num_rows: 1725\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "raw_dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we see that as out tokenize functions returns new keys of `'input_ids', 'token_type_ids', 'attention_mask'`, those simply get added to the new tokenized_dataset Dataset and rest remains the same."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import DataCollatorWithPadding\n",
+    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This `DataCollatorWithPadding` is meant to do dynamic collation of batches in the dataset based on the max length from among all the sequences in the batch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[50, 59, 47]"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "samples = tokenized_datasets[\"train\"][:3]\n",
+    "samples = {k: v for k, v in samples.items()}\n",
+    "[len(x) for x in samples[\"input_ids\"]]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'sentence1': ['Amrozi accused his brother , whom he called \" the witness \" , of deliberately distorting his evidence .',\n",
+       "  \"Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .\",\n",
+       "  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'],\n",
+       " 'sentence2': ['Referring to him as only \" the witness \" , Amrozi accused his brother of deliberately distorting his evidence .',\n",
+       "  \"Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .\",\n",
+       "  \"On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .\"],\n",
+       " 'label': [1, 0, 1],\n",
+       " 'idx': [0, 1, 2],\n",
+       " 'input_ids': [[101,\n",
+       "   2572,\n",
+       "   3217,\n",
+       "   5831,\n",
+       "   5496,\n",
+       "   2010,\n",
+       "   2567,\n",
+       "   1010,\n",
+       "   3183,\n",
+       "   2002,\n",
+       "   2170,\n",
+       "   1000,\n",
+       "   1996,\n",
+       "   7409,\n",
+       "   1000,\n",
+       "   1010,\n",
+       "   1997,\n",
+       "   9969,\n",
+       "   4487,\n",
+       "   23809,\n",
+       "   3436,\n",
+       "   2010,\n",
+       "   3350,\n",
+       "   1012,\n",
+       "   102,\n",
+       "   7727,\n",
+       "   2000,\n",
+       "   2032,\n",
+       "   2004,\n",
+       "   2069,\n",
+       "   1000,\n",
+       "   1996,\n",
+       "   7409,\n",
+       "   1000,\n",
+       "   1010,\n",
+       "   2572,\n",
+       "   3217,\n",
+       "   5831,\n",
+       "   5496,\n",
+       "   2010,\n",
+       "   2567,\n",
+       "   1997,\n",
+       "   9969,\n",
+       "   4487,\n",
+       "   23809,\n",
+       "   3436,\n",
+       "   2010,\n",
+       "   3350,\n",
+       "   1012,\n",
+       "   102],\n",
+       "  [101,\n",
+       "   9805,\n",
+       "   3540,\n",
+       "   11514,\n",
+       "   2050,\n",
+       "   3079,\n",
+       "   11282,\n",
+       "   2243,\n",
+       "   1005,\n",
+       "   1055,\n",
+       "   2077,\n",
+       "   4855,\n",
+       "   1996,\n",
+       "   4677,\n",
+       "   2000,\n",
+       "   3647,\n",
+       "   4576,\n",
+       "   1999,\n",
+       "   2687,\n",
+       "   2005,\n",
+       "   1002,\n",
+       "   1016,\n",
+       "   1012,\n",
+       "   1019,\n",
+       "   4551,\n",
+       "   1012,\n",
+       "   102,\n",
+       "   9805,\n",
+       "   3540,\n",
+       "   11514,\n",
+       "   2050,\n",
+       "   4149,\n",
+       "   11282,\n",
+       "   2243,\n",
+       "   1005,\n",
+       "   1055,\n",
+       "   1999,\n",
+       "   2786,\n",
+       "   2005,\n",
+       "   1002,\n",
+       "   6353,\n",
+       "   2509,\n",
+       "   2454,\n",
+       "   1998,\n",
+       "   2853,\n",
+       "   2009,\n",
+       "   2000,\n",
+       "   3647,\n",
+       "   4576,\n",
+       "   2005,\n",
+       "   1002,\n",
+       "   1015,\n",
+       "   1012,\n",
+       "   1022,\n",
+       "   4551,\n",
+       "   1999,\n",
+       "   2687,\n",
+       "   1012,\n",
+       "   102],\n",
+       "  [101,\n",
+       "   2027,\n",
+       "   2018,\n",
+       "   2405,\n",
+       "   2019,\n",
+       "   15147,\n",
+       "   2006,\n",
+       "   1996,\n",
+       "   4274,\n",
+       "   2006,\n",
+       "   2238,\n",
+       "   2184,\n",
+       "   1010,\n",
+       "   5378,\n",
+       "   1996,\n",
+       "   6636,\n",
+       "   2005,\n",
+       "   5096,\n",
+       "   1010,\n",
+       "   2002,\n",
+       "   2794,\n",
+       "   1012,\n",
+       "   102,\n",
+       "   2006,\n",
+       "   2238,\n",
+       "   2184,\n",
+       "   1010,\n",
+       "   1996,\n",
+       "   2911,\n",
+       "   1005,\n",
+       "   1055,\n",
+       "   5608,\n",
+       "   2018,\n",
+       "   2405,\n",
+       "   2019,\n",
+       "   15147,\n",
+       "   2006,\n",
+       "   1996,\n",
+       "   4274,\n",
+       "   1010,\n",
+       "   5378,\n",
+       "   1996,\n",
+       "   14792,\n",
+       "   2005,\n",
+       "   5096,\n",
+       "   1012,\n",
+       "   102]],\n",
+       " 'token_type_ids': [[0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1],\n",
+       "  [0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1],\n",
+       "  [0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1]],\n",
+       " 'attention_mask': [[1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1],\n",
+       "  [1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1],\n",
+       "  [1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1]]}"
+      ]
+     },
+     "execution_count": 43,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "samples"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "samples_to_collate = tokenized_datasets[\"train\"][:3]\n",
+    "samples_to_collate.pop(\"sentence1\"); samples_to_collate.pop(\"sentence2\"); samples_to_collate.pop(\"idx\");"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'input_ids': torch.Size([3, 59]),\n",
+       " 'token_type_ids': torch.Size([3, 59]),\n",
+       " 'attention_mask': torch.Size([3, 59]),\n",
+       " 'labels': torch.Size([3])}"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "batch = data_collator(samples_to_collate)\n",
+    "{k: v.shape for k, v in batch.items()}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Replication of the above preprocessing on GLUE-SST2 dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Downloading data: 100%|██████████| 3.11M/3.11M [00:00<00:00, 4.89MB/s]\n",
+      "Downloading data: 100%|██████████| 72.8k/72.8k [00:00<00:00, 128kB/s]\n",
+      "Downloading data: 100%|██████████| 148k/148k [00:00<00:00, 260kB/s]\n",
+      "Generating train split: 100%|██████████| 67349/67349 [00:00<00:00, 467302.76 examples/s]\n",
+      "Generating validation split: 100%|████���█████| 872/872 [00:00<00:00, 137580.24 examples/s]\n",
+      "Generating test split: 100%|██████████| 1821/1821 [00:00<00:00, 205588.75 examples/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "raw_dataset = load_dataset(\"glue\", \"sst2\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'sentence': 'hide new secretions from the parental units ',\n",
+       " 'label': 0,\n",
+       " 'idx': 0}"
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "raw_dataset[\"train\"][0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "/home/huggingface/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "from transformers import AutoModelForSequenceClassification\n",
+    "\n",
+    "checkpoint = \"bert-base-uncased\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
+    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def tokenize_function(sequence):\n",
+    "    return tokenizer(sequence[\"sentence\"], padding = True, truncation = True, return_tensors=\"pt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Map:   0%|          | 0/67349 [00:00<?, ? examples/s]"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Map: 100%|██████████| 67349/67349 [00:06<00:00, 11164.26 examples/s]\n",
+      "Map: 100%|██████████| 872/872 [00:00<00:00, 10952.43 examples/s]\n",
+      "Map: 100%|██████████| 1821/1821 [00:00<00:00, 11315.74 examples/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "tokenized_dataset = raw_dataset.map(tokenize_function, batched = True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       "        num_rows: 67349\n",
+       "    })\n",
+       "    validation: Dataset({\n",
+       "        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       "        num_rows: 872\n",
+       "    })\n",
+       "    test: Dataset({\n",
+       "        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+       "        num_rows: 1821\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 55,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenized_dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import DataCollatorWithPadding\n",
+    "dc = DataCollatorWithPadding(tokenizer = tokenizer, padding = True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "samples = tokenized_dataset[\"train\"][:3]\n",
+    "samples = {k: v for k,v in samples.items() if k not in [\"sentence\", \"ids\"]}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "dict_keys(['label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])"
+      ]
+     },
+     "execution_count": 61,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "samples.keys()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'label': [0, 0, 1],\n",
+       " 'idx': [0, 1, 2],\n",
+       " 'input_ids': [[101,\n",
+       "   5342,\n",
+       "   2047,\n",
+       "   3595,\n",
+       "   8496,\n",
+       "   2013,\n",
+       "   1996,\n",
+       "   18643,\n",
+       "   3197,\n",
+       "   102,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0],\n",
+       "  [101,\n",
+       "   3397,\n",
+       "   2053,\n",
+       "   15966,\n",
+       "   1010,\n",
+       "   2069,\n",
+       "   4450,\n",
+       "   2098,\n",
+       "   18201,\n",
+       "   2015,\n",
+       "   102,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0],\n",
+       "  [101,\n",
+       "   2008,\n",
+       "   7459,\n",
+       "   2049,\n",
+       "   3494,\n",
+       "   1998,\n",
+       "   10639,\n",
+       "   2015,\n",
+       "   2242,\n",
+       "   2738,\n",
+       "   3376,\n",
+       "   2055,\n",
+       "   2529,\n",
+       "   3267,\n",
+       "   102,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0]],\n",
+       " 'token_type_ids': [[0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0],\n",
+       "  [0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0],\n",
+       "  [0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0]],\n",
+       " 'attention_mask': [[1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0],\n",
+       "  [1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0],\n",
+       "  [1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   1,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0,\n",
+       "   0]]}"
+      ]
+     },
+     "execution_count": 62,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The datasets library is pretty intuitive in the way it is structured. We just need to make sure before collating, we have the necessary fields and drop the unnecessary fields from the dataset. And that we do dynamic padding based on a batch of data and not on the model dim or the max sequence length of the entire corpus. It will be economical in terms of computation and also help training."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}