Upload ML Quickstart_ Model Training with HuggingFace (1).ipynb
Browse files
ML Quickstart_ Model Training with HuggingFace (1).ipynb
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"cells":[{"cell_type":"markdown","source":["# Databricks & Hugging Face ML Quickstart: Model Training\n\nThis notebook provides a quick overview of machine learning model training on Databricks using Hugging Face transformers. The notebook includes using MLflow to track the trained models.\n\nThis tutorial covers:\n- Part 1: Training a text classification transformer model with MLflow tracking\n\n### Requirements\n- Cluster running Databricks Runtime 7.5 ML or above\n- Training is super slow/unusable if there is no GPU attached to the cluster"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"528c6d2b-3c5a-4236-ac61-fa8cc8d4d323"}}},{"cell_type":"markdown","source":["### Libraries\nImport the necessary libraries. These libraries are preinstalled on Databricks Runtime for Machine Learning ([AWS](https://docs.databricks.com/runtime/mlruntime.html)|[Azure](https://docs.microsoft.com/azure/databricks/runtime/mlruntime)|[GCP](https://docs.gcp.databricks.com/runtime/mlruntime.html)) clusters and are tuned for compatibility and performance."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8738a402-24dc-4074-bebb-b51bec8e74db"}}},{"cell_type":"code","source":["%pip install transformers datasets mlflow torch"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"7c554af6-20e3-44ec-a8e1-b1e3411ab169"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Python interpreter will be restarted.\nCollecting transformers\n Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)\nCollecting datasets\n Downloading datasets-2.3.2-py3-none-any.whl (362 kB)\nCollecting mlflow\n Downloading mlflow-1.27.0-py3-none-any.whl (17.9 MB)\nCollecting torch\n Downloading torch-1.12.0-cp38-cp38-manylinux1_x86_64.whl (776.3 MB)\nCollecting xxhash\n Downloading xxhash-3.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)\nCollecting aiohttp\n Downloading aiohttp-3.8.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.3 MB)\nCollecting fsspec[http]>=2021.05.0\n Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)\nCollecting dill<0.3.6\n Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)\nRequirement already satisfied: numpy>=1.17 in /databricks/python3/lib/python3.8/site-packages (from datasets) (1.20.1)\nCollecting multiprocess\n Downloading multiprocess-0.70.13-py38-none-any.whl (131 kB)\nCollecting responses<0.19\n Downloading responses-0.18.0-py3-none-any.whl (38 kB)\nCollecting huggingface-hub<1.0.0,>=0.1.0\n Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)\nCollecting pyarrow>=6.0.0\n Downloading pyarrow-8.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)\nRequirement already satisfied: packaging in /databricks/python3/lib/python3.8/site-packages (from datasets) (20.9)\nRequirement already satisfied: pandas in /databricks/python3/lib/python3.8/site-packages (from datasets) (1.2.4)\nCollecting tqdm>=4.62.1\n Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)\nRequirement already satisfied: requests>=2.19.0 in /databricks/python3/lib/python3.8/site-packages (from datasets) (2.25.1)\nCollecting pyyaml>=5.1\n Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)\nRequirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.6.0)\nCollecting typing-extensions>=3.7.4.3\n Downloading typing_extensions-4.3.0-py3-none-any.whl (25 kB)\nRequirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.8/site-packages (from packaging->datasets) (2.4.7)\nRequirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2020.12.5)\nRequirement already satisfied: chardet<5,>=3.0.2 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (4.0.0)\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (1.25.11)\nRequirement already satisfied: idna<3,>=2.5 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2.10)\nRequirement already satisfied: pytz in /databricks/python3/lib/python3.8/site-packages (from mlflow) (2020.5)\nCollecting Flask\n Downloading Flask-2.1.3-py3-none-any.whl (95 kB)\nRequirement already satisfied: scipy in /databricks/python3/lib/python3.8/site-packages (from mlflow) (1.6.2)\nCollecting alembic\n Downloading alembic-1.8.1-py3-none-any.whl (209 kB)\nCollecting sqlparse>=0.3.1\n Downloading sqlparse-0.4.2-py3-none-any.whl (42 kB)\nRequirement already satisfied: protobuf>=3.12.0 in /databricks/python3/lib/python3.8/site-packages (from mlflow) (3.17.2)\nCollecting gitpython>=2.1.0\n Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)\nCollecting databricks-cli>=0.8.7\n Downloading databricks-cli-0.17.0.tar.gz (81 kB)\nCollecting cloudpickle\n Downloading cloudpickle-2.1.0-py3-none-any.whl (25 kB)\nCollecting click>=7.0\n Downloading click-8.1.3-py3-none-any.whl (96 kB)\nCollecting sqlalchemy>=1.4.0\n Downloading SQLAlchemy-1.4.39-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)\nCollecting gunicorn\n Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)\nCollecting docker>=4.0.0\n Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)\nCollecting querystring-parser\n Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)\nCollecting importlib-metadata!=4.7.0,>=3.7.0\n Downloading importlib_metadata-4.12.0-py3-none-any.whl (21 kB)\nRequirement already satisfied: entrypoints in /databricks/python3/lib/python3.8/site-packages (from mlflow) (0.3)\nCollecting prometheus-flask-exporter\n Downloading prometheus_flask_exporter-0.20.2-py3-none-any.whl (18 kB)\nCollecting pyjwt>=1.7.0\n Downloading PyJWT-2.4.0-py3-none-any.whl (18 kB)\nCollecting oauthlib>=3.1.0\n Downloading oauthlib-3.2.0-py3-none-any.whl (151 kB)\nCollecting tabulate>=0.7.7\n Downloading tabulate-0.8.10-py3-none-any.whl (29 kB)\nRequirement already satisfied: six>=1.10.0 in /databricks/python3/lib/python3.8/site-packages (from databricks-cli>=0.8.7->mlflow) (1.15.0)\nCollecting websocket-client>=0.32.0\n Downloading websocket_client-1.3.3-py3-none-any.whl (54 kB)\nCollecting gitdb<5,>=4.0.1\n Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)\nCollecting smmap<6,>=3.0.1\n Downloading smmap-5.0.0-py3-none-any.whl (24 kB)\nCollecting zipp>=0.5\n Downloading zipp-3.8.1-py3-none-any.whl (5.6 kB)\nCollecting greenlet!=0.4.17\n Downloading greenlet-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (156 kB)\nCollecting regex!=2019.12.17\n Downloading regex-2022.7.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (765 kB)\nCollecting tokenizers!=0.11.3,<0.13,>=0.11.1\n Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)\nCollecting async-timeout<5.0,>=4.0.0a3\n Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)\nCollecting charset-normalizer<3.0,>=2.0\n Downloading charset_normalizer-2.1.0-py3-none-any.whl (39 kB)\nCollecting frozenlist>=1.1.1\n Downloading frozenlist-1.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)\nCollecting multidict<7.0,>=4.5\n Downloading multidict-6.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121 kB)\nRequirement already satisfied: attrs>=17.3.0 in /databricks/python3/lib/python3.8/site-packages (from aiohttp->datasets) (20.3.0)\nCollecting yarl<2.0,>=1.0\n Downloading yarl-1.7.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (308 kB)\nCollecting aiosignal>=1.1.2\n Downloading aiosignal-1.2.0-py3-none-any.whl (8.2 kB)\nCollecting importlib-resources\n Downloading importlib_resources-5.8.0-py3-none-any.whl (28 kB)\nCollecting Mako\n Downloading Mako-1.2.1-py3-none-any.whl (78 kB)\nCollecting Werkzeug>=2.0\n Downloading Werkzeug-2.1.2-py3-none-any.whl (224 kB)\nCollecting itsdangerous>=2.0\n Downloading itsdangerous-2.1.2-py3-none-any.whl (15 kB)\nCollecting Jinja2>=3.0\n Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)\nRequirement already satisfied: MarkupSafe>=2.0 in /databricks/python3/lib/python3.8/site-packages (from Jinja2>=3.0->Flask->mlflow) (2.0.1)\nRequirement already satisfied: setuptools>=3.0 in /usr/local/lib/python3.8/dist-packages (from gunicorn->mlflow) (52.0.0)\nRequirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.8/site-packages (from pandas->datasets) (2.8.1)\nRequirement already satisfied: prometheus-client in /databricks/python3/lib/python3.8/site-packages (from prometheus-flask-exporter->mlflow) (0.10.1)\nBuilding wheels for collected packages: databricks-cli\n Building wheel for databricks-cli (setup.py): started\n Building wheel for databricks-cli (setup.py): finished with status 'done'\n Created wheel for databricks-cli: filename=databricks_cli-0.17.0-py3-none-any.whl size=141932 sha256=bb09e2cf09646974e0569af11512120f854d104b5284c4656f865a660e821cc9\n Stored in directory: /root/.cache/pip/wheels/bc/ef/2a/18885b70c6b78d4b9612ef2bf4bfdc7325f43db9d817d20f3f\nSuccessfully built databricks-cli\nInstalling collected packages: zipp, multidict, frozenlist, yarl, Werkzeug, smmap, Jinja2, itsdangerous, importlib-metadata, greenlet, click, charset-normalizer, async-timeout, aiosignal, websocket-client, typing-extensions, tqdm, tabulate, sqlalchemy, pyyaml, pyjwt, oauthlib, Mako, importlib-resources, gitdb, fsspec, Flask, dill, aiohttp, xxhash, tokenizers, sqlparse, responses, regex, querystring-parser, pyarrow, prometheus-flask-exporter, multiprocess, huggingface-hub, gunicorn, gitpython, docker, databricks-cli, cloudpickle, alembic, transformers, torch, mlflow, datasets\n Attempting uninstall: Jinja2\n Found existing installation: Jinja2 2.11.3\n Not uninstalling jinja2 at /databricks/python3/lib/python3.8/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-bab12bc9-22d1-4101-97b5-6ae403f8662e\n Can't uninstall 'Jinja2'. No files were found to uninstall.\n Attempting uninstall: pyarrow\n Found existing installation: pyarrow 4.0.0\n Not uninstalling pyarrow at /databricks/python3/lib/python3.8/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-bab12bc9-22d1-4101-97b5-6ae403f8662e\n Can't uninstall 'pyarrow'. No files were found to uninstall.\nSuccessfully installed Flask-2.1.3 Jinja2-3.1.2 Mako-1.2.1 Werkzeug-2.1.2 aiohttp-3.8.1 aiosignal-1.2.0 alembic-1.8.1 async-timeout-4.0.2 charset-normalizer-2.1.0 click-8.1.3 cloudpickle-2.1.0 databricks-cli-0.17.0 datasets-2.3.2 dill-0.3.5.1 docker-5.0.3 frozenlist-1.3.0 fsspec-2022.5.0 gitdb-4.0.9 gitpython-3.1.27 greenlet-1.1.2 gunicorn-20.1.0 huggingface-hub-0.8.1 importlib-metadata-4.12.0 importlib-resources-5.8.0 itsdangerous-2.1.2 mlflow-1.27.0 multidict-6.0.2 multiprocess-0.70.13 oauthlib-3.2.0 prometheus-flask-exporter-0.20.2 pyarrow-8.0.0 pyjwt-2.4.0 pyyaml-6.0 querystring-parser-1.2.4 regex-2022.7.9 responses-0.18.0 smmap-5.0.0 sqlalchemy-1.4.39 sqlparse-0.4.2 tabulate-0.8.10 tokenizers-0.12.1 torch-1.12.0 tqdm-4.64.0 transformers-4.20.1 typing-extensions-4.3.0 websocket-client-1.3.3 xxhash-3.0.0 yarl-1.7.2 zipp-3.8.1\nPython interpreter will be restarted.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Python interpreter will be restarted.\nCollecting transformers\n Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)\nCollecting datasets\n Downloading datasets-2.3.2-py3-none-any.whl (362 kB)\nCollecting mlflow\n Downloading mlflow-1.27.0-py3-none-any.whl (17.9 MB)\nCollecting torch\n Downloading torch-1.12.0-cp38-cp38-manylinux1_x86_64.whl (776.3 MB)\nCollecting xxhash\n Downloading xxhash-3.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)\nCollecting aiohttp\n Downloading aiohttp-3.8.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.3 MB)\nCollecting fsspec[http]>=2021.05.0\n Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)\nCollecting dill<0.3.6\n Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)\nRequirement already satisfied: numpy>=1.17 in /databricks/python3/lib/python3.8/site-packages (from datasets) (1.20.1)\nCollecting multiprocess\n Downloading multiprocess-0.70.13-py38-none-any.whl (131 kB)\nCollecting responses<0.19\n Downloading responses-0.18.0-py3-none-any.whl (38 kB)\nCollecting huggingface-hub<1.0.0,>=0.1.0\n Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)\nCollecting pyarrow>=6.0.0\n Downloading pyarrow-8.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)\nRequirement already satisfied: packaging in /databricks/python3/lib/python3.8/site-packages (from datasets) (20.9)\nRequirement already satisfied: pandas in /databricks/python3/lib/python3.8/site-packages (from datasets) (1.2.4)\nCollecting tqdm>=4.62.1\n Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)\nRequirement already satisfied: requests>=2.19.0 in /databricks/python3/lib/python3.8/site-packages (from datasets) (2.25.1)\nCollecting pyyaml>=5.1\n Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)\nRequirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.6.0)\nCollecting typing-extensions>=3.7.4.3\n Downloading typing_extensions-4.3.0-py3-none-any.whl (25 kB)\nRequirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.8/site-packages (from packaging->datasets) (2.4.7)\nRequirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2020.12.5)\nRequirement already satisfied: chardet<5,>=3.0.2 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (4.0.0)\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (1.25.11)\nRequirement already satisfied: idna<3,>=2.5 in /databricks/python3/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2.10)\nRequirement already satisfied: pytz in /databricks/python3/lib/python3.8/site-packages (from mlflow) (2020.5)\nCollecting Flask\n Downloading Flask-2.1.3-py3-none-any.whl (95 kB)\nRequirement already satisfied: scipy in /databricks/python3/lib/python3.8/site-packages (from mlflow) (1.6.2)\nCollecting alembic\n Downloading alembic-1.8.1-py3-none-any.whl (209 kB)\nCollecting sqlparse>=0.3.1\n Downloading sqlparse-0.4.2-py3-none-any.whl (42 kB)\nRequirement already satisfied: protobuf>=3.12.0 in /databricks/python3/lib/python3.8/site-packages (from mlflow) (3.17.2)\nCollecting gitpython>=2.1.0\n Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)\nCollecting databricks-cli>=0.8.7\n Downloading databricks-cli-0.17.0.tar.gz (81 kB)\nCollecting cloudpickle\n Downloading cloudpickle-2.1.0-py3-none-any.whl (25 kB)\nCollecting click>=7.0\n Downloading click-8.1.3-py3-none-any.whl (96 kB)\nCollecting sqlalchemy>=1.4.0\n Downloading SQLAlchemy-1.4.39-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)\nCollecting gunicorn\n Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)\nCollecting docker>=4.0.0\n Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)\nCollecting querystring-parser\n Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)\nCollecting importlib-metadata!=4.7.0,>=3.7.0\n Downloading importlib_metadata-4.12.0-py3-none-any.whl (21 kB)\nRequirement already satisfied: entrypoints in /databricks/python3/lib/python3.8/site-packages (from mlflow) (0.3)\nCollecting prometheus-flask-exporter\n Downloading prometheus_flask_exporter-0.20.2-py3-none-any.whl (18 kB)\nCollecting pyjwt>=1.7.0\n Downloading PyJWT-2.4.0-py3-none-any.whl (18 kB)\nCollecting oauthlib>=3.1.0\n Downloading oauthlib-3.2.0-py3-none-any.whl (151 kB)\nCollecting tabulate>=0.7.7\n Downloading tabulate-0.8.10-py3-none-any.whl (29 kB)\nRequirement already satisfied: six>=1.10.0 in /databricks/python3/lib/python3.8/site-packages (from databricks-cli>=0.8.7->mlflow) (1.15.0)\nCollecting websocket-client>=0.32.0\n Downloading websocket_client-1.3.3-py3-none-any.whl (54 kB)\nCollecting gitdb<5,>=4.0.1\n Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)\nCollecting smmap<6,>=3.0.1\n Downloading smmap-5.0.0-py3-none-any.whl (24 kB)\nCollecting zipp>=0.5\n Downloading zipp-3.8.1-py3-none-any.whl (5.6 kB)\nCollecting greenlet!=0.4.17\n Downloading greenlet-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (156 kB)\nCollecting regex!=2019.12.17\n Downloading regex-2022.7.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (765 kB)\nCollecting tokenizers!=0.11.3,<0.13,>=0.11.1\n Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)\nCollecting async-timeout<5.0,>=4.0.0a3\n Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)\nCollecting charset-normalizer<3.0,>=2.0\n Downloading charset_normalizer-2.1.0-py3-none-any.whl (39 kB)\nCollecting frozenlist>=1.1.1\n Downloading frozenlist-1.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)\nCollecting multidict<7.0,>=4.5\n Downloading multidict-6.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121 kB)\nRequirement already satisfied: attrs>=17.3.0 in /databricks/python3/lib/python3.8/site-packages (from aiohttp->datasets) (20.3.0)\nCollecting yarl<2.0,>=1.0\n Downloading yarl-1.7.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (308 kB)\nCollecting aiosignal>=1.1.2\n Downloading aiosignal-1.2.0-py3-none-any.whl (8.2 kB)\nCollecting importlib-resources\n Downloading importlib_resources-5.8.0-py3-none-any.whl (28 kB)\nCollecting Mako\n Downloading Mako-1.2.1-py3-none-any.whl (78 kB)\nCollecting Werkzeug>=2.0\n Downloading Werkzeug-2.1.2-py3-none-any.whl (224 kB)\nCollecting itsdangerous>=2.0\n Downloading itsdangerous-2.1.2-py3-none-any.whl (15 kB)\nCollecting Jinja2>=3.0\n Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)\nRequirement already satisfied: MarkupSafe>=2.0 in /databricks/python3/lib/python3.8/site-packages (from Jinja2>=3.0->Flask->mlflow) (2.0.1)\nRequirement already satisfied: setuptools>=3.0 in /usr/local/lib/python3.8/dist-packages (from gunicorn->mlflow) (52.0.0)\nRequirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.8/site-packages (from pandas->datasets) (2.8.1)\nRequirement already satisfied: prometheus-client in /databricks/python3/lib/python3.8/site-packages (from prometheus-flask-exporter->mlflow) (0.10.1)\nBuilding wheels for collected packages: databricks-cli\n Building wheel for databricks-cli (setup.py): started\n Building wheel for databricks-cli (setup.py): finished with status 'done'\n Created wheel for databricks-cli: filename=databricks_cli-0.17.0-py3-none-any.whl size=141932 sha256=bb09e2cf09646974e0569af11512120f854d104b5284c4656f865a660e821cc9\n Stored in directory: /root/.cache/pip/wheels/bc/ef/2a/18885b70c6b78d4b9612ef2bf4bfdc7325f43db9d817d20f3f\nSuccessfully built databricks-cli\nInstalling collected packages: zipp, multidict, frozenlist, yarl, Werkzeug, smmap, Jinja2, itsdangerous, importlib-metadata, greenlet, click, charset-normalizer, async-timeout, aiosignal, websocket-client, typing-extensions, tqdm, tabulate, sqlalchemy, pyyaml, pyjwt, oauthlib, Mako, importlib-resources, gitdb, fsspec, Flask, dill, aiohttp, xxhash, tokenizers, sqlparse, responses, regex, querystring-parser, pyarrow, prometheus-flask-exporter, multiprocess, huggingface-hub, gunicorn, gitpython, docker, databricks-cli, cloudpickle, alembic, transformers, torch, mlflow, datasets\n Attempting uninstall: Jinja2\n Found existing installation: Jinja2 2.11.3\n Not uninstalling jinja2 at /databricks/python3/lib/python3.8/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-bab12bc9-22d1-4101-97b5-6ae403f8662e\n Can't uninstall 'Jinja2'. No files were found to uninstall.\n Attempting uninstall: pyarrow\n Found existing installation: pyarrow 4.0.0\n Not uninstalling pyarrow at /databricks/python3/lib/python3.8/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-bab12bc9-22d1-4101-97b5-6ae403f8662e\n Can't uninstall 'pyarrow'. No files were found to uninstall.\nSuccessfully installed Flask-2.1.3 Jinja2-3.1.2 Mako-1.2.1 Werkzeug-2.1.2 aiohttp-3.8.1 aiosignal-1.2.0 alembic-1.8.1 async-timeout-4.0.2 charset-normalizer-2.1.0 click-8.1.3 cloudpickle-2.1.0 databricks-cli-0.17.0 datasets-2.3.2 dill-0.3.5.1 docker-5.0.3 frozenlist-1.3.0 fsspec-2022.5.0 gitdb-4.0.9 gitpython-3.1.27 greenlet-1.1.2 gunicorn-20.1.0 huggingface-hub-0.8.1 importlib-metadata-4.12.0 importlib-resources-5.8.0 itsdangerous-2.1.2 mlflow-1.27.0 multidict-6.0.2 multiprocess-0.70.13 oauthlib-3.2.0 prometheus-flask-exporter-0.20.2 pyarrow-8.0.0 pyjwt-2.4.0 pyyaml-6.0 querystring-parser-1.2.4 regex-2022.7.9 responses-0.18.0 smmap-5.0.0 sqlalchemy-1.4.39 sqlparse-0.4.2 tabulate-0.8.10 tokenizers-0.12.1 torch-1.12.0 tqdm-4.64.0 transformers-4.20.1 typing-extensions-4.3.0 websocket-client-1.3.3 xxhash-3.0.0 yarl-1.7.2 zipp-3.8.1\nPython interpreter will be restarted.\n"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Install Git LFS"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8a18992e-ce3f-4b09-a6a1-ddd867006afa"}}},{"cell_type":"code","source":["%sh\ncurl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash\nsudo apt-get install git-lfs"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"27caf826-3804-40f1-8cd8-bc72077ceeb0"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Detected operating system as Ubuntu/focal.\nChecking for curl...\nDetected curl...\nChecking for gpg...\nDetected gpg...\nRunning apt-get update... done.\nInstalling apt-transport-https... done.\nInstalling /etc/apt/sources.list.d/github_git-lfs.list...done.\nImporting packagecloud gpg key... done.\nRunning apt-get update... done.\n\nThe repository is setup! You can now install packages.\nReading package lists...\nBuilding dependency tree...\nReading state information...\nThe following NEW packages will be installed:\n git-lfs\n0 upgraded, 1 newly installed, 0 to remove and 68 not upgraded.\nNeed to get 7,168 kB of archives.\nAfter this operation, 15.6 MB of additional disk space will be used.\nGet:1 https://packagecloud.io/github/git-lfs/ubuntu focal/main amd64 git-lfs amd64 3.2.0 [7,168 kB]\ndebconf: delaying package configuration, since apt-utils is not installed\nFetched 7,168 kB in 0s (15.4 MB/s)\nSelecting previously unselected package git-lfs.\n(Reading database ... \n(Reading database ... 5%\n(Reading database ... 10%\n(Reading database ... 15%\n(Reading database ... 20%\n(Reading database ... 25%\n(Reading database ... 30%\n(Reading database ... 35%\n(Reading database ... 40%\n(Reading database ... 45%\n(Reading database ... 50%\n(Reading database ... 55%\n(Reading database ... 60%\n(Reading database ... 65%\n(Reading database ... 70%\n(Reading database ... 75%\n(Reading database ... 80%\n(Reading database ... 85%\n(Reading database ... 90%\n(Reading database ... 95%\n(Reading database ... 100%\n(Reading database ... 92257 files and directories currently installed.)\nPreparing to unpack .../git-lfs_3.2.0_amd64.deb ...\nUnpacking git-lfs (3.2.0) ...\nSetting up git-lfs (3.2.0) ...\nGit LFS initialized.\nProcessing triggers for man-db (2.9.1-1) ...\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Detected operating system as Ubuntu/focal.\nChecking for curl...\nDetected curl...\nChecking for gpg...\nDetected gpg...\nRunning apt-get update... done.\nInstalling apt-transport-https... done.\nInstalling /etc/apt/sources.list.d/github_git-lfs.list...done.\nImporting packagecloud gpg key... done.\nRunning apt-get update... done.\n\nThe repository is setup! You can now install packages.\nReading package lists...\nBuilding dependency tree...\nReading state information...\nThe following NEW packages will be installed:\n git-lfs\n0 upgraded, 1 newly installed, 0 to remove and 68 not upgraded.\nNeed to get 7,168 kB of archives.\nAfter this operation, 15.6 MB of additional disk space will be used.\nGet:1 https://packagecloud.io/github/git-lfs/ubuntu focal/main amd64 git-lfs amd64 3.2.0 [7,168 kB]\ndebconf: delaying package configuration, since apt-utils is not installed\nFetched 7,168 kB in 0s (15.4 MB/s)\nSelecting previously unselected package git-lfs.\n(Reading database ... \n(Reading database ... 5%\n(Reading database ... 10%\n(Reading database ... 15%\n(Reading database ... 20%\n(Reading database ... 25%\n(Reading database ... 30%\n(Reading database ... 35%\n(Reading database ... 40%\n(Reading database ... 45%\n(Reading database ... 50%\n(Reading database ... 55%\n(Reading database ... 60%\n(Reading database ... 65%\n(Reading database ... 70%\n(Reading database ... 75%\n(Reading database ... 80%\n(Reading database ... 85%\n(Reading database ... 90%\n(Reading database ... 95%\n(Reading database ... 100%\n(Reading database ... 92257 files and directories currently installed.)\nPreparing to unpack .../git-lfs_3.2.0_amd64.deb ...\nUnpacking git-lfs (3.2.0) ...\nSetting up git-lfs (3.2.0) ...\nGit LFS initialized.\nProcessing triggers for man-db (2.9.1-1) ...\n"]}}],"execution_count":0},{"cell_type":"code","source":["import mlflow\nimport torch\n#from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK\n#from hyperopt.pyll import scope\nfrom datasets import load_dataset, load_metric\nfrom huggingface_hub import notebook_login\nfrom transformers import (\n AutoModelForSequenceClassification,\n AutoTokenizer,\n Trainer,\n TrainingArguments,\n)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"cec61d6d-ea1a-4d2f-9ee6-625393a24aa5"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Log into Hugging Face Hub"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"094921b5-f746-4303-af4b-7a4d61b3b48a"}}},{"cell_type":"markdown","source":["This uses the command line to login into the hugging face hub. If the Hugging Face hub is private, specify the location using the \"HF_ENDPOINT\" parameter."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"4af8d844-c5f0-45c0-80c8-c0b74e55abe9"}}},{"cell_type":"code","source":["from huggingface_hub.commands.user import _login\nfrom huggingface_hub import HfApi\napi = HfApi()\n_login(hf_api = api, token = \"API Token\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"896afdc7-85b9-4ad0-ac9e-2a68def84532"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Login successful\nYour token has been saved to /root/.huggingface/token\n\u001B[1m\u001B[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.\nYou might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default\n\ngit config --global credential.helper store\u001B[0m\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Login successful\nYour token has been saved to /root/.huggingface/token\n\u001B[1m\u001B[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.\nYou might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default\n\ngit config --global credential.helper store\u001B[0m\n"]}}],"execution_count":0},{"cell_type":"code","source":["#Verify Login\n!huggingface-cli whoami"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0a3fc9dd-7a03-41c4-a43a-6a2a3ee610bd"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"rajistics\r\n\u001B[1morgs: \u001B[0m huggingface,spaces-explorers,demo-org,HF-test-lab,qualitydatalab\r\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["rajistics\r\n\u001B[1morgs: \u001B[0m huggingface,spaces-explorers,demo-org,HF-test-lab,qualitydatalab\r\n"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Load data\nThe tutorial uses the IMDB dataset for move reviews. The complete [model card](https://huggingface.co/datasets/imdb) can be found at Hugging Face with details on the dataset. \n\nThe goal is to classify reviews as positive or negative. \n\nThe dataset is loaded using the Hugging Face datasets package."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b2f67ffe-ad1b-49a1-a7cf-603daa8c9890"}}},{"cell_type":"code","source":["# Load and preprocess data\ntrain_dataset, test_dataset = load_dataset(\"imdb\", split=[\"train\", \"test\"])"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"51fd8a9f-bef3-4fbd-90ce-8531f5f71205"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading builder script: 0%| | 0.00/1.79k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"cc725581886c4aa2a7ca07f8ba162cd8"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading builder script: 0%| | 0.00/1.79k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"cc725581886c4aa2a7ca07f8ba162cd8"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading metadata: 0%| | 0.00/1.05k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"8d14e03ef6ea4c28a9aa26c44fa5c645"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading metadata: 0%| | 0.00/1.05k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"8d14e03ef6ea4c28a9aa26c44fa5c645"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading data: 0%| | 0.00/84.1M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"4d8c5f3bb4484599ac06a3d533526e0e"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading data: 0%| | 0.00/84.1M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"4d8c5f3bb4484599ac06a3d533526e0e"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Generating train split: 0%| | 0/25000 [00:00<?, ? examples/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"dc3e706fb4bb459f8f57a07948f588ac"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Generating train split: 0%| | 0/25000 [00:00<?, ? examples/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"dc3e706fb4bb459f8f57a07948f588ac"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Generating test split: 0%| | 0/25000 [00:00<?, ? examples/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"2b55658ba8e4440fa2a58dd9637ae836"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Generating test split: 0%| | 0/25000 [00:00<?, ? examples/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"2b55658ba8e4440fa2a58dd9637ae836"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Generating unsupervised split: 0%| | 0/50000 [00:00<?, ? examples/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"3db8eb436e764afc8178a165c079ecb8"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Generating unsupervised split: 0%| | 0/50000 [00:00<?, ? examples/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"3db8eb436e764afc8178a165c079ecb8"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":" 0%| | 0/2 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"df77373b9ede44d1a2e008cac8059abe"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":" 0%| | 0/2 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"df77373b9ede44d1a2e008cac8059abe"}}}],"execution_count":0},{"cell_type":"markdown","source":["## Part 1. Train a classification model"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8123df40-67fa-43d0-97c1-6f608c9f7d61"}}},{"cell_type":"markdown","source":["### MLflow Tracking\n[MLflow tracking](https://www.mlflow.org/docs/latest/tracking.html) allows you to organize your machine learning training code, parameters, and models. \n\nYou can enable automatic MLflow tracking by using [*autologging*](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging)."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f3441df0-a7fe-4d6d-bf62-46184cf6ac2c"}}},{"cell_type":"code","source":["# Enable MLflow autologging for this notebook\nmlflow.autolog()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"18e0381e-7c02-4e57-b29b-127a7585992b"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"2022/07/20 20:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.\n2022/07/20 20:53:02 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during spark autologging: Exception while attempting to initialize JVM-side state for Spark datasource autologging. Please create a new Spark session and ensure you have the mlflow-spark JAR attached to your Spark session as described in http://mlflow.org/docs/latest/tracking.html#automatic-logging-from-spark-experimental. Exception:\n'JavaPackage' object is not callable\n2022/07/20 20:53:02 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for pyspark: Exception while attempting to initialize JVM-side state for Spark datasource autologging. Please create a new Spark session and ensure you have the mlflow-spark JAR attached to your Spark session as described in http://mlflow.org/docs/latest/tracking.html#automatic-logging-from-spark-experimental. Exception:\n'JavaPackage' object is not callable\n2022/07/20 20:53:02 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during spark autologging: Exception while attempting to initialize JVM-side state for Spark datasource autologging. Please create a new Spark session and ensure you have the mlflow-spark JAR attached to your Spark session as described in http://mlflow.org/docs/latest/tracking.html#automatic-logging-from-spark-experimental. Exception:\n'JavaPackage' object is not callable\n2022/07/20 20:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.ml.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["2022/07/20 20:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.\n2022/07/20 20:53:02 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during spark autologging: Exception while attempting to initialize JVM-side state for Spark datasource autologging. Please create a new Spark session and ensure you have the mlflow-spark JAR attached to your Spark session as described in http://mlflow.org/docs/latest/tracking.html#automatic-logging-from-spark-experimental. Exception:\n'JavaPackage' object is not callable\n2022/07/20 20:53:02 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for pyspark: Exception while attempting to initialize JVM-side state for Spark datasource autologging. Please create a new Spark session and ensure you have the mlflow-spark JAR attached to your Spark session as described in http://mlflow.org/docs/latest/tracking.html#automatic-logging-from-spark-experimental. Exception:\n'JavaPackage' object is not callable\n2022/07/20 20:53:02 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during spark autologging: Exception while attempting to initialize JVM-side state for Spark datasource autologging. Please create a new Spark session and ensure you have the mlflow-spark JAR attached to your Spark session as described in http://mlflow.org/docs/latest/tracking.html#automatic-logging-from-spark-experimental. Exception:\n'JavaPackage' object is not callable\n2022/07/20 20:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.ml.\n"]}}],"execution_count":0},{"cell_type":"markdown","source":["Next, train a classifier within the context of an MLflow run, which automatically logs the trained model and many associated metrics and parameters. You can supplement the logging with additional metrics such as the model's AUC score on the test dataset.\nIf the model is private, another way to access the model is by using the `use_auth_token` parameter to specify the API key that has access to the model."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"88913160-61c4-4cbf-8f2d-235e86fa8768"}}},{"cell_type":"code","source":["model = AutoModelForSequenceClassification.from_pretrained(\"distilbert-base-cased\", num_labels=2)\ntokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-cased\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"70fefb47-9af8-49c8-932d-49a0727c1428"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading: 0%| | 0.00/411 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"56d2fe47cf6f401488fdb6dec517bcdf"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading: 0%| | 0.00/411 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"56d2fe47cf6f401488fdb6dec517bcdf"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading: 0%| | 0.00/251M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"641af36cd86a45ca9af71968fd2b2c0b"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading: 0%| | 0.00/251M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"641af36cd86a45ca9af71968fd2b2c0b"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']\n- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\nSome weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']\n- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\nSome weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading: 0%| | 0.00/29.0 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"6e5c4f05f43b4f1382965f74c4c60605"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading: 0%| | 0.00/29.0 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"6e5c4f05f43b4f1382965f74c4c60605"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading: 0%| | 0.00/208k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"1dab3342aa524a4da54e99c1af9bdd64"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading: 0%| | 0.00/208k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"1dab3342aa524a4da54e99c1af9bdd64"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading: 0%| | 0.00/426k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"a15dd86cbdad456e8ba123e52930a1c2"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading: 0%| | 0.00/426k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"a15dd86cbdad456e8ba123e52930a1c2"}}}],"execution_count":0},{"cell_type":"code","source":["def tokenize_function(examples):\n return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n\n\ntrain_dataset = train_dataset.map(tokenize_function, batched=True)\ntest_dataset = test_dataset.map(tokenize_function, batched=True)\n\nmetric = load_metric(\"accuracy\")\n\ndef compute_metrics(eval_pred):\n logits, labels = eval_pred\n predictions = np.argmax(logits, axis=-1)\n return metric.compute(predictions=predictions, references=labels)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"85fa6e22-56ab-44da-89c0-e77f883c5fcd"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Parameter 'function'=<function tokenize_function at 0x7f7040e62dc0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Parameter 'function'=<function tokenize_function at 0x7f7040e62dc0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":" 0%| | 0/25 [00:00<?, ?ba/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"e00d087294f9404b965ab1baaf5bef64"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":" 0%| | 0/25 [00:00<?, ?ba/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"e00d087294f9404b965ab1baaf5bef64"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":" 0%| | 0/25 [00:00<?, ?ba/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"5a4e5901b23d46a48b87318412aee7c0"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":" 0%| | 0/25 [00:00<?, ?ba/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"5a4e5901b23d46a48b87318412aee7c0"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading builder script: 0%| | 0.00/1.65k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"8538e58e45db4b09927cdd5d9bfdd5a2"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading builder script: 0%| | 0.00/1.65k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"8538e58e45db4b09927cdd5d9bfdd5a2"}}}],"execution_count":0},{"cell_type":"code","source":["training_args = TrainingArguments(\n hub_model_id=\"rajistics/distilbert-imdb-mlflow2\",\n num_train_epochs=1,\n output_dir=\"./output\",\n logging_steps=500,\n save_strategy=\"epoch\",\n push_to_hub=True,\n)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f1b7f9f9-fd27-497d-81f4-663c43ca5343"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["trainer = Trainer(\n model=model,\n args=training_args,\n train_dataset=train_dataset,\n eval_dataset=test_dataset,\n compute_metrics=compute_metrics,\n)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0348ee5b-bd38-440d-9b7f-1fa9150a14b1"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Cloning https://huggingface.co/rajistics/distilbert-imdb-mlflow into local empty directory.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Cloning https://huggingface.co/rajistics/distilbert-imdb-mlflow into local empty directory.\n"]}}],"execution_count":0},{"cell_type":"code","source":["trainer.train()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2e107f16-59d1-4cda-ba24-55f271a85755"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n/local_disk0/.ephemeral_nfs/envs/pythonEnv-bab12bc9-22d1-4101-97b5-6ae403f8662e/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n warnings.warn(\n***** Running training *****\n Num examples = 25000\n Num Epochs = 1\n Instantaneous batch size per device = 8\n Total train batch size (w. parallel, distributed & accumulation) = 8\n Gradient Accumulation steps = 1\n Total optimization steps = 3125\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n/local_disk0/.ephemeral_nfs/envs/pythonEnv-bab12bc9-22d1-4101-97b5-6ae403f8662e/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n warnings.warn(\n***** Running training *****\n Num examples = 25000\n Num Epochs = 1\n Instantaneous batch size per device = 8\n Total train batch size (w. parallel, distributed & accumulation) = 8\n Gradient Accumulation steps = 1\n Total optimization steps = 3125\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":" 0%| | 0/3125 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"d5b4c21aa57e4b748c35217eef9c6759"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":" 0%| | 0/3125 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"d5b4c21aa57e4b748c35217eef9c6759"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"Cancelled","metadata":{},"errorTraceType":"html","type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>"]}}],"execution_count":0},{"cell_type":"code","source":["mlflow.end_run()\ntrainer.push_to_hub()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"12d375d9-a641-4b64-a6d1-38a01c095070"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Saving model checkpoint to ./output\nConfiguration saved in ./output/config.json\nModel weights saved in ./output/pytorch_model.bin\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Saving model checkpoint to ./output\nConfiguration saved in ./output/config.json\nModel weights saved in ./output/pytorch_model.bin\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Upload file pytorch_model.bin: 0%| | 32.0k/251M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"150e2a403e734dd997ae41aec0cb9785"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Upload file pytorch_model.bin: 0%| | 32.0k/251M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"150e2a403e734dd997ae41aec0cb9785"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Upload file training_args.bin: 100%|##########| 3.23k/3.23k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"7391e864663c49639744a786b51fa208"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Upload file training_args.bin: 100%|##########| 3.23k/3.23k [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"7391e864663c49639744a786b51fa208"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"To https://huggingface.co/rajistics/distilbert-imdb-mlflow\n 11f9d35..565ce9d main -> main\n\nDropping the following result as it does not have all the necessary fields:\n{'task': {'name': 'Text Classification', 'type': 'text-classification'}, 'dataset': {'name': 'imdb', 'type': 'imdb', 'args': 'plain_text'}}\nTo https://huggingface.co/rajistics/distilbert-imdb-mlflow\n 565ce9d..2e139c6 main -> main\n\nOut[12]: 'https://huggingface.co/rajistics/distilbert-imdb-mlflow/commit/565ce9de2a3bf303432d5ca277711f8237b8097c'","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["To https://huggingface.co/rajistics/distilbert-imdb-mlflow\n 11f9d35..565ce9d main -> main\n\nDropping the following result as it does not have all the necessary fields:\n{'task': {'name': 'Text Classification', 'type': 'text-classification'}, 'dataset': {'name': 'imdb', 'type': 'imdb', 'args': 'plain_text'}}\nTo https://huggingface.co/rajistics/distilbert-imdb-mlflow\n 565ce9d..2e139c6 main -> main\n\nOut[12]: 'https://huggingface.co/rajistics/distilbert-imdb-mlflow/commit/565ce9de2a3bf303432d5ca277711f8237b8097c'"]}}],"execution_count":0},{"cell_type":"code","source":["from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline\n# Model Card: https://huggingface.co/lvwerra/distilbert-imdb\ntokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-cased\")\nmodel = AutoModelForSequenceClassification.from_pretrained(\"rajistics/distilbert-imdb-mlflow\")\nmoviereview = pipeline(\"text-classification\", model = model, tokenizer = tokenizer)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5bcdb820-3b8f-43f0-8c86-ff34dc5a8bc7"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a\nModel config DistilBertConfig {\n \"_name_or_path\": \"distilbert-base-cased\",\n \"activation\": \"gelu\",\n \"attention_dropout\": 0.1,\n \"dim\": 768,\n \"dropout\": 0.1,\n \"hidden_dim\": 3072,\n \"initializer_range\": 0.02,\n \"max_position_embeddings\": 512,\n \"model_type\": \"distilbert\",\n \"n_heads\": 12,\n \"n_layers\": 6,\n \"output_past\": true,\n \"pad_token_id\": 0,\n \"qa_dropout\": 0.1,\n \"seq_classif_dropout\": 0.2,\n \"sinusoidal_pos_embds\": false,\n \"tie_weights_\": true,\n \"transformers_version\": \"4.20.1\",\n \"vocab_size\": 28996\n}\n\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/ba377304984dc63e3ede0e23a938bbbf04d5c3835b66d5bb48343aecca188429.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/acb5c2138c1f8c84f074b86dafce3631667fccd6efcb1a7ea1320cf75c386a36.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/added_tokens.json from cache at None\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/special_tokens_map.json from cache at None\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/81e970e5e6ec68be12da0f8f3b2f2469c78d579282299a2ea65b4b7441719107.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f\nloading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a\nModel config DistilBertConfig {\n \"_name_or_path\": \"distilbert-base-cased\",\n \"activation\": \"gelu\",\n \"attention_dropout\": 0.1,\n \"dim\": 768,\n \"dropout\": 0.1,\n \"hidden_dim\": 3072,\n \"initializer_range\": 0.02,\n \"max_position_embeddings\": 512,\n \"model_type\": \"distilbert\",\n \"n_heads\": 12,\n \"n_layers\": 6,\n \"output_past\": true,\n \"pad_token_id\": 0,\n \"qa_dropout\": 0.1,\n \"seq_classif_dropout\": 0.2,\n \"sinusoidal_pos_embds\": false,\n \"tie_weights_\": true,\n \"transformers_version\": \"4.20.1\",\n \"vocab_size\": 28996\n}\n\nloading configuration file https://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ef11e290776d32ed8373fb83bdc594abae0602cc3d4a6530f3ed9533e98aac64.a8ccf646a0873f3d60805d2779a0b1caf3ca469ffd72e36224dee22566738d73\nModel config DistilBertConfig {\n \"_name_or_path\": \"rajistics/distilbert-imdb-mlflow\",\n \"activation\": \"gelu\",\n \"architectures\": [\n \"DistilBertForSequenceClassification\"\n ],\n \"attention_dropout\": 0.1,\n \"dim\": 768,\n \"dropout\": 0.1,\n \"hidden_dim\": 3072,\n \"initializer_range\": 0.02,\n \"max_position_embeddings\": 512,\n \"model_type\": \"distilbert\",\n \"n_heads\": 12,\n \"n_layers\": 6,\n \"output_past\": true,\n \"pad_token_id\": 0,\n \"problem_type\": \"single_label_classification\",\n \"qa_dropout\": 0.1,\n \"seq_classif_dropout\": 0.2,\n \"sinusoidal_pos_embds\": false,\n \"tie_weights_\": true,\n \"torch_dtype\": \"float32\",\n \"transformers_version\": \"4.20.1\",\n \"vocab_size\": 28996\n}\n\nhttps://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpmdxxho8z\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a\nModel config DistilBertConfig {\n \"_name_or_path\": \"distilbert-base-cased\",\n \"activation\": \"gelu\",\n \"attention_dropout\": 0.1,\n \"dim\": 768,\n \"dropout\": 0.1,\n \"hidden_dim\": 3072,\n \"initializer_range\": 0.02,\n \"max_position_embeddings\": 512,\n \"model_type\": \"distilbert\",\n \"n_heads\": 12,\n \"n_layers\": 6,\n \"output_past\": true,\n \"pad_token_id\": 0,\n \"qa_dropout\": 0.1,\n \"seq_classif_dropout\": 0.2,\n \"sinusoidal_pos_embds\": false,\n \"tie_weights_\": true,\n \"transformers_version\": \"4.20.1\",\n \"vocab_size\": 28996\n}\n\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/ba377304984dc63e3ede0e23a938bbbf04d5c3835b66d5bb48343aecca188429.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/acb5c2138c1f8c84f074b86dafce3631667fccd6efcb1a7ea1320cf75c386a36.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/added_tokens.json from cache at None\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/special_tokens_map.json from cache at None\nloading file https://huggingface.co/distilbert-base-cased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/81e970e5e6ec68be12da0f8f3b2f2469c78d579282299a2ea65b4b7441719107.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f\nloading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a\nModel config DistilBertConfig {\n \"_name_or_path\": \"distilbert-base-cased\",\n \"activation\": \"gelu\",\n \"attention_dropout\": 0.1,\n \"dim\": 768,\n \"dropout\": 0.1,\n \"hidden_dim\": 3072,\n \"initializer_range\": 0.02,\n \"max_position_embeddings\": 512,\n \"model_type\": \"distilbert\",\n \"n_heads\": 12,\n \"n_layers\": 6,\n \"output_past\": true,\n \"pad_token_id\": 0,\n \"qa_dropout\": 0.1,\n \"seq_classif_dropout\": 0.2,\n \"sinusoidal_pos_embds\": false,\n \"tie_weights_\": true,\n \"transformers_version\": \"4.20.1\",\n \"vocab_size\": 28996\n}\n\nloading configuration file https://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ef11e290776d32ed8373fb83bdc594abae0602cc3d4a6530f3ed9533e98aac64.a8ccf646a0873f3d60805d2779a0b1caf3ca469ffd72e36224dee22566738d73\nModel config DistilBertConfig {\n \"_name_or_path\": \"rajistics/distilbert-imdb-mlflow\",\n \"activation\": \"gelu\",\n \"architectures\": [\n \"DistilBertForSequenceClassification\"\n ],\n \"attention_dropout\": 0.1,\n \"dim\": 768,\n \"dropout\": 0.1,\n \"hidden_dim\": 3072,\n \"initializer_range\": 0.02,\n \"max_position_embeddings\": 512,\n \"model_type\": \"distilbert\",\n \"n_heads\": 12,\n \"n_layers\": 6,\n \"output_past\": true,\n \"pad_token_id\": 0,\n \"problem_type\": \"single_label_classification\",\n \"qa_dropout\": 0.1,\n \"seq_classif_dropout\": 0.2,\n \"sinusoidal_pos_embds\": false,\n \"tie_weights_\": true,\n \"torch_dtype\": \"float32\",\n \"transformers_version\": \"4.20.1\",\n \"vocab_size\": 28996\n}\n\nhttps://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpmdxxho8z\n"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":{"text/plain":"Downloading: 0%| | 0.00/251M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"6a353b318789489f83aa7940a506dce6"}},"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"mimeBundle","arguments":{}}},"output_type":"display_data","data":{"text/plain":"Downloading: 0%| | 0.00/251M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"6a353b318789489f83aa7940a506dce6"}}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"storing https://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a\ncreating metadata file for /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a\nloading weights file https://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a\nAll model checkpoint weights were used when initializing DistilBertForSequenceClassification.\n\nAll the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at rajistics/distilbert-imdb-mlflow.\nIf your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["storing https://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a\ncreating metadata file for /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a\nloading weights file https://huggingface.co/rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a\nAll model checkpoint weights were used when initializing DistilBertForSequenceClassification.\n\nAll the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at rajistics/distilbert-imdb-mlflow.\nIf your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.\n"]}}],"execution_count":0},{"cell_type":"code","source":["moviereview(\"This move was a bit crazy, but I liked it\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"a7455162-9175-4bfd-bd9c-fcf53b1aa8c1"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Out[25]: [{'label': 'LABEL_0', 'score': 0.5706899762153625}]","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Out[25]: [{'label': 'LABEL_0', 'score': 0.5706899762153625}]"]}}],"execution_count":0},{"cell_type":"markdown","source":["### View MLflow runs\nTo view the logged training runs, click the **Experiment** icon at the upper right of the notebook to display the experiment sidebar. If necessary, click the refresh icon to fetch and monitor the latest runs. \n\n<img width=\"350\" src=\"https://docs.databricks.com/_static/images/mlflow/quickstart/experiment-sidebar-icons.png\"/>\n\nYou can then click the experiment page icon to display the more detailed MLflow experiment page ([AWS](https://docs.databricks.com/applications/mlflow/tracking.html#notebook-experiments)|[Azure](https://docs.microsoft.com/azure/databricks/applications/mlflow/tracking#notebook-experiments)|[GCP](https://docs.gcp.databricks.com/applications/mlflow/tracking.html#notebook-experiments)). This page allows you to compare runs and view details for specific runs.\n\n<img width=\"800\" src=\"https://docs.databricks.com/_static/images/mlflow/quickstart/compare-runs.png\"/>"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"70e02a64-6878-4b9b-9297-5390c9e19ddc"}}},{"cell_type":"code","source":["runs = mlflow.search_runs(\"3759898664210413\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"00ff072f-fccd-41ae-ba16-60340e4b6379"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["import pandas\nruns.to_csv(\"output/mlflow_runs.csv\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"be4baec5-089b-4537-a53d-dfe183f2c68b"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"fatal: not a git repository (or any of the parent directories): .git\r\nfatal: not a git repository (or any of the parent directories): .git\r\nfatal: not a git repository (or any of the parent directories): .git\r\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["fatal: not a git repository (or any of the parent directories): .git\r\nfatal: not a git repository (or any of the parent directories): .git\r\nfatal: not a git repository (or any of the parent directories): .git\r\n"]}}],"execution_count":0},{"cell_type":"code","source":["%sh\ncd output\ngit add mlflow_runs.csv\ngit commit -m \"Add MLFlow results\"\ngit push"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"410156e8-69a3-4adc-84f2-72df68582c81"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"[main 7d3e3d7] Add MLFlow results\n 1 file changed, 4 insertions(+)\n create mode 100644 mlflow_runs.csv\nTo https://huggingface.co/rajistics/distilbert-imdb-mlflow\n 2e139c6..7d3e3d7 main -> main\n","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["[main 7d3e3d7] Add MLFlow results\n 1 file changed, 4 insertions(+)\n create mode 100644 mlflow_runs.csv\nTo https://huggingface.co/rajistics/distilbert-imdb-mlflow\n 2e139c6..7d3e3d7 main -> main\n"]}}],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"ML Quickstart: Model Training with HuggingFace","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2},"language":"python","widgets":{},"notebookOrigID":3759898664210413}},"nbformat":4,"nbformat_minor":0}
|