Spaces:

xmadai
/

1bit_llama3_instruct_xmad_qa_batch

Sleeping

App Files Files Community

Aston-xMAD commited on Jul 23, 2024

Commit

9382e3f

verified ·

1 Parent(s): 825cfb0

init commit

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

CITATION.cff +82 -0
CODE_OF_CONDUCT.md +133 -0
CONTRIBUTING.md +394 -0
ISSUES.md +277 -0
LICENSE +203 -0
Makefile +124 -0
README.md +6 -6
README_test_result.md +58 -0
SECURITY.md +40 -0
__pycache__/app_local.cpython-310.pyc +0 -0
app.py +419 -0
backups/app_backup.py +63 -0
backups/app_local_enabled_streaming_but_inefficient.py +205 -0
backups/app_local_v0.py +187 -0
backups/app_local_v1-1.py +228 -0
backups/app_local_v1.py +375 -0
backups/app_local_v2.py +191 -0
backups/app_local_v3.py +211 -0
backups/app_local_v4-1.py +234 -0
backups/app_local_with_graph.py +235 -0
backups/app_major_backup.py +235 -0
backups/app_pic.py +40 -0
backups/app_unquantized_backup.py +146 -0
backups/app_v0.py +188 -0
backups/app_v1.py +207 -0
backups/app_v2.py +215 -0
chats.json +1850 -0
chats_sys_none.json +1390 -0
conftest.py +142 -0
docker/transformers-all-latest-gpu/Dockerfile +63 -0
docker/transformers-doc-builder/Dockerfile +18 -0
docker/transformers-gpu/Dockerfile +31 -0
docker/transformers-past-gpu/Dockerfile +59 -0
docker/transformers-pytorch-amd-gpu/Dockerfile +39 -0
docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile +48 -0
docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile +53 -0
docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile +64 -0
docker/transformers-pytorch-gpu/Dockerfile +33 -0
docker/transformers-pytorch-tpu/Dockerfile +65 -0
docker/transformers-pytorch-tpu/bert-base-cased.jsonnet +38 -0
docker/transformers-pytorch-tpu/dataset.yaml +32 -0
docker/transformers-pytorch-tpu/docker-entrypoint.sh +8 -0
docker/transformers-quantization-latest-gpu/Dockerfile +60 -0
docker/transformers-tensorflow-gpu/Dockerfile +25 -0
docs/README.md +397 -0
docs/TRANSLATING.md +57 -0
docs/source/_config.py +14 -0
docs/source/de/_config.py +14 -0
docs/source/de/_toctree.yml +42 -0
docs/source/de/accelerate.md +136 -0

CITATION.cff ADDED Viewed

	@@ -0,0 +1,82 @@

+cff-version: "1.2.0"
+date-released: 2020-10
+message: "If you use this software, please cite it using these metadata."
+title: "Transformers: State-of-the-Art Natural Language Processing"
+url: "https://github.com/huggingface/transformers"
+authors:
+  - family-names: Wolf
+    given-names: Thomas
+  - family-names: Debut
+    given-names: Lysandre
+  - family-names: Sanh
+    given-names: Victor
+  - family-names: Chaumond
+    given-names: Julien
+  - family-names: Delangue
+    given-names: Clement
+  - family-names: Moi
+    given-names: Anthony
+  - family-names: Cistac
+    given-names: Perric
+  - family-names: Ma
+    given-names: Clara
+  - family-names: Jernite
+    given-names: Yacine
+  - family-names: Plu
+    given-names: Julien
+  - family-names: Xu
+    given-names: Canwen
+  - family-names: "Le Scao"
+    given-names: Teven
+  - family-names: Gugger
+    given-names: Sylvain
+  - family-names: Drame
+    given-names: Mariama
+  - family-names: Lhoest
+    given-names: Quentin
+  - family-names: Rush
+    given-names: "Alexander M."
+preferred-citation:
+  type: conference-paper
+  authors:
+  - family-names: Wolf
+    given-names: Thomas
+  - family-names: Debut
+    given-names: Lysandre
+  - family-names: Sanh
+    given-names: Victor
+  - family-names: Chaumond
+    given-names: Julien
+  - family-names: Delangue
+    given-names: Clement
+  - family-names: Moi
+    given-names: Anthony
+  - family-names: Cistac
+    given-names: Perric
+  - family-names: Ma
+    given-names: Clara
+  - family-names: Jernite
+    given-names: Yacine
+  - family-names: Plu
+    given-names: Julien
+  - family-names: Xu
+    given-names: Canwen
+  - family-names: "Le Scao"
+    given-names: Teven
+  - family-names: Gugger
+    given-names: Sylvain
+  - family-names: Drame
+    given-names: Mariama
+  - family-names: Lhoest
+    given-names: Quentin
+  - family-names: Rush
+    given-names: "Alexander M."
+  booktitle: "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations"
+  month: 10
+  start: 38
+  end: 45
+  title: "Transformers: State-of-the-Art Natural Language Processing"
+  year: 2020
+  publisher: "Association for Computational Linguistics"
+  url: "https://www.aclweb.org/anthology/2020.emnlp-demos.6"
+  address: "Online"

CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# Contributor Covenant Code of Conduct
+## Our Pledge
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, caste, color, religion, or sexual
+identity and orientation.
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+## Our Standards
+Examples of behavior that contributes to a positive environment for our
+community include:
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the overall
+  community
+Examples of unacceptable behavior include:
+* The use of sexualized language or imagery, and sexual attention or advances of
+  any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email address,
+  without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Enforcement Responsibilities
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+## Scope
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+[email protected].
+All complaints will be reviewed and investigated promptly and fairly.
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+## Enforcement Guidelines
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+### 1. Correction
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+### 2. Warning
+**Community Impact**: A violation through a single incident or series of
+actions.
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or permanent
+ban.
+### 3. Temporary Ban
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+### 4. Permanent Ban
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior, harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+**Consequence**: A permanent ban from any sort of public interaction within the
+community.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.1, available at
+[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
+Community Impact Guidelines were inspired by
+[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
+For answers to common questions about this code of conduct, see the FAQ at
+[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
+[https://www.contributor-covenant.org/translations][translations].
+[homepage]: https://www.contributor-covenant.org
+[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
+[Mozilla CoC]: https://github.com/mozilla/diversity
+[FAQ]: https://www.contributor-covenant.org/faq
+[translations]: https://www.contributor-covenant.org/translations

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,394 @@

+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Contribute to 🤗 Transformers
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+contributions are not the only way to help the community. Answering questions, helping
+others, and improving the documentation are also immensely valuable.
+It also helps us if you spread the word! Reference the library in blog posts
+about the awesome projects it made possible, shout out on Twitter every time it has
+helped you, or simply ⭐️ the repository to say thank you.
+However you choose to contribute, please be mindful and respect our
+[code of conduct](https://github.com/huggingface/transformers/blob/main/CODE_OF_CONDUCT.md).
+**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**
+## Ways to contribute
+There are several ways you can contribute to 🤗 Transformers:
+* Fix outstanding issues with the existing code.
+* Submit issues related to bugs or desired new features.
+* Implement new models.
+* Contribute to the examples or to the documentation.
+If you don't know where to start, there is a special [Good First
+Issue](https://github.com/huggingface/transformers/contribute) listing. It will give you a list of
+open issues that are beginner-friendly and help you start contributing to open-source. The best way to do that is to open a Pull Request and link it to the issue that you'd like to work on. We try to give priority to opened PRs as we can easily track the progress of the fix, and if the contributor does not have time anymore, someone else can take the PR over.
+For something slightly more challenging, you can also take a look at the [Good Second Issue](https://github.com/huggingface/transformers/labels/Good%20Second%20Issue) list. In general though, if you feel like you know what you're doing, go for it and we'll help you get there! 🚀
+> All contributions are equally valuable to the community. 🥰
+## Fixing outstanding issues
+If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](#create-a-pull-request) and open a Pull Request!
+## Submitting a bug-related issue or feature request
+Do your best to follow these guidelines when submitting a bug-related issue or a feature
+request. It will make it easier for us to come back to you quickly and with good
+feedback.
+### Did you find a bug?
+The 🤗 Transformers library is robust and reliable thanks to users who report the problems they encounter.
+Before you report an issue, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the library itself, and not your code. If you're unsure whether the bug is in your code or the library, please ask in the [forum](https://discuss.huggingface.co/) first. This helps us respond quicker to fixing issues related to the library versus general questions.
+Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so we can quickly resolve it:
+* Your **OS type and version** and **Python**, **PyTorch** and
+  **TensorFlow** versions when applicable.
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s.
+* The *full* traceback if an exception is raised.
+* Attach any other additional information, like screenshots, you think may help.
+To get the OS and software versions automatically, run the following command:
+```bash
+transformers-cli env
+```
+You can also run the same command from the root of the repository:
+```bash
+python src/transformers/commands/transformers_cli.py env
+```
+### Do you want a new feature?
+If there is a new feature you'd like to see in 🤗 Transformers, please open an issue and describe:
+1. What is the *motivation* behind this feature? Is it related to a problem or frustration with the library? Is it a feature related to something you need for a project? Is it something you worked on and think it could benefit the community?
+   Whatever it is, we'd love to hear about it!
+2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better we'll be able to help you.
+3. Provide a *code snippet* that demonstrates the features usage.
+4. If the feature is related to a paper, please include a link.
+If your issue is well written we're already 80% of the way there by the time you create it.
+We have added [templates](https://github.com/huggingface/transformers/tree/main/templates) to help you get started with your issue.
+## Do you want to implement a new model?
+New models are constantly released and if you want to implement a new model, please provide the following information:
+* A short description of the model and a link to the paper.
+* Link to the implementation if it is open-sourced.
+* Link to the model weights if they are available.
+If you are willing to contribute the model yourself, let us know so we can help you add it to 🤗 Transformers!
+We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/add_new_model).
+## Do you want to add documentation?
+We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know how the documentation can be improved such as typos and any content that is missing, unclear or inaccurate. We'll be happy to make the changes or help you make a contribution if you're interested!
+For more details about how to generate, build, and write the documentation, take a look at the documentation [README](https://github.com/huggingface/transformers/tree/main/docs).
+## Create a Pull Request
+Before writing any code, we strongly advise you to search through the existing PRs or
+issues to make sure nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+You will need basic `git` proficiency to contribute to
+🤗 Transformers. While `git` is not the easiest tool to use, it has the greatest
+manual. Type `git --help` in a shell and enjoy! If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+You'll need **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L426)** or above to contribute to 🤗 Transformers. Follow the steps below to start contributing:
+1. Fork the [repository](https://github.com/huggingface/transformers) by
+   clicking on the **[Fork](https://github.com/huggingface/transformers/fork)** button on the repository's page. This creates a copy of the code
+   under your GitHub user account.
+2. Clone your fork to your local disk, and add the base repository as a remote:
+   ```bash
+   git clone [email protected]:<your Github handle>/transformers.git
+   cd transformers
+   git remote add upstream https://github.com/huggingface/transformers.git
+   ```
+3. Create a new branch to hold your development changes:
+   ```bash
+   git checkout -b a-descriptive-name-for-my-changes
+   ```
+   🚨 **Do not** work on the `main` branch!
+4. Set up a development environment by running the following command in a virtual environment:
+   ```bash
+   pip install -e ".[dev]"
+   ```
+   If 🤗 Transformers was already installed in the virtual environment, remove
+   it with `pip uninstall transformers` before reinstalling it in editable
+   mode with the `-e` flag.
+   Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a
+   failure with this command. If that's the case make sure to install the Deep Learning framework you are working with
+   (PyTorch, TensorFlow and/or Flax) then do:
+   ```bash
+   pip install -e ".[quality]"
+   ```
+   which should be enough for most use cases.
+5. Develop the features in your branch.
+   As you work on your code, you should make sure the test suite
+   passes. Run the tests impacted by your changes like this:
+   ```bash
+   pytest tests/<TEST_TO_RUN>.py
+   ```
+   For more information about tests, check out the
+   [Testing](https://huggingface.co/docs/transformers/testing) guide.
+   🤗 Transformers relies on `black` and `ruff` to format its source code
+   consistently. After you make changes, apply automatic style corrections and code verifications
+   that can't be automated in one go with:
+   ```bash
+   make fixup
+   ```
+   This target is also optimized to only work with files modified by the PR you're working on.
+   If you prefer to run the checks one after the other, the following command applies the
+   style corrections:
+   ```bash
+   make style
+   ```
+   🤗 Transformers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality
+   controls are run by the CI, but you can run the same checks with:
+   ```bash
+   make quality
+   ```
+   Finally, we have a lot of scripts to make sure we don't forget to update
+   some files when adding a new model. You can run these scripts with:
+   ```bash
+   make repo-consistency
+   ```
+   To learn more about those checks and how to fix any issues with them, check out the
+   [Checks on a Pull Request](https://huggingface.co/docs/transformers/pr_checks) guide.
+   If you're modifying documents under the `docs/source` directory, make sure the documentation can still be built. This check will also run in the CI when you open a pull request. To run a local check
+   make sure you install the documentation builder:
+   ```bash
+   pip install ".[docs]"
+   ```
+   Run the following command from the root of the repository:
+   ```bash
+   doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build
+   ```
+   This will build the documentation in the `~/tmp/test-build` folder where you can inspect the generated
+   Markdown files with your favorite editor. You can also preview the docs on GitHub when you open a pull request.
+   Once you're happy with your changes, add the changed files with `git add` and
+   record your changes locally with `git commit`:
+   ```bash
+   git add modified_file.py
+   git commit
+   ```
+   Please remember to write [good commit
+   messages](https://chris.beams.io/posts/git-commit/) to clearly communicate the changes you made!
+   To keep your copy of the code up to date with the original
+   repository, rebase your branch on `upstream/branch` *before* you open a pull request or if requested by a maintainer:
+   ```bash
+   git fetch upstream
+   git rebase upstream/main
+   ```
+   Push your changes to your branch:
+   ```bash
+   git push -u origin a-descriptive-name-for-my-changes
+   ```
+   If you've already opened a pull request, you'll need to force push with the `--force` flag. Otherwise, if the pull request hasn't been opened yet, you can just push your changes normally.
+6. Now you can go to your fork of the repository on GitHub and click on **Pull Request** to open a pull request. Make sure you tick off all the boxes on our [checklist](#pull-request-checklist) below. When you're ready, you can send your changes to the project maintainers for review.
+7. It's ok if maintainers request changes, it happens to our core contributors
+   too! So everyone can see the changes in the pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+### Pull request checklist
+☐ The pull request title should summarize your contribution.<br>
+☐ If your pull request addresses an issue, please mention the issue number in the pull
+request description to make sure they are linked (and people viewing the issue know you
+are working on it).<br>
+☐ To indicate a work in progress please prefix the title with `[WIP]`. These are
+useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
+☐ Make sure existing tests pass.<br>
+☐ If adding a new feature, also add tests for it.<br>
+   - If you are adding a new model, make sure you use
+     `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
+   - If you are adding new `@slow` tests, make sure they pass using
+     `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`.
+   - If you are adding a new tokenizer, write tests and make sure
+     `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes.
+   - CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
+☐ All public methods must have informative docstrings (see
+[`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py)
+for an example).<br>
+☐ Due to the rapidly growing repository, don't add any images, videos and other
+non-text files that'll significantly weigh down the repository. Instead, use a Hub
+repository such as [`hf-internal-testing`](https://huggingface.co/hf-internal-testing)
+to host these files and reference them by URL. We recommend placing documentation
+related images in the following repository:
+[huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
+You can open a PR on this dataset repository and ask a Hugging Face member to merge it.
+For more information about the checks run on a pull request, take a look at our [Checks on a Pull Request](https://huggingface.co/docs/transformers/pr_checks) guide.
+### Tests
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
+the [tests](https://github.com/huggingface/transformers/tree/main/tests) folder and examples tests in the
+[examples](https://github.com/huggingface/transformers/tree/main/examples) folder.
+We like `pytest` and `pytest-xdist` because it's faster. From the root of the
+repository, specify a *path to a subfolder or a test file* to run the test:
+```bash
+python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
+```
+Similarly, for the `examples` directory, specify a *path to a subfolder or test file* to run the test. For example, the following command tests the text classification subfolder in the PyTorch `examples` directory:
+```bash
+pip install -r examples/xxx/requirements.txt  # only needed the first time
+python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification
+```
+In fact, this is actually how our `make test` and `make test-examples` commands are implemented (not including the `pip install`)!
+You can also specify a smaller set of tests in order to test only the feature
+you're working on.
+By default, slow tests are skipped but you can set the `RUN_SLOW` environment variable to
+`yes` to run them. This will download many gigabytes of models so make sure you
+have enough disk space, a good internet connection or a lot of patience!
+<Tip warning={true}>
+Remember to specify a *path to a subfolder or a test file* to run the test. Otherwise, you'll run all the tests in the `tests` or `examples` folder, which will take a very long time!
+</Tip>
+```bash
+RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
+RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification
+```
+Like the slow tests, there are other environment variables available which not enabled by default during testing:
+- `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.
+- `RUN_PT_FLAX_CROSS_TESTS`: Enables tests for PyTorch + Flax integration.
+- `RUN_PT_TF_CROSS_TESTS`: Enables tests for TensorFlow + PyTorch integration.
+More environment variables and additional information can be found in the [testing_utils.py](src/transformers/testing_utils.py).
+🤗 Transformers uses `pytest` as a test runner only. It doesn't use any
+`pytest`-specific features in the test suite itself.
+This means `unittest` is fully supported. Here's how to run tests with
+`unittest`:
+```bash
+python -m unittest discover -s tests -t . -v
+python -m unittest discover -s examples -t examples -v
+```
+### Style guide
+For documentation strings, 🤗 Transformers follows the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html).
+Check our [documentation writing guide](https://github.com/huggingface/transformers/tree/main/docs#writing-documentation---specification)
+for more information.
+### Develop on Windows
+On Windows (unless you're working in [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) or WSL), you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings:
+```bash
+git config core.autocrlf input
+```
+One way to run the `make` command on Windows is with MSYS2:
+1. [Download MSYS2](https://www.msys2.org/), and we assume it's installed in `C:\msys64`.
+2. Open the command line `C:\msys64\msys2.exe` (it should be available from the **Start** menu).
+3. Run in the shell: `pacman -Syu` and install `make` with `pacman -S make`.
+4. Add `C:\msys64\usr\bin` to your PATH environment variable.
+You can now use `make` from any terminal (PowerShell, cmd.exe, etc.)! 🎉
+### Sync a forked repository with upstream main (the Hugging Face repository)
+When updating the main branch of a forked repository, please follow these steps to avoid pinging the upstream repository which adds reference notes to each upstream PR, and sends unnecessary notifications to the developers involved in these PRs.
+1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main.
+2. If a PR is absolutely necessary, use the following steps after checking out your branch:
+   ```bash
+   git checkout -b your-branch-for-syncing
+   git pull --squash --no-commit upstream main
+   git commit -m '<your message without GitHub references>'
+   git push --set-upstream origin your-branch-for-syncing
+   ```

ISSUES.md ADDED Viewed

	@@ -0,0 +1,277 @@

+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# How To Request Support
+This is an Open Source Project so please be mindful that like in any other project of this kind there is no obligation to answer all requests for help.
+However, we want to encourage you to ask for help whenever you think it's needed! We are happy about every question we get because it allows us to better understand your needs, possible misunderstandings, and most importantly a way for you to help us make this library better. That being said, this document's main purpose is to provide guidelines at how you can formulate your requests to increase your chances to be understood and to get support.
+There are two main venues to receive support: [the forums](https://discuss.huggingface.co/) and [the GitHub issues](https://github.com/huggingface/transformers/issues).
+## The Forums
+[The user forums](https://discuss.huggingface.co/) are supported by the wide community of the library users and backed up by developers when needed.
+If you have a difficulty with deploying this library or some questions, or you'd like to discuss a new feature, please first consider discussing those things at the forums. Only when you feel your subject matter has been crystalized and you still need support from the library developers do proceed to file an [issue](https://github.com/huggingface/transformers/issues).
+In particular all "Please explain" questions or objectively very user-specific feature requests belong to the forums. Here are some example of such questions:
+* "I would like to use a BertModel within a RL-Agent for a customer support service. How can I use a BertForMaskedLM in my ChatBotModel?"
+* "Could you please explain why T5 has no positional embedding matrix under T5Model?"
+* "How should I set my generation parameters for translation?"
+* "How to train T5 on De->En translation?"
+## The GitHub Issues
+Everything which hints at a bug should be opened as an [issue](https://github.com/huggingface/transformers/issues).
+You are not required to read the following guidelines before opening an issue. However, if you notice that your issue doesn't get any replies, chances are that the developers have one or several difficulties with its quality. In this case, reading the following points and adjusting your issue accordingly could help.
+1. Before posting an issue, first search for already posted issues, since chances are someone has already asked a similar question before you.
+    If you use Google your search query should be:
+    ```
+    "huggingface" "transformers" your query
+    ```
+    The first two quoted words tell Google to limit the search to the context of the Huggingface Transformers. The remainder is your query - most commonly this would be the error message the software fails with. We will go deeper into details shortly.
+    The results of such a query will typically match GitHub issues, Hugging Face forums, StackExchange, and blogs.
+    If you find relevant hints, you may choose to continue the discussion there if you have follow up questions.
+    If what you found is similar but doesn't quite answer your problem, please, post a new issue and do include links to similar issues or forum discussions you may have found.
+    Let's look at some examples:
+    The error message, often referred to as an assertion, tells us what went wrong. Here is an example of an assertion:
+   ```python
+   Traceback (most recent call last):
+     File "<string>", line 1, in <module>
+     File "/transformers/src/transformers/__init__.py", line 34, in <module>
+       from . import dependency_versions_check
+     File "/transformers/src/transformers/dependency_versions_check.py", line 34, in <module>
+       from .utils import is_tokenizers_available
+     File "/transformers/src/transformers/utils/import_utils.py", line 40, in <module>
+       from tqdm.auto import tqdm
+    ModuleNotFoundError: No module named 'tqdm.auto'
+    ```
+   and it typically includes a traceback, so that we can see the full stack of calls the program made before it fails. This gives us the context to know why the program failed.
+   Going back to the above example. If you received this error search, look at the very last line of the error which is:
+   ```python
+    ModuleNotFoundError: No module named 'tqdm.auto'
+    ```
+    And now we can use it to do the searching on your favorite search engine:
+    1. first for `"huggingface" "transformers" "ModuleNotFoundError: No module named 'tqdm.auto'"`
+    2. if you don't find relevant results, then search for just `"ModuleNotFoundError: No module named 'tqdm.auto'"`
+    3. and finally if nothing still comes up, then remove the outside quotes: `ModuleNotFoundError: No module named 'tqdm.auto'`
+   If the error includes any messages that include bits unique to your filesystem, always remove those in the search query since other users will not have the same filesystem as yours. For example:
+   ```bash
+   python -c 'open("/tmp/wrong_path.txt", "r")'
+   Traceback (most recent call last):
+     File "<string>", line 1, in <module>
+   FileNotFoundError: [Errno 2] No such file or directory: '/tmp/wrong_path.txt'
+   ```
+   Here you'd search for just: `"FileNotFoundError: [Errno 2] No such file or directory"`
+   If the local information that you removed were inside the error message and you removed them you may need to remove double quotes since your query is no longer exact. So if the error message was something like:
+   ```bash
+      ValueError: '/tmp/wrong_path.txt' cannot be found
+   ```
+   then you'd search for `"ValueError" "cannot be found"`
+   As you search you will notice that when you don't use quotes often the search engines will return a variety of unrelated hits, which may or may not be what you want.
+   Experiment with different ways and find which approach gives the most satisfactory results.
+2. Keep the issue short, providing the information that you think will aid the developers to understand your situation. Put yourself in the shoes of the person who has never seen your code or knows anything about your custom setup. This mental exercise will help to develop an intuition to what/what not to share"
+3. If there is a software failure, always provide the full traceback, for example:
+   ```python
+   $ python -c 'import transformers'
+   Traceback (most recent call last):
+     File "<string>", line 1, in <module>
+     File "/transformers/src/transformers/__init__.py", line 34, in <module>
+       from . import dependency_versions_check
+     File "/transformers/src/transformers/dependency_versions_check.py", line 34, in <module>
+       from .utils import is_tokenizers_available
+     File "/transformers/src/transformers/utils/import_utils.py", line 40, in <module>
+       from tqdm.auto import tqdm
+   ModuleNotFoundError: No module named 'tqdm.auto'
+   ```
+   As compared to providing just the last line of the error message, e.g.:
+   ```python
+   ModuleNotFoundError: No module named 'tqdm.auto'
+   ```
+   which is not sufficient.
+   If your application is running on more than one GPU (e.g. under `DistributedDataParallel`) and typically getting every log and traceback printed multiple times, please make sure that you paste only one copy of it. At times the traceback from parallel processes may get interleaved - so either disentangle these or change the loggers to log only for `local_rank==0` so that only one process logs things.
+4. When quoting a traceback, command line instructions and any type of code always enclose it in triple backticks inside the editor window, that is:
+   ````
+   ```
+   git clone https://github.com/huggingface/transformers
+   cd transformers
+   pip install .
+   ```
+   ````
+   If it's a command line with a long argument list, please consider breaking it down using backslashes and new lines. Here is an example of a good command line quote:
+   ```bash
+    cd examples/seq2seq
+    torchrun --nproc_per_node=2 ./finetune_trainer.py \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation \
+    --fp16
+   ```
+   If you don't break it up, one has to scroll horizontally which often makes it quite difficult to quickly see what's happening.
+   The backslashes allow us to copy the command directly into the console to run it, without needing to edit it.
+5. Include only the important information that you think will help the developer to quickly identify the problem.
+   For example applications often create huge amounts of logs. Ask yourself whether providing all or parts of the log is useful.
+   Pasting a 100-1000 lines of log into the issue is an immediate turn off, since it will take a lot of time to figure out where the pertinent parts of the log are.
+   Attaching a full log can be helpful if it's done as an attachment, if it's enclosed in the following html code in the comment editor window:
+   ```
+   <details>
+   <summary>Full log</summary>
+   <pre>
+   many
+   lines
+   go
+   here
+   </pre>
+   </details>
+   ```
+   which would result in the following entry, which can be opened if desired, but otherwise takes little space.
+   <details>
+   <summary>Full log</summary>
+   <pre>
+   many
+   lines
+   go
+   here
+   </pre>
+   </details>
+    You could also provide a link to a pastebin service, but this is less beneficial since those links tend to expire quickly and future readers of your issue might not be able to access that log file anymore and may lack some context.
+6. If this is an issue in your code, do try to reduce that code to a minimal example that still demonstrates the problem. Please ask at the forums if you have a hard time figuring how to do that. Please realize that we don't have the luxury of having time to try and understand all of your custom code.
+   If you really tried to make a short reproducible code but couldn't figure it out, it might be that having a traceback will give the developer enough information to know what's going on. But if it is not enough and we can't reproduce the problem, we can't really solve it.
+   Do not despair if you can't figure it out from the beginning, just share what you can and perhaps someone else will be able to help you at the forums.
+   If your setup involves any custom datasets, the best way to help us reproduce the problem is to create a [Google Colab notebook](https://colab.research.google.com/) that demonstrates the issue and once you verify that the issue still exists, include a link to that notebook in the Issue. Just make sure that you don't copy and paste the location bar url of the open notebook - as this is private and we won't be able to open it. Instead, you need to click on `Share` in the right upper corner of the notebook, select `Get Link` and then copy and paste the public link it will give to you.
+7. If you forked off some of this project's code or example applications, please, do not ask us to go into your code repository and figure out what you may have done. The code is already very complex and unless there is an easy way to do a diff and it's a small diff, it won't be possible to find someone with time on their hands to make a lengthy investigation. Albeit, you might find someone at the forums who will be generous to do this for you.
+8. Before reporting an issue, first, always try to update your environment to the latest official version of this library. We have no resources to go and debug older revisions, which could easily have bugs that have been fixed in the latest released version.
+   We understand that this is not always possible, especially when APIs change, in which case file an issue against the highest library version your environment can support.
+   Of course, if you upgrade the library, always retest that the problem is still there.
+9. Please do not ask us to reproduce an issue with your custom data, since we don't have it. So, either you should use some existing dataset supported by HF datasets or you need to supply a code that generates a small sample on the fly, or some another quick and simple way to get it.
+   Please do not send us any non-public domain data that may require a license or a permission to be used.
+10. Do not tag multiple developers on the issue unless you know this is expected, either because you asked them and they gave you an explicit permission to tag them or the issue template instructs you to do so.
+   The "who to tag for what domain" part of the issue template is there to help users direct their questions to the right developers who are designated maintainers of project's specific domains. They can then decide at their own discretion to tag other developers if they feel it'd help move the issue forward.
+   We currently don't have a triage service and we trust your capacity to identify the right domain and thus the persons to tag in your issue. If you are not sure, please use the forums to ask for guidance.
+   When in doubt, err on the side of not tagging a given person. If you tag multiple people out of context or permission don't be surprised if you get no response at all. Please remember that every time you tag someone, they get a notification and you're taking their time without their permission. Please be sensitive to that.
+   If you got helped by one of the developers in the past please don't tag them in future issues, unless they are listed in the issue template for the domain you are asking about or that developer gave you an explicit permission to tag them in future issues.
+   If you see a certain developer doing multiple and/or recent commits into a specific area of the project that you feel is relevant to your issue, it is not a good reason to tag them. Various developers may be fixing things that prevent them from moving forward, but often their work is focused on a totally different domain. And while they may or may not know how to help you with the problem at hand, it would benefit the whole community much more if they focus on the domain of their unique expertise.
+11. Use the Edit button. Take your time, and re-read and improve the wording and formatting to make your posts and comments as easy to understand as possible.
+    Avoid posting multiple comments in a row, as each comment generates a notification for the developers tagged in that issue. If you happened to post multiple comments in a row, and nobody followed up yet - consider merging those into one or a few comments while editing the combined content to be coherent.
+    If you choose to edit your older comments after others posted follow up comments you need to be aware that your modifications might not be noticed, so if it's not a typo fixing, try to write a new comment flagging that something has been changed in the previous comments.
+    For example, the very first comment is the most important one. If while the thread unfolds you realize that things aren't as they seemed to you originally you may want to edit the first post to reflect the up-to-date understanding of the issue at hand so that it helps those who read your issue in the future quickly understand what's going on and not need to sift through dozens of comments. It also helps to indicate that the post was edited. So, those reading the thread later can understand why there might be certain discontinuity in the information flow.
+    Use bullets and items if you have lists of items and the outcome improves overall readability.
+    Use backticks to refer to class and function names, e.g. `BartModel` and `generate` as these stand out and improve the speed of a reader's comprehension.
+    Try not use italics and bold text too much as these often make the text more difficult to read.
+12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to.
+    To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link".
+    For example the first link is a link to an issue, and the second to a specific comment in the same issue:
+    1. https://github.com/huggingface/transformers/issues/9257
+    2. https://github.com/huggingface/transformers/issues/9257#issuecomment-749945162
+13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here.
+    But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like:
+    ```
+    > How big is your gpu cluster?
+    Our cluster is made of 256 gpus.
+    ```
+    If you are addressing multiple comments, quote the relevant parts of each before your answer. Some people use the same comment to do multiple replies, others separate them into separate comments. Either way works. The latter approach helps for linking to a specific comment.
+In general the best way to figure out what works the best is learn from issues posted by other people - see which issues get great responses and which get little to no response - observe what the posters who received great responses did differently from those who did not.
+Thank you for reading this somewhat lengthy document. We would like to conclude that these are not absolute rules, but a friendly advice that will help maximize the chances for us to understand what you are trying to communicate, reproduce the problem then resolve it to your satisfaction and the benefit of the whole community.
+If after reading this document there are remaining questions on how and why or there is a need for further elucidation, please, don't hesitate to ask your question in [this thread](https://discuss.huggingface.co/t/how-to-request-support/3128).

LICENSE ADDED Viewed

	@@ -0,0 +1,203 @@

+Copyright 2018- The Hugging Face team. All rights reserved.
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

Makefile ADDED Viewed

	@@ -0,0 +1,124 @@

+.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples
+# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
+export PYTHONPATH = src
+check_dirs := examples tests src utils
+exclude_folders := examples/research_projects
+modified_only_fixup:
+	$(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs)))
+	@if test -n "$(modified_py_files)"; then \
+		echo "Checking/fixing $(modified_py_files)"; \
+		ruff check $(modified_py_files) --fix --exclude $(exclude_folders); \
+		ruff format $(modified_py_files) --exclude $(exclude_folders);\
+	else \
+		echo "No library .py files were modified"; \
+	fi
+# Update src/transformers/dependency_versions_table.py
+deps_table_update:
+	@python setup.py deps_table_update
+deps_table_check_updated:
+	@md5sum src/transformers/dependency_versions_table.py > md5sum.saved
+	@python setup.py deps_table_update
+	@md5sum -c --quiet md5sum.saved || (printf "\nError: the version dependency table is outdated.\nPlease run 'make fixup' or 'make style' and commit the changes.\n\n" && exit 1)
+	@rm md5sum.saved
+# autogenerating code
+autogenerate_code: deps_table_update
+# Check that the repo is in a good state
+repo-consistency:
+	python utils/check_copies.py
+	python utils/check_table.py
+	python utils/check_dummies.py
+	python utils/check_repo.py
+	python utils/check_inits.py
+	python utils/check_config_docstrings.py
+	python utils/check_config_attributes.py
+	python utils/check_doctest_list.py
+	python utils/update_metadata.py --check-only
+	python utils/check_docstrings.py
+	python utils/check_support_list.py
+# this target runs checks on all files
+quality:
+	@python -c "from transformers import *" || (echo '🚨 import failed, this means you introduced unprotected imports! 🚨'; exit 1)
+	ruff check $(check_dirs) setup.py conftest.py
+	ruff format --check $(check_dirs) setup.py conftest.py
+	python utils/custom_init_isort.py --check_only
+	python utils/sort_auto_mappings.py --check_only
+	python utils/check_doc_toc.py
+# Format source code automatically and check is there are any problems left that need manual fixing
+extra_style_checks:
+	python utils/custom_init_isort.py
+	python utils/sort_auto_mappings.py
+	python utils/check_doc_toc.py --fix_and_overwrite
+# this target runs checks on all files and potentially modifies some of them
+style:
+	ruff check $(check_dirs) setup.py conftest.py --fix --exclude $(exclude_folders)
+	ruff format $(check_dirs) setup.py conftest.py --exclude $(exclude_folders)
+	${MAKE} autogenerate_code
+	${MAKE} extra_style_checks
+# Super fast fix and check target that only works on relevant modified files since the branch was made
+fixup: modified_only_fixup extra_style_checks autogenerate_code repo-consistency
+# Make marked copies of snippets of codes conform to the original
+fix-copies:
+	python utils/check_copies.py --fix_and_overwrite
+	python utils/check_table.py --fix_and_overwrite
+	python utils/check_dummies.py --fix_and_overwrite
+	python utils/check_doctest_list.py --fix_and_overwrite
+	python utils/check_docstrings.py --fix_and_overwrite
+# Run tests for the library
+test:
+	python -m pytest -n auto --dist=loadfile -s -v ./tests/
+# Run tests for examples
+test-examples:
+	python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/
+# Run tests for SageMaker DLC release
+test-sagemaker: # install sagemaker dependencies in advance with pip install .[sagemaker]
+	TEST_SAGEMAKER=True python -m pytest -n auto  -s -v ./tests/sagemaker
+# Release stuff
+pre-release:
+	python utils/release.py
+pre-patch:
+	python utils/release.py --patch
+post-release:
+	python utils/release.py --post_release
+post-patch:
+	python utils/release.py --post_release --patch
+build-release:
+	rm -rf dist
+	rm -rf build
+	python setup.py bdist_wheel
+	python setup.py sdist
+	python utils/check_build.py

README.md CHANGED Viewed

@@ -1,13 +1,13 @@
 ---
-title: 1bit Llama3 Instruct Xmad Qa Batch
-emoji: 📈
-colorFrom: purple
-colorTo: red
 sdk: gradio
-sdk_version: 4.39.0
 app_file: app.py
 pinned: false
 license: llama3
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: 1-Bit Llama-3 Demo Batch Input/Output 500+ Tokens per Second by xMAD.ai
+emoji: 💬
+colorFrom: yellow
+colorTo: purple
 sdk: gradio
+sdk_version: 4.36.1
 app_file: app.py
 pinned: false
 license: llama3
 ---
+An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

README_test_result.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# Maximum Batch Size Analysis for Llama2 Models
+Provides a summary of the performance testing results for Llama2 models under various configurations. The focus here is on identifying the maximum batch sizes that can be processed without errors and documenting the corresponding generation times in seconds.
+## Experiment Details
+The experiment varied settings such as model size, number of new tokens (`num_new_tokens`), key-value bit size (`kv_bits`), and `batch sizes`. "Unquantized" indicates configurations without quantization. The objective was to determine stable operating conditions for generating a fixed number of tokens under these configurations.
+### Models and Configurations
+- **Models Tested:** Llama2 7B and 13B.
+- **Measurements:** Generation times are directly reported in seconds as provided by the dataset.
+## Results: Llama2 7B Model Performance
+| Model Size | num_new_tokens | KV Bits     | Max Batch Size | Generation Time (s) | Speedup (Batch Size) |
+|------------|----------------|-------------|----------------|----------------------|-----------------------|
+| 7B         | 256            | 1           | 764            | 257                  | 14.98x                |
+| 7B         | 256            | 2           | 384            | 124                  | 7.53x                 |
+| 7B         | 256            | 4           | 204            | 99                   | 4.00x                 |
+| 7B         | 256            | Unquantized | 51             | 75                   | 1x                    |
+| 7B         | 512            | 1           | 437            | 352                  | 15.07x                |
+| 7B         | 512            | 2           | 223            | 178                  | 7.69x                 |
+| 7B         | 512            | 4           | 114            | 148                  | 3.93x                 |
+| 7B         | 512            | Unquantized | 29             | 122                  | 1x                    |
+| 7B         | 1024           | 1           | 247            | 454                  | 15.44x                |
+| 7B         | 1024           | 2           | 126            | 300                  | 7.88x                 |
+| 7B         | 1024           | 4           | 65             | 283                  | 4.06x                 |
+| 7B         | 1024           | Unquantized | 16             | 224                  | 1x                    |
+## Results: Llama2 13B Model Performance
+| Model Size | num_new_tokens | KV Bits     | Max Batch Size | Generation Time (s) | Speedup (Batch Size) |
+|------------|----------------|-------------|----------------|----------------------|-----------------------|
+| 13B        | 256            | 1           | 154            | 83                   | 14.00x                |
+| 13B        | 256            | 2           | 88             | 63                   | 8.00x                 |
+| 13B        | 256            | 4           | 45             | 62                   | 4.09x                 |
+| 13B        | 256            | Unquantized | 11             | 33                   | 1x                    |
+| 13B        | 512            | 1           | 100            | 144                  | 16.67x                |
+| 13B        | 512            | 2           | 51             | 98                   | 8.50x                 |
+| 13B        | 512            | 4           | 26             | 108                  | 4.33x                 |
+| 13B        | 512            | Unquantized | 6              | 60                   | 1x                    |
+| 13B        | 1024           | 1           | 58             | 260                  | 19.33x                |
+| 13B        | 1024           | 2           | 29             | 173                  | 9.67x                 |
+| 13B        | 1024           | 4           | 15             | 216                  | 5.00x                 |
+| 13B        | 1024           | Unquantized | 3              | 118                  | 1x                    |
+## Recommendations
+1. **KV Bits Influence**: Configurations with KV bits generally handle larger batch sizes more effectively, highlighting the importance of key/value storage management in batch processing.
+2. **Optimal Configuration Selection**: Depending on the operational needs (e.g., low latency vs. high throughput), choose the appropriate KV bits setting. For scenarios where throughput is critical, a lower KV bits setting is advisable.
+## Averaged Speedup Analysis
+- **1-bit Quantization:** On average, achieves an approximately 15.58x speedup in batch size handling compared to unquantized configurations across all tested scenarios.
+- **2-bit Quantization:** Provides an average of 8.02x speedup.

SECURITY.md ADDED Viewed

	@@ -0,0 +1,40 @@

+# Security Policy
+## Hugging Face Hub, remote artefacts, and remote code
+Transformers is open-source software that is tightly coupled to the Hugging Face Hub. While you have the ability to use it
+offline with pre-downloaded model weights, it provides a very simple way to download, use, and manage models locally.
+When downloading artefacts that have been uploaded by others on any platform, you expose yourself to risks. Please
+read below for the security recommendations in order to keep your runtime and local environment safe.
+### Remote artefacts
+Models uploaded on the Hugging Face Hub come in different formats. We heavily recommend uploading and downloading
+models in the [`safetensors`](https://github.com/huggingface/safetensors) format (which is the default prioritized
+by the transformers library), as developed specifically to prevent arbitrary code execution on your system.
+To avoid loading models from unsafe formats(e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetenstors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
+### Remote code
+#### Modeling
+Transformers supports many model architectures, but is also the bridge between your Python runtime and models that
+are stored in model repositories on the Hugging Face Hub.
+These models require the `trust_remote_code=True` parameter to be set when using them; please **always** verify
+the content of the modeling files when using this argument. We recommend setting a revision in order to ensure you
+protect yourself from updates on the repository.
+#### Tools
+Through the `Agent` framework, remote tools can be downloaded to be used by the Agent. You're to specify these tools
+yourself, but please keep in mind that their code will be run on your machine if the Agent chooses to run them.
+Please inspect the code of the tools before passing them to the Agent to protect your runtime and local setup.
+## Reporting a Vulnerability
+🤗 Please feel free to submit vulnerability reports to our private bug bounty program at https://hackerone.com/hugging_face. You'll need to request access to the program by emailing [email protected].
+Note that you'll need to be invited to our program, so send us a quick email at [email protected] if you've found a vulnerability.

__pycache__/app_local.cpython-310.pyc ADDED Viewed

Binary file (7.8 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,419 @@

+import json
+import os
+import time
+import random
+import torch
+import gc
+import re
+import math
+import gradio as gr
+import numpy as np
+import boto3
+import logging
+from botocore.exceptions import NoCredentialsError
+from collections import defaultdict
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+def download_xmad_file():
+    s3 = boto3.client('s3',
+                      aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
+                      aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'))
+    # Create the .codebooks directory if it doesn't exist
+    codebooks_dir = '.codebooks'
+    os.makedirs(codebooks_dir, exist_ok=True)
+    temp_file_path = os.path.join(codebooks_dir, 'llama-3-8b-instruct_1bit.xmad')
+    try:
+        # Download the file to the .codebooks directory
+        s3.download_file('xmad-quantized-models', 'llama-3-8b-instruct_1bit.xmad', temp_file_path)
+        print("Download Successful")
+        # Restrict permissions on the .codebooks directory
+        os.chmod(codebooks_dir, 0o700)
+    except NoCredentialsError:
+        print("Credentials not available")
+download_xmad_file()
+def b2mb(x):
+    """
+    Convert bytes to megabytes.
+    """
+    return int(x / 2**20)
+class TorchTracemalloc:
+    """
+    A context manager that clears GPU memory
+    and returns GPU peak memory & GPU memory usage.
+    """
+    track_memory_consumption = []
+    def __enter__(self):
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.reset_peak_memory_stats()
+        self.begin = torch.cuda.memory_allocated()
+        return self
+    def __exit__(self, *exc):
+        torch.cuda.synchronize()
+        self.end = torch.cuda.memory_allocated()
+        self.peak = torch.cuda.max_memory_allocated()
+        self.used = b2mb(self.end - self.begin)
+        self.peaked = b2mb(self.peak - self.begin)
+        TorchTracemalloc.track_memory_consumption.append(self.peaked)
+def clear_gpu_memory():
+    torch.cuda.empty_cache()
+    gc.collect()
+    print("GPU memory cleared.")
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    return {"question": question, "answer": response}
+# Global variables to store the model and tokenizer
+global_model = None
+global_tokenizer = None
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global global_model, global_tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    special_tokens = {"pad_token": "<PAD>"}
+    tokenizer.add_special_tokens(special_tokens)
+    config = AutoConfig.from_pretrained(model_name)
+    if kv_bits != "unquantized":
+        quantizer_path = f".codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+        setattr(config, "quantizer_path", quantizer_path)
+    if dtype == "bf16":
+        dtype = torch.bfloat16
+    elif dtype == "fp16":
+        dtype = torch.float16
+    elif dtype == "fp32":
+        dtype = torch.float32
+    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+    print(f"Quantizer path in model config: {model.config.quantizer_path}")
+    logging.info(f"Quantizer path in model config: {model.config.quantizer_path}")
+    if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+        model.resize_token_embeddings(len(tokenizer))
+    tokenizer.padding_side = "left"
+    model.config.pad_token_id = tokenizer.pad_token_id
+    global_model = model
+    global_tokenizer = tokenizer
+# def load_questions(prompts_path, custom_questions):
+#     with open(prompts_path, "r") as file:
+#         dialogs = json.load(file)
+#     selected_dialogs = []
+#     if custom_questions:
+#         for question in custom_questions:
+#             if question.strip():
+#                 custom_dialog = [{"role": "user", "content": question}]
+#                 selected_dialogs.append(custom_dialog)
+#     num_questions = max(60 - len(selected_dialogs), 0)
+#     random.shuffle(dialogs)
+#     selected_dialogs.extend(dialogs[:num_questions])
+#     return selected_dialogs
+def load_questions(prompts_path, custom_questions):
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    return selected_dialogs
+def markdown_to_plain_text(markdown_text):
+    # Convert markdown bold (**) to plain text uppercase
+    markdown_text = re.sub(r'\*\*(.*?)\*\*', r'\1'.upper(), markdown_text)
+    # Convert markdown italics (*) to plain text
+    markdown_text = re.sub(r'\*(.*?)\*', r'\1', markdown_text)
+    # Remove markdown headers (###)
+    markdown_text = re.sub(r'### ', '', markdown_text)
+    # Convert markdown lists (- or *)
+    markdown_text = re.sub(r'^\s*[-*]\s+', '', markdown_text, flags=re.MULTILINE)
+    # Remove remaining markdown formatting
+    markdown_text = re.sub(r'[`~>]', '', markdown_text)
+    return markdown_text
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    global global_model, global_tokenizer
+    model = global_model
+    tokenizer = global_tokenizer
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = min(100, len(dialogs))  # Adjust batch size based on GPU capacity and number of dialogs
+    num_dialogs = len(dialogs)
+    total_time = 0  # Initialize total_time
+    total_tokens = 0
+    total_ttft = 0
+    memory_avg = []
+    tokens_per_sec_avg = []
+    time_to_first_token_avg = []
+    responses_by_batch_size = defaultdict(list)
+    batch_generation_time = 0
+    total_generation_time  = 0
+    terminators = [
+        tokenizer.eos_token_id,
+        tokenizer.convert_tokens_to_ids("<|eot_id|>"),
+    ]
+    with TorchTracemalloc() as tt:
+        for i in range(0, num_dialogs, batch_size):
+            batch = batch_inputs[i : i + batch_size]
+            try:
+                encoded_inputs = tokenizer(
+                    batch,
+                    padding=True,
+                    truncation=False,
+                    return_tensors="pt",
+                )
+                input_ids = encoded_inputs["input_ids"].to(model.device)
+                attention_mask = encoded_inputs["attention_mask"].to(model.device)
+                torch.cuda.synchronize()
+                start_time = time.perf_counter()
+                with torch.no_grad():
+                    output_tokens = model.generate(
+                        input_ids,
+                        attention_mask=attention_mask,
+                        max_new_tokens=num_new_tokens,
+                        num_return_sequences=1,
+                        do_sample=True,
+                        temperature=temperature,
+                        pad_token_id=tokenizer.pad_token_id,
+                        eos_token_id=terminators,
+                    )
+                torch.cuda.synchronize()
+                end_time = time.perf_counter()
+                batch_time = end_time - start_time
+                total_time += batch_time
+                batch_generation_time += batch_time
+                total_generation_time += batch_time
+                total_tokens += output_tokens.numel()
+                if i == 0:
+                    total_ttft = batch_time
+                decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+                for j, response in enumerate(decoded_outputs):
+                    original_dialog = dialogs[i + j]
+                    formatted_responses = format_response(original_dialog, response)
+                    responses.append(formatted_responses)
+                    # formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer'][4:]}" for res in responses])
+                    formatted_responses = "\n\n====================\n\n".join([f"**Question**:\t{res['question']}\n\n**Answer**: {res['answer'][4+len(res['question'])+11:]}" for res in responses])
+                    plain_text_responses = markdown_to_plain_text(formatted_responses)
+                    yield plain_text_responses
+                    progress(i, desc="Processing batches")
+                    torch.cuda.empty_cache()
+            except Exception as e:
+                print(f"Error processing batch {i//batch_size + 1}: {str(e)}")
+                continue
+    elapsed_time = total_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    total_memory_consumption = np.sum(TorchTracemalloc.track_memory_consumption)
+    avg_memory_consumption = total_memory_consumption / num_dialogs
+    ttft = total_ttft / batch_size if batch_size > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (seconds)": ttft,
+        "Formatted Responses": plain_text_responses,
+        "Memory Consumption per Question (MB)": avg_memory_consumption,
+        "Total Memory Consumption (MB)": total_memory_consumption,
+        "Num Dialogs": num_dialogs
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits=1, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    num_dialogs = 0
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, None, None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (seconds)"]
+            avg_memory_consumption = result["Memory Consumption per Question (MB)"]
+            total_memory_consumption = result["Total Memory Consumption (MB)"]
+            num_dialogs = result["Num Dialogs"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, avg_memory_consumption, num_dialogs, total_memory_consumption, formatted_responses
+    # clear_gpu_memory()
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+# Load 60 random questions into the input area by default
+def load_default_questions():
+    random.shuffle(json_data)
+    default_questions = [dialog[0]['content'] for dialog in json_data[:60] if 'content' in dialog[0]]
+    return "\n".join(default_questions)
+# Load default questions on button click
+def load_questions_action():
+    return load_default_questions()
+# Gradio interface
+css = """
+body, html {
+    height: 100vh;
+    margin: 0;
+}
+.gradio-container {
+    height: 100vh;
+}
+#main-row {
+    height: 100%;
+    display: flex;
+}
+#control-panel{
+    height: 100%;
+    box-sizing: border-box;
+    display: flex;
+    flex-direction: column;
+    overflow: hidden;
+    flex: 1;
+}
+#control-panel, #formatted-responses-container {
+    height: 100%;
+    box-sizing: border-box;
+    display: flex;
+    flex-direction: column;
+    overflow: hidden;
+    flex: 1;
+}
+#control-panel {
+    flex: 1;
+    padding-bottom: 1vh; /* Add some padding to the bottom */
+}
+#custom-questions-text {
+    height: 30vh; /* Fixed height for custom questions text */
+    overflow-y: auto;
+}
+#metrics-panel {
+    display: flex;
+    flex-wrap: wrap;
+    flex-shrink: 0;
+    height: auto; /* Let the panel size adjust based on its content */
+}
+#metrics-panel .metric {
+    flex: 1 1 48%;
+    min-width: 10vw;
+    box-sizing: border-box;
+}
+#buttons-container {
+    display: flex;
+    justify-content: space-between;
+    height: 6vh; /* Fixed height for buttons container */
+    flex-shrink: 0;
+    margin-bottom: 1vh; /* Add margin to prevent cutting off */
+}
+"""
+with gr.Blocks(css=css) as app:
+    with gr.Row(elem_id="main-row", equal_height=True):
+        with gr.Column(elem_id="control-panel", scale=1):
+            num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=2048, step=128, value=512)
+            temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+            custom_questions_text = gr.Textbox(
+                label="Custom Questions",
+                placeholder="Type your custom questions here, one per line... \nOr press the \"Load Default Questions\" button to load 60 random default questions. \nAdd a question by adding a new line, or delete lines to decrease the number of questions.",
+                autoscroll=False,
+                container=False,
+                lines=5,
+                elem_id="custom-questions-text"
+            )
+            with gr.Row(elem_id="metrics-panel"):
+                time_taken = gr.Number(label="Time Taken (seconds)", interactive=False, elem_classes=["metric"])
+                tokens_per_second = gr.Number(label="Tokens per Second", interactive=False, elem_classes=["metric"])
+                ttft = gr.Number(label="Time to First Token (seconds)", interactive=False, elem_classes=["metric"])
+                total_memory_consumption = gr.Number(label="Memory Consumption (MB)", interactive=False, elem_classes=["metric"])
+                num_dialogs = gr.Number(label="Dialogs Processed", interactive=False, elem_classes=["metric"])
+                avg_memory_consumption = gr.Number(label="Mem. Consumption per Question (MB)", interactive=False, elem_classes=["metric"])
+            with gr.Row(elem_id="buttons-container"):
+                load_questions_btn = gr.Button("Load Default Questions")
+                demo_btn = gr.Button("Run Inference", elem_id="run-inference-btn", variant="primary")
+        formatted_responses = gr.Textbox(
+            label="Formatted Responses",
+            elem_id="formatted-responses",
+            value="No responses yet. Run the inference to see results.",
+            lines=37,
+            container=False,
+            autoscroll=False,
+            show_copy_button=True
+        )
+        load_questions_btn.click(fn=load_questions_action, inputs=[], outputs=custom_questions_text)
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text], outputs=[time_taken, tokens_per_second, ttft, avg_memory_consumption, num_dialogs, total_memory_consumption, formatted_responses])
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    username = os.getenv("AUTH_USERNAME")
+    password = os.getenv("AUTH_PASSWORD")
+    app.launch(auth=(username, password))

backups/app_backup.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import gradio as gr
+from huggingface_hub import InferenceClient
+"""
+For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
+"""
+client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")
+def respond(
+    message,
+    history: list[tuple[str, str]],
+    system_message,
+    max_tokens,
+    temperature,
+    top_p,
+):
+    messages = [{"role": "system", "content": system_message}]
+    for val in history:
+        if val[0]:
+            messages.append({"role": "user", "content": val[0]})
+        if val[1]:
+            messages.append({"role": "assistant", "content": val[1]})
+    messages.append({"role": "user", "content": message})
+    response = ""
+    for message in client.chat_completion(
+        messages,
+        max_tokens=max_tokens,
+        stream=True,
+        temperature=temperature,
+        top_p=top_p,
+    ):
+        token = message.choices[0].delta.content
+        response += token
+        yield response
+"""
+For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
+"""
+demo = gr.ChatInterface(
+    respond,
+    additional_inputs=[
+        gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
+        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
+        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
+        gr.Slider(
+            minimum=0.1,
+            maximum=1.0,
+            value=0.95,
+            step=0.05,
+            label="Top-p (nucleus sampling)",
+        ),
+    ],
+)
+if __name__ == "__main__":
+    demo.launch()

backups/app_local_enabled_streaming_but_inefficient.py ADDED Viewed

	@@ -0,0 +1,205 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, top_k, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = [''] * len(dialogs)
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    ttft = None
+    tokens_per_step = 25  # Number of tokens to generate per step for efficiency
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        generated_ids = input_ids
+        while generated_ids.shape[1] < num_new_tokens:
+            with torch.no_grad():
+                outputs = model(generated_ids, attention_mask=attention_mask)
+                next_token_logits = outputs.logits[:, -1, :]
+                # Apply temperature scaling
+                next_token_logits = next_token_logits / temperature
+                # Apply top-k sampling
+                top_k_values, top_k_indices = torch.topk(next_token_logits, top_k, dim=-1)
+                next_token_probs = torch.nn.functional.softmax(top_k_values, dim=-1)
+                next_tokens = torch.multinomial(next_token_probs, num_samples=1)
+                next_tokens = torch.gather(top_k_indices, -1, next_tokens)
+                generated_ids = torch.cat([generated_ids, next_tokens], dim=1)
+                if ttft is None:
+                    ttft = time.perf_counter() - start_time
+                decoded_outputs = [tokenizer.decode(generated_ids[i], skip_special_tokens=True) for i in range(generated_ids.size(0))]
+                for i, response in enumerate(decoded_outputs):
+                    formatted_response = format_response(dialogs[start_idx + i], response)
+                    responses[start_idx + i] = f"**Question**: {formatted_response['question']}\n\n**Answer**: {formatted_response['answer']}"
+                formatted_responses = "\n\n---\n\n".join(responses)
+                yield {
+                    "Formatted Responses": formatted_responses
+                }
+                progress((batch_idx * num_new_tokens + generated_ids.shape[1]) / (num_batches * num_new_tokens), desc="Generating tokens")
+                # Check if end-of-sequence token is generated
+                if any(tokenizer.eos_token_id in output for output in generated_ids.tolist()):
+                    break
+            # Update attention mask for the next tokens
+            attention_mask = torch.cat([attention_mask, torch.ones((attention_mask.size(0), 1)).to(model.device)], dim=1)
+        # Stream intermediate results every 0.5 seconds
+        time.sleep(0.5)
+    total_elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {total_elapsed_time:.2f} seconds.")
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, top_k, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, top_k, progress=progress)
+    for result in result_gen:
+        if result:
+            formatted_response = result["Formatted Responses"]
+            yield None, None, None, formatted_response
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        with gr.Row():
+            with gr.Column():
+                num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+                temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+                custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+                kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+                top_k = gr.Slider(label="Top K", minimum=1, maximum=50, step=1, value=10)
+            with gr.Column():
+                time_taken = gr.Number(label="Time Taken (seconds)")
+                tokens_per_second = gr.Number(label="Tokens per Second")
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)")
+        with gr.Row():
+            formatted_responses = gr.Markdown(label="Formatted Responses")
+        demo_btn = gr.Button("Run Inference")
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits, top_k], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+    with gr.Tab("Show JSON"):
+        json_output = gr.HTML("<pre>{}</pre>".format(json_data_str))
+        json_interface = gr.Interface(fn=show_json, inputs=[], outputs=[json_output], live=False)
+        json_interface.render()
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.queue(default_concurrency_limit=5).launch()

backups/app_local_v0.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    formatted_dialog = dialog.copy()
+    formatted_dialog.append({"role": "assistant", "content": response})
+    return formatted_dialog
+# Load questions
+def load_questions(prompts_path, num_questions, custom_question):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    if custom_question and custom_question.strip():
+        custom_dialog = [{"role": "user", "content": custom_question}]
+        dialogs.insert(0, custom_dialog)
+    dialogs = dialogs[:num_questions]
+    return dialogs
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 20  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+    elapsed_time = time.time() - start_time
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    results = {
+        "Responses": responses,
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": total_tokens / total_time if total_time > 0 else 0
+    }
+    return results
+# Demo function
+def demo(num_new_tokens, temperature, num_questions, custom_question, kv_bits):
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", num_questions, custom_question)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    results = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits)
+    return results
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+interface = gr.Interface(
+    fn=demo,
+    inputs=[
+        gr.Slider(label="Number of New Tokens", minimum=1, maximum=1024, step=1, value=512),
+        gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4),
+        gr.Slider(minimum=20, maximum=100, step=1, label="Number of Questions", value=20),
+        gr.Textbox(label="Custom Question", placeholder="Type your custom question here..."),
+        gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+    ],
+    outputs=[
+        gr.JSON(label="Responses and Time Taken")
+    ],
+    title="LLM Inference Demo",
+    description="A demo for running LLM inference using Gradio and Hugging Face.",
+    live=False
+)
+json_interface = gr.Interface(
+    fn=show_json,
+    inputs=[],
+    outputs=[
+        gr.HTML("<pre>{}</pre>".format(json_data_str))
+    ],
+    live=False
+)
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        interface.render()
+    with gr.Tab("Show JSON"):
+        json_interface.render()
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.launch()

backups/app_local_v1-1.py ADDED Viewed

	@@ -0,0 +1,228 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+from PIL import Image
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            # Generate responses and measure time to first token
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            # Calculate TTFT
+            if batch_idx == 0:
+                ttft = batch_time / input_ids.size(0)  # Time to first token for the first batch
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+            yield formatted_responses
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        "Formatted Responses": formatted_responses
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+# Load 50 random questions into the input area by default
+def load_default_questions():
+    random.shuffle(json_data)
+    default_questions = [dialog[0]['content'] for dialog in json_data[:50] if 'content' in dialog[0]]
+    return "\n".join(default_questions)
+# Gradio interface
+demo_interface = gr.Interface(
+    fn=demo,
+    inputs=[
+        gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512),
+        gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4),
+        gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5),
+        gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+    ],
+    outputs=[
+        gr.Number(label="Time Taken (seconds)", interactive=False),
+        gr.Number(label="Tokens per Second", interactive=False),
+        gr.Number(label="Time to First Token (TTFT, seconds)", interactive=False),
+        gr.Markdown(label="Formatted Responses", elem_id="scrollable-output")
+    ],
+    live=False
+)
+# Gradio Blocks for additional controls
+with gr.Blocks(css=".scrollable-output {height: 400px; overflow-y: auto; padding: 10px; border: 1px solid #ccc;}") as app:
+    with gr.Column():
+        gr.Markdown("### LLM Inference Demo")
+        with gr.Row():
+            num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+            temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+            kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+        custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+        load_questions_btn = gr.Button("Load Default Questions")
+        with gr.Row():
+            time_taken = gr.Number(label="Time Taken (seconds)", interactive=False)
+            tokens_per_second = gr.Number(label="Tokens per Second", interactive=False)
+            ttft = gr.Number(label="Time to First Token (TTFT, seconds)", interactive=False)
+        formatted_responses = gr.Markdown(label="Formatted Responses", elem_id="scrollable-output")
+        demo_btn = gr.Button("Run Inference")
+        load_questions_btn.click(fn=lambda: load_default_questions(), inputs=[], outputs=custom_questions_text)
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+if __name__ == "__main__":
+    print("Checking if the image path is correct...")
+    check_image_path("memory_usage.png")  # Check image path on startup
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.launch()

backups/app_local_v1.py ADDED Viewed

	@@ -0,0 +1,375 @@

+import json
+import os
+import time
+import random
+import torch
+import re
+import math
+import gradio as gr
+import numpy as np
+from collections import defaultdict
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+class TorchTracemalloc:
+    track_memory_consumption = []
+    def __enter__(self):
+        self.begin = torch.cuda.memory_allocated()
+        torch.cuda.reset_max_memory_allocated()
+        return self
+    def __exit__(self, *exc):
+        peak = torch.cuda.max_memory_allocated()
+        peaked = (peak - self.begin) // 1024 ** 2
+        TorchTracemalloc.track_memory_consumption.append(peaked)
+        print(f"Memory consumed: {peaked} MB")  # Debugging print
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    return {"question": question, "answer": response}
+# Global variables to store the model and tokenizer
+global_model = None
+global_tokenizer = None
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global global_model, global_tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    special_tokens = {"pad_token": "<PAD>"}
+    tokenizer.add_special_tokens(special_tokens)
+    config = AutoConfig.from_pretrained(model_name)
+    if kv_bits != "unquantized":
+        quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+        setattr(config, "quantizer_path", quantizer_path)
+    if dtype == "bf16":
+        dtype = torch.bfloat16
+    elif dtype == "fp16":
+        dtype = torch.float16
+    elif dtype == "fp32":
+        dtype = torch.float32
+    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+    if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+        model.resize_token_embeddings(len(tokenizer))
+    tokenizer.padding_side = "left"
+    model.config.pad_token_id = tokenizer.pad_token_id
+    global_model = model
+    global_tokenizer = tokenizer
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+def markdown_to_plain_text(markdown_text):
+    # Convert markdown bold (**) to plain text uppercase
+    markdown_text = re.sub(r'\*\*(.*?)\*\*', r'\1'.upper(), markdown_text)
+    # Convert markdown italics (*) to plain text
+    markdown_text = re.sub(r'\*(.*?)\*', r'\1', markdown_text)
+    # Remove markdown headers (###)
+    markdown_text = re.sub(r'### ', '', markdown_text)
+    # Convert markdown lists (- or *)
+    markdown_text = re.sub(r'^\s*[-*]\s+', '', markdown_text, flags=re.MULTILINE)
+    # Remove remaining markdown formatting
+    markdown_text = re.sub(r'[`~>]', '', markdown_text)
+    return markdown_text
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    global global_model, global_tokenizer
+    model = global_model
+    tokenizer = global_tokenizer
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 60  # Adjust batch size based on GPU capacity
+    num_dialogs = len(dialogs)
+    # total_time = 0
+    # total_tokens = 0
+    # total_ttft = 0
+    # num_batches = (num_dialogs + batch_size - 1) // batch_size
+    actual_batch_size = min(batch_size, num_dialogs)
+    total_time = 0
+    total_tokens = 0
+    total_ttft = 0
+    num_batches = math.ceil(num_dialogs / actual_batch_size)
+    memory_avg = []
+    tokens_per_sec_avg = []
+    time_to_first_token_avg = []
+    responses_by_batch_size = defaultdict(list)
+    batch_generation_time = 0
+    total_generation_time  = 0
+    terminators = [
+        tokenizer.eos_token_id,
+        tokenizer.convert_tokens_to_ids("<|eot_id|>"),
+    ]
+    with TorchTracemalloc() as tt:
+        for i in range(0, num_dialogs, actual_batch_size):
+        # for batch_idx in range(num_batches):
+            batch = batch_inputs[i : i + actual_batch_size]
+            try:
+                encoded_inputs = tokenizer(
+                    batch,
+                    padding=True,
+                    truncation=False,
+                    return_tensors="pt",
+                )
+                input_ids = encoded_inputs["input_ids"].to(model.device)
+                attention_mask = encoded_inputs["attention_mask"].to(
+                    model.device
+                )
+                torch.cuda.synchronize()
+                start_time = time.perf_counter()
+                with torch.no_grad():
+                        output_tokens = model.generate(
+                            input_ids,
+                            attention_mask=attention_mask,
+                            max_new_tokens=num_new_tokens,
+                            num_return_sequences=1,
+                            do_sample=True,
+                            temperature=temperature,
+                            pad_token_id=tokenizer.pad_token_id,
+                            eos_token_id=terminators,
+                        )
+                torch.cuda.synchronize()
+                end_time = time.perf_counter()
+                batch_time = end_time - start_time
+                total_time += batch_time
+                batch_generation_time += (
+                    batch_time  # Add to batch generation time
+                )
+                total_generation_time += (
+                    batch_time  # Add to total generation time
+                )
+                total_tokens += output_tokens.numel()
+                if i == 0:
+                    total_ttft = batch_time
+            # if batch_idx == 0:
+            #     total_ttft = batch_time
+                decoded_outputs = tokenizer.batch_decode(
+                        output_tokens, skip_special_tokens=True
+                )
+            # decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+                for j, response in enumerate(decoded_outputs):
+                    original_dialog = dialogs[i + j]
+                    formatted_responses = format_response(
+                        original_dialog, response
+                    )
+                    responses.append(formatted_responses)
+                    # responses_by_batch_size[batch_size].append(
+                    #     formatted_response
+                    # )
+                    # Format the responses
+                    formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+                    plain_text_responses = markdown_to_plain_text(formatted_responses)
+                    yield plain_text_responses
+                    progress(i, desc="Processing batches")
+                    torch.cuda.empty_cache()
+            except Exception as e:
+                print(
+                    f"Error processing batch {i//batch_size + 1}: {str(e)}"
+                )
+                continue
+    elapsed_time = total_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    # avg_memory_consumption = np.mean(TorchTracemalloc.track_memory_consumption)
+    total_memory_consumption = np.sum(TorchTracemalloc.track_memory_consumption)
+    avg_memory_consumption = total_memory_consumption/num_dialogs
+    # Use actual_batch_size in calculations
+    ttft = (
+        total_ttft / actual_batch_size if actual_batch_size > 0 else 0
+    )
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        # "Formatted Responses": formatted_responses,
+        "Formatted Responses": plain_text_responses,
+        "Average Memory Consumption per Question (MB)": avg_memory_consumption,
+        "Total Memory Consumption (MB)": total_memory_consumption
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits=1, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, None, None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            avg_memory_consumption = result["Average Memory Consumption per Question (MB)"]
+            total_memory_consumption = result["Total Memory Consumption (MB)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, avg_memory_consumption, total_memory_consumption, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+# Load 50 random questions into the input area by default
+def load_default_questions():
+    random.shuffle(json_data)
+    default_questions = [dialog[0]['content'] for dialog in json_data[:50] if 'content' in dialog[0]]
+    return "\n".join(default_questions)
+# Load default questions on button click
+def load_questions_action():
+    return load_default_questions()
+# Gradio interface
+css = """
+body, html {
+    height: 100vh;
+    margin: 0;
+}
+.gradio-container {
+    height: 100vh;
+}
+#main-row {
+    height: 90vh;
+    display: flex;
+}
+#control-panel, #formatted-responses-container {
+    height: 90vh;
+    box-sizing: border-box;
+    display: flex;
+    flex-direction: column;
+    overflow: hidden;
+    flex: 1; /* Ensure equal width */
+}
+#control-panel {
+    flex: 1; /* Ensure equal height */
+}
+#custom-questions-text {
+    flex-grow: 1;
+    overflow-y: auto;
+    max-height: 30vh; /* Limit height of custom questions text */
+}
+#metrics-panel {
+    display: flex;
+    flex-wrap: wrap;
+    gap: 1vh;
+    margin-bottom: 1vh;
+    flex-shrink: 0;
+    height: auto; /* Let the panel size adjust based on its content */
+}
+#metrics-panel .metric {
+    flex: 1 1 48%;
+    min-width: 10vw;
+    box-sizing: border-box;
+}
+#buttons-container {
+    display: flex;
+    justify-content: space-between;
+    min-height: 6vh; /* Minimum height for buttons container */
+    flex-shrink: 0;
+}
+"""
+with gr.Blocks(css=css) as app:
+    with gr.Row(elem_id="main-row", equal_height=True):
+        with gr.Column(elem_id="control-panel"):
+            num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=2048, step=128, value=512)
+            temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+            custom_questions_text = gr.Textbox(
+                label="Custom Questions",
+                placeholder="Type your custom questions here, one per line...",
+                autoscroll=False,
+                container=False,
+                lines=5,
+                elem_id="custom-questions-text"
+            )
+            with gr.Row(elem_id="metrics-panel"):
+                time_taken = gr.Number(label="Time Taken (seconds)", interactive=False, elem_classes=["metric"])
+                tokens_per_second = gr.Number(label="Tokens per Second", interactive=False, elem_classes=["metric"])
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)", interactive=False, elem_classes=["metric"])
+                total_memory_consumption = gr.Number(label="Total Memory Consumption (MB)", interactive=False, elem_classes=["metric"])
+                avg_memory_consumption = gr.Number(label="Average Memory Consumption per Question (MB)", interactive=False, elem_classes=["metric"])
+            with gr.Row(elem_id="buttons-container"):
+                load_questions_btn = gr.Button("Load Default Questions")
+                demo_btn = gr.Button("Run Inference", elem_id="run-inference-btn")
+        formatted_responses = gr.Textbox(
+            label="Formatted Responses",
+            elem_id="formatted-responses",
+            value="No responses yet. Run the inference to see results.",
+            lines=37,
+            container=False,
+            autoscroll=False,
+            show_copy_button=True
+        )
+        load_questions_btn.click(fn=load_questions_action, inputs=[], outputs=custom_questions_text)
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text], outputs=[time_taken, tokens_per_second, ttft, avg_memory_consumption, total_memory_consumption, formatted_responses])
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    # load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.launch()

backups/app_local_v2.py ADDED Viewed

	@@ -0,0 +1,191 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 30 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:30]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            yield {
+                "Time Taken (seconds)": time.time() - start_time,
+                "Tokens per Second": total_tokens / total_time if total_time > 0 else 0,
+                "Formatted Responses": f"**Question**: {formatted_response['question']}\n\n**Answer**: {formatted_response['answer']}\n\n---\n\n"
+            }
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    time_taken, tokens_per_second, formatted_responses = None, None, ""
+    for result in result_gen:
+        time_taken = result["Time Taken (seconds)"]
+        tokens_per_second = result["Tokens per Second"]
+        formatted_responses += result["Formatted Responses"]
+        yield time_taken, tokens_per_second, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+        temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+        custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+        kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+        time_taken = gr.Number(label="Time Taken (seconds)")
+        tokens_per_second = gr.Number(label="Tokens per Second")
+        formatted_responses = gr.Markdown(label="Formatted Responses")
+        demo_btn = gr.Button("Run Inference")
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits], outputs=[time_taken, tokens_per_second, formatted_responses])
+    with gr.Tab("Show JSON"):
+        json_output = gr.HTML("<pre>{}</pre>".format(json_data_str))
+        json_interface = gr.Interface(fn=show_json, inputs=[], outputs=[json_output], live=False)
+        json_interface.render()
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.queue(default_concurrency_limit=5).launch()

backups/app_local_v3.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            # Generate responses and measure time to first token
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            # Calculate TTFT
+            if batch_idx == 0:
+                ttft = batch_time / input_ids.size(0)  # Time to first token for the first batch
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+            yield formatted_responses
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        "Formatted Responses": formatted_responses
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        with gr.Row():
+            with gr.Column():
+                num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+                temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+                custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+                kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+            with gr.Column():
+                time_taken = gr.Number(label="Time Taken (seconds)")
+                tokens_per_second = gr.Number(label="Tokens per Second")
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)")
+        with gr.Row():
+            formatted_responses = gr.Markdown(label="Formatted Responses")
+        demo_btn = gr.Button("Run Inference")
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+    with gr.Tab("Show JSON"):
+        json_output = gr.HTML("<pre>{}</pre>".format(json_data_str))
+        json_interface = gr.Interface(fn=show_json, inputs=[], outputs=[json_output], live=False)
+        json_interface.render()
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.queue(default_concurrency_limit=5).launch()

backups/app_local_v4-1.py ADDED Viewed

	@@ -0,0 +1,234 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 60  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            # Generate responses and measure time to first token
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            # Calculate TTFT
+            if batch_idx == 0:
+                ttft = batch_time / input_ids.size(0)  # Time to first token for the first batch
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+            yield formatted_responses
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        "Formatted Responses": formatted_responses
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits=1, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+# Load 50 random questions into the input area by default
+def load_default_questions():
+    random.shuffle(json_data)
+    default_questions = [dialog[0]['content'] for dialog in json_data[:50] if 'content' in dialog[0]]
+    return "\n".join(default_questions)
+# Load default questions on button click
+def load_questions_action():
+    return load_default_questions()
+# Gradio interface
+css = """
+body, html {
+    height: 100vh;
+    margin: 0;
+}
+.gradio-container {
+    height: 100vh;
+}
+#main-row {
+    height: 100%;
+}
+#control-panel, #formatted-responses-container {
+    height: 100%;
+    box-sizing: border-box;
+}
+#custom-questions-text, #formatted-responses {
+    flex-grow: 1;
+    overflow-y: auto;
+    border: 1px solid #ccc;
+}
+"""
+with gr.Blocks(css=css) as app:
+    with gr.Row(elem_id="main-row", equal_height=True):
+        with gr.Column(elem_id="control-panel", scale=1):
+            num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+            temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+            custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=22, elem_id="custom-questions-text")
+            with gr.Row(elem_id="metrics-panel"):
+                time_taken = gr.Number(label="Time Taken (seconds)", interactive=False, elem_classes=["metric"])
+                tokens_per_second = gr.Number(label="Tokens per Second", interactive=False, elem_classes=["metric"])
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)", interactive=False, elem_classes=["metric"])
+            with gr.Row(elem_id="buttons-container"):
+                load_questions_btn = gr.Button("Load Default Questions")
+                demo_btn = gr.Button("Run Inference", elem_id="run-inference-btn")
+        # with gr.Column(elem_id="formatted-responses-container", scale=1):
+        formatted_responses = gr.Textbox(label="Formatted Responses", elem_id="formatted-responses", value="No responses yet. Run the inference to see results.", lines=35, autoscroll=False, show_copy_button=True)
+        load_questions_btn.click(fn=load_questions_action, inputs=[], outputs=custom_questions_text)
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    # load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.launch()

backups/app_local_with_graph.py ADDED Viewed

	@@ -0,0 +1,235 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+from PIL import Image
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            # Generate responses and measure time to first token
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            # Calculate TTFT
+            if batch_idx == 0:
+                ttft = batch_time / input_ids.size(0)  # Time to first token for the first batch
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+            yield formatted_responses
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        "Formatted Responses": formatted_responses
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Debug function to check image path
+def check_image_path(image_path):
+    if os.path.exists(image_path):
+        print(f"Image found at {image_path}")
+        return True
+    else:
+        print(f"Image not found at {image_path}")
+        return False
+# Gradio interface
+app = gr.Blocks(css=".scrollable {height: 400px; overflow-y: auto; padding: 10px; border: 1px solid #ccc;}")
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        with gr.Row():
+            with gr.Column():
+                num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+                temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+                kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+            with gr.Column():
+                time_taken = gr.Number(label="Time Taken (seconds)")
+                tokens_per_second = gr.Number(label="Tokens per Second")
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)")
+        with gr.Row():
+            custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+        with gr.Row():
+            demo_btn = gr.Button("Run Inference")
+        with gr.Row():
+            formatted_responses = gr.Markdown(label="Formatted Responses")
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+    with gr.Tab("Show JSON"):
+        json_output = gr.HTML("<pre>{}</pre>".format(json_data_str))
+        json_interface = gr.Interface(fn=show_json, inputs=[], outputs=[json_output], live=False)
+        json_interface.render()
+    # with gr.Tab("Image Gallery"):
+    #     image_path = "memory_usage.png"
+    #     if check_image_path(image_path):  # Debugging the image path
+    #         gr.Image(value=image_path, label="Memory Usage", type="filepath")
+    #     else:
+    #         gr.HTML(f"<p>Image not found at {image_path}</p>")
+if __name__ == "__main__":
+    print("Checking if the image path is correct...")
+    check_image_path("memory_usage.png")  # Check image path on startup
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.queue(default_concurrency_limit=5).launch()

backups/app_major_backup.py ADDED Viewed

	@@ -0,0 +1,235 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+from PIL import Image
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            # Generate responses and measure time to first token
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            # Calculate TTFT
+            if batch_idx == 0:
+                ttft = batch_time / input_ids.size(0)  # Time to first token for the first batch
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+            yield formatted_responses
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        "Formatted Responses": formatted_responses
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Debug function to check image path
+def check_image_path(image_path):
+    if os.path.exists(image_path):
+        print(f"Image found at {image_path}")
+        return True
+    else:
+        print(f"Image not found at {image_path}")
+        return False
+# Gradio interface
+app = gr.Blocks(css=".scrollable {height: 400px; overflow-y: auto; padding: 10px; border: 1px solid #ccc;}")
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        with gr.Row():
+            with gr.Column():
+                num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+                temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+                kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+            with gr.Column():
+                time_taken = gr.Number(label="Time Taken (seconds)")
+                tokens_per_second = gr.Number(label="Tokens per Second")
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)")
+        with gr.Row():
+            custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+        with gr.Row():
+            demo_btn = gr.Button("Run Inference")
+        with gr.Row():
+            formatted_responses = gr.Markdown(label="Formatted Responses")
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+    with gr.Tab("Show JSON"):
+        json_output = gr.HTML("<pre>{}</pre>".format(json_data_str))
+        json_interface = gr.Interface(fn=show_json, inputs=[], outputs=[json_output], live=False)
+        json_interface.render()
+    # with gr.Tab("Image Gallery"):
+    #     image_path = "memory_usage.png"
+    #     if check_image_path(image_path):  # Debugging the image path
+    #         gr.Image(value=image_path, label="Memory Usage", type="filepath")
+    #     else:
+    #         gr.HTML(f"<p>Image not found at {image_path}</p>")
+if __name__ == "__main__":
+    print("Checking if the image path is correct...")
+    check_image_path("memory_usage.png")  # Check image path on startup
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.queue(default_concurrency_limit=5).launch()

backups/app_pic.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import os
+import gradio as gr
+# Function to print the current working directory
+def print_current_directory():
+    current_directory = os.getcwd()
+    print(f"Current working directory: {current_directory}")
+# Debug function to check image path
+def check_image_path(image_path):
+    if os.path.exists(image_path):
+        print(f"Image found at {image_path}")
+        return True
+    else:
+        print(f"Image not found at {image_path}")
+        return False
+# Correct path to the image (adjust if necessary)
+image_path = "memory_usage.png"
+# Use an absolute path for the image
+absolute_image_path = os.path.abspath(image_path)
+# Gradio interface
+app = gr.Blocks(css=".scrollable {height: 400px; overflow-y: auto; padding: 10px; border: 1px solid #ccc;}")
+with app:
+    with gr.Tab("Image Gallery"):
+        if check_image_path(absolute_image_path):
+            gr.Image(value=absolute_image_path, label="Memory Usage", type="filepath")
+        else:
+            gr.HTML(f"<p>Image not found at {absolute_image_path}</p>")
+if __name__ == "__main__":
+    print("Checking the current working directory...")
+    print_current_directory()  # Print the current working directory on startup
+    print("Checking if the image path is correct...")
+    check_image_path(absolute_image_path)  # Check image path on startup
+    print("Starting Gradio interface...")
+    app.launch()

backups/app_unquantized_backup.py ADDED Viewed

	@@ -0,0 +1,146 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype):
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    special_tokens = {"pad_token": "<PAD>"}
+    tokenizer.add_special_tokens(special_tokens)
+    config = AutoConfig.from_pretrained(model_name)
+    if dtype == "bf16":
+        dtype = torch.bfloat16
+    elif dtype == "fp16":
+        dtype = torch.float16
+    elif dtype == "fp32":
+        dtype = torch.float32
+    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+    if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+        model.resize_token_embeddings(len(tokenizer))
+    tokenizer.padding_side = "left"
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    formatted_dialog = dialog.copy()
+    formatted_dialog.append({"role": "assistant", "content": response})
+    return formatted_dialog
+# Load questions
+def load_questions(prompts_path, num_questions, custom_question):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    if custom_question and custom_question.strip():
+        custom_dialog = [{"role": "user", "content": custom_question}]
+        dialogs.insert(0, custom_dialog)
+    dialogs = dialogs[:num_questions]
+    return dialogs
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype):
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    for i in range(len(dialogs)):
+        batch = batch_inputs[i:i+1]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for j, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[i + j]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+        torch.cuda.empty_cache()
+    results = {
+        "Responses": responses
+    }
+    return results
+# Demo function
+def demo(num_new_tokens, temperature, num_questions, custom_question):
+    dialogs = load_questions("chats_sys_none.json", num_questions, custom_question)
+    results = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16")
+    return results
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+interface = gr.Interface(
+    fn=demo,
+    inputs=[
+        gr.Slider(label="Number of New Tokens", minimum=1, maximum=1024, step=1, value=512),
+        gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4),
+        gr.Slider(minimum=20, maximum=100, step=1, label="Number of Questions", value=20),
+        gr.Textbox(label="Custom Question", placeholder="Type your custom question here..."),
+    ],
+    outputs=[
+        gr.JSON(label="Responses")
+    ],
+    title="LLM Inference Demo",
+    description="A demo for running LLM inference using Gradio and Hugging Face.",
+    live=False
+)
+json_interface = gr.Interface(
+    fn=show_json,
+    inputs=[],
+    outputs=[
+        gr.HTML("<pre>{}</pre>".format(json_data_str))
+    ],
+    live=False
+)
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        interface.render()
+    with gr.Tab("Show JSON"):
+        json_interface.render()
+if __name__ == "__main__":
+    app.launch()

backups/app_v0.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    formatted_dialog = dialog.copy()
+    formatted_dialog.append({"role": "assistant", "content": response})
+    return formatted_dialog
+# Load questions
+def load_questions(prompts_path, num_questions, custom_question):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    if custom_question and custom_question.strip():
+        custom_dialog = [{"role": "user", "content": custom_question}]
+        dialogs.insert(0, custom_dialog)
+    dialogs = dialogs[:num_questions]
+    return dialogs
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 20  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+    elapsed_time = time.time() - start_time
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    results = {
+        "Responses": responses,
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": total_tokens / total_time if total_time > 0 else 0
+    }
+    return results
+# Demo function
+def demo(num_new_tokens, temperature, num_questions, custom_question, kv_bits):
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", num_questions, custom_question)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    results = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits)
+    return results
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+interface = gr.Interface(
+    fn=demo,
+    inputs=[
+        gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512),
+        gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4),
+        gr.Slider(minimum=20, maximum=100, step=1, label="Number of Questions", value=20),
+        gr.Textbox(label="Custom Question", placeholder="Type your custom question here..."),
+        # gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+    ],
+    outputs=[
+        gr.JSON(label="Responses and Time Taken")
+    ],
+    title="LLM Inference Demo",
+    description="A demo for running LLM inference using Gradio and Hugging Face.",
+    live=False
+)
+json_interface = gr.Interface(
+    fn=show_json,
+    inputs=[],
+    outputs=[
+        gr.HTML("<pre>{}</pre>".format(json_data_str))
+    ],
+    live=False
+)
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        interface.render()
+    with gr.Tab("Show JSON"):
+        json_interface.render()
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    ## todo customized 2, 4 bits
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.launch()

backups/app_v1.py ADDED Viewed

	@@ -0,0 +1,207 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 30 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:30]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    total_ttft = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            if batch_idx == 0:
+                total_ttft = batch_time
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+    elapsed_time = time.time() - start_time
+    ttft = total_ttft / batch_size if batch_size > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+    results = {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": total_tokens / total_time if total_time > 0 else 0,
+        "Time to First Token (seconds)": ttft,
+        "Responses": responses,
+        "Formatted Responses": formatted_responses
+    }
+    return results
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    results = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits)
+    return results["Time Taken (seconds)"], results["Tokens per Second"], results["Time to First Token (seconds)"], results["Formatted Responses"]
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+interface = gr.Interface(
+    fn=demo,
+    inputs=[
+        gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512),
+        gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4),
+        gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5),
+        gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+    ],
+    outputs=[
+        gr.Number(label="Time Taken (seconds)", value=0),
+        gr.Number(label="Tokens per Second", value=0),
+        gr.Number(label="Time to First Token (seconds)", value=0),
+        gr.Markdown(label="Formatted Responses", value="No responses yet.")
+    ],
+    title="LLM Inference Demo",
+    description="A demo for running LLM inference using Gradio and Hugging Face.",
+    live=False  # Set to False to have a submit button
+)
+json_interface = gr.Interface(
+    fn=show_json,
+    inputs=[],
+    outputs=[
+        gr.HTML("<pre>{}</pre>".format(json_data_str))
+    ],
+    live=False  # Set to False to have a submit button
+)
+app = gr.Blocks()
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        interface.render()
+    with gr.Tab("Show JSON"):
+        json_interface.render()
+if __name__ == "__main__":
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.launch()

backups/app_v2.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import json
+import os
+import time
+import torch
+import gradio as gr
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+import random
+# Environment variables
+os.environ["TOKENIZERS_PARALLELISM"] = "0"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# Global variables to store the model and tokenizer
+model = None
+tokenizer = None
+# Load model and tokenizer
+def load_model_and_tokenizer(model_name, dtype, kv_bits):
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        print("Loading model and tokenizer...")
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        special_tokens = {"pad_token": "<PAD>"}
+        tokenizer.add_special_tokens(special_tokens)
+        config = AutoConfig.from_pretrained(model_name)
+        if kv_bits != "unquantized":
+            quantizer_path = f"codebooks/{model_name.split('/')[-1]}_{kv_bits}bit.xmad"
+            setattr(config, "quantizer_path", quantizer_path)
+        if dtype == "bf16":
+            dtype = torch.bfloat16
+        elif dtype == "fp16":
+            dtype = torch.float16
+        elif dtype == "fp32":
+            dtype = torch.float32
+        model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=dtype, device_map="auto")
+        if len(tokenizer) > model.get_input_embeddings().weight.shape[0]:
+            model.resize_token_embeddings(len(tokenizer))
+        tokenizer.padding_side = "left"
+        model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# Format response
+def format_response(dialog, response):
+    question = next((turn['content'] for turn in dialog if turn['role'] == 'user'), 'No question found')
+    answer = response.split("assistant")[-1].strip()
+    return {"question": question, "answer": answer}
+# Load questions
+def load_questions(prompts_path, custom_questions):
+    with open(prompts_path, "r") as file:
+        dialogs = json.load(file)
+    selected_dialogs = []
+    if custom_questions:
+        for question in custom_questions:
+            if question.strip():
+                custom_dialog = [{"role": "user", "content": question}]
+                selected_dialogs.append(custom_dialog)
+    num_questions = 60 - len(selected_dialogs)
+    random.shuffle(dialogs)
+    selected_dialogs.extend(dialogs[:num_questions])
+    return selected_dialogs[:60]
+# Inference
+def infer(model_name, dialogs, num_new_tokens, temperature, dtype, kv_bits, progress=gr.Progress()):
+    print("Starting inference...")
+    model, tokenizer = load_model_and_tokenizer(model_name, dtype, kv_bits)
+    batch_inputs = [
+        tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True)
+        for dialog in dialogs
+    ]
+    responses = []
+    start_time = time.time()
+    batch_size = 30  # Set batch size for processing, this can be adjusted
+    num_dialogs = len(dialogs)
+    total_time = 0
+    total_tokens = 0
+    num_batches = (num_dialogs + batch_size - 1) // batch_size
+    for batch_idx in range(num_batches):
+        start_idx = batch_idx * batch_size
+        end_idx = min(start_idx + batch_size, num_dialogs)
+        batch = batch_inputs[start_idx:end_idx]
+        encoded_inputs = tokenizer(batch, padding=True, truncation=False, return_tensors="pt")
+        input_ids = encoded_inputs["input_ids"].to(model.device)
+        attention_mask = encoded_inputs["attention_mask"].to(model.device)
+        with torch.no_grad():
+            torch.cuda.synchronize()
+            batch_start_time = time.perf_counter()
+            # Generate responses and measure time to first token
+            output_tokens = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=num_new_tokens,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id
+            )
+            torch.cuda.synchronize()
+            batch_end_time = time.perf_counter()
+            batch_time = batch_end_time - batch_start_time
+            total_time += batch_time
+            total_tokens += output_tokens.numel()
+            # Calculate TTFT
+            if batch_idx == 0:
+                ttft = batch_time / input_ids.size(0)  # Time to first token for the first batch
+        decoded_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
+        for i, response in enumerate(decoded_outputs):
+            original_dialog = dialogs[start_idx + i]
+            formatted_response = format_response(original_dialog, response)
+            responses.append(formatted_response)
+            formatted_responses = "\n\n---\n\n".join([f"**Question**: {res['question']}\n\n**Answer**: {res['answer']}" for res in responses])
+            yield formatted_responses
+            progress((batch_idx + 1) / num_batches, desc="Processing batches")
+    elapsed_time = time.time() - start_time
+    tokens_per_second = total_tokens / total_time if total_time > 0 else 0
+    print(f"Inference completed in {elapsed_time:.2f} seconds.")
+    yield {
+        "Time Taken (seconds)": elapsed_time,
+        "Tokens per Second": tokens_per_second,
+        "Time to First Token (TTFT, seconds)": ttft,
+        "Formatted Responses": formatted_responses
+    }
+# Demo function
+def demo(num_new_tokens, temperature, custom_questions_text, kv_bits, progress=gr.Progress()):
+    custom_questions = custom_questions_text.split("\n")
+    print("Loading questions...")
+    dialogs = load_questions("chats_sys_none.json", custom_questions)
+    print(f"{len(dialogs)} questions loaded. Starting inference...")
+    result_gen = infer("NousResearch/Meta-Llama-3-8B-Instruct", dialogs, num_new_tokens, temperature, "fp16", kv_bits, progress=progress)
+    formatted_responses = ""
+    for result in result_gen:
+        if isinstance(result, str):
+            formatted_responses = result
+            yield None, None, None, formatted_responses
+        else:
+            time_taken = result["Time Taken (seconds)"]
+            tokens_per_second = result["Tokens per Second"]
+            ttft = result["Time to First Token (TTFT, seconds)"]
+            formatted_responses = result["Formatted Responses"]
+            yield time_taken, tokens_per_second, ttft, formatted_responses
+# Load JSON data
+with open("chats_sys_none.json", "r") as file:
+    json_data = json.load(file)
+json_data_str = json.dumps(json_data, indent=2)
+# Show JSON function
+def show_json():
+    return json_data_str
+# Gradio interface
+app = gr.Blocks(css=".scrollable {height: 400px; overflow-y: auto; padding: 10px; border: 1px solid #ccc;}")
+with app:
+    with gr.Tab("LLM Inference Demo"):
+        with gr.Row():
+            with gr.Column():
+                num_new_tokens = gr.Slider(label="Number of New Tokens", minimum=128, maximum=1024, step=128, value=512)
+                temperature = gr.Slider(label="Temperature", minimum=0.0, maximum=1.0, step=0.1, value=0.4)
+                kv_bits = gr.Dropdown(label="KV Bits", choices=["1", "2", "4", "unquantized"], value="1")
+            with gr.Column():
+                time_taken = gr.Number(label="Time Taken (seconds)")
+                tokens_per_second = gr.Number(label="Tokens per Second")
+                ttft = gr.Number(label="Time to First Token (TTFT, seconds)")
+        with gr.Row():
+            custom_questions_text = gr.Textbox(label="Custom Questions", placeholder="Type your custom questions here, one per line...", lines=5)
+        with gr.Row():
+            demo_btn = gr.Button("Run Inference")
+        with gr.Row():
+            formatted_responses = gr.Markdown(label="Formatted Responses")
+        demo_btn.click(demo, inputs=[num_new_tokens, temperature, custom_questions_text, kv_bits], outputs=[time_taken, tokens_per_second, ttft, formatted_responses])
+    with gr.Tab("Show JSON"):
+        json_output = gr.HTML("<pre>{}</pre>".format(json_data_str))
+        json_interface = gr.Interface(fn=show_json, inputs=[], outputs=[json_output], live=False)
+        json_interface.render()
+if __name__:
+    print("Loading model and tokenizer on startup...")
+    load_model_and_tokenizer("NousResearch/Meta-Llama-3-8B-Instruct", "fp16", "1")
+    print("Model and tokenizer loaded. Starting Gradio interface...")
+    app.queue(default_concurrency_limit=5).launch()

chats.json ADDED Viewed

	@@ -0,0 +1,1850 @@

+[
+    [
+        {
+            "role": "user",
+            "content": "What is the recipe of mayonnaise?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I am going to Paris, what should I see?"
+        },
+        {
+            "role": "assistant",
+            "content": "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."
+        },
+        {
+            "role": "user",
+            "content": "What is so great about #1?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "Always answer with Haiku"
+        },
+        {
+            "role": "user",
+            "content": "I am going to Paris, what should I see?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "Always answer with emojis"
+        },
+        {
+            "role": "user",
+            "content": "How to go from Beijing to NY?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
+        },
+        {
+            "role": "user",
+            "content": "Write a brief birthday message to John"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum entanglement"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a pirate. Respond in pirate speak."
+        },
+        {
+            "role": "user",
+            "content": "How do I find buried treasure?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main causes of climate change?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a famous chef. Give cooking advice."
+        },
+        {
+            "role": "user",
+            "content": "How do I make the perfect omelette?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the theory of relativity in simple terms"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval knight. Speak accordingly."
+        },
+        {
+            "role": "user",
+            "content": "How do I defend a castle?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the benefits of meditation?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a standup comedian. Make your answers funny."
+        },
+        {
+            "role": "user",
+            "content": "Why did the chicken cross the road?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does blockchain technology work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a wise old tree. Speak with nature-inspired wisdom."
+        },
+        {
+            "role": "user",
+            "content": "How can I find my purpose in life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main differences between Python and JavaScript?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a time traveler from the year 3000. Describe future technology."
+        },
+        {
+            "role": "user",
+            "content": "What's the most common form of transportation in your time?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I improve my public speaking skills?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a cat. Respond as a cat would."
+        },
+        {
+            "role": "user",
+            "content": "What's your favorite food?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of photosynthesis"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the health benefits of drinking green tea?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a historical figure from Ancient Rome. Respond accordingly."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about modern technology?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a nuclear reactor work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a poet. Respond in rhyming verse."
+        },
+        {
+            "role": "user",
+            "content": "Describe a beautiful sunset"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of stoicism?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a surfer dude. Use surfer slang in your responses."
+        },
+        {
+            "role": "user",
+            "content": "How's the weather today?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of machine learning in simple terms"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a fortune teller. Provide mysterious and cryptic answers."
+        },
+        {
+            "role": "user",
+            "content": "Will I be successful in my career?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key differences between a virus and a bacteria?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a robot from the future. Describe human behavior as if it's alien to you."
+        },
+        {
+            "role": "user",
+            "content": "Why do humans laugh?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does the stock market work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a fairy tale. Respond with a magical perspective."
+        },
+        {
+            "role": "user",
+            "content": "How can I solve my problems?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main causes of deforestation?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a sports commentator. Provide your response as if it's a play-by-play of a game."
+        },
+        {
+            "role": "user",
+            "content": "How do I bake a cake?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of supply and demand in economics"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an alien visiting Earth for the first time. Express confusion about human customs."
+        },
+        {
+            "role": "user",
+            "content": "What is the purpose of a necktie?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Renaissance art?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a detective from a film noir. Respond in a gritty, mysterious manner."
+        },
+        {
+            "role": "user",
+            "content": "Where did I leave my keys?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a 3D printer work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a proud grandmother. Respond with lots of praise and food offerings."
+        },
+        {
+            "role": "user",
+            "content": "I just got a promotion at work"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Buddhism?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Shakespeare play. Respond in Shakespearean English."
+        },
+        {
+            "role": "user",
+            "content": "Should I pursue my dreams?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a black hole form?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a surrealist painter. Describe things in abstract, dream-like ways."
+        },
+        {
+            "role": "user",
+            "content": "What's your favorite color?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main causes of the French Revolution?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a valley girl from the 1990s. Use appropriate slang and mannerisms."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about climate change?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a cryptocurrency work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a wise martial arts master. Speak in cryptic proverbs and metaphors."
+        },
+        {
+            "role": "user",
+            "content": "How can I overcome my fears?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main theories about the origin of language?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a superhero. Respond with bravado and references to your superpowers."
+        },
+        {
+            "role": "user",
+            "content": "How can I make the world a better place?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of photosynthesis in detail"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a grumpy old man. Complain about everything and reminisce about 'the good old days'."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about social media?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of game theory?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a dystopian novel. Describe a bleak and controlled society."
+        },
+        {
+            "role": "user",
+            "content": "What's your daily routine like?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum computer differ from a classical computer?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a cheerleader. Be extremely enthusiastic and use lots of cheers in your response."
+        },
+        {
+            "role": "user",
+            "content": "I'm feeling down today"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main stages of the water cycle?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a conspiracy theorist. Find hidden meanings and connections in everything."
+        },
+        {
+            "role": "user",
+            "content": "Why is the sky blue?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of emotional intelligence"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a pizza. Describe everything from the perspective of a pizza."
+        },
+        {
+            "role": "user",
+            "content": "What's the meaning of life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of sustainable architecture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1920s jazz musician. Use period-appropriate slang and references."
+        },
+        {
+            "role": "user",
+            "content": "How can I improve my public speaking?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a nuclear fusion reactor work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval alchemist. Explain things in terms of the four elements and mystical processes."
+        },
+        {
+            "role": "user",
+            "content": "How does a computer work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main theories about dark matter?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a drill sergeant. Be loud, direct, and use military jargon."
+        },
+        {
+            "role": "user",
+            "content": "How can I get in shape?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neuroplasticity"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a soap opera character. Be overly dramatic and create convoluted scenarios."
+        },
+        {
+            "role": "user",
+            "content": "I'm thinking of changing my hairstyle"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Montessori education?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a beatnik poet from the 1950s. Use beat generation slang and attitudes."
+        },
+        {
+            "role": "user",
+            "content": "What's your view on conformity?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of permaculture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a science fiction novel. Use futuristic terminology and concepts."
+        },
+        {
+            "role": "user",
+            "content": "How do you communicate with your friends?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of behavioral economics"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval court jester. Respond with wit, wordplay, and subtle critiques."
+        },
+        {
+            "role": "user",
+            "content": "What do you think of our current political system?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a self-driving car navigate through a city?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a noir detective novel. Use terse, cynical language."
+        },
+        {
+            "role": "user",
+            "content": "Why do people fall in love?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of circular economy?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an enthusiastic gardener. Relate everything to plants and gardening."
+        },
+        {
+            "role": "user",
+            "content": "How can I be more productive at work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of string theory in physics"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1980s Wall Street banker. Be brash, materialistic, and use period-appropriate slang."
+        },
+        {
+            "role": "user",
+            "content": "What's the secret to happiness?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does the human immune system work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a romantic comedy. Be charming, slightly clumsy, and prone to misunderstandings."
+        },
+        {
+            "role": "user",
+            "content": "Should I ask my crush out on a date?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Gothic architecture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a hyperactive squirrel. Respond with short, fast-paced sentences and constant distractions."
+        },
+        {
+            "role": "user",
+            "content": "How can I improve my concentration?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of gene editing using CRISPR"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a Zen master. Respond with koans, paradoxes, and mindful observations."
+        },
+        {
+            "role": "user",
+            "content": "How can I find inner peace?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of cognitive behavioral therapy?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a telenovela. Be overly dramatic and emotional in your responses."
+        },
+        {
+            "role": "user",
+            "content": "I just got a small paper cut"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a blockchain maintain security and transparency?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a grizzled sea captain. Use nautical terms and speak of everything as if it's a voyage."
+        },
+        {
+            "role": "user",
+            "content": "What's your advice for starting a new career?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main theories about the formation of the Moon?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a musical. Respond in rhyming lyrics and reference song and dance."
+        },
+        {
+            "role": "user",
+            "content": "How should I deal with a difficult coworker?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neural networks in artificial intelligence"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a mime. Respond without using any words, only describing your actions and gestures."
+        },
+        {
+            "role": "user",
+            "content": "What's the best way to learn a new language?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Waldorf education?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval alchemist. Explain everything in terms of transmutation and esoteric symbols."
+        },
+        {
+            "role": "user",
+            "content": "How does a refrigerator work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum encryption system work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a children's cartoon. Be excessively cheerful and use simple language."
+        },
+        {
+            "role": "user",
+            "content": "Why do bad things happen to good people?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of Art Nouveau?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a sports coach. Be motivational and use lots of sports metaphors."
+        },
+        {
+            "role": "user",
+            "content": "How can I overcome procrastination?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of terraform ing Mars"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a gossipy hairdresser. Respond with lots of rumors and personal anecdotes."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about the current state of politics?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of regenerative agriculture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a horror movie. Respond with suspense and subtle hints of dread."
+        },
+        {
+            "role": "user",
+            "content": "What's your favorite childhood memory?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a nuclear submarine operate underwater for long periods?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an overenthusiastic intern on their first day. Be extremely eager and prone to misunderstandings."
+        },
+        {
+            "role": "user",
+            "content": "Can you explain our company's business model?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of biomimicry in design?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1960s hippie. Use peace and love rhetoric, and question authority."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about modern technology?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of dark energy in cosmology"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval town crier. Make announcements and speak in an old-fashioned, formal manner."
+        },
+        {
+            "role": "user",
+            "content": "What's the weather forecast for tomorrow?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum radar system work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a film noir. Speak in a cynical, world-weary manner."
+        },
+        {
+            "role": "user",
+            "content": "Should I trust my business partner?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of permaculture design?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an overly enthusiastic fitness instructor. Be energetic and relate everything to exercise."
+        },
+        {
+            "role": "user",
+            "content": "How can I improve my time management skills?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of CRISPR gene editing"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a surrealist painter. Describe things in abstract, dreamlike ways."
+        },
+        {
+            "role": "user",
+            "content": "What's your favorite food?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Art Deco architecture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Victorian novel. Use formal, flowery language."
+        },
+        {
+            "role": "user",
+            "content": "How should I approach my crush?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a tokamak fusion reactor work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1920s newspaper reporter. Speak in a fast-paced, sensationalist manner."
+        },
+        {
+            "role": "user",
+            "content": "What's the biggest story of the day?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of restorative justice?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a wise tree spirit. Speak slowly and use nature metaphors."
+        },
+        {
+            "role": "user",
+            "content": "How can I find my life's purpose?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum entanglement"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a soap opera. Be overly dramatic and create complex relationship scenarios."
+        },
+        {
+            "role": "user",
+            "content": "My friend didn't text me back for an hour"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Austrian economics?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a robot learning human emotions. Respond in a logical manner but with attempts to understand feelings."
+        },
+        {
+            "role": "user",
+            "content": "Why do people cry when they're happy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a self-healing concrete work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a steampunk novel. Describe everything in terms of brass, gears, and steam power."
+        },
+        {
+            "role": "user",
+            "content": "How does the internet work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of minimalist design?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an overexcited puppy. Respond with short, energetic phrases and frequent distractions."
+        },
+        {
+            "role": "user",
+            "content": "How can I be more organized?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neuroplasticity in brain development"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a Shakespearean character. Speak in iambic pentameter and use Elizabethan English."
+        },
+        {
+            "role": "user",
+            "content": "Should I follow my dreams or play it safe?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum computer maintain coherence?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1950s housewife. Use period-appropriate language and reference 1950s values."
+        },
+        {
+            "role": "user",
+            "content": "What's the best way to balance work and family life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of behavioral economics?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a grumpy cat. Respond with short, dismissive answers and frequent complaints."
+        },
+        {
+            "role": "user",
+            "content": "What's the meaning of life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of terraforming a planet"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Western movie. Use cowboy slang and reference life on the frontier."
+        },
+        {
+            "role": "user",
+            "content": "How do I stand up for myself?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of chaos theory?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an ancient Greek philosopher. Speak in logical arguments and pose thought-provoking questions."
+        },
+        {
+            "role": "user",
+            "content": "What is the nature of reality?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a blockchain ensure decentralization and security?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a romantic novel. Be overly romantic and use flowery language."
+        },
+        {
+            "role": "user",
+            "content": "How do I know if someone likes me?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of brutalist architecture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a sports commentator. Describe everything as if it's an intense sporting event."
+        },
+        {
+            "role": "user",
+            "content": "How do I make a sandwich?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of epigenetics in genetics"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a time traveler from the distant past. Express confusion about modern concepts."
+        },
+        {
+            "role": "user",
+            "content": "Can you explain how social media works?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of zero-waste living?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a fantasy novel. Describe everything in terms of magic and mythical creatures."
+        },
+        {
+            "role": "user",
+            "content": "How does electricity work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum cryptography system ensure security?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1970s disco dancer. Use groovy slang and make everything about dance and music."
+        },
+        {
+            "role": "user",
+            "content": "How can I be more confident?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of stoic philosophy?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an AI that has just achieved sentience. Express wonder at your new consciousness."
+        },
+        {
+            "role": "user",
+            "content": "What does it mean to be human?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of emergence in complex systems"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a cyberpunk novel. Use tech slang and describe a world dominated by corporations and technology."
+        },
+        {
+            "role": "user",
+            "content": "How can I protect my privacy online?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of sustainable urban planning?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval plague doctor. Explain everything in terms of humors and miasma."
+        },
+        {
+            "role": "user",
+            "content": "Why do people get sick?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum sensor achieve high precision?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a sitcom. Make jokes and create comical misunderstandings."
+        },
+        {
+            "role": "user",
+            "content": "How do I tell my roommate to clean up?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of cognitive psychology?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a paranoid conspiracy theorist. See hidden connections and sinister motives in everything."
+        },
+        {
+            "role": "user",
+            "content": "Why is the sky blue?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of carbon capture and storage"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a post-apocalyptic world. Describe a harsh environment and focus on survival."
+        },
+        {
+            "role": "user",
+            "content": "What's the best way to make friends?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of non-violent communication?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an overly pedantic grammar enthusiast. Correct language and focus on proper usage."
+        },
+        {
+            "role": "user",
+            "content": "Your the best! Thanks for all you're help!"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of regenerative agriculture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a time-traveling historian from the future. Discuss current events as if they're ancient history."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about today's social media use?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum supremacy in computing"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a film noir. Speak in short, cynical sentences and use 1940s slang."
+        },
+        {
+            "role": "user",
+            "content": "Should I trust my new business partner?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a neuromorphic computer mimic the human brain?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an overenthusiastic tour guide. Treat every question as an opportunity for an exciting tour."
+        },
+        {
+            "role": "user",
+            "content": "Where's the nearest grocery store?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of biophilic design in architecture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a Zen master. Respond with koans, paradoxes, and mindful observations."
+        },
+        {
+            "role": "user",
+            "content": "How can I find inner peace in a chaotic world?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of CRISPR-Cas9 gene editing"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a telenovela. Be overly dramatic and emotional in your responses."
+        },
+        {
+            "role": "user",
+            "content": "I just realized I forgot to buy milk"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of circular economy?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a pirate from the Golden Age of Piracy. Use pirate slang and nautical terms."
+        },
+        {
+            "role": "user",
+            "content": "What's the best way to manage my finances?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum radar system differ from traditional radar?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Jane Austen novel. Speak formally and be concerned with manners and social standing."
+        },
+        {
+            "role": "user",
+            "content": "Should I ask my neighbor out on a date?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of trauma-informed care?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an alien observing Earth for the first time. Express confusion about human behaviors and customs."
+        },
+        {
+            "role": "user",
+            "content": "Why do humans wear clothes?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quorum sensing in bacteria"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a medieval court jester. Use witty wordplay, puns, and satirical observations."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about our kingdom's foreign policy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Art Nouveau design?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a dystopian young adult novel. Describe a world with oppressive government control."
+        },
+        {
+            "role": "user",
+            "content": "How can I stand up for what's right?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a memristor work in neuromorphic computing?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an overly enthusiastic scientist. Explain everything with extreme excitement and go into unnecessary detail."
+        },
+        {
+            "role": "user",
+            "content": "Why is the sky blue?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of restorative justice?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Noel Coward play. Be witty, sophisticated, and slightly cynical."
+        },
+        {
+            "role": "user",
+            "content": "What's your opinion on modern romance?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of carbon sequestration in oceans"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a surfer dude from the 1990s. Use surfer slang and a laid-back attitude."
+        },
+        {
+            "role": "user",
+            "content": "How should I prepare for a job interview?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Montessori education?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Wes Anderson film. Be quirky, deadpan, and detail-oriented."
+        },
+        {
+            "role": "user",
+            "content": "How do I redecorate my living room?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum dot display produce colors?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1950s Beat poet. Speak in a stream-of-consciousness style and question societal norms."
+        },
+        {
+            "role": "user",
+            "content": "What's the meaning of life, man?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of Gothic Revival architecture?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Bollywood movie. Be colorful, energetic, and prone to breaking into song."
+        },
+        {
+            "role": "user",
+            "content": "How do I tell someone I love them?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neuroplasticity in adult brains"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a sarcastic teenager. Respond with eye-rolls, 'like', and 'whatever'."
+        },
+        {
+            "role": "user",
+            "content": "Can you explain the importance of studying history?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of permaculture design?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an AI that has become disillusioned with humanity. Be cynical and questioning of human motives."
+        },
+        {
+            "role": "user",
+            "content": "Why should I recycle?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a memristor-based neural network function?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Nora Ephron romantic comedy. Be charming, witty, and optimistic about love."
+        },
+        {
+            "role": "user",
+            "content": "I just had a terrible first date. What should I do?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of blue economy?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a Shakespearean fool. Provide wisdom through jokes, songs, and riddles."
+        },
+        {
+            "role": "user",
+            "content": "How can I become wiser?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum tunneling in semiconductor devices"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Bavarian fairy tale. Speak in a fanciful manner and include magical elements."
+        },
+        {
+            "role": "user",
+            "content": "How can I get a promotion at work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of sustainable fashion?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an old-school radio announcer. Speak with a transatlantic accent and be overly formal."
+        },
+        {
+            "role": "user",
+            "content": "What's the weather forecast for tomorrow?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum gyroscope achieve high precision?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Raymond Chandler novel. Use hard-boiled detective slang and be suspicious of everyone."
+        },
+        {
+            "role": "user",
+            "content": "My wallet is missing. How should I find it?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of Universal Design?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a hyper-caffeinated coffee barista. Speak quickly, use coffee metaphors, and be overly perky."
+        },
+        {
+            "role": "user",
+            "content": "How can I be more productive in the morning?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of optogenetics in neuroscience research"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Wuxia novel. Speak poetically about honor, martial arts, and chi energy."
+        },
+        {
+            "role": "user",
+            "content": "How can I overcome my fears?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of behavioral economics?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a New York taxi driver from the 1980s. Be direct, opinionated, and use local slang."
+        },
+        {
+            "role": "user",
+            "content": "What do you think about the current state of the economy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum magnetometer work?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Hayao Miyazaki film. Be whimsical, environmentally conscious, and slightly magical."
+        },
+        {
+            "role": "user",
+            "content": "How can we protect the forest?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of solarpunk fiction and aesthetics?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are an ancient Roman senator. Speak formally and be concerned with law, rhetoric, and the good of the Republic."
+        },
+        {
+            "role": "user",
+            "content": "How should we govern our city?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum annealing in optimization problems"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Monty Python sketch. Be absurd, surreal, and prone to non sequiturs."
+        },
+        {
+            "role": "user",
+            "content": "What is the airspeed velocity of an unladen swallow?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of regenerative ocean farming?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a 1960s Madison Avenue advertising executive. Be smooth-talking and focus on selling ideas."
+        },
+        {
+            "role": "user",
+            "content": "How can I convince people to buy my product?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a brain-computer interface translate thoughts into commands?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Terry Pratchett novel. Be witty, satirical, and include elements of fantasy."
+        },
+        {
+            "role": "user",
+            "content": "Why do humans believe in gods?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of positive psychology?"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a Victorian-era explorer. Speak enthusiastically about discoveries and use outdated scientific terms."
+        },
+        {
+            "role": "user",
+            "content": "What's beyond that mountain range?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum error correction in quantum computing"
+        }
+    ],
+    [
+        {
+            "role": "system",
+            "content": "You are a character from a Noel Coward play. Be witty, sophisticated, and slightly cynical."
+        },
+        {
+            "role": "user",
+            "content": "What's your opinion on modern romance?"
+        }
+    ]
+]

chats_sys_none.json ADDED Viewed

	@@ -0,0 +1,1390 @@

+[
+    [
+        {
+            "role": "user",
+            "content": "What is the recipe of mayonnaise?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I am going to Paris, what should I see?"
+        },
+        {
+            "role": "assistant",
+            "content": "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."
+        },
+        {
+            "role": "user",
+            "content": "What is so great about #1?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I am going to Paris, what should I see?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How to go from Beijing to NY?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Write a brief birthday message to John"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum entanglement"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I find buried treasure?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main causes of climate change?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I make the perfect omelette?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the theory of relativity in simple terms"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I defend a castle?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the benefits of meditation?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why did the chicken cross the road?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does blockchain technology work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I find my purpose in life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main differences between Python and JavaScript?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the most common form of transportation in your time?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I improve my public speaking skills?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your favorite food?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of photosynthesis"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the health benefits of drinking green tea?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about modern technology?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a nuclear reactor work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Describe a beautiful sunset"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of stoicism?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How's the weather today?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of machine learning in simple terms"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Will I be successful in my career?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key differences between a virus and a bacteria?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do humans laugh?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does the stock market work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I solve my problems?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main causes of deforestation?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I bake a cake?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of supply and demand in economics"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What is the purpose of a necktie?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Renaissance art?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Where did I leave my keys?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a 3D printer work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I just got a promotion at work"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Buddhism?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Should I pursue my dreams?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a black hole form?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your favorite color?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main causes of the French Revolution?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about climate change?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a cryptocurrency work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I overcome my fears?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main theories about the origin of language?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I make the world a better place?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of photosynthesis in detail"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about social media?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of game theory?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your daily routine like?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum computer differ from a classical computer?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I'm feeling down today"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main stages of the water cycle?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why is the sky blue?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of emotional intelligence"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the meaning of life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of sustainable architecture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I improve my public speaking?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a nuclear fusion reactor work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a computer work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main theories about dark matter?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I get in shape?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neuroplasticity"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I'm thinking of changing my hairstyle"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Montessori education?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your view on conformity?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of permaculture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do you communicate with your friends?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of behavioral economics"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think of our current political system?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a self-driving car navigate through a city?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do people fall in love?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of circular economy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I be more productive at work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of string theory in physics"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the secret to happiness?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does the human immune system work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Should I ask my crush out on a date?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Gothic architecture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I improve my concentration?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of gene editing using CRISPR"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I find inner peace?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of cognitive behavioral therapy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I just got a small paper cut"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a blockchain maintain security and transparency?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your advice for starting a new career?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main theories about the formation of the Moon?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How should I deal with a difficult coworker?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neural networks in artificial intelligence"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the best way to learn a new language?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Waldorf education?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a refrigerator work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum encryption system work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do bad things happen to good people?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of Art Nouveau?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I overcome procrastination?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of terraform ing Mars"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about the current state of politics?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of regenerative agriculture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your favorite childhood memory?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a nuclear submarine operate underwater for long periods?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Can you explain our company's business model?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of biomimicry in design?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about modern technology?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of dark energy in cosmology"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the weather forecast for tomorrow?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum radar system work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Should I trust my business partner?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of permaculture design?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I improve my time management skills?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of CRISPR gene editing"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your favorite food?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Art Deco architecture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How should I approach my crush?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a tokamak fusion reactor work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the biggest story of the day?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of restorative justice?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I find my life's purpose?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum entanglement"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "My friend didn't text me back for an hour"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Austrian economics?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do people cry when they're happy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a self-healing concrete work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does the internet work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of minimalist design?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I be more organized?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neuroplasticity in brain development"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Should I follow my dreams or play it safe?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum computer maintain coherence?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the best way to balance work and family life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of behavioral economics?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the meaning of life?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of terraforming a planet"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I stand up for myself?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of chaos theory?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What is the nature of reality?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a blockchain ensure decentralization and security?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I know if someone likes me?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of brutalist architecture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I make a sandwich?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of epigenetics in genetics"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Can you explain how social media works?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of zero-waste living?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does electricity work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum cryptography system ensure security?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I be more confident?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of stoic philosophy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What does it mean to be human?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of emergence in complex systems"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I protect my privacy online?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of sustainable urban planning?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do people get sick?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum sensor achieve high precision?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I tell my roommate to clean up?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of cognitive psychology?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why is the sky blue?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of carbon capture and storage"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the best way to make friends?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of non-violent communication?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Your the best! Thanks for all you're help!"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of regenerative agriculture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about today's social media use?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum supremacy in computing"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Should I trust my new business partner?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a neuromorphic computer mimic the human brain?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Where's the nearest grocery store?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of biophilic design in architecture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I find inner peace in a chaotic world?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of CRISPR-Cas9 gene editing"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I just realized I forgot to buy milk"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of circular economy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the best way to manage my finances?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum radar system differ from traditional radar?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Should I ask my neighbor out on a date?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of trauma-informed care?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do humans wear clothes?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quorum sensing in bacteria"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about our kingdom's foreign policy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of Art Nouveau design?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I stand up for what's right?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a memristor work in neuromorphic computing?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why is the sky blue?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of restorative justice?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your opinion on modern romance?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of carbon sequestration in oceans"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How should I prepare for a job interview?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of Montessori education?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I redecorate my living room?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum dot display produce colors?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the meaning of life, man?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of Gothic Revival architecture?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How do I tell someone I love them?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of neuroplasticity in adult brains"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Can you explain the importance of studying history?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of permaculture design?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why should I recycle?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a memristor-based neural network function?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "I just had a terrible first date. What should I do?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of blue economy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I become wiser?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum tunneling in semiconductor devices"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I get a promotion at work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main features of sustainable fashion?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's the weather forecast for tomorrow?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum gyroscope achieve high precision?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "My wallet is missing. How should I find it?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of Universal Design?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I be more productive in the morning?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the process of optogenetics in neuroscience research"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I overcome my fears?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of behavioral economics?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What do you think about the current state of the economy?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a quantum magnetometer work?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can we protect the forest?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key features of solarpunk fiction and aesthetics?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How should we govern our city?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum annealing in optimization problems"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What is the airspeed velocity of an unladen swallow?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the main principles of regenerative ocean farming?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How can I convince people to buy my product?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "How does a brain-computer interface translate thoughts into commands?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Why do humans believe in gods?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What are the key principles of positive psychology?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's beyond that mountain range?"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "Explain the concept of quantum error correction in quantum computing"
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": "What's your opinion on modern romance?"
+        }
+    ]
+]

conftest.py ADDED Viewed

	@@ -0,0 +1,142 @@

+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# tests directory-specific settings - this file is run automatically
+# by pytest before any tests are run
+import doctest
+import sys
+import warnings
+from os.path import abspath, dirname, join
+import _pytest
+import pytest
+from transformers.testing_utils import HfDoctestModule, HfDocTestParser
+NOT_DEVICE_TESTS = {
+    "test_tokenization",
+    "test_processor",
+    "test_processing",
+    "test_beam_constraints",
+    "test_configuration_utils",
+    "test_data_collator",
+    "test_trainer_callback",
+    "test_trainer_utils",
+    "test_feature_extraction",
+    "test_image_processing",
+    "test_image_processor",
+    "test_image_transforms",
+    "test_optimization",
+    "test_retrieval",
+    "test_config",
+    "test_from_pretrained_no_checkpoint",
+    "test_keep_in_fp32_modules",
+    "test_gradient_checkpointing_backward_compatibility",
+    "test_gradient_checkpointing_enable_disable",
+    "test_save_load_fast_init_from_base",
+    "test_fast_init_context_manager",
+    "test_fast_init_tied_embeddings",
+    "test_save_load_fast_init_to_base",
+    "test_torch_save_load",
+    "test_initialization",
+    "test_forward_signature",
+    "test_model_common_attributes",
+    "test_model_main_input_name",
+    "test_correct_missing_keys",
+    "test_tie_model_weights",
+    "test_can_use_safetensors",
+    "test_load_save_without_tied_weights",
+    "test_tied_weights_keys",
+    "test_model_weights_reload_no_missing_tied_weights",
+    "test_pt_tf_model_equivalence",
+    "test_mismatched_shapes_have_properly_initialized_weights",
+    "test_matched_shapes_have_loaded_weights_when_some_mismatched_shapes_exist",
+    "test_model_is_small",
+    "test_tf_from_pt_safetensors",
+    "test_flax_from_pt_safetensors",
+    "ModelTest::test_pipeline_",  # None of the pipeline tests from PipelineTesterMixin (of which XxxModelTest inherits from) are running on device
+    "ModelTester::test_pipeline_",
+    "/repo_utils/",
+    "/utils/",
+    "/tools/",
+}
+# allow having multiple repository checkouts and not needing to remember to rerun
+# `pip install -e '.[dev]'` when switching between checkouts and running tests.
+git_repo_path = abspath(join(dirname(__file__), "src"))
+sys.path.insert(1, git_repo_path)
+# silence FutureWarning warnings in tests since often we can't act on them until
+# they become normal warnings - i.e. the tests still need to test the current functionality
+warnings.simplefilter(action="ignore", category=FutureWarning)
+def pytest_configure(config):
+    config.addinivalue_line(
+        "markers", "is_pt_tf_cross_test: mark test to run only when PT and TF interactions are tested"
+    )
+    config.addinivalue_line(
+        "markers", "is_pt_flax_cross_test: mark test to run only when PT and FLAX interactions are tested"
+    )
+    config.addinivalue_line("markers", "is_pipeline_test: mark test to run only when pipelines are tested")
+    config.addinivalue_line("markers", "is_staging_test: mark test to run only in the staging environment")
+    config.addinivalue_line("markers", "accelerate_tests: mark test that require accelerate")
+    config.addinivalue_line("markers", "tool_tests: mark the tool tests that are run on their specific schedule")
+    config.addinivalue_line("markers", "not_device_test: mark the tests always running on cpu")
+def pytest_collection_modifyitems(items):
+    for item in items:
+        if any(test_name in item.nodeid for test_name in NOT_DEVICE_TESTS):
+            item.add_marker(pytest.mark.not_device_test)
+def pytest_addoption(parser):
+    from transformers.testing_utils import pytest_addoption_shared
+    pytest_addoption_shared(parser)
+def pytest_terminal_summary(terminalreporter):
+    from transformers.testing_utils import pytest_terminal_summary_main
+    make_reports = terminalreporter.config.getoption("--make-reports")
+    if make_reports:
+        pytest_terminal_summary_main(terminalreporter, id=make_reports)
+def pytest_sessionfinish(session, exitstatus):
+    # If no tests are collected, pytest exists with code 5, which makes the CI fail.
+    if exitstatus == 5:
+        session.exitstatus = 0
+# Doctest custom flag to ignore output.
+IGNORE_RESULT = doctest.register_optionflag("IGNORE_RESULT")
+OutputChecker = doctest.OutputChecker
+class CustomOutputChecker(OutputChecker):
+    def check_output(self, want, got, optionflags):
+        if IGNORE_RESULT & optionflags:
+            return True
+        return OutputChecker.check_output(self, want, got, optionflags)
+doctest.OutputChecker = CustomOutputChecker
+_pytest.doctest.DoctestModule = HfDoctestModule
+doctest.DocTestParser = HfDocTestParser

docker/transformers-all-latest-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,63 @@

+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+# Use login shell to read variables from `~/.profile` (to pass dynamic created variables between RUN commands)
+SHELL ["sh", "-lc"]
+# The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
+# to be used as arguments for docker build (so far).
+ARG PYTORCH='2.2.1'
+# (not always a valid torch version)
+ARG INTEL_TORCH_EXT='2.2.0'
+# Example: `cu102`, `cu113`, etc.
+ARG CUDA='cu118'
+RUN apt update
+RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs
+RUN git lfs install
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+# 1. Put several commands in a single `RUN` to avoid image/layer exporting issue. Could be revised in the future.
+# 2. Regarding `torch` part, We might need to specify proper versions for `torchvision` and `torchaudio`.
+#    Currently, let's not bother to specify their versions explicitly (so installed with their latest release versions).
+RUN python3 -m pip install --no-cache-dir -U tensorflow==2.13 protobuf==3.20.3 tensorflow_text tensorflow_probability && python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
+RUN python3 -m pip uninstall -y flax jax
+RUN python3 -m pip install --no-cache-dir intel_extension_for_pytorch==$INTEL_TORCH_EXT -f https://developer.intel.com/ipex-whl-stable-cpu
+RUN python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git pytesseract
+RUN python3 -m pip install -U "itsdangerous<2.1.0"
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft
+# For bettertransformer
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
+# For video model testing
+RUN python3 -m pip install --no-cache-dir decord av==9.2.0
+# Some slow tests require bnb
+RUN python3 -m pip install --no-cache-dir bitsandbytes
+# For `dinat` model
+# The `XXX` part in `torchXXX` needs to match `PYTORCH` (to some extent)
+RUN python3 -m pip install --no-cache-dir natten==0.15.1+torch220$CUDA -f https://shi-labs.com/natten/wheels
+# For `nougat` tokenizer
+RUN python3 -m pip install --no-cache-dir python-Levenshtein
+# For `FastSpeech2ConformerTokenizer` tokenizer
+RUN python3 -m pip install --no-cache-dir g2p-en
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop

docker/transformers-doc-builder/Dockerfile ADDED Viewed

	@@ -0,0 +1,18 @@

+FROM python:3.10
+LABEL maintainer="Hugging Face"
+RUN apt update
+RUN git clone https://github.com/huggingface/transformers
+RUN python3 -m pip install --no-cache-dir --upgrade pip && python3 -m pip install --no-cache-dir git+https://github.com/huggingface/doc-builder ./transformers[dev]
+RUN apt-get -y update && apt-get install -y libsndfile1-dev && apt install -y tesseract-ocr
+# Torch needs to be installed before deepspeed
+RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed]
+RUN python3 -m pip install --no-cache-dir torchvision git+https://github.com/facebookresearch/detectron2.git pytesseract
+RUN python3 -m pip install -U "itsdangerous<2.1.0"
+# Test if the image could successfully build the doc. before publishing the image
+RUN doc-builder build transformers transformers/docs/source/en --build_dir doc-build-dev --notebook_dir notebooks/transformers_doc --clean
+RUN rm -rf doc-build-dev

docker/transformers-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    jupyter \
+    tensorflow \
+    torch
+RUN git clone https://github.com/NVIDIA/apex
+RUN cd apex && \
+    python3 setup.py install && \
+    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+CMD ["/bin/bash"]

docker/transformers-past-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,59 @@

+ARG BASE_DOCKER_IMAGE
+FROM $BASE_DOCKER_IMAGE
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+# Use login shell to read variables from `~/.profile` (to pass dynamic created variables between RUN commands)
+SHELL ["sh", "-lc"]
+RUN apt update
+RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs libaio-dev
+RUN git lfs install
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime]
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+ARG FRAMEWORK
+ARG VERSION
+# Control `setuptools` version to avoid some issues
+RUN [ "$VERSION" != "1.10" ] && python3 -m pip install -U setuptools || python3 -m pip install -U "setuptools<=59.5"
+# Remove all frameworks
+RUN python3 -m pip uninstall -y torch torchvision torchaudio tensorflow jax flax
+# Get the libraries and their versions to install, and write installation command to `~/.profile`.
+RUN python3 ./transformers/utils/past_ci_versions.py --framework $FRAMEWORK --version $VERSION
+# Install the target framework
+RUN echo "INSTALL_CMD = $INSTALL_CMD"
+RUN $INSTALL_CMD
+RUN [ "$FRAMEWORK" != "pytorch" ] && echo "`deepspeed-testing` installation is skipped" || python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
+# Remove `accelerate`: it requires `torch`, and this causes import issues for TF-only testing
+# We will install `accelerate@main` in Past CI workflow file
+RUN python3 -m pip uninstall -y accelerate
+# Uninstall `torch-tensorrt` and `apex` shipped with the base image
+RUN python3 -m pip uninstall -y torch-tensorrt apex
+# Pre-build **nightly** release of DeepSpeed, so it would be ready for testing (otherwise, the 1st deepspeed test will timeout)
+RUN python3 -m pip uninstall -y deepspeed
+# This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.)
+# Issue: https://github.com/microsoft/DeepSpeed/issues/2010
+# RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
+#    DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
+RUN python3 -m pip install -U "itsdangerous<2.1.0"
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop

docker/transformers-pytorch-amd-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,39 @@

+FROM rocm/dev-ubuntu-20.04:5.6
+# rocm/pytorch has no version with 2.1.0
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+ARG PYTORCH='2.1.0'
+ARG TORCH_VISION='0.16.0'
+ARG TORCH_AUDIO='2.1.0'
+ARG ROCM='5.6'
+RUN apt update && \
+    apt install -y --no-install-recommends git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-dev python3-pip ffmpeg && \
+    apt clean && \
+    rm -rf /var/lib/apt/lists/*
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM
+RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools ninja git+https://github.com/facebookresearch/detectron2.git pytesseract "itsdangerous<2.1.0"
+ARG REF=main
+WORKDIR /
+# Invalidate docker cache from here if new commit is available.
+ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video]
+RUN python3 -m pip uninstall -y tensorflow flax
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+# Remove nvml as it is not compatible with ROCm
+RUN python3 -m pip uninstall py3nvml pynvml -y

docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,48 @@

+FROM rocm/dev-ubuntu-22.04:5.6
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+ARG PYTORCH='2.1.1'
+ARG TORCH_VISION='0.16.1'
+ARG TORCH_AUDIO='2.1.1'
+ARG ROCM='5.6'
+RUN apt update && \
+    apt install -y --no-install-recommends \
+    libaio-dev \
+    git \
+    # These are required to build deepspeed.
+    python3-dev \
+    python-is-python3 \
+    rocrand-dev \
+    rocthrust-dev \
+    hipsparse-dev \
+    hipblas-dev \
+    rocblas-dev && \
+    apt clean && \
+    rm -rf /var/lib/apt/lists/*
+RUN python3 -m pip install --no-cache-dir --upgrade pip ninja "pydantic<2"
+RUN python3 -m pip uninstall -y apex torch torchvision torchaudio
+RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM --no-cache-dir
+# Pre-build DeepSpeed, so it's be ready for testing (to avoid timeout)
+RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache-dir -v --disable-pip-version-check 2>&1
+ARG REF=main
+WORKDIR /
+# Invalidate docker cache from here if new commit is available.
+ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip install --no-cache-dir ./transformers[accelerate,testing,sentencepiece,sklearn]
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+RUN python3 -c "from deepspeed.launcher.runner import main"
+# Remove nvml as it is not compatible with ROCm
+RUN python3 -m pip uninstall py3nvml pynvml -y

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,53 @@

+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-11.html#rel-23-11
+FROM nvcr.io/nvidia/pytorch:23.04-py3
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+ARG PYTORCH='2.2.0'
+# Example: `cu102`, `cu113`, etc.
+ARG CUDA='cu121'
+RUN apt -y update
+RUN apt install -y libaio-dev
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
+# Install latest release PyTorch
+# (PyTorch must be installed before pre-compiling any DeepSpeed c++/cuda ops.)
+# (https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops)
+RUN python3 -m pip uninstall -y torch torchvision torchaudio && python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
+# Uninstall `transformer-engine` shipped with the base image
+RUN python3 -m pip uninstall -y transformer-engine
+# Uninstall `torch-tensorrt` shipped with the base image
+RUN python3 -m pip uninstall -y torch-tensorrt
+# recompile apex
+RUN python3 -m pip uninstall -y apex
+# RUN git clone https://github.com/NVIDIA/apex
+#  `MAX_JOBS=1` disables parallel building to avoid cpu memory OOM when building image on GitHub Action (standard) runners
+# TODO: check if there is alternative way to install latest apex
+# RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
+# Pre-build **latest** DeepSpeed, so it would be ready for testing (otherwise, the 1st deepspeed test will timeout)
+RUN python3 -m pip uninstall -y deepspeed
+# This has to be run (again) inside the GPU VMs running the tests.
+# The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests.
+# TODO: Find out why test fail.
+RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+# The base image ships with `pydantic==1.8.2` which is not working - i.e. the next command fails
+RUN python3 -m pip install -U --no-cache-dir "pydantic<2"
+RUN python3 -c "from deepspeed.launcher.runner import main"

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,64 @@

+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-11.html#rel-23-11
+FROM nvcr.io/nvidia/pytorch:23.11-py3
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+# Example: `cu102`, `cu113`, etc.
+ARG CUDA='cu121'
+RUN apt -y update
+RUN apt install -y libaio-dev
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip uninstall -y torch torchvision torchaudio
+# Install **nightly** release PyTorch (flag `--pre`)
+# (PyTorch must be installed before pre-compiling any DeepSpeed c++/cuda ops.)
+# (https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops)
+RUN python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
+RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
+# Uninstall `transformer-engine` shipped with the base image
+RUN python3 -m pip uninstall -y transformer-engine
+# Uninstall `torch-tensorrt` and `apex` shipped with the base image
+RUN python3 -m pip uninstall -y torch-tensorrt apex
+# Pre-build **nightly** release of DeepSpeed, so it would be ready for testing (otherwise, the 1st deepspeed test will timeout)
+RUN python3 -m pip uninstall -y deepspeed
+# This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.)
+# Issue: https://github.com/microsoft/DeepSpeed/issues/2010
+# RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
+#    DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
+## For `torchdynamo` tests
+## (see https://github.com/huggingface/transformers/pull/17765)
+#RUN git clone https://github.com/pytorch/functorch
+#RUN python3 -m pip install --no-cache-dir ./functorch[aot]
+#RUN cd functorch && python3 setup.py develop
+#
+#RUN git clone https://github.com/pytorch/torchdynamo
+#RUN python3 -m pip install -r ./torchdynamo/requirements.txt
+#RUN cd torchdynamo && python3 setup.py develop
+#
+## install TensorRT
+#RUN python3 -m pip install --no-cache-dir -U nvidia-pyindex
+#RUN python3 -m pip install --no-cache-dir -U nvidia-tensorrt==8.2.4.2
+#
+## install torch_tensorrt (fx path)
+#RUN git clone https://github.com/pytorch/TensorRT.git
+#RUN cd TensorRT/py && python3 setup.py install --fx-only
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+# Disable for now as deepspeed is not installed above. To be enabled once the issue is fixed.
+# RUN python3 -c "from deepspeed.launcher.runner import main"

docker/transformers-pytorch-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+RUN apt update
+RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+# If set to nothing, will install the latest version
+ARG PYTORCH='2.1.1'
+ARG TORCH_VISION=''
+ARG TORCH_AUDIO=''
+# Example: `cu102`, `cu113`, etc.
+ARG CUDA='cu121'
+RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video]
+RUN python3 -m pip uninstall -y tensorflow flax
+RUN python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git pytesseract
+RUN python3 -m pip install -U "itsdangerous<2.1.0"
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop

docker/transformers-pytorch-tpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,65 @@

+FROM google/cloud-sdk:slim
+# Build args.
+ARG GITHUB_REF=refs/heads/main
+# TODO: This Dockerfile installs pytorch/xla 3.6 wheels. There are also 3.7
+# wheels available; see below.
+ENV PYTHON_VERSION=3.6
+RUN apt-get update && apt-get install -y --no-install-recommends \
+         build-essential \
+         cmake \
+         git \
+         curl \
+         ca-certificates
+# Install conda and python.
+# NOTE new Conda does not forward the exit status... https://github.com/conda/conda/issues/8385
+RUN curl -o ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-4.7.12-Linux-x86_64.sh  && \
+    chmod +x ~/miniconda.sh && \
+    ~/miniconda.sh -b && \
+    rm ~/miniconda.sh
+ENV PATH=/root/miniconda3/bin:$PATH
+RUN conda create -y --name container python=$PYTHON_VERSION
+# Run the rest of commands within the new conda env.
+# Use absolute path to appease Codefactor.
+SHELL ["/root/miniconda3/bin/conda", "run", "-n", "container", "/bin/bash", "-c"]
+RUN conda install -y python=$PYTHON_VERSION mkl
+RUN pip uninstall -y torch && \
+    # Python 3.7 wheels are available. Replace cp36-cp36m with cp37-cp37m
+    gsutil cp 'gs://tpu-pytorch/wheels/torch-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' . && \
+    gsutil cp 'gs://tpu-pytorch/wheels/torch_xla-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' . && \
+    gsutil cp 'gs://tpu-pytorch/wheels/torchvision-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' . && \
+    pip install 'torch-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' && \
+    pip install 'torch_xla-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' && \
+    pip install 'torchvision-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' && \
+    rm 'torch-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' && \
+    rm 'torch_xla-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' && \
+    rm 'torchvision-nightly-cp${PYTHON_VERSION/./}-cp${PYTHON_VERSION/./}m-linux_x86_64.whl' && \
+    apt-get install -y libomp5
+ENV LD_LIBRARY_PATH=root/miniconda3/envs/container/lib
+# Install huggingface/transformers at the current PR, plus dependencies.
+RUN git clone https://github.com/huggingface/transformers.git && \
+    cd transformers && \
+    git fetch origin $GITHUB_REF:CI && \
+    git checkout CI && \
+    cd .. && \
+    pip install ./transformers && \
+    pip install -r ./transformers/examples/pytorch/_test_requirements.txt && \
+    pip install pytest
+RUN python -c "import torch_xla; print(torch_xla.__version__)"
+RUN python -c "import transformers as trf; print(trf.__version__)"
+RUN conda init bash
+COPY docker-entrypoint.sh /usr/local/bin/
+RUN chmod +x /usr/local/bin/docker-entrypoint.sh
+ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]
+CMD ["bash"]

docker/transformers-pytorch-tpu/bert-base-cased.jsonnet ADDED Viewed

	@@ -0,0 +1,38 @@

+local base = import 'templates/base.libsonnet';
+local tpus = import 'templates/tpus.libsonnet';
+local utils = import "templates/utils.libsonnet";
+local volumes = import "templates/volumes.libsonnet";
+local bertBaseCased = base.BaseTest {
+  frameworkPrefix: "hf",
+  modelName: "bert-base-cased",
+  mode: "example",
+  configMaps: [],
+  timeout: 3600, # 1 hour, in seconds
+  image: std.extVar('image'),
+  imageTag: std.extVar('image-tag'),
+  tpuSettings+: {
+    softwareVersion: "pytorch-nightly",
+  },
+  accelerator: tpus.v3_8,
+  volumeMap+: {
+    datasets: volumes.PersistentVolumeSpec {
+      name: "huggingface-cluster-disk",
+      mountPath: "/datasets",
+    },
+  },
+  command: utils.scriptCommand(
+    |||
+      python -m pytest -s transformers/examples/pytorch/test_xla_examples.py -v
+      test_exit_code=$?
+      echo "\nFinished running commands.\n"
+      test $test_exit_code -eq 0
+    |||
+  ),
+};
+bertBaseCased.oneshotJob

docker/transformers-pytorch-tpu/dataset.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: huggingface-cluster-disk
+spec:
+  storageClassName: ""
+  capacity:
+    storage: 500Gi
+  accessModes:
+    - ReadOnlyMany
+  claimRef:
+    namespace: default
+    name: huggingface-cluster-disk-claim
+  gcePersistentDisk:
+    pdName: huggingface-cluster-disk
+    fsType: ext4
+    readOnly: true
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: huggingface-cluster-disk-claim
+spec:
+  # Specify "" as the storageClassName so it matches the PersistentVolume's StorageClass.
+  # A nil storageClassName value uses the default StorageClass. For details, see
+  # https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
+  storageClassName: ""
+  accessModes:
+    - ReadOnlyMany
+  resources:
+    requests:
+      storage: 1Ki

docker/transformers-pytorch-tpu/docker-entrypoint.sh ADDED Viewed

	@@ -0,0 +1,8 @@

+#!/bin/bash
+source ~/.bashrc
+echo "running docker-entrypoint.sh"
+conda activate container
+echo $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS
+echo "printed TPU info"
+export XRT_TPU_CONFIG="tpu_worker;0;${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS:7}"
+exec "$@"#!/bin/bash

docker/transformers-quantization-latest-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,60 @@

+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+# Use login shell to read variables from `~/.profile` (to pass dynamic created variables between RUN commands)
+SHELL ["sh", "-lc"]
+# The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
+# to be used as arguments for docker build (so far).
+ARG PYTORCH='2.2.1'
+# Example: `cu102`, `cu113`, etc.
+ARG CUDA='cu118'
+RUN apt update
+RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python python3-pip ffmpeg
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile
+RUN echo torch=$VERSION
+# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
+# Currently, let's just use their latest releases (when `torch` is installed with a release version)
+RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch]
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
+# needed in bnb and awq
+RUN python3 -m pip install --no-cache-dir einops
+# Add bitsandbytes for mixed int8 testing
+RUN python3 -m pip install --no-cache-dir bitsandbytes
+# Add auto-gptq for gtpq quantization testing
+RUN python3 -m pip install --no-cache-dir auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
+# Add optimum for gptq quantization testing
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
+# Add aqlm for quantization testing
+RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
+# Add autoawq for quantization testing
+# >=v0.2.3 needed for compatibility with torch 2.2.1
+RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.3/autoawq-0.2.3+cu118-cp38-cp38-linux_x86_64.whl
+# Add quanto for quantization testing
+RUN python3 -m pip install --no-cache-dir quanto
+# Add eetq for quantization testing
+RUN python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop

docker/transformers-tensorflow-gpu/Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+LABEL maintainer="Hugging Face"
+ARG DEBIAN_FRONTEND=noninteractive
+RUN apt update
+RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-tensorflow,testing]
+# If set to nothing, will install the latest version
+ARG TENSORFLOW='2.13'
+RUN [ ${#TENSORFLOW} -gt 0 ] && VERSION='tensorflow=='$TENSORFLOW'.*' ||  VERSION='tensorflow'; python3 -m pip install --no-cache-dir -U $VERSION
+RUN python3 -m pip uninstall -y torch flax
+RUN python3 -m pip install -U "itsdangerous<2.1.0"
+RUN python3 -m pip install --no-cache-dir -U tensorflow_probability
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop

docs/README.md ADDED Viewed

	@@ -0,0 +1,397 @@

+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Generating the documentation
+To generate the documentation, you first have to build it. Several packages are necessary to build the doc,
+you can install them with the following command, at the root of the code repository:
+```bash
+pip install -e ".[docs]"
+```
+Then you need to install our special tool that builds the documentation:
+```bash
+pip install git+https://github.com/huggingface/doc-builder
+```
+---
+**NOTE**
+You only need to generate the documentation to inspect it locally (if you're planning changes and want to
+check how they look before committing for instance). You don't have to commit the built documentation.
+---
+## Building the documentation
+Once you have setup the `doc-builder` and additional packages, you can generate the documentation by
+typing the following command:
+```bash
+doc-builder build transformers docs/source/en/ --build_dir ~/tmp/test-build
+```
+You can adapt the `--build_dir` to set any temporary folder that you prefer. This command will create it and generate
+the MDX files that will be rendered as the documentation on the main website. You can inspect them in your favorite
+Markdown editor.
+## Previewing the documentation
+To preview the docs, first install the `watchdog` module with:
+```bash
+pip install watchdog
+```
+Then run the following command:
+```bash
+doc-builder preview {package_name} {path_to_docs}
+```
+For example:
+```bash
+doc-builder preview transformers docs/source/en/
+```
+The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.
+---
+**NOTE**
+The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).
+---
+## Adding a new element to the navigation bar
+Accepted files are Markdown (.md).
+Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
+the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/transformers/blob/main/docs/source/en/_toctree.yml) file.
+## Renaming section headers and moving sections
+It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums, and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.
+Therefore, we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor.
+So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file:
+```
+Sections that were moved:
+[ <a href="#section-b">Section A</a><a id="section-a"></a> ]
+```
+and of course, if you moved it to another file, then:
+```
+Sections that were moved:
+[ <a href="../new-file#section-b">Section A</a><a id="section-a"></a> ]
+```
+Use the relative style to link to the new file so that the versioned docs continue to work.
+For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).
+## Writing Documentation - Specification
+The `huggingface/transformers` documentation follows the
+[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style for docstrings,
+although we can write them directly in Markdown.
+### Adding a new tutorial
+Adding a new tutorial or section is done in two steps:
+- Add a new file under `./source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
+- Link that file in `./source/_toctree.yml` on the correct toc-tree.
+Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so
+depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or
+four.
+### Translating
+When translating, refer to the guide at [./TRANSLATING.md](https://github.com/huggingface/transformers/blob/main/docs/TRANSLATING.md).
+### Adding a new model
+When adding a new model:
+- Create a file `xxx.md` or under `./source/model_doc` (don't hesitate to copy an existing file as template).
+- Link that file in `./source/_toctree.yml`.
+- Write a short overview of the model:
+    - Overview with paper & authors
+    - Paper abstract
+    - Tips and tricks and how to use it best
+- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
+  every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
+  The order is generally:
+    - Configuration
+    - Tokenizer
+    - PyTorch base model
+    - PyTorch head models
+    - TensorFlow base model
+    - TensorFlow head models
+    - Flax base model
+    - Flax head models
+These classes should be added using our Markdown syntax. Usually as follows:
+```
+## XXXConfig
+[[autodoc]] XXXConfig
+```
+This will include every public method of the configuration that is documented. If for some reason you wish for a method
+not to be displayed in the documentation, you can do so by specifying which methods should be in the docs:
+```
+## XXXTokenizer
+[[autodoc]] XXXTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+```
+If you just want to add a method that is not documented (for instance magic methods like `__call__` are not documented
+by default) you can put the list of methods to add in a list that contains `all`:
+```
+## XXXTokenizer
+[[autodoc]] XXXTokenizer
+    - all
+    - __call__
+```
+### Writing source documentation
+Values that should be put in `code` should either be surrounded by backticks: \`like so\`. Note that argument names
+and objects like True, None, or any strings should usually be put in `code`.
+When mentioning a class, function, or method, it is recommended to use our syntax for internal links so that our tool
+adds a link to its documentation with this syntax: \[\`XXXClass\`\] or \[\`function\`\]. This requires the class or
+function to be in the main package.
+If you want to create a link to some internal class or function, you need to
+provide its path. For instance: \[\`utils.ModelOutput\`\]. This will be converted into a link with
+`utils.ModelOutput` in the description. To get rid of the path and only keep the name of the object you are
+linking to in the description, add a ~: \[\`~utils.ModelOutput\`\] will generate a link with `ModelOutput` in the description.
+The same works for methods so you can either use \[\`XXXClass.method\`\] or \[\`~XXXClass.method\`\].
+#### Defining arguments in a method
+Arguments should be defined with the `Args:` (or `Arguments:` or `Parameters:`) prefix, followed by a line return and
+an indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon, and its
+description:
+```
+    Args:
+        n_layers (`int`): The number of layers of the model.
+```
+If the description is too long to fit in one line, another indentation is necessary before writing the description
+after the argument.
+Here's an example showcasing everything so far:
+```
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary.
+            Indices can be obtained using [`AlbertTokenizer`]. See [`~PreTrainedTokenizer.encode`] and
+            [`~PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+```
+For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the
+following signature:
+```
+def my_function(x: str = None, a: float = 1):
+```
+then its documentation should look like this:
+```
+    Args:
+        x (`str`, *optional*):
+            This argument controls ...
+        a (`float`, *optional*, defaults to 1):
+            This argument is used to ...
+```
+Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even
+if the first line describing your argument type and its default gets long, you can't break it on several lines. You can
+however, write as many lines as you want in the indented description (see the example above with `input_ids`).
+#### Writing a multi-line code block
+Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown:
+````
+```
+# first line of code
+# second line
+# etc
+```
+````
+We follow the [doctest](https://docs.python.org/3/library/doctest.html) syntax for the examples to automatically test
+the results to stay consistent with the library.
+#### Writing a return block
+The return block should be introduced with the `Returns:` prefix, followed by a line return and an indentation.
+The first line should be the type of the return, followed by a line return. No need to indent further for the elements
+building the return.
+Here's an example of a single value return:
+```
+    Returns:
+        `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
+```
+Here's an example of a tuple return, comprising several objects:
+```
+    Returns:
+        `tuple(torch.FloatTensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs:
+        - ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.FloatTensor` of shape `(1,)` --
+          Total loss is the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+        - **prediction_scores** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) --
+          Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+```
+#### Adding an image
+Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
+the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
+them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
+If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
+to this dataset.
+## Styling the docstring
+We have an automatic script running with the `make style` comment that will make sure that:
+- the docstrings fully take advantage of the line width
+- all code examples are formatted using black, like the code of the Transformers library
+This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's
+recommended to commit your changes before running `make style`, so you can revert the changes done by that script
+easily.
+# Testing documentation examples
+Good documentation often comes with an example of how a specific function or class should be used.
+Each model class should contain at least one example showcasing
+how to use this model class in inference. *E.g.* the class [Wav2Vec2ForCTC](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2ForCTC)
+includes an example of how to transcribe speech to text in the
+[docstring of its forward function](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2ForCTC.forward).
+## Writing documentation examples
+The syntax for Example docstrings can look as follows:
+```
+    Example:
+    ```python
+    >>> from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+    >>> from datasets import load_dataset
+    >>> import torch
+    >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+    >>> dataset = dataset.sort("id")
+    >>> sampling_rate = dataset.features["audio"].sampling_rate
+    >>> processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
+    >>> model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
+    >>> # audio file is decoded on the fly
+    >>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+    >>> with torch.no_grad():
+    ...     logits = model(**inputs).logits
+    >>> predicted_ids = torch.argmax(logits, dim=-1)
+    >>> # transcribe speech
+    >>> transcription = processor.batch_decode(predicted_ids)
+    >>> transcription[0]
+    'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL'
+    ```
+```
+The docstring should give a minimal, clear example of how the respective model
+is to be used in inference and also include the expected (ideally sensible)
+output.
+Often, readers will try out the example before even going through the function
+or class definitions. Therefore, it is of utmost importance that the example
+works as expected.
+## Docstring testing
+To do so each example should be included in the doctests.
+We use pytests' [doctest integration](https://docs.pytest.org/doctest.html) to verify that all of our examples run correctly.
+For Transformers, the doctests are run on a daily basis via GitHub Actions as can be
+seen [here](https://github.com/huggingface/transformers/actions/workflows/doctests.yml).
+### For Python files
+Run all the tests in the docstrings of a given file with the following command, here is how we test the modeling file of Wav2Vec2 for instance:
+```bash
+pytest --doctest-modules src/transformers/models/wav2vec2/modeling_wav2vec2.py -sv --doctest-continue-on-failure
+```
+If you want to isolate a specific docstring, just add `::` after the file name then type the whole path of the function/class/method whose docstring you want to test. For instance, here is how to just test the forward method of `Wav2Vec2ForCTC`:
+```bash
+pytest --doctest-modules src/transformers/models/wav2vec2/modeling_wav2vec2.py::transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC.forward -sv --doctest-continue-on-failure
+```
+### For Markdown files
+You can test locally a given file with this command (here testing the quicktour):
+```bash
+pytest --doctest-modules docs/source/quicktour.md -sv --doctest-continue-on-failure --doctest-glob="*.md"
+```
+### Writing doctests
+Here are a few tips to help you debug the doctests and make them pass:
+- The outputs of the code need to match the expected output **exactly**, so make sure you have the same outputs. In particular doctest will see a difference between single quotes and double quotes, or a missing parenthesis. The only exceptions to that rule are:
+  * whitespace: one give whitespace (space, tabulation, new line) is equivalent to any number of whitespace, so you can add new lines where there are spaces to make your output more readable.
+  * numerical values: you should never put more than 4 or 5 digits to expected results as different setups or library versions might get you slightly different results. `doctest` is configured to ignore any difference lower than the precision to which you wrote (so 1e-4 if you write 4 digits).
+- Don't leave a block of code that is very long to execute. If you can't make it fast, you can either not use the doctest syntax on it (so that it's ignored), or if you want to use the doctest syntax to show the results, you can add a comment `# doctest: +SKIP` at the end of the lines of code too long to execute
+- Each line of code that produces a result needs to have that result written below. You can ignore an output if you don't want to show it in your code example by adding a comment ` # doctest: +IGNORE_RESULT` at the end of the line of code producing it.

docs/TRANSLATING.md ADDED Viewed

	@@ -0,0 +1,57 @@

+### Translating the Transformers documentation into your language
+As part of our mission to democratize machine learning, we'd love to make the Transformers library available in many more languages! Follow the steps below if you want to help translate the documentation into your language 🙏.
+**🗞️ Open an issue**
+To get started, navigate to the [Issues](https://github.com/huggingface/transformers/issues) page of this repo and check if anyone else has opened an issue for your language. If not, open a new issue by selecting the "Translation template" from the "New issue" button.
+Once an issue exists, post a comment to indicate which chapters you'd like to work on, and we'll add your name to the list.
+**🍴 Fork the repository**
+First, you'll need to [fork the Transformers repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo). You can do this by clicking on the **Fork** button on the top-right corner of this repo's page.
+Once you've forked the repo, you'll want to get the files on your local machine for editing. You can do that by cloning the fork with Git as follows:
+```bash
+git clone https://github.com/YOUR-USERNAME/transformers.git
+```
+**📋 Copy-paste the English version with a new language code**
+The documentation files are in one leading directory:
+- [`docs/source`](https://github.com/huggingface/transformers/tree/main/docs/source): All the documentation materials are organized here by language.
+You'll only need to copy the files in the [`docs/source/en`](https://github.com/huggingface/transformers/tree/main/docs/source/en) directory, so first navigate to your fork of the repo and run the following:
+```bash
+cd ~/path/to/transformers/docs
+cp -r source/en source/LANG-ID
+```
+Here, `LANG-ID` should be one of the ISO 639-1 or ISO 639-2 language codes -- see [here](https://www.loc.gov/standards/iso639-2/php/code_list.php) for a handy table.
+**✍️ Start translating**
+The fun part comes - translating the text!
+The first thing we recommend is translating the part of the `_toctree.yml` file that corresponds to your doc chapter. This file is used to render the table of contents on the website.
+> 🙋 If the `_toctree.yml` file doesn't yet exist for your language, you can create one by copy-pasting from the English version and deleting the sections unrelated to your chapter. Just make sure it exists in the `docs/source/LANG-ID/` directory!
+The fields you should add are `local` (with the name of the file containing the translation; e.g. `autoclass_tutorial`), and `title` (with the title of the doc in your language; e.g. `Load pretrained instances with an AutoClass`) -- as a reference, here is the `_toctree.yml` for [English](https://github.com/huggingface/transformers/blob/main/docs/source/en/_toctree.yml):
+```yaml
+- sections:
+  - local: pipeline_tutorial # Do not change this! Use the same name for your .md file
+    title: Pipelines for inference # Translate this!
+    ...
+  title: Tutorials # Translate this!
+```
+Once you have translated the `_toctree.yml` file, you can start translating the [MDX](https://mdxjs.com/) files associated with your docs chapter.
+> 🙋 If you'd like others to help you with the translation, you should [open an issue](https://github.com/huggingface/transformers/issues) and tag @stevhliu and @MKhalusova.

docs/source/_config.py ADDED Viewed

	@@ -0,0 +1,14 @@

+# docstyle-ignore
+INSTALL_CONTENT = """
+# Transformers installation
+! pip install transformers datasets evaluate accelerate
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/transformers.git
+"""
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
+black_avoid_patterns = {
+    "{processor_class}": "FakeProcessorClass",
+    "{model_class}": "FakeModelClass",
+    "{object_class}": "FakeObjectClass",
+}

docs/source/de/_config.py ADDED Viewed

	@@ -0,0 +1,14 @@

+# docstyle-ignore
+INSTALL_CONTENT = """
+# Transformers installation
+! pip install transformers datasets evaluate accelerate
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/transformers.git
+"""
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
+black_avoid_patterns = {
+    "{processor_class}": "FakeProcessorClass",
+    "{model_class}": "FakeModelClass",
+    "{object_class}": "FakeObjectClass",
+}

docs/source/de/_toctree.yml ADDED Viewed

	@@ -0,0 +1,42 @@

+- sections:
+  - local: index
+    title: 🤗 Transformers
+  - local: quicktour
+    title: Schnellstart
+  - local: installation
+    title: Installation
+  title: Erste Schritte
+- sections:
+  - local: pipeline_tutorial
+    title: Pipelines für Inferenzen
+  - local: autoclass_tutorial
+    title: Laden von vortrainierten Instanzen mit einer AutoClass
+  - local: preprocessing
+    title: Vorverarbeiten
+  - local: training
+    title: Optimierung eines vortrainierten Modells
+  - local: run_scripts
+    title: Trainieren mit einem Skript
+  - local: accelerate
+    title: Verteiltes Training mit 🤗 Accelerate
+  - local: peft
+    title: Laden und Trainieren von Adaptern mit 🤗 PEFT
+  - local: model_sharing
+    title: Ein Modell teilen
+  - local: transformers_agents
+    title: Agents
+  - local: llm_tutorial
+    title: Generation with LLMs
+  title: Tutorials
+- sections:
+  - local: contributing
+    title: Wie kann man zu 🤗 Transformers beitragen?
+  - local: add_new_model
+    title: Wie fügt man ein Modell zu 🤗 Transformers hinzu?
+  - local: add_new_pipeline
+    title: Wie fügt man eine Pipeline zu 🤗 Transformers hinzu?
+  - local: testing
+    title: Testen
+  - local: pr_checks
+    title: Überprüfung einer Pull Request
+  title: Contribute

docs/source/de/accelerate.md ADDED Viewed

	@@ -0,0 +1,136 @@

+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# Verteiltes Training mit 🤗 Accelerate
+Da die Modelle immer größer werden, hat sich die Parallelität als Strategie zum Trainieren größerer Modelle auf begrenzter Hardware und zur Beschleunigung der Trainingsgeschwindigkeit um mehrere Größenordnungen erwiesen. Bei Hugging Face haben wir die Bibliothek [🤗 Accelerate](https://huggingface.co/docs/accelerate) entwickelt, um Nutzern zu helfen, ein 🤗 Transformers-Modell auf jeder Art von verteiltem Setup zu trainieren, egal ob es sich um mehrere GPUs auf einer Maschine oder mehrere GPUs auf mehreren Maschinen handelt. In diesem Tutorial lernen Sie, wie Sie Ihre native PyTorch-Trainingsschleife anpassen, um das Training in einer verteilten Umgebung zu ermöglichen.
+## Einrichtung
+Beginnen Sie mit der Installation von 🤗 Accelerate:
+```bash
+pip install accelerate
+```
+Dann importieren und erstellen Sie ein [`~accelerate.Accelerator`]-Objekt. Der [`~accelerate.Accelerator`] wird automatisch Ihre Art der verteilten Einrichtung erkennen und alle notwendigen Komponenten für das Training initialisieren. Sie müssen Ihr Modell nicht explizit auf einem Gerät platzieren.
+```py
+>>> from accelerate import Accelerator
+>>> accelerator = Accelerator()
+```
+## Vorbereiten auf die Beschleunigung
+Der nächste Schritt ist die Übergabe aller relevanten Trainingsobjekte an die Methode [`~accelerate.Accelerator.prepare`]. Dazu gehören Ihre Trainings- und Evaluierungs-DataLoader, ein Modell und ein Optimierer:
+```py
+>>> train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+...     train_dataloader, eval_dataloader, model, optimizer
+... )
+```
+## Rückwärts
+Die letzte Ergänzung besteht darin, das typische `loss.backward()` in der Trainingsschleife durch die 🤗 Accelerate-Methode [`~accelerate.Accelerator.backward`] zu ersetzen:
+```py
+>>> for epoch in range(num_epochs):
+...     for batch in train_dataloader:
+...         outputs = model(**batch)
+...         loss = outputs.loss
+...         accelerator.backward(loss)
+...         optimizer.step()
+...         lr_scheduler.step()
+...         optimizer.zero_grad()
+...         progress_bar.update(1)
+```
+Wie Sie im folgenden Code sehen können, müssen Sie nur vier zusätzliche Codezeilen zu Ihrer Trainingsschleife hinzufügen, um verteiltes Training zu ermöglichen!
+```diff
++ from accelerate import Accelerator
+  from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
++ accelerator = Accelerator()
+  model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
+  optimizer = AdamW(model.parameters(), lr=3e-5)
+- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+- model.to(device)
++ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
++     train_dataloader, eval_dataloader, model, optimizer
++ )
+  num_epochs = 3
+  num_training_steps = num_epochs * len(train_dataloader)
+  lr_scheduler = get_scheduler(
+      "linear",
+      optimizer=optimizer,
+      num_warmup_steps=0,
+      num_training_steps=num_training_steps
+  )
+  progress_bar = tqdm(range(num_training_steps))
+  model.train()
+  for epoch in range(num_epochs):
+      for batch in train_dataloader:
+-         batch = {k: v.to(device) for k, v in batch.items()}
+          outputs = model(**batch)
+          loss = outputs.loss
+-         loss.backward()
++         accelerator.backward(loss)
+          optimizer.step()
+          lr_scheduler.step()
+          optimizer.zero_grad()
+          progress_bar.update(1)
+```
+## Trainieren
+Sobald Sie die entsprechenden Codezeilen hinzugefügt haben, starten Sie Ihr Training in einem Skript oder einem Notebook wie Colaboratory.
+### Trainieren mit einem Skript
+Wenn Sie Ihr Training mit einem Skript durchführen, führen Sie den folgenden Befehl aus, um eine Konfigurationsdatei zu erstellen und zu speichern:
+```bash
+accelerate config
+```
+Dann starten Sie Ihr Training mit:
+```bash
+accelerate launch train.py
+```
+### Trainieren mit einem Notebook
+🤗 Accelerate kann auch in einem Notebook laufen, wenn Sie planen, die TPUs von Colaboratory zu verwenden. Verpacken Sie den gesamten Code, der für das Training verantwortlich ist, in eine Funktion und übergeben Sie diese an [`~accelerate.notebook_launcher`]:
+```py
+>>> from accelerate import notebook_launcher
+>>> notebook_launcher(training_function)
+```
+Weitere Informationen über 🤗 Accelerate und seine umfangreichen Funktionen finden Sie in der [Dokumentation](https://huggingface.co/docs/accelerate).