Upload 13 files

Browse files

Files changed (13) hide show

README.md +71 -0
all_results.json +12 -0
config.json +360 -0
ds_config.json +50 -0
eval_results.json +8 -0
model.safetensors +3 -0
preprocessor_config.json +14 -0
pytorch_model.bin +3 -0
run.sh +31 -0
run_audio_classification.py +418 -0
train_results.json +7 -0
trainer_state.json +0 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+license: apache-2.0
+tags:
+- audio-classification
+- generated_from_trainer
+datasets:
+- xtreme_s
+metrics:
+- accuracy
+base_model: openai/whisper-medium
+model-index:
+- name: whisper-medium-fleurs-lang-id
+  results: []
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# Whisper Medium FLEURS Language Identification
+This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [FLEURS subset](https://huggingface.co/datasets/google/xtreme_s#language-identification---fleurs-langid) of the [google/xtreme_s](https://huggingface.co/google/xtreme_s) dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.8413
+- Accuracy: 0.8805
+To reproduce this run, execute the command in [`run.sh`](https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id/blob/main/run.sh).
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 3e-05
+- train_batch_size: 16
+- eval_batch_size: 32
+- seed: 0
+- distributed_type: multi-GPU
+- gradient_accumulation_steps: 2
+- total_train_batch_size: 32
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 3.0
+### Training results
+| Training Loss | Epoch | Step  | Validation Loss | Accuracy |
+|:-------------:|:-----:|:-----:|:---------------:|:--------:|
+| 0.0152        | 1.0   | 8494  | 0.9087          | 0.8431   |
+| 0.0003        | 2.0   | 16988 | 1.0059          | 0.8460   |
+| 0.0           | 3.0   | 25482 | 0.8413          | 0.8805   |
+### Framework versions
+- Transformers 4.27.0.dev0
+- Pytorch 1.13.1
+- Datasets 2.9.0
+- Tokenizers 0.13.2

all_results.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "epoch": 3.0,
+    "eval_accuracy": 0.8805294322535702,
+    "eval_loss": 0.84130859375,
+    "eval_runtime": 4369.2701,
+    "eval_samples_per_second": 7.885,
+    "eval_steps_per_second": 0.246,
+    "train_loss": 0.06268550049036697,
+    "train_runtime": 389325.9759,
+    "train_samples_per_second": 2.094,
+    "train_steps_per_second": 0.065
+}

config.json ADDED Viewed

	@@ -0,0 +1,360 @@

+{
+  "_name_or_path": "sanchit-gandhi/whisper-medium-fleurs-lang-id",
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "apply_spec_augment": false,
+  "architectures": [
+    "WhisperForAudioClassification"
+  ],
+  "attention_dropout": 0.0,
+  "begin_suppress_tokens": [
+    220,
+    50257
+  ],
+  "bos_token_id": 50257,
+  "classifier_proj_size": 256,
+  "d_model": 1024,
+  "decoder_attention_heads": 16,
+  "decoder_ffn_dim": 4096,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 24,
+  "decoder_start_token_id": 50258,
+  "dropout": 0.0,
+  "encoder_attention_heads": 16,
+  "encoder_ffn_dim": 4096,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 24,
+  "eos_token_id": 50257,
+  "finetuning_task": "audio-classification",
+  "forced_decoder_ids": [
+    [
+      1,
+      50259
+    ],
+    [
+      2,
+      50359
+    ],
+    [
+      3,
+      50363
+    ]
+  ],
+  "id2label": {
+    "0": "Afrikaans",
+    "1": "Amharic",
+    "2": "Arabic",
+    "3": "Assamese",
+    "4": "Asturian",
+    "5": "Azerbaijani",
+    "6": "Belarusian",
+    "7": "Bulgarian",
+    "8": "Bengali",
+    "9": "Bosnian",
+    "10": "Catalan",
+    "11": "Cebuano",
+    "12": "Sorani-Kurdish",
+    "13": "Mandarin Chinese",
+    "14": "Czech",
+    "15": "Welsh",
+    "16": "Danish",
+    "17": "German",
+    "18": "Greek",
+    "19": "English",
+    "20": "Spanish",
+    "21": "Estonian",
+    "22": "Persian",
+    "23": "Fula",
+    "24": "Finnish",
+    "25": "Filipino",
+    "26": "French",
+    "27": "Irish",
+    "28": "Galician",
+    "29": "Gujarati",
+    "30": "Hausa",
+    "31": "Hebrew",
+    "32": "Hindi",
+    "33": "Croatian",
+    "34": "Hungarian",
+    "35": "Armenian",
+    "36": "Indonesian",
+    "37": "Igbo",
+    "38": "Icelandic",
+    "39": "Italian",
+    "40": "Japanese",
+    "41": "Javanese",
+    "42": "Georgian",
+    "43": "Kamba",
+    "44": "Kabuverdianu",
+    "45": "Kazakh",
+    "46": "Khmer",
+    "47": "Kannada",
+    "48": "Korean",
+    "49": "Kyrgyz",
+    "50": "Luxembourgish",
+    "51": "Ganda",
+    "52": "Lingala",
+    "53": "Lao",
+    "54": "Lithuanian",
+    "55": "Luo",
+    "56": "Latvian",
+    "57": "Maori",
+    "58": "Macedonian",
+    "59": "Malayalam",
+    "60": "Mongolian",
+    "61": "Marathi",
+    "62": "Malay",
+    "63": "Maltese",
+    "64": "Burmese",
+    "65": "Norwegian",
+    "66": "Nepali",
+    "67": "Dutch",
+    "68": "Northern-Sotho",
+    "69": "Nyanja",
+    "70": "Occitan",
+    "71": "Oromo",
+    "72": "Oriya",
+    "73": "Punjabi",
+    "74": "Polish",
+    "75": "Pashto",
+    "76": "Portuguese",
+    "77": "Romanian",
+    "78": "Russian",
+    "79": "Sindhi",
+    "80": "Slovak",
+    "81": "Slovenian",
+    "82": "Shona",
+    "83": "Somali",
+    "84": "Serbian",
+    "85": "Swedish",
+    "86": "Swahili",
+    "87": "Tamil",
+    "88": "Telugu",
+    "89": "Tajik",
+    "90": "Thai",
+    "91": "Turkish",
+    "92": "Ukrainian",
+    "93": "Umbundu",
+    "94": "Urdu",
+    "95": "Uzbek",
+    "96": "Vietnamese",
+    "97": "Wolof",
+    "98": "Xhosa",
+    "99": "Yoruba",
+    "100": "Cantonese Chinese",
+    "101": "Zulu"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "Afrikaans": "0",
+    "Amharic": "1",
+    "Arabic": "2",
+    "Armenian": "35",
+    "Assamese": "3",
+    "Asturian": "4",
+    "Azerbaijani": "5",
+    "Belarusian": "6",
+    "Bengali": "8",
+    "Bosnian": "9",
+    "Bulgarian": "7",
+    "Burmese": "64",
+    "Cantonese Chinese": "100",
+    "Catalan": "10",
+    "Cebuano": "11",
+    "Croatian": "33",
+    "Czech": "14",
+    "Danish": "16",
+    "Dutch": "67",
+    "English": "19",
+    "Estonian": "21",
+    "Filipino": "25",
+    "Finnish": "24",
+    "French": "26",
+    "Fula": "23",
+    "Galician": "28",
+    "Ganda": "51",
+    "Georgian": "42",
+    "German": "17",
+    "Greek": "18",
+    "Gujarati": "29",
+    "Hausa": "30",
+    "Hebrew": "31",
+    "Hindi": "32",
+    "Hungarian": "34",
+    "Icelandic": "38",
+    "Igbo": "37",
+    "Indonesian": "36",
+    "Irish": "27",
+    "Italian": "39",
+    "Japanese": "40",
+    "Javanese": "41",
+    "Kabuverdianu": "44",
+    "Kamba": "43",
+    "Kannada": "47",
+    "Kazakh": "45",
+    "Khmer": "46",
+    "Korean": "48",
+    "Kyrgyz": "49",
+    "Lao": "53",
+    "Latvian": "56",
+    "Lingala": "52",
+    "Lithuanian": "54",
+    "Luo": "55",
+    "Luxembourgish": "50",
+    "Macedonian": "58",
+    "Malay": "62",
+    "Malayalam": "59",
+    "Maltese": "63",
+    "Mandarin Chinese": "13",
+    "Maori": "57",
+    "Marathi": "61",
+    "Mongolian": "60",
+    "Nepali": "66",
+    "Northern-Sotho": "68",
+    "Norwegian": "65",
+    "Nyanja": "69",
+    "Occitan": "70",
+    "Oriya": "72",
+    "Oromo": "71",
+    "Pashto": "75",
+    "Persian": "22",
+    "Polish": "74",
+    "Portuguese": "76",
+    "Punjabi": "73",
+    "Romanian": "77",
+    "Russian": "78",
+    "Serbian": "84",
+    "Shona": "82",
+    "Sindhi": "79",
+    "Slovak": "80",
+    "Slovenian": "81",
+    "Somali": "83",
+    "Sorani-Kurdish": "12",
+    "Spanish": "20",
+    "Swahili": "86",
+    "Swedish": "85",
+    "Tajik": "89",
+    "Tamil": "87",
+    "Telugu": "88",
+    "Thai": "90",
+    "Turkish": "91",
+    "Ukrainian": "92",
+    "Umbundu": "93",
+    "Urdu": "94",
+    "Uzbek": "95",
+    "Vietnamese": "96",
+    "Welsh": "15",
+    "Wolof": "97",
+    "Xhosa": "98",
+    "Yoruba": "99",
+    "Zulu": "101"
+  },
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.05,
+  "max_length": 448,
+  "max_source_positions": 1500,
+  "max_target_positions": 448,
+  "model_type": "whisper",
+  "num_hidden_layers": 24,
+  "num_mel_bins": 80,
+  "pad_token_id": 50257,
+  "scale_embedding": false,
+  "suppress_tokens": [
+    1,
+    2,
+    7,
+    8,
+    9,
+    10,
+    14,
+    25,
+    26,
+    27,
+    28,
+    29,
+    31,
+    58,
+    59,
+    60,
+    61,
+    62,
+    63,
+    90,
+    91,
+    92,
+    93,
+    359,
+    503,
+    522,
+    542,
+    873,
+    893,
+    902,
+    918,
+    922,
+    931,
+    1350,
+    1853,
+    1982,
+    2460,
+    2627,
+    3246,
+    3253,
+    3268,
+    3536,
+    3846,
+    3961,
+    4183,
+    4667,
+    6585,
+    6647,
+    7273,
+    9061,
+    9383,
+    10428,
+    10929,
+    11938,
+    12033,
+    12331,
+    12562,
+    13793,
+    14157,
+    14635,
+    15265,
+    15618,
+    16553,
+    16604,
+    18362,
+    18956,
+    20075,
+    21675,
+    22520,
+    26130,
+    26161,
+    26435,
+    28279,
+    29464,
+    31650,
+    32302,
+    32470,
+    36865,
+    42863,
+    47425,
+    49870,
+    50254,
+    50258,
+    50360,
+    50361,
+    50362
+  ],
+  "torch_dtype": "float16",
+  "transformers_version": "4.30.0.dev0",
+  "use_cache": true,
+  "use_weighted_layer_sum": false,
+  "vocab_size": 51865
+}

ds_config.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+   "scheduler": {
+         "type": "WarmupDecayLR",
+         "params": {
+             "last_batch_iteration": -1,
+             "total_num_steps": "auto",
+             "warmup_min_lr": "auto",
+             "warmup_max_lr": "auto",
+             "warmup_num_steps": "auto"
+         }
+     },
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto"
+}

eval_results.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+    "epoch": 3.0,
+    "eval_accuracy": 0.8805294322535702,
+    "eval_loss": 0.84130859375,
+    "eval_runtime": 4369.2701,
+    "eval_samples_per_second": 7.885,
+    "eval_steps_per_second": 0.246
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dfe1b47efa122ea382e06b17209f1c8b7424d39b6e3224520e601fbb9cd5aaa2
+size 615050492

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "chunk_length": 30,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 80,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "WhisperProcessor",
+  "return_attention_mask": false,
+  "sampling_rate": 16000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6ded31e35036a85fe27b810198c6f9dd332d8b506df244c156b3af8524a01bce
+size 615058493

run.sh ADDED Viewed

	@@ -0,0 +1,31 @@

+deepspeed run_audio_classification.py \
+    --deepspeed ds_config.json \
+    --model_name_or_path openai/whisper-medium \
+    --dataset_name google/xtreme_s \
+    --dataset_config_name fleurs.all \
+    --output_dir ./ \
+    --overwrite_output_dir \
+    --remove_unused_columns False \
+    --do_train \
+    --do_eval \
+    --fp16 \
+    --learning_rate 3e-5 \
+    --max_length_seconds 30 \
+    --label_column_name lang_id \
+    --attention_mask False \
+    --warmup_ratio 0.1 \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 16 \
+    --gradient_accumulation_steps 2 \
+    --gradient_checkpointing True \
+    --per_device_eval_batch_size 32 \
+    --dataloader_num_workers 8 \
+    --logging_strategy steps \
+    --logging_steps 25 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --load_best_model_at_end True \
+    --metric_for_best_model accuracy \
+    --seed 0 \
+    --freeze_feature_encoder False \
+    --push_to_hub

run_audio_classification.py ADDED Viewed

	@@ -0,0 +1,418 @@

+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import os
+import sys
+import warnings
+from dataclasses import dataclass, field
+from random import randint
+from typing import Optional
+import datasets
+import evaluate
+import numpy as np
+from datasets import DatasetDict, load_dataset
+import transformers
+from transformers import (
+    AutoConfig,
+    AutoFeatureExtractor,
+    AutoModelForAudioClassification,
+    HfArgumentParser,
+    Trainer,
+    TrainingArguments,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from transformers.utils import check_min_version, send_example_telemetry
+from transformers.utils.versions import require_version
+logger = logging.getLogger(__name__)
+# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
+check_min_version("4.27.0.dev0")
+require_version("datasets>=1.14.0", "To fix: pip install -r examples/pytorch/audio-classification/requirements.txt")
+def random_subsample(wav: np.ndarray, max_length: float, sample_rate: int = 16000):
+    """Randomly sample chunks of `max_length` seconds from the input audio"""
+    sample_length = int(round(sample_rate * max_length))
+    if len(wav) <= sample_length:
+        return wav
+    random_offset = randint(0, len(wav) - sample_length - 1)
+    return wav[random_offset : random_offset + sample_length]
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `HfArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+    dataset_name: Optional[str] = field(default=None, metadata={"help": "Name of a dataset from the datasets package"})
+    dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
+    )
+    train_file: Optional[str] = field(
+        default=None, metadata={"help": "A file containing the training audio paths and labels."}
+    )
+    eval_file: Optional[str] = field(
+        default=None, metadata={"help": "A file containing the validation audio paths and labels."}
+    )
+    train_split_name: str = field(
+        default="train",
+        metadata={
+            "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
+        },
+    )
+    eval_split_name: str = field(
+        default="validation",
+        metadata={
+            "help": (
+                "The name of the training data set split to use (via the datasets library). Defaults to 'validation'"
+            )
+        },
+    )
+    audio_column_name: str = field(
+        default="audio",
+        metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
+    )
+    label_column_name: str = field(
+        default="label", metadata={"help": "The name of the dataset column containing the labels. Defaults to 'label'"}
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+    max_eval_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
+                "value if set."
+            )
+        },
+    )
+    max_length_seconds: float = field(
+        default=20,
+        metadata={"help": "Audio clips will be randomly cut to this length during training if the value is set."},
+    )
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+    model_name_or_path: str = field(
+        default="facebook/wav2vec2-base",
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"},
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from the Hub"}
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+    )
+    feature_extractor_name: Optional[str] = field(
+        default=None, metadata={"help": "Name or path of preprocessor config."}
+    )
+    freeze_feature_encoder: bool = field(
+        default=True, metadata={"help": "Whether to freeze the feature encoder layers of the model."}
+    )
+    attention_mask: bool = field(
+        default=True, metadata={"help": "Whether to generate an attention mask in the feature extractor."}
+    )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
+                "with private models)."
+            )
+        },
+    )
+    freeze_feature_extractor: Optional[bool] = field(
+        default=None, metadata={"help": "Whether to freeze the feature extractor layers of the model."}
+    )
+    ignore_mismatched_sizes: bool = field(
+        default=False,
+        metadata={"help": "Will enable to load a pretrained model whose head dimensions are different."},
+    )
+    def __post_init__(self):
+        if not self.freeze_feature_extractor and self.freeze_feature_encoder:
+            warnings.warn(
+                "The argument `--freeze_feature_extractor` is deprecated and "
+                "will be removed in a future version. Use `--freeze_feature_encoder`"
+                "instead. Setting `freeze_feature_encoder==True`.",
+                FutureWarning,
+            )
+        if self.freeze_feature_extractor and not self.freeze_feature_encoder:
+            raise ValueError(
+                "The argument `--freeze_feature_extractor` is deprecated and "
+                "should not be used in combination with `--freeze_feature_encoder`."
+                "Only make use of `--freeze_feature_encoder`."
+            )
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
+    # information sent is the one passed as arguments along with your Python/PyTorch versions.
+    send_example_telemetry("run_audio_classification", model_args, data_args)
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    if training_args.should_log:
+        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+        transformers.utils.logging.set_verbosity_info()
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu} "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    logger.info(f"Training/evaluation parameters {training_args}")
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to train from scratch."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+    # Initialize our dataset and prepare it for the audio classification task.
+    raw_datasets = DatasetDict()
+    raw_datasets["train"] = load_dataset(
+        data_args.dataset_name,
+        data_args.dataset_config_name,
+        split=data_args.train_split_name,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    raw_datasets["eval"] = load_dataset(
+        data_args.dataset_name,
+        data_args.dataset_config_name,
+        split=data_args.eval_split_name,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    if data_args.audio_column_name not in raw_datasets["train"].column_names:
+        raise ValueError(
+            f"--audio_column_name {data_args.audio_column_name} not found in dataset '{data_args.dataset_name}'. "
+            "Make sure to set `--audio_column_name` to the correct audio column - one of "
+            f"{', '.join(raw_datasets['train'].column_names)}."
+        )
+    if data_args.label_column_name not in raw_datasets["train"].column_names:
+        raise ValueError(
+            f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. "
+            "Make sure to set `--label_column_name` to the correct text column - one of "
+            f"{', '.join(raw_datasets['train'].column_names)}."
+        )
+    # Setting `return_attention_mask=True` is the way to get a correctly masked mean-pooling over
+    # transformer outputs in the classifier, but it doesn't always lead to better accuracy
+    feature_extractor = AutoFeatureExtractor.from_pretrained(
+        model_args.feature_extractor_name or model_args.model_name_or_path,
+        return_attention_mask=model_args.attention_mask,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    # `datasets` takes care of automatically loading and resampling the audio,
+    # so we just need to set the correct target sampling rate.
+    raw_datasets = raw_datasets.cast_column(
+        data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate)
+    )
+    model_input_name = feature_extractor.model_input_names[0]
+    def train_transforms(batch):
+        """Apply train_transforms across a batch."""
+        subsampled_wavs = []
+        for audio in batch[data_args.audio_column_name]:
+            wav = random_subsample(
+                audio["array"], max_length=data_args.max_length_seconds, sample_rate=feature_extractor.sampling_rate
+            )
+            subsampled_wavs.append(wav)
+        inputs = feature_extractor(subsampled_wavs, sampling_rate=feature_extractor.sampling_rate)
+        output_batch = {model_input_name: inputs.get(model_input_name)}
+        output_batch["labels"] = list(batch[data_args.label_column_name])
+        return output_batch
+    def val_transforms(batch):
+        """Apply val_transforms across a batch."""
+        wavs = [audio["array"] for audio in batch[data_args.audio_column_name]]
+        inputs = feature_extractor(wavs, sampling_rate=feature_extractor.sampling_rate)
+        output_batch = {model_input_name: inputs.get(model_input_name)}
+        output_batch["labels"] = list(batch[data_args.label_column_name])
+        return output_batch
+    # Prepare label mappings.
+    # We'll include these in the model's config to get human readable labels in the Inference API.
+    labels = raw_datasets["train"].features[data_args.label_column_name].names
+    label2id, id2label = {}, {}
+    for i, label in enumerate(labels):
+        label2id[label] = str(i)
+        id2label[str(i)] = label
+    # Load the accuracy metric from the datasets package
+    metric = evaluate.load("accuracy")
+    # Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with
+    # `predictions` and `label_ids` fields) and has to return a dictionary string to float.
+    def compute_metrics(eval_pred):
+        """Computes accuracy on a batch of predictions"""
+        predictions = np.argmax(eval_pred.predictions, axis=1)
+        return metric.compute(predictions=predictions, references=eval_pred.label_ids)
+    config = AutoConfig.from_pretrained(
+        model_args.config_name or model_args.model_name_or_path,
+        num_labels=len(labels),
+        label2id=label2id,
+        id2label=id2label,
+        finetuning_task="audio-classification",
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    model = AutoModelForAudioClassification.from_pretrained(
+        model_args.model_name_or_path,
+        from_tf=bool(".ckpt" in model_args.model_name_or_path),
+        config=config,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+        ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
+    )
+    # freeze the convolutional waveform encoder
+    if model_args.freeze_feature_encoder:
+        model.freeze_feature_encoder()
+    if training_args.do_train:
+        if data_args.max_train_samples is not None:
+            raw_datasets["train"] = (
+                raw_datasets["train"].shuffle(seed=training_args.seed).select(range(data_args.max_train_samples))
+            )
+        # Set the training transforms
+        raw_datasets["train"].set_transform(train_transforms, output_all_columns=False)
+    if training_args.do_eval:
+        if data_args.max_eval_samples is not None:
+            raw_datasets["eval"] = (
+                raw_datasets["eval"].shuffle(seed=training_args.seed).select(range(data_args.max_eval_samples))
+            )
+        # Set the validation transforms
+        raw_datasets["eval"].set_transform(val_transforms, output_all_columns=False)
+    # Initialize our trainer
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=raw_datasets["train"] if training_args.do_train else None,
+        eval_dataset=raw_datasets["eval"] if training_args.do_eval else None,
+        compute_metrics=compute_metrics,
+        tokenizer=feature_extractor,
+    )
+    # Training
+    if training_args.do_train:
+        checkpoint = None
+        if training_args.resume_from_checkpoint is not None:
+            checkpoint = training_args.resume_from_checkpoint
+        elif last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        trainer.save_model()
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+    # Evaluation
+    if training_args.do_eval:
+        metrics = trainer.evaluate()
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+    # Write model card and (optionally) push to hub
+    kwargs = {
+        "finetuned_from": model_args.model_name_or_path,
+        "tasks": "audio-classification",
+        "dataset": data_args.dataset_name,
+        "tags": ["audio-classification"],
+    }
+    if training_args.push_to_hub:
+        trainer.push_to_hub(**kwargs)
+    else:
+        trainer.create_model_card(**kwargs)
+if __name__ == "__main__":
+    main()

train_results.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "epoch": 3.0,
+    "train_loss": 0.06268550049036697,
+    "train_runtime": 389325.9759,
+    "train_samples_per_second": 2.094,
+    "train_steps_per_second": 0.065
+}

trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:056cee6adb1b2f1fcd2f38aa61d20cb381ad85614636d6b75ea4483a58612531
+size 4731