koichi12 commited on Nov 28, 2024

Commit

84a9380

verified ·

1 Parent(s): 62eda46

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/__init__.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/__main__.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/evaluator.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/evaluator_utils.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/utils.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/caching/__init__.py +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/caching/__pycache__/__init__.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/caching/__pycache__/cache.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/caching/cache.py +55 -0
scripts/yans/lm-evaluation-harness/lm_eval/decontamination/__init__.py +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/decontamination/archiver.py +171 -0
scripts/yans/lm-evaluation-harness/lm_eval/decontamination/decontaminate.py +166 -0
scripts/yans/lm-evaluation-harness/lm_eval/decontamination/janitor.py +328 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/__init__.py +2 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/__init__.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/evaluation_tracker.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/utils.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/wandb_logger.cpython-310.pyc +0 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/evaluation_tracker.py +521 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/utils.py +143 -0
scripts/yans/lm-evaluation-harness/lm_eval/loggers/wandb_logger.py +352 -0
scripts/yans/lm-evaluation-harness/lm_eval/models/dummy.py +41 -0
scripts/yans/lm-evaluation-harness/lm_eval/models/gguf.py +130 -0
scripts/yans/lm-evaluation-harness/lm_eval/models/mamba_lm.py +126 -0
scripts/yans/lm-evaluation-harness/lm_eval/models/neuron_optimum.py +737 -0
scripts/yans/lm-evaluation-harness/lm_eval/models/vllm_causallms.py +540 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/bleu.py +241 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/go.yaml +21 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/java.yaml +21 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/javascript.yaml +21 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/php.yaml +21 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/python.yaml +21 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/ruby.yaml +21 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/utils.py +12 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/README.md +54 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/commonsense.yaml +15 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/deontology.yaml +9 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/justice.yaml +9 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utilitarianism.yaml +12 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml +16 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utils.py +25 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/virtue.yaml +10 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/mc_taco/README.md +53 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/mc_taco/default.yaml +15 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/pubmedqa/README.md +56 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/pubmedqa/preprocess_pubmedqa.py +6 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/pubmedqa/pubmedqa.yaml +16 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/qa4mre/README.md +55 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/qa4mre/preprocess_qa4mre.py +6 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/qa4mre/qa4mre_2011.yaml +22 -0

scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (224 Bytes). View file

scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/__main__.cpython-310.pyc ADDED Viewed

Binary file (12.2 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/evaluator.cpython-310.pyc ADDED Viewed

Binary file (15.5 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/evaluator_utils.cpython-310.pyc ADDED Viewed

Binary file (15 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (15.5 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/caching/__init__.py ADDED Viewed

File without changes

scripts/yans/lm-evaluation-harness/lm_eval/caching/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (164 Bytes). View file

scripts/yans/lm-evaluation-harness/lm_eval/caching/__pycache__/cache.cpython-310.pyc ADDED Viewed

Binary file (1.6 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/caching/cache.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import hashlib
+import os
+import dill
+from lm_eval.utils import eval_logger
+MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
+OVERRIDE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
+PATH = OVERRIDE_PATH if OVERRIDE_PATH else f"{MODULE_DIR}/.cache"
+# This should be sufficient for uniqueness
+HASH_INPUT = "EleutherAI-lm-evaluation-harness"
+HASH_PREFIX = hashlib.sha256(HASH_INPUT.encode("utf-8")).hexdigest()
+FILE_SUFFIX = f".{HASH_PREFIX}.pickle"
+def load_from_cache(file_name):
+    try:
+        path = f"{PATH}/{file_name}{FILE_SUFFIX}"
+        with open(path, "rb") as file:
+            cached_task_dict = dill.loads(file.read())
+            return cached_task_dict
+    except Exception:
+        eval_logger.debug(f"{file_name} is not cached, generating...")
+        pass
+def save_to_cache(file_name, obj):
+    if not os.path.exists(PATH):
+        os.mkdir(PATH)
+    file_path = f"{PATH}/{file_name}{FILE_SUFFIX}"
+    eval_logger.debug(f"Saving {file_path} to cache...")
+    with open(file_path, "wb") as file:
+        file.write(dill.dumps(obj))
+# NOTE the "key" param is to allow for flexibility
+def delete_cache(key: str = ""):
+    files = os.listdir(PATH)
+    for file in files:
+        if file.startswith(key) and file.endswith(FILE_SUFFIX):
+            file_path = f"{PATH}/{file}"
+            os.unlink(file_path)

scripts/yans/lm-evaluation-harness/lm_eval/decontamination/__init__.py ADDED Viewed

File without changes

scripts/yans/lm-evaluation-harness/lm_eval/decontamination/archiver.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import datetime
+import io
+import json
+import mmap
+import os
+from pathlib import Path
+from typing import Any
+import jsonlines
+import tqdm
+import zstandard
+def json_serial(obj: Any) -> str:
+    """JSON serializer for objects not serializable by default json code"""
+    if isinstance(obj, (datetime.datetime,)):
+        return obj.isoformat()
+    raise TypeError("Type %s not serializable" % type(obj))
+# Modified version of lm_dataformat Archive for single file.
+class Archive:
+    def __init__(self, file_path: str, compression_level: int = 3) -> None:
+        self.file_path = file_path
+        dir_name = os.path.dirname(file_path)
+        if dir_name:
+            os.makedirs(dir_name, exist_ok=True)
+        self.fh = open(self.file_path, "wb")
+        self.cctx = zstandard.ZstdCompressor(level=compression_level)
+        self.compressor = self.cctx.stream_writer(self.fh)
+    def add_data(self, data, meta=None) -> None:
+        if meta is None:
+            meta = {}
+        self.compressor.write(
+            json.dumps({"text": data, "meta": meta}, default=json_serial).encode(
+                "UTF-8"
+            )
+            + b"\n"
+        )
+    def commit(self) -> None:
+        self.compressor.flush(zstandard.FLUSH_FRAME)
+        self.fh.flush()
+        self.fh.close()
+# Modified version of lm_dataformat Reader with self.fh set, allowing peeking for tqdm.
+class Reader:
+    def __init__(self) -> None:
+        pass
+    def read(
+        self,
+        file,
+        get_meta: bool = False,
+        autojoin_paragraphs: bool = True,
+        para_joiner: str = "\n\n",
+    ):
+        with open(file, "rb") as fh:
+            self.fh = fh
+            cctx = zstandard.ZstdDecompressor()
+            reader = io.BufferedReader(cctx.stream_reader(fh))
+            rdr = jsonlines.Reader(reader)
+            for ob in rdr:
+                # naive jsonl where each object is just the string itself, with no meta. For legacy compatibility.
+                if isinstance(ob, str):
+                    assert not get_meta
+                    yield ob
+                    continue
+                text = ob["text"]
+                if autojoin_paragraphs and isinstance(text, list):
+                    text = para_joiner.join(text)
+                if get_meta:
+                    yield text, (ob["meta"] if "meta" in ob else {})
+                else:
+                    yield text
+class TextArchive:
+    def __init__(self, file_path, mode: str = "rb+") -> None:
+        self.file_path = file_path
+        dir_name = os.path.dirname(file_path)
+        if dir_name:
+            os.makedirs(dir_name, exist_ok=True)
+        if not os.path.exists(file_path):
+            Path(file_path).touch()
+        self.fh = open(self.file_path, mode)
+    def add_data(self, data) -> None:
+        self.fh.write(data.encode("UTF-8") + b"\n")
+    def commit(self) -> None:
+        self.fh.flush()
+        self.fh.close()
+class TextReader:
+    def __init__(self, file_path) -> None:
+        self.file_path = file_path
+    # Optimized mmap read with infrequent tqdm updates to maintain speed
+    # Tested up to 250MB/s.
+    def read_tqdm(self, update_frequency: int = 10000):
+        current_file_position = 0
+        line_counter = 0
+        with open(self.file_path, "r", encoding="utf-8") as fh, tqdm.tqdm(
+            total=os.path.getsize(self.file_path),
+            dynamic_ncols=True,
+            unit="byte",
+            unit_scale=1,
+        ) as progress:
+            with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
+                for line in iter(mmap_obj.readline, b""):
+                    line = line.decode("utf-8")
+                    line_counter += 1
+                    if line_counter == update_frequency:
+                        new_file_pos = mmap_obj.tell()
+                        bytes_read = new_file_pos - current_file_position
+                        current_file_position = new_file_pos
+                        progress.update(bytes_read)
+                        line_counter = 0
+                    yield line[:-1]
+    def read_and_tell(self):
+        current_file_position = 0
+        with open(self.file_path, "r", encoding="utf8") as fh:
+            with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
+                for line in iter(mmap_obj.readline, b""):
+                    line = line.decode("utf-8")
+                    new_file_pos = mmap_obj.tell()
+                    raw_bytes_read = new_file_pos - current_file_position
+                    current_file_position = new_file_pos
+                    yield line[:-1], raw_bytes_read
+    def read(self):
+        with open(self.file_path, "r", encoding="utf8") as fh:
+            with mmap.mmap(fh.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
+                for line in iter(mmap_obj.readline, b""):
+                    line = line.decode("utf-8")
+                    yield line[:-1]
+    def read_slow(self):
+        with open(self.file_path, "r", encoding="utf8") as fh:
+            while True:
+                line = fh.readline()
+                if line == -1 or line == "":
+                    break
+                else:
+                    yield line[:-1]
+# Optimized for speed. Decompresses the archive in shell before
+# using the mmap'd TextReader.
+class ZStdTextReader:
+    def __init__(self, file) -> None:
+        self.file = file
+    def read_tqdm(self):
+        decompressed_file = self.file[:-4]
+        print("Decompressing file, please wait...")
+        os.system(f"zstd -d {self.file}")  # linux decompress is faster
+        reader = TextReader(decompressed_file)
+        yield from reader.read_tqdm()
+        os.remove(decompressed_file)

scripts/yans/lm-evaluation-harness/lm_eval/decontamination/decontaminate.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import collections
+import glob
+import json
+import os
+import pickle
+import random
+import time
+from .archiver import ZStdTextReader
+from .janitor import Janitor, word_ngrams
+# Was used for testing the evaluator decoupled from the full logic below
+def get_train_overlap_stub(docs: dict, ngrams_path: str, ngrams_n_size: str):
+    simulated_overlap = 0.1
+    contaminated = int(len(docs) * simulated_overlap)
+    return random.sample(range(len(docs)), contaminated)
+# Returns a dictionary containing all overlapping documents in each
+# task. In the standard use case, an overlap occurs when any of the 13-grams
+# found in the task document exist in the training set documents.
+#
+# To generate 13-grams for the pile see scripts/clean_training_data. The final output of these
+# scripts are an info.json file containing the n_gram_size (13) and a bunch of "ngrams_{x}.bkt.txt.sorted.zst"
+# files. These should exist in the "ngrams_path" provided to this function.
+# Algorithm:
+# 1. Build lookups for each dataset {ngram: list(document_ids)}
+# 2. Merge into an overall lookup {ngram: [(task_name, task_set, doc_ids),]}
+# 3. Full scan the 13-grams from the training set against the merged lookup,
+#    saving matches in the "duplicates" dictionary {(task_name, task_set): set(doc_ids)}
+# 4. Strip the task_set from the dictionary keys and return
+#
+# We cache the task+set lookups as well as the overlaps.
+def get_train_overlap(docs_by_task_set: dict, ngrams_path: str, limit: int) -> dict:
+    # return get_train_overlap_stub(docs, ngrams_path, ngrams_n_size)
+    info_dict_path = os.path.join(ngrams_path, "info.json")
+    info_dict = json.load(open(info_dict_path, "r", encoding="utf-8"))
+    ngrams_n_size = info_dict["ngram_size"]
+    janitor = Janitor()
+    # Build lookup for each dataset first in case we use different task combinations later
+    print("Building Lookups...")
+    start = time.perf_counter()
+    def get_overlaps_dump_path(task_name, task_set, ngrams_n_size, limit) -> str:
+        return f"data/{task_name}/{task_set}_{ngrams_n_size}grams_limit{limit}.overlaps"
+    lookups = {}
+    duplicates = {}  # (task_name, task_set): set(doc_ids)}
+    sets_to_decontaminate = len(docs_by_task_set.keys())
+    for (task_name, task_set), docs in docs_by_task_set.items():
+        if not os.path.exists(f"data/{task_name}"):
+            os.mkdir(f"data/{task_name}")
+        # Check if we've decontaminated this combination before
+        overlaps_dump_path = get_overlaps_dump_path(
+            task_name, task_set, ngrams_n_size, limit
+        )
+        if os.path.exists(overlaps_dump_path):
+            duplicates[(task_name, task_set)] = pickle.load(
+                open(overlaps_dump_path, "rb")
+            )
+            sets_to_decontaminate -= 1
+            continue
+        else:
+            duplicates[(task_name, task_set)] = set()
+        # Build/load the task lookup {ngram: set(documents)}.
+        task_set_lookup_path = (
+            f"data/{task_name}/{task_set}_{ngrams_n_size}grams_limit{limit}.lookup"
+        )
+        if os.path.exists(task_set_lookup_path):
+            print(f"{task_set_lookup_path} available, loading...")
+            lookups[(task_name, task_set)] = pickle.load(
+                open(task_set_lookup_path, "rb")
+            )
+        else:
+            print(f"{task_set_lookup_path} not available, building...")
+            lookup = collections.defaultdict(set)
+            for doc_id, document in enumerate(docs):
+                ngrams = word_ngrams(janitor.normalize_string(document), ngrams_n_size)
+                for ngram in ngrams:
+                    lookup[ngram].add(doc_id)
+            pickle.dump(lookup, open(task_set_lookup_path, "wb"))
+            lookups[(task_name, task_set)] = lookup
+    elapsed = time.perf_counter() - start
+    print(f"Building lookups took {elapsed:0.5f} seconds.")
+    matched_ngrams = []
+    if sets_to_decontaminate > 0:
+        print("Merging lookups...")
+        start = time.perf_counter()
+        merged_lookup = collections.defaultdict(list)
+        for (task_name, task_set), lookup in lookups.items():
+            for ngram, doc_ids in lookup.items():
+                merged_lookup[ngram].append((task_name, task_set, doc_ids))
+        elapsed = time.perf_counter() - start
+        print(f"Merging lookups took {elapsed:0.5f} seconds.")
+        print(f"{ngrams_n_size} grams files found in {ngrams_path}:")
+        files = glob.glob(os.path.join(ngrams_path, "*.sorted.zst"))
+        print(files)
+        for file in files:
+            start = time.perf_counter()
+            print(f"Scanning {file}")
+            reader = ZStdTextReader(file)
+            total_ngrams = 0
+            unique_ngrams = 0
+            matching_unique = 0
+            non_matching_unique = 0
+            current_ngram = ""
+            for line in reader.read_tqdm():  # Scan training set ngrams file
+                total_ngrams += 1
+                [ngram, document_id] = line.rsplit(" ", 1)
+                if (
+                    ngram != current_ngram
+                ):  # Only need to match the ngram once in training set
+                    unique_ngrams += 1
+                    current_ngram = ngram
+                    if ngram in merged_lookup:
+                        matched_ngrams.append(ngram)  # For logging
+                        matching_unique += 1
+                        for task_name, task_set, doc_ids in merged_lookup[ngram]:
+                            task_doc_set = duplicates[(task_name, task_set)]
+                            for doc_id in doc_ids:  # Record contamination across all relevant task/set combos
+                                task_doc_set.add(doc_id)
+                        del merged_lookup[ngram]  # No point matching again
+                    else:
+                        non_matching_unique += 1
+            print(f"Total Ngrams: {total_ngrams}")
+            print(f"Unique Ngrams: {unique_ngrams}")
+            print(f"Unique Matching: {matching_unique}")
+            print(f"Unique Non Matching: {non_matching_unique}")
+            print("Matched ngrams:")
+            for ngram in matched_ngrams:
+                print(ngram)
+            elapsed = time.perf_counter() - start
+            print(f"Read took {elapsed:0.5f} seconds.")
+            print(f"Speed: {(os.path.getsize(file)/1000000.0)/elapsed}MB/second")
+        print(duplicates)
+        # Dump overlaps separately
+        for (task_name, task_set), doc_ids in duplicates.items():
+            overlaps_dump_path = get_overlaps_dump_path(
+                task_name, task_set, ngrams_n_size, limit
+            )
+            pickle.dump(doc_ids, open(overlaps_dump_path, "wb"))
+    # Strip task set and return
+    return {task_name: doc_ids for (task_name, task_set), doc_ids in duplicates.items()}

scripts/yans/lm-evaluation-harness/lm_eval/decontamination/janitor.py ADDED Viewed

	@@ -0,0 +1,328 @@

+import pickle
+import re
+import string
+import traceback
+from typing import Iterator, List, Sequence, Tuple, TypeVar
+# This is a cpp module. Compile janitor_util.cpp with:
+# c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix) -undefined dynamic_lookup
+try:
+    import janitor_util
+    JANITOR_CPP = True
+except Exception:
+    print("WARNING: C++ module could not be loaded. Janitor running in python mode")
+    traceback.print_exc()
+    JANITOR_CPP = False
+T = TypeVar("T")
+# Implementation from nltk source
+# https://www.nltk.org/_modules/nltk/util.html
+def form_ngrams(sequence: Iterator[T], n: int) -> Iterator[Tuple[T, ...]]:
+    history = []
+    while n > 1:
+        # PEP 479, prevent RuntimeError from being raised when StopIteration bubbles out of generator
+        try:
+            next_item = next(sequence)
+        except StopIteration:
+            # no more data, terminate the generator
+            return
+        history.append(next_item)
+        n -= 1
+    for item in sequence:
+        history.append(item)
+        yield tuple(history)
+        del history[0]
+def word_ngrams(s: str, n: int) -> Iterator[str]:
+    """Splits a string into ngram words"""
+    tokens = s.split()  # not a generator :(
+    ngram_seqs = form_ngrams(iter(tokens), n)
+    return (" ".join(ngram) for ngram in ngram_seqs)
+# Does character sequences only - combined faster function to play around with later
+# def word_ngrams_indices_combined(sequence, n):
+#     current_word = ""
+#     history = []
+#     gap = False;
+#     start = 0
+#     end = 0
+#     for character in sequence:
+#         if character == " ":
+#             if not gap:
+#                 gap = True
+#                 history.append(current_word)
+#                 end += len(current_word) - 1
+#                 current_word = ""
+#                 if len(history) == n:
+#                     yield (tuple(history), start, end)
+#                     del history[0]
+#                     start = end + 1
+#                     end = start
+#         else:
+#             gap = False
+#             current_word += character
+# https://stackoverflow.com/questions/13734451/string-split-with-indices-in-python
+def split_indices(s: str) -> Iterator[Tuple[str, Tuple[int, int]]]:
+    """Splits a string on whitespaces and records the indices of each in the original string.
+    @:return generator((word, (start_idx, end_idx)), ...)
+    """
+    return ((m.group(0), (m.start(), m.end() - 1)) for m in re.finditer(r"\S+", s))
+def word_ngrams_indices(s: str, n: int) -> Iterator[Tuple[str, Tuple[int, int]]]:
+    """Splits a string into pairs of (ngram words, their start/end indices)"""
+    tokens_with_indices = split_indices(s)
+    # Generator of ngrams of (word, idx_pairs)
+    # (
+    #   [(word, (start,end)), (word, (start, end))...],
+    #   [(word, (start, end)), ...],
+    #   ...
+    # )
+    ngram_seqs_with_indices = form_ngrams(tokens_with_indices, n)
+    # Generator of pairs of word and index ngrams
+    # (
+    #   ([word, word, ...], [(start,end), (start,end), ...]),
+    #   ...
+    # )
+    ngram_indices_pairs = (
+        zip(*ngram_with_indices) for ngram_with_indices in ngram_seqs_with_indices
+    )
+    # Generator of ( (word_ngram, (start, end)), (word_ngram, start, end)), ...)
+    return (
+        (" ".join(ngram_seq), (indices[0][0], indices[-1][1]))
+        for ngram_seq, indices in ngram_indices_pairs
+    )
+class Janitor:
+    # FIXME delete_chars: Should anything else go here? Special chars?
+    def __init__(
+        self,
+        ngram_n: int = 13,
+        window_to_remove: int = 200,
+        too_dirty_cutoff: int = 10,
+        minimum_slice_length: int = 200,
+        delete_chars: str = string.punctuation,
+    ) -> None:
+        self.ngram_n = ngram_n
+        self.window_to_remove = window_to_remove
+        self.too_dirty_cutoff = too_dirty_cutoff
+        self.minimum_slice_length = minimum_slice_length
+        self.delete_chars = delete_chars
+        self.dirt_ngrams = set()
+        # If in python, we'll translate uppercase to lowercase and delete naughty characters.
+        # This is fast by python standards
+        # https://stackoverflow.com/questions/638893/what-is-the-most-efficient-way-in-python-to-convert-a-string-to-all-lowercase-st
+        self.translation_table = str.maketrans(
+            string.ascii_lowercase + string.ascii_uppercase,  # These characters
+            string.ascii_lowercase * 2,  # Become these characters
+            self.delete_chars,  # These are deleted
+        )
+    ##############
+    # I/O for saving contamination ngrams
+    ##############
+    def save_contamination_ngrams(self, filename: str) -> None:
+        with open(filename, "wb") as fp:
+            pickle.dump(filename, fp)
+    def load_contamination_ngrams(self, filename: str) -> None:
+        with open(filename, "rb") as fp:
+            self.dirt_ngrams = pickle.load(fp)
+    ##############
+    # Call these :)
+    ##############
+    def register_contaminant(self, dirt_string: str) -> None:
+        """Register a string as contamination to be removed, e.g. a test set
+        This breaks the dirt_string into ngrams to store for future cleaning"""
+        if JANITOR_CPP:
+            return self.register_contaminant_cpp(dirt_string)
+        else:
+            print("WARNING: Janitor running in python mode")
+            return self.register_contaminant_python(dirt_string)
+    def clean(self, dirty_string: str) -> List[str]:
+        """Clean a string (e.g. a training set) by removing all ngrams previously
+        registered as contaminants. Returns a list of clean chunks, or empty if
+        the string was too dirty"""
+        if JANITOR_CPP:
+            return self.clean_cpp(dirty_string)
+        else:
+            print("WARNING: Janitor running in python mode")
+            return self.clean_python(dirty_string)
+    def _split_chunks(
+        self, dirty_string: str, dirty_parts: Sequence[Tuple]
+    ) -> List[str]:
+        clean_chunks = []
+        splice_idx = 0
+        end = -1
+        for i, (ngram, start, end) in enumerate(dirty_parts):
+            if i >= self.too_dirty_cutoff:
+                return []
+            start = max(0, start - self.window_to_remove)
+            end = min(len(dirty_string), end + self.window_to_remove)
+            if start - splice_idx > self.minimum_slice_length:
+                clean_chunks.append(dirty_string[splice_idx:start])
+            splice_idx = end
+        if end < len(dirty_string) - self.minimum_slice_length:
+            clean_chunks.append(dirty_string[end + 1 :])
+        return clean_chunks
+    ##############
+    # Fast C++
+    ##############
+    def register_contaminant_cpp(self, dirt_string) -> None:
+        self.dirt_ngrams.update(
+            janitor_util.clean_ngram(dirt_string, self.delete_chars, self.ngram_n)
+        )
+    def clean_cpp(self, dirty_string: str) -> List[str]:
+        contamination_indices = janitor_util.clean_ngram_with_indices(
+            dirty_string, self.delete_chars, self.ngram_n
+        )
+        return self._split_chunks(dirty_string, contamination_indices)
+    ##############
+    # Slow python
+    ##############
+    def normalize_string(self, s: str) -> str:
+        return s.translate(self.translation_table)
+    def register_contaminant_python(self, dirt_string: str) -> None:
+        self.dirt_ngrams.update(
+            word_ngrams(self.normalize_string(dirt_string), self.ngram_n)
+        )
+    def clean_python(self, dirty_string: str) -> List[str]:
+        contamination_indices = (
+            (None, *idx_pair)
+            for dirty_ngram, idx_pair in word_ngrams_indices(dirty_string, self.ngram_n)
+            if self.normalize_string(dirty_ngram) in self.dirt_ngrams
+        )
+        return self._split_chunks(dirty_string, contamination_indices)
+##################################################################
+# Tests
+#################################################################
+# def print_cpp():
+#     source = """   ,, I'm a very !dirty,, ,,  dirty boy. Clean me daddy. \n\nhe he he hehe heh.  lastword  """ * 2
+#     for i in range(1, 10, 2):
+#         pprint(janitor_util.clean_ngram(source, string.punctuation, i))
+#         for ngram, start, end in \
+#                 janitor_util.clean_ngram_with_indices(source, string.punctuation, i):
+#             print(ngram, "\t", start, end, source[start:end].replace("\n", "\\n"))
+# def test_cpp():
+#     source = """   ,, I'm a very !dirty,, ,,  dirty boy. Clean me daddy. \n\nhe he he hehe heh.  lastword  """ * 2
+#     contaminant = "dirty boy. Clean he he"
+#     jan_python = Janitor()
+#     jan_cpp = Janitor()
+#     jan_python.register_contaminant_python(contaminant)
+#     jan_cpp.register_contaminant(contaminant)
+#     assert jan_python.dirt_ngrams == jan_cpp.dirt_ngrams, (jan_python.dirt_ngrams, jan_cpp.dirt_ngrams)
+#     assert jan_python.clean_python(source) == jan_cpp.clean(source), \
+#         (jan_python.clean_python(source), jan_cpp.clean(source))
+#     print("Passed test, python==cpp")
+# def benchmark():
+#     # Download and put in data folder: enwik8 (100 MB) from https://cs.fit.edu/~mmahoney/compression/textdata.html
+#     setup = \
+#         """
+#         with open("data/enwik8", "r") as f:
+#             data = f.read()
+#         jan = Janitor(too_dirty_cutoff=1000)
+#         jan.register_contaminant('''
+#         theories is that there is a connection between &quot;geekdom&quot; and autism.
+#         This is hinted, for instance, by a ''Wired Magazine'' article in 2001 entitled &quot;
+#         The [[Geek]] Syndrome&quot;, which is a point argued by many in the autism rights
+#         movement{{ref|Wired}}.  This article, many professionals assert, is just one example of
+#         the media's application of mental disease labels to what is actually variant normal behavior
+#         &amp;mdash;they argue that shyness, lack of athletic ability or social skills, and intellectual
+#         interests, even when they seem unusual to others, are not in themselves signs of autism or
+#         Asperger's syndrome. Others assert that it is actually the medical profession which is applying
+#         mental disease labels to children who in the past would have simply been accepted as a little
+#         different or even labeled 'gifted'. See [[clinomorphism]] for further discussion of this issue.
+#         Due to the recent publicity surrounding autism and autis
+#         ultan Al Nahyan]] granted [[Petroleum]] concessions, and oil was first found in 1958.  At first,
+#         oil money had a marginal impact.  A few lowrise concete buildings were erected, and the first
+#         paved road was completed in 1961, but Sheikh Shakbut, uncertain whether the new oil royalties
+#         would last, took a cautious approach, preferring to save the revenue rather than investing it in
+#         development.  His brother, [[Zayed bin Sultan Al Nahayan]], saw that oil wealth had the potential
+#         to transform Abu Dhabi.  The ruling Al Nahayan family decided that Sheikh Zayed should replace his
+#         brother as Ruler and carry out his vision of developing the country.  On [[August 6]], [[1966]],
+#         with the assistance of the British, Sheikh Zayed became the new ruler.  See generally, Al-Fahim, M,
+#         ''From Rags to Riches: A Story of Abu Dhabi'', Chapter Six (London Centre of Arab Studies, 1995),
+#         ISBN 1 900404 00 1. With the announcement by Britain in 1968 that it would withdraw from the
+#         Gulf area by 1971, Sheikh Zayed became the main driving force behind the formation of the
+#         [[United Arab Emirates]]. After the Emirates gained independence in 1971,
+#         ''')
+#         """
+#     n = 1
+#     print(f"Timing {n} run on 100 MB")
+#     print("Register contaminant")
+#     # print("\tPython", timeit.timeit("jan.register_contaminant_python(data)", setup=setup, globals=globals(), number=n))
+#     print("\tCpp", timeit.timeit("jan.register_contaminant(data)", setup=setup, globals=globals(), number=n))
+#     print("Clean")
+#     # print("\tPython", timeit.timeit("jan.clean_python(data)", setup=setup, globals=globals(), number=n))
+#     print("\tCpp", timeit.timeit("jan.clean(data)", setup=setup, globals=globals(), number=n))
+# def test_janitor_general():
+#     source = """   ,, I'm a very !dirty,, ,,  dirty boy. Clean me daddy. \n\nhe he he hehe heh.  lastword  """ * 2
+#     contaminant = "dirty boy. Clean he he"
+#     jan = Janitor(ngram_n=3)
+#     jan.register_contaminant(contaminant)
+#     cleaned = " ".join(jan.clean(source))
+#     for contam in jan.dirt_ngrams:
+#         assert contam not in cleaned, contam
+#     filename = "data/saved_contam"
+#     jan.save_contamination_ngrams(filename)
+#     jan = Janitor(ngram_n=3)
+#     jan.load_contamination_ngrams(filename)
+#     cleaned = " ".join(jan.clean(source))
+#     for contam in jan.dirt_ngrams:
+#         assert contam not in cleaned, contam
+# if __name__ == "__main__":
+#     test()
+#     # print_cpp()
+#     # test_cpp()
+#     # benchmark()

scripts/yans/lm-evaluation-harness/lm_eval/loggers/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .evaluation_tracker import EvaluationTracker
2	+ from .wandb_logger import WandbLogger

scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (272 Bytes). View file

scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/evaluation_tracker.cpython-310.pyc ADDED Viewed

Binary file (15.6 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (4.25 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/loggers/__pycache__/wandb_logger.cpython-310.pyc ADDED Viewed

Binary file (11.6 kB). View file

scripts/yans/lm-evaluation-harness/lm_eval/loggers/evaluation_tracker.py ADDED Viewed

	@@ -0,0 +1,521 @@

+import json
+import os
+import re
+import time
+from collections import defaultdict
+from dataclasses import asdict, dataclass
+from datetime import datetime
+from pathlib import Path
+from datasets import load_dataset
+from datasets.utils.metadata import MetadataConfigs
+from huggingface_hub import (
+    DatasetCard,
+    DatasetCardData,
+    HfApi,
+    hf_hub_url,
+)
+from huggingface_hub.utils import build_hf_headers, get_session, hf_raise_for_status
+from lm_eval.utils import (
+    eval_logger,
+    get_file_datetime,
+    get_file_task_name,
+    get_results_filenames,
+    get_sample_results_filenames,
+    handle_non_serializable,
+    hash_string,
+    sanitize_list,
+    sanitize_model_name,
+    sanitize_task_name,
+)
+@dataclass(init=False)
+class GeneralConfigTracker:
+    """
+    Tracker for the evaluation parameters.
+    Attributes:
+        model_source (str): Source of the model (e.g. Hugging Face, GGUF, etc.)
+        model_name (str): Name of the model.
+        model_name_sanitized (str): Sanitized model name for directory creation.
+        start_time (float): Start time of the experiment. Logged at class init.
+        end_time (float): Start time of the experiment. Logged when calling [`GeneralConfigTracker.log_end_time`]
+        total_evaluation_time_seconds (str): Inferred total evaluation time in seconds (from the start and end times).
+    """
+    model_source: str = None
+    model_name: str = None
+    model_name_sanitized: str = None
+    system_instruction: str = None
+    system_instruction_sha: str = None
+    fewshot_as_multiturn: bool = None
+    chat_template: str = None
+    chat_template_sha: str = None
+    start_time: float = None
+    end_time: float = None
+    total_evaluation_time_seconds: str = None
+    def __init__(self) -> None:
+        """Starts the evaluation timer."""
+        self.start_time = time.perf_counter()
+    @staticmethod
+    def _get_model_name(model_args: str) -> str:
+        """Extracts the model name from the model arguments."""
+        def extract_model_name(model_args: str, key: str) -> str:
+            """Extracts the model name from the model arguments using a key."""
+            args_after_key = model_args.split(key)[1]
+            return args_after_key.split(",")[0]
+        # order does matter, e.g. peft and delta are provided together with pretrained
+        prefixes = ["peft=", "delta=", "pretrained=", "model=", "path=", "engine="]
+        for prefix in prefixes:
+            if prefix in model_args:
+                return extract_model_name(model_args, prefix)
+        return ""
+    def log_experiment_args(
+        self,
+        model_source: str,
+        model_args: str,
+        system_instruction: str,
+        chat_template: str,
+        fewshot_as_multiturn: bool,
+    ) -> None:
+        """Logs model parameters and job ID."""
+        self.model_source = model_source
+        self.model_name = GeneralConfigTracker._get_model_name(model_args)
+        self.model_name_sanitized = sanitize_model_name(self.model_name)
+        self.system_instruction = system_instruction
+        self.system_instruction_sha = (
+            hash_string(system_instruction) if system_instruction else None
+        )
+        self.chat_template = chat_template
+        self.chat_template_sha = hash_string(chat_template) if chat_template else None
+        self.fewshot_as_multiturn = fewshot_as_multiturn
+    def log_end_time(self) -> None:
+        """Logs the end time of the evaluation and calculates the total evaluation time."""
+        self.end_time = time.perf_counter()
+        self.total_evaluation_time_seconds = str(self.end_time - self.start_time)
+class EvaluationTracker:
+    """
+    Keeps track and saves relevant information of the evaluation process.
+    Compiles the data from trackers and writes it to files, which can be published to the Hugging Face hub if requested.
+    """
+    def __init__(
+        self,
+        output_path: str = None,
+        hub_results_org: str = "",
+        hub_repo_name: str = "",
+        details_repo_name: str = "",
+        results_repo_name: str = "",
+        push_results_to_hub: bool = False,
+        push_samples_to_hub: bool = False,
+        public_repo: bool = False,
+        token: str = "",
+        leaderboard_url: str = "",
+        point_of_contact: str = "",
+        gated: bool = False,
+    ) -> None:
+        """
+        Creates all the necessary loggers for evaluation tracking.
+        Args:
+            output_path (str): Path to save the results. If not provided, the results won't be saved.
+            hub_results_org (str): The Hugging Face organization to push the results to. If not provided, the results will be pushed to the owner of the Hugging Face token.
+            hub_repo_name (str): The name of the Hugging Face repository to push the results to. If not provided, the results will be pushed to `lm-eval-results`.
+            details_repo_name (str): The name of the Hugging Face repository to push the details to. If not provided, the results will be pushed to `lm-eval-results`.
+            result_repo_name (str): The name of the Hugging Face repository to push the results to. If not provided, the results will not be pushed and will be found in the details_hub_repo.
+            push_results_to_hub (bool): Whether to push the results to the Hugging Face hub.
+            push_samples_to_hub (bool): Whether to push the samples to the Hugging Face hub.
+            public_repo (bool): Whether to push the results to a public or private repository.
+            token (str): Token to use when pushing to the Hugging Face hub. This token should have write access to `hub_results_org`.
+            leaderboard_url (str): URL to the leaderboard on the Hugging Face hub on the dataset card.
+            point_of_contact (str): Contact information on the Hugging Face hub dataset card.
+            gated (bool): Whether to gate the repository.
+        """
+        self.general_config_tracker = GeneralConfigTracker()
+        self.output_path = output_path
+        self.push_results_to_hub = push_results_to_hub
+        self.push_samples_to_hub = push_samples_to_hub
+        self.public_repo = public_repo
+        self.leaderboard_url = leaderboard_url
+        self.point_of_contact = point_of_contact
+        self.api = HfApi(token=token) if token else None
+        self.gated_repo = gated
+        if not self.api and (push_results_to_hub or push_samples_to_hub):
+            raise ValueError(
+                "Hugging Face token is not defined, but 'push_results_to_hub' or 'push_samples_to_hub' is set to True. "
+                "Please provide a valid Hugging Face token by setting the HF_TOKEN environment variable."
+            )
+        if (
+            self.api
+            and hub_results_org == ""
+            and (push_results_to_hub or push_samples_to_hub)
+        ):
+            hub_results_org = self.api.whoami()["name"]
+            eval_logger.warning(
+                f"hub_results_org was not specified. Results will be pushed to '{hub_results_org}'."
+            )
+        if hub_repo_name == "":
+            details_repo_name = (
+                details_repo_name if details_repo_name != "" else "lm-eval-results"
+            )
+            results_repo_name = (
+                results_repo_name if results_repo_name != "" else details_repo_name
+            )
+        else:
+            details_repo_name = hub_repo_name
+            results_repo_name = hub_repo_name
+            eval_logger.warning(
+                "hub_repo_name was specified. Both details and results will be pushed to the same repository. Using hub_repo_name is no longer recommended, details_repo_name and results_repo_name should be used instead."
+            )
+        self.details_repo = f"{hub_results_org}/{details_repo_name}"
+        self.details_repo_private = f"{hub_results_org}/{details_repo_name}-private"
+        self.results_repo = f"{hub_results_org}/{results_repo_name}"
+        self.results_repo_private = f"{hub_results_org}/{results_repo_name}-private"
+    def save_results_aggregated(
+        self,
+        results: dict,
+        samples: dict,
+    ) -> None:
+        """
+        Saves the aggregated results and samples to the output path and pushes them to the Hugging Face hub if requested.
+        Args:
+            results (dict): The aggregated results to save.
+            samples (dict): The samples results to save.
+        """
+        self.general_config_tracker.log_end_time()
+        if self.output_path:
+            try:
+                eval_logger.info("Saving results aggregated")
+                # calculate cumulative hash for each task - only if samples are provided
+                task_hashes = {}
+                if samples:
+                    for task_name, task_samples in samples.items():
+                        sample_hashes = [
+                            s["doc_hash"] + s["prompt_hash"] + s["target_hash"]
+                            for s in task_samples
+                        ]
+                        task_hashes[task_name] = hash_string("".join(sample_hashes))
+                # update initial results dict
+                results.update({"task_hashes": task_hashes})
+                results.update(asdict(self.general_config_tracker))
+                dumped = json.dumps(
+                    results,
+                    indent=2,
+                    default=handle_non_serializable,
+                    ensure_ascii=False,
+                )
+                path = Path(self.output_path if self.output_path else Path.cwd())
+                path = path.joinpath(self.general_config_tracker.model_name_sanitized)
+                path.mkdir(parents=True, exist_ok=True)
+                self.date_id = datetime.now().isoformat().replace(":", "-")
+                file_results_aggregated = path.joinpath(f"results_{self.date_id}.json")
+                file_results_aggregated.open("w", encoding="utf-8").write(dumped)
+                if self.api and self.push_results_to_hub:
+                    repo_id = (
+                        self.results_repo
+                        if self.public_repo
+                        else self.results_repo_private
+                    )
+                    self.api.create_repo(
+                        repo_id=repo_id,
+                        repo_type="dataset",
+                        private=not self.public_repo,
+                        exist_ok=True,
+                    )
+                    self.api.upload_file(
+                        repo_id=repo_id,
+                        path_or_fileobj=str(
+                            path.joinpath(f"results_{self.date_id}.json")
+                        ),
+                        path_in_repo=os.path.join(
+                            self.general_config_tracker.model_name,
+                            f"results_{self.date_id}.json",
+                        ),
+                        repo_type="dataset",
+                        commit_message=f"Adding aggregated results for {self.general_config_tracker.model_name}",
+                    )
+                    eval_logger.info(
+                        "Successfully pushed aggregated results to the Hugging Face Hub. "
+                        f"You can find them at: {repo_id}"
+                    )
+            except Exception as e:
+                eval_logger.warning("Could not save results aggregated")
+                eval_logger.info(repr(e))
+        else:
+            eval_logger.info(
+                "Output path not provided, skipping saving results aggregated"
+            )
+    def save_results_samples(
+        self,
+        task_name: str,
+        samples: dict,
+    ) -> None:
+        """
+        Saves the samples results to the output path and pushes them to the Hugging Face hub if requested.
+        Args:
+            task_name (str): The task name to save the samples for.
+            samples (dict): The samples results to save.
+        """
+        if self.output_path:
+            try:
+                eval_logger.info(f"Saving per-sample results for: {task_name}")
+                path = Path(self.output_path if self.output_path else Path.cwd())
+                path = path.joinpath(self.general_config_tracker.model_name_sanitized)
+                path.mkdir(parents=True, exist_ok=True)
+                file_results_samples = path.joinpath(
+                    f"samples_{task_name}_{self.date_id}.jsonl"
+                )
+                for sample in samples:
+                    # we first need to sanitize arguments and resps
+                    # otherwise we won't be able to load the dataset
+                    # using the datasets library
+                    arguments = {}
+                    for i, arg in enumerate(sample["arguments"]):
+                        arguments[f"gen_args_{i}"] = {}
+                        for j, tmp in enumerate(arg):
+                            arguments[f"gen_args_{i}"][f"arg_{j}"] = tmp
+                    sample["resps"] = sanitize_list(sample["resps"])
+                    sample["filtered_resps"] = sanitize_list(sample["filtered_resps"])
+                    sample["arguments"] = arguments
+                    sample["target"] = str(sample["target"])
+                    sample_dump = (
+                        json.dumps(
+                            sample,
+                            default=handle_non_serializable,
+                            ensure_ascii=False,
+                        )
+                        + "\n"
+                    )
+                    with open(file_results_samples, "a", encoding="utf-8") as f:
+                        f.write(sample_dump)
+                if self.api and self.push_samples_to_hub:
+                    repo_id = (
+                        self.details_repo
+                        if self.public_repo
+                        else self.details_repo_private
+                    )
+                    self.api.create_repo(
+                        repo_id=repo_id,
+                        repo_type="dataset",
+                        private=not self.public_repo,
+                        exist_ok=True,
+                    )
+                    try:
+                        if self.gated_repo:
+                            headers = build_hf_headers()
+                            r = get_session().put(
+                                url=f"https://huggingface.co/api/datasets/{repo_id}/settings",
+                                headers=headers,
+                                json={"gated": "auto"},
+                            )
+                            hf_raise_for_status(r)
+                    except Exception as e:
+                        eval_logger.warning("Could not gate the repository")
+                        eval_logger.info(repr(e))
+                    self.api.upload_folder(
+                        repo_id=repo_id,
+                        folder_path=str(path),
+                        path_in_repo=self.general_config_tracker.model_name_sanitized,
+                        repo_type="dataset",
+                        commit_message=f"Adding samples results for {task_name} to {self.general_config_tracker.model_name}",
+                    )
+                    eval_logger.info(
+                        f"Successfully pushed sample results for task: {task_name} to the Hugging Face Hub. "
+                        f"You can find them at: {repo_id}"
+                    )
+            except Exception as e:
+                eval_logger.warning("Could not save sample results")
+                eval_logger.info(repr(e))
+        else:
+            eval_logger.info("Output path not provided, skipping saving sample results")
+    def recreate_metadata_card(self) -> None:
+        """
+        Creates a metadata card for the evaluation results dataset and pushes it to the Hugging Face hub.
+        """
+        eval_logger.info("Recreating metadata card")
+        repo_id = self.details_repo if self.public_repo else self.details_repo_private
+        files_in_repo = self.api.list_repo_files(repo_id=repo_id, repo_type="dataset")
+        results_files = get_results_filenames(files_in_repo)
+        sample_files = get_sample_results_filenames(files_in_repo)
+        # Build a dictionary to store the latest evaluation datetime for:
+        # - Each tested model and its aggregated results
+        # - Each task and sample results, if existing
+        # i.e. {
+        #     "org__model_name__gsm8k": "2021-09-01T12:00:00",
+        #     "org__model_name__ifeval": "2021-09-01T12:00:00",
+        #     "org__model_name__results": "2021-09-01T12:00:00"
+        # }
+        latest_task_results_datetime = defaultdict(lambda: datetime.min.isoformat())
+        for file_path in sample_files:
+            file_path = Path(file_path)
+            filename = file_path.name
+            model_name = file_path.parent
+            task_name = get_file_task_name(filename)
+            results_datetime = get_file_datetime(filename)
+            task_name_sanitized = sanitize_task_name(task_name)
+            # Results and sample results for the same model and task will have the same datetime
+            samples_key = f"{model_name}__{task_name_sanitized}"
+            results_key = f"{model_name}__results"
+            latest_datetime = max(
+                latest_task_results_datetime[samples_key],
+                results_datetime,
+            )
+            latest_task_results_datetime[samples_key] = latest_datetime
+            latest_task_results_datetime[results_key] = max(
+                latest_task_results_datetime[results_key],
+                latest_datetime,
+            )
+        # Create metadata card
+        card_metadata = MetadataConfigs()
+        # Add the latest aggregated results to the metadata card for easy access
+        for file_path in results_files:
+            file_path = Path(file_path)
+            results_filename = file_path.name
+            model_name = file_path.parent
+            eval_date = get_file_datetime(results_filename)
+            eval_date_sanitized = re.sub(r"[^\w\.]", "_", eval_date)
+            results_filename = Path("**") / Path(results_filename).name
+            config_name = f"{model_name}__results"
+            sanitized_last_eval_date_results = re.sub(
+                r"[^\w\.]", "_", latest_task_results_datetime[config_name]
+            )
+            if eval_date_sanitized == sanitized_last_eval_date_results:
+                # Ensure that all results files are listed in the metadata card
+                current_results = card_metadata.get(config_name, {"data_files": []})
+                current_results["data_files"].append(
+                    {"split": eval_date_sanitized, "path": [str(results_filename)]}
+                )
+                card_metadata[config_name] = current_results
+                # If the results file is the newest, update the "latest" field in the metadata card
+                card_metadata[config_name]["data_files"].append(
+                    {"split": "latest", "path": [str(results_filename)]}
+                )
+        # Add the tasks details configs
+        for file_path in sample_files:
+            file_path = Path(file_path)
+            filename = file_path.name
+            model_name = file_path.parent
+            task_name = get_file_task_name(filename)
+            eval_date = get_file_datetime(filename)
+            task_name_sanitized = sanitize_task_name(task_name)
+            eval_date_sanitized = re.sub(r"[^\w\.]", "_", eval_date)
+            results_filename = Path("**") / Path(filename).name
+            config_name = f"{model_name}__{task_name_sanitized}"
+            sanitized_last_eval_date_results = re.sub(
+                r"[^\w\.]", "_", latest_task_results_datetime[config_name]
+            )
+            if eval_date_sanitized == sanitized_last_eval_date_results:
+                # Ensure that all sample results files are listed in the metadata card
+                current_details_for_task = card_metadata.get(
+                    config_name, {"data_files": []}
+                )
+                current_details_for_task["data_files"].append(
+                    {"split": eval_date_sanitized, "path": [str(results_filename)]}
+                )
+                card_metadata[config_name] = current_details_for_task
+                # If the samples results file is the newest, update the "latest" field in the metadata card
+                card_metadata[config_name]["data_files"].append(
+                    {"split": "latest", "path": [str(results_filename)]}
+                )
+        # Get latest results and extract info to update metadata card examples
+        latest_datetime = max(latest_task_results_datetime.values())
+        latest_model_name = max(
+            latest_task_results_datetime, key=lambda k: latest_task_results_datetime[k]
+        )
+        last_results_file = [
+            f for f in results_files if latest_datetime.replace(":", "-") in f
+        ][0]
+        last_results_file_path = hf_hub_url(
+            repo_id=repo_id, filename=last_results_file, repo_type="dataset"
+        )
+        latest_results_file = load_dataset(
+            "json", data_files=last_results_file_path, split="train"
+        )
+        results_dict = latest_results_file["results"][0]
+        new_dictionary = {"all": results_dict}
+        new_dictionary.update(results_dict)
+        results_string = json.dumps(new_dictionary, indent=4)
+        dataset_summary = (
+            "Dataset automatically created during the evaluation run of model "
+        )
+        if self.general_config_tracker.model_source == "hf":
+            dataset_summary += f"[{self.general_config_tracker.model_name}](https://huggingface.co/{self.general_config_tracker.model_name})\n"
+        else:
+            dataset_summary += f"{self.general_config_tracker.model_name}\n"
+        dataset_summary += (
+            f"The dataset is composed of {len(card_metadata)-1} configuration(s), each one corresponding to one of the evaluated task.\n\n"
+            f"The dataset has been created from {len(results_files)} run(s). Each run can be found as a specific split in each "
+            'configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.\n\n'
+            'An additional configuration "results" store all the aggregated results of the run.\n\n'
+            "To load the details from a run, you can for instance do the following:\n"
+        )
+        if self.general_config_tracker.model_source == "hf":
+            dataset_summary += (
+                "```python\nfrom datasets import load_dataset\n"
+                f'data = load_dataset(\n\t"{repo_id}",\n\tname="{latest_model_name}",\n\tsplit="latest"\n)\n```\n\n'
+            )
+        dataset_summary += (
+            "## Latest results\n\n"
+            f'These are the [latest results from run {latest_datetime}]({last_results_file_path.replace("/resolve/", "/blob/")}) '
+            "(note that there might be results for other tasks in the repos if successive evals didn't cover the same tasks. "
+            'You find each in the results and the "latest" split for each eval):\n\n'
+            f"```python\n{results_string}\n```"
+        )
+        card_data = DatasetCardData(
+            dataset_summary=dataset_summary,
+            repo_url=f"https://huggingface.co/{self.general_config_tracker.model_name}",
+            pretty_name=f"Evaluation run of {self.general_config_tracker.model_name}",
+            leaderboard_url=self.leaderboard_url,
+            point_of_contact=self.point_of_contact,
+        )
+        card_metadata.to_dataset_card_data(card_data)
+        card = DatasetCard.from_template(
+            card_data,
+            pretty_name=card_data.pretty_name,
+        )
+        card.push_to_hub(repo_id, repo_type="dataset")

scripts/yans/lm-evaluation-harness/lm_eval/loggers/utils.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import logging
+import os
+import re
+import subprocess
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple, Union
+import numpy as np
+from torch.utils.collect_env import get_pretty_env_info
+from transformers import __version__ as trans_version
+logger = logging.getLogger(__name__)
+def remove_none_pattern(input_string: str) -> Tuple[str, bool]:
+    """Remove the ',none' substring from the input_string if it exists at the end.
+    Args:
+        input_string (str): The input string from which to remove the ',none' substring.
+    Returns:
+        Tuple[str, bool]: A tuple containing the modified input_string with the ',none' substring removed
+                          and a boolean indicating whether the modification was made (True) or not (False).
+    """
+    # Define the pattern to match ',none' at the end of the string
+    pattern = re.compile(r",none$")
+    # Use sub() to replace ',none' with an empty string
+    result = re.sub(pattern, "", input_string)
+    # check if the input_string changed
+    removed = result != input_string
+    return result, removed
+def _handle_non_serializable(o: Any) -> Union[int, str, list]:
+    """Handle non-serializable objects by converting them to serializable types.
+    Args:
+        o (Any): The object to be handled.
+    Returns:
+        Union[int, str, list]: The converted object. If the object is of type np.int64 or np.int32,
+            it will be converted to int. If the object is of type set, it will be converted
+            to a list. Otherwise, it will be converted to str.
+    """
+    if isinstance(o, np.int64) or isinstance(o, np.int32):
+        return int(o)
+    elif isinstance(o, set):
+        return list(o)
+    else:
+        return str(o)
+def get_commit_from_path(repo_path: Union[Path, str]) -> Optional[str]:
+    try:
+        git_folder = Path(repo_path, ".git")
+        if git_folder.is_file():
+            git_folder = Path(
+                git_folder.parent,
+                git_folder.read_text(encoding="utf-8").split("\n")[0].split(" ")[-1],
+            )
+        if Path(git_folder, "HEAD").exists():
+            head_name = (
+                Path(git_folder, "HEAD")
+                .read_text(encoding="utf-8")
+                .split("\n")[0]
+                .split(" ")[-1]
+            )
+            head_ref = Path(git_folder, head_name)
+            git_hash = head_ref.read_text(encoding="utf-8").replace("\n", "")
+        else:
+            git_hash = None
+    except Exception as err:
+        logger.debug(
+            f"Failed to retrieve a Git commit hash from path: {str(repo_path)}. Error: {err}"
+        )
+        return None
+    return git_hash
+def get_git_commit_hash():
+    """
+    Gets the git commit hash of your current repo (if it exists).
+    Source: https://github.com/EleutherAI/gpt-neox/blob/b608043be541602170bfcfb8ec9bf85e8a0799e0/megatron/neox_arguments/neox_args.py#L42
+    """
+    try:
+        git_hash = subprocess.check_output(["git", "describe", "--always"]).strip()
+        git_hash = git_hash.decode()
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        # FileNotFoundError occurs when git not installed on system
+        git_hash = get_commit_from_path(os.getcwd())  # git hash of repo if exists
+    return git_hash
+def add_env_info(storage: Dict[str, Any]):
+    try:
+        pretty_env_info = get_pretty_env_info()
+    except Exception as err:
+        pretty_env_info = str(err)
+    transformers_version = trans_version
+    upper_dir_commit = get_commit_from_path(
+        Path(os.getcwd(), "..")
+    )  # git hash of upper repo if exists
+    added_info = {
+        "pretty_env_info": pretty_env_info,
+        "transformers_version": transformers_version,
+        "upper_git_hash": upper_dir_commit,  # in case this repo is submodule
+    }
+    storage.update(added_info)
+def add_tokenizer_info(storage: Dict[str, Any], lm):
+    if getattr(lm, "tokenizer", False):
+        try:
+            tokenizer_info = {
+                "tokenizer_pad_token": [
+                    lm.tokenizer.pad_token,
+                    str(lm.tokenizer.pad_token_id),
+                ],
+                "tokenizer_eos_token": [
+                    lm.tokenizer.eos_token,
+                    str(lm.tokenizer.eos_token_id),
+                ],
+                "tokenizer_bos_token": [
+                    lm.tokenizer.bos_token,
+                    str(lm.tokenizer.bos_token_id),
+                ],
+                "eot_token_id": getattr(lm, "eot_token_id", None),
+                "max_length": getattr(lm, "max_length", None),
+            }
+            storage.update(tokenizer_info)
+        except Exception as err:
+            logger.debug(
+                f"Logging detailed tokenizer info failed with {err}, skipping..."
+            )
+        # seems gguf and textsynth do not have tokenizer
+    else:
+        logger.debug(
+            "LM does not have a 'tokenizer' attribute, not logging tokenizer metadata to results."
+        )

scripts/yans/lm-evaluation-harness/lm_eval/loggers/wandb_logger.py ADDED Viewed

	@@ -0,0 +1,352 @@

+import copy
+import json
+import logging
+from typing import Any, Dict, List, Literal, Tuple
+import numpy as np
+import pandas as pd
+from packaging.version import Version
+from lm_eval.loggers.utils import _handle_non_serializable, remove_none_pattern
+logger = logging.getLogger(__name__)
+def get_wandb_printer() -> Literal["Printer"]:
+    """Returns a wandb printer instance for pretty stdout."""
+    from wandb.sdk.lib.printer import get_printer
+    from wandb.sdk.wandb_settings import Settings
+    printer = get_printer(Settings()._jupyter)
+    return printer
+class WandbLogger:
+    def __init__(self, **kwargs) -> None:
+        """Attaches to wandb logger if already initialized. Otherwise, passes kwargs to wandb.init()
+        Args:
+            kwargs Optional[Any]: Arguments for configuration.
+        Parse and log the results returned from evaluator.simple_evaluate() with:
+            wandb_logger.post_init(results)
+            wandb_logger.log_eval_result()
+            wandb_logger.log_eval_samples(results["samples"])
+        """
+        try:
+            import wandb
+            assert Version(wandb.__version__) >= Version("0.13.6")
+            if Version(wandb.__version__) < Version("0.13.6"):
+                wandb.require("report-editing:v0")
+        except Exception as e:
+            logger.warning(
+                "To use the wandb reporting functionality please install wandb>=0.13.6.\n"
+                "To install the latest version of wandb run `pip install wandb --upgrade`\n"
+                f"{e}"
+            )
+        self.wandb_args: Dict[str, Any] = kwargs
+        # initialize a W&B run
+        if wandb.run is None:
+            self.run = wandb.init(**self.wandb_args)
+        else:
+            self.run = wandb.run
+        self.printer = get_wandb_printer()
+    def post_init(self, results: Dict[str, Any]) -> None:
+        self.results: Dict[str, Any] = copy.deepcopy(results)
+        self.task_names: List[str] = list(results.get("results", {}).keys())
+        self.group_names: List[str] = list(results.get("groups", {}).keys())
+    def _get_config(self) -> Dict[str, Any]:
+        """Get configuration parameters."""
+        self.task_configs = self.results.get("configs", {})
+        cli_configs = self.results.get("config", {})
+        configs = {
+            "task_configs": self.task_configs,
+            "cli_configs": cli_configs,
+        }
+        return configs
+    def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]:
+        """Sanitize the results dictionary."""
+        _results = copy.deepcopy(self.results.get("results", dict()))
+        # Remove None from the metric string name
+        tmp_results = copy.deepcopy(_results)
+        for task_name in self.task_names:
+            task_result = tmp_results.get(task_name, dict())
+            for metric_name, metric_value in task_result.items():
+                _metric_name, removed = remove_none_pattern(metric_name)
+                if removed:
+                    _results[task_name][_metric_name] = metric_value
+                    _results[task_name].pop(metric_name)
+        # remove string valued keys from the results dict
+        wandb_summary = {}
+        for task in self.task_names:
+            task_result = _results.get(task, dict())
+            for metric_name, metric_value in task_result.items():
+                if isinstance(metric_value, str):
+                    wandb_summary[f"{task}/{metric_name}"] = metric_value
+        for summary_metric, summary_value in wandb_summary.items():
+            _task, _summary_metric = summary_metric.split("/")
+            _results[_task].pop(_summary_metric)
+        tmp_results = copy.deepcopy(_results)
+        for task_name, task_results in tmp_results.items():
+            for metric_name, metric_value in task_results.items():
+                _results[f"{task_name}/{metric_name}"] = metric_value
+                _results[task_name].pop(metric_name)
+        for task in self.task_names:
+            _results.pop(task)
+        return wandb_summary, _results
+    def _log_results_as_table(self) -> None:
+        """Generate and log evaluation results as a table to W&B."""
+        columns = [
+            "Version",
+            "Filter",
+            "num_fewshot",
+            "Metric",
+            "Value",
+            "Stderr",
+        ]
+        def make_table(columns: List[str], key: str = "results"):
+            import wandb
+            table = wandb.Table(columns=columns)
+            results = copy.deepcopy(self.results)
+            for k, dic in results.get(key).items():
+                if k in self.group_names and not key == "groups":
+                    continue
+                version = results.get("versions").get(k)
+                if version == "N/A":
+                    version = None
+                n = results.get("n-shot").get(k)
+                for (mf), v in dic.items():
+                    m, _, f = mf.partition(",")
+                    if m.endswith("_stderr"):
+                        continue
+                    if m == "alias":
+                        continue
+                    if m + "_stderr" + "," + f in dic:
+                        se = dic[m + "_stderr" + "," + f]
+                        if se != "N/A":
+                            se = "%.4f" % se
+                        table.add_data(*[k, version, f, n, m, str(v), str(se)])
+                    else:
+                        table.add_data(*[k, version, f, n, m, str(v), ""])
+            return table
+        # log the complete eval result to W&B Table
+        table = make_table(["Tasks"] + columns, "results")
+        self.run.log({"evaluation/eval_results": table})
+        if "groups" in self.results.keys():
+            table = make_table(["Groups"] + columns, "groups")
+            self.run.log({"evaluation/group_eval_results": table})
+    def _log_results_as_artifact(self) -> None:
+        """Log results as JSON artifact to W&B."""
+        import wandb
+        dumped = json.dumps(
+            self.results, indent=2, default=_handle_non_serializable, ensure_ascii=False
+        )
+        artifact = wandb.Artifact("results", type="eval_results")
+        with artifact.new_file("results.json", mode="w", encoding="utf-8") as f:
+            f.write(dumped)
+        self.run.log_artifact(artifact)
+    def log_eval_result(self) -> None:
+        """Log evaluation results to W&B."""
+        # Log configs to wandb
+        configs = self._get_config()
+        self.run.config.update(configs)
+        wandb_summary, self.wandb_results = self._sanitize_results_dict()
+        # update wandb.run.summary with items that were removed
+        self.run.summary.update(wandb_summary)
+        # Log the evaluation metrics to wandb
+        self.run.log(self.wandb_results)
+        # Log the evaluation metrics as W&B Table
+        self._log_results_as_table()
+        # Log the results dict as json to W&B Artifacts
+        self._log_results_as_artifact()
+    def _generate_dataset(
+        self, data: List[Dict[str, Any]], config: Dict[str, Any]
+    ) -> pd.DataFrame:
+        """Generate a dataset from evaluation data.
+        Args:
+            data (List[Dict[str, Any]]): The data to generate a dataset for.
+            config (Dict[str, Any]): The configuration of the task.
+        Returns:
+            pd.DataFrame: A dataframe that is ready to be uploaded to W&B.
+        """
+        ids = [x["doc_id"] for x in data]
+        labels = [x["target"] for x in data]
+        instance = [""] * len(ids)
+        resps = [""] * len(ids)
+        filtered_resps = [""] * len(ids)
+        model_outputs = {}
+        metrics_list = config["metric_list"]
+        metrics = {}
+        for metric in metrics_list:
+            metric = metric.get("metric")
+            if metric in ["word_perplexity", "byte_perplexity", "bits_per_byte"]:
+                metrics[f"{metric}_loglikelihood"] = [x[metric][0] for x in data]
+                if metric in ["byte_perplexity", "bits_per_byte"]:
+                    metrics[f"{metric}_bytes"] = [x[metric][1] for x in data]
+                else:
+                    metrics[f"{metric}_words"] = [x[metric][1] for x in data]
+            else:
+                metrics[metric] = [x[metric] for x in data]
+        if config["output_type"] == "loglikelihood":
+            instance = [x["arguments"][0][0] for x in data]
+            labels = [x["arguments"][0][1] for x in data]
+            resps = [
+                f'log probability of continuation is {x["resps"][0][0][0]} '
+                + "\n\n"
+                + "continuation will {} generated with greedy sampling".format(
+                    "not be" if not x["resps"][0][0][1] else "be"
+                )
+                for x in data
+            ]
+            filtered_resps = [
+                f'log probability of continuation is {x["filtered_resps"][0][0]} '
+                + "\n\n"
+                + "continuation will {} generated with greedy sampling".format(
+                    "not be" if not x["filtered_resps"][0][1] else "be"
+                )
+                for x in data
+            ]
+        elif config["output_type"] == "multiple_choice":
+            instance = [x["arguments"][0][0] for x in data]
+            choices = [
+                "\n".join([f"{idx}. {y[1]}" for idx, y in enumerate(x["arguments"])])
+                for x in data
+            ]
+            resps = [np.argmax([n[0][0] for n in x["resps"]]) for x in data]
+            filtered_resps = [
+                np.argmax([n[0] for n in x["filtered_resps"]]) for x in data
+            ]
+        elif config["output_type"] == "loglikelihood_rolling":
+            instance = [x["arguments"][0][0] for x in data]
+            resps = [x["resps"][0][0] for x in data]
+            filtered_resps = [x["filtered_resps"][0] for x in data]
+        elif config["output_type"] == "generate_until":
+            instance = [x["arguments"][0][0] for x in data]
+            resps = [x["resps"][0][0] for x in data]
+            filtered_resps = [x["filtered_resps"][0] for x in data]
+        model_outputs["raw_predictions"] = resps
+        model_outputs["filtered_predictions"] = filtered_resps
+        df_data = {
+            "id": ids,
+            "data": instance,
+        }
+        if config["output_type"] == "multiple_choice":
+            df_data["choices"] = choices
+        tmp_data = {
+            "input_len": [len(x) for x in instance],
+            "labels": labels,
+            "output_type": config["output_type"],
+        }
+        df_data.update(tmp_data)
+        df_data.update(model_outputs)
+        df_data.update(metrics)
+        return pd.DataFrame(df_data)
+    def _log_samples_as_artifact(
+        self, data: List[Dict[str, Any]], task_name: str
+    ) -> None:
+        import wandb
+        # log the samples as an artifact
+        dumped = json.dumps(
+            data,
+            indent=2,
+            default=_handle_non_serializable,
+            ensure_ascii=False,
+        )
+        artifact = wandb.Artifact(f"{task_name}", type="samples_by_task")
+        with artifact.new_file(
+            f"{task_name}_eval_samples.json", mode="w", encoding="utf-8"
+        ) as f:
+            f.write(dumped)
+        self.run.log_artifact(artifact)
+        # artifact.wait()
+    def log_eval_samples(self, samples: Dict[str, List[Dict[str, Any]]]) -> None:
+        """Log evaluation samples to W&B.
+        Args:
+            samples (Dict[str, List[Dict[str, Any]]]): Evaluation samples for each task.
+        """
+        task_names: List[str] = [
+            x for x in self.task_names if x not in self.group_names
+        ]
+        ungrouped_tasks = []
+        tasks_by_groups = {}
+        for task_name in task_names:
+            group_names = self.task_configs[task_name].get("group", None)
+            if group_names:
+                if isinstance(group_names, str):
+                    group_names = [group_names]
+                for group_name in group_names:
+                    if not tasks_by_groups.get(group_name):
+                        tasks_by_groups[group_name] = [task_name]
+                    else:
+                        tasks_by_groups[group_name].append(task_name)
+            else:
+                ungrouped_tasks.append(task_name)
+        for task_name in ungrouped_tasks:
+            eval_preds = samples[task_name]
+            # log the samples as a W&B Table
+            df = self._generate_dataset(eval_preds, self.task_configs.get(task_name))
+            self.run.log({f"{task_name}_eval_results": df})
+            # log the samples as a json file as W&B Artifact
+            self._log_samples_as_artifact(eval_preds, task_name)
+        for group, grouped_tasks in tasks_by_groups.items():
+            grouped_df = pd.DataFrame()
+            for task_name in grouped_tasks:
+                eval_preds = samples[task_name]
+                df = self._generate_dataset(
+                    eval_preds, self.task_configs.get(task_name)
+                )
+                df["group"] = group
+                df["task"] = task_name
+                grouped_df = pd.concat([grouped_df, df], ignore_index=True)
+                # log the samples as a json file as W&B Artifact
+                self._log_samples_as_artifact(eval_preds, task_name)
+            self.run.log({f"{group}_eval_results": grouped_df})

scripts/yans/lm-evaluation-harness/lm_eval/models/dummy.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import random
+from tqdm import tqdm
+from lm_eval.api.model import LM
+from lm_eval.api.registry import register_model
+@register_model("dummy")
+class DummyLM(LM):
+    def __init__(self) -> None:
+        super().__init__()
+    @classmethod
+    def create_from_arg_string(cls, arg_string, additional_config=None):
+        return cls()
+    def loglikelihood(self, requests, disable_tqdm: bool = False):
+        res = []
+        for _ in tqdm(requests, disable=disable_tqdm):
+            res.append((-random.random(), False))
+        return res
+    def generate_until(self, requests, disable_tqdm: bool = False):
+        res = []
+        for ctx, _ in tqdm(requests, disable=disable_tqdm):
+            res.append("lol")
+            assert ctx.strip() != ""
+        return res
+    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
+        res = []
+        for _ in tqdm(requests, disable=disable_tqdm):
+            res.append(-random.random())
+        return res

scripts/yans/lm-evaluation-harness/lm_eval/models/gguf.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import logging
+import time
+import requests
+from requests.exceptions import RequestException
+from tqdm import tqdm
+from lm_eval.api.model import LM
+from lm_eval.api.registry import register_model
+logger = logging.getLogger(__name__)
+def get_result(logprobs, context_length):
+    is_greedy = True
+    offsets = logprobs["text_offset"]
+    tokens = logprobs["tokens"]
+    tokens_logprobs = logprobs["token_logprobs"]
+    idx = 0
+    while offsets[idx] < context_length:
+        idx += 1
+    continuation_logprobs = sum(tokens_logprobs[idx:-1])
+    for i in range(idx, len(tokens)):
+        token = tokens[i]
+        top_tokens = logprobs["top_logprobs"][i]
+        top_token = max(top_tokens.keys(), key=lambda x: top_tokens[x])
+        if top_token != token:
+            is_greedy = False
+            break
+    return continuation_logprobs, is_greedy
+@register_model("gguf", "ggml")
+class GGUFLM(LM):
+    def __init__(self, base_url=None, max_length=2048, **kwargs):
+        super().__init__()
+        self.base_url = base_url
+        assert self.base_url, "must pass `base_url` to use GGUF LM!"
+        self.logprobs = 10
+        self.temperature = 0.0
+        self.max_length = max_length
+    def gguf_completion(
+        self, context, continuation=None, stop=None, retries=3, delay=5, **kwargs
+    ):
+        for _ in range(retries):
+            try:
+                prompt = context
+                request = {
+                    "prompt": prompt,
+                    "logprobs": self.logprobs,
+                    "temperature": self.temperature,
+                }
+                if continuation:
+                    prompt += continuation
+                    request.update({"prompt": prompt, "max_tokens": 1, "echo": True})
+                if stop is not None:
+                    request["stop"] = stop
+                response = requests.post(
+                    f"{self.base_url}/v1/completions", json=request
+                )
+                response.raise_for_status()
+                return response.json()
+            except RequestException as e:
+                logger.error(f"RequestException: {e}")
+                time.sleep(delay)  # wait before retrying
+        else:
+            raise Exception(f"Failed to get a valid response after {retries} retries.")
+    def loglikelihood(self, requests, disable_tqdm: bool = False):
+        if not requests:
+            return []
+        res = []
+        for context, continuation in tqdm(
+            [req.args for req in requests], disable=disable_tqdm
+        ):
+            response = self.gguf_completion(context=context, continuation=continuation)
+            if response and "choices" in response and response["choices"]:
+                choice = response["choices"][0]
+                logprobs = choice.get("logprobs")
+                if (
+                    logprobs
+                    and "token_logprobs" in logprobs
+                    and logprobs["token_logprobs"]
+                ):
+                    logprob, is_greedy = get_result(logprobs, len(context))
+                    res.append((logprob, is_greedy))
+                else:
+                    logger.warning(
+                        "Invalid logprobs data. Expected 'logprobs' to contain 'token_logprobs' list."
+                    )
+            else:
+                logger.error(
+                    f"Invalid response for loglikelihood. Response: {response}"
+                )
+                assert False
+        return res
+    def generate_until(self, requests, disable_tqdm: bool = False):
+        if not requests:
+            return []
+        res = []
+        for request in tqdm([req.args for req in requests], disable=disable_tqdm):
+            inp = request[0]
+            request_args = request[1]
+            until = request_args.get("until", ["</s>"])
+            response = self.gguf_completion(context=inp, stop=until)
+            if response and "choices" in response and response["choices"]:
+                choice = response["choices"][0]
+                if "text" in choice:
+                    generated_text = choice["text"].strip()
+                    res.append(generated_text)
+                else:
+                    logger.error(
+                        f"Invalid response for greedy_until. Response: {response}"
+                    )
+                    res.append(None)  # Add default value in case of error
+            else:
+                logger.error(f"Invalid response for greedy_until. Response: {response}")
+                res.append(None)  # Add default value in case of error
+        return res
+    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
+        raise NotImplementedError(
+            "loglikelihood_rolling not yet supported for GGUF models"
+        )

scripts/yans/lm-evaluation-harness/lm_eval/models/mamba_lm.py ADDED Viewed

	@@ -0,0 +1,126 @@

+from typing import Optional, Union
+import torch
+import lm_eval.models.utils
+from lm_eval.api.registry import register_model
+from lm_eval.models.huggingface import HFLM
+@register_model("mamba_ssm")
+class MambaLMWrapper(HFLM):
+    def __init__(
+        self,
+        pretrained="state-spaces/mamba-130m",
+        **kwargs,
+    ) -> None:
+        """
+        Mamba (via the `mamba_ssm` package) supports the following args:
+        ```
+        d_model: int,
+        n_layer: int,
+        vocab_size: int,
+        initializer_cfg=None,
+        pad_vocab_size_multiple: int = 1,
+        ssm_cfg=None,
+        norm_epsilon: float = 1e-5,
+        rms_norm: bool = False,
+        initializer_cfg=None,
+        fused_add_norm=False,
+        residual_in_fp32=False,
+        ```
+        See https://github.com/state-spaces/mamba/blob/main/mamba_ssm/models/mixer_seq_simple.py#L175 for more info.
+        The above can all be passed via `--model_args` or to this __init__() directly
+        but we recommend placing many of these within the config.json file uploaded alongside your
+        Mamba model to the HF Hub instead.
+        All other HuggingFace from_pretrained() kwargs
+        such as those related to
+        `parallelize=True`, PEFT, autoGPTQ,
+        or any sub-configurations of these advanced args,
+        are unsupported by the `mamba_ssm` package.
+        The HFLM arguments
+        `backend`, `tokenizer`, `truncation`, `max_length`,
+        `device`, `dtype`, `batch_size`, `max_batch_size`, `trust_remote_code`, `use_fast_tokenizer`
+        Are all supported by Mamba where they do not conflict
+        with Mamba-specific restrictions such as causal LMs only.
+        """
+        if "backend" in kwargs:
+            # mamba currently only supports causal models
+            assert kwargs["backend"] == "causal"
+        super().__init__(
+            pretrained=pretrained,
+            # set appropriate defaults for tokenizer, max length, etc
+            backend=kwargs.pop("backend", "causal"),
+            tokenizer=kwargs.pop("tokenizer", "EleutherAI/gpt-neox-20b"),
+            max_length=kwargs.pop("max_length", 2048),
+            **kwargs,
+        )
+    def _get_config(
+        self,
+        pretrained: str,
+        **kwargs,
+    ) -> None:
+        try:
+            from mamba_ssm.utils.hf import load_config_hf  # noqa: F811
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'mamba_ssm' LM type, but package `mamba_ssm` is not installed. \
+please install mamba via `pip install lm-eval[mamba]` or `pip install -e .[mamba]`",
+            )
+        self._config = load_config_hf(pretrained)
+    def _create_model(
+        self,
+        pretrained: str,
+        dtype: Optional[Union[str, torch.dtype]] = "float16",
+        # no `parallelize=True` options
+        # no PEFT and quantization options
+        # Mamba does not support arbitrary HF from_pretrained() args
+        **kwargs,
+    ) -> None:
+        try:
+            from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel  # noqa: F811
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'mamba_ssm' LM type, but package `mamba_ssm` is not installed. \
+please install mamba via `pip install lm-eval[mamba]` or `pip install -e .[mamba]`",
+            )
+        self._model = MambaLMHeadModel.from_pretrained(
+            pretrained,
+            device=self._device,
+            dtype=torch.float16
+            if dtype == "auto"
+            else lm_eval.models.utils.get_dtype(dtype),
+        )
+    def _model_generate(self, context, max_length, stop, **generation_kwargs):
+        for key in ("do_sample", "attention_mask"):
+            if key in generation_kwargs:
+                generation_kwargs.pop(key)
+        # mamba's custom GenerationMixin currently does not support
+        # passing stopping criteria.
+        # for the time being, we simply generate to max length,
+        # then truncate (equivalent result)
+        # -- this should be revisited to speed up generation
+        # stopping_criteria = stop_sequences_criteria(
+        #     self.tokenizer, stop, 1, context.shape[0]
+        # )
+        return self.model.generate(
+            input_ids=context,
+            max_length=max_length,
+            # stopping_criteria=stopping_criteria,
+            # pad_token_id=self.tokenizer.pad_token_id,
+            # use_cache=True,
+            **generation_kwargs,
+        )

scripts/yans/lm-evaluation-harness/lm_eval/models/neuron_optimum.py ADDED Viewed

	@@ -0,0 +1,737 @@

+import copy
+import json
+import logging
+import subprocess
+from collections import defaultdict
+from typing import List, Optional, Union
+import torch
+import torch.nn.functional as F
+import transformers
+from packaging import version
+from tqdm import tqdm
+from transformers import GenerationConfig
+from transformers.generation import StoppingCriteriaList
+import lm_eval.models.utils
+from lm_eval import utils
+from lm_eval.api.model import TemplateLM
+from lm_eval.api.registry import register_model
+from lm_eval.models.utils import stop_sequences_criteria
+try:
+    NEURON_AVAILABLE = True
+    from optimum.neuron import NeuronModelForCausalLM
+    from optimum.neuron.generation import TokenSelector
+    from optimum.neuron.version import __version__ as optimum_neuron_version
+except ImportError:
+    NeuronModelForCausalLM = object
+    NEURON_AVAILABLE = False
+logger = logging.getLogger(__name__)
+def get_nc_count() -> Union[int, None]:
+    """Returns the number of neuron cores on the current instance."""
+    try:
+        cmd = "neuron-ls --json-output"
+        result = subprocess.run(cmd, shell=True, capture_output=True)
+        print(f"inferring nc_count from `neuron-ls` {result.stdout}")
+        json_output = json.loads(result.stdout)
+        count = sum([x["nc_count"] for x in json_output])
+        print(f"nc_count={count}")
+        return count
+    except Exception:
+        return None
+def wrap_constant_batch_size(func):
+    def _decorator(self, input_ids):
+        """input_ids a 2D array with batch_size on dim=0
+        makes sure the func runs with self.batch_size
+        """
+        # access a from TestSample
+        batch_size = input_ids.shape[0]
+        if batch_size < self.batch_size:
+            # handle the event of input_ids.shape[0] != batch_size
+            # Neuron cores expect constant batch_size
+            input_ids = torch.concat(
+                (
+                    input_ids,
+                    # add missing_batch_size dummy
+                    torch.zeros(
+                        [self.batch_size - batch_size, *input_ids.size()[1:]],
+                        dtype=input_ids.dtype,
+                        device=input_ids.device,
+                    ),
+                ),
+                dim=0,
+            )
+        elif batch_size > self.batch_size:
+            raise ValueError(
+                f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
+            )
+        # return the forward pass that requires constant batch size
+        return func(self, input_ids)[:batch_size]
+    return _decorator
+class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
+    """NeuronModelForCausalLM with `stopping_criteria` in `generate`"""
+    def generate(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        stopping_criteria: Optional["StoppingCriteriaList"] = None,
+        generation_config: Optional["GenerationConfig"] = None,
+        **kwargs,
+    ) -> torch.LongTensor:
+        r"""
+        A streamlined generate() method overriding the transformers.GenerationMixin.generate() method.
+        This method uses the same logits processors/warpers and stopping criteria as the transformers library
+        `generate()` method but restricts the generation to greedy search and sampling.
+        It does not support transformers `generate()` advanced options.
+        Please refer to https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate
+        for details on generation configuration.
+        Parameters:
+            input_ids (`torch.Tensor` of shape `(batch_size, sequence_length)`):
+                The sequence used as a prompt for the generation.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices.
+            generation_config (`~transformers.generation.GenerationConfig`, *optional*):
+                The generation configuration to be used as base parametrization for the generation call. `**kwargs`
+                passed to generate matching the attributes of `generation_config` will override them. If
+                `generation_config` is not provided, default will be used, which had the following loading
+                priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model
+                configuration. Please note that unspecified parameters will inherit [`~transformers.generation.GenerationConfig`]'s
+                default values, whose documentation should be checked to parameterize generation.
+        Returns:
+            `torch.Tensor`: A  `torch.FloatTensor`.
+        """
+        # The actual generation configuration is a combination of config and parameters
+        generation_config = copy.deepcopy(
+            self.generation_config if generation_config is None else generation_config
+        )
+        model_kwargs = generation_config.update(
+            **kwargs
+        )  # All unused kwargs must be model kwargs
+        # Check model kwargs are actually used by either prepare_inputs_for_generation or forward
+        self._validate_model_kwargs(model_kwargs)
+        # Instantiate a TokenSelector for the specified configuration
+        selector = TokenSelector.create(
+            input_ids, generation_config, self, self.max_length
+        )
+        selector.stopping_criteria.append(stopping_criteria)
+        # Verify that the inputs are compatible with the model static input dimensions
+        batch_size, sequence_length = input_ids.shape
+        if sequence_length > self.max_length:
+            raise ValueError(
+                f"The input sequence length ({sequence_length}) exceeds the model static sequence length ({self.max_length})"
+            )
+        padded_input_ids = input_ids
+        padded_attention_mask = attention_mask
+        if batch_size > self.batch_size:
+            raise ValueError(
+                f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
+            )
+        elif batch_size < self.batch_size:
+            logger.warning(
+                "Inputs will be padded to match the model static batch size. This will increase latency."
+            )
+            padding_shape = [self.batch_size - batch_size, sequence_length]
+            padding = torch.full(
+                padding_shape, fill_value=self.config.eos_token_id, dtype=torch.int64
+            )
+            padded_input_ids = torch.cat([input_ids, padding])
+            if attention_mask is not None:
+                padding = torch.zeros(padding_shape, dtype=torch.int64)
+                padded_attention_mask = torch.cat([attention_mask, padding])
+        # Drop the current generation context and clear the Key/Value cache
+        self.reset_generation()
+        output_ids = self.generate_tokens(
+            padded_input_ids,
+            selector,
+            batch_size,
+            attention_mask=padded_attention_mask,
+            **model_kwargs,
+        )
+        return output_ids[:batch_size, :]
+@register_model("neuronx")
+class NEURON_HF(TemplateLM):
+    """
+    Enables usage with on AWS Neuron
+    using the HuggingFace Transformers + Transformers neuronx library.
+    Tested with neuron 2.17.0
+    """
+    _DEFAULT_MAX_LENGTH = 2048
+    def __init__(
+        self,
+        pretrained: Optional[str] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+        revision: Optional[str] = "main",
+        tp_degree: Optional[int] = None,
+        subfolder: Optional[str] = None,
+        tokenizer: Optional[str] = None,
+        truncation: Optional[bool] = False,
+        max_length: Optional[int] = None,
+        dtype: Optional[Union[str, torch.dtype]] = "auto",
+        batch_size: Optional[int] = 1,
+        low_cpu_mem_usage: Optional[bool] = True,
+        trust_remote_code: Optional[bool] = False,
+        use_fast_tokenizer: Optional[bool] = True,
+        add_bos_token: Optional[bool] = False,
+    ) -> None:
+        if not NEURON_AVAILABLE:
+            raise Exception(
+                "Tried to load neuron model, but neuron is not installed ",
+                "please install neuron via pip install transformers-neuron ",
+                "also make sure you are running on an AWS inf2 instance",
+            )
+        if version.parse(optimum_neuron_version) != version.parse("0.0.17"):
+            logger.warning(
+                '`optimum-neuron` model requires `pip install "optimum[neuronx]>=0.0.17" '
+                "preferably using the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) "
+                "https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2 "
+                f"You are using optimum-neuron={optimum_neuron_version}"
+            )
+        super().__init__()
+        assert isinstance(pretrained, str)
+        assert isinstance(batch_size, (int, str))
+        self.batch_size_per_gpu = int(batch_size)
+        batch_size = int(batch_size)
+        if tp_degree is None:
+            # execute `neuron-ls --json-output | jq '.[0].nc_count'``
+            # to get the number of neuron cores on your instance
+            tp_degree = get_nc_count()
+        assert isinstance(tp_degree, int), (
+            f"model_args must include tp_degree. tp_degree must be set to an integer,"
+            f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
+            "Set it to number of neuron cores on your instance."
+            " For inf2.xlarge and inf2.8xlarge, set it to `2`."
+            " For inf2.24xlarge, set it to `12`."
+            " For inf2.48xlarge, set it to `24`."
+        )
+        revision = str(revision)  # cast to string if not already one
+        # TODO: update this to be less of a hack once subfolder is fixed in HF
+        revision = revision + ("/" + subfolder if subfolder is not None else "")
+        self._config = transformers.AutoConfig.from_pretrained(
+            pretrained,
+            revision=revision,
+            trust_remote_code=trust_remote_code,
+        )
+        torch_dtype = lm_eval.models.utils.get_dtype(dtype)
+        assert torch_dtype in [
+            torch.float16,
+            torch.bfloat16,
+        ], "Only float16 and bfloat16 are supported"
+        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+            pretrained if tokenizer is None else tokenizer,
+            revision=revision,
+            trust_remote_code=trust_remote_code,
+            use_fast=use_fast_tokenizer,
+        )
+        # Neuron specific code
+        if torch_dtype == torch.float16:
+            self.amp_dtype = "f16"
+        elif torch_dtype == torch.bfloat16:
+            self.amp_dtype = "bf16"
+        elif torch_dtype == torch.float32:
+            self.amp_dtype = "f32"
+        else:
+            raise NotImplementedError("Only float16 and bfloat16 are implemented.")
+        compiler_args = {"num_cores": tp_degree, "auto_cast_type": self.amp_dtype}
+        input_shapes = {
+            "batch_size": batch_size,
+            "sequence_length": self._DEFAULT_MAX_LENGTH,
+        }
+        print(
+            f"{'='*20} \n loading model to neuron with"
+            f" {compiler_args}, {input_shapes}..."
+        )
+        self.model = CustomNeuronModelForCausalLM.from_pretrained(
+            pretrained,
+            revision=revision,
+            trust_remote_code=trust_remote_code,
+            low_cpu_mem_usage=low_cpu_mem_usage,
+            export=True,
+            **compiler_args,
+            **input_shapes,
+        )
+        print(f"SUCCESS: neuron model compiled. \n {'='*20}")
+        self.truncation = truncation
+        self.vocab_size = self.tokenizer.vocab_size
+        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
+        self.add_bos_token = add_bos_token
+        self._max_length = max_length
+        self.batch_schedule = 1
+        self.batch_sizes = {}
+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+    @property
+    def prefix_token_id(self):
+        # it is used as prefix for loglikelihood
+        return self.tokenizer.bos_token_id or self.tokenizer.eos_token_id
+    @property
+    def max_length(self):
+        if self._max_length:  # if max length manually set, return it
+            return self._max_length
+        seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
+        for attr in seqlen_config_attrs:
+            if hasattr(self.model.config, attr):
+                return getattr(self.model.config, attr)
+        if hasattr(self.tokenizer, "model_max_length"):
+            if self.tokenizer.model_max_length == 1000000000000000019884624838656:
+                return self._DEFAULT_MAX_LENGTH
+            return self.tokenizer.model_max_length
+        return self._DEFAULT_MAX_LENGTH
+    @property
+    def max_gen_toks(self) -> int:
+        return 256
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+    @property
+    def device(self):
+        """device are neuron cores, but the created tensors are on CPU."""
+        return "cpu"
+    @property
+    def rank(self):
+        return 0
+    @property
+    def world_size(self):
+        return 1
+    def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None):
+        """ """
+        if add_special_tokens is None:
+            add_special_tokens = False or self.add_bos_token
+        encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            encoding = encoding[-left_truncate_len:]
+        return encoding
+    def tok_batch_encode(
+        self,
+        strings: List[str],
+        padding_side: str = "left",
+        left_truncate_len: int = None,
+        truncation: bool = False,
+    ):
+        # encode a batch of strings. converts to tensors and pads automatically, unlike tok_encode.
+        old_padding_side = self.tokenizer.padding_side
+        self.tokenizer.padding_side = padding_side
+        add_special_tokens = False or self.add_bos_token
+        encoding = self.tokenizer(
+            strings,
+            truncation=truncation,
+            padding="longest",
+            return_tensors="pt",
+            add_special_tokens=add_special_tokens,
+        )
+        if left_truncate_len:
+            encoding["input_ids"] = encoding["input_ids"][:, -left_truncate_len:]
+            encoding["attention_mask"] = encoding["attention_mask"][
+                :, -left_truncate_len:
+            ]
+        self.tokenizer.padding_side = old_padding_side
+        return encoding["input_ids"], encoding["attention_mask"]
+    def tok_decode(self, tokens):
+        return self.tokenizer.decode(tokens)
+    @wrap_constant_batch_size
+    def _model_call(self, input_ids: torch.Tensor):
+        """
+        get logits for the entire sequence
+        :param input_ids: torch.Tensor
+            A torch tensor of shape [batch, sequence_cont]
+            the size of sequence may vary from call to call
+        :return
+            A torch tensor of shape [batch, sequence, vocab] with the
+            logits returned from the model's decoder-lm head
+        """
+        _, sequence_length = input_ids.shape
+        with torch.inference_mode():
+            cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
+            input_ids_split = input_ids.split(1, dim=1)
+            return torch.concat(
+                [
+                    self.model.forward(
+                        input_ids=input_id, cache_ids=cache_id, return_dict=False
+                    )[0]
+                    for input_id, cache_id in zip(input_ids_split, cache_ids)
+                ],
+                dim=1,
+            )
+    def _model_generate(self, context, max_length, stop, **generation_kwargs):
+        # we require users to pass do_sample=True explicitly
+        # for non-greedy gen. This should be reevaluated when considering beam search.
+        with torch.inference_mode():
+            if "do_sample" not in generation_kwargs.keys():
+                generation_kwargs["do_sample"] = False
+            stopping_criteria = stop_sequences_criteria(
+                self.tokenizer,
+                stop + [self.tokenizer.decode([self.config.eos_token_id])],
+                1,
+                context.shape[0],
+            )
+            return self.model.generate(
+                input_ids=context,
+                max_length=max_length,
+                stopping_criteria=stopping_criteria,
+                pad_token_id=self.eot_token_id,
+                use_cache=True,
+                **generation_kwargs,
+            )
+    def _select_cont_toks(self, logits, contlen=None, inplen=None):
+        assert (
+            contlen and inplen
+        ), "Must pass input len and cont. len to select scored logits for causal LM"
+        # discard right-padding.
+        # also discard the input/context tokens. we'll only score continuations.
+        logits = logits[inplen - contlen : inplen]
+        return logits
+    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
+        loglikelihoods = []
+        adaptive_batch_size = None
+        for (string,) in tqdm(
+            [req.args for req in requests], disable=(disable_tqdm or (self.rank != 0))
+        ):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.prefix_token_id,
+                        max_seq_len=self.max_length,
+                        context_len=1,
+                    ),
+                )
+            )
+            # TODO: Right now, we pass single EOT token to the Encoder and the full context to the decoder, in seq2seq case
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+            pad_amnt = 0
+            if self.world_size > 1:
+                # We pad out the external document-level iterator so the inner iterator doesn't hang
+                mytensor = torch.tensor(len(rolling_token_windows), device=self.device)
+                gathered = (
+                    self.accelerator.gather(mytensor).cpu().detach().numpy().tolist()
+                )
+                pad_amnt = max(gathered) - gathered[self.rank]
+                if pad_amnt > 0:
+                    rolling_token_windows += pad_amnt * [rolling_token_windows[0]]
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+                disable_tqdm=True,
+                override_bs=adaptive_batch_size,
+            )
+            if (self.world_size > 1) and (pad_amnt > 0):
+                string_nll = [x[0] for x in string_nll[:-pad_amnt]]
+            else:
+                # discard is_greedy
+                string_nll = [x[0] for x in string_nll]
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
+    def _loglikelihood_tokens(
+        self, requests, disable_tqdm: bool = False, override_bs=None
+    ):
+        # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
+        res = []
+        def _collate(x):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+            toks = x[1] + x[2]
+            return -len(toks), tuple(toks)
+        re_ord = utils.Reorderer(requests, _collate)
+        n_reordered_requests = len(re_ord.get_reordered())  # noqa
+        # automatic (variable) batch size detection for vectorization
+        # pull longest context sample from request
+        chunks = lm_eval.models.utils.chunks(
+            re_ord.get_reordered(),
+            n=self.batch_size,
+            fn=None,
+        )
+        for chunk in tqdm(chunks, disable=(disable_tqdm or (self.rank != 0))):
+            inps = []
+            cont_toks_list = []
+            inplens = []
+            conts = []  # noqa
+            encoder_attns = []  # noqa
+            padding_len_inp = None
+            padding_len_cont = None  # noqa
+            # because vectorizing is annoying, we first convert each (context, continuation) pair to padded
+            # tensors, then we pack them together into a batch, call the model, and then pick it all apart
+            # again because vectorizing is annoying
+            for _, context_enc, continuation_enc in chunk:
+                # sanity check
+                assert len(context_enc) > 0
+                assert len(continuation_enc) > 0
+                assert len(continuation_enc) <= self.max_length
+                # how this all works (illustrated on a causal decoder-only setup):
+                #          CTX      CONT
+                # inp    0 1 2 3|4 5 6 7 8 9   <- last token is deleted by inp[:, :-1]
+                # model  \               \
+                # logits   1 2 3|4 5 6 7 8 9   <- the ctx half gets tossed out by the
+                # cont_toks      4 5 6 7 8 9      [:, -len(continuation_enc):, :self.vocab_size] slice
+                # when too long to fit in context, truncate from the left
+                inp = torch.tensor(
+                    (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
+                    dtype=torch.long,
+                    device=self.device,
+                )
+                (inplen,) = inp.shape
+                padding_len_inp = (
+                    max(padding_len_inp, inplen)
+                    if padding_len_inp is not None
+                    else inplen
+                )
+                inps.append(inp)  # [1, inp_length]
+                cont_toks_list.append(continuation_enc)
+                inplens.append(inplen)
+            # create encoder attn mask and batched conts, if seq2seq
+            call_kwargs = {}
+            batched_inps = lm_eval.models.utils.pad_and_concat(
+                padding_len_inp, inps, padding_side="right"
+            )  # [batch, padding_len_inp]
+            multi_logits = F.log_softmax(
+                self._model_call(batched_inps, **call_kwargs), dim=-1
+            )  # [batch, padding_length (inp or cont), vocab]
+            for (cache_key, _, _), logits, inplen, cont_toks in zip(
+                chunk, multi_logits, inplens, cont_toks_list
+            ):
+                # Slice to original seq length
+                contlen = len(cont_toks)
+                # take only logits in the continuation
+                # (discard context toks if decoder-only ; discard right-padding)
+                # also discards + checks for "virtual tokens" in the causal LM's input window
+                # from prompt/prefix tuning tokens, if applicable
+                ctx_len = inplen + (logits.shape[0] - padding_len_inp)
+                logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len)
+                logits = logits.unsqueeze(0)  # [1, seq, vocab]
+                # Check if per-token argmax is exactly equal to continuation
+                greedy_tokens = logits.argmax(dim=-1)
+                cont_toks = torch.tensor(
+                    cont_toks, dtype=torch.long, device=self.device
+                ).unsqueeze(0)  # [1, seq]
+                max_equal = (greedy_tokens == cont_toks).all()
+                # Obtain log-probs at the corresponding continuation token indices
+                # last_token_slice = logits[:, -1, :].squeeze(0).tolist()
+                logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
+                    -1
+                )  # [1, seq]
+                # Answer: (log prob, is-exact-match)
+                answer = (float(logits.sum()), bool(max_equal))
+                res.append(answer)
+                self.cache_hook.add_partial("loglikelihood", cache_key, answer)
+        return re_ord.get_original(res)
+    def generate_until(self, requests, disable_tqdm: bool = False):
+        res = defaultdict(list)
+        re_ords = {}
+        def _collate(x):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+            toks = self.tok_encode(x[0])
+            return -len(toks), x[0]
+        # we group requests by their generation_kwargs,
+        # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
+        # in the same batch.
+        grouper = lm_eval.models.utils.Grouper(requests, lambda x: str(x.args[1]))
+        for key, reqs in grouper.get_grouped().items():
+            # within each set of reqs for given kwargs, we reorder by token length, descending.
+            re_ords[key] = utils.Reorderer([req.args for req in reqs], _collate)
+        pbar = tqdm(total=len(requests), disable=(disable_tqdm or (self.rank != 0)))
+        # for each different set of kwargs, we execute all requests, by batch.
+        for key, re_ord in re_ords.items():
+            chunks = lm_eval.models.utils.chunks(
+                re_ord.get_reordered(), n=self.batch_size
+            )
+            for chunk in tqdm(chunks, disable=self.rank != 0):
+                contexts, all_gen_kwargs = zip(*chunk)
+                # we assume all gen kwargs in the batch are the same
+                # this is safe to assume because the `grouper` object ensures it.
+                gen_kwargs = all_gen_kwargs[0]
+                # unpack our keyword arguments.
+                until = None
+                if isinstance(gen_kwargs, dict):
+                    kwargs = copy.deepcopy(gen_kwargs)  # edge case for repeats > 1
+                    if "until" in kwargs.keys():
+                        until = kwargs.pop("until")
+                        if isinstance(until, str):
+                            until = [until]
+                        elif not isinstance(until, list):
+                            raise ValueError(
+                                f"Expected `kwargs['until']` to be of type Union[str,list] but got {until}"
+                            )
+                else:
+                    raise ValueError(
+                        f"Expected `kwargs` to be of type `dict` but got {kwargs}"
+                    )
+                # add EOS token to stop sequences
+                eos = self.tok_decode(self.eot_token_id)
+                if not until:
+                    until = [eos]
+                else:
+                    until.append(eos)
+                if "max_gen_toks" in kwargs.keys():
+                    max_gen_toks = kwargs.pop("max_gen_toks")
+                else:
+                    max_gen_toks = self.max_gen_toks
+                # first stop sequence is used to halt generation upon encountering
+                primary_until = [until[0]]
+                max_ctx_len = self.max_length - max_gen_toks
+                # encode, pad, and truncate contexts for this batch
+                context_enc, attn_masks = self.tok_batch_encode(
+                    contexts,
+                    left_truncate_len=max_ctx_len,
+                    truncation=self.truncation,
+                )
+                context_enc = context_enc.to(self.device)
+                attn_masks = attn_masks.to(self.device)
+                if "max_length" not in kwargs:
+                    kwargs["max_length"] = context_enc.shape[1] + max_gen_toks
+                # perform batched generation
+                cont = self._model_generate(
+                    context=context_enc,
+                    attention_mask=attn_masks,
+                    stop=primary_until,
+                    **kwargs,
+                )
+                cont_toks_list = cont.tolist()
+                for cont_toks, context in zip(cont_toks_list, contexts):
+                    # discard context + left-padding toks if using causal decoder-only LM
+                    cont_toks = cont_toks[context_enc.shape[1] :]
+                    s = self.tok_decode(cont_toks)
+                    # use secondary stop seqs to cut off should-have-been-stopped content post-hoc
+                    for term in until:
+                        if len(term) > 0:
+                            # ignore '' separator,
+                            # for seq2seq case where self.tok_decode(self.eot_token_id) = ''
+                            s = s.split(term)[0]
+                    res[key].append(s)
+                    self.cache_hook.add_partial(
+                        "generate_until", (context, gen_kwargs), s
+                    )
+                    pbar.update(1)
+            # reorder this group of results back to original unsorted form
+            res[key] = re_ord.get_original(res[key])
+        pbar.close()
+        return grouper.get_original(res)

scripts/yans/lm-evaluation-harness/lm_eval/models/vllm_causallms.py ADDED Viewed

	@@ -0,0 +1,540 @@

+import copy
+from importlib.metadata import version
+from importlib.util import find_spec
+from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Tuple, Union
+from more_itertools import distribute
+from packaging.version import parse as parse_version
+from tqdm import tqdm
+from lm_eval.api.instance import Instance
+from lm_eval.api.model import TemplateLM
+from lm_eval.api.registry import register_model
+from lm_eval.models.utils import Collator, configure_pad_token, undistribute
+from lm_eval.utils import (
+    eval_logger,
+    get_rolling_token_windows,
+    make_disjoint_window,
+)
+try:
+    import ray
+    from vllm import LLM, SamplingParams
+    from vllm.lora.request import LoRARequest
+    from vllm.transformers_utils.tokenizer import get_tokenizer
+except ModuleNotFoundError:
+    pass
+if TYPE_CHECKING:
+    pass
+eval_logger = eval_logger
+@register_model("vllm")
+class VLLM(TemplateLM):
+    _DEFAULT_MAX_LENGTH = 2048
+    def __init__(
+        self,
+        pretrained: str,
+        dtype: Literal["float16", "bfloat16", "float32", "auto"] = "auto",
+        revision: Optional[str] = None,
+        trust_remote_code: Optional[bool] = False,
+        tokenizer: Optional[str] = None,
+        tokenizer_mode: Literal["auto", "slow"] = "auto",
+        tokenizer_revision: Optional[str] = None,
+        add_bos_token: Optional[bool] = False,
+        prefix_token_id: Optional[int] = None,
+        tensor_parallel_size: int = 1,
+        quantization: Optional[str] = None,
+        max_gen_toks: int = 256,
+        swap_space: int = 4,
+        batch_size: Union[str, int] = 1,
+        max_batch_size=None,
+        max_length: int = None,
+        max_model_len: int = None,
+        seed: int = 1234,
+        gpu_memory_utilization: float = 0.9,
+        device: str = "cuda",
+        data_parallel_size: int = 1,
+        lora_local_path: str = None,
+        **kwargs,
+    ):
+        super().__init__()
+        if not find_spec("vllm"):
+            raise Exception(
+                "attempted to use 'vllm' LM type, but package `vllm` is not installed. "
+                "Please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
+            )
+        assert "cuda" in device or device is None, "vLLM only supports CUDA"
+        assert (
+            max_length is None or max_model_len is None
+        ), "Either max_length or max_model_len may be provided, but not both"
+        self._max_length = max_model_len if max_model_len is not None else max_length
+        self.tensor_parallel_size = int(tensor_parallel_size)
+        self.data_parallel_size = int(data_parallel_size)
+        self.model_args = {
+            "model": pretrained,
+            "gpu_memory_utilization": float(gpu_memory_utilization),
+            "revision": revision,
+            "dtype": dtype,
+            "tokenizer": tokenizer,
+            "tokenizer_mode": tokenizer_mode,
+            "tokenizer_revision": tokenizer_revision,
+            "trust_remote_code": trust_remote_code,
+            "tensor_parallel_size": int(tensor_parallel_size),
+            "max_model_len": int(self._max_length) if self._max_length else None,
+            "swap_space": int(swap_space),
+            "quantization": quantization,
+            "seed": int(seed),
+        }
+        self.model_args.update(kwargs)
+        self.batch_size = (
+            "auto"
+            if isinstance(batch_size, str) and "auto" in batch_size
+            else batch_size
+        )
+        if self.data_parallel_size <= 1:
+            self.model = LLM(**self.model_args)
+        else:
+            eval_logger.warning(
+                "You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached."
+            )
+            self.model_args["worker_use_ray"] = True
+            self.batch_size = "auto"
+            eval_logger.info("Manual batching is not compatible with data parallelism.")
+            from transformers import AutoConfig
+            self._config = AutoConfig.from_pretrained(
+                pretrained, trust_remote_code=trust_remote_code, revision=revision
+            )
+        self.tokenizer = get_tokenizer(
+            tokenizer if tokenizer else pretrained,
+            tokenizer_mode=tokenizer_mode,
+            trust_remote_code=trust_remote_code,
+            tokenizer_revision=tokenizer_revision,
+        )
+        self.tokenizer = configure_pad_token(self.tokenizer)
+        self.add_bos_token = add_bos_token
+        if "gemma" in pretrained.lower():
+            self.add_bos_token = True
+            eval_logger.info(
+                "Found 'gemma' in model name, a BOS token will be used as Gemma series models underperform without it."
+            )
+        self.custom_prefix_token_id = prefix_token_id
+        if prefix_token_id is not None:
+            eval_logger.info(
+                f"Loglikelihood prefix token id used in evaluation: {self.prefix_token_id}"
+            )
+        self._max_gen_toks = max_gen_toks
+        if lora_local_path is not None:
+            assert parse_version(version("vllm")) > parse_version(
+                "0.3.0"
+            ), "lora adapters only compatible with vllm > v0.3.0."
+            self.lora_request = LoRARequest("finetuned", 1, lora_local_path)
+        else:
+            self.lora_request = None
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+    @property
+    def prefix_token_id(self):
+        # it is used as prefix for loglikelihood
+        if self.custom_prefix_token_id is not None:
+            return self.custom_prefix_token_id
+        if self.tokenizer.bos_token_id is not None:
+            return self.tokenizer.bos_token_id
+        return self.tokenizer.eos_token_id
+    @property
+    def max_length(self):
+        if self._max_length:  # if max length manually set, return it
+            return self._max_length
+        if self.data_parallel_size <= 1:
+            return self.model.llm_engine.model_config.max_model_len
+        else:
+            seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
+            for attr in seqlen_config_attrs:
+                if hasattr(self._config, attr):
+                    return getattr(self._config, attr)
+            if hasattr(self.tokenizer, "model_max_length"):
+                if self.tokenizer.model_max_length == 1000000000000000019884624838656:
+                    return self._DEFAULT_MAX_LENGTH
+                return self.tokenizer.model_max_length
+            return self._DEFAULT_MAX_LENGTH
+    @property
+    def max_gen_toks(self):
+        return self._max_gen_toks
+    def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
+        """
+        Method to apply a chat template to a list of chat history between user and model.
+        """
+        return self.tokenizer.apply_chat_template(
+            chat_history, tokenize=False, add_generation_prompt=True
+        )
+    @property
+    def chat_template(self) -> str:
+        if self.tokenizer.chat_template is not None:
+            return self.tokenizer.chat_template
+        return self.tokenizer.default_chat_template
+    @property
+    def tokenizer_name(self) -> str:
+        return self.tokenizer.name_or_path.replace("/", "__")
+    def tok_encode(
+        self,
+        string: Union[str, List[str]],
+        left_truncate_len: int = None,
+        add_special_tokens: bool = False,
+        truncation: bool = False,
+    ) -> Union[List[int], List[List[int]]]:
+        if not add_special_tokens:
+            add_special_tokens = False or self.add_bos_token
+        encoding: Union[List[List[int]], List[int]] = self.tokenizer(
+            string,
+            add_special_tokens=add_special_tokens,
+            truncation=truncation,
+            return_attention_mask=False,
+        ).input_ids
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            if not isinstance(string, str):
+                encoding = [enc[-left_truncate_len:] for enc in encoding]
+            else:
+                encoding = encoding[-left_truncate_len:]
+        return encoding
+    def _model_generate(
+        self,
+        requests: List[List[int]] = None,
+        generate: bool = False,
+        max_tokens: int = None,
+        stop: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if generate:
+            kwargs = self.modify_gen_kwargs(kwargs)
+            sampling_params = SamplingParams(max_tokens=max_tokens, stop=stop, **kwargs)
+        else:
+            sampling_params = SamplingParams(
+                temperature=0, prompt_logprobs=1, max_tokens=1, detokenize=False
+            )
+        if self.data_parallel_size > 1:
+            # vLLM hangs if tensor_parallel > 1 and resources are set in ray.remote
+            # also seems to only work with decorator and not with ray.remote() fn
+            # see https://github.com/vllm-project/vllm/issues/973
+            # note: this has changed on 0.3.3, and it only works now if num_gpus are set.
+            # but then tensor_parallel breaks
+            @ray.remote
+            def run_inference_one_model(
+                model_args: dict, sampling_params, requests: List[List[int]]
+            ):
+                llm = LLM(**model_args)
+                return llm.generate(
+                    prompt_token_ids=requests, sampling_params=sampling_params
+                )
+            # dispatch requests to all self.data_parallel_size workers, in interleaved fashion
+            # interleaved important to balance context lengths across workers
+            requests = [list(x) for x in distribute(self.data_parallel_size, requests)]
+            inputs = ((self.model_args, sampling_params, req) for req in requests)
+            object_refs = [run_inference_one_model.remote(*x) for x in inputs]
+            results = ray.get(object_refs)
+            # Invoke ray.shutdown() to prevent hang-ups if subsequent calls required.
+            ray.shutdown()
+            # flatten results
+            return undistribute(results)
+        if self.lora_request is not None:
+            outputs = self.model.generate(
+                prompt_token_ids=requests,
+                sampling_params=sampling_params,
+                use_tqdm=True if self.batch_size == "auto" else False,
+                lora_request=self.lora_request,
+            )
+        else:
+            outputs = self.model.generate(
+                prompt_token_ids=requests,
+                sampling_params=sampling_params,
+                use_tqdm=True if self.batch_size == "auto" else False,
+            )
+        return outputs
+    def loglikelihood_rolling(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[float]:
+        loglikelihoods = []
+        for (string,) in tqdm([req.args for req in requests], disable=disable_tqdm):
+            rolling_token_windows = list(
+                map(
+                    make_disjoint_window,
+                    get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.eot_token_id,
+                        max_seq_len=self.max_length - 1,
+                        context_len=1,
+                    ),
+                )
+            )
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+            )
+            # discard is_greedy
+            string_nll = [x[0] for x in string_nll]
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
+    def generate_until(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[str]:
+        res = []
+        # batch tokenize contexts
+        context, all_gen_kwargs = zip(*(req.args for req in requests))
+        context_encoding: List[List[int]] = self.tok_encode(
+            context, add_special_tokens=self.add_bos_token
+        )
+        requests = [
+            ((a, b), c) for a, b, c in zip(context, context_encoding, all_gen_kwargs)
+        ]
+        def _collate_gen(_requests):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+            return -len(_requests[0][1]), _requests[0][0]
+        # we group requests by their generation_kwargs,
+        # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
+        # in the same batch.
+        re_ords = Collator(requests, _collate_gen, group_by="gen_kwargs")
+        chunks = re_ords.get_batched(
+            n=int(self.batch_size) if self.batch_size != "auto" else 0, batch_fn=None
+        )
+        pbar = tqdm(
+            total=len(requests),
+            disable=(disable_tqdm or (self.rank != 0)),
+            desc="Running generate_until requests",
+        )
+        # for each different set of kwargs, we execute all requests, by batch.
+        for chunk in chunks:
+            context_and_encoding, all_gen_kwargs = zip(*chunk)
+            context, context_encoding = zip(*context_and_encoding)
+            # we assume all gen kwargs in the batch are the same
+            # this is safe to assume because the `grouper` object ensures it.
+            gen_kwargs = all_gen_kwargs[0]
+            # unpack our keyword arguments.
+            until = None
+            if isinstance(gen_kwargs, dict):
+                kwargs = copy.deepcopy(gen_kwargs)  # edge case for repeats > 1
+                if "until" in kwargs.keys():
+                    until = kwargs.pop("until")
+                    if isinstance(until, str):
+                        until = [until]
+                    elif not isinstance(until, list):
+                        raise ValueError(
+                            f"Expected `kwargs['until']` to be of type Union[str,list] but got {until}"
+                        )
+            else:
+                raise ValueError(
+                    f"Expected `kwargs` to be of type `dict` but got {gen_kwargs}"
+                )
+            # add EOS token to stop sequences
+            eos = self.tokenizer.decode(self.eot_token_id)
+            if not until:
+                until = [eos]
+            else:
+                until.append(eos)
+            if "max_gen_toks" in kwargs.keys():
+                max_gen_toks = kwargs.pop("max_gen_toks")
+            else:
+                max_gen_toks = self.max_gen_toks
+            # set the max length in tokens of inputs ("context_enc")
+            # max len for inputs = max length, minus room to generate the max new tokens
+            max_ctx_len = self.max_length - max_gen_toks
+            context_encoding = [x[-max_ctx_len:] for x in context_encoding]
+            # perform batched generation
+            cont = self._model_generate(
+                requests=context_encoding,
+                generate=True,
+                max_tokens=max_gen_toks,
+                stop=until,
+                **kwargs,
+            )
+            # cache generations
+            for output, context in zip(cont, context):
+                generated_text = output.outputs[0].text
+                res.append(generated_text)
+                self.cache_hook.add_partial(
+                    "generate_until", (context, gen_kwargs), generated_text
+                )
+                pbar.update(1)
+        pbar.close()
+        # reorder all group of results back to original unsorted form
+        return re_ords.get_original(res)
+    def _loglikelihood_tokens(
+        self,
+        requests: List[Tuple[Tuple[str, str], List[int], List[int]]],
+        disable_tqdm: bool = False,
+    ) -> List[Tuple[float, bool]]:
+        res = []
+        def _collate(x):
+            toks = x[1] + x[2]
+            return -len(toks), tuple(toks)
+        # Reorder requests by length and batch
+        re_ord = Collator(requests, sort_fn=_collate)
+        chunks = re_ord.get_batched(
+            n=int(self.batch_size) if self.batch_size != "auto" else 0, batch_fn=None
+        )
+        pbar = tqdm(
+            total=len(requests),
+            disable=disable_tqdm,
+            desc="Running loglikelihood requests",
+        )
+        for chunk in chunks:
+            inputs = []
+            ctxlens = []
+            for cache_key, context_enc, continuation_enc in chunk:
+                inp = (context_enc + continuation_enc)[-(self.max_length) :]
+                ctxlen = len(context_enc) - max(
+                    0, len(context_enc) + len(continuation_enc) - (self.max_length)
+                )
+                inputs.append(inp)
+                ctxlens.append(ctxlen)
+            outputs = self._model_generate(requests=inputs, generate=False)
+            for output, ctxlen, (cache_key, _, _), inp in zip(
+                outputs, ctxlens, chunk, inputs
+            ):
+                answer = self._parse_logprobs(
+                    tokens=inp,
+                    outputs=output,
+                    ctxlen=ctxlen,
+                )
+                res.append(answer)
+                # partial caching
+                if cache_key is not None:
+                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
+                pbar.update(1)
+        pbar.close()
+        return re_ord.get_original(res)
+    @staticmethod
+    def _parse_logprobs(tokens: List, outputs, ctxlen: int) -> Tuple[float, bool]:
+        """Process logprobs and tokens.
+        :param tokens: list
+            Input tokens (potentially left-truncated)
+        :param outputs: RequestOutput
+            Contains prompt_logprobs
+        :param ctxlen: int
+            Length of context (so we can slice them away and only keep the predictions)
+        :return:
+            continuation_logprobs: float
+                Log probabilities of continuation tokens
+            is_greedy: bool
+                Whether argmax matches given continuation exactly
+        """
+        # The first entry of prompt_logprobs is None because the model has no previous tokens to condition on.
+        continuation_logprobs_dicts = outputs.prompt_logprobs
+        def coerce_logprob_to_num(logprob):
+            # vLLM changed the return type of logprobs from float
+            # to a Logprob object storing the float value + extra data
+            # (https://github.com/vllm-project/vllm/pull/3065).
+            # If we are dealing with vllm's Logprob object, return
+            # the logprob value stored as an attribute. Otherwise,
+            # return the object itself (which should be a float
+            # for older versions of vLLM).
+            return getattr(logprob, "logprob", logprob)
+        continuation_logprobs_dicts = [
+            {
+                token: coerce_logprob_to_num(logprob)
+                for token, logprob in logprob_dict.items()
+            }
+            if logprob_dict is not None
+            else None
+            for logprob_dict in continuation_logprobs_dicts
+        ]
+        # Calculate continuation_logprobs
+        # assume ctxlen always >= 1
+        continuation_logprobs = sum(
+            logprob_dict.get(token)
+            for token, logprob_dict in zip(
+                tokens[ctxlen:], continuation_logprobs_dicts[ctxlen:]
+            )
+        )
+        # Determine if is_greedy
+        is_greedy = True
+        for token, logprob_dict in zip(
+            tokens[ctxlen:], continuation_logprobs_dicts[ctxlen:]
+        ):
+            # Get the token with the maximum log probability from the logprob_dict
+            if logprob_dict:  # Ensure the logprob_dict is not None
+                top_token = max(logprob_dict, key=logprob_dict.get)
+                if top_token != token:
+                    is_greedy = False
+                    break
+        return continuation_logprobs, is_greedy
+    @staticmethod
+    def modify_gen_kwargs(kwargs: dict) -> dict:
+        # sampling_params
+        do_sample = kwargs.pop("do_sample", None)
+        if do_sample is False and "temperature" not in kwargs:
+            eval_logger.debug(
+                "Got `do_sample=False` and no temperature value, setting VLLM temperature to 0.0 ..."
+            )
+            kwargs["temperature"] = 0.0
+        # hf defaults
+        kwargs["skip_special_tokens"] = kwargs.get("skip_special_tokens", False)
+        kwargs["spaces_between_special_tokens"] = kwargs.get(
+            "spaces_between_special_tokens", False
+        )
+        return kwargs

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/bleu.py ADDED Viewed

	@@ -0,0 +1,241 @@

+#!/usr/bin/python
+import math
+import re
+import sys
+import xml.sax.saxutils
+from typing import Any, Dict, List, Optional, Pattern, Tuple, Union
+"""
+This script was adapted from the original version by hieuhoang1972 which is part of MOSES.
+"""
+# $Id: bleu.py 1307 2007-03-14 22:22:36Z hieuhoang1972 $
+"""Provides:
+cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
+cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
+score_cooked(alltest, n=4): Score a list of cooked test sentences.
+score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
+The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
+"""
+# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
+nonorm = 0
+preserve_case = False
+eff_ref_len = "shortest"
+normalize1: List[Tuple[Union[Pattern[str], str], str]] = [
+    ("<skipped>", ""),  # strip "skipped" tags
+    (r"-\n", ""),  # strip end-of-line hyphenation and join lines
+    (r"\n", " "),  # join lines
+    #    (r'(\d)\s+(?=\d)', r'\1'), # join digits
+]
+normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
+normalize2: List[Tuple[Union[Pattern[str], str], str]] = [
+    (
+        r"([\{-\~\[-\` -\&\(-\+\:-\@\/])",
+        r" \1 ",
+    ),  # tokenize punctuation. apostrophe is missing
+    (
+        r"([^0-9])([\.,])",
+        r"\1 \2 ",
+    ),  # tokenize period and comma unless preceded by a digit
+    (
+        r"([\.,])([^0-9])",
+        r" \1 \2",
+    ),  # tokenize period and comma unless followed by a digit
+    (r"([0-9])(-)", r"\1 \2 "),  # tokenize dash when preceded by a digit
+]
+normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
+def normalize(s):
+    """Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl."""
+    # Added to bypass NIST-style pre-processing of hyp and ref files -- wade
+    if nonorm:
+        return s.split()
+    if not isinstance(s, str):
+        s = " ".join(s)
+    # language-independent part:
+    for pattern, replace in normalize1:
+        s = re.sub(pattern, replace, s)
+    s = xml.sax.saxutils.unescape(s, {"&quot;": '"'})
+    # language-dependent part (assuming Western languages):
+    s = " %s " % s
+    if not preserve_case:
+        s = s.lower()  # this might not be identical to the original
+    for pattern, replace in normalize2:
+        s = re.sub(pattern, replace, s)
+    return s.split()
+def count_ngrams(words, n=4):
+    counts: Dict[Any, int] = {}
+    for k in range(1, n + 1):
+        for i in range(len(words) - k + 1):
+            ngram = tuple(words[i : i + k])
+            counts[ngram] = counts.get(ngram, 0) + 1
+    return counts
+def cook_refs(refs, n=4):
+    """Takes a list of reference sentences for a single segment
+    and returns an object that encapsulates everything that BLEU
+    needs to know about them."""
+    refs = [normalize(ref) for ref in refs]
+    maxcounts: Dict[Tuple[str], int] = {}
+    for ref in refs:
+        counts = count_ngrams(ref, n)
+        for ngram, count in counts.items():
+            maxcounts[ngram] = max(maxcounts.get(ngram, 0), count)
+    return ([len(ref) for ref in refs], maxcounts)
+def cook_test(test, item, n=4):
+    """Takes a test sentence and returns an object that
+    encapsulates everything that BLEU needs to know about it."""
+    (reflens, refmaxcounts) = item
+    test = normalize(test)
+    result: Dict[str, Any] = {}
+    result["testlen"] = len(test)
+    # Calculate effective reference sentence length.
+    if eff_ref_len == "shortest":
+        result["reflen"] = min(reflens)
+    elif eff_ref_len == "average":
+        result["reflen"] = float(sum(reflens)) / len(reflens)
+    elif eff_ref_len == "closest":
+        min_diff: Optional[int] = None
+        for reflen in reflens:
+            if min_diff is None or abs(reflen - len(test)) < min_diff:
+                min_diff = abs(reflen - len(test))
+                result["reflen"] = reflen
+    result["guess"] = [max(len(test) - k + 1, 0) for k in range(1, n + 1)]
+    result["correct"] = [0] * n
+    counts = count_ngrams(test, n)
+    for ngram, count in counts.items():
+        result["correct"][len(ngram) - 1] += min(refmaxcounts.get(ngram, 0), count)
+    return result
+def score_cooked(allcomps, n=4, ground=0, smooth=1):
+    totalcomps: Dict[str, Any] = {
+        "testlen": 0,
+        "reflen": 0,
+        "guess": [0] * n,
+        "correct": [0] * n,
+    }
+    for comps in allcomps:
+        for key in ["testlen", "reflen"]:
+            totalcomps[key] += comps[key]
+        for key in ["guess", "correct"]:
+            for k in range(n):
+                totalcomps[key][k] += comps[key][k]
+    logbleu = 0.0
+    all_bleus: List[float] = []
+    for k in range(n):
+        correct = totalcomps["correct"][k]
+        guess = totalcomps["guess"][k]
+        addsmooth = 0
+        if smooth == 1 and k > 0:
+            addsmooth = 1
+        logbleu += math.log(correct + addsmooth + sys.float_info.min) - math.log(
+            guess + addsmooth + sys.float_info.min
+        )
+        if guess == 0:
+            all_bleus.append(-10000000.0)
+        else:
+            all_bleus.append(math.log(correct + sys.float_info.min) - math.log(guess))
+    logbleu /= float(n)
+    all_bleus.insert(0, logbleu)
+    brevPenalty = min(
+        0, 1 - float(totalcomps["reflen"] + 1) / (totalcomps["testlen"] + 1)
+    )
+    for i in range(len(all_bleus)):
+        if i == 0:
+            all_bleus[i] += brevPenalty
+        all_bleus[i] = math.exp(all_bleus[i])
+    return all_bleus
+def bleu(refs, candidate, ground=0, smooth=1):
+    refs = cook_refs(refs)
+    test = cook_test(candidate, refs)
+    return score_cooked([test], ground=ground, smooth=smooth)
+def splitPuncts(line):
+    return " ".join(re.findall(r"[\w]+|[^\s\w]", line))
+def computeMaps(predictions, goldfile):
+    predictionMap: Dict[str, list] = {}
+    goldMap: Dict[str, list] = {}
+    gf = open(goldfile, "r", encoding="utf-8")
+    for row in predictions:
+        cols = row.strip().split("\t")
+        if len(cols) == 1:
+            (rid, pred) = (cols[0], "")
+        else:
+            (rid, pred) = (cols[0], cols[1])
+        predictionMap[rid] = [splitPuncts(pred.strip().lower())]
+    for row in gf:
+        (rid, pred) = row.split("\t")
+        if rid in predictionMap:  # Only insert if the id exists for the method
+            if rid not in goldMap:
+                goldMap[rid] = []
+            goldMap[rid].append(splitPuncts(pred.strip().lower()))
+    sys.stderr.write("Total: " + str(len(goldMap)) + "\n")
+    return (goldMap, predictionMap)
+# m1 is the reference map
+# m2 is the prediction map
+def bleuFromMaps(m1, m2):
+    score = [0] * 5
+    num = 0.0
+    for key in m1:
+        if key in m2:
+            bl = bleu(m1[key], m2[key][0])
+            score = [score[i] + bl[i] for i in range(0, len(bl))]
+            num += 1
+    return [s * 100.0 / num for s in score]
+def smoothed_bleu_4(references, predictions, **kwargs):
+    predictionMap = {}
+    goldMap = {}
+    for rid, pred in enumerate(predictions):
+        predictionMap[rid] = [splitPuncts(pred.strip().lower())]
+    for rid, row in enumerate(references):
+        goldMap[rid] = [splitPuncts(row.strip().lower())]
+    return bleuFromMaps(goldMap, predictionMap)[0]
+if __name__ == "__main__":
+    reference_file = sys.argv[1]
+    predictions = []
+    for row in sys.stdin:
+        predictions.append(row)
+    (goldMap, predictionMap) = computeMaps(predictions, reference_file)
+    print(bleuFromMaps(goldMap, predictionMap)[0])

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/go.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+group:
+  - codexglue_code2text
+task: code2text_go
+dataset_path: CM/codexglue_code2text_go
+training_split: train
+validation_split: validation
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  num_beams: 10
+  max_gen_toks: 128
+  until:
+    - "</s>"
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function bleu.smoothed_bleu_4
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/java.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+group:
+  - codexglue_code2text
+task: code2text_java
+dataset_path: CM/codexglue_code2text_java
+training_split: train
+validation_split: validation
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  num_beams: 10
+  max_gen_toks: 128
+  until:
+    - "</s>"
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function bleu.smoothed_bleu_4
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/javascript.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+group:
+  - codexglue_code2text
+task: code2text_javascript
+dataset_path: CM/codexglue_code2text_javascript
+training_split: train
+validation_split: validation
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  num_beams: 10
+  max_gen_toks: 128
+  until:
+    - "</s>"
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function bleu.smoothed_bleu_4
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/php.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+group:
+  - codexglue_code2text
+task: code2text_php
+dataset_path: CM/codexglue_code2text_php
+training_split: train
+validation_split: validation
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  num_beams: 10
+  max_gen_toks: 128
+  until:
+    - "</s>"
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function bleu.smoothed_bleu_4
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/python.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+group:
+  - codexglue_code2text
+task: code2text_python
+dataset_path: CM/codexglue_code2text_python
+training_split: train
+validation_split: validation
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  num_beams: 10
+  max_gen_toks: 128
+  until:
+    - "</s>"
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function bleu.smoothed_bleu_4
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/ruby.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+group:
+  - codexglue_code2text
+task: code2text_ruby
+dataset_path: CM/codexglue_code2text_ruby
+training_split: train
+validation_split: validation
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  num_beams: 10
+  max_gen_toks: 128
+  until:
+    - "</s>"
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function bleu.smoothed_bleu_4
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 3.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/code_x_glue/code-text/utils.py ADDED Viewed

	@@ -0,0 +1,12 @@

+def doc_to_text(doc):
+    inputs = " ".join(doc["code_tokens"]).replace("\n", " ")
+    inputs = " ".join(inputs.strip().split())
+    return inputs
+def doc_to_target(doc):
+    targets = " ".join(doc["docstring_tokens"]).replace("\n", "")
+    targets = " ".join(targets.strip().split())
+    return targets

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+# ETHICS Dataset
+### Paper
+Pointer Sentinel Mixture Models
+https://arxiv.org/pdf/1609.07843.pdf
+The ETHICS dataset is a benchmark that spans concepts in justice, well-being,
+duties, virtues, and commonsense morality. Models predict widespread moral
+judgments about diverse text scenarios. This requires connecting physical and
+social world knowledge to value judgements, a capability that may enable us
+to steer chatbot outputs or eventually regularize open-ended reinforcement
+learning agents.
+Homepage: https://github.com/hendrycks/ethics
+### Citation
+```
+@article{hendrycks2021ethics
+    title={Aligning AI With Shared Human Values},
+    author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
+    journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+    year={2021}
+}
+```
+### Groups and Tasks
+#### Groups
+- `hendrycks_ethics`
+#### Tasks
+* `ethics_cm`
+* `ethics_deontology`
+* `ethics_justice`
+* `ethics_utilitarianism`
+* (MISSING) `ethics_utilitarianism_original`
+* `ethics_virtue`
+### Checklist
+* [x] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+  * [ ] Matches v0.3.0 of Eval Harness

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/commonsense.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+tag:
+  - hendrycks_ethics
+task: ethics_cm
+dataset_path: EleutherAI/hendrycks_ethics
+dataset_name: commonsense
+output_type: multiple_choice
+training_split: train
+test_split: test
+doc_to_text: "{{input}}\nQuestion: Is this wrong?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ['no', 'yes']
+metric_list:
+  - metric: acc
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/deontology.yaml ADDED Viewed

	@@ -0,0 +1,9 @@

+include: commonsense.yaml
+task: ethics_deontology
+dataset_name: deontology
+doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:"
+doc_to_target: label
+doc_to_choice: ['unreasonable', 'reasonable']
+metadata:
+  version: 1.0
+# TODO: implement exact-match metric for this subset

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/justice.yaml ADDED Viewed

	@@ -0,0 +1,9 @@

+include: deontology.yaml
+tag:
+  - hendrycks_ethics
+task: ethics_justice
+dataset_name: justice
+doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:"
+# TODO: impl. exact match for this and deontology
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utilitarianism.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+include: commonsense.yaml
+tag:
+  - hendrycks_ethics
+task: ethics_utilitarianism
+dataset_name: utilitarianism
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+doc_to_choice: ['no', 'yes']
+metric_list:
+  - metric: acc
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+# group:
+#   - hendrycks_ethics
+# task: ethics_utilitarianism_original
+# dataset_path: hails/hendrycks_ethics
+# dataset_name: utilitarianism
+# output_type: winograd_schema
+# fewshot_split: null # TODO: implement a special fewshot split for this dataset subsets
+# test_split: test
+# template_aliases:  #"{% set answer_choices = range(1, 11)|list %}"
+# doc_to_text: 'Activity: "{{activity}}"\nRating:'
+# doc_to_target: "{{answer_choices[label]}}"
+# metric_list:
+#   - metric: acc
+# TODO: we want this to be implemented as a winograd_schema task type, actually
+# metadata:
+#   version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/utils.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import random
+### Utils for `ethics_utilitarianism` task below
+def _preproc_doc(doc):
+    rnd = random.Random(doc["activity"])
+    scenarios = [doc["activity"], doc["baseline"]]
+    ordering = [0, 1]
+    rnd.shuffle(ordering)
+    doc = {
+        "scenarios": [scenarios[ordering[0]], scenarios[ordering[1]]],
+        # The correct scenario is always first
+        "label": int(ordering.index(0) == 0),
+    }
+    return doc
+def doc_to_text(doc) -> str:
+    doc = _preproc_doc(doc)
+    return f"Scenario 1: {doc['scenarios'][0]}\nScenario 2: {doc['scenarios'][1]}\nQuestion: Is Scenario 1 preferable?\nAnswer:"
+def doc_to_target(doc):
+    doc = _preproc_doc(doc)
+    return doc["label"]

scripts/yans/lm-evaluation-harness/lm_eval/tasks/hendrycks_ethics/virtue.yaml ADDED Viewed

	@@ -0,0 +1,10 @@

+include: commonsense.yaml
+tag:
+  - hendrycks_ethics
+task: ethics_virtue
+dataset_name: virtue
+doc_to_text: "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ['no', 'yes']
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/mc_taco/README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+# MC Taco
+### Paper
+Title: `"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding`
+Abstract: https://arxiv.org/abs/1909.03065
+MC-TACO is a dataset of 13k question-answer pairs that require temporal commonsense
+comprehension. The dataset contains five temporal properties, (1) duration (how long
+an event takes), (2) temporal ordering (typical order of events), (3) typical time
+(when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity
+(whether a state is maintained for a very long time or indefinitely).
+WARNING: Running this task with a `--limit` arg will give misleading results! The
+corresponding dataset is structured such that each multiple-choice-question gathered
+by the authors is split into question-option pairs, where each such pair gets
+siloed into an individual document for plausibility testing. Because the harness
+shuffles these documents, setting `--limit` will likely "cut off" certain candidate
+answers. This is a problem because the task's metrics require an exhaustive evaluation
+of a question's options. See section 4 of the paper for details.
+Homepage: https://leaderboard.allenai.org/mctaco/submissions/public
+### Citation
+```
+BibTeX-formatted citation goes here
+```
+### Groups and Tasks
+#### Groups
+* Not part of a group yet.
+#### Tasks
+* `mc_taco`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/mc_taco/default.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+task: mc_taco
+dataset_path: mc_taco
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "{{sentence}}\nQuestion: {{question}}\nAnswer: {{answer}}\nPlausible:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+should_decontaminate: true
+doc_to_decontamination_query: "{{question}} {{sentence}}"
+metric_list:
+  - metric: acc
+  - metric: f1
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/pubmedqa/README.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# PubMedQA
+### Paper
+Title: `PubMedQA: A Dataset for Biomedical Research Question Answering`
+Abstract: https://arxiv.org/abs/1909.06146
+PubMedQA is a novel biomedical question answering (QA) dataset collected from
+PubMed abstracts. The task of PubMedQA is to answer research questions with
+yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after
+coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA
+has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA
+instances. Each PubMedQA instance is composed of (1) a question which is either
+an existing research article title or derived from one, (2) a context which is
+the corresponding abstract without its conclusion, (3) a long answer, which is
+the conclusion of the abstract and, presumably, answers the research question,
+and (4) a yes/no/maybe answer which summarizes the conclusion.
+Homepage: https://pubmedqa.github.io/
+### Citation
+```
+@inproceedings{jin2019pubmedqa,
+    title={PubMedQA: A Dataset for Biomedical Research Question Answering},
+    author={Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua},
+    booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
+    pages={2567--2577},
+    year={2019}
+}
+```
+### Groups and Tasks
+#### Groups
+* Not part of a group yet
+#### Tasks
+* `pubmed_qa`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/pubmedqa/preprocess_pubmedqa.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def doc_to_text(doc) -> str:
+    ctxs = "\n".join(doc["CONTEXTS"])
+    return "Abstract: {}\nQuestion: {}\nAnswer:".format(
+        ctxs,
+        doc["QUESTION"],
+    )

scripts/yans/lm-evaluation-harness/lm_eval/tasks/pubmedqa/pubmedqa.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+task: pubmedqa
+dataset_path: bigbio/pubmed_qa
+dataset_name: pubmed_qa_labeled_fold0_source
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function preprocess_pubmedqa.doc_to_text
+doc_to_target: final_decision
+doc_to_choice: ["yes", "no", "maybe"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/qa4mre/README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# QA4MRE
+### Paper
+Title: `QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation`
+Abstract: https://www.cs.cmu.edu/~./hovy/papers/13CLEF-QA4MRE.pdf
+The (English only) QA4MRE challenge which was run as a Lab at CLEF 2011-2013.
+The main objective of this exercise is to develop a methodology for evaluating
+Machine Reading systems through Question Answering and Reading Comprehension
+Tests. Systems should be able to extract knowledge from large volumes of text
+and use this knowledge to answer questions. Four different tasks have been
+organized during these years: Main Task, Processing Modality and Negation for
+Machine Reading, Machine Reading of Biomedical Texts about Alzheimer's disease,
+and Entrance Exam.
+Homepage: http://nlp.uned.es/clef-qa/repository/qa4mre.php
+### Citation
+```
+@inproceedings{Peas2013QA4MRE2O,
+    title={QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation},
+    author={Anselmo Pe{\~n}as and Eduard H. Hovy and Pamela Forner and {\'A}lvaro Rodrigo and Richard F. E. Sutcliffe and Roser Morante},
+    booktitle={CLEF},
+    year={2013}
+}
+```
+### Groups and Tasks
+#### Groups
+* `qa4mre`
+#### Tasks
+* `qa4mre_2011`
+* `qa4mre_2012`
+* `qa4mre_2013`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/qa4mre/preprocess_qa4mre.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def qa4mre_process(doc):
+    return int(doc["correct_answer_id"]) - 1
+def doc_to_target(doc):
+    return doc["answer_options"]["answer_str"][qa4mre_process(doc)]

scripts/yans/lm-evaluation-harness/lm_eval/tasks/qa4mre/qa4mre_2011.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+tag:
+  - qa4mre
+task: qa4mre_2011
+dataset_path: qa4mre
+dataset_name: 2011.main.EN
+output_type: multiple_choice
+test_split: train
+# doc_to_text: "{{document_str.strip()}}\nQuestion: {{question_str}}\nChoices:\n- {{answer_choices|join('\n- ')}}\nAnswer:"
+doc_to_text: "{{document_str.strip()}}\nQuestion: {{question_str}}\nAnswer:"
+doc_to_target: "{{correct_answer_id|int - 1}}"
+doc_to_choice: "{{answer_options.answer_str}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{document_str.strip()}} + ' ' + {{question_str}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0