Hello instructlab,

I am very interested in using the instructlab for my project, and have tried to use the library https://pypi.org/project/instructlab-training/ on Kaggle and following directions from the documentation.

I have tried to use the library in Kaggle on both the CPU and the GPU, with no success. Can someone answer the question: IS IT POSSIBLE TO RUN https://pypi.org/project/instructlab-training/ on KAGGLE?

-----------------------DETAILS----------------------------------------------

I have successfully installed the library in the Kaggle workbook:

!pip install instructlab-training

I have successfully imported the necessary tools without error:

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

I am getting the ERRORs that neither the 'torchrun_args' or 'training_args' are valid 'run_training' inputs:

Run the training

run_training(
torchrun_args=TorchrunArgs(
nnodes=1,
nproc_per_node=1,
node_rank=0, # Node rank
rdzv_id=0, # Changed rdzv_id to an integer
rdzv_endpoint="localhost:29500", # Endpoint
),
training_args=training_args
)

TypeError Traceback (most recent call last)
Cell In[7], line 46
43 os.makedirs(training_args.data_output_dir, exist_ok=True)
45 # Run the training
---> 46 run_training(
47 torchrun_args=TorchrunArgs(
48 nnodes=1,
49 nproc_per_node=1,
50 node_rank=0, # Node rank
51 rdzv_id=0, # Changed rdzv_id to an integer
52 rdzv_endpoint="localhost:29500", # Endpoint
53 ),
54 training_args=training_args
55 )
57 print("Training completed successfully.")

TypeError: run_training() got an unexpected keyword argument 'torchrun_args'

Here is my basic test code to try to utilize the Python Library and Train an IBM model using instructLab:

-------------------------------------------------CODE--------------------------------------------------------------------------------------------------------------
!pip install instructlab-training
import json
import os
from instructlab.training import run_training, TrainingArgs, TorchrunArgs

Step 1: Create a small hardcoded synthetic dataset in JSONL format

def create_synthetic_data(output_file="dataset.jsonl"):
examples = [
{"instruction": "Translate 'Hello' to Spanish.", "response": "Hola"},
{"instruction": "What is the capital of France?", "response": "Paris"},
{"instruction": "Solve 5 + 3.", "response": "8"},
{"instruction": "Provide a synonym for 'happy'.", "response": "Joyful"},
{"instruction": "List three primary colors.", "response": "Red, Blue, Yellow"}
]

with open(output_file, 'w') as f:
    for example in examples:
        f.write(json.dumps(example) + '\n')
print(f"Synthetic dataset created at {output_file}")

Generate dataset

create_synthetic_data()

Step 2: Define training arguments with all required fields

training_args = TrainingArgs(
model_path="ibm-granite/granite-3.0-1b-a400m-instruct",
data_path="dataset.jsonl",
ckpt_output_dir="data/saved_checkpoints",
data_output_dir="data/outputs",
max_seq_len=512,
max_batch_len=64, # Added max_batch_len
num_epochs=1,
effective_batch_size=8,
save_samples=1000, # Added save_samples
learning_rate=2e-6,
warmup_steps=100, # Added warmup_steps
is_padding_free=True, # Added is_padding_free
random_seed=42,
)

Ensure output directories exist

os.makedirs(training_args.ckpt_output_dir, exist_ok=True)
os.makedirs(training_args.data_output_dir, exist_ok=True)

Run the training

print("Training test complet.")

THANK YOU! JEFF

Spaces:

instructlab
/

README

Running

Error running InstructLab Training Python Library on Kaggle

Run the training

Step 1: Create a small hardcoded synthetic dataset in JSONL format

Generate dataset

Step 2: Define training arguments with all required fields

Ensure output directories exist

Run the training