added documenation

#10

by Ananthakrishnan12 - opened Oct 9, 2024

base: refs/heads/main

←

from: refs/pr/10

Discussion Files changed

+283

-412

Files changed (15) hide show

.gitattributes +0 -1
.gitignore +1 -6
About.md +0 -64
{data_set → Dataset}/transaction_data.csv +1 -1
LSTM_model.py +0 -62
README.md +1 -59
__pycache__/data_preprocessing.cpython-312.pyc +0 -0
__pycache__/inference.cpython-312.pyc +0 -0
bert_model.py +134 -0
config.json +0 -35
data_preprocessing.py +84 -50
main.py +0 -57
model.py +0 -25
requirements.txt +4 -0
setup.md +58 -52

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-*.mp4 filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -1,6 +1 @@
-transactify_venv
-tokenizer.joblib
-label_encoder.joblib
-transactify.h5
-venv
-.venv


1	+ transactify_venv

About.md DELETED Viewed

@@ -1,64 +0,0 @@
-Abstract for Transactify......
-Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions.
-By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining."
-This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting.
-Transactify is trained on real-world transaction data for improved accuracy and generalization.
-Table of contents....
-1.Data Collection:
-The dataset consists of 5,000 transaction records generated using ChatGPT, each containing a transaction description and its corresponding category.
-Example entries include descriptions like "Live concert stream on YouTube" (Movies & Entertainment) and "Coffee at Starbucks" (Food & Dining).
-These records cover various spending categories such as Lifestyle, Movies & Entertainment, Food & Dining, and others.
-2.Data Preprocessing:
-The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training.
-These include:
-Lowercasing all text.
-Removing digits and punctuation using regular expressions (regex).
-Tokenizing the cleaned text to convert it into a sequence of tokens.
-Applying text_to_sequences to transform the tokenized words into numerical sequences.
-Using pad_sequences to ensure all sequences have the same length for input into the LSTM model.
-Label encoding the target categories to convert them into numerical labels.
-After preprocessing, the data is split into training and testing sets to build and validate the model.
-3.Model Building:
-Embedding Layer: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
-LSTM Layer: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.
-Dropout Layer: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.
-Dense Layer with Softmax Activation: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.
-Model Compilation: Compiled with the Adam optimizer for efficient learning, sparse categorical cross-entropy loss for multi-class classification, and accuracy as the evaluation metric.
-Model Training: The model is trained for 50 epochs with a batch size of 8, using a validation set to monitor performance and adjust during training.
-Saving the Model and Preprocessing Objects:
-The trained model is saved as transactify.h5 for future use.
-The tokenizer and label encoder used during preprocessing are saved using joblib as tokenizer.joblib and label_encoder.joblib, respectively,
-ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.
-4.Prediction:
-Once trained, the model is used to predict the category of new transaction descriptions.
-The output provides the category label, enabling users to classify their spending based on transaction descriptions.
-5.Conclusion:
-The Transactify model effectively categorizes transaction descriptions using LSTM networks.
-However, to improve the accuracy and reliability of predictions, a larger and more diverse dataset is necessary.
-Expanding the dataset will help the model generalize better across various spending behaviors and conditions.
-This enhancement will lead to more precise predictions, enabling users to gain deeper insights into their spending patterns.
-Future work should focus on collecting additional data to refine the model's performance and applicability in real-world scenarios.
-![Excepted Output:](result.gif)

{data_set → Dataset}/transaction_data.csv RENAMED Viewed

@@ -4998,4 +4998,4 @@ Google Play Music,Online Payment
 Yoga class at HealthFit Studio,Lifestyle
 Doctor's appointment payment,Health & Wellness
 New sneakers from Nike,Lifestyle
-Breakfast at Denny's,Food & Dining

 Yoga class at HealthFit Studio,Lifestyle
 Doctor's appointment payment,Health & Wellness
 New sneakers from Nike,Lifestyle
+Breakfast at Denny's,Food & Dining

LSTM_model.py DELETED Viewed

@@ -1,62 +0,0 @@
-# LSTM_model.py
-import numpy as np
-from tensorflow.keras.models import Sequential
-from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
-from data_preprocessing import preprocess_data, split_data
-import joblib  # To save the tokenizer and label encoder
-# Define the LSTM model
-def build_lstm_model(vocab_size, embedding_dim=64, max_len=10, lstm_units=128, dropout_rate=0.2, output_units=6):
-    model = Sequential()
-    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))
-    model.add(LSTM(units=lstm_units, return_sequences=False))
-    model.add(Dropout(dropout_rate))
-    model.add(Dense(units=output_units, activation='softmax'))
-    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
-    return model
-# Main function to execute the training process
-def main():
-    # Path to your data file
-    data_path = r"E:\transactify\transactify\transactify\transactify\transactify\data_set\transaction_data.csv"
-    # Preprocess the data
-    sequences, labels, tokenizer, label_encoder = preprocess_data(data_path)
-    # Check if preprocessing succeeded
-    if sequences is not None:
-        print("Data preprocessing successful!")
-        # Split the data into training and testing sets
-        X_train, X_test, y_train, y_test = split_data(sequences, labels)
-        print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")
-        print(f"Testing data shape: {X_test.shape}, Testing labels shape: {y_test.shape}")
-        # Build the LSTM model
-        vocab_size = tokenizer.num_words + 1  # +1 for padding token
-        model = build_lstm_model(vocab_size, max_len=10, output_units=len(label_encoder.classes_))
-        # Train the model
-        model.fit(X_train, y_train, epochs=50, batch_size=8, validation_data=(X_test, y_test))
-        # Evaluate the model
-        loss, accuracy = model.evaluate(X_test, y_test)
-        print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")
-        # Save the model
-        model.save('transactify.h5')
-        print("Model saved as 'transactify.h5'")
-        # Save the tokenizer and label encoder
-        joblib.dump(tokenizer, 'tokenizer.joblib')
-        joblib.dump(label_encoder, 'label_encoder.joblib')
-        print("Tokenizer and LabelEncoder saved as 'tokenizer.joblib' and 'label_encoder.joblib'")
-    else:
-        print("Data preprocessing failed.")
-# Execute the main function
-if __name__ == "__main__":
-    main()

README.md CHANGED Viewed

@@ -2,62 +2,4 @@
 license: mit
 language:
 - en
----
-## What is Transactify?
-Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions.
-By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining."
-This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting.
-Transactify is trained on real-world transaction data for improved accuracy and generalization.
-## Table of contents
-## 1. Data Collection
-The dataset consists of **5,000 transaction records** generated using ChatGPT, each containing a transaction description and its corresponding category.
-Example entries include:
-- "Live concert stream on YouTube" (Movies & Entertainment)
-- "Coffee at Starbucks" (Food & Dining)
-These records cover various spending categories such as **Lifestyle**, **Movies & Entertainment**, **Food & Dining**, and others.
----
-## 2. Data Preprocessing
-The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training. These include:
-- Lowercasing all text.
-- Removing digits and punctuation using regular expressions (regex).
-- Tokenizing the cleaned text to convert it into a sequence of tokens.
-- Applying `text_to_sequences` to transform the tokenized words into numerical sequences.
-- Using `pad_sequences` to ensure all sequences have the same length for input into the LSTM model.
-- Label encoding the target categories to convert them into numerical labels.
-After preprocessing, the data is split into training and testing sets to build and validate the model.
----
-## 3. Model Building
-- **Embedding Layer**: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
-- **LSTM Layer**: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.
-- **Dropout Layer**: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.
-- **Dense Layer with Softmax Activation**: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.
-### Model Compilation
-- Compiled with the Adam optimizer for efficient learning.
-- Sparse categorical cross-entropy loss for multi-class classification.
-- Accuracy as the evaluation metric.
-### Model Training
-The model is trained for **50 epochs** with a batch size of **8**, using a validation set to monitor performance and adjust during training.
-### Saving the Model and Preprocessing Objects
-- The trained model is saved as `transactify.h5` for future use.
-- The tokenizer and label encoder used during preprocessing are saved using joblib as `tokenizer.joblib` and `label_encoder.joblib`, respectively, ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.
----
-## 4. Prediction
-Once trained

 license: mit
 language:
 - en
+---

__pycache__/data_preprocessing.cpython-312.pyc DELETED Viewed

Binary file (3.55 kB)

__pycache__/inference.cpython-312.pyc DELETED Viewed

Binary file (2.21 kB)

bert_model.py ADDED Viewed

	@@ -0,0 +1,134 @@

+# Import Required Libraries
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader, TensorDataset
+from transformers import BertModel, AdamW
+from sklearn.metrics import accuracy_score
+import numpy as np
+# Import functions from the preprocessing module
+from transactify.data_preprocessing import preprocessing_data, split_data, read_data
+# Define a BERT-based classification model
+class BertClassifier(nn.Module):
+    def __init__(self, num_labels, dropout_rate=0.3):
+        super(BertClassifier, self).__init__()
+        self.bert = BertModel.from_pretrained("bert-base-uncased")
+        self.dropout = nn.Dropout(dropout_rate)
+        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
+    def forward(self, input_ids, attention_mask):
+        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
+        pooled_output = outputs[1]  # Pooler output (CLS token)
+        output = self.dropout(pooled_output)
+        logits = self.classifier(output)
+        return logits
+# Training the model
+# Training the model
+def train_model(model, train_dataloader, val_dataloader, device, epochs=3, lr=2e-5):
+    optimizer = AdamW(model.parameters(), lr=lr)
+    loss_fn = nn.CrossEntropyLoss()
+    for epoch in range(epochs):
+        model.train()
+        total_train_loss = 0
+        for step, batch in enumerate(train_dataloader):
+            b_input_ids, b_input_mask, b_labels = batch
+            b_input_ids = b_input_ids.to(device)
+            b_input_mask = b_input_mask.to(device)
+            b_labels = b_labels.to(device).long()  # Ensure labels are LongTensor
+            model.zero_grad()
+            outputs = model(b_input_ids, b_input_mask)
+            loss = loss_fn(outputs, b_labels)
+            total_train_loss += loss.item()
+            loss.backward()
+            optimizer.step()
+        avg_train_loss = total_train_loss / len(train_dataloader)
+        print(f"Epoch {epoch+1}, Training Loss: {avg_train_loss}")
+        model.eval()
+        total_val_accuracy = 0
+        total_val_loss = 0
+        with torch.no_grad():
+            for batch in val_dataloader:
+                b_input_ids, b_input_mask, b_labels = batch
+                b_input_ids = b_input_ids.to(device)
+                b_input_mask = b_input_mask.to(device)
+                b_labels = b_labels.to(device)
+                outputs = model(b_input_ids, b_input_mask)
+                loss = loss_fn(outputs, b_labels)
+                total_val_loss += loss.item()
+                preds = torch.argmax(outputs, dim=1)
+                total_val_accuracy += (preds == b_labels).sum().item()
+        avg_val_accuracy = total_val_accuracy / len(val_dataloader.dataset)
+        avg_val_loss = total_val_loss / len(val_dataloader)
+        print(f"Validation Loss: {avg_val_loss}, Validation Accuracy: {avg_val_accuracy}")
+# Testing the model
+def test_model(model, test_dataloader, device):
+    model.eval()
+    all_preds = []
+    all_labels = []
+    with torch.no_grad():
+        for batch in test_dataloader:
+            b_input_ids, b_input_mask, b_labels = batch
+            b_input_ids = b_input_ids.to(device)
+            b_input_mask = b_input_mask.to(device)
+            b_labels = b_labels.to(device)
+            outputs = model(b_input_ids, b_input_mask)
+            preds = torch.argmax(outputs, dim=1)
+            all_preds.append(preds.cpu().numpy())
+            all_labels.append(b_labels.cpu().numpy())
+    all_preds = np.concatenate(all_preds)
+    all_labels = np.concatenate(all_labels)
+    accuracy = accuracy_score(all_labels, all_preds)
+    print(f"Test Accuracy: {accuracy}")
+# Main function to train, validate, and test the model
+def main(data_path, epochs=3, batch_size=16):
+    # Read and preprocess data
+    data = read_data(data_path)
+    if data is None:
+        return
+    input_ids, attention_masks, labels, labelencoder = preprocessing_data(data)
+    X_train_ids, X_test_ids, X_train_masks, X_test_masks, y_train, y_test = split_data(input_ids, attention_masks, labels)
+    # Determine the number of labels
+    num_labels = len(labelencoder.classes_)
+    # Create the model
+    model = BertClassifier(num_labels)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model.to(device)
+    # Create dataloaders
+    train_dataset = TensorDataset(X_train_ids, X_train_masks, y_train)
+    train_dataloader = DataLoader(train_dataset, batch_size=batch_size)
+    val_dataset = TensorDataset(X_test_ids, X_test_masks, y_test)
+    val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
+    # Train the model
+    train_model(model, train_dataloader, val_dataloader, device, epochs=epochs)
+    # Test the model
+    test_dataloader = DataLoader(val_dataset, batch_size=batch_size)
+    test_model(model, test_dataloader, device)
+if __name__ == "__main__":
+    data_path = r"E:\transactify\transactify\Dataset\transaction_data.csv"
+    main(data_path)

config.json DELETED Viewed

@@ -1,35 +0,0 @@
-{
-  "model_type": "custom",
-  "architectures": ["LSTM"],
-  "library_name": "tensorflow",
-  "task_specific_params": {
-    "text-classification": {
-      "vocab_size": 500,
-      "embedding_dim": 64,
-      "hidden_size": 64,
-      "num_layers": 2,
-      "dropout_rate": 0.2,
-      "max_sequence_length": 10
-    }
-  },
-  "training_params": {
-    "batch_size": 8,
-    "epochs": 50,
-    "loss_function": "sparse_categorical_crossentropy",
-    "optimizer": "adam",
-    "metrics": ["accuracy"]
-  },
-  "train_data_size": 5000,
-  "id2label": {
-    "0": "Lifestyle",
-    "1": "Movies & Entertainment",
-    "2": "Food & Dining",
-    "3": "Others"
-  },
-  "label2id": {
-    "Lifestyle": 0,
-    "Movies & Entertainment": 1,
-    "Food & Dining": 2,
-    "Others": 3
-  }
-}

data_preprocessing.py CHANGED Viewed

@@ -1,11 +1,12 @@
-# data_preprocessing.py
 import numpy as np
 import pandas as pd
-import re
 from sklearn.preprocessing import LabelEncoder
 from sklearn.model_selection import train_test_split
-from tensorflow.keras.preprocessing.text import Tokenizer
-from tensorflow.keras.preprocessing.sequence import pad_sequences
 # Read the data
 def read_data(path):
@@ -22,62 +23,95 @@ def read_data(path):
         print(f"An error occurred: {e}")
         return None
 # Cleaning the text
 def clean_text(text):
-    text = text.lower()                    # Convert uppercase to lowercase
-    text = re.sub(r"\d+", " ", text)       # Remove digits
-    text = re.sub(r"[^\w\s]", " ", text)   # Remove punctuations
     text = text.strip()                    # Remove extra spaces
     return text
-# Main preprocessing function
-def preprocess_data(file_path, max_len=10, vocab_size=250):
-    # Read the data
-    df = read_data(file_path)
-    if df is None:
-        print("Data loading failed.")
-        return None, None, None, None
-    # Clean the text
-    df['Transaction Description'] = df['Transaction Description'].apply(clean_text)
-    # Initialize the tokenizer
-    tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
-    tokenizer.fit_on_texts(df['Transaction Description'])
-    # Convert texts to sequences and pad them
-    sequences = tokenizer.texts_to_sequences(df['Transaction Description'])
-    padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
-    # Initialize LabelEncoder and encode the labels
-    label_encoder = LabelEncoder()
-    labels = label_encoder.fit_transform(df['Category'])
-    return padded_sequences, labels, tokenizer, label_encoder
-# Train-test split function
-def split_data(sequences, labels, test_size=0.2, random_state=42):
-    X_train, X_test, y_train, y_test = train_test_split(sequences, labels, test_size=test_size, random_state=random_state)
-    return X_train, X_test, y_train, y_test
-# Main function to execute preprocessing
-def main():
-    # Path to your data file
-    data_path = r"E:\transactify\transactify\Dataset\transaction_data.csv"
-    # Preprocess the data
-    sequences, labels, tokenizer, label_encoder = preprocess_data(data_path)
-    # Check if preprocessing succeeded
-    if sequences is not None:
-        print("Data preprocessing successful!")
-        # Split the data into training and testing sets
-        X_train, X_test, y_train, y_test = split_data(sequences, labels)
-        print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")
-        print(f"Testing data shape: {X_test.shape}, Testing labels shape: {y_test.shape}")
-    else:
-        print("Data preprocessing failed.")
-# Execute the main function
-if __name__ == "__main__":
-    main()

+# Import Required Libraries:
 import numpy as np
 import pandas as pd
+import torch
+from transformers import BertTokenizer
 from sklearn.preprocessing import LabelEncoder
 from sklearn.model_selection import train_test_split
+import re
 # Read the data
 def read_data(path):
         print(f"An error occurred: {e}")
         return None
+# Path to your data file
+data_path = r"E:\transactify\transactify\Dataset\transaction_data.csv"
+# Read the data and check if it was loaded successfully
+data = read_data(data_path)
+if data is not None:
+    print("Data loaded successfully:")
+    print(data.head(15))
+else:
+    print("Data loading failed. Exiting...")
+    exit()
 # Cleaning the text
 def clean_text(text):
+    text = text.lower()                    # Converting uppercase to lowercase
+    text = re.sub(r"\d+", " ", text)       # Removing digits in the text
+    text = re.sub(r"[^\w\s]", " ", text)   # Removing punctuations
     text = text.strip()                    # Remove extra spaces
     return text
+# Preprocessing the data
+def preprocessing_data(df, max_length=20):
+    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+    input_ids = []
+    attention_masks = []
+    # Ensure the dataframe has the required columns
+    if "Transaction Description" not in df.columns or "Category" not in df.columns:
+        raise ValueError("The required columns 'Transaction Description' and 'Category' are missing from the dataset.")
+    for description in df["Transaction Description"]:
+        cleaned_text = clean_text(description)
+        # Debugging print statements
+        # print(f"Original Description: {description}")
+        # print(f"Cleaned Text: {cleaned_text}")
+        # Only tokenize if the cleaned text is not empty
+        if cleaned_text:
+            encoded_dict = tokenizer.encode_plus(
+                cleaned_text,
+                add_special_tokens=True,  # Add special tokens for BERT
+                max_length=max_length,
+                pad_to_max_length=True,
+                return_attention_mask=True,
+                return_tensors="pt",
+                truncation=True
+            )
+            input_ids.append(encoded_dict['input_ids'])  # Append input IDs
+            attention_masks.append(encoded_dict['attention_mask'])  # Append attention masks
+        else:
+            print("Cleaned text is empty, skipping...")
+    # Debugging output to check sizes
+    print(f"Total input_ids collected: {len(input_ids)}")
+    print(f"Total attention_masks collected: {len(attention_masks)}")
+    if not input_ids:
+        raise ValueError("No input_ids were collected. Check the cleaning process.")
+    # Concatenating the list of tensors to form a single tensor
+    input_ids = torch.cat(input_ids, dim=0)
+    attention_masks = torch.cat(attention_masks, dim=0)
+    # Encoding the labels
+    labelencoder = LabelEncoder()
+    labels = labelencoder.fit_transform(df["Category"])
+    labels = torch.tensor(labels, dtype=torch.long)  # Convert labels to LongTensor
+    return input_ids, attention_masks, labels, labelencoder
+# Split the data into train and test sets
+def split_data(input_ids, attention_masks, labels, test_size=0.2, random_state=42):
+    X_train_ids, X_test_ids, y_train, y_test = train_test_split(
+        input_ids, labels, test_size=test_size, random_state=random_state
+    )
+    X_train_masks, X_test_masks = train_test_split(
+        attention_masks, test_size=test_size, random_state=random_state
+    )
+    return X_train_ids, X_test_ids, X_train_masks, X_test_masks, y_train, y_test
+# Preprocess the data and split into train and test sets
+input_ids, attention_masks, labels, labelencoder = preprocessing_data(data)
+X_train_ids, X_test_ids, X_train_masks, X_test_masks, y_train, y_test = split_data(input_ids, attention_masks, labels)
+# Output the sizes of the splits for confirmation
+print(f"Training set size: {X_train_ids.shape[0]}")
+print(f"Test set size: {X_test_ids.shape[0]}")

main.py DELETED Viewed

@@ -1,57 +0,0 @@
-# main.py
-import numpy as np
-import pandas as pd
-from tensorflow.keras.models import load_model
-from tensorflow.keras.preprocessing.text import Tokenizer
-from tensorflow.keras.preprocessing.sequence import pad_sequences
-import joblib
-import re
-# Function to clean the input text
-def clean_text(text):
-    text = text.lower()
-    text = re.sub(r"\d+", " ", text)
-    text = re.sub(r"[^\w\s]", " ", text)
-    text = text.strip()
-    return text
-# Load the model, tokenizer, and label encoder
-def load_resources(model_path, tokenizer_path, label_encoder_path):
-    model = load_model(model_path)
-    tokenizer = joblib.load(tokenizer_path)
-    label_encoder = joblib.load(label_encoder_path)
-    return model, tokenizer, label_encoder
-# Function to make predictions
-def predict(model, tokenizer, label_encoder, input_text, max_len=50):
-    cleaned_text = clean_text(input_text)
-    sequence = tokenizer.texts_to_sequences([cleaned_text])
-    padded_sequence = pad_sequences(sequence, maxlen=max_len, padding='post', truncating='post')
-    # Make prediction
-    prediction = model.predict(padded_sequence)
-    predicted_class = np.argmax(prediction, axis=1)
-    # Decode the label
-    predicted_label = label_encoder.inverse_transform(predicted_class)
-    return predicted_label[0]
-# Main function for running predictions
-def main():
-    # Paths to your resources
-    model_path = 'transactify.h5'  # Update with the correct path if needed
-    tokenizer_path = 'tokenizer.joblib'  # Update with the correct path if needed
-    label_encoder_path = 'label_encoder.joblib'  # Update with the correct path if needed
-    # Load resources
-    model, tokenizer, label_encoder = load_resources(model_path, tokenizer_path, label_encoder_path)
-    # Input for prediction
-    input_text = input("Enter a transaction description for prediction: ")
-    predicted_category = predict(model, tokenizer, label_encoder, input_text)
-    print(f"The predicted category is: {predicted_category}")
-# Execute the main function
-if __name__ == "__main__":
-    main()

model.py DELETED Viewed

@@ -1,25 +0,0 @@
-from tensorflow.keras.models import load_model
-import joblib
-from tensorflow.keras.preprocessing.sequence import pad_sequences
-import numpy as np
-import re
-# Load the model, tokenizer, and label encoder
-model = load_model("transactify.h5")
-tokenizer = joblib.load("tokenizer.joblib")
-label_encoder = joblib.load("label_encoder.joblib")
-def clean_text(text):
-    text = text.lower()
-    text = re.sub(r"\d+", "", text)
-    text = re.sub(r"[^\w\s]", "", text)
-    return text.strip()
-def predict(text):
-    cleaned_text = clean_text(text)
-    sequence = tokenizer.texts_to_sequences([cleaned_text])
-    padded_sequence = pad_sequences(sequence, maxlen=100)
-    prediction = model.predict(padded_sequence)
-    predicted_label = np.argmax(prediction, axis=1)
-    category = label_encoder.inverse_transform(predicted_label)
-    return {"category": category[0]}

requirements.txt CHANGED Viewed

@@ -1,4 +1,8 @@
 numpy
 pandas
 tensorflow
 scikit-learn

 numpy
 pandas
 tensorflow
+transformers
 scikit-learn
+torch
+torchvision
+torchaudio

setup.md CHANGED Viewed

@@ -1,53 +1,59 @@
-# Steps to Run the Model
-1. **Clone the Repository**:
-   Open your command line interface (CLI) and clone the repository using:
-   ```bash
-   git clone https://huggingface.co/webslate/transactify
-   ```
-2. **Create the Virtual Environment**:
-   Navigate to the project directory and create a virtual environment:
-   ```bash
-   python -m venv transactify_venv
-   ```
-3. **Activate the Virtual Environment**:
-   To activate the virtual environment, follow these steps:
-   - Open your command line interface (CLI).
-   - Type the following commands:
-     ```bash
-     cd transactify_venv
-     cd Scripts
-     activate
-     ```
-4. **Install Required Libraries**:
-   After activating the virtual environment, install the necessary libraries by typing:
-   ```bash
-   pip install -r requirements.txt
-   ```
-5. **Run the Data Preprocessing Code**:
-   Execute the data preprocessing script by typing:
-   ```bash
-   python data_preprocessing.py
-   ```
-6. **Run the LSTM Model Code**:
-   Train the LSTM model by executing:
-   ```bash
-   python LSTM_model.py
-   ```
-7. **Generate the H5 File**:
-   After training, you can generate the model file (`transactify.h5`).
-8. **Run the Prediction Code**:
-   To make predictions using the trained model, type:
-   ```bash
-   python main.py
-   ```
-Following these steps will set up and run the Transactify model for predicting transaction categories based on descriptions.

+## Install Git LFS
+```
+brew install git-lfs
+```
+or download from https://git-lfs.github.com/
+## Update global git config
+```
+$ git lfs install
+```
+## Update system git config
+```
+$ git lfs install --system
+```
+## Clone the Repo
+### Entire Clone
+```
+git clone https://huggingface.co/webslate/transactify
+```
+### Light Clone
+```
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/webslate/transactify
+```
+## For Pushing the Code
+> Refer to https://huggingface.co/blog/password-git-deprecation
+### Set the Remote URL
+```
+$: git remote set-url origin https://<user_name>:<token>@huggingface.co/<repo_path>
+```
+### Token Creation
+> Go to Settings > Access Tokens > Create new token >
+Choose Write Tab (3rd one) / go here https://huggingface.co/settings/tokens/new?tokenType=write
+## Create Virtual Environment
+```
+create a Virtual Environment for Transactify project...
+python -m venv transactify_venv
+To activate environment..
+go to cmd ..
+type >> cd transactify_venv
+     >> cd scripts
+     >> activate
+```
+## Installing Required Libaries.
+to install required libaries...
+go to cmd..
+type >>pip install -r requirements.txt