tech: transactify base changes

by ai-venkat-r - opened Oct 6

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+43

-5454

Files changed (15) hide show

.gitattributes +0 -1
.gitignore +0 -6
About.md +0 -64
LSTM_model.py +0 -62
README.md +3 -63
__pycache__/data_preprocessing.cpython-312.pyc +0 -0
__pycache__/datapreprocessing.cpython-312.pyc +0 -0
__pycache__/inference.cpython-312.pyc +0 -0
config.json +0 -35
data_preprocessing.py +0 -83
data_set/transaction_data.csv +0 -0
main.py +0 -57
model.py +0 -25
requirements.txt +0 -4
setup.md +40 -53

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-*.mp4 filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore DELETED Viewed

@@ -1,6 +0,0 @@
-transactify_venv
-tokenizer.joblib
-label_encoder.joblib
-transactify.h5
-venv
-.venv

About.md DELETED Viewed

@@ -1,64 +0,0 @@
-Abstract for Transactify......
-Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions.
-By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining."
-This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting.
-Transactify is trained on real-world transaction data for improved accuracy and generalization.
-Table of contents....
-1.Data Collection:
-The dataset consists of 5,000 transaction records generated using ChatGPT, each containing a transaction description and its corresponding category.
-Example entries include descriptions like "Live concert stream on YouTube" (Movies & Entertainment) and "Coffee at Starbucks" (Food & Dining).
-These records cover various spending categories such as Lifestyle, Movies & Entertainment, Food & Dining, and others.
-2.Data Preprocessing:
-The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training.
-These include:
-Lowercasing all text.
-Removing digits and punctuation using regular expressions (regex).
-Tokenizing the cleaned text to convert it into a sequence of tokens.
-Applying text_to_sequences to transform the tokenized words into numerical sequences.
-Using pad_sequences to ensure all sequences have the same length for input into the LSTM model.
-Label encoding the target categories to convert them into numerical labels.
-After preprocessing, the data is split into training and testing sets to build and validate the model.
-3.Model Building:
-Embedding Layer: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
-LSTM Layer: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.
-Dropout Layer: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.
-Dense Layer with Softmax Activation: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.
-Model Compilation: Compiled with the Adam optimizer for efficient learning, sparse categorical cross-entropy loss for multi-class classification, and accuracy as the evaluation metric.
-Model Training: The model is trained for 50 epochs with a batch size of 8, using a validation set to monitor performance and adjust during training.
-Saving the Model and Preprocessing Objects:
-The trained model is saved as transactify.h5 for future use.
-The tokenizer and label encoder used during preprocessing are saved using joblib as tokenizer.joblib and label_encoder.joblib, respectively,
-ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.
-4.Prediction:
-Once trained, the model is used to predict the category of new transaction descriptions.
-The output provides the category label, enabling users to classify their spending based on transaction descriptions.
-5.Conclusion:
-The Transactify model effectively categorizes transaction descriptions using LSTM networks.
-However, to improve the accuracy and reliability of predictions, a larger and more diverse dataset is necessary.
-Expanding the dataset will help the model generalize better across various spending behaviors and conditions.
-This enhancement will lead to more precise predictions, enabling users to gain deeper insights into their spending patterns.
-Future work should focus on collecting additional data to refine the model's performance and applicability in real-world scenarios.
-![Excepted Output:](result.gif)

LSTM_model.py DELETED Viewed

@@ -1,62 +0,0 @@
-# LSTM_model.py
-import numpy as np
-from tensorflow.keras.models import Sequential
-from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
-from data_preprocessing import preprocess_data, split_data
-import joblib  # To save the tokenizer and label encoder
-# Define the LSTM model
-def build_lstm_model(vocab_size, embedding_dim=64, max_len=10, lstm_units=128, dropout_rate=0.2, output_units=6):
-    model = Sequential()
-    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))
-    model.add(LSTM(units=lstm_units, return_sequences=False))
-    model.add(Dropout(dropout_rate))
-    model.add(Dense(units=output_units, activation='softmax'))
-    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
-    return model
-# Main function to execute the training process
-def main():
-    # Path to your data file
-    data_path = r"E:\transactify\transactify\transactify\transactify\transactify\data_set\transaction_data.csv"
-    # Preprocess the data
-    sequences, labels, tokenizer, label_encoder = preprocess_data(data_path)
-    # Check if preprocessing succeeded
-    if sequences is not None:
-        print("Data preprocessing successful!")
-        # Split the data into training and testing sets
-        X_train, X_test, y_train, y_test = split_data(sequences, labels)
-        print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")
-        print(f"Testing data shape: {X_test.shape}, Testing labels shape: {y_test.shape}")
-        # Build the LSTM model
-        vocab_size = tokenizer.num_words + 1  # +1 for padding token
-        model = build_lstm_model(vocab_size, max_len=10, output_units=len(label_encoder.classes_))
-        # Train the model
-        model.fit(X_train, y_train, epochs=50, batch_size=8, validation_data=(X_test, y_test))
-        # Evaluate the model
-        loss, accuracy = model.evaluate(X_test, y_test)
-        print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")
-        # Save the model
-        model.save('transactify.h5')
-        print("Model saved as 'transactify.h5'")
-        # Save the tokenizer and label encoder
-        joblib.dump(tokenizer, 'tokenizer.joblib')
-        joblib.dump(label_encoder, 'label_encoder.joblib')
-        print("Tokenizer and LabelEncoder saved as 'tokenizer.joblib' and 'label_encoder.joblib'")
-    else:
-        print("Data preprocessing failed.")
-# Execute the main function
-if __name__ == "__main__":
-    main()

README.md CHANGED Viewed

@@ -1,63 +1,3 @@
----
-license: mit
-language:
-- en
----
-## What is Transactify?
-Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions.
-By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining."
-This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting.
-Transactify is trained on real-world transaction data for improved accuracy and generalization.
-## Table of contents
-## 1. Data Collection
-The dataset consists of **5,000 transaction records** generated using ChatGPT, each containing a transaction description and its corresponding category.
-Example entries include:
-- "Live concert stream on YouTube" (Movies & Entertainment)
-- "Coffee at Starbucks" (Food & Dining)
-These records cover various spending categories such as **Lifestyle**, **Movies & Entertainment**, **Food & Dining**, and others.
----
-## 2. Data Preprocessing
-The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training. These include:
-- Lowercasing all text.
-- Removing digits and punctuation using regular expressions (regex).
-- Tokenizing the cleaned text to convert it into a sequence of tokens.
-- Applying `text_to_sequences` to transform the tokenized words into numerical sequences.
-- Using `pad_sequences` to ensure all sequences have the same length for input into the LSTM model.
-- Label encoding the target categories to convert them into numerical labels.
-After preprocessing, the data is split into training and testing sets to build and validate the model.
----
-## 3. Model Building
-- **Embedding Layer**: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
-- **LSTM Layer**: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.
-- **Dropout Layer**: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.
-- **Dense Layer with Softmax Activation**: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.
-### Model Compilation
-- Compiled with the Adam optimizer for efficient learning.
-- Sparse categorical cross-entropy loss for multi-class classification.
-- Accuracy as the evaluation metric.
-### Model Training
-The model is trained for **50 epochs** with a batch size of **8**, using a validation set to monitor performance and adjust during training.
-### Saving the Model and Preprocessing Objects
-- The trained model is saved as `transactify.h5` for future use.
-- The tokenizer and label encoder used during preprocessing are saved using joblib as `tokenizer.joblib` and `label_encoder.joblib`, respectively, ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.
----
-## 4. Prediction
-Once trained

+---
+license: mit
+---

__pycache__/data_preprocessing.cpython-312.pyc DELETED Viewed

Binary file (3.55 kB)

__pycache__/datapreprocessing.cpython-312.pyc DELETED Viewed

Binary file (4.3 kB)

__pycache__/inference.cpython-312.pyc DELETED Viewed

Binary file (2.21 kB)

config.json DELETED Viewed

@@ -1,35 +0,0 @@
-{
-  "model_type": "custom",
-  "architectures": ["LSTM"],
-  "library_name": "tensorflow",
-  "task_specific_params": {
-    "text-classification": {
-      "vocab_size": 500,
-      "embedding_dim": 64,
-      "hidden_size": 64,
-      "num_layers": 2,
-      "dropout_rate": 0.2,
-      "max_sequence_length": 10
-    }
-  },
-  "training_params": {
-    "batch_size": 8,
-    "epochs": 50,
-    "loss_function": "sparse_categorical_crossentropy",
-    "optimizer": "adam",
-    "metrics": ["accuracy"]
-  },
-  "train_data_size": 5000,
-  "id2label": {
-    "0": "Lifestyle",
-    "1": "Movies & Entertainment",
-    "2": "Food & Dining",
-    "3": "Others"
-  },
-  "label2id": {
-    "Lifestyle": 0,
-    "Movies & Entertainment": 1,
-    "Food & Dining": 2,
-    "Others": 3
-  }
-}

data_preprocessing.py DELETED Viewed

@@ -1,83 +0,0 @@
-# data_preprocessing.py
-import numpy as np
-import pandas as pd
-import re
-from sklearn.preprocessing import LabelEncoder
-from sklearn.model_selection import train_test_split
-from tensorflow.keras.preprocessing.text import Tokenizer
-from tensorflow.keras.preprocessing.sequence import pad_sequences
-# Read the data
-def read_data(path):
-    try:
-        df = pd.read_csv(path)
-        if df.empty:
-            print("The file is empty.")
-            return None
-        return df
-    except FileNotFoundError:
-        print(f"File not found at: {path}")
-        return None
-    except Exception as e:
-        print(f"An error occurred: {e}")
-        return None
-# Cleaning the text
-def clean_text(text):
-    text = text.lower()                    # Convert uppercase to lowercase
-    text = re.sub(r"\d+", " ", text)       # Remove digits
-    text = re.sub(r"[^\w\s]", " ", text)   # Remove punctuations
-    text = text.strip()                    # Remove extra spaces
-    return text
-# Main preprocessing function
-def preprocess_data(file_path, max_len=10, vocab_size=250):
-    # Read the data
-    df = read_data(file_path)
-    if df is None:
-        print("Data loading failed.")
-        return None, None, None, None
-    # Clean the text
-    df['Transaction Description'] = df['Transaction Description'].apply(clean_text)
-    # Initialize the tokenizer
-    tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
-    tokenizer.fit_on_texts(df['Transaction Description'])
-    # Convert texts to sequences and pad them
-    sequences = tokenizer.texts_to_sequences(df['Transaction Description'])
-    padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
-    # Initialize LabelEncoder and encode the labels
-    label_encoder = LabelEncoder()
-    labels = label_encoder.fit_transform(df['Category'])
-    return padded_sequences, labels, tokenizer, label_encoder
-# Train-test split function
-def split_data(sequences, labels, test_size=0.2, random_state=42):
-    X_train, X_test, y_train, y_test = train_test_split(sequences, labels, test_size=test_size, random_state=random_state)
-    return X_train, X_test, y_train, y_test
-# Main function to execute preprocessing
-def main():
-    # Path to your data file
-    data_path = r"E:\transactify\transactify\Dataset\transaction_data.csv"
-    # Preprocess the data
-    sequences, labels, tokenizer, label_encoder = preprocess_data(data_path)
-    # Check if preprocessing succeeded
-    if sequences is not None:
-        print("Data preprocessing successful!")
-        # Split the data into training and testing sets
-        X_train, X_test, y_train, y_test = split_data(sequences, labels)
-        print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")
-        print(f"Testing data shape: {X_test.shape}, Testing labels shape: {y_test.shape}")
-    else:
-        print("Data preprocessing failed.")
-# Execute the main function
-if __name__ == "__main__":
-    main()

data_set/transaction_data.csv DELETED Viewed

The diff for this file is too large to render. See raw diff

main.py DELETED Viewed

@@ -1,57 +0,0 @@
-# main.py
-import numpy as np
-import pandas as pd
-from tensorflow.keras.models import load_model
-from tensorflow.keras.preprocessing.text import Tokenizer
-from tensorflow.keras.preprocessing.sequence import pad_sequences
-import joblib
-import re
-# Function to clean the input text
-def clean_text(text):
-    text = text.lower()
-    text = re.sub(r"\d+", " ", text)
-    text = re.sub(r"[^\w\s]", " ", text)
-    text = text.strip()
-    return text
-# Load the model, tokenizer, and label encoder
-def load_resources(model_path, tokenizer_path, label_encoder_path):
-    model = load_model(model_path)
-    tokenizer = joblib.load(tokenizer_path)
-    label_encoder = joblib.load(label_encoder_path)
-    return model, tokenizer, label_encoder
-# Function to make predictions
-def predict(model, tokenizer, label_encoder, input_text, max_len=50):
-    cleaned_text = clean_text(input_text)
-    sequence = tokenizer.texts_to_sequences([cleaned_text])
-    padded_sequence = pad_sequences(sequence, maxlen=max_len, padding='post', truncating='post')
-    # Make prediction
-    prediction = model.predict(padded_sequence)
-    predicted_class = np.argmax(prediction, axis=1)
-    # Decode the label
-    predicted_label = label_encoder.inverse_transform(predicted_class)
-    return predicted_label[0]
-# Main function for running predictions
-def main():
-    # Paths to your resources
-    model_path = 'transactify.h5'  # Update with the correct path if needed
-    tokenizer_path = 'tokenizer.joblib'  # Update with the correct path if needed
-    label_encoder_path = 'label_encoder.joblib'  # Update with the correct path if needed
-    # Load resources
-    model, tokenizer, label_encoder = load_resources(model_path, tokenizer_path, label_encoder_path)
-    # Input for prediction
-    input_text = input("Enter a transaction description for prediction: ")
-    predicted_category = predict(model, tokenizer, label_encoder, input_text)
-    print(f"The predicted category is: {predicted_category}")
-# Execute the main function
-if __name__ == "__main__":
-    main()

model.py DELETED Viewed

@@ -1,25 +0,0 @@
-from tensorflow.keras.models import load_model
-import joblib
-from tensorflow.keras.preprocessing.sequence import pad_sequences
-import numpy as np
-import re
-# Load the model, tokenizer, and label encoder
-model = load_model("transactify.h5")
-tokenizer = joblib.load("tokenizer.joblib")
-label_encoder = joblib.load("label_encoder.joblib")
-def clean_text(text):
-    text = text.lower()
-    text = re.sub(r"\d+", "", text)
-    text = re.sub(r"[^\w\s]", "", text)
-    return text.strip()
-def predict(text):
-    cleaned_text = clean_text(text)
-    sequence = tokenizer.texts_to_sequences([cleaned_text])
-    padded_sequence = pad_sequences(sequence, maxlen=100)
-    prediction = model.predict(padded_sequence)
-    predicted_label = np.argmax(prediction, axis=1)
-    category = label_encoder.inverse_transform(predicted_label)
-    return {"category": category[0]}

requirements.txt DELETED Viewed

@@ -1,4 +0,0 @@
-numpy
-pandas
-tensorflow
-scikit-learn

setup.md CHANGED Viewed

@@ -1,53 +1,40 @@
-# Steps to Run the Model
-1. **Clone the Repository**:
-   Open your command line interface (CLI) and clone the repository using:
-   ```bash
-   git clone https://huggingface.co/webslate/transactify
-   ```
-2. **Create the Virtual Environment**:
-   Navigate to the project directory and create a virtual environment:
-   ```bash
-   python -m venv transactify_venv
-   ```
-3. **Activate the Virtual Environment**:
-   To activate the virtual environment, follow these steps:
-   - Open your command line interface (CLI).
-   - Type the following commands:
-     ```bash
-     cd transactify_venv
-     cd Scripts
-     activate
-     ```
-4. **Install Required Libraries**:
-   After activating the virtual environment, install the necessary libraries by typing:
-   ```bash
-   pip install -r requirements.txt
-   ```
-5. **Run the Data Preprocessing Code**:
-   Execute the data preprocessing script by typing:
-   ```bash
-   python data_preprocessing.py
-   ```
-6. **Run the LSTM Model Code**:
-   Train the LSTM model by executing:
-   ```bash
-   python LSTM_model.py
-   ```
-7. **Generate the H5 File**:
-   After training, you can generate the model file (`transactify.h5`).
-8. **Run the Prediction Code**:
-   To make predictions using the trained model, type:
-   ```bash
-   python main.py
-   ```
-Following these steps will set up and run the Transactify model for predicting transaction categories based on descriptions.

+## Install Git LFS
+```
+brew install git-lfs
+```
+or download from https://git-lfs.github.com/
+## Update global git config
+```
+$ git lfs install
+```
+## Update system git config
+```
+$ git lfs install --system
+```
+## Clone the Repo
+### Entire Clone
+```
+git clone https://huggingface.co/webslate/transactify
+```
+### Light Clone
+```
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/webslate/transactify
+```
+## For Pushing the Code
+> Refer to https://huggingface.co/blog/password-git-deprecation
+### Set the Remote URL
+```
+$: git remote set-url origin https://<user_name>:<token>@huggingface.co/<repo_path>
+```
+### Token Creation
+> Go to Settings > Access Tokens > Create new token >
+Choose Write Tab (3rd one) / go here https://huggingface.co/settings/tokens/new?tokenType=write