|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
## What is Transactify? |
|
Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions. |
|
By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining." |
|
This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting. |
|
Transactify is trained on real-world transaction data for improved accuracy and generalization. |
|
|
|
## Table of contents |
|
## 1. Data Collection |
|
The dataset consists of **5,000 transaction records** generated using ChatGPT, each containing a transaction description and its corresponding category. |
|
|
|
Example entries include: |
|
- "Live concert stream on YouTube" (Movies & Entertainment) |
|
- "Coffee at Starbucks" (Food & Dining) |
|
|
|
These records cover various spending categories such as **Lifestyle**, **Movies & Entertainment**, **Food & Dining**, and others. |
|
|
|
--- |
|
|
|
## 2. Data Preprocessing |
|
The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training. These include: |
|
|
|
- Lowercasing all text. |
|
- Removing digits and punctuation using regular expressions (regex). |
|
- Tokenizing the cleaned text to convert it into a sequence of tokens. |
|
- Applying `text_to_sequences` to transform the tokenized words into numerical sequences. |
|
- Using `pad_sequences` to ensure all sequences have the same length for input into the LSTM model. |
|
- Label encoding the target categories to convert them into numerical labels. |
|
|
|
After preprocessing, the data is split into training and testing sets to build and validate the model. |
|
|
|
--- |
|
|
|
## 3. Model Building |
|
- **Embedding Layer**: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships. |
|
|
|
- **LSTM Layer**: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time. |
|
|
|
- **Dropout Layer**: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization. |
|
|
|
- **Dense Layer with Softmax Activation**: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description. |
|
|
|
### Model Compilation |
|
- Compiled with the Adam optimizer for efficient learning. |
|
- Sparse categorical cross-entropy loss for multi-class classification. |
|
- Accuracy as the evaluation metric. |
|
|
|
### Model Training |
|
The model is trained for **50 epochs** with a batch size of **8**, using a validation set to monitor performance and adjust during training. |
|
|
|
### Saving the Model and Preprocessing Objects |
|
- The trained model is saved as `transactify.h5` for future use. |
|
- The tokenizer and label encoder used during preprocessing are saved using joblib as `tokenizer.joblib` and `label_encoder.joblib`, respectively, ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data. |
|
|
|
--- |
|
|
|
## 4. Prediction |
|
Once trained |