File size: 3,193 Bytes
9f91e91
 
 
 
7ec1e86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: mit
language:
- en
---

## What is Transactify?
Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions. 
By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining." 
This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting. 
Transactify is trained on real-world transaction data for improved accuracy and generalization.

## Table of contents
## 1. Data Collection
The dataset consists of **5,000 transaction records** generated using ChatGPT, each containing a transaction description and its corresponding category. 

Example entries include:
- "Live concert stream on YouTube" (Movies & Entertainment)
- "Coffee at Starbucks" (Food & Dining)

These records cover various spending categories such as **Lifestyle**, **Movies & Entertainment**, **Food & Dining**, and others.

---

## 2. Data Preprocessing
The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training. These include:

- Lowercasing all text.
- Removing digits and punctuation using regular expressions (regex).
- Tokenizing the cleaned text to convert it into a sequence of tokens.
- Applying `text_to_sequences` to transform the tokenized words into numerical sequences.
- Using `pad_sequences` to ensure all sequences have the same length for input into the LSTM model.
- Label encoding the target categories to convert them into numerical labels.

After preprocessing, the data is split into training and testing sets to build and validate the model.

---

## 3. Model Building
- **Embedding Layer**: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
  
- **LSTM Layer**: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.

- **Dropout Layer**: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.

- **Dense Layer with Softmax Activation**: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.

### Model Compilation
- Compiled with the Adam optimizer for efficient learning.
- Sparse categorical cross-entropy loss for multi-class classification.
- Accuracy as the evaluation metric.

### Model Training
The model is trained for **50 epochs** with a batch size of **8**, using a validation set to monitor performance and adjust during training.

### Saving the Model and Preprocessing Objects
- The trained model is saved as `transactify.h5` for future use.
- The tokenizer and label encoder used during preprocessing are saved using joblib as `tokenizer.joblib` and `label_encoder.joblib`, respectively, ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.

---

## 4. Prediction
Once trained