File size: 6,136 Bytes
be28faf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
_\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\_
_\\----------- **Resume Parser** ----------\\_
_\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\_

# Overview:
This project is a comprehensive Resume Parsing tool built using Python,
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
If Mistral fails or encounters issues,
the system falls back to a custom-trained spaCy model to ensure continued functionality.
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.


# Installation Guide:

1. Create and Activate a Virtual Environment
    python -m venv venv

    source venv/bin/activate  # For Linux/Mac

    # or

    venv\Scripts\activate  # For Windows


    # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.

        - For Linux/Mac:

            source venv/bin/activate

        - For Windows:

            venv\Scripts\activate


2. Install Required Libraries
    pip install -r requirements.txt


    # Ensure the following dependencies are included:

    - Flask

    - spaCy

    - huggingface_hub

    - PyMuPDF

    - python-docx

    - Tesseract-OCR (for image-based parsing)


3. Set up Hugging Face Token
    - Add your Hugging Face token to the .env file as:
    HF_TOKEN=<your_huggingface_token>



# File Structure Overview:
    Mistral_With_Spacy/

    β”‚

    β”œβ”€β”€ Spacy_Models/

    β”‚   └── ner_model_05_3  # Pretrained spaCy model directory for resume parsing

    β”‚

    β”œβ”€β”€ templates/

    β”‚   β”œβ”€β”€ index.html  # UI for file upload

    β”‚   └── result.html  # Display parsed results in structured JSON

    β”‚

    β”œβ”€β”€ uploads/  # Directory for uploaded resume files

    β”‚

    β”œβ”€β”€ utils/

    β”‚   β”œβ”€β”€ mistral.py  # Code for calling Mistral API and handling responses

    β”‚   β”œβ”€β”€ spacy.py  # spaCy fallback model for parsing resumes

    β”‚   β”œβ”€β”€ error.py  # Error handling utilities

    β”‚   └── fileTotext.py  # Functions to extract text from different file formats (PDF, DOCX, etc.)

    β”‚

    β”œβ”€β”€ venv/  # Virtual environment

    β”‚

    β”œβ”€β”€ .env  # Environment variables file (contains Hugging Face token)

    β”‚

    β”œβ”€β”€ main.py  # Flask app handling API routes for uploading and processing resumes

    β”‚

    └── requirements.txt  # Dependencies required for the project



# Program Overview:

    # Mistral Integration (utils/mistral.py)

        - Mistral API Calls: Uses Hugging Face’s Mistral-Nemo-Instruct-2407 model to parse resumes.

        - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.

        - Fallback Mechanism: If Mistral fails, spaCy NER model is used as a fallback.


    # SpaCy Integration (utils/spacy.py)

        - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.

        - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.

        - Validation: Includes validation for extracted emails and contacts.


    # File Conversion (utils/fileTotext.py)

       - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.

          - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.

          - DOCX Files: Uses `python-docx` to extract structured text from Word documents.

          - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.

          - RSF Files: Reads plain text from RSF files.

          - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.


       - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.


    # Error Handling (utils/error.py)

        - Handles API response errors, file format errors, and ensures smooth fallbacks without crashing the app.


    # Flask API (main.py)

        Endpoints:

        - /upload for uploading resumes.

        - Displays parsed results in JSON format on the results page.

        - UI: Simple interface for uploading resumes and viewing the parsing results.



# Tree map of your program:

    main.py

    β”œβ”€β”€ Handles API side

    β”œβ”€β”€ File upload/remove

    β”œβ”€β”€ Process resumes

    └── Show result


    utils

    β”œβ”€β”€ fileTotext.py

    β”‚   └── Converts files to text

    β”‚       β”œβ”€β”€ PDF

    β”‚       β”œβ”€β”€ DOCX

    β”‚       β”œβ”€β”€ RTF

    β”‚       β”œβ”€β”€ ODT

    β”‚       β”œβ”€β”€ PNG

    β”‚       β”œβ”€β”€ JPG

    β”‚       └── JPEG

    β”œβ”€β”€ mistral.py

    β”‚   β”œβ”€β”€ Mistral API Calls

    β”‚   β”‚   └── Uses Mistral-Nemo-Instruct-2407 model

    β”‚   β”œβ”€β”€ Personal and Professional Extraction

    β”‚   β”‚   β”œβ”€β”€ Extracts personal information

    β”‚   β”‚   └── Extracts professional information

    β”‚   └── Fallback Mechanism

    β”‚       └── Uses spaCy NER model if Mistral fails

    └── spacy.py

        β”œβ”€β”€ Custom Trained Model

        β”‚   └── Uses spaCy model (ner_model_05_3)

        β”œβ”€β”€ Named Entity Recognition

        β”‚   └── Extracts key information (Name, Email, Contact, etc.)

        └── Validation

            └── Validates emails and contacts



# References:

- [Flask Documentation](https://flask.palletsprojects.com/)
- [spaCy Documentation](https://spacy.io/usage)
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
- [Tesseract OCR Documentation](https://github.com/tesseract-ocr/tesseract)
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)