Zero_to_Hero_Machine_Learning / pages /4.Life Cycle of ML.py
shwetashweta05's picture
Update pages/4.Life Cycle of ML.py
2d64502 verified
import streamlit as st
import numpy as np
import pandas as pd
st.header(":red[**Life Cycle Of Machine Learning Project**]")
st.write(":blue[Click the button below to explore detailed steps involved in an ML project:]")
if st.button("**Problem Statement**"):
st.switch_page("pages/5.Problem Statement.py")
st.write("""
**A problem statement in machine learning defines the specific issue you want to solve using data and machine learning techniques. It should clearly explain:**
- What the problem is
- Why solving it is important
- What data is available
- What the expected outcome will look like
""")
st.write("""
**Examples of ML Problem Statements:**
- **Predicting House Prices:**
- Problem: We want to predict the price of houses based on features like size, location, number of bedrooms, etc.
- Why: This helps buyers make informed decisions and real estate agents price houses correctly.
- Data: Historical data about house prices and their features.
- Expected Outcome: A model that predicts the price of a house given its features.
""")
if st.button("**Data Collection**"):
st.switch_page("pages/6.Data Collection.py")
st.write("""
**Collecting data is the first and most important step in any machine learning project. This is where you gather the information needed to train your machine learning model.**
**Steps to Collect Data:**
- **1.Define the Problem**
- Understand what kind of data you need to solve the problem.
- Example: If you're predicting house prices, you need data on house size, location, number of rooms, etc.
- **2.Identify Sources of Data**
- Existing Datasets: Use publicly available datasets from sources like Kaggle, UCI Machine Learning Repository, or government portals.
- Databases: Access company or organization databases.
- Manual Collection: Collect data through surveys, experiments, or observations.
- APIs and Web Scraping: Use online APIs or scrape websites for specific information.
- Example: For a weather prediction project, you can collect data from weather station APIs.
- **3.Organize the Data**
- Make sure the data is in a usable format like spreadsheets (CSV), databases, or JSON.
- Example: A dataset with columns like Date, Temperature, Humidity, and Rainfall.
- **4.Ensure Quality**
- Data should be accurate, relevant, and complete. Remove any errors or inconsistencies.
- Example: For customer churn prediction, make sure there are no missing customer details like age or usage data.
""")
if st.button("**Simple EDA**"):
st.switch_page("pages/7.Simple EDA.py")
st.write("""
**EDA (Exploratory Data Analysis) is the process of exploring your data to understand its structure, patterns, and insights before building a machine learning model.Think of it as getting to know your data better!**
**Steps for Simple EDA:**
- **1.Understand the Data**
Look at the data to understand its structure and contents.
- Example: If you have a dataset of students' marks, check columns like Name, Math Marks, Science Marks, and Grade.
- **2.Check the Size of the Data**
Find out how many rows (data points) and columns (features) are in the dataset.
- Example: Your student dataset might have 500 rows (students) and 5 columns (attributes).
- **3.View the First Few Rows**
Look at the top 5-10 rows to get a snapshot of the data.
- Example: Check if the columns contain relevant information like scores and grades.
- **4.Summarize the Data**
Generate basic statistics for numerical data, such as:
- Mean: Average of a column (e.g., average math marks).
- Minimum and Maximum: Lowest and highest values (e.g., lowest and highest scores).
- Count: Number of non-missing values (e.g., total students who took the test).
- **5.Handle Missing Data**
Identify and deal with missing or incomplete values.
- Example: If some students are missing marks, decide to either fill them with an average or remove those rows.
- **6.Check Data Distribution**
Visualize how data is spread using graphs:
- Histograms: Show the distribution of scores (e.g., most students scored 70-80 in math).
- Boxplots: Highlight outliers and data spread.
- **7.Identify Relationships**
Check how different features relate to each other using scatter plots or correlation matrices.
- Example: Do students with high math marks also score high in science?
""")
if st.button("**Data Pre-processing**"):
st.switch_page("pages/8.Data Pre-processing.py")
st.write("**Data preprocessing is the process of cleaning and preparing raw data so it can be used by a machine learning model. It ensures that the data is in the right format, free from errors, and ready for analysis.**")
st.write("""
**Why is Data Preprocessing Important?**
- Raw data often contains errors, missing values, or irrelevant information.
- Clean and processed data improves the accuracy and performance of the model.
""")
st.write("""
**Steps in Data Preprocessing:**
- **1.Collect the Data**
Gather data from sources like CSV files, databases, or APIs.
- Example: A dataset of house prices with columns like Size, Location, Price, and Year Built.
- **2.Handle Missing Data**
Replace or remove missing values so the model doesn't face errors.
- **Methods:**
- Fill with mean, median, or mode.
- Remove rows or columns with too many missing values.
- Example: If Price is missing for some houses, replace it with the average price.
- **3.Remove Outliers**
Outliers are extreme values that can distort the model. Use methods like z-score or interquartile range (IQR) to identify and handle them.
- Example: A house with a price 10x higher than similar houses might be an outlier.
- **4.Convert Categorical Data to Numbers**
Machine learning models work with numbers, so categorical data must be converted.
- **Techniques:**
- Label Encoding: Assign a number to each category (e.g., Male = 0, Female = 1).
- One-Hot Encoding: Create new columns for each category with binary values (0 or 1).
- Example: Convert Location (e.g., "City A", "City B") into numerical values.
- **5.Scale Features**
Ensure all numerical values are on the same scale so that no feature dominates.
- Techniques:
- Normalization: Rescale values to be between 0 and 1.
- Standardization: Scale data to have a mean of 0 and standard deviation of 1.
- Example: House sizes (in square feet) might range from 500 to 5,000, while prices range in millions; scaling ensures both features are treated equally.
- **6.Split the Data**
Divide the data into training and testing sets.
- Training set: Used to train the model.
- Testing set: Used to evaluate the model’s performance.
- Example: Split 80% of the data for training and 20% for testing.
""")
if st.button("**Exploratory Data Analysis (EDA)**"):
st.switch_page("pages/9.Exploratory Data Analysis (EDA).py")
st.write("**EDA in Machine Learning (Easy Language)EDA (Exploratory Data Analysis) is like getting to know your dataset before using it in a machine learning model. It helps you understand the data's structure, patterns, and relationships to decide how to process and use it effectively.**")
st.write("""
Why is EDA Important?
- Identifies errors, missing values, or outliers.
- Helps understand data distribution and trends.
- Guides feature selection and engineering.
- Gives insights for choosing the right ML model.
""")
st.write("""
**Steps in EDA:**
- **Understand the Dataset**
- Look at the structure of your data (rows, columns, and types of values).
- Example: In a student dataset, check if columns include Name, Math Marks, and Grade.
- **Summarize the Data**
- Generate statistics like mean, median, minimum, maximum, and standard deviation.
- Example: For math scores, check the average, highest, and lowest scores.
- **Handle Missing Values**
- Identify any missing data and decide how to fix it (e.g., fill with average values or remove).
- Example: If a student is missing Science Marks, fill it with the average science score.
- **Visualize the Data**
- Create plots to understand data distributions and relationships:
- Histograms: Show how data is spread across a range (e.g., how many students scored between 70-80).
- Boxplots: Highlight outliers and data spread.
- Scatter Plots: Show relationships between two variables (e.g., Attendance vs. Marks).
- **Check Relationships**
- Use a correlation matrix to see how features relate to each other.
- Example: See if Attendance has a strong positive correlation with Math Marks.
- **Identify Outliers**
- Look for extreme values that might distort the analysis.
- Example: A student with Marks = 0 when others scored 70-100 could be an error.
""")
if st.button("**Feature Engineering**"):
st.switch_page("pages/10.Feature Engineering.py")
st.write("**Feature engineering is the process of creating, modifying, or selecting features (columns) in your dataset to make machine learning models work better. Features are the input data that the model uses to learn and make predictions.**")
st.write("""
Why is Feature Engineering Important?
- Improves model accuracy and performance.
- Helps the model understand the data better.
- Reduces noise and irrelevant information.
""")
st.write("""
**Steps in Feature Engineering:**
- **1.Select Relevant Features**
Keep only the columns that are important for the problem.
- Example: If you’re predicting house prices, keep features like Size, Location, and Year Built, but remove irrelevant ones like Owner's Name.
- **2.Handle Missing Values**
Fill or remove missing data to ensure clean features.
- Example: Fill missing Age values with the average age.
- **3.Create New Features**
Combine or transform existing columns to make new useful ones.
- Example: If you have Date of Birth, create a new feature called Age.
- **4.Transform Features**
Modify features to improve their scale or distribution.
- Normalize or standardize numerical features.
- Example: Convert house prices in millions to a range between 0 and 1.
- **5.Encode Categorical Data**
Convert non-numeric (categorical) data into numbers.
- One-Hot Encoding: Create new binary columns for each category.
- Label Encoding: Assign numbers to categories.
- Example: Convert Color (Red, Blue, Green) into binary columns Is_Red, Is_Blue, and Is_Green.
- **6.Feature Scaling**
Ensure all numerical features are on the same scale so one doesn’t dominate the others.
- Example: Scale features like Salary (in thousands) and Experience (in years) to a similar range.
- **7.Feature Selection**
Choose only the most important features to avoid overloading the model.
- Use methods like correlation analysis, feature importance scores, or PCA (Principal Component Analysis).
""")
if st.button("**Training**"):
st.switch_page("pages/11.Training.py")
st.write("**Training a machine learning model is the process of teaching the model to make predictions by learning patterns in the data. This is done by showing the model examples (training data) and adjusting it so it performs well.**")
st.write("""
**Steps in Training a Model:**
- **1.Prepare the Data**
Split your data into:
- Training Set: Used to train the model (usually 70-80% of the data).
- Testing Set: Used to check how well the model performs on unseen data.
- Example: If you have 100 rows of student data, use 80 rows for training and 20 rows for testing.
- **2.Choose a Model**
Select the algorithm or method to use for predictions. Common models include:
- Linear Regression (for predicting numbers).
- Decision Trees (for classification or regression).
- K-Nearest Neighbors (KNN) (for identifying closest patterns).
- **3.Train the Model**
Show the training data to the model so it can learn the patterns.
- During this process, the model adjusts its internal parameters to minimize errors.
- Example: A student performance prediction model might learn that Attendance and Study Hours are important for predicting grades.
- **4.Test the Model**
Check the model's performance by giving it the testing data (data it hasn't seen before).
- The model makes predictions, and you compare them to the actual values.
- Example: If the model predicts a student's grade as A and the actual grade is also A, the prediction is correct.
- **5.Evaluate the Model**
Measure how well the model is performing using metrics like:
- Accuracy: Percentage of correct predictions.
- Mean Squared Error (MSE): Average error for numerical predictions.
""")
if st.button("**Testing**"):
st.switch_page("pages/12.Testing.py")
st.write("**Testing a machine learning model is the process of checking how well the model works on new, unseen data. This step helps you understand if the model can make accurate predictions or decisions when applied to real-world scenarios.**")
st.write("""
**Why Testing is Important?**
- Ensures the model doesn’t just memorize the training data but can generalize to new situations.
- Identifies if the model needs improvement.
- Measures the model's accuracy, precision, or error rate
""")
st.write("""
**Steps in Testing a Machine Learning Model:**
- **1.Prepare the Test Data**
- Use a separate dataset (called the testing set) that the model hasn’t seen during training.
- Example: If you’re predicting student grades, the test data could include students whose information was not part of the training.
- **2.Run the Model on Test Data**
- Use the model to predict outcomes based on the test data's input features.
- Example: For a grade prediction model, test data might have Study Hours and Attendance as inputs. The model predicts the grade.
- **3.Compare Predictions to Actual Outcomes**
- Check how close the predictions are to the real values.
- Example: If the model predicts Grade = B and the actual grade is also B, it’s correct.
- **4.Evaluate Performance**
- Use metrics to measure how well the model is performing:
- Accuracy: How many predictions were correct?
- Precision/Recall: Useful for classification problems.
- Mean Squared Error (MSE): Measures error in numerical predictions.
- Example: If the model predicts grades for 10 students and gets 9 right, the accuracy is 90%.
- **5.Analyze Errors**
- Understand where the model made mistakes to identify areas for improvement.
- Example: If the model struggles with students with low attendance, you might need more training data for that group.
""")
if st.button("**Deployment**"):
st.switch_page("pages/13.Deployment.py")
st.write("**Deployment is the process of making a trained machine learning model available for real-world use. It allows people or systems to use the model to make predictions or decisions on new data.**")
st.write("""
**Why Deployment is Important:**
- To apply the model to solve real-world problems.
- To provide predictions or insights for users, apps, or businesses.
- To continuously monitor and improve the model over time.
""")
st.write("""
**Steps in Deployment:**
- **1.Prepare the Model**
- Train and test your model until it performs well.
- Save the final version of the model.
- Example: Use Python libraries like joblib or pickle to save the trained model to a file.
- **2.Set Up a Deployment Environment**
- Decide where the model will run:
- On a Cloud Server: For large-scale use (e.g., AWS, Google Cloud, Azure).
- On a Local System: For small or private applications.
- **3.Create a User Interface (Optional)**
- Build an application that users can interact with.
- Example: Use a web app (like Streamlit or Flask) to let users input data and get predictions.
- **4.Serve the Model**
- Set up an API (Application Programming Interface) so the model can receive input and return output.
- Example: Use Flask or FastAPI to create an API endpoint that the model responds to.
- **5.Monitor Performance**
- Continuously track the model's accuracy and performance in the real world.
- Example: If the model starts making more mistakes, it may need retraining.
- **6.Update the Model**
- Retrain the model with new data as the problem or environment evolves.
- Example: A house price prediction model might need updates as market trends change.
""")
if st.button("**Monitoring**"):
st.switch_page("pages/14.Monitoring.py")
st.write("**Monitoring a machine learning model means keeping track of how well it performs after it has been deployed. It helps you make sure the model continues to give accurate predictions when used in the real world.**")
st.write("""
**Why Monitoring is Important:**
- Ensure model accuracy: The model might perform well initially but could start making mistakes over time.
- Detect problems: For example, if the data changes, the model might need retraining.
- Keep the model updated: Regular monitoring helps decide when to update or retrain the model.
""")
st.write("""
Steps in Monitoring a Machine Learning Model:
- **1.Track Model Performance**
- Measure how well the model is doing after deployment using metrics like:
- Accuracy: How often the model is correct.
- Precision and Recall: Important for classification problems.
- Mean Squared Error (MSE): Useful for regression models.
- AUC-ROC: Used to evaluate classification models, especially for imbalanced data.
- **2.Monitor for Data Drift**
- Data drift happens when the patterns in the new data are different from the data used to train the model.
- Example: A house price prediction model trained on old data might perform poorly if the market changes.
- **3.Track Prediction Errors**
- Look for situations where the model makes large or consistent errors.
- Example: A fraud detection model might fail to catch new types of fraud.
- **4.Monitor Model Latency and Speed**
- Ensure that the model is making predictions quickly enough for real-time use.
- Example: A recommendation system in an online store needs to suggest products in a few seconds.
- **5.Check Resource Usage**
- Keep track of how much computing power and memory the model is using.
- Example: A large deep learning model might use a lot of resources and slow down a website.
- **6.Update or Retrain the Model**
- If the model’s performance drops, you may need to retrain it with new data.
- Example: If new features (like a new product category) become available, the model should be retrained.
""")