Spaces:
Sleeping
Sleeping
tmlinhdinh
commited on
Commit
•
69408b5
1
Parent(s):
b8e2ff5
add nb
Browse files- report/report.ipynb +0 -0
- report/report.md +0 -34
report/report.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
report/report.md
CHANGED
@@ -19,11 +19,7 @@ I would also like to test out **Semantic Chunking** because it focuses on dividi
|
|
19 |
# Task 2: Building a Quick End-to-End Prototype
|
20 |
|
21 |
### 1. Build a prototype and deploy to a Hugging Face Space, and create a short (< 2 min) loom video demonstrating some initial testing inputs and outputs.
|
22 |
-
<<<<<<< HEAD
|
23 |
Loom link: https://www.loom.com/share/2ee6825aed60424da6aadc414bbc800a?sid=1994ecda-c52e-4e71-afe9-166e862296e4
|
24 |
-
=======
|
25 |
-
Link:
|
26 |
-
>>>>>>> e107018 (add report)
|
27 |
|
28 |
### 2. How did you choose your stack, and why did you select each tool the way you did?
|
29 |
I built a Retrieval-Augmented Generation (RAG) system with:
|
@@ -44,10 +40,6 @@ Each tool was selected to balance performance, scalability, and efficiency for t
|
|
44 |
![](./ragas_prototype.png)
|
45 |
|
46 |
### 2. What conclusions can you draw about performance and effectiveness of your pipeline with this information?
|
47 |
-
The pipeline shows strong retrieval performance but needs improvement in generating accurate answers:
|
48 |
-
|
49 |
-
<<<<<<< HEAD
|
50 |
-
### What conclusions can you draw about performance and effectiveness of your pipeline with this information?
|
51 |
|
52 |
The pipeline demonstrates reasonably strong retrieval capabilities but reveals areas for improvement in the accuracy and relevance of generated answers:
|
53 |
|
@@ -58,57 +50,31 @@ The pipeline demonstrates reasonably strong retrieval capabilities but reveals a
|
|
58 |
|
59 |
### Summary:
|
60 |
- The pipeline is doing well at retrieving relevant context, but the generation of accurate and faithful answers needs refinement. This suggests a potential need for either a more advanced QA model or improvements in how retrieved chunks are passed to the answer generation process.
|
61 |
-
=======
|
62 |
-
### Strengths:
|
63 |
-
- **Context Recall (92.2%)** and **Precision (91.9%)** are high, meaning the system effectively retrieves relevant information.
|
64 |
-
- **Faithfulness (75.9%)** indicates that the generated answers are mostly grounded in retrieved data.
|
65 |
-
|
66 |
-
### Weaknesses:
|
67 |
-
- **Answer Correctness (52.6%)** and **Relevancy (67.6%)** need improvement, as the system struggles to generate consistently correct and relevant responses.
|
68 |
-
|
69 |
-
**Summary**: Retrieval is excellent, but the QA generation needs refinement for more accurate answers.
|
70 |
-
>>>>>>> e107018 (add report)
|
71 |
|
72 |
# Task 4: Fine-Tuning Open-Source Embeddings
|
73 |
|
74 |
### 1. Swap out your existing embedding model for the new fine-tuned version. Provide a link to your fine-tuned embedding model on the Hugging Face Hub.
|
75 |
-
<<<<<<< HEAD
|
76 |
Fine-tuning model link: https://huggingface.co/ldldld/snowflake-arctic-embed-m-finetuned
|
77 |
|
78 |
### 2. How did you choose the embedding model for this application?
|
79 |
I selected `Snowflake/snowflake-arctic-embed-m` as the model for fine-tuning. To make this choice, I referred to the `mteb/leaderboard`, filtered for models with fewer than 250M parameters. Then I looked at all the top ranking models, filtered out models from personal accounts and models that require me to execute some suspicious executable. That ultimately left me with `Snowflake/snowflake-arctic-embed-m`, which is actually the one we used in class.
|
80 |
-
=======
|
81 |
-
|
82 |
-
### 2. How did you choose the embedding model for this application?
|
83 |
-
>>>>>>> e107018 (add report)
|
84 |
|
85 |
# Task 5: Assessing Performance
|
86 |
|
87 |
### 1. Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements. Provide results in a table.
|
88 |
-
<<<<<<< HEAD
|
89 |
![](./ragas_finetune.png)
|
90 |
|
91 |
It seems that off-the-shelve embedding model from OpenAI `text-embedding-3-small` is still better for our RAG, which honestly isn't too surprising.
|
92 |
-
=======
|
93 |
-
>>>>>>> e107018 (add report)
|
94 |
|
95 |
### 2. Test the two chunking strategies using the RAGAS frameworks to quantify any improvements. Provide results in a table.
|
96 |
|
97 |
### 3. The AI Solutions Engineer asks you “Which one is the best to test with internal stakeholders next week, and why?”
|
98 |
-
<<<<<<< HEAD
|
99 |
The original prototype is ideal for testing with internal stakeholders next week: it offers strong performance and is straightforward to implement. The only drawback is that it's not open-sourced. If this is a critical requirement, we can confirm with stakeholders and then explore the fine-tuning path. Based on initial results, we could likely fine-tune open-source models to achieve performance similar to that of OpenAI's `text-embedding-3-small`.
|
100 |
-
=======
|
101 |
-
>>>>>>> e107018 (add report)
|
102 |
|
103 |
# Task 6: Managing Your Boss and User Expectations
|
104 |
|
105 |
### 1. What is the story that you will give to the CEO to tell the whole company at the launch next month?
|
106 |
-
<<<<<<< HEAD
|
107 |
We're excited to introduce our **AI Industry Insights chatbot**, designed to provide real-time, nuanced guidance on the rapidly evolving impact of AI—especially in the context of politics and ethical enterprise applications. As we move through an election cycle and navigate the uncertainties around AI regulations, our chatbot empowers users to stay informed and make confident decisions. The tool leverages cutting-edge technology, offering insightful, up-to-date information on how AI is shaping industries and government policies. It’s a reliable companion for anyone looking to understand the future of AI in business and governance.
|
108 |
|
109 |
### 2. There appears to be important information not included in our build, for instance, the 270-day update on the 2023 executive order on Safe, Secure, and Trustworthy AI. How might you incorporate relevant white-house briefing information into future versions?
|
110 |
I'd add the new relevant white-house briefing information into the QDrant vectorstore. Then depending on if we use an open-sourced model or not, I'd proceed with re-finetuning the embedding model and evaluate with RAGAS.
|
111 |
-
=======
|
112 |
-
|
113 |
-
### 2. There appears to be important information not included in our build, for instance, the 270-day update on the 2023 executive order on Safe, Secure, and Trustworthy AI. How might you incorporate relevant white-house briefing information into future versions?
|
114 |
-
>>>>>>> e107018 (add report)
|
|
|
19 |
# Task 2: Building a Quick End-to-End Prototype
|
20 |
|
21 |
### 1. Build a prototype and deploy to a Hugging Face Space, and create a short (< 2 min) loom video demonstrating some initial testing inputs and outputs.
|
|
|
22 |
Loom link: https://www.loom.com/share/2ee6825aed60424da6aadc414bbc800a?sid=1994ecda-c52e-4e71-afe9-166e862296e4
|
|
|
|
|
|
|
23 |
|
24 |
### 2. How did you choose your stack, and why did you select each tool the way you did?
|
25 |
I built a Retrieval-Augmented Generation (RAG) system with:
|
|
|
40 |
![](./ragas_prototype.png)
|
41 |
|
42 |
### 2. What conclusions can you draw about performance and effectiveness of your pipeline with this information?
|
|
|
|
|
|
|
|
|
43 |
|
44 |
The pipeline demonstrates reasonably strong retrieval capabilities but reveals areas for improvement in the accuracy and relevance of generated answers:
|
45 |
|
|
|
50 |
|
51 |
### Summary:
|
52 |
- The pipeline is doing well at retrieving relevant context, but the generation of accurate and faithful answers needs refinement. This suggests a potential need for either a more advanced QA model or improvements in how retrieved chunks are passed to the answer generation process.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
# Task 4: Fine-Tuning Open-Source Embeddings
|
55 |
|
56 |
### 1. Swap out your existing embedding model for the new fine-tuned version. Provide a link to your fine-tuned embedding model on the Hugging Face Hub.
|
|
|
57 |
Fine-tuning model link: https://huggingface.co/ldldld/snowflake-arctic-embed-m-finetuned
|
58 |
|
59 |
### 2. How did you choose the embedding model for this application?
|
60 |
I selected `Snowflake/snowflake-arctic-embed-m` as the model for fine-tuning. To make this choice, I referred to the `mteb/leaderboard`, filtered for models with fewer than 250M parameters. Then I looked at all the top ranking models, filtered out models from personal accounts and models that require me to execute some suspicious executable. That ultimately left me with `Snowflake/snowflake-arctic-embed-m`, which is actually the one we used in class.
|
|
|
|
|
|
|
|
|
61 |
|
62 |
# Task 5: Assessing Performance
|
63 |
|
64 |
### 1. Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements. Provide results in a table.
|
|
|
65 |
![](./ragas_finetune.png)
|
66 |
|
67 |
It seems that off-the-shelve embedding model from OpenAI `text-embedding-3-small` is still better for our RAG, which honestly isn't too surprising.
|
|
|
|
|
68 |
|
69 |
### 2. Test the two chunking strategies using the RAGAS frameworks to quantify any improvements. Provide results in a table.
|
70 |
|
71 |
### 3. The AI Solutions Engineer asks you “Which one is the best to test with internal stakeholders next week, and why?”
|
|
|
72 |
The original prototype is ideal for testing with internal stakeholders next week: it offers strong performance and is straightforward to implement. The only drawback is that it's not open-sourced. If this is a critical requirement, we can confirm with stakeholders and then explore the fine-tuning path. Based on initial results, we could likely fine-tune open-source models to achieve performance similar to that of OpenAI's `text-embedding-3-small`.
|
|
|
|
|
73 |
|
74 |
# Task 6: Managing Your Boss and User Expectations
|
75 |
|
76 |
### 1. What is the story that you will give to the CEO to tell the whole company at the launch next month?
|
|
|
77 |
We're excited to introduce our **AI Industry Insights chatbot**, designed to provide real-time, nuanced guidance on the rapidly evolving impact of AI—especially in the context of politics and ethical enterprise applications. As we move through an election cycle and navigate the uncertainties around AI regulations, our chatbot empowers users to stay informed and make confident decisions. The tool leverages cutting-edge technology, offering insightful, up-to-date information on how AI is shaping industries and government policies. It’s a reliable companion for anyone looking to understand the future of AI in business and governance.
|
78 |
|
79 |
### 2. There appears to be important information not included in our build, for instance, the 270-day update on the 2023 executive order on Safe, Secure, and Trustworthy AI. How might you incorporate relevant white-house briefing information into future versions?
|
80 |
I'd add the new relevant white-house briefing information into the QDrant vectorstore. Then depending on if we use an open-sourced model or not, I'd proceed with re-finetuning the embedding model and evaluate with RAGAS.
|
|
|
|
|
|
|
|