--- library_name: transformers tags: [] --- # Geraldine/msmarco-distilbert-base-v4-ead ## Model Details - Model Name: Geraldine/msmarco-distilbert-base-v4-ead - Base Model: sentence-transformers/msmarco-distilbert-base-v4 - Intended Use: This model is optimized for creating text embeddings with specific handling of XML/EAD elements. - Architecture: DistilBERT-based sentence-transformer model, fine-tuned for MSMARCO and adapted to recognize XML/EAD elements. ## Model Description This model is built on top of sentence-transformers/msmarco-distilbert-base-v4 and enhanced with two key modifications: 1. Special Tokens for XML/EAD Elements: The tokenizer includes additional tokens to handle EAD (Encoded Archival Description) and XML elements and attributes. This allows the model to generate embeddings that capture structural metadata commonly used in archival contexts. 2. Dimensionality Reduction with PCA: A PCA model is applied to reduce the dimensionality of embeddings from 768 to 128. This makes the embeddings more compact while preserving essential semantic information, which is useful for downstream tasks requiring lower-dimensional representations. ## Model Usage ### Installation and Setup ```python from transformers import AutoModel, AutoTokenizer import joblib from huggingface_hub import hf_hub_download # Load the embeddings model model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead") tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead") # Load the PCA model pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib") pca = joblib.load(pca_path) ``` ### Encoding Text and Reducing Dimensionality To use the model for generating 128-dimensional embeddings, follow these steps: ```python # Encode text using the model and tokenizer text = "Your EAD/XML text goes here" inputs = tokenizer(text, return_tensors="pt") embeddings = model(**inputs).last_hidden_state # Apply PCA to reduce dimensionality reduced_embeddings = pca.transform(embeddings.detach().numpy()) ``` ### Full example to use with Langchain or Llamaindex ```python from transformers import AutoModel, AutoTokenizer, pipeline import joblib from huggingface_hub import hf_hub_download # Load the embeddings model model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead") tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead") # Load the PCA model pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib") feature_extraction_pipeline = pipeline("feature-extraction", model=model, tokenizer=tokenizer) class HuggingFaceEmbeddingFunction: def __init__(self, pipeline, pca_model_path): self.pipeline = pipeline self.pca = joblib.load(pca_model_path) # Function for embedding documents (lists of text) def embed_documents(self, texts): # Get embeddings as numpy arrays embeddings = self.pipeline(texts) embeddings = [embedding[0][0] for embedding in embeddings] embeddings = np.array(embeddings) # Transform embeddings using PCA reduced_embeddings = self.pca.transform(embeddings) return reduced_embeddings.tolist() # Function for embedding individual queries def embed_query(self, text): embedding = self.pipeline(text) embedding = np.array(embedding[0][0]).reshape(1, -1) # Transform embedding using PCA reduced_embedding = self.pca.transform(embedding) return reduced_embedding.flatten().tolist() embeddings = HuggingFaceEmbeddingFunction(feature_extraction_pipeline, pca_model_path="pca_model.joblib") ``` ### Intended Use Cases This model is well-suited for: - **Archival Data Embeddings**: Generate embeddings for texts containing EAD/XML elements, making it ideal for digital archives and library sciences. - **Semantic Search**: Improve search results for content with complex metadata or hierarchical data, like archival records or digital collections. - **Information Retrieval**: Use embeddings to power retrieval tasks where reducing storage and maintaining relevance in the embeddings are essential. ## Training Data The base model was fine-tuned on MSMARCO data by sentence-transformers. Additional training or fine-tuning with EAD/XML-specific tokens was not required; instead, the tokenizer was adapted to recognize XML/EAD elements and attributes as distinct tokens. ## Limitations and Considerations - **Domain-Specific Tokenization**: The model's tokenizer recognizes EAD/XML tokens, making it particularly useful in contexts where such elements are frequently used. However, this specialization may not be necessary for general NLP tasks. - **Dimensionality Reduction Trade-Off**: PCA reduces the embedding dimensions from 768 to 128, which can introduce minor losses in the information encoded in embeddings. This trade-off is balanced to retain essential semantic information. ## Evaluation The base model has been evaluated on MSMARCO, and the added tokenization aligns it for use in XML/EAD contexts. Further evaluation can be conducted on EAD-specific datasets or tasks to ensure model effectiveness in domain-specific applications. ## Citation If you use this model, please cite it as follows: ```bibtex @misc{geraldine2024eadxml, author = {GĂ©raldine Geoffroy}, title = {Geraldine/msmarco-distilbert-base-v4-ead: A DistilBERT Embedding Model for EAD/XML Text}, year = {2024}, howpublished = {\url{https://huggingface.co/Geraldine/msmarco-distilbert-base-v4-ead}}, } ``` ## Model Card Authors [optional] GĂ©raldine Geoffroy ## Model Card Contact grldn.geoffroy@gmail.com