brunokreiner
/

lyrics-bert

+---
+language:
+- en
+tags:
+- music
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+Embeds song lyrics to 300 dimensions.
+# Model Details
+## Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** bert-base-uncased trained with contrastive learning
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+## Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+# Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+## Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+## Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+## Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+# Bias, Risks, and Limitations
+## Translate to English:
+chlussendlich existieren die Lyrics für 606'255 Songs. Um das weitere Vorgehen zu vereinfachen, wurden diese Songs durch die Python-Implementierung eines in Java implementierten Google Sprachdetektors \cite{nakatani2010langdetect} \cite{langdetectpy} gefiltert und nur die verbleibenden 480'964 englischen Lyrics werden weiter beachtet.
+\subsection{Weitere Probleme}
+Im Nachhinein wurden 109 Lyrics festgestellt, die Spezialcharaktere haben, welche nicht vom Cleanup fetgestellt wurden. Diese wurden mit dem Regex \glqq '[a-zA-Z|\'|0-9]'\grqq{} gematcht und im Training ignoriert. Im Training wurden aber trotzdem einige Lyrics miteinberechnet, die zwar keine Spezialcharaktere haben, aber nicht ganz Englisch sind. Dadurch encoded das Languagemodel auch Japanische / Koreanische / Chinesische / Russische / Griechische sowie Spezialcharakter aus lateinischer Sprachen, jedoch mit sehr wenigen Trainingsdaten. Diese Lyrics wurden nicht durch das Google Spracherkennungsmodell als \glqq nicht Englisch\grqq{} eingestuft, weil sie genügend englische Wörter haben. Wir nehmen an, dass diese Lyrics das Training nicht gross beeinflussen und man kann von circa 500 solcher Songs ausgehen.
+Einige Lyrics sind auch lateinigiserte Versionen von japanischen / koreanischen / chinesischen Lieder (manuell geprüft). Weitere Grenzfälle sind Lyrics mit akzentuierten Lyrics wie:
+\\[8pt]
+\glqq let your fists swang k i c k y o a s s oh yes k i c k y o a s s oh yes i say beat you say that ass\grqq{}
+\\[8pt]
+Eine Analyse fehlt über was genau mit diesen Wörtern im Embedding Space passiert.
+## Recommendations
+bias, risk, technical limitations...
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+# Training Details
+## Training Data
+<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+## Training Procedure [optional]
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+### Preprocessing
+[More Information Needed]
+### Speeds, Sizes, Times
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+# Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+## Testing Data, Factors & Metrics
+### Testing Data
+<!-- This should link to a Data Card if possible. -->
+[More Information Needed]
+### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+## Results
+[More Information Needed]
+### Summary
+# Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+# Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+# Technical Specifications [optional]
+## Model Architecture and Objective
+[More Information Needed]
+## Compute Infrastructure
+[More Information Needed]
+### Hardware
+[More Information Needed]
+### Software
+[More Information Needed]
+# Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+# Model Card Contact
+for more info contact
+[email protected]