|
--- |
|
language: |
|
- en |
|
thumbnail: url to a thumbnail used in social sharing |
|
tags: |
|
- toponym detection |
|
- language model |
|
- geospatial understanding |
|
- geolm |
|
license: cc-by-nc-2.0 |
|
datasets: |
|
- GeoWebNews |
|
metrics: |
|
- f1 |
|
pipeline_tag: token-classification |
|
widget: |
|
- text: >- |
|
Minneapolis, officially the City of Minneapolis, is a city in the state of Minnesota |
|
and the county seat of Hennepin County. As of the 2020 census the population was |
|
429,954, making it the largest city in Minnesota and the 46th-most-populous in the |
|
United States. Nicknamed the "City of Lakes", Minneapolis is abundant in water, |
|
with thirteen lakes, wetlands, the Mississippi River, creeks, and waterfalls. |
|
- text: >- |
|
Los Angeles, often referred to by its initials L.A., is the most populous |
|
city in California, the most populous U.S. state. It is the commercial, financial, |
|
and cultural center of Southern California. Los Angeles is the second-most populous |
|
city in the United States after New York City, with a population of roughly 3.9 |
|
million residents within the city limits as of 2020. |
|
--- |
|
|
|
# Model Card for GeoLM model for Toponym Recognition |
|
|
|
<!-- Provide a quick summary of what the model is/does. [Optional] --> |
|
A language model for detection toponyms (i.e. place names) from sentences. We pretrain the GeoLM model on world-wide OpenStreetMap (OSM), WikiData and Wikipedia data, then fine-tune it for Toponym Recognition task on GeoWebNews dataset |
|
|
|
|
|
|
|
|
|
# Table of Contents |
|
|
|
- [Model Details](#model-details) |
|
- [Model Description](#model-description) |
|
- [Uses](#uses) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Preprocessing](#preprocessing) |
|
- [Speeds, Sizes, Times](#speeds-sizes-times) |
|
- [Evaluation](#evaluation) |
|
- [Testing Data, Metrics & Results](#testing-data-factors--metrics) |
|
- [Testing Data](#testing-data) |
|
- [Metrics](#metrics) |
|
- [Results](#results) |
|
- [Technical Specifications [optional]](#technical-specifications-optional) |
|
- [Model Architecture and Objective](#model-architecture-and-objective) |
|
- [Compute Infrastructure](#compute-infrastructure) |
|
- [Bias, Risks, and Limitations](#bias-risks-and-limitations) |
|
- [Citation](#citation) |
|
- [Model Card Authors [optional]](#model-card-authors-optional) |
|
- [Model Card Contact](#model-card-contact) |
|
- [How to Get Started with the Model](#how-to-get-started-with-the-model) |
|
|
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is/does. --> |
|
Pretrain the GeoLM model on world-wide OpenStreetMap (OSM), WikiData and Wikipedia data, then fine-tune it for Toponym Recognition task on GeoWebNews dataset |
|
|
|
- **Developed by:** Zekun Li |
|
- **Model type:** Language model for geospatial understanding |
|
- **Language(s) (NLP):** en |
|
- **License:** cc-by-nc-2.0 |
|
- **Parent Model:** https://huggingface.co/bert-base-cased |
|
- **Resources for more information:** li002666[Shift+2]umn.edu |
|
|
|
|
|
|
|
# Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
This is a fine-tuned GeoLM model for toponym detection task. The inputs are sentences and outputs are detected toponyms. |
|
|
|
|
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." --> |
|
|
|
|
|
To use this model, please refer to the code below. |
|
|
|
* Option 1: Load weights to a BERT model (Same procedure as the demo on the right side panel) |
|
```import torch |
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
|
|
|
|
# Model name from Hugging Face model hub |
|
model_name = "zekun-li/geolm-base-toponym-recognition" |
|
|
|
# Load tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
# Example input sentence |
|
input_sentence = "Minneapolis, officially the City of Minneapolis, is a city in the state of Minnesota and the county seat of Hennepin County." |
|
|
|
# Tokenize input sentence |
|
tokens = tokenizer.encode(input_sentence, truncation=True, padding=True, return_tensors="pt") |
|
|
|
# Pass tokens through the model |
|
outputs = model(tokens) |
|
|
|
# Retrieve predicted labels for each token |
|
predicted_labels = torch.argmax(outputs.logits, dim=2) |
|
|
|
predicted_labels = predicted_labels.detach().cpu().numpy() |
|
|
|
# Decode predicted labels |
|
predicted_labels = [model.config.id2label[label] for label in predicted_labels[0]] |
|
|
|
# Print predicted labels |
|
print(predicted_labels) |
|
``` |
|
* Option 2: Load weights to a GeoLM model |
|
|
|
To appear soon |
|
|
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
GeoWebNews (Credit to Gritta et al.) |
|
Download link: https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation/blob/master/data/GWN.xml |
|
|
|
## Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
|
|
### Speeds, Sizes, Times |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
More information needed |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
## Testing Data & Metrics & Results |
|
|
|
### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
More information needed |
|
|
|
|
|
### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
More information needed |
|
|
|
### Results |
|
|
|
More information needed |
|
|
|
|
|
|
|
# Technical Specifications [optional] |
|
|
|
## Model Architecture and Objective |
|
|
|
More information needed |
|
|
|
## Compute Infrastructure |
|
|
|
More information needed |
|
|
|
|
|
|
|
|
|
# Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. |
|
|
|
|
|
|
|
# Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
More information needed |
|
|
|
**APA:** |
|
|
|
More information needed |
|
|
|
|
|
|
|
# Model Card Author [optional] |
|
|
|
<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. --> |
|
|
|
Zekun Li (li002666[Shift+2]umn.edu) |
|
|
|
|