gbpatentdata/lt-patent-inventor-linking
This is a LinkTransformer model. At its core this model this is a sentence transformer model sentence-transformers model - it just wraps around the class.
This model has been fine-tuned on the model: sentence-transformers/all-mpnet-base-v2
. It is pretrained for the language: en
.
Usage (Sentence-Transformers)
To use this model using sentence-transformers:
from sentence_transformers import SentenceTransformer
# load
model = SentenceTransformer("matthewleechen/lt-patent-inventor-linking")
Usage (LinkTransformer)
To use this model for clustering with LinkTransformer installed:
import linktransformer as lt
import pandas as pd
df_lm_matched = lt.cluster_rows(df, # df should be a dataset of unique patent-inventors
model='matthewleechen/lt-patent-inventor-linking',
on=['name', 'occupation', 'year', 'address', 'firm', 'patent_title'], # cluster on these variables
cluster_type='SLINK', # use SLINK algorithm
cluster_params={ # default params
'threshold': 0.1,
'min cluster size': 1,
'metric': 'cosine'
}
)
)
Evaluation
We evaluate using the standard LinkTransformer information retrieval metrics. Our test set evaluations are available here.
Training
The model was trained with the parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 31 with parameters:
{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
linktransformer.modified_sbert.losses.SupConLoss_wandb
Parameters of the fit()-Method:
{
"epochs": 100,
"evaluation_steps": 16,
"evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 3100,
"weight_decay": 0.01
}
LinkTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
(2): Normalize()
)
Citation
If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:
@article{bct2025,
title = {300 Years of British Patents},
author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
journal = {arXiv preprint arXiv:2401.12345},
year = {2025},
url = {https://arxiv.org/abs/2401.12345}
}
Please also cite the original LinkTransformer authors:
@misc{arora2023linktransformer,
title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
author={Abhishek Arora and Melissa Dell},
year={2023},
eprint={2309.00789},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 18