File size: 3,804 Bytes
68d7b25
 
 
 
 
 
 
 
3f1798a
 
 
 
 
1ec9ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a13472
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: apache-2.0
language:
- en
- la
base_model:
- distilbert/distilbert-base-multilingual-cased
library_name: transformers
datasets:
- sjhuskey/latin_author_dll_id
metrics:
- f1
- accuracy
---

# DLL Catalog Author Reconciliation Model

The purpose of this model is to automate the reconciliation of bibliographic metadata with records in the [DLL Catalog](https://catalog.digitallatin.org/).

The DLL Catalog maintains [authority records for authors and work records for works](https://catalog.digitallatin.org/authority-records). Each work is linked to its author (if known), and each [individual item record](https://catalog.digitallatin.orgindividual-items-catalog) must be linked to the relevant authority records and work records.

## The Problem

Linking incoming metadata for individual items to their corresponding author and work records in the DLL Catalog is the problem: the metadata that we acquire from other sites comes in different forms, with different spellings of authors' names and the titles of their works. Reconciling one or two records does not take much time, but more often than not we acquire many thousands of records all at once, which creates a significant logjam in the process of publishing records in the catalog.

## The Proposed Solution

The authority and work records in the DLL Catalog contain multiple variant spellings of author names and work titles, and new variant spellings are added to the records as we encounter them. This means that we already have a labeled set of data that could be used to train a model to identify names and titles and match them with the unique identifiers of the DLL Catalog's authority and work records.

Achieving accuracy and reliability in this process will make the second goal of reconciling titles with their corresponding work records easier, since the author's name can be used to narrow the field of potential matches to the works by that author, thus reducing the chances for false positives on works with the same or similar titles. For example, both Julius Caesar and Lucan wrote works called _Bellum Civile_, and several authors wrote works known generically as _Carmina_.

## The Model

After preliminary experiments with sequential neural network models using [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), [term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (tf-idf), and custom word embedding encoding, I settled on using a pretrained BERT model developed by [Devlin et al. 2018](https://arxiv.org/abs/1810.04805v2). Specifically, I'm using [Hugging Face's DistilBert base multilingual (cased) model](https://huggingface.co/distilbert/distilbert-base-multilingual-cased), which is based on work by [Sanh et al. 2020](https://doi.org/10.48550/arXiv.1910.01108).

## Emissions

Here is the `codecarbon` output from training on Google Colab with an A100 runtime:

```properties
timestamp: 2024-12-23T17:37:16
project_name: codecarbon
run_id: a2b8975b-512b-4158-b41f-2a00d1d6fb39
experiment_id: 5b0fa12a-3dd7-45bb-9766-cc326314d9f1
duration (seconds): 877.531339527
emissions (kilograms of carbon): 0.0260658391490936
emissions_rate (kg/sec): 2.970359914797282e-05
cpu_power (average in watts): 42.5
gpu_power (average in watts): 71.5115170414632
ram_power (average in watts): 31.30389261245728
cpu_energy (total watts): 0.0103517333061409
gpu_energy (total watts): 0.03961337474623
ram_energy (total watts): 0.007623585574942
energy_consumed (total watts): 0.057588693627313
os: Linux-6.1.85+-x86_64-with-glibc2.35
python_version: 3.10.12
codecarbon_version: 2.8.2
cpu_count: 12
cpu_model: Intel(R) Xeon(R) CPU @ 2.20GHz
gpu_count: 1
gpu_model: 1 x NVIDIA A100-SXM4-40GB
ram_total_size: 83.47704696655273
tracking_mode: machine
on_cloud: N
pue: 1.0
```