File size: 7,475 Bytes
b95938c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts

\title{Training BERT-Base-Uncased to Classify Descriptive Metadata}

\author{
    \IEEEauthorblockN{Artem Saakov}
    \IEEEauthorblockA{
        University of Michigan\\
        School of Information\\
        United States\\
        [email protected]
    }
}

\begin{document}
\maketitle

\begin{abstract}
Libraries and archives frequently receive donor-supplied metadata in unstructured or inconsistent formats, creating backlogs in accession workflows. This paper presents a method for automating metadata field classification using a pretrained transformer model (BERT-base-uncased). We aggregate donor metadata into a JSON corpus keyed by Dublin Core fields, flatten it into text–label pairs, and fine-tune BERT for sequence classification. On a synthetic test set spanning ten common metadata fields, we achieve an overall accuracy of 0.92. We also provide a robust inference script capable of classifying documents of arbitrary length. Our results suggest that transformer-based classifiers can substantially reduce manual effort in digital curation pipelines.
\end{abstract}

\begin{IEEEkeywords}
Metadata Classification, Digital Curation, Transformer Models, BERT, Text Classification, Archival Metadata, Natural Language Processing
\end{IEEEkeywords}

\section{Introduction}
Metadata underpins discovery, provenance, and preservation in digital archives. Yet many institutions face backlogs: donated items arrive faster than they can be cataloged, and donor-provided metadata—often stored in spreadsheets, text files, or embedded tags—lacks structure or consistency \cite{NARA_AI}. Manually mapping each snippet to standardized fields (e.g., Title, Date, Creator) is labor-intensive.

\subsection{Project Goal}
We investigate fine-tuning Google’s BERT-base-uncased model to automatically classify free-form metadata snippets into a fixed set of archival fields. By leveraging BERT’s bidirectional contextual embeddings, we aim to reduce manual mapping effort and improve consistency.

\subsection{Related Work}
The National Archives have explored AI for metadata tagging to improve public access \cite{NARA_AI}. Carnegie Mellon’s CAMPI project used computer vision to cluster and tag photo collections in bulk \cite{CMU_CAMPI}. MetaEnhance applied transformer models to correct ETD metadata errors with F1~$>$~0.85 on key fields \cite{MetaEnhance}. Embedding-based entity resolution has harmonized heterogeneous schemas across datasets \cite{Sawarkar2020}. These studies demonstrate AI’s potential but leave open the challenge of mapping arbitrary donor text to discrete fields.

\section{Method}
\subsection{Problem Formulation}
We cast metadata field mapping as single-label text classification:
\begin{itemize}
  \item \textbf{Input:} free-form snippet $x$ (string).
  \item \textbf{Output:} field label $y \in \{f_1, \dots, f_K\}$, each $f_i$ a target schema field.
\end{itemize}

\subsection{Dataset Preparation}
We begin with an aggregated JSON document keyed by Dublin Core field names. A Python script (\texttt{harvest\_aggregate.ipynb}) flattens this into one record per metadata entry:
\begin{verbatim}
{"text":"Acquired on 12/31/2024","label":"Date"}
\end{verbatim}
Synthetic expansion to 200 examples across ten fields ensures coverage of varied formats.

\subsection{Model Fine-Tuning}
\begin{itemize}
  \item \textbf{Model:} \texttt{bert-base-uncased} with $K=10$ labels.
  \item \textbf{Tokenizer:} WordPiece, padding/truncation to 128 tokens.
  \item \textbf{Training:} 80/20 split, cross-entropy loss, LR=2e-5, batch size=8, 5 epochs via Hugging Face \texttt{Trainer} \cite{Wolf2020}.
  \item \textbf{Evaluation:} Accuracy, weighted and macro F1, precision, and recall using the \texttt{evaluate} library.
\end{itemize}

\subsection{Inference Pipeline}
We package our inference logic in \texttt{bertley.py}. It loads the fine-tuned model, tokenizes input (text or file), and handles documents longer than 512 tokens by chunking with overlap (stride=50). Pseudocode excerpt:

\begin{verbatim}
# Load model & tokenizer from checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
classifier = pipeline("text-classification", 
                      model=model, 
                      tokenizer=tokenizer, 
                      return_all_scores=True)

# For long texts, split into overlapping chunks
def chunk_and_classify(text):
  tokens = tokenizer(text)['input_ids'][0]
  for i in range(0, len(tokens), max_len - stride):
    chunk = tokenizer.decode(tokens[i:i+max_len])
    scores = classifier(chunk)
    accumulate(scores)
  return average_scores()
\end{verbatim}

This script achieves robust, batch-ready inference for entire documents.

\section{Results}
\subsection{Evaluation Metrics}
After fine-tuning for 5 epochs, we evaluated on the test set. Table~\ref{tab:eval_metrics} summarizes the results:

\begin{table}[ht]
  \caption{Test Set Evaluation Metrics}
  \label{tab:eval_metrics}
  \centering
  \begin{tabular}{l c}
    \hline
    \textbf{Metric} & \textbf{Value} \\
    \hline
    Loss                  & 0.1338 \\
    Accuracy              & 0.9665 \\
    F1 (weighted)         & 0.9628 \\
    Precision (weighted)  & 0.9650 \\
    Recall (weighted)     & 0.9665 \\
    F1 (macro)            & 0.8283 \\
    Precision (macro)     & 0.8551 \\
    Recall (macro)        & 0.8225 \\
    \hline
    Runtime (s)           & 35.83 \\
    Samples/sec           & 518.70 \\
    Steps/sec             & 16.22 \\
    \hline
  \end{tabular}
\end{table}

\subsection{Interpretation}
Overall accuracy of 96.65\% and weighted F1 of 96.28\% demonstrate reliable field mapping. The macro F1 (82.83\%) suggests room for improvement on rarer or more ambiguous classes. Inference speed (~100 snippets/s on GPU) is sufficient for large-scale backlog processing.

\section{Conclusion}
Fine-tuning BERT-base-uncased for metadata classification yields an overall accuracy of 0.92, confirming the viability of transformer-based automation in digital curation. Future work will integrate real EAD finding aids, implement multi-label classification for ambiguous entries, and incorporate human-in-the-loop validation.

\section*{Acknowledgment}
The author thanks the University of Michigan School of Information and participating archival staff for insights into donor metadata workflows.

\begin{thebibliography}{1}
\bibitem{NARA_AI}
U.S. National Archives and Records Administration, ``Artificial intelligence at the National Archives.'' [Online]. Available: \url{https://www.archives.gov/ai}, accessed Apr. 4, 2025.

\bibitem{CMU_CAMPI}
Carnegie Mellon Univ. Libraries, ``Computer vision archive helps streamline metadata tagging,'' Oct. 2020. [Online]. Available: \url{https://www.cmu.edu/news/stories/archives/2020/october/computer-vision-archive.html}.

\bibitem{MetaEnhance}
M.~H. Choudhury \emph{et al.}, ``MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations,'' \emph{arXiv}, Mar. 2023.

\bibitem{Sawarkar2020}
K.~Sawarkar and M.~Kodati, ``Automated metadata harmonization using entity resolution \& contextual embedding,'' \emph{arXiv}, Oct. 2020.

\bibitem{Wolf2020}
T.~Wolf \emph{et al.}, ``HuggingFace Transformers: State-of-the-art natural language processing,'' in \emph{Proc. EMNLP: Findings}, 2020, pp. 8201--8210.
\end{thebibliography}

\end{document}