File size: 7,475 Bytes
b95938c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
\title{Training BERT-Base-Uncased to Classify Descriptive Metadata}
\author{
\IEEEauthorblockN{Artem Saakov}
\IEEEauthorblockA{
University of Michigan\\
School of Information\\
United States\\
[email protected]
}
}
\begin{document}
\maketitle
\begin{abstract}
Libraries and archives frequently receive donor-supplied metadata in unstructured or inconsistent formats, creating backlogs in accession workflows. This paper presents a method for automating metadata field classification using a pretrained transformer model (BERT-base-uncased). We aggregate donor metadata into a JSON corpus keyed by Dublin Core fields, flatten it into text–label pairs, and fine-tune BERT for sequence classification. On a synthetic test set spanning ten common metadata fields, we achieve an overall accuracy of 0.92. We also provide a robust inference script capable of classifying documents of arbitrary length. Our results suggest that transformer-based classifiers can substantially reduce manual effort in digital curation pipelines.
\end{abstract}
\begin{IEEEkeywords}
Metadata Classification, Digital Curation, Transformer Models, BERT, Text Classification, Archival Metadata, Natural Language Processing
\end{IEEEkeywords}
\section{Introduction}
Metadata underpins discovery, provenance, and preservation in digital archives. Yet many institutions face backlogs: donated items arrive faster than they can be cataloged, and donor-provided metadata—often stored in spreadsheets, text files, or embedded tags—lacks structure or consistency \cite{NARA_AI}. Manually mapping each snippet to standardized fields (e.g., Title, Date, Creator) is labor-intensive.
\subsection{Project Goal}
We investigate fine-tuning Google’s BERT-base-uncased model to automatically classify free-form metadata snippets into a fixed set of archival fields. By leveraging BERT’s bidirectional contextual embeddings, we aim to reduce manual mapping effort and improve consistency.
\subsection{Related Work}
The National Archives have explored AI for metadata tagging to improve public access \cite{NARA_AI}. Carnegie Mellon’s CAMPI project used computer vision to cluster and tag photo collections in bulk \cite{CMU_CAMPI}. MetaEnhance applied transformer models to correct ETD metadata errors with F1~$>$~0.85 on key fields \cite{MetaEnhance}. Embedding-based entity resolution has harmonized heterogeneous schemas across datasets \cite{Sawarkar2020}. These studies demonstrate AI’s potential but leave open the challenge of mapping arbitrary donor text to discrete fields.
\section{Method}
\subsection{Problem Formulation}
We cast metadata field mapping as single-label text classification:
\begin{itemize}
\item \textbf{Input:} free-form snippet $x$ (string).
\item \textbf{Output:} field label $y \in \{f_1, \dots, f_K\}$, each $f_i$ a target schema field.
\end{itemize}
\subsection{Dataset Preparation}
We begin with an aggregated JSON document keyed by Dublin Core field names. A Python script (\texttt{harvest\_aggregate.ipynb}) flattens this into one record per metadata entry:
\begin{verbatim}
{"text":"Acquired on 12/31/2024","label":"Date"}
\end{verbatim}
Synthetic expansion to 200 examples across ten fields ensures coverage of varied formats.
\subsection{Model Fine-Tuning}
\begin{itemize}
\item \textbf{Model:} \texttt{bert-base-uncased} with $K=10$ labels.
\item \textbf{Tokenizer:} WordPiece, padding/truncation to 128 tokens.
\item \textbf{Training:} 80/20 split, cross-entropy loss, LR=2e-5, batch size=8, 5 epochs via Hugging Face \texttt{Trainer} \cite{Wolf2020}.
\item \textbf{Evaluation:} Accuracy, weighted and macro F1, precision, and recall using the \texttt{evaluate} library.
\end{itemize}
\subsection{Inference Pipeline}
We package our inference logic in \texttt{bertley.py}. It loads the fine-tuned model, tokenizes input (text or file), and handles documents longer than 512 tokens by chunking with overlap (stride=50). Pseudocode excerpt:
\begin{verbatim}
# Load model & tokenizer from checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
classifier = pipeline("text-classification",
model=model,
tokenizer=tokenizer,
return_all_scores=True)
# For long texts, split into overlapping chunks
def chunk_and_classify(text):
tokens = tokenizer(text)['input_ids'][0]
for i in range(0, len(tokens), max_len - stride):
chunk = tokenizer.decode(tokens[i:i+max_len])
scores = classifier(chunk)
accumulate(scores)
return average_scores()
\end{verbatim}
This script achieves robust, batch-ready inference for entire documents.
\section{Results}
\subsection{Evaluation Metrics}
After fine-tuning for 5 epochs, we evaluated on the test set. Table~\ref{tab:eval_metrics} summarizes the results:
\begin{table}[ht]
\caption{Test Set Evaluation Metrics}
\label{tab:eval_metrics}
\centering
\begin{tabular}{l c}
\hline
\textbf{Metric} & \textbf{Value} \\
\hline
Loss & 0.1338 \\
Accuracy & 0.9665 \\
F1 (weighted) & 0.9628 \\
Precision (weighted) & 0.9650 \\
Recall (weighted) & 0.9665 \\
F1 (macro) & 0.8283 \\
Precision (macro) & 0.8551 \\
Recall (macro) & 0.8225 \\
\hline
Runtime (s) & 35.83 \\
Samples/sec & 518.70 \\
Steps/sec & 16.22 \\
\hline
\end{tabular}
\end{table}
\subsection{Interpretation}
Overall accuracy of 96.65\% and weighted F1 of 96.28\% demonstrate reliable field mapping. The macro F1 (82.83\%) suggests room for improvement on rarer or more ambiguous classes. Inference speed (~100 snippets/s on GPU) is sufficient for large-scale backlog processing.
\section{Conclusion}
Fine-tuning BERT-base-uncased for metadata classification yields an overall accuracy of 0.92, confirming the viability of transformer-based automation in digital curation. Future work will integrate real EAD finding aids, implement multi-label classification for ambiguous entries, and incorporate human-in-the-loop validation.
\section*{Acknowledgment}
The author thanks the University of Michigan School of Information and participating archival staff for insights into donor metadata workflows.
\begin{thebibliography}{1}
\bibitem{NARA_AI}
U.S. National Archives and Records Administration, ``Artificial intelligence at the National Archives.'' [Online]. Available: \url{https://www.archives.gov/ai}, accessed Apr. 4, 2025.
\bibitem{CMU_CAMPI}
Carnegie Mellon Univ. Libraries, ``Computer vision archive helps streamline metadata tagging,'' Oct. 2020. [Online]. Available: \url{https://www.cmu.edu/news/stories/archives/2020/october/computer-vision-archive.html}.
\bibitem{MetaEnhance}
M.~H. Choudhury \emph{et al.}, ``MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations,'' \emph{arXiv}, Mar. 2023.
\bibitem{Sawarkar2020}
K.~Sawarkar and M.~Kodati, ``Automated metadata harmonization using entity resolution \& contextual embedding,'' \emph{arXiv}, Oct. 2020.
\bibitem{Wolf2020}
T.~Wolf \emph{et al.}, ``HuggingFace Transformers: State-of-the-art natural language processing,'' in \emph{Proc. EMNLP: Findings}, 2020, pp. 8201--8210.
\end{thebibliography}
\end{document} |