Full Paper

Synthetic Signal Identification in LLM Datasets

Jim Hahn ORCID

DOI: 10.23106/dcmi.952493684

Abstract

This research addresses the quality of training data in LLMs using methods from signaling theory and the talk page metadata of Wikipedia articles. The significance of the method is to lower the cost of information quality assessment in datasets. Natural language processing on metadata text generated sentiment, reading complexity, and self-reference scores as contributions to the computationally derived signals. Results showed that it is possible to understand indicators of information quality using textual computation over the metadata in article pages.

Author information

Jim Hahn

Penn Libraries,US

Cite this article

Hahn, J. (2024). Synthetic Signal Identification in LLM Datasets. Proceedings of the International Conference on Dublin Core and Metadata Applications, 2024. https://doi.org/10.23106/dcmi.952493684
Published

Issue

DCMI-2024 Toronto, Canada Proceedings
Location:
University of Toronto, Toronto, Ontario, Canada
Dates:
October 20-23, 2024
CC-0 Logo Metadata and citations of this article is published under the Creative Commons Zero Universal Public Domain Dedication (CC0), allowing unrestricted reuse. Anyone can freely use the metadata from DCPapers articles for any purpose without limitations.
CC-BY Logo This article full-text is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license allows use, sharing, adaptation, distribution, and reproduction in any medium or format, provided that appropriate credit is given to the original author(s) and the source is cited.