Abstract
This research addresses the quality of training data in LLMs using methods from signaling theory and the talk page metadata of Wikipedia articles. The significance of the method is to lower the cost of information quality assessment in datasets. Natural language processing on metadata text generated sentiment, reading complexity, and self-reference scores as contributions to the computationally derived signals. Results showed that it is possible to understand indicators of information quality using textual computation over the metadata in article pages.
Author information
Cite this article
Hahn, J. (2024). Synthetic Signal Identification in LLM Datasets. International Conference on Dublin Core and Metadata Applications, 2024. https://doi.org/10.23106/dcmi.952493684
DOI : 10.23106/dcmi.952493684
Published