Skip to main navigation Skip to search Skip to main content

Automatic classification of construction accident reports using BERTopic-GLDA approach

  • Jun Wang*
  • , Ziyi Qu
  • , Shujie Wu
  • , Martin Skitmore
  • , Leyuan Ma
  • *Corresponding author for this work

Research output: Contribution to journalArticleResearchpeer-review

8 Downloads (Pure)

Abstract

Purpose:
This study aims to propose a semi-supervised classification framework that reduces reliance on labeled data, manages class imbalance and improves the interpretability of classification outcomes. This study proposes a semi-supervised classification framework designed to minimize reliance on labeled data, effectively address class imbalance and enhance the interpretability of classification results.

Design/methodology/approach:
A semi-supervised BERTopic-Guided Latent Dirichlet Allocation (GLDA) framework is introduced, integrating BERTopic's contextual keyword extraction with the semi-supervised capabilities of GLDA. BERTopic uses context-aware language embeddings to generate semantically rich, domain-specific seed words. These seed words guide GLDA in defining topics a priori, thereby enabling robust semi-supervised classification. The framework is evaluated on two OSHA datasets and benchmarked against statistical keyword-based methods, including YAKE-GLDA. Its performance is also compared with traditional supervised models such as SVM and CNN.

Findings:
The BERTopic-GLDA framework demonstrates superior performance across all evaluation metrics. For Dataset 1, it achieves a macro F1 score of 0.64, outperforming YAKE-GLDA (0.53, a 20.8% improvement), support vector machine (SVM) (0.33, a 93.9% improvement) and convolutional neural network (CNN) (0.30, a 113% improvement). For Dataset 2, it achieves a macro F1 score of 0.73, surpassing YAKE-GLDA (0.43, a 69.8% improvement), SVM (0.55, a 32.7% improvement), and CNN (0.41, a 78.0% improvement). The framework performs particularly well in classifying minority classes, where traditional supervised models often fail and YAKE-GLDA performs poorly. This capability effectively mitigates class imbalance. Additionally, the method reduces dependence on pre-labeled data and improves interpretability, providing a scalable solution for real-world construction safety applications.

Originality/value:
A novel semi-supervised approach is introduced for classifying construction accident reports, achieving higher classification accuracy while overcoming challenges posed by imbalanced datasets. Unlike conventional supervised methods, the framework does not require extensive pre-labeled datasets, reducing resource demands. Linking classification outcomes to meaningful topic keywords ensures interpretability, allowing practitioners to trace predictions to underlying linguistic patterns. Integrating BERTopic and GLDA significantly advances semi-supervised learning for construction accident classification, providing a practical tool for enhanced risk assessment and decision-making. In practice, the framework can be integrated into safety dashboards to categorize new reports automatically, visualize emerging trends and highlight high-risk categories. This integration facilitates faster incident response and more targeted safety management.
Original languageEnglish
Pages (from-to)1-30
Number of pages30
JournalEngineering, Construction and Architectural Management
DOIs
Publication statusE-pub ahead of print - 25 Dec 2025

Fingerprint

Dive into the research topics of 'Automatic classification of construction accident reports using BERTopic-GLDA approach'. Together they form a unique fingerprint.

Cite this