This study proposed an automated method for manifesting construction activity scenes by image captioning – an approach rooted in computer vision and natural language generation. A linguistic description schema for manifesting the scenes is developed initially and two unique dedicated image captioning datasets are created for method validation. A general model architecture of image captioning is then instituted by combining an encoder-decoder framework with deep neural networks, followed by three experimental tests involving the selection of model learning strategies and performance evaluation metrics. It is demonstrated the method's performance is comparable with that of state-of-the-art computer vision methods in general. The paper concludes with a discussion of the feasibility of the practical application of the proposed approach at the current technical level.