Detecting non-natural language artifacts for de-noising bug reports

检测非自然语言成分以减少错误报告中的噪声

阅读:1

Abstract

Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。