Andrian Marcus

The Use of Text Retrieval and Natural Language Processing in Software Engineering

Abstract
During software evolution many related artifacts are created or modified. Some of these are composed of structured data (e.g., analysis data), some contain semi-structured information (e.g., source code), and many include unstructured information (e.g., natural language text). Software artifacts written in natural language (e.g., requirements, design documents, user manuals, scenarios, bug reports, developers’ messages, etc.), together with the comments and identifiers in the source code encode to a large degree the domain of the software, the developers’ knowledge about the system, capture design decisions, developer information, etc. In many software projects the amount of the unstructured information exceeds the size of the source code by one order of magnitude. Retrieving and analyzing the textual information existing in software are extremely important in supporting program comprehension and a variety of software evolution tasks. More than that, many researchers are focusing these days on mining and analyzing textual information from internet-based sources, such as, Stack Overflow, app markets, etc. This information is used then to support processes and development activities. Text retrieval (TR) is a branch of information retrieval (IR) where the information is stored primarily in the form of text. TR methods are suitable candidates to help in the retrieval and the analysis of textual data embedded in software or present in other sources. This tutorial presents some of the most popular TR methods and their applications in software engineering (SE). In most SE applications, TR techniques are used in conjunction with natural language processing (NLP) tools. The main NLP techniques used by software engineering researchers will also be presented. More than 20 different SE tasks are being addressed through TR and NLP of software documents, such as, traceability link recovery, concern/concept/feature/bug location, software search, change impact analysis, requirements analysis, bug triage, refactoring, defect prediction, etc. The tutorial will focus on tasks that are most popular among researchers, where the use of TR and NLP methods proved to be very successful.


Speaker's Bio
Andrian Marcus is Associate Professor in the Department of Computer Science at The University of Texas at Dallas. Between 2003-2014 he was a faculty member in the Department of Computer Science at Wayne State University (Detroit, MI). He obtained his Ph.D. in Computer Science from Kent State University (USA) and has prior degrees from The University of Memphis (Memphis, TN) and Babes-Bolyai University (Cluj-Napoca, Romania). His current research interests are in software engineering, with focus on software evolution and program comprehension. He is best known for his work on using information retrieval and text mining techniques for software analysis to support comprehension during software evolution. He received several Best Paper Awards and a Most Influential Paper Award, and his research was funded by NSF, NIH, IBM, etc. He is also a former junior Fulbright Scholar (1997-98). He served on the Steering Committee of the IEEE International Conference on Software Maintenance and Evolution (ICSME) between 2005-2008 and 2011-2014, and on the Steering Committee IEEE International Workshop on Visualizing Software for Understanding and Analysis (VISSOFT) between 2005-2009. He was the General Chair and the Program Co-chair of ICSME in 2011 and 2010, respectively, and served on the program and organizing committee of many other software engineering conferences. He currently serves on the editorial board of the IEEE Transactions on Software Engineering, the Empirical Software Engineering Journal (Springer), and the Journal of Software: Evolution and Process (John Wiley and Sons).


SEschool@unibz