Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, USA {binkley, lawrie}@cs.loyola.edu, mthearn@loyola.edu
Keywords: source code analysis tools, natural language processing, program comprehension, identifier analysis
Abstract
Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques. One such tool provides part-of-speech information, which finds application in improving the searching of software repositories and extracting domain information found in identifiers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. The presented empirical investigation finds that this limitation can be partially overcome, resulting in a tagger that is up to 88% accurate when applied to source code identifiers. The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 field names. From patterns in the tags several rules emerge that seek to improve structure-field naming.
Source Part of Extract Split Apply Source ⇒ Code ⇒ Field ⇒ Field ⇒ ⇒ Speech Template Code Mark-up Tagging Names Names
Figure 1. Process for POS tagging of field names. The text available in source-code artifacts, in particular a program’s identifiers, has a very different structure. For example the words of an identifier rarely form a grammatically correct sentence. This raises an interesting question: can an existing POS tagger be made to work well on the natural language found in source code? Better POS information would aid existing techniques that have used limited POS information to successfully improve retrieval
References: [1] S. L. Abebe and P. Tonella. Natural language parsing of program element names for concept extraction. In 18th IEEE International Conference on Program Comprehension. IEEE, 2010. [2] K. Atkinson. Spell checking oriented word lists (scowl). [3] E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction. In Proceedings of the International Conference on Intelligence Analysis, 2005. [4] B. Caprile and P. Tonella. Restructuring program identifier names. In ICSM, 2000. [5] ML Collard, HH Kagdi, and JI Maletic. An XML-based lightweight C++ fact extractor. Program Comprehension, 2003. 11th IEEE International Workshop on, pages 134–143, 2003. [6] E. Høst and B. Østvold. The programmer’s lexicon, volume i: The verbs. In International Working Conference on Source Code Analysis and Manipulation, Beijing, China, September 2008. [7] E. W. Høst and B. M. Østvold. Debugging method names. In ECOOP 09. Springer Berlin / Heidelberg, 2009. [8] J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL 2007, 2007. [9] D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, 2010. [10] L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classification. In ACL 07. ACL, June 2007. [11] D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented conerns. In AOSD 07. ACM, March 2007. [12] K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLTNAACL 2003, 2003. 4 Related Work This section briefly reviews three projects that use POS information. Each uses an off-the-shelf POS tagger or lookup table. First, Host et al. study naming of Java methods using a lookup table to assign POS tags [7]. Their aim is to find what they call “naming bugs” by checking to see if the method’s implementation is properly indicated with the name of the method. Second, Abebe and Tonella study class, method, and attribute names using a POS tagger based on a modification of minipar to formulate domain concepts [1]. Nouns in the identifiers are examined to form ontological relations between concepts. Based on a case study, their approach improved concept searching. Finally, Shepherd et al. considered finding concepts in code using natural language information [11]. The resulting Find-Concept tool locates action-oriented concerns more effectively than the other tools and with less user effort. This is made possible by POS information applied to source code. 4