The Natural Language Toolkit (NLTK) is a suite of libraries and programs for building Python programs to work in the area of symbolic and statistical natural language processing (NLP). Its development started in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it continued to develop as an open source, free, community-driven project.
“Natural” language means human language used in everyday conversations. Natural language processing is concerned with the broad circle of topics, including interactions between computers and human languages, natural language understanding by the computers and natural language generation. Language processing is an important factor in our multilingual information society, so technologies based on NLP expand their influence: predictive text and handwriting recognition, robust web search engines, machine translation, text analysis, etc.
NLTK is a broad-coverage natural language toolkit that provides easy-to-use interfaces to over 50 corpora (large bodies of linguistic data) and lexical resources, as well as a suite of text processing libraries for tokenization, classification, tagging, stemming, parsing, and semantic reasoning. NLTK was designed to provide consistency as a uniform framework featuring consistent interfaces and data structures, and easily recognised method names. NLTK is a modular toolkit, so different building blocks and components can be used separately or in different combinations, creating alternative approaches to the same task.
Python was chosen to work with NLP for a number of reasons. Python supports interactive exploration, data and method re-usage, transparent syntax and semantics. Natural language processing data can be represented using basic classes. Python implementations for each task can be combined to solve complex problems. Moreover, this language has extensive libraries that handle graphical programming, numerical processing, and web connectivity. Python software is known for its productivity, quality, and maintainability.
NLTK contains modules for different language processing tasks and functionalities:
- Accessing corpora (standardized interfaces to corpora and lexicons)
- String processing (tokenizers, stemmers)
- Collocation discovery (t-test, chi-squared, point-wise mutual information)
- Part-of-speech tagging (n-gram, backoff, Brill, HMM, TnT)
- Chunking (regular expression, n-gram, named-entity)
- Parsing (chart, feature-based, unification, probabilistic, dependency)
- Machine learning (decision tree, maximum entropy, naive Bayes, EM, k-means)
- Linguistic fieldwork (manipulate data in SIL Toolbox format)
- Applications (graphical concordancer, parsers, WordNet browser, chatbots)
- Semantic interpretation (lambda calculus, first-order logic, model checking)
- Evaluation metrics (precision, recall, agreement coefficients)
- Probability and estimation (frequency distributions, smoothed probability distributions)
NLTK is thoroughly documented and simple to use, so its user base is broad: linguists, engineers, students, educators, researchers, etc. This uniform toolkit can be applied as a platform for prototyping and building research systems and as a teaching or individual study tool. NLTK is available for Windows, Mac OS X, and Linux, together with programming fundamentals guide and comprehensive API documentation. For more information visit NLTK website.