Scikit-learn is a free and open source machine learning library for Python. This library offers efficient easy-to-use tools for data mining and data analysis. Basically Scikit-learn is a Python module that provides a big number of advanced machine learning algorithms for supervised and unsupervised problems.
New libraries, extensions and modules for SciPy (Open Source Library of Scientific Tools) are called SciKits. This particular library is focused on learning algorithms, thus was named scikit-learn (previously known as scikits.learn). It was started in 2007 as a Google Summer of Code project and publicly released in 2010. This project was designed to support high level of robustness and features ease of use, code quality, collaboration, API consistency, and performance.
Scikit-learn is mostly written in Python, with some additions in Cython to improve performance. The required dependencies include numerical library NumPy, scientific library SciPy, and a working C/C++ compiler. Scikit-learn is not concerned with loading, manipulating, and summarizing data as SciPy or Pandas, but focuses on modeling data instead.
Via a consistent interface in Python Scikit-learn integrates a set of supervised and unsupervised learning algorithms. Within supervised learning incoming data has additional attributes that need to be predicted either via classification, or regression. Incoming data for unsupervised learning includes a set of input vectors x that lack corresponding target values, so goals of such problems vary: density estimation (distribution of data within the input space), clustering (search for groups of similar objects within the data), visualization (transforming the data from a high-dimensional space down to two or three dimensions), etc.
Scikit-learn offers a large range of models that are grouped to perform different objectives:
- Data preprocessing - normalization, changing raw data into suitable representations.
- Clustering - automatic grouping of similar objects into sets (algorithms: KMeans, mean-shift, spectral clustering).
- Classification - identifying to which category an object belongs to (algorithms: SVM, random forest, nearest neighbors, etc.)
- Regression - predicting a continuous-valued variable with existing values and related attributes (algorithms: ridge regression, SVR, Lasso).
- Cross Validation - estimating the performance of estimator (supervised models on unseen data).
- Dimensionality Reduction - reducing the number of random variables in data for summarization, visualization and feature selection (algorithms: feature selection, PCA, non-negative matrix factorization).
- Ensemble methods - combining the predictions of multiple estimators (supervised models).
- Feature extraction - defining attributes in image and text data in a format supported by machine learning algorithms.
- Feature selection -identifying meaningful attributes to improve accuracy scores or performance.
- Manifold Learning (approach to nonlinear dimensionality reduction) - summarizing and represents complex multi-dimensional data.
- Model selection - comparing models, parameter tuning (modules: grid search, metrics, cross validation).
Scikit-learn is BSD-licensed machine learning Python software that provides classification, clustering, and regression algorithms. It is accessible even to non-specialists due to a general-purpose high-level language. For more information visit Scikit-learn website.