# Scikit Learn (Beginners) — Part 1

This is part one of the Scikit-learn series, which is as follows:

**Part 1 — Introduction (this article)**- Part 2 — Supervised Learning in Scikit-Learn
- Part 3— Unsupervised Learning in Scikit-Learn

Link to part two : https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-2-ca78a51803a8

Link to part three : https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-3-6fb05798acb1

**Introduction**

New to machine learning ? Don’t know how to get started with this amazing library ? Then hang on as you are about to get started with this free library that will help you boost your skills.

Before moving on to the different features it offers let us understand what actually is scikit-learn !

So, scikit learn is a machine learning library for python programming language which offers various important features for machine learning such as classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to inter-operate with the python numerical and scientific libraries like **NumPy** and **SciPy**.**We will discuss each algorithm and its implementation with codes in detail later in the second part of this series.**

**Supervised Algorithms In Scikit-Learn**

Since you are familiar with machine learning you already know that there are 2 types of algorithms i.e s**upervised** and **unsupervised**.

So, let us see what scikit learn offers us in supervised algorithms.

The problem of supervised learning can be broken into 2 :

C**lassification:** Samples belong to two or more classes, and we want to learn from already labeled data on how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.

R**egression**: If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which input consist of the concentration of reactants, temperature and the pressure.

Scikit Learn supports following models :

1. Generalized Linear Models

2. Linear and Quadratic Discriminant Analysis

3. Kernel ridge regression

4. Support Vector Machines

5. Stochastic Gradient Descent

6. Nearest Neighbors

7. Gaussian Processes

8. Cross decomposition

9. Naive Bayes

10. Decision Trees

11. Ensemble methods

12. Multiclass and multilabel algorithms

13. Feature selection

14. Semi-Supervised

15. Isotonic regression

16. Probability calibration

17. Neural network models (supervised)

**Unsupervised Algorithms In Scikit-Learn**

Now, let us see what scikit learn offers us in unsupervised algorithms.

In this the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

Scikit Learn supports these models :

1.Gaussian mixture models

2. Manifold learning

3. Clustering

4. Biclustering

5. Decomposing signals in components (matrix factorization problems)

6. Covariance estimation

7. Novelty and Outlier Detection

8. Density Estimation

9. Neural network models (unsupervised).

# Model Selection and Evaluation

As we know learning the parameters of a prediction function and testing it on the same data is a methodological mistake or it can be called as cheating : a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called **overfitting**. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.

Model selection contains the following :

1. Cross-validation: evaluating estimator performance

2. Tuning the hyper-parameters of an estimator

3. Model evaluation: quantifying the quality of predictions

4. Model persistence

5. Validation curves: plotting scores to evaluate models

*Note : How it is implemented will be discussed later in this series.*

# Dataset transformations

These are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. **fit_transform** may be a more convenient and efficient for modelling and transforming the training data simultaneously.

It has following sub-categories :

1. Pipeline and FeatureUnion: combining estimators

2. Feature extraction

3. Preprocessing data

4. Unsupervised dimensionality reduction

5. Random Projection

6. Kernel Approximation

7. Pairwise metrics, Affinities and Kernels

8. Transforming the prediction target (y)

# Dataset Loading Utilities

The **sklearn.datasets** package embeds some small yet useful datasets.

Scikit Learn offers various datasets some of them are :

1. The Olivetti faces dataset

2. The 20 newsgroups text dataset

3. Downloading datasets from the mldata.org repository

4. The Labeled Faces in the Wild face recognition dataset

5. Forest covertypes

6. RCV1 dataset

7. Boston House Prices dataset

8. Breast Cancer Wisconsin (Diagnostic) Database

9. Diabetes dataset

10. Optical Recognition of Handwritten Digits Data Set

11. Iris Plants Database

*and many more…..*

# Strategies to scale computationally: bigger data

For some applications the amount of examples, features and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases scikit-learn has a number of options you can consider to make your system scale better.

# Computational Performance

For some applications the performance (mainly latency and throughput at prediction time) of estimators is crucial.

So, scikit learn offers various features that makes it easy for us to complete our task with great score.

It gives us the chance to use the following functions in it :

- Prediction Latency
- Prediction Throughput

# More coming soon….

As we have covered almost every feature that scikit learn offers us and by the end of this you must have understood the importance of this wonderful & easy yet powerful to use library. Its easy to get intimidated by seeing so much at one glance but dont worry you will get to learn in the most easiest way so stick with me in this journey.

Get ready to dive deep into implementations of these features with codes.