2.3. Getting datasets in code#

2.3.1. Generate a synthetic dataset with outliers#

Anomaly detection is a fascinating unsupervised problem. To practice solving it, you can use the PyOD (Python Outlier Detection) library’s generate_data function.

Its features are:

  1. Controlling the proportion of outliers in the data (contamination)

  2. Choosing the number of informative and uninformative features

  3. Return the inlier/outlier labels if desired

Here is an example 2-dimensional dataset generated with the function and visualized with Seaborn:

2.3.2. Generate synthetic datasets with Sklearn#

You can generate synthetic data of any shape and pattern with Sklearn. The most basic functions are make_classification and make_regression.

These datasets are great for proof-of-concepts or just simple practice.

2.3.3. Faker - generate fake data#

As if all the data in the world is not enough, you can generate synthetic datasets as well. Faker is one of the best libraries to do this in Python.

Every time you call a faker function, it returns a new random name, address, email, phone number or many dozens of other fake attributes. Below is a sample banking dataset with 10k records.

Link to the library: https://faker.readthedocs.io/