2.1. Importing data#

2.1.1. Reading large CSV files with Datatable#

If you load a giant CSV file with Pandas, the death wait that comes after is entirely on you.

Pandas itself warns that you should stay far away from Pandas if you import large files. It suggests alternatives like Dask or Vaex.

I have a better suggestion: use datatable.

The R-inspired package is popular on Kaggle and it is darn fast.

2.1.2. Reading MATLAB, SAS, Stata and HDF5 files in Python#

The world isn’t full of CSVs and parquets. There might be a few stubborn people who are spurting out data in weird formats like old MATLAB, SAS, Stata or massive HDF5s.

Here is how you can read them in Python👇

2.1.3. Saving to parquet is much faster#

Saving and loading Parquet files are much faster and painless. Here is a comparison of how much it takes to save an 11GB dataframe to Parquet and CSV👇

2.1.4. Read na_values#

People denote missing values on a whim. It might be 0, -9999, # or any other symbol/word that comes to their mind. You can immediately catch those values and encode them properly as NaN values while reading the data with read_csv. Just pass the custom missing value to “na_values”.

2.1.5. Paquet vs. Feather in terms of memory#

Parquet is twice more memory-efficient than Feather.

As Parquet file format uses dictionary and RLE (Run-length) encodings and data page compression, it takes far less disk space than feather.

If you want to learn more about the differences, I dropped a link to a SO discussion below👇

Link to SO thread: https://bit.ly/3yuDYvx

2.1.6. SQLGlot for changing SQL dialects#

Do you know how to change between all SQL dialects - Hive, Presto, Spark, MySQL, PostgreSQL, DuckDB, BigQuery, etc?

With SQLGlot you don’t have to. It is a Python library that has the following features:

Written in pure Python
Prettify complex SQL queries
Translate queries between dialects
Rewrite queries into optimized form
Parse errors
Build and modify queries with Python functions

Link to the library in the comments.

Link to the library: https://github.com/tobymao/sqlglot

2.1.7. More compressed file saving with Joblib#

You are wasting precious memory resources if you are still using vanilla Joblib.

The “dump” function of the Joblib library has a “compress” parameter that lets you specify 9 levels of file compression. The higher the number, the more compressed the file is, thus taking up much smaller size.

However, as you increase compression, the read and write times increase accordingly. So, a common middleground is to use 3 or 4, with 0 being the default (no compression).

Below is an example of how you can save 50% memory resources by going from 0 to 4th level of compression in Joblib.

Tricking Data Science

Importing data

Contents