When conducting statistical tests, many of them require that the values of the samples are normally distributed. Often times they are, unfortunately, not normally distributed, but there is a nifty trick to fix this problem. The solution involves what is called ‘The Central Limit Theorem’.
The central limit theorem states that even if a set of data is not normally distributed, if you take many random samples with replacement and record the means of those samples, those means will be normally distributed. The insight generating power of a statistical test extends to these sample means and testing on them can be used to come to conclusions about the original set of data.
To make use of the central limit theorem in python simply use numpy’s sampling methods. Try a few hundred samples with a sample size of one third the size of the data set and record the means for each sample in an array. If the resulting data still isn’t normally distributed the try changing the amount of samples and the sample size. Here is an example of what your code may look like.
import numpy as npn_samples = 500
sample_size = 800
sample_means = for n in range(n_samples):
random_sample = np.random.choice(sample_size, replace=True)
Take note that the central limit theorem is not to be used with machine learning algorithms. Almost all machine learning algorithms run under the assumption that each observation is independent. The means of the samples are inherently related and may cause your machine learning models to give misleading results.