11 min read

Exploratory data analysis in Python 学习笔记

Statistical Thinking in Python (Part 1)

有部分重复内容,因此可以很快的完成。 介绍了很多基本统计的知识,如mean、median、variance。 numpy包除了处理矩阵,还可以simulate和random假设。 Justin Bois | DataCamp 是他讲的。

Plotting a histogram of iris data | Python

sns.set()改变图像的风格。

Justin assigned his plotting statements (except for plt.show()) to the dummy variable _. This is to prevent unnecessary output from being displayed. It is not required for your solutions to these exercises, however it is good practice to use it.

xplt.hist()的参数之一。

# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# Set default Seaborn style
sns.set()

# Plot histogram of versicolor petal lengths
plt.hist(x = versicolor_petal_length)

# Show histogram
plt.show()

Adjusting the number of bins in a histogram | Python

The “square root rule” is a commonly-used rule of thumb for choosing number of bins: choose the number of bins to be the square root of the number of samples.

In [4]: help(len)
Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.

int(n_bins)是对的,但是n_bins.int()是错的。

# Import numpy
import numpy as np

# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)

# Convert number of bins to integer: n_bins
n_bins = int(n_bins)

# Plot the histogram
_ = plt.hist(x = versicolor_petal_length, bins = n_bins)

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()

Plotting all of your data: Empirical cumulative distribution functions | Python

bee swarm plot 对于相似程度比较高的,很难区别,需要更加精细的方法。 引入ECDF图。 ECDF是empirical cumulative density function.

np.arange()是Array of evenly spaced values. 也就是说,

    >>> np.arange(3)
    array([0, 1, 2])

plt.margins()描述间距,对于ECDF函数,使得y轴为\([0.00,0.20,...,1.00]\)

Computing the ECDF | Python

Remember, however, that the end value in np.arange() is not inclusive. Therefore, np.arange() will need to go from 1 to n+1.

因为np.arange()函数不包括最后的结束值,因此使用n+1

def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

Plotting the ECDF | Python

# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
plt.plot(x_vers, y_vers, marker = '.', linestyle = 'none')

# Make the margins nice
plt.margins(0.02)

# Label the axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')


# Display the plot
plt.show()

Set the margins of the plot with plt.margins() so that no data points are cut off. Use a 2% margin.

Comparison of ECDFs | Python

# Compute ECDFs
x_set, y_set   = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length) 
x_virg, y_virg = ecdf(virginica_petal_length) 

# Plot all ECDFs on the same plot
_ = plt.plot(x_set, y_set, marker='.', linestyle='none')
_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none')
_ = plt.plot(x_virg, y_virg, marker='.', linestyle='none')

# Make nice margins
plt.margins(0.02)

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

The ECDFs expose clear differences among the species. Setosa is much shorter, also with less absolute variability in petal length than versicolor and virginica.

Onward toward the whole story | Python

单独看和比较看,重心刚好相反。 单独看,\(histogram > swarm > ecdf\), 比较看,\(histogram < swarm < ecdf\)

Introduction to summary statistics: The sample mean and median | Python

\(median < mean\)说明,有正向的outliers。

Computing percentiles | Python

# Specify array of percentiles: percentiles
percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(versicolor_petal_length, percentiles)

# Print the result
print(ptiles_vers)

np.percentile()两项都要是矩阵。

In [6]: type(versicolor_petal_length)
Out[6]: numpy.ndarray

In [7]: type(percentiles)
Out[7]: numpy.ndarray

Comparing percentiles to ECDF | Python

在ECDF曲线上打标签。

# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
plt.margins(0.02)
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
        linestyle='none')

# Show the plot
plt.show()

percentiles/100是因为percentiles = np.array([2.5, 25, 50, 75, 97.5])ptiles_versptiles_vers = np.percentile(versicolor_petal_length, percentiles)

Making a box plot for the petal lengths is unnecessary because the iris data set is not too large and the bee swarm plot works fine.

Box-and-whisker plot | Python

可能箱形图给人感觉太简单,但是确实的,swarm图只是多少描述简单一点的图。

Computing the variance | Python

所以np.var是population定义。

\[Var(x) = \frac{\sum_{i=1}^n(x_i-\bar x)^2}{n}\]

而不是,

\[Var(x) = \frac{\sum_{i=1}^n(x_i-\bar x)^2}{n-1}\]

In [1]: differences = versicolor_petal_length - np.mean(versicolor_petal_length)

In [2]: diff_sq = differences**2

In [3]: variance_explicit = np.mean(diff_sq)

In [4]: variance_np = np.var(versicolor_petal_length)

In [5]: print(variance_explicit, variance_np)
0.2164 0.2164

Covariance and Pearson correlation coefficient | Python

\[covariance = \frac{1}{n}\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y)\]

在直角坐标系中,表达了面积。

Pearson correlation:

\[\rho = \frac{Cov(x,y)}{\sigma_x \cdot \sigma_y}=\frac{variability \space due \space to \space codependence}{independence \space variability }\]

Probabilistic logic and statistical inference | Python

这个老师的统计教的很好。

Hacker statistics: Uses simulated repeated measurements to compute probabilities.

np.random.random(): draw a number between 0 and 1

Bernoulli trial: An experiment that has two options, “success” (True) and “failure” (False).

Generating random numbers using the np.random module | Python

np.random.seed()和R中的set.seed()很像。 np.empty(10000)产生空的向量,如10000个。 np.random.random()产生均匀分布\(U[0,1]\)的数。

# Seed the random number generator
np.random.seed(42)

# Initialize random numbers: random_numbers
random_numbers = np.empty(10000)

# Generate random numbers by looping over range(100000)
for i in range(10000):
    random_numbers[i] = np.random.random()

# Plot a histogram
_ = plt.hist(random_numbers)

# Show the plot
plt.show()

这里都是均匀分布的产生概率了,下面是伯努利分布了。

The np.random module and Bernoulli trials | Python

伯努利实验(Bernoulli trials)需要知道np

returns the number of successes out of n Bernoulli trials, each of which has probability p of success.

这里可以自主产生。

def perform_bernoulli_trials(n, p):
    """Perform n Bernoulli trials with success probability p
    and return number of successes."""
    # Initialize number of successes: n_success
    n_success = 0


    # Perform trials
    for i in range(n):
        # Choose random number between zero and one: random_number
        random_number = np.random.random()

        # If less than p, it's a success so add one to n_success
        if random_number < p:
            n_success += 1

    return n_success

random_number < p的设定满足,\(F(x < p) = p\)

How many defaults might we expect? | Python

这里normed = True使得图片中\(y\)轴满足[0,1]

# Seed random number generator
np.random.seed(42)

# Initialize the number of defaults: n_defaults
n_defaults = np.empty(1000)

# Compute the number of defaults
for i in range(1000):
    n_defaults[i] = perform_bernoulli_trials(100,0.05)


# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, normed = True)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

Will the bank fail? | Python

# Compute ECDF: x, y
x, y = ecdf(n_defaults)

# Plot the ECDF with labeled axes
plt.plot(x,y,marker='.',linestyle="none")
plt.xlabel('')
plt.ylabel('')

# Show the plot
plt.show()

# Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money
n_lose_money = np.sum(n_defaults <= 5)

# Compute and print probability of losing money
print('Probability of losing money =', n_lose_money / len(n_defaults))

这个地方np.sum(n_defaults <= 5)其中\(n \cdot p=100 \times 0.05=5\),因此应该是50%的概率哈哈。

Probability distributions and stories: The Binomial distribution | Python

PMF: Probability mass function, The set of probabilities of discrete outcomes.

类似于PDF,但是不是density,因此图像是棒棒糖图。

Probability distribution: A mathematical description of outcomes.

Binomial distribution: The number \(r\) of successes in \(n\) Bernoulli trials with probability \(p\) of success, is Binomially distributed

\(r\)的定义是理解这个概念的核心。

np.random.binormial(n,p,size=10)其中,np不需要解释了,size的默认值是1,因此这个函数默认只返回一个结果,但是如果是10,那么返回一个有10个值的行向量。

In [1]: np.random.binomial(4,0.5)
Out[1]: 2

In [2]: np.random.binomial(4,0.5,size=10)
Out[2]: array([4, 3, 2, 1, 1, 0, 3, 2, 3, 0])

Plotting the Binomial PMF | Python

# Compute bin edges: bins
bins = np.arange(0, max(n_defaults) + 1.5) - 0.5

# Generate histogram
_ = plt.hist(n_defaults, normed=True, bins=bins)

# Set margins
plt.margins(0.02)

# Label axes
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('PMF')

# Show the plot
plt.show()

max(n_defaults)是这个分布中的最大值。


Poisson processes and the Poisson distribution | Python

开始泊松分布

The number \(r\) of arrivals of a Poisson process in a given time interval with average rate of \(\lambda\) arrivals per interval is Poisson distributed.

The number \(r\) of hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed.

所以说,规定了时间和平均速度,路程的分布? 但是Binomial没有时间限制,只给出了np,但是泊松分布给出了vt

Limit of the Binomial distribution for low probability of success and large number of trials.

That is, for rare events.

所以这是一种极少数情况。

多说一句,这些出现还都是相互独立的。

Relationship between Binomial and Poisson distributions | Python

You just heard that the Poisson distribution is a limit of the Binomial distribution for rare events. This makes sense if you think about the stories. Say we do a Bernoulli trial every minute for an hour, each with a success probability of \(p = 0.1\). We would do \(n = 60\) trials, and the number of successes is Binomially distributed, and we would expect to get about 6 successes. This is just like the Poisson story we discussed in the video, where we get on average 6 hits on a website per hour. So, the Poisson distribution with arrival rate equal to \(np\) approximates a Binomial distribution for nn Bernoulli trials with probability \(p\) of success (with \(n\) large and \(p\) small). Importantly, the Poisson distribution is often simpler to work with because it has only one parameter instead of two for the Binomial distribution.

这个例子中,\(p=0.1\)\(n=60\),这里的\(n\)就是trials的次数,因此单位时间60次试验。

所以Binormial向Poisson的转化, 首先先假定完成一次B(n,p)的时间,比如是1min, 然后假定一个小时是一个单位时间, 那么自然在一个小时是60次 (trials),那么平均速度为\(np\)。 这里假设\(n\)大,\(p\)小. 例如,流量来的可能性小\(p\),来的人\(n\)很多。

# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = np.random.poisson(lam = 10, size = 10000)

# Print the mean and standard deviation
print('Poisson:     ', np.mean(samples_poisson),
                       np.std(samples_poisson))

# Specify values of n and p to consider for Binomial: n, p
n = [20, 100, 1000] 
p = [0.5, 0.1, 0.01]


# Draw 10,000 samples for each n,p pair: samples_binomial
for i in range(3):
    samples_binomial = np.random.binomial(n = n[i], p = p[i],size = 10000)

    # Print results
    print('n =', n[i], 'Binom:', np.mean(samples_binomial),
                                 np.std(samples_binomial))
<script.py> output:
    Poisson:      10.0186 3.14481383233
    n = 20 Binom: 9.9637 2.21634435727
    n = 100 Binom: 9.9947 3.01358124331
    n = 1000 Binom: 9.9985 3.13937856112

The standard deviation of the Binomial distribution gets closer and closer to that of the Poisson distribution as the probability p gets lower and lower.

Was 2015 anomalous? | Python

这里直接假设了np

# Draw 10,000 samples out of Poisson distribution: n_nohitters
n_nohitters = np.random.poisson(lam = 251/115, size = 10000)

# Compute number of samples that are seven or greater: n_large
n_large = np.sum(n_nohitters >= 7)

# Compute probability of getting seven or more: p_large
p_large = n_large / 10000

# Print the result
print('Probability of seven or more no-hitters:', p_large)
<script.py> output:
    Probability of seven or more no-hitters: 0.0067

The result is about 0.007. This means that it is not that improbable to see a 7-or-more no-hitter season in a century. We have seen two in a century and a half, so it is not unreasonable.

大于7的概率肯定小于0.0067了,这些数据是选了一个世纪的数据来测试的。 因此,1个世纪就是一个单位时间,那么在这个单位时间内,大于7的概率肯定小于0.0067了。

Probability density functions | Python

一般来说,看了下目录,对离散型变量讨论Binormial和Poisson,但是连续型变量讨论正态分布。

The Normal PDF | Python

# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20, 1, size=100000)
samples_std3 = np.random.normal(20, 3, size=100000)
samples_std10 = np.random.normal(20, 10, size=100000)

# Make histograms
_ = plt.hist(samples_std1, bins=100, normed=True, histtype='step')
_ = plt.hist(samples_std3, bins=100, normed=True, histtype='step')
_ = plt.hist(samples_std10, bins=100, normed=True, histtype='step')

# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()

The Normal CDF | Python

# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)

# Plot CDFs
_ = plt.plot(x_std1, y_std1, marker = '.', linestyle = 'none')
_ = plt.plot(x_std3, y_std3, marker = '.', linestyle = 'none')
_ = plt.plot(x_std10, y_std10, marker = '.', linestyle = 'none')

# Make 2% margin
plt.margins(0.02)

# Make a legend and show the plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
plt.show()

std越小,越陡峭。

# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu,sigma,10000)

# Get the CDF of the samples and of the data
x_theor, y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)

# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()

The theoretical CDF and the ECDF of the data suggest that the winning Belmont times are, indeed, Normally distributed. This also suggests that in the last 100 years or so, there have not been major technological or training advances that have significantly affected the speed at which horses can run this race.

因为正态分布是随机变量所致。

因此一组数,或者一个分布,可求\(\mu\)\(\sigma\)来simulate正态分布图,用cdfecdf来比较,来看这组数是否随机。

这个是一个有用的idea。

# Take a million samples out of the Normal distribution: samples
samples = np.random.normal(mu,sigma,1000000)

# Compute the fraction that are faster than 144 seconds: prob
prob = np.sum(samples <= 144)/len(samples)

# Print the result
print('Probability of besting Secretariat:', prob)

Great work! We had to take a million samples because the probability of a fast time is very low and we had to be sure to sample enough. We get that there is only a 0.06% chance of a horse running the Belmont as fast as Secretariat.

这是一种说明方法。 看分布,算\(\mu\)\(\sigma\), 假设正态分布,带入\(\mu\)\(\sigma\), 算具体值的p value。

The Exponential distribution | Python

The Exponential distribution: The waiting time between arrivals of a Poisson process is Exponentially distributed. Timing of one is independent of all others.

If you have a story, you can simulate it! | Python

Now, you’ll use your sampling function to compute the waiting time to observer a no-hitter and hitting of the cycle. The mean waiting time for a no-hitter is 764 games, and the mean waiting time for hitting the cycle is 715 games.

# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, 100000)

# Make the histogram
_ = plt.hist(waiting_times, bins=100, histtype='step',
             normed=True)

# Label axes
_ = plt.xlabel('total waiting time (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

Great work! Notice that the PDF is peaked, unlike the waiting time for a single Poisson process. For fun (and enlightenment), I encourage you to also plot the CDF.

重新理解了

  • Binormial
  • Poisson
  • Exponential