Data Visualization with Python 学习笔记

Introduction to Data Visualization with Python

可视化一直觉得R包ggplot2和其衍生包就够了，因此这部分大家觉得任务重，可以把优先级放到后面再看。

Bryan Van de Ven | DataCamp 另外这个哥们对于每个argument讲解的非常细致。

Move on!!!

Plotting multiple graphs | Python

plt.axes([x_lo,y_lo,width,hight])不同，可以一个图展示多个图，类似于ggplot2中的arrange(p1,p2)函数。

x_lo: \(min(x)\)
y_lo: \(min(y)\)
width: \(\Delta x\)
hight: \(\Delta y\)

plt.subplot(nrows,ncols,nsubplot)表示

nrows: 整个图有几行
ncols: 整个图有几列
nsubplot: 第几个图

Using subplot() (1) | Python

The command plt.axes() requires a lot of effort to use well because the coordinates of the axes need to be set manually. A better alternative is to use plt.subplot() to determine the layout automatically.

plt.tight_layout()表示整个大图和小图的边缘¹，调整为字的1.08倍，这个是default选择。

In [1]: help(plt.tight_layout)
Help on function tight_layout in module matplotlib.pyplot:

tight_layout(pad=1.08, h_pad=None, w_pad=None, rect=None)
    Automatically adjust subplot parameters to give specified padding.
    
    Parameters:
    
    pad : float
        padding between the figure edge and the edges of subplots, as a fraction of the font-size.
    h_pad, w_pad : float
        padding (height/width) between edges of adjacent subplots.
        Defaults to `pad_inches`.
    rect : if rect is given, it is interpreted as a rectangle
        (left, bottom, right, top) in the normalized figure
        coordinate that the whole subplots area (including
        labels) will fit into. Default is (0, 0, 1, 1).

Customizing axes | Python

xlim和ylim可以设置最大值和最小值。 plt.savefig('')和ggsave很像。

Using axis() | Python

plt.axis([x_min,x_max,y_min,y_max]) = plt.xlim(x_min,x_max) + `plt.xlim(y_min,y_max)

changes limits of x or y axis so that equal increments of x and y have the same length; a circle is circular.

axis(equal)表示\(\Delta x\)和\(\Delta y\)等比例，也就是说原来画个椭圆，现在就是圆了。

In [2]: help(plt.axis)
Help on function axis in module matplotlib.pyplot:

axis(*v, **kwargs)
    Convenience method to get or set axis properties.
    
    Calling with no arguments::
    
      >>> axis()
    
    returns the current axes limits ``[xmin, xmax, ymin, ymax]``.::
    
      >>> axis(v)
    
    sets the min and max of the x and y axes, with
    ``v = [xmin, xmax, ymin, ymax]``.::
    
      >>> axis('off')
    
    turns off the axis lines and labels.::
    
      >>> axis('equal')
    
    changes limits of *x* or *y* axis so that equal increments of *x*
    and *y* have the same length; a circle is circular.::
    
      >>> axis('scaled')
    
    achieves the same result by changing the dimensions of the plot box instead
    of the axis data limits.::
    
      >>> axis('tight')
    
    changes *x* and *y* axis limits such that all data is shown. If
    all data is already shown, it will move it to the center of the
    figure without modifying (*xmax* - *xmin*) or (*ymax* -
    *ymin*). Note this is slightly different than in MATLAB.::
    
      >>> axis('image')
    
    is 'scaled' with the axis limits equal to the data limits.::
    
      >>> axis('auto')
    
    and::
    
      >>> axis('normal')
    
    are deprecated. They restore default behavior; axis limits are automatically
    scaled to make the data fit comfortably within the plot box.
    
    if ``len(*v)==0``, you can pass in *xmin*, *xmax*, *ymin*, *ymax*
    as kwargs selectively to alter just those limits without changing
    the others.
    
      >>> axis('square')
    
    changes the limit ranges (*xmax*-*xmin*) and (*ymax*-*ymin*) of
    the *x* and *y* axes to be the same, and have the same scaling,
    resulting in a square plot.
    
    The xmin, xmax, ymin, ymax tuple is returned
    
    .. seealso::
    
        :func:`xlim`, :func:`ylim`
           For setting the x- and y-limits individually.

Using legend() | Python

plt.plot特定了label=，在后续的plt.legend()中会按照特定的文字在图中标注出来。

# Specify the label 'Computer Science'
plt.plot(year, computer_science, color='red', label='Computer Science') 

# Specify the label 'Physical Sciences' 
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')

# Add a legend at the lower center
plt.legend(loc='lower center')

# Add axis labels and title
plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.title('Undergraduate enrollment of women')
plt.show()

Using annotate() | Python

假设，

\[y=f(t)\]

\(\max f(t)\)代表y是f(t)函式所有的值中最大的output。 \(argmaxf(t)\)代表f(t)函式中，产生\(\max f(t)\)对应的\(t\)。 max和argmax的区别_考研数学笔记_新浪博客

the arguments of the maxima (abbreviated arg max or argmax).

.argmax()是numpy包的，并且这个函数在梯度下降等地方用的很多。

In [5]: import numpy as np

In [6]: help(np.argmax)
Help on function argmax in module numpy.core.fromnumeric:

argmax(a, axis=None, out=None)
    Returns the indices of the maximum values along an axis.
    
    Parameters
    ----------
    a : array_like
        Input array.
    axis : int, optional
        By default, the index is into the flattened array, otherwise
        along the specified axis.
    out : array, optional
        If provided, the result will be inserted into this array. It should
        be of the appropriate shape and dtype.
    
    Returns
    -------
    index_array : ndarray of ints
        Array of indices into the array. It has the same shape as `a.shape`
        with the dimension along `axis` removed.
    
    See Also
    --------
    ndarray.argmax, argmin
    amax : The maximum value along a given axis.
    unravel_index : Convert a flat index into an index tuple.
    
    Notes
    -----
    In case of multiple occurrences of the maximum values, the indices
    corresponding to the first occurrence are returned.
    
    Examples
    --------
    >>> a = np.arange(6).reshape(2,3)
    >>> a
    array([[0, 1, 2],
           [3, 4, 5]])
    >>> np.argmax(a)
    5
    >>> np.argmax(a, axis=0)
    array([1, 1, 1])
    >>> np.argmax(a, axis=1)
    array([2, 2])
    
    >>> b = np.arange(6)
    >>> b[1] = 5
    >>> b
    array([0, 5, 2, 3, 4, 5])
    >>> np.argmax(b) # Only the first occurrence is returned.
    1

# Plot with legend as before
plt.plot(year, computer_science, color='red', label='Computer Science') 
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')
plt.legend(loc='lower right')

# Compute the maximum enrollment of women in Computer Science: cs_max
cs_max = computer_science.max()

# Calculate the year in which there was maximum enrollment of women in Computer Science: yr_max
yr_max = year[computer_science.argmax()]

# Add a black arrow annotation
plt.annotate('Maximum', 
             xy = (yr_max, cs_max), 
             xytext = (yr_max+5, cs_max+5),
             arrowprops=dict(facecolor='black'))


# Add axis labels and title
plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.title('Undergraduate enrollment of women')
plt.show()

The single letter shortcut for 'black' is 'k'.

Modifying styles | Python

plt.style.use('ggplot')这个就和R中ggplot2的风格很像了。

Working with 2D arrays | Python

Numpy ²主要是用来计算矩阵的。

这里对slice函数的用法解释得非常清楚。

Slicing:

1D arrays: A[slice],
2D arrays: A[slice0, slice1]

Slicing:

slice = start:stop:stride

Indexes from start to stop-1 in steps of stride
Missing start: implicitly at beginning of array
Missing stop: implicitly at end of array
Missing stride: implicitly stride 1

Negative indexes/slices: count from end of array

Generating meshes | Python

In [6]: help(np.linspace)
Help on function linspace in module numpy.core.function_base:

linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
    Return evenly spaced numbers over a specified interval.
    
    Returns `num` evenly spaced samples, calculated over the
    interval [`start`, `stop`].
    
    The endpoint of the interval can optionally be excluded.
    
    Parameters
    ----------
    start : scalar
        The starting value of the sequence.
    stop : scalar
        The end value of the sequence, unless `endpoint` is set to False.
        In that case, the sequence consists of all but the last of ``num + 1``
        evenly spaced samples, so that `stop` is excluded.  Note that the step
        size changes when `endpoint` is False.
    num : int, optional
        Number of samples to generate. Default is 50. Must be non-negative.
    endpoint : bool, optional
        If True, `stop` is the last sample. Otherwise, it is not included.
        Default is True.
    retstep : bool, optional
        If True, return (`samples`, `step`), where `step` is the spacing
        between samples.
    dtype : dtype, optional
        The type of the output array.  If `dtype` is not given, infer the data
        type from the other input arguments.
    
        .. versionadded:: 1.9.0
    
    Returns
    -------
    samples : ndarray
        There are `num` equally spaced samples in the closed interval
        ``[start, stop]`` or the half-open interval ``[start, stop)``
        (depending on whether `endpoint` is True or False).
    step : float, optional
        Only returned if `retstep` is True
    
        Size of spacing between samples.
    
    
    See Also
    --------
    arange : Similar to `linspace`, but uses a step size (instead of the
             number of samples).
    logspace : Samples uniformly distributed in log space.
    
    Examples
    --------
    >>> np.linspace(2.0, 3.0, num=5)
    array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ])
    >>> np.linspace(2.0, 3.0, num=5, endpoint=False)
    array([ 2. ,  2.2,  2.4,  2.6,  2.8])
    >>> np.linspace(2.0, 3.0, num=5, retstep=True)
    (array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ]), 0.25)
    
    Graphical illustration:
    
    >>> import matplotlib.pyplot as plt
    >>> N = 8
    >>> y = np.zeros(N)
    >>> x1 = np.linspace(0, 10, N, endpoint=True)
    >>> x2 = np.linspace(0, 10, N, endpoint=False)
    >>> plt.plot(x1, y, 'o')
    [<matplotlib.lines.Line2D object at 0x...>]
    >>> plt.plot(x2, y + 0.5, 'o')
    [<matplotlib.lines.Line2D object at 0x...>]
    >>> plt.ylim([-0.5, 1])
    (-0.5, 1)
    >>> plt.show()

np.linspace(数列开始值, 数列结束值, 样本大小)

meshgrid是MATLAB中用于生成网格采样点的函数。在使用MATLAB进行3-D图形绘制方面有着广泛的应用。

这里应该是numpy的借鉴。

In [8]: help(np.meshgrid)
Help on function meshgrid in module numpy.lib.function_base:

meshgrid(*xi, **kwargs)
    Return coordinate matrices from coordinate vectors.
    
    Make N-D coordinate arrays for vectorized evaluations of
    N-D scalar/vector fields over N-D grids, given
    one-dimensional coordinate arrays x1, x2,..., xn.
    
    .. versionchanged:: 1.9
       1-D and 0-D cases are allowed.
    
    Parameters
    ----------
    x1, x2,..., xn : array_like
        1-D arrays representing the coordinates of a grid.
    indexing : {'xy', 'ij'}, optional
        Cartesian ('xy', default) or matrix ('ij') indexing of output.
        See Notes for more details.
    
        .. versionadded:: 1.7.0
    sparse : bool, optional
        If True a sparse grid is returned in order to conserve memory.
        Default is False.
    
        .. versionadded:: 1.7.0
    copy : bool, optional
        If False, a view into the original arrays are returned in order to
        conserve memory.  Default is True.  Please note that
        ``sparse=False, copy=False`` will likely return non-contiguous
        arrays.  Furthermore, more than one element of a broadcast array
        may refer to a single memory location.  If you need to write to the
        arrays, make copies first.
    
        .. versionadded:: 1.7.0
    
    Returns
    -------
    X1, X2,..., XN : ndarray
        For vectors `x1`, `x2`,..., 'xn' with lengths ``Ni=len(xi)`` ,
        return ``(N1, N2, N3,...Nn)`` shaped arrays if indexing='ij'
        or ``(N2, N1, N3,...Nn)`` shaped arrays if indexing='xy'
        with the elements of `xi` repeated to fill the matrix along
        the first dimension for `x1`, the second for `x2` and so on.
    
    Notes
    -----
    This function supports both indexing conventions through the indexing
    keyword argument.  Giving the string 'ij' returns a meshgrid with
    matrix indexing, while 'xy' returns a meshgrid with Cartesian indexing.
    In the 2-D case with inputs of length M and N, the outputs are of shape
    (N, M) for 'xy' indexing and (M, N) for 'ij' indexing.  In the 3-D case
    with inputs of length M, N and P, outputs are of shape (N, M, P) for
    'xy' indexing and (M, N, P) for 'ij' indexing.  The difference is
    illustrated by the following code snippet::
    
        xv, yv = np.meshgrid(x, y, sparse=False, indexing='ij')
        for i in range(nx):
            for j in range(ny):
                # treat xv[i,j], yv[i,j]
    
        xv, yv = np.meshgrid(x, y, sparse=False, indexing='xy')
        for i in range(nx):
            for j in range(ny):
                # treat xv[j,i], yv[j,i]
    
    In the 1-D and 0-D case, the indexing and sparse keywords have no effect.
    
    See Also
    --------
    index_tricks.mgrid : Construct a multi-dimensional "meshgrid"
                     using indexing notation.
    index_tricks.ogrid : Construct an open multi-dimensional "meshgrid"
                     using indexing notation.
    
    Examples
    --------
    >>> nx, ny = (3, 2)
    >>> x = np.linspace(0, 1, nx)
    >>> y = np.linspace(0, 1, ny)
    >>> xv, yv = np.meshgrid(x, y)
    >>> xv
    array([[ 0. ,  0.5,  1. ],
           [ 0. ,  0.5,  1. ]])
    >>> yv
    array([[ 0.,  0.,  0.],
           [ 1.,  1.,  1.]])
    >>> xv, yv = np.meshgrid(x, y, sparse=True)  # make sparse output arrays
    >>> xv
    array([[ 0. ,  0.5,  1. ]])
    >>> yv
    array([[ 0.],
           [ 1.]])
    
    `meshgrid` is very useful to evaluate functions on a grid.
    
    >>> x = np.arange(-5, 5, 0.1)
    >>> y = np.arange(-5, 5, 0.1)
    >>> xx, yy = np.meshgrid(x, y, sparse=True)
    >>> z = np.sin(xx**2 + yy**2) / (xx**2 + yy**2)
    >>> h = plt.contourf(x,y,z)

对MATLAB中meshgrid的理解 - CSDN博客

要在3<=x<=5，6<=y<=9，z不限制区间 这个区域内绘制一个3-D图形。

(3,9),(4,9),(5,9);
(3,8),(4,8),(5,8);
(3,7),(4,7),(5,7);
(3,6),(4,6),(5,6);

把各个点的x坐标独立出来，得：

3,4,5;
3,4,5;
3,4,5;
3,4,5;

再把各个点的y坐标也独立出来：

9,9,9;
8,8,8;
7,7,7;
6,6,6;

因此可以发现， x代表了行向量\([3,4,5]\)， y代表了列向量\(\begin{bmatrix}3 \\ 4 \\5\end{bmatrix}\)。这刚好对应了np.linspace ³的假设。

# Generate two 1-D arrays: u, v
u = np.linspace(-2, +2, 41)
v = np.linspace(-1,+1,21)

# Generate 2-D arrays from u and v: X, Y
X,Y = np.meshgrid(u,v)

# Compute Z based on X and Y
Z = np.sin(3*np.sqrt(X**2 + Y**2))

相应地，这里的u和v分别是行向量和列向量， X,Y是矩阵。 X内每一行相等，行数是v的长度； Y内每一列相等，行数是u的长度。

In [16]: X.shape
Out[16]: (21, 41)

In [17]: Y.shape
Out[17]: (21, 41)

已知X**2表示\(X_{m \times n} \times X_{n \times m}^T = (X \times X^T)_{m \times n}\)。

所以，X**2 + Y**2也是(21, 41)。

所以，np.sqrt(X**2 + Y**2).shape也是(21, 41)。

The sine is one of the fundamental functions of trigonometry (the mathematical study of triangles).

np.sin就是正弦函数。

In [25]: np.sin(np.pi/2)
Out[25]: 1.0

In [26]: np.sin(np.array((0., 30., 45., 60., 90.)) * np.pi / 180. )
Out[26]: array([ 0.        ,  0.5       ,  0.70710678,  0.8660254 ,  1.        ])

# Import numpy and matplotlib.pyplot
import numpy as np
import matplotlib.pyplot as plt

# Generate two 1-D arrays: u, v
u = np.linspace(-2, +2, 41)
v = np.linspace(-1,+1,21)

# Generate 2-D arrays from u and v: X, Y
X,Y = np.meshgrid(u,v)

# Compute Z based on X and Y
Z = np.sin(3*np.sqrt(X**2 + Y**2)) 

# Display the resulting image with pcolor()
plt.pcolor(Z)
plt.show()

# Save the figure to 'sine_mesh.png'
plt.savefig('sine_mesh.png')

图非常的厉害，大家可以感受一下。

Array orientation | Python

产生的矩阵是

In [6]: np.array([[1, 2, 1], [0, 0, 1], [-1, 1, 1]])
Out[6]: 
array([[ 1,  0, -1],
       [ 2,  0,  1],
       [ 1,  1,  1]])

但是在途中对应的矩阵表达的是

\[\begin{bmatrix} 1 & 1 & 1 \\ 2 & 0 & 1 \\ 1 & 0 & -1 \\ \end{bmatrix}\]

显然是做了一个倒序排列，也就是说第一个行向量\([ 1, 0, -1]\)想排在x轴上，后来的在这个基础上累加。

这本身也符合我们的构图逻辑。

例如一个矩阵

\[\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix}\]

我们在构图的时候，其实从原点\(xOy(0,0)\)开始出发，因此这样的效果，就是热力图上面的效果。

Visualizing bivariate functions | Python

pseudo-color 假色？假色是指在一幅影像中使用与全彩不同的颜色描述一项物体。

因此可以表达平面图的第三维度。

plt.contour()是画轮廓线，等高线那种，是plt.pcolor()的变形。 plt.contour(Z, 30)中，Z包含三个维度的值，30表示等高线的数量。

Contour & filled contour plots | Python

# Generate a default contour map of the array Z
plt.subplot(2,2,1)
plt.contour(Z)

# Generate a contour map with 20 contours
plt.subplot(2,2,2)
plt.contour(Z,20)

# Generate a default filled contour map of the array Z
plt.subplot(2,2,3)
plt.contourf(Z)

# Generate a default filled contour map with 20 contours
plt.subplot(2,2,4)
plt.contourf(Z,20)

# Improve the spacing between subplots
plt.tight_layout()

# Display the figure
plt.show()

plt.contourf显然就是加了fill的plt.contour。

Modifying colormaps | Python

# Create a filled contour plot with a color map of 'viridis'
plt.subplot(2,2,1)
plt.contourf(X,Y,Z,20, cmap='viridis')
plt.colorbar()
plt.title('Viridis')

# Create a filled contour plot with a color map of 'gray'
plt.subplot(2,2,2)
plt.contourf(X,Y,Z,20, cmap='gray')
plt.colorbar()
plt.title('Gray')

# Create a filled contour plot with a color map of 'autumn'
plt.subplot(2,2,3)
plt.contourf(X,Y,Z,20, cmap='autumn')
plt.colorbar()
plt.title('Autumn')

# Create a filled contour plot with a color map of 'winter'
plt.subplot(2,2,4)
plt.contourf(X,Y,Z,20, cmap='winter')
plt.colorbar()
plt.title('Winter')

# Improve the spacing between subplots and display them
plt.tight_layout()
plt.show()

图像的文章就是代码多，因为要描述图片。因此文章长，但是实际上，内容不多的。看了下四种图都不怎么好看，就是比较science而已。

Visualizing bivariate distributions | Python

还是三个维度的图。

Using hist2d() | Python

histogram图这里涉及的是二维的，形容三个维度的图，\((x,y,count(x,y))\)。涉及的主要参数， plt.hist2d(x,y)， bins=(nx, ny)， range=((xmin, xmax), (ymin, ymax))。参数含义非常好理解。

# Generate a 2-D histogram
plt.hist2d(
    hp, mpg,
    bins = (20,20),
    range=((40,235),(8,48))
    )

# Add a color bar to the histogram
plt.colorbar()

# Add labels, title, and display the plot
plt.xlabel('Horse power [hp]')
plt.ylabel('Miles per gallon [mpg]')
plt.title('hist2d() plot')
plt.show()

Using hexbin() | Python

gridsize的默认值为100，表示x和y轴上的方块数量。 extent=(xmin, xmax, ymin, ymax)显然不用解释。

# Generate a 2d histogram with hexagonal bins
plt.hexbin(
    hp,mpg,
    gridsize = (15,12),
    extent = (40,235,8,48)
    )

           
# Add a color bar to the histogram
plt.colorbar()

# Add labels, title, and display the plot
plt.xlabel('Horse power [hp]')
plt.ylabel('Miles per gallon [mpg]')
plt.title('hexbin() plot')
plt.show()

Working with images | Python

已经到要修改image的程度了，很好玩，但是工作上没什么用啊，差评。

接下来的练习可以跳过，因为修改image跟数据呈现关联度不大。

当然如果觉得好玩，也可以继续看看。

Loading, examining images | Python

红、绿、蓝是三原色，构图那个，不是光的那个。因此图片任意一个点，可以用\([red\%,green\%m,blue\%]\)来描述。因此图片转化成np.array后，其实就是个\(M \times N\)的矩阵，单位是\([red\%,green\%m,blue\%]\)。

plt.axis('off')不展示横纵坐标。 plt.imread('480px-Astronaut-EVA.jpg')转化.jpg为矩阵。 plt.imshow(img)转化矩阵成图片。

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Print the shape of the image
print(img.shape)

# Display the image
plt.imshow(img)

# Hide the axes
plt.axis('off')
plt.show()

Pseudocolor plot from image data | Python

这里.sum(axis=2)表示一个单元的相加，\([red\%,green\%m,blue\%]\)。

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Print the shape of the image
print(img.shape)

# Compute the sum of the red, green and blue channels: intensity
intensity = img.sum(axis=2)

# Print the shape of the intensity
print(intensity.shape)

# Display the intensity with a colormap of 'gray'
plt.imshow(intensity, cmap='gray')

# Add a colorbar
plt.colorbar()

# Hide the axes and show the figure
plt.axis('off')
plt.show()

Extent and aspect | Python

\(aspect = \frac{height}{width}\)

aspect是纵横比，\(\frac{纵}{横}(比)\)。

extent = (xmin,xmax,ymin,ymax)显然，不用解释。

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Specify the extent and aspect ratio of the top left subplot
plt.subplot(2,2,1)
plt.title('extent=(-1,1,-1,1),\naspect=0.5') 
plt.xticks([-1,0,1])
plt.yticks([-1,0,1])
plt.imshow(img, extent =(-1,1,-1,1), aspect=0.5)

# Specify the extent and aspect ratio of the top right subplot
plt.subplot(2,2,2)
plt.title('extent=(-1,1,-1,1),\naspect=1')
plt.xticks([-1,0,1])
plt.yticks([-1,0,1])
plt.imshow(img, extent =(-1,1,-1,1), aspect=1)

# Specify the extent and aspect ratio of the bottom left subplot
plt.subplot(2,2,3)
plt.title('extent=(-1,1,-1,1),\naspect=2')
plt.xticks([-1,0,1])
plt.yticks([-1,0,1])
plt.imshow(img, extent =(-1,1,-1,1), aspect=2)

# Specify the extent and aspect ratio of the bottom right subplot
plt.subplot(2,2,4)
plt.title('extent=(-2,2,-1,1),\naspect=2')
plt.xticks([-2,-1,0,1,2])
plt.yticks([-1,0,1])
plt.imshow(img, extent =(-2,2,-1,1), aspect=2)

# Improve spacing and display the figure
plt.tight_layout()
plt.show()

Rescaling pixel intensities | Python

\(intensities = [red\%,green\%m,blue\%]\)这个是之前就定义过的。

# Load the image into an array: image
image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Extract minimum and maximum values from the image: pmin, pmax
pmin, pmax = image.min(), image.max()
print("The smallest & largest pixel intensities are %d & %d." % (pmin, pmax))

# Rescale the pixels: rescaled_image
rescaled_image = 256*(image - pmin) / (pmax - pmin)
print("The rescaled smallest & largest pixel intensities are %.1f & %.1f." % 
      (rescaled_image.min(), rescaled_image.max()))

# Display the original image in the top subplot
plt.subplot(2,1,1)
plt.title('original image')
plt.axis('off')
plt.imshow(image)

# Display the rescaled image in the bottom subplot
plt.subplot(2,1,2)
plt.title('rescaled image')
plt.axis('off')
plt.imshow(rescaled_image)

plt.show()

经过了标准化处理

\[\tilde x =\frac{x-\max(x)}{\max(x)-\min(x)}\]

感觉骚操作，但是没有感觉有啥好处？

Visualizing regressions | Python

hue='sex'针对factor变量。 palette='Set1' ⁴针对调色。 col='sex'针对subplot。 sns.residplot()针对残差画图。

Higher-order regressions | Python

order=1是默认的，因此不需要加。

In [4]: help(sns.regplot)
Help on function regplot in module seaborn.linearmodels:

regplot(x, y, data=None, x_estimator=None, x_bins=None, x_ci='ci', scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, order=1, logistic=False, lowess=False, robust=False, logx=False, x_partial=None, y_partial=None, truncate=False, dropna=True, x_jitter=None, y_jitter=None, label=None, color=None, marker='o', scatter_kws=None, line_kws=None, ax=None)

plt.legend(loc = 'upper right')中loc =别忘了。

loc : int or string or pair of floats, default: ‘upper right’

且'upper right'是默认的。

Grouping linear regressions by row or column | Python

overlaying linear regressions of grouped data in the same plot, we may want to use a grid of subplots.

重叠方程用hue，网格图用row和col，来确定网格的排列方式。和R很像。 R的ggplot2网格用facet.grid(a ~ b)中的a和b来确定。

Visualizing univariate distributions | Python

jitter的作用，使得strip图转为swarm图，实际上是在hjust上加了比重。这个图在ggplot2中的geom_jitter()中使用。

改变方向使用orient = 'h'或者orient = 'v'等于Rggplot2中的coord_flip ⁵

明显感觉violin plot比box plot好。但是violin plot不能体现outlier。可以通过violin plot和swarm plot的合并完成， .violinplot(inner = None)和.stripplot(jitter = True)。 Constructing violin plots | Python

# Generate a violin plot of 'hp' grouped horizontally by 'cyl'
plt.subplot(2,1,1)
sns.violinplot(y='hp', x='cyl', data=auto)

# Generate the same violin plot again with a color of 'lightgray' and without inner annotations
plt.subplot(2,1,2)
sns.violinplot(y='hp', x='cyl', data=auto, inner=None, color = 'lightgray')

# Overlay a strip plot on the violin plot
sns.stripplot(y='hp', x='cyl', data=auto, size = 1.5, jitter = True)

# Display the plot
plt.show()

Visualizing multivariate distributions | Python

joint plot我在Rggplot2中也用得少啊。体现了两个连续变量的分布、散点图、相关系数、相关系数相对于0的显著水平。 kind = kde让图像变得smooth。

pair plot⁶是\(N \times N\)个图，其中对角线上是某个变量的分布histogram图，其他都是散点图。其中可以加入hue，看组间差异。 sns.pairplot(auto)这是个例子。其中kind = 'scatter'是默认的，kind = 'reg'可以specify。

heat map主要看相关性。

Plotting joint distributions (2) | Python

在sns.jointplot()中kind有几种重要的选择，

kind=‘scatter’ uses a scatter plot of the data points
kind=‘reg’ uses a regression plot (default order 1)
kind=‘resid’ uses a residual plot
kind=‘kde’ uses a kernel density estimate of the joint distribution
kind=‘hex’ uses a hexbin plot of the joint distribution

Visualizing correlations with a heatmap | Python

# Print the covariance matrix
print(cov_matrix)

# Visualize the covariance matrix using a heatmap
sns.heatmap(cov_matrix)

# Display the heatmap
plt.show()

.heatmap()的对象是一个matrix不是table。

Visualizing time series | Python

plt.xticks(rotation = 60)这个类似于R中的 theme(axis.text.x = element_text(angle = 70, hjust = 1))。

.index(::96)这个为什么就是每隔四天了。

Multiple time series on common axes | Python

有四种颜色⁷。

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the aapl time series in blue
plt.plot(aapl, color='blue', label='AAPL')

# Plot the ibm time series in green
plt.plot(ibm, color='green', label='IBM')

# Plot the csco time series in red
plt.plot(csco, color='red', label='CSCO')

# Plot the msft time series in magenta
plt.plot(msft, color='magenta', label='MSFT')

# Add a legend in the top left corner of the plot
plt.legend(loc='upper left')

# Specify the orientation of the xticks
plt.xticks(rotation = 60)

# Display the plot
plt.show()

Time series with moving windows | Python

moving windows有好几种方法

Averages
Medians
Standard deviations

# Plot the 30-day moving average in the top left subplot in green
plt.subplot(2,2,1)
plt.plot(mean_30, color = 'green')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('30d averages')

# Plot the 75-day moving average in the top right subplot in red
plt.subplot(2,2,2)
plt.plot(mean_75, 'red')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('75d averages')

# Plot the 125-day moving average in the bottom left subplot in magenta
plt.subplot(2, 2, 3)
plt.plot(mean_125, 'magenta')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('125d averages')

# Plot the 250-day moving average in the bottom right subplot in cyan
plt.subplot(2,2,4)
plt.plot(mean_250, 'cyan')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('250d averages')

# Display the plot
plt.show()

Histogram equalization in images | Python

.flatten() 是numpy.ndarray.flatten的一个函数，其官方文档： ndarray.flatten(order='C') Return a copy of the array collapsed into one dimension.

但是该函数只能适用于numpy对象，即array或者mat，普通的list列表是不行的。 python numpy库中flatten()函数用法 - taotiezhengfeng的博客 - CSDN博客

为什么这里可以使用.flatten()呢？

An image histogram, then, is computed by counting the occurences of distinct pixel intensities over all the pixels in the image.

因为我们要看histogram，因此就是对一个一维的数据进行分析，因此当然要将一个矩阵的\(intensities = [red\%,green\%m,blue\%]\)合成，然后将矩阵转化为一个行向量，即一维向量。

Extracting a histogram from a grayscale image | Python

# Load the image into an array: image
image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Display image in top subplot using color map 'gray'
plt.subplot(2,1,1)
plt.title('Original image')
plt.axis('off')
plt.imshow(image, cmap = 'gray')

# Flatten the image into 1 dimension: pixels
pixels = image.flatten()

# Display a histogram of the pixels in the bottom subplot
plt.subplot(2,1,2)
plt.xlim((0,255))
plt.title('Normalized histogram')
plt.hist(
    pixels, 
    bins=64, 
    color='red', 
    alpha=0.4, 
    range=(0,256) , 
    normed=True)

# Display the plot
plt.show()

Cumulative Distribution Function from an image histogram | Python

The command plt.twinx() allows two plots to be overlayed sharing the x-axis but with different scales on the y-axis.

# Load the image into an array: image
image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Display image in top subplot using color map 'gray'
plt.subplot(2,1,1)
plt.imshow(image, cmap='gray')
plt.title('Original image')
plt.axis('off')

# Flatten the image into 1 dimension: pixels
pixels = image.flatten()

# Display a histogram of the pixels in the bottom subplot
plt.subplot(2,1,2)
pdf = plt.hist(pixels, bins=64, range=(0,256), normed=False,
               color='red', alpha=0.4)
plt.grid('off')

# Use plt.twinx() to overlay the CDF in the bottom subplot
plt.twinx()

# Display a cumulative histogram of the pixels
cdf = plt.hist(pixels, bins=64, range=(0,256),
               normed=True, cumulative=True,
               color='blue', alpha=0.4)
               
# Specify x-axis range, hide axes, add title and display plot
plt.xlim((0,256))
plt.grid('off')
plt.title('PDF & CDF (original image)')
plt.show()

Equalizing an image histogram | Python

The basic idea is to use interpolation to map the original CDF of pixel intensities to a CDF that is almost a straight line. In essence, the pixel intensities are spread out and this has the practical effect of making a sharper, contrast-enhanced image. This is particularly useful in astronomy and medical imaging to help us see more features.

归一化让分布函数形成一条直线，且图片看起来更加鲜明对比。

# Load the image into an array: image
image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Flatten the image into 1 dimension: pixels
pixels = image.flatten()

# Generate a cumulative histogram
cdf, bins, patches = plt.hist(pixels, bins=256, range=(0,256), normed=True, cumulative=True)
new_pixels = np.interp(pixels, bins[:-1], cdf*255)

# Reshape new_pixels as a 2-D array: new_image
new_image = new_pixels.reshape(image.shape)

# Display the new image with 'gray' color map
plt.subplot(2,1,1)
plt.title('Equalized image')
plt.axis('off')
plt.imshow(new_image, cmap='gray')

# Generate a histogram of the new pixels
plt.subplot(2,1,2)
pdf = plt.hist(new_pixels, bins=64, range=(0,256), normed=False,
               color='red', alpha=0.4)
plt.grid('off')

# Use plt.twinx() to overlay the CDF in the bottom subplot
plt.twinx()
plt.xlim((0,256))
plt.grid('off')

# Add title
plt.title('PDF & CDF (equalized image)')

# Generate a cumulative histogram of the new pixels
cdf = plt.hist(new_pixels, bins=64, range=(0,256),
               cumulative=True, normed=True,
               color='blue', alpha=0.4)
plt.show()

Extracting histograms from a color image | Python

# Load the image into an array: image
image = plt.imread('hs-2004-32-b-small_web.jpg')

# Display image in top subplot
plt.subplot(2,1,1)
plt.title('Original image')
plt.axis('off')
plt.imshow(image)

# Extract 2-D arrays of the RGB channels: red, blue, green
red, green, blue = image[:,:,0], image[:,:,1], image[:,:,2]

# Flatten the 2-D arrays of the RGB channels into 1-D
red_pixels = red.flatten()
blue_pixels = green.flatten().flatten()
green_pixels =blue.flatten()

# Overlay histograms of the pixels of each color in the bottom subplot
plt.subplot(2,1,2)
plt.title('Histograms from color image')
plt.xlim((0,256))
plt.hist(red_pixels, bins=64, normed=True, color='red', alpha=0.2)
plt.hist(blue_pixels, bins=64, normed=True, color='blue', alpha=0.2)
plt.hist(green_pixels, bins=64, normed=True, color='green', alpha=0.2)

# Display the plot
plt.show()

东西真多，好心烦。

Extracting bivariate histograms from a color image | Python

# Load the image into an array: image
image = plt.imread('hs-2004-32-b-small_web.jpg')

# Extract RGB channels and flatten into 1-D array
red, blue, green = image[:,:,0], image[:,:,1], image[:,:,2]
red_pixels = red.flatten()
blue_pixels = blue.flatten()
green_pixels = green.flatten()

# Generate a 2-D histogram of the red and green pixels
plt.subplot(2,2,1)
plt.grid('off') 
plt.xticks(rotation=60)
plt.xlabel('red')
plt.ylabel('green')
plt.hist2d(red_pixels,green_pixels,bins=(32,32))


# Generate a 2-D histogram of the green and blue pixels
plt.subplot(2,2,2)
plt.grid('off')
plt.xticks(rotation=60)
plt.xlabel('green')
plt.ylabel('blue')
plt.hist2d(green_pixels,blue_pixels,bins=(32,32))

# Generate a 2-D histogram of the blue and red pixels
plt.subplot(2,2,3)
plt.grid('off')
plt.xticks(rotation=60)
plt.xlabel('blue')
plt.ylabel('red')
plt.hist2d(blue_pixels,red_pixels,bins=(32,32))

# Display the plot
plt.show()

Statement of Accomplishment

证书

Happy ploting !

the padding between each item: 各项之间的边距↩
NumPy is pronounced “Numb Pie”.↩
abbr. 线性等分向量（linear space）↩
美音 /’pælət/ n. 颜料；调色板↩
Flip cartesian coordinates 转置笛卡尔坐标系↩
Plotting distributions pairwise，就是成对的画分布图。↩
美音 /mə’dʒɛntə/ n. 品红,洋红↩