Sun Haozhe's Blog

Sun Haozhe's Blog


  • Home

  • Categories

  • Archives

  • Tags

  • About

  • Search

Keyboard layout

Posted on 2020-05-27 | In Misc
| Words count in article 129

中国大陆键盘布局(keyboard layout)和美国一样,即美式键盘(QWERTY)。

Chinese keyboard layout QWERTY [1]

chinese_keyboard_layout

French keyboard layout AZERTY [1]

french_keyboard_layout

French keyboard layout AZERTY (MacBook or Apple keyboard) [2]

french_mac_keyboard_layout

References

[1] 键盘布局. (2009, June 7). 维基百科,自由的百科全书. https://zh.wikipedia.org/wiki/%E9%94%AE%E7%9B%98%E5%B8%83%E5%B1%80

[2] Keyshorts. (2015, July 17). How to identify MacBook keyboard layout? https://keyshorts.com/blogs/blog/37615873-how-to-identify-macbook-keyboard-localization#french

Read more »

Generative models and discriminative models

Posted on 2020-05-25 | In Mathematics
| Words count in article 322

Let’s say we have input data $X$ and we want to classify the data into labels $Y$.

  • A generative model (生成模型) learns the joint probability distribution $\mathbb{P}(X, Y)$
  • A discriminative model (判别模型) learns the conditional probability distribution $\mathbb{P}(Y | X)$

Discriminative models learn the (hard or soft) boundary between classes, generative models model the distribution of individual classes.

Examples of generative models:

  • HMM (hidden markov model)
  • Naive Bayes
  • Gaussian mixture model and other types of mixture model
  • GAN (generative adversarial network)
  • VAE (variational autoencoder)
  • PCFG (probabilistic context-free grammar)
  • Latent Dirichlet allocation
  • Boltzmann machine

Examples of discriminative models:

  • CRF (conditional random field)
  • Logistic regression
  • SVM
  • Decision tree
  • Random forest
  • KNN (k-nearest neighbors)
  • Neural networks
  • Classical (sparse, denoising, etc.) autoencoders are generally classified as discriminative models, although denoising autoencoders can be used as generative models [1].

For sequence labeling tasks (such as Part-of-Speech tagging and Name Entity Recognition), discriminative models are said to be preferred to generative models [2].

There is a widely-held belief that discriminative models are almost always to be preferred in classification tasks [3].

One of the advantages of generative models is that they can be used to generate new data similar to existing data.

References

[1] Bengio, Y., Yao, L., & Alain, P. (2013). Generalized Denoising Auto-Encoders as Generative ModelsarXiv e-prints, arXiv:1305.6663.

[2] Why discriminative models are preferred to generative models for sequence labeling tasks? (n.d.). Cross Validated. https://stats.stackexchange.com/questions/73015/why-discriminative-models-are-preferred-to-generative-models-for-sequence-labeli

[3] Andrew Y. Ng and Michael I. Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker, and Z. Ghahramani, eds.), MIT Press, 2002, pp. 841–848.

Read more »

Classification metrics

Posted on 2020-05-10 | In Mathematics
| Words count in article 783
  • true positive (TP)
  • true negative (TN)
  • false positive (FP)
    • type I error: rejection of a true null hypothesis
  • false negative (FN)
    • type II error: non-rejection of a false null hypothesis

Confusion matrix:

   
TP FP (type I error)
FN (type II error) TN
\[\text{precision} = \frac{\text{TP}}{|\text{predicted as positive}|} = \frac{\text{TP}}{\text{TP} + \text{FP}}\] \[\text{recall} = \text{sensitivity} = \text{true positive rate} = \text{probability of detection} = \frac{\text{TP}}{|\text{real positive}|} = \frac{\text{TP}}{\text{TP} + \text{FN}}\] \[\text{specificity} = \text{true negative rate} = \frac{\text{TN}}{|\text{real negative}|} = \frac{\text{TN}}{\text{TN} + \text{FP}}\] \[\text{accuracy} = \frac{|\text{correct prediction}|}{|\text{all samples}|} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}\] \[\text{F1 score} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}\]

Accuracy can be a misleading metric for imbalanced data sets because it can be biased by the most common target class.

A high sensitivity test is reliable when its result is negative since it rarely misdiagnoses the cases which are indeed positive, it is not that reliable when its result is positive.

A high specificity test is reliable when its result is positive since it rarely gives positive results in negative cases, it is not that reliable when its result is negative. A test with a higher specificity has a lower type I error rate.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. True positive rate (TPR) is also known as recall, sensitivity or probability of detection. false positive rate (FPR) can be calculated as $1 - \text{specificity}$.

roc_curve

The precision-recall curve (PR curve) shows the tradeoff between precision and recall for different threshold.

According to a post on Cross-Validated, the key difference is that ROC curves will be the same no matter what the baseline probability is, but PR curves may be more useful in practice for problems where the positive class is more interesting than the negative class (imbalanced problems). Recall (sensitivity) and specificity, which make up the ROC curve, are probabilities conditioned on the true class label. Precision is a probability conditioned on the estimate of the class label. If the question is: “How meaningful is a positive result from the classifier given the baseline probabilities of the problem?”, use a PR curve. If the question is, “How well can this classifier be expected to perform in general, at a variety of different baseline probabilities?”, go with a ROC curve.

pr_curve_example

pr_curve_example_2

The general definition for the Average Precision (AP) is finding the area under the precision-recall curve. Before calculating AP for the object detection, we often smooth out the zig-zag pattern first. Graphically, at each recall level, we replace each precision value with the maximum precision value to the right of that recall level.

mAP (mean average precision) is the average of AP. In some context, we compute the AP for each class and average them to get mAP. But in some context, for example, under the COCO context, AP is averaged over all classes (even also over several IoU thresholds), which is traditionally called mAP.

How is precision-recall curve drawn?

  • In the context of object detection, we first rank the predicted bounding boxes by their corresponding confidence score. By following this ranking, we cumulatively (only considering the current bounding box and all previous bounding boxes) calculate pairs of precision and recall. For each precision-recall pair, we draw a point in the precision-recall curve. By following this ranking, the recall of successive pairs will be increasing (not strictly increasing) by nature, on the other hand the precision can have a zig-zag pattern because of newly encountered false-positive predicted bounding boxes. This is why precision-recall curve is usually smoothed.

pr_curve_smoothing

Comparison of F1 score and AP:

  • F1 score is sensitive to the trade-off between precision and recall. Therefore, careful tuning is often needed to get the best combination of precision and recall. On the other hand, AP is invariant to that trade-off. AP, on the other hand, calculates a measure by iterating over all points of precisions and recalls. Thus, it is invariant to the trade-off between precision and recall.
  • For each point of the precision-recall curve (each pair of precision+recall), one can obtain a F1 score. One can then choose to keep the maximum one among obtained F1 scores.
English Chinese
precision 精确率,查准率
recall 召回率,查全率
sensitivity 灵敏度
specificity 特异度
accuracy 准确率
confusion matrix 混淆矩阵
mAP 平均精度均值
Read more »

Time series

Posted on 2020-04-12 | In Mathematics
| Words count in article 8272

Correlation

To compute the correlation between two time series, we can use the .corr() method of pandas.Series.

To visualize the scatter plot of two time series:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def correlation_between_2_ts(s1, s2, figsize=(5, 5), plot_scatter=True):
    """
    Computes correlation between two time series
    
    s1, s2: named pandas.Series 
    """
    correlation = s1.corr(s2)
    if plot_scatter:
        plt.figure(figsize=figsize)
        plt.scatter(s1, s2)
        plt.grid()
        title = "Scatter plot of {} and {}".format(s1.name, s2.name) 
        title += "\ncorrelation = {:.4f}".format(correlation)
        plt.title(title) 
        plt.show()
    return correlation

correlation = correlation_between_2_ts(s1, s2)

Autocorrelation function (ACF)

The autocorrelation as a function of the lag. It equals one at lag-zero.

1
2
3
from statsmodels.tsa.stattools import acf

acf(data)
1
2
3
4
from statsmodels.graphics.tsaplots import plot_acf

plot_acf(data, alpha=0.05, lags=30)
plt.show()

To compute a single lag-N autocorrelation, we can also use the .autocorr(lag=N) method of pandas.Series.

Stationarity

  • Strong stationarity: joint distribution of the observations is time-invariant.
  • Weak stationarity: mean, variance and autocorrelation of the observations are time-invariant. For autocorrelation, $\text{corr}(X_t, X_{t-\tau})$ is only a function of $\tau$, not $t$.
  • Weak stationarity is easier to test than strong stationarity.
  • If a process is not stationary, it becomes difficult to model. Modeling involves estimating a set of parameters. If a process is not stationary and the parameters are different at each point in time, then there are too many parameters to estimate (maybe more parameters than data).
  • A random walk is a common type of non-stationary series, its variance grows with time. Seasonal series are also non-stationary, for example, its mean varies with time. A white noise series with varying mean is non-stationary.
  • A white noise series (with constant mean and variance) is stationary.
  • Many non-stationary series can be made stationary through a simple transformation. For example, if we take the differences (with a lag of $1$) of a random walk series, we get a white noise series. A seasonal series can be made stationary by seasonal adjustments (taking the difference with a lag corresponding to the periodicity). Most economic data published by the government is seasonally adjusted. Sometimes, we may need more than one transformations, for a series that grows exponentially and shows a strong seasonal pattern, we can first take the log of the series to eliminate the exponential growth and then a seasonal difference. These operations often involve df.diff(), np.log(), np.sqrt(), etc.

White noise

White noise is a series with:

  • constant mean
  • constant variance
  • zero autocorrelations at all lags
  • special case: if data has normal distribution, then Gaussian white noise
1
gaussian_white_noise = np.random.normal(loc=0.02, scale=0.05, size=1000)

Random walk

\[X_t = X_{t-1} + \epsilon_t\]

In other words, the change is white noise:

\[X_t - X_{t-1} = \epsilon_t\]

One cannot forecast a random walk, the best forecast for the next value is the current value.

An extension, random walk with drift:

\[X_t = \mu + X_{t-1} + \epsilon_t\] \[X_t - X_{t-1} = \mu + \epsilon_t\]

Augmented Dickey-Fuller test

The Dickey-Fuller test is the following:

\[X_t = \mu + \phi X_{t-1} + \epsilon_t\]

Null hypothesis: $H_0 : \phi = 1$

This is equivalent to:

\[X_t - X_{t-1} = \mu + \phi X_{t-1} + \epsilon_t\]

Null hypothesis: $H_0 : \phi = 0$

If we add more lagged changes on the right hand side, it’s the Augmented Dickey-Fuller test.

The null hypothesis for both Dickey-Fuller test and Augmented Dickey-Fuller test is that the data are non-stationary.

If we use $5\%$ as the threshold, a p-value less than $5\%$ indicates that:

  • we can reject the null hypothesis that the series is non-stationary with $95\%$ confidence.
  • this series is stationary.

If we use $5\%$ as the threshold, a p-value larger than $5\%$ indicates that:

  • we cannot reject the null hypothesis.
  • this series is non-stationary.
1
2
3
4
from statsmodels.tsa.stattools import adfuller

# p-value
adfuller(p1)[1]

Autoregressive model (AR model)

AR model of order $1$, or simply AR$(1)$ model (there is only one lagged value on right hand side):

\[X_t = \mu + \phi X_{t-1} + \epsilon_t\]

If AR parameter $\phi = 1$, then this is a random walk. If $\phi = 0$, then this is white noise.

In order for the process to be stable and stationary, $\phi$ has to be $- 1 < \phi < 1$.

Interpretation of AR$(1)$ parameter:

  • negative $\phi$: mean reversion
  • positive $\phi$: momentum.
  • the autocorrelation function decays exponentially for an AR$(1)$ time series at a rate of the AR$(1)$ parameter $\phi$.

The model can be extended to include more lagged values and more $\phi$ parameters:

  • AR$(1)$: $X_t = \mu + \phi_1 X_{t-1} + \epsilon_t$
  • AR$(2)$: $X_t = \mu + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \epsilon_t$
  • etc.

To simulate AR time series, take care of the convention: we must include the zero-lag coefficient of $1$ and the sign of the other coefficients is opposite what we have been using above.

1
2
3
4
5
6
7
8
9
10
11
12
13
from statsmodels.tsa.arima_process import ArmaProcess

# phi = + 0.9
ar = [1, - 0.9]
ma = [1]
arma_process = ArmaProcess(ar, ma)
simulated_data = arma_process.generate_sample(nsample=1000)

# phi = - 0.9
ar = [1, 0.9]
ma = [1]
arma_process = ArmaProcess(ar, ma)
simulated_data = arma_process.generate_sample(nsample=1000)

To estimate an AR model (estimate parameters from data):

1
2
3
4
5
6
7
from statsmodels.tsa.arima_model import ARMA

# Fit an AR(1) model to the data, true phi = + 0.9
arma = ARMA(data, order=(1, 0))
arma_results = arma.fit()

print(arma_results.summary())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                     ARMA(1, 0)   Log Likelihood               -1455.318
Method:                       css-mle   S.D. of innovations              1.036
Date:                Sun, 12 Apr 2020   AIC                           2916.636
Time:                        20:45:48   BIC                           2931.360
Sample:                             0   HQIC                          2922.232
                                                                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0938      0.323     -0.290      0.772      -0.727       0.540
ar.L1.y        0.8995      0.014     65.170      0.000       0.872       0.927
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1117           +0.0000j            1.1117            0.0000
-----------------------------------------------------------------------------

We can do forecasting, both in-sample and out-of-sample using statsmodels. The in-sample is a forecast of the next data point using the data up to that point, and the out-of-sample forecasts any number of data points in the future. These forecasts can be made using either the predict() method if we want the forecasts in the form of a series of data, or using the plot_predict() method if we want a plot of the forecasted data. We supply the starting point for forecasting and the ending point, which can be any number of data points after the data set ends. plot_predict() also gives confidence intervals around the out-of-sample forecasts.

1
2
3
1
2
3
# Plot the original series and the forecasted series
arma_results.plot_predict(start=0, end="2030")
plt.show()

To identify the order of an AR model:

  • Partial autocorrelation function (PACF) meansures the incremental benefit of adding another lag. Each coefficient of PACF represents how significant adding the corresponding lag when we already have all the previous lags included.
  • Information criteria computes the goodness-of-fit with the estimated parameters, but apply a penalty function on the number of parameters in the model. In general, the more parameters in a model, the better the model will fit the data. But this can lead to overfitting. There are two popular information criteria:
    • AIC (Akaike Information Criterion)
    • BIC (Bayesian Information Criterion)

In practice, the way to use the information criteria is to fit several models, each with a different number of parameters, and choose the one with the lowest information criteria. Compared to AIC, BIC penalizes additional model orders more than AIC and so the BIC will sometimes suggest a simpler model. The AIC and BIC will often choose the same model, but when they don’t we have to make a choice. If our goal is to identify good predictive models, we should use AIC. If our goal is to identify a good explanatory model, we should use BIC.

1
2
3
4
from statsmodels.graphics.tsaplots import plot_pacf

plot_pacf(data, lags=20)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
from statsmodels.tsa.arima_model import ARMA

bic = []
range_ = range(1, 7)
for p in range_:
    arma = ARMA(simulated_data, order=(p, 0))
    arma_results = arma.fit()  
    bic.append(arma_results.bic)

plt.plot(range_, bic, marker="o")
plt.xlabel("Order of AR Model")
plt.ylabel("BIC")
plt.show()

Moving average model (MA model)

MA model of order $1$, or simply MA$(1)$ model (there is only one lagged error on right hand side):

\[X_t = \mu + \epsilon_t + \theta \epsilon_{t-1}\]

If MA parameter $\theta = 0$, then this is white noise.

MA models are stationary for all values of $\theta$.

Interpretation of MA$(1)$ parameter:

  • negative $\theta$: one-period mean reversion
  • positive $\theta$: one-period momentum
  • one-period autocorrelation (lag-$1$ autocorrelation) is not $\theta$, it is $\frac{\theta}{1 + \theta^2}$. There is zero autocorrelation for an MA$(1)$ beyond lag-$1$.
  • higher frequency stock data is well modeled by an MA$(1)$ process. For example, one day’s prices (on September 1, 2017) for Sprint stock (ticker symbol “S”) sampled at a frequency of one minute. The stock market is open for 6.5 hours (390 minutes), from 9:30am to 4:00pm. Such data can be accessed via Google Finance [3]. Stocks trade at discrete one-cent increments (although a small percentage of trades occur in between the one-cent increments) rather than at continuous prices, and when we plot the data we should observe that there are long periods when the stock bounces back and forth over a one cent range. This is sometimes referred to as “bid/ask bounce”. The bouncing of the stock price between bid and ask induces a negative first order autocorrelation, but no autocorrelations at lags higher than $1$.

The model can be extended to include more lagged errors and more $\theta$ parameters:

  • MA$(1)$: $X_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1}$
  • MA$(2)$: $X_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}$
  • etc.

An MA$(q)$ model has no autocorrelation beyond lag-$q$.

To simulate MA time series, take care of the convention: we must include the zero-lag coefficient of $1$. However, unlike with the AR simulation, we don’t need to reverse the sign of $\theta$, its sign is what we have been using above.

1
2
3
4
5
6
7
8
9
10
11
12
13
from statsmodels.tsa.arima_process import ArmaProcess

# theta = + 0.9
ar = [1]
ma = [1, 0.9]
arma_process = ArmaProcess(ar, ma)
simulated_data = arma_process.generate_sample(nsample=1000)

# theta = - 0.9
ar = [1]
ma = [1, - 0.9]
arma_process = ArmaProcess(ar, ma)
simulated_data = arma_process.generate_sample(nsample=1000)

To estimate an MA model (estimate parameters from data):

1
2
3
4
5
6
7
from statsmodels.tsa.arima_model import ARMA

# Fit an MA(1) model to the data, true theta = + 0.9
arma = ARMA(data, order=(0, 1))
arma_results = arma.fit()

print(arma_results.summary())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                     ARMA(0, 1)   Log Likelihood               -1404.378
Method:                       css-mle   S.D. of innovations              0.985
Date:                Mon, 13 Apr 2020   AIC                           2814.756
Time:                        16:10:50   BIC                           2829.479
Sample:                             0   HQIC                          2820.352
                                                                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0664      0.060     -1.112      0.266      -0.184       0.051
ma.L1.y        0.9203      0.012     75.882      0.000       0.896       0.944
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
MA.1           -1.0867           +0.0000j            1.0867            0.5000
-----------------------------------------------------------------------------

Just as with AR models, we can use MA models to do forecasting, both in-sample and out-of-sample using statsmodels. This can be done via either the predict() method if we want the forecasts in the form of a series of data, or the plot_predict() method if we want a plot of the forecasted data. One big difference between out-of-sample forecasts with an MA$(1)$ model and an AR$(1)$ model is that the MA$(1)$ forecasts more than one period in the future are simply the mean of the sample.

1
2
3
1
2
3
# Plot the original series and the forecasted series
arma_results.plot_predict(start=0, end="2030")
plt.show()

Autoregressive moving average model (ARMA model)

An ARMA model is a combination of an AR model and an MA model.

ARMA$(1, 1)$ model:

\[X_t = \mu + \phi X_{t-1} + \epsilon_t + \theta \epsilon_{t-1}\]

ARMA models are generally denoted ARMA$(p, q)$ where parameters $p$ and $q$ are non-negative integers, $p$ is the order of the AR model, $q$ is the order of the MA model.

ARMA models can be converted to pure AR or pure MA models. Here is an example of converting an AR$(1)$ model into an an MA$(\infty)$:

\[X_t = \mu + \phi X_{t-1} + \epsilon_t\] \[X_t = \mu + \phi (\mu + \phi X_{t-2} + \epsilon_{t-1}) + \epsilon_t\] \[X_t = \frac{\mu}{1-\phi} + \epsilon_t + \phi \epsilon_{t-1} + \phi^2 \epsilon_{t-2} + \phi^3 \epsilon_{t-3} + ...\]

This demonstrates that an AR$(1)$ model is equivalent to an MA$(\infty)$ model with the appropriate parameters.

Using ACF and PACF to choose model order

  AR$(p)$ MA$(q)$ ARMA$(p, q)$
ACF tails off cuts off after lag-$q$ tails off
PACF cuts off after lag-$p$ tails off tails off

The time series must be made stationary before making these plots. If the ACF values are high and tail off very very slowly, this is the sign that the data is non-stationary, so it needs to be differenced. If the autocorrelation at lag-$1$ is very negative, this is the sign that we have taken the difference too many times.

In the case of ARMA$(p, q)$, we cannot deduce the model orders $p$ and $q$ from the plots. However, we can use AIC and BIC to find the most appropriate $p$ and $q$. Sometimes when searching over model orders we will attempt to fit an order that leads to an error, for example, ValueError: Non-stationary starting augoregressive parameters found with enforce_stationary set to True. This ValueError tells us that we have tried to fit a model which would result in a non-stationary set of AR coefficients. We can use try/except blocks to skip this one.

Cointegration model

The idea behind cointegration (协整) is that even if the prices of two different assets both follow random walks, it is still possible that a linear combination of them is not a random walk. If that’s true, then even though these two assets are not forecastable because they are random walks, the linear combination is forecastable. We say that these two assets are cointegrated. One analogy is that a dog owner walking his dog with a retractable leash. The position of the dog owner and the position of the dog may both follow random walks, but the distance between them may very well be mean reverting. Examples can be found by looking at economic substitutes: heating oil and natural gas, platinum and palladium, corn and wheat, corn and sugar, Bitcoin and Ethereum, etc. For stocks, a natural starting point for identifying cointegrated pairs are stocks in the same industry. However competitors are not necessarily economic substitutes, think of Apple and Blackberry.

Two steps to test for cointegration:

  • regress the level of one series $P_t$ on the level of the other series $Q_t$ to get the slope coefficient $c$ or the cointegration vector (coefficients of the linear combination).
  • run Augmented Dickey-Fuller test on $P_t - c Q_t$ or on the residuals of the regression.

Alternatively, statsmodels has a function coint that combines both steps, its null hypothesis is no cointegration.

1
2
3
4
from statsmodels.tsa.stattools import coint

# p-value
coint(s1, s2)[1]

Autoregressive integrated moving average model (ARIMA model)

An ARIMA model is a generalization of an ARMA model. ARIMA models are applied in some cases where data show evidence of non-stationarity, where an initial differencing step (corresponding to the “integrated” part of the model) can be applied one or more times to eliminate the non-stationarity.

ARIMA models are generally denoted ARIMA$(p, d, q)$ where parameters $p$, $d$, and $q$ are non-negative integers, $p$ is the order of the AR model, $d$ is the degree of differencing, $q$ is the order of the MA model.

Using the ARIMA module on a random walk series is identical to using the ARMA module on the first-order difference of that series followed by taking cumulative sums of these differences to get the original series forecast.

1
2
3
4
5
6
7
from statsmodels.tsa.arima_model import ARIMA

arima = ARIMA(data, order=(1,1,1))
arima_results = arima.fit()

arima_results.plot_predict(start="1872-01-01", end="2046-01-01")
plt.show()

In terms of forecasting, sometimes the estimate of the drift will have a much bigger impact on long range forecasts than the ARMA parameters.

ARMAX model

Exogenous ARMA that uses external variables as well as time series.

\[\text{ARMAX} = \text{ARMA} + \text{linear regression}\]

ARMAX$(1, 1)$ model:

\[X_t = \mu + \phi X_{t-1} + \epsilon_t + \theta \epsilon_{t-1} + \eta Z_t\]

where $Z_t$ is the exogenous input.

1
2
3
4
5
6
from statsmodels.tsa.arima_model import ARMA

arma = ARMA(data, order=(p, q), exog=df["..."])
arma_results = arma.fit()

print(arma_results.summary())

SARIMAX model

Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from statsmodels.tsa.statespace.sarimax import SARIMAX

sarimax = SARIMAX(df, order=(p, d, q)) # trend="c"

sarimax_results = sarimax.fit()

# make in-sample prediction for last 25 values  
prediction_results = sarimax_results.get_prediction(start=-25)
# the central value of the forecast 
predicted_mean = prediction_results.predicted_mean
# confidence interval of forecasts, 
# use plt.fill_between to visualize 
confidence_intervals = prediction_results.conf_int()

# dynamic prediction (predict one step ahead and then use 
# this predicted value to forecast the next value after that, 
# and so on)
prediction_results = sarimax_results.get_prediction(start=-25, dynamic=True)

# make out-of-sample prediction (forecast future), 
# this is also a dynamic forecast 
prediction_results = sarimax_results.get_forecast(steps=5)
predicted_mean = prediction_results.predicted_mean
confidence_intervals = prediction_results.conf_int()

For an ideal model, the residuals (difference between one-step-ahead predictions and the real values) should be uncorrelated white Gaussian noise centered on zero. We can use the plot_diagnostics() method to evaluate this, this method generates 4 plots. The first plot is standardized residual, if our model is working correctly, there should be no obvious structure in the residuals. Another of the four plots shows the distribution of the residuals where the histogram shows the measured distribution, the orange line shows a smoothed version of this histogram and the green line shows a normal distribution. If our model is good, these two lines should be almost the same. The normal Q-Q plot is another way to show how the distribution of the model residuals compares to a normal distribution. If our residuals are normally distributed then all the points should lie along the red line, except perhaps some values at either end. The last plot is the correlogram, which is just an ACF plot of the residuals rather than the data. $95\%$ of the correlations for lag greater than zero should not be significant. If there is significant correlation in the residuals, it means that there is information in the data that our model hasn’t captured.

In the output of the summary() method, Prob(Q) is the p-value associated with the null hypothesis that the residuals have no correlation structure. Prob(JB) is the p-value associated with the null hypothesis that the residuals are Gaussian normally distributed.

1
2
sarimax_results.plot_diagnostics(figsize=(16, 8))
plt.show()

For seasonal data, the full time series can be decomposed into 3 parts: the trend, the seasonal component and the residual. We can use statsmodels.tsa.seasonal.seasonal_decomposeto separate out any time series into these three components.

1
2
3
4
5
6
7
from statsmodels.tsa.seasonal import seasonal_decompose

# freq corresponds the number of data points in each repeated cycle 
decompose_result = seasonal_decompose(df["..."], freq=12) 

decompose_result.plot()
plt.show()

We can use ACF to identify the frequency/period. In the case of a seasonal time series, the ACF will show periodic correlation pattern. To find the frequency we look for a lag greater than one, which is a peak in the ACF plot. In order to use ACF to identify the period of a non-stationary time series, it might be useful to detrend it first, we can substract long rolling average over $N$ steps. Any large window size $N$ will work for this, we could use a window size of any value bigger than the likely period.

1
df = (df - df.rolling(N).mean()).dropna()
\[\text{SARIMA} = \text{Seasonal ARIMA}\]

Fitting a SARIMA model is like fitting two different ARIMA models at once, one to the seasonal part and another to the non-seasonal part. Since we have these two models, we will have two sets of orders:

\[\text{SARIMA}(p, d, q)(P, D, Q)_S\]
  • Non-seasonal orders:
    • $p$: autoregressive order
    • $d$: differencing order
    • $q$: moving average order
  • Seasonal orders:
    • $P$: seasonal autoregressive order
    • $D$: seasonal differencing order
    • $Q$: seasonal moving average order

There is also a new order $S$ which is the length of the seasonal cycle.

Comparison of ARIMA and SARIMA models:

  • ARIMA$(2, 0, 1)$:
    • $X_t = \mu + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \epsilon_t + \theta_1 \epsilon_{t-1}$
  • SARIMA$(0, 0, 0)(2, 0, 1)_{7}$:
    • $X_t = \mu + \phi_7 X_{t-7} + \phi_{14} X_{t-14} + \epsilon_t + \theta_7 \epsilon_{t-7}$

The SARIMA$(0, 0, 0)(2, 0, 1)_{7}$ model will be able to capture seasonal, weekly patterns, but won’t be able to capture local, day to day patterns. If we construct a SARIMA model and include non-seasonal orders as well, then we can capture both of these patterns.

Fitting a SARIMA model:

1
2
3
4
from statsmodels.tsa.statespace.sarimax import SARIMAX

sarimax = SARIMAX(df, order=(p, d, q), seasonal_order=(P, D, Q, S))
sarimax_results = sarimax.fit()

The value of $S$ can be found by using the ACF. The next task is to find the order of differencing. To make a time series stationary we may need to apply seasonal differencing. In seasonal differencing, instead of substracting the most recent time series value, we substract the time series value from one cycle ago.

1
2
# take the seasonal difference 
df = df.diff(S)

For example, if the time series shows a trend then we take the (normal) difference. If there is a strong seasonal cycle, then we will also take the seasonal difference. Once we have found the two orders of differencing and made the time series stationary, we need to find the other model orders. To find the non-seasonal orders, we plot the ACF and the PACF of the differenced time series. To find the seasonal orders, we plot the ACF and the PACF of the differenced time series at multiple seasonal steps, this can be done as follows:

1
2
3
4
5
6
7
8
# plot seasonal ACF and PACF 
# S is the length of the seasonal cycle 
plt.figure(figsize=(15, 8))
plt.subplot(211)
plot_acf(df, lags=[S, 2*S, 3*S, 4*S, 5*S, 6*S], ax=plt.gca())
plt.subplot(212)
plot_pacf(df, lags=[S, 2*S, 3*S, 4*S, 5*S, 6*S], ax=plt.gca())
plt.show()

Here we set the lags parameter to a list of lags instead of a maximum, this plots the ACF and PACF at these specific lags only.

Sometimes we will have the choice of whether to apply seasonal differencing, non-seasonal differencing or both to make a time series stationary. Some good rules of thumb are:

  • Never use more than one order of seasonal differencing, $0 \leqslant D \leqslant 1$
  • Never more than two orders of differencing in total, $0 \leqslant d + D \leqslant 2$

Sometimes we will be able to make a time series stationary by using either one seasonal differencing or one non-seasonal differencing, we might build models for each in this case and see which one makes better predictions.

A time series is said to have a weak seasonality, meaning that the seasonal oscillations don’t always look the same and are harder to identify. When we have a time series that has a strong seasonality, we should always use one order of seasonal differencing. This will ensure that the seasonal oscillation will remain in our dynamic predictions far into the future without fading out.

  • Weak seasonal pattern
    • Use seasonal differencing if necessary
  • Strong seasonal pattern
    • Always use seasonal differencing

Just like in ARIMA modeling sometimes we need to use other transformations on our time series before fitting. Whenever the seasonality is additive we should not need to apply any transforms except for differencing. Additive seasonality is where the seasonal pattern just adds or takes away a little from the trend. When seasonality is multiplicative, the SARIMA model cannot fit this without extra transforms. If the seasonality is multiplicative the amplitude of the seasonal oscillations will get larger as the data trends up or smaller as it trends down.

  • Additive seasonality $=$ trend $+$ season
    • Proceeds as usual with differencing
  • Multiplicative seasonality $=$ trend $\times$ season
    • Apply log transform first np.log()

Searching over SARIMA model orders using for loops may be complex, there is a package that will do most of this work for us. The auto_arima function from this package loops over model orders to find the best one. The object returned by the function is the results object of the best model found by the search, this object is almost exactly a statsmodels SARIMAXresults object and has the summary() and the plot_diagnostics() method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pmdarima as pm

results1 = pm.auto_arima(df)

# create model for SARIMAX(p,1,q)(P,1,Q)_{7} 
results2 = pm.auto_arima(df,
                         seasonal=True, m=7,
                         d=1, D=1, 
                         start_p=1, start_q=1,
                         max_p=3, max_q=3,
                         max_P=2, max_Q=2,
                         trace=True,
                         error_action='ignore',
                         suppress_warnings=True) 

print(results2.summary())

The pmdarima package can also be used to update the model (incorporate data we have collected since then)

1
2
# add new observations and update the model parameters
results.update(df_new)

However this does not choose the model orders again, so if we are updating with a large amount of new data, it might be best to go back to the start of Box-Jenkins method.

Box-Jenkins method

From raw data $\Rightarrow$ production model:

  • identification
    • Determine whether it is seasonal? If so, find its seasonal period.
    • Is it stationary? What (seasonal and non-seasonal) differencing will make it stationary? What transforms will make it stationary?
    • What values of $p$ and $q$ are the most promising?
    • Tools: plot the time series, augmented Dickey-Fuller test, ACF, PACF, etc.
  • estimation
    • Use the fit() method.
    • Choose between models using AIC and BIC
  • model diagnostics
    • Evaluate the best fitting model
    • Are the residuals uncorrelated? Are the residuals normally distributed?
    • Tools: plot_diagnostics(), summary()
    • Is this model good enough or do we need to go back and rework it?

GARCH model

Generalized AutoRegressive Conditional Heteroskedasticity model

This is a popular approach to model volatility. A common assumption in time series modeling is that volatility (standard deviation, variance) remains constant over time. However it is frequently observed in financial return data that the presence of varying volatility systematically over time, this is called heteroskedasticity.

1
2
3
4
5
6
7
from arch import arch_model

# GARCH(1, 1) model 
model = arch_model(data, p=1, q=1, 
                   mean="constant", 
                   vol="GARCH", 
                   dist="normal")

Machine learning for time series

Unpivot a DataFrame from wide to long format (melt):

1
df = data.melt(id_vars="Product_Code", var_name="Week", value_name="Sales")
1
2
3
4
5
              W0  W1  W2  W3  W4  W5  W6  W7  W8  W9  ...  W42  W43  W44  W45  \
Product_Code                                          ...                       
P1            11  12  10   8  13  12  14  21   6  14  ...    4    7    8   10   
P2             7   6   3   2   7   1   6   3   3   3  ...    2    4    5    1   
P3             7  11   8   9  10   8   7  13  12   6  ...    6   14    5    5   
1
2
3
4
5
6
7
8
9
                   Sales
Product_Code Week       
P1           0        11
             1        12
             2        10
             3         8
             4        13
...                  ...
P99          47        8

Adding lags as basic feature engineering:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def add_lag(original_df, lag=3, groupby_col="Product_Code", col="Sales", diff=False, dropna=True):
    df = original_df.copy()
    columns = []
    for i in range(1, lag + 1):
        c = "lag_" + str(i)
        columns.append(c)
        df[c] = df.groupby(groupby_col)[col].shift(i)
    if diff: # pay attention for multicollinearity
        for i in range(1, lag + 1):
            c = "diff_" + str(i)
            columns.append(c)
            df[c] = df.groupby(groupby_col)["lag_" + str(i)].diff(1)
    if dropna:
        df.dropna(inplace=True, axis=0, how="any")
        df[columns] = df[columns].astype("int")
    return df
    
df = add_lag(df, 3)

Time series cross validation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from sklearn.model_selection import TimeSeriesSplit

def split_test_ts(X, y, test_size=1):
    """
    X: pd.DataFrame
    y: pd.Series
    
    reserves the last test_size elements of the time series dataset 
    as the test set of the time series problem. 
    """
    assert test_size > 0, "test_size must be larger than 0."
    X_tmp, X_test = X.iloc[: - test_size, :], X.iloc[- test_size :, :]
    y_tmp, y_test = y.iloc[: - test_size], y.iloc[- test_size :]
    return X_tmp, X_test, y_tmp, y_test 

def split_train_validation_ts(X, y, n_splits=3, max_train_size=None):
    """
    X: pd.DataFrame
    y: pd.Series
    
    returns a generator that generates training sets and 
    validation sets for time series cross validation. 
    
    If a separate test set is needed, call split_test_time_series() 
    before calling this function. 
    """
    tss = TimeSeriesSplit(n_splits=n_splits, max_train_size=max_train_size)
    for train_idx, validation_idx in tss.split(X):
        X_train, X_validation = X.iloc[train_idx, :], X.iloc[validation_idx, :]
        y_train, y_validation = y.iloc[train_idx], y.iloc[validation_idx]
        yield X_train, X_validation, y_train, y_validation 
        
def get_dataset_with_specific_lag(df, lag, target="Sales", test_size=3600):
    df = add_lag(df, lag)
    y = df.loc[:, target]
    X = df.drop(target, axis=1)
    X_train, X_test, y_train, y_test = split_test_ts(X, y, test_size=test_size)
    return X_train, X_test, y_train, y_test
1
2
3
4
5
X, X_test, y, y_test = split_test_ts(X, y, test_size=3600)

train_validation_generator = split_train_validation_ts(X, y, n_splits=10)
for X_train, X_validation, y_train, y_validation in train_validation_generator:
    pass 

Time series feature extraction with tsfresh

tsfresh is a python package which automatically calculates a large number of time series features. Further the package contains methods to evaluate the explaining power and importance of such features for regression or classification tasks.

Let’s look at one example. Suppose we have the following dataframe df where the length of A’s time series is 1000, the length of B’s time series is 1200:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
     id     value
0     A  0.008801
1     A  0.008801
2     A  0.008801
3     A  0.008803
4     A  0.008808
...  ..       ...
2195  B  0.001495
2196  B  0.001498
2197  B  0.001500
2198  B  0.001503
2199  B  0.001506

[2200 rows x 2 columns]

tsfresh can extract 756 features with version 0.15.1. To extract all features, we do use tsfresh.extract_features(). This function can be very time-consuming if the input’s size is large. The returned feature_matrix has the shape:

\[(\text{nb_ids}, \text{nb_extracted_features_per_time_series} \times \text{nb_time_series})\]
1
2
3
import tsfresh

feature_matrix = tsfresh.extract_features(df, column_id="id", column_value="value")

We end up with a DataFrame feature_matrix with all possible features. Then we should:

  • first, remove all NaN values that were created by feature calculators that cannot be used on the given data, e.g. because it has too low statistics.
    • tsfresh.utilities.dataframe_functions.impute() replaces all NaNs and infs from the input DataFrame with average/extreme values from the same columns.
      • -inf -> min
      • +inf -> max
      • NaN -> median
    • tsfresh.utilities.dataframe_functions.impute() is in-place.
  • second, select only the relevant features
    • tsfresh.feature_selection.selection.select_features() checks the significance of all features (columns) with respect to target vector y and return a possibly reduced feature matrix only containing relevant features.
1
2
tsfresh.utilities.dataframe_functions.impute(feature_matrix)
filtered_feature_matrix = tsfresh.select_features(feature_matrix, y)

We can also perform the extraction, imputing and filtering at the same time with the tsfresh.extract_relevant_features() function.

1
filtered_feature_matrix = tsfresh.extract_relevant_features(df, y, column_id="id", column_value="value")

Analyzing time series motifs and anomalies with stumpy

For the stumpy package, time series motifs are approximately repeated subsequences found within a longer time series.

stumpy compares all subsequences within a time series by computing the pairwise z-normalized Euclidean distances and then storing only the index to its nearest neighbor. This nearest neighbor distance is referred to as the matrix profile and the index to each nearest neighbor within the time series is referred to as the matrix profile index.

The global minima from the matrix profile correspond to the locations of the two subsequences that make up the motif pair. The matrix profile index also tells us which subsequence within the time series does not have nearest neighbor that resembles itself, which is referred to as a discord, novelty, or anomaly.

1
2
3
4
import stumpy

window_size = 640  
matrix_profile = stumpy.stump(data, m=window_size)

The returned matrix_profile is a numpy array of shape (data.size, 4) (suppose data is 1-D). The first column consists of the matrix profile (the distance), the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def ts_stumpy_motif(data, matrix_profile, window_size, plot=True):
    """Time series motif discovery (stumpy package)"""
    from matplotlib.patches import Rectangle as patch_rect_
    idx = matrix_profile[:, 0].argpartition(0)[0]
    indices = np.array([idx, matrix_profile[idx, 1]]) 
    if plot:
        fig = plt.figure(figsize=(14, 5))
        plt.suptitle("Time series motif discovery - window size = {}".format(window_size), 
                     fontsize="xx-large")
        ax = plt.subplot(211)
        plt.plot(data)
        plt.gca().set_ylabel("Time series", fontsize="xx-large")
        plt.xticks(fontsize="xx-large")
        plt.yticks(fontsize="xx-large")
        bottom_, top_ = plt.ylim()
        rect_height = top_ - bottom_
        for idx in indices:
            rect = patch_rect_((idx, 0), window_size, rect_height, facecolor="lightgrey")
            plt.gca().add_patch(rect)
        plt.subplot(212, sharex=ax)
        plt.plot(matrix_profile[:, 0])
        plt.gca().set_ylabel("Matrix profile", fontsize="xx-large")
        plt.xticks(fontsize="xx-large")
        plt.yticks(fontsize="xx-large")
        for idx in indices:
            plt.axvline(x=idx, linestyle="dashed")
        plt.show()
    return indices

XXX

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def ts_stumpy_anomaly(data, matrix_profile, window_size, plot=True):
    """Time series anomaly discovery (stumpy package)"""
    from matplotlib.patches import Rectangle as patch_rect_
    idx = matrix_profile[:, 0].argpartition(matrix_profile[:, 0].size - 1)[-1]
    if plot:
        plt.figure(figsize=(14, 5))
        plt.suptitle("Time series anomaly discovery - window size = {}".format(window_size), 
                     fontsize="xx-large")
        ax = plt.subplot(211)
        plt.plot(data)
        plt.gca().set_ylabel("Time series", fontsize="xx-large")
        bottom_, top_ = plt.ylim() 
        rect = patch_rect_((idx, 0), window_size, top_-bottom_, facecolor='lightgrey')
        plt.gca().add_patch(rect)
        plt.xticks(fontsize="xx-large")
        plt.yticks(fontsize="xx-large")
        plt.subplot(212, sharex=ax)
        plt.plot(matrix_profile[:, 0])
        plt.gca().set_ylabel('Matrix profile', fontsize="xx-large")
        plt.axvline(x=idx, linestyle="dashed")
        plt.xticks(fontsize="xx-large")
        plt.yticks(fontsize="xx-large")
        plt.show()
    return idx

XXX

Machine learning for audio signals

For a classification problem whose inputs are audio data (for example, heartbeat time series), one can extract summary statistics from the envelope. The envelope is obtained by smoothing the absolute value of the original centered data so that the total amount of sound energy over time is distinguishable. This is done by first applying np.abs() to the original centered audio waveform, then by using the .rolling() method. This is shown as follows:

1
2
3
4
5
# suppose that audio1, audio2 are both 1-D arrays 

# smooth by applying a rolling mean
audio_envelope1 = audio1.apply(np.abs).rolling(20).mean()
audio_envelope2 = audio2.apply(np.abs).rolling(window=50).mean()

Then we can use np.mean(), np.std(), np.max(), etc. to extract statistics from the obtained envelope.

The envelope calculation is also a common technique in computing tempo (每分钟节拍数,beats per minute) and rhythm (节奏,韵律) features. The tempogram (which estimates the tempo of a sound over time) can be calculated by using the librosa package.

Some basics of the librosa package,

  • The sampling rate, sometimes denoted by sr, is a positive integer which indicates the number of samples per second of a time series.
  • A frame is a short slice of a time series used for analysis purposes. This usually corresponds to a single column of a spectrogram matrix.
  • A window is a vector or function used to weight samples within a frame when computing a spectrogram.
  • n_fft is a positive integer which indicates the number of samples in an analysis window (or frame), this can be called the frame length.
  • hop_length is a positive integer which indicates the number of samples between successive frames, e.g., the columns of a spectrogram.
  • librosa.core.stft() returns a complex-valued matrix D such that:
    • np.abs(D) is the magnitude
    • np.angle(D) is the phase
  • Spectral centroid and spectral bandwidth are only defined with real-valued non-negative input. magnitude, phase = librosa.magphase(D) separates a complex-valued STFT D into its magnitude and phase components.
  • librosa functions tend to only operate on numpy arrays instead of DataFrames, so we’ll access pandas data as a numpy array with the .values attribute.
1
2
3
import librosa

tempo = librosa.beat.tempo(audio, sr=sampling_rate, hop_length=2**6, aggregate=None)

tempo and audio are both 1-D arrays, their sizes satisfy the following relation:

\[\text{tempo.size} = \text{nb_frames} \approx \frac{\text{audio.size}}{\text{hop_length}}\]

Then we can extract statistics from the tempogram by using np.mean(tempo) (the average tempo of this particular audio signal), np.std(tempo), np.max(tempo), etc.

Spectrograms (时频谱) are common in time-series analysis. By definition, the spectrogram is the squared magnitude of the short-time Fourier transform (STFT). The Fourier transform (FFT) describes, for a window of time, the presence of fast and slow oscillations in a time series. We can do the spectral feature engineering by using the spectrogram as a base. For example, we can calculate the spectral centroids and spectral bandwidth which describe where most of the spectral energy is at each moment of time. One way to do this is to use the librosa packages (other packages can also be used, for example scipy.signal.spectrogram). Under the assumption that the temporal features and spectral features provide independent information we can combine them to train our machine learning model.

1
2
3
4
5
6
7
8
9
10
import librosa

audio, sampling_rate = librosa.load(audio_file_path)
time_axis = np.arange(0, audio.size) / sampling_rate # seconds since the beginning 

n_fft = 2 ** 7
hop_length = 2 ** 4
stft = librosa.stft(audio, hop_length=hop_length, n_fft=n_fft)

magnitude, phase = librosa.magphase(stft)
1
2
3
4
5
6
7
8
9
10
11
12
13
import librosa.display

# convert into decibels
spec_db = librosa.amplitude_to_db(magnitude)

# compare the raw audio to the spectrogram of the audio
plt.figure(figsize=(10, 10))
ax = plt.subplot(211)
plt.plot(time_axis, audio)
plt.subplot(212, sharex=ax)
librosa.display.specshow(spec_db, sr=sampling_rate, x_axis="time", 
                         y_axis="hz", hop_length=hop_length)
plt.show()

XXX

1
2
3
4
5
6
7
8
9
# calculate the spectral centroid and bandwidth for the spectrogram from time-series input
centroids = librosa.feature.spectral_centroid(y=audio, sr=sampling_rate, 
                                              hop_length=hop_length, n_fft=n_fft)[0]
bandwidths = librosa.feature.spectral_bandwidth(y=audio, sr=sampling_rate, 
                                                hop_length=hop_length, n_fft=n_fft)[0]

# calculate the spectral centroid and bandwidth for the spectrogram from spectrogram input
centroids = librosa.feature.spectral_centroid(S=magnitude)[0]
bandwidths = librosa.feature.spectral_bandwidth(S=magnitude)[0]

The obtained centroids and bandwidths are both 1-D arrays and their sizes satisfy the following relations:

\[\text{centroids.size} = \text{nb_frames} \approx \frac{\text{audio.size}}{\text{hop_length}}\] \[\text{bandwidths.size} = \text{nb_frames} \approx \frac{\text{audio.size}}{\text{hop_length}}\]
1
2
3
4
5
6
7
# visualize spectral centroid and bandwidth on top of the spectrogram
plt.figure(figsize=(10, 5))
librosa.display.specshow(spec_db, x_axis='time', y_axis='hz', hop_length=hop_length)
times_spec = np.arange(0, spec_db.shape[1])
plt.plot(times_spec, centroids)
plt.fill_between(times_spec, centroids - bandwidths / 2, centroids + bandwidths / 2, alpha=.5)
plt.show()

XXX

To use the spectral features as machine learning features, we can use np.mean(centroids) (the average spectral centroid of this particular audio signal, which is a frequency), np.std(centroids), np.max(centroids), np.mean(bandwidths) (the average spectral bandwidth of this particular audio signal, which is a frequency range), np.std(bandwidths), np.max(bandwidths), etc.

References

[1] Https://plus.google.com/u/0/+Datacamp/. (n.d.). Sign in. DataCamp. https://learn.datacamp.com/courses/time-series-analysis-in-python

[2] E. E. Holmes, M. D. Scheuerell, and E. J. Ward. (2020, February 3). 5.3 Dickey-Fuller and augmented Dickey-Fuller tests - Applied time series analysis for fisheries and environmental sciences. NWFSC Time-Series Analysis. https://nwfsc-timeseries.github.io/atsa-labs/sec-boxjenkins-aug-dickey-fuller.html

[3] 6 ways to download free intraday and tick data for the U.S. stock market. (n.d.). QuantShare Trading Software. https://www.quantshare.com/sa-426-6-ways-to-download-free-intraday-and-tick-data-for-the-us-stock-market

[4] Autoregressive integrated moving average. (2005, February 15). Wikipedia, the free encyclopedia. Retrieved April 14, 2020, from https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average

[5] Https://plus.google.com/u/0/+Datacamp/. (n.d.). Sign in. DataCamp. https://learn.datacamp.com/courses/arima-models-in-python

[6] Https://plus.google.com/u/0/+Datacamp/. (n.d.). Sign in. DataCamp. https://learn.datacamp.com/courses/garch-models-in-python

[7] Https://plus.google.com/u/0/+Datacamp/. (n.d.). Sign in. DataCamp. https://learn.datacamp.com/courses/machine-learning-for-time-series-data-in-python

Read more »

Python iterable iterator generator yield

Posted on 2020-04-08 | In Python
| Words count in article 855

Iterable:

  • It is an object that contains a countable number of values.
  • It is an object that can be iterated upon, meaning that you can traverse through all the values.
  • It is an object that has an __iter__() method or defines a __getitem__() method.
  • It generates an Iterator when passed to iter() function.
  • It stores all the values in memory.
  • It is any object you can get an iterator from.

Iterator:

  • It is an object that lets you iterate on an iterable.
  • it must implement __iter__()method and __next__()method.
  • __next__() method returns the next item of the object.
  • It is an object with state that remembers where it is during iteration.
  • It is self-iterable (meaning that it has an __iter__() method that returns self.

Generator:

  • It is an iterator, a kind of iterable you can only iterate over once.
  • It does not store all the values in memory, it generates the values on the fly.
  • yield is a keyword that is used like return. By using yield, the function will return a generator.
  • When you call a function which uses yield, the code you have written in that function body does not run. The function only returns the generator object.

Every iterator is also an iterable, but not vice versa. For example, a list is iterable but it is not an iterator. Compared to generator, iterator is a more general concept: any object whose class has a __next__() method and a __iter__() method that does return self is an iterator. Every generator is an iterator, but not vice versa. A generator is an object that meets the previous definition of an iterator and it can be built by calling a function that has one or more yield expressions. A function with yield in it is still a function, that, when called, returns an instance of a generator object.

\[\text{generator} \subsetneq \text{iterator} \subsetneq \text{iterable}\]

When a for loop is executed, the for statement calls iter() on the iterable object. If successful, the iter() call will return an iterator object that defines the __next__() method, which accesses elements of the object one at a time. The __next__() method will raise a StopIteration exception, if there are no further elements available. The for loop will terminate as soon as it catches a StopIteration exception.

1
2
3
4
5
6
my_iterable = ["A", "B", "C"] 
iterator_obj = iter(my_iterable) 

print(next(iterator_obj)) # A
print(next(iterator_obj)) # B
print(next(iterator_obj)) # C
1
2
3
4
5
6
7
8
9
10
# this is called generator expression
my_generator = (x * x for x in range(4))

for i in my_generator:
    print(i, end =" ") # 0 1 4 9 

print("\n")

for i in my_generator:
    print(i)           #                   

It is just the same except you used () instead of []. But you cannot perform for i in my_generator a second time since generators can only be used once: they calculate the first element, then forget about it and calculate the second element, then forget about it, and so on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def createGenerator():
    my_iterable = range(4)
    for i in my_iterable:
        yield i * i

my_generator = createGenerator() 
print(my_generator) # <generator object createGenerator at 0x11cd234c0>

for i in my_generator:
    print(i, end =" ") # 0 1 4 9 

print("\n")

for i in my_generator:
    print(i)           #     

yield can be used when you know your function will return a huge set of values that you will only need to read once.

The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it will return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting yield. That can be because the loop has come to an end, or because you no longer satisfy an "if/else".

References

[1] Glossary — Python 3.8.2 documentation. (n.d.). 3.8.2 Documentation. https://docs.python.org/3/glossary.html

[2] Python Iterators. (n.d.). W3Schools Online Web Tutorials. https://www.w3schools.com/python/python_iterators.asp

[3] Python - Difference between iterable and iterator. (2019, August 23). GeeksforGeeks. https://www.geeksforgeeks.org/python-difference-iterable-iterator/

[4] What exactly are iterator, iterable, and iteration? (n.d.). Stack Overflow. https://stackoverflow.com/questions/9884132/what-exactly-are-iterator-iterable-and-iteration

[5] Python Iterators (iter and next): How to use it and why? (n.d.). Programiz: Learn to Code for Free. https://www.programiz.com/python-programming/iterator

[6] What does the “yield” keyword do? (n.d.). Stack Overflow. https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

[7] Difference between Python’s generators and Iterators. (n.d.). Stack Overflow. https://stackoverflow.com/questions/2776829/difference-between-pythons-generators-and-iterators

Read more »

Matplotlib memo

Posted on 2020-04-03 | In Python
| Words count in article 588

matplotlib is a plotting library for the Python programming language and its numerical mathematics extension numpy. matplotlib.pyplot provides a MATLAB-like plotting framework. pylab combines pyplot with numpy into a single namespace. pylab is now discouraged. [1, 2, 3]

1
import matplotlib.pyplot as plt
1
2
3
4
5
6
params = {'legend.fontsize': 'xx-large',
         'axes.labelsize': 'xx-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
plt.rcParams.update(params)
1
2
3
4
params = {'image.interpolation': 'nearest',
         'image.cmap': 'gray', 
         'figure.dpi': 72 * 1} 
plt.rcParams.update(params)

How to use plt.subplots()?

1
2
3
4
5
6
7
fig, axes = plt.subplots(nrows=nb_rows, ncols=nb_cols, figsize=(10, 6))
for i, c in enumerate(iterable_):
    # if nb_cols = 2, the following can be used:  
    #axes[2*i, 0], axes[2*i, 1], axes[2*i+1, 0], axes[2*i+1, 1]
    pass 
#plt.tight_layout()
plt.show()
1
2
3
4
5
6
7
8
fig, axes = plt.subplots(nrows=nb_rows, ncols=nb_cols, figsize=(10, 6))
axes = axes.ravel() 
for i, c in enumerate(iterable_):
    #axes[i].set_title("blablabla")
    pass 
#plt.tight_layout()
#plt.suptitle("blablabla")
plt.show() 

In matplotlib and PIL, figure’s size is given as (width, height) in inches. Width comes first.

However, height comes first in numpy (thus OpenCV), Tensorflow, PyTorch (the image convention is (N, C, H, W) ) and conventional matrix notation in mathematics.

Figure size (figsize) determines the size of the figure in inches. This gives the amount of space the axes (and other elements) have inside the figure. The default figure size is (6.4, 4.8) inches in matplotlib 2.

Dots per inches (dpi) determines how many pixels the figure comprises. The default dpi in matplotlib is 100.

For example, a figure of figsize = (w, h) will have

1
2
3
px, py = w * dpi, h * dpi  # pixels
# e.g.
# 6.4 inches * 100 dpi = 640 pixels

In matplotlib, image size (in pixels) cannot be too large, it must be less than $2^{16}=65536$ pixels in each direction.

seaborn

For seaborn version 0.10.1,

1
2
3
4
5
6
7
8
# This import will not change the style for the rest of the session
import seaborn as sns

# This command will change the style for the rest of the session  
sns.set()

# This command will restore the style  
sns.reset_orig()

One can use the style context manager which sets a style temporarily:

1
2
3
4
style_name = "seaborn-darkgrid"

with plt.style.context(style_name):
    pass 

Valid style names include: "seaborn-darkgrid", "seaborn-whitegrid", "seaborn-dark", "seaborn-white", "seaborn-ticks", "seaborn-bright", "seaborn-colorblind", "seaborn-dark-palette", "seaborn-paper", "seaborn-poster", "seaborn-talk", "bmh", "classic", "dark_background", "fivethirtyeight", "ggplot", "grayscale", etc.

Set font size for seaborn. The value of font_scale should be proportional to the figure size.

1
2
3
4
5
# sns.set(...) will also influence matplotlib's behavior

sns.set(font_scale=6)
...
sns.reset_orig()

Plot pairplot

1
2
3
4
5
# height plays the role of figsize

sns.pairplot(df, diag_kind="hist", kind="kde", height=15)

pairplot = sns.pairplot(df, diag_kind="hist", kind="scatter", height=15)

Save figure

1
2
3
fig = plt.figure(figsize=(xxx, yyy))
...
fig.savefig(image_path, dpi=fig.dpi)

For seaborn:

1
2
pairplot = sns.pairplot(...)
pairplot.fig.savefig(image_path, dpi=fig.dpi)

References

[1] Matplotlib. (2005, October 14). Wikipedia, the free encyclopedia. Retrieved April 3, 2020, from https://en.wikipedia.org/wiki/Matplotlib

[2] Pyplot — Matplotlib 2.0.2 documentation. (n.d.). Matplotlib: Python plotting — Matplotlib 3.2.1 documentation. https://matplotlib.org/api/pyplot_api.html

[3] Matplotlib, Pyplot, Pylab etc: What’s the difference between these and when to use each? (2019, August 31). queirozf.com. https://queirozf.com/entries/matplotlib-pylab-pyplot-etc-what-s-the-different-between-thes`

Read more »

Remove unused categories in Pandas series

Posted on 2020-03-25 | In Python
| Words count in article 687
1
2
3
4
5
6
7
df = pd.read_csv("train.csv")
df.set_index("id", inplace=True)

categorical_cols = ["province", "country"]
df.loc[:, categorical_cols] = df.loc[:, categorical_cols].astype("category")

print(df.info())
1
2
3
4
5
6
7
8
Int64Index: 17040 entries, 1 to 26379
Data columns (total 3 columns):
province    7800 non-null category
country     17040 non-null category
cases    		17040 non-null float64
dtypes: category(2), float64(1)
memory usage: ...
None
1
df["country"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Id
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
5        Afghanistan
            ...     
26375         Zambia
26376         Zambia
26377         Zambia
26378         Zambia
26379         Zambia
Name: country, Length: 17040, dtype: category
Categories (163, object): [Afghanistan, Albania, Algeria, Andorra, ..., Uzbekistan, Venezuela, Vietnam, Zambia]

For the country column, there are 17040 entries. Each entry is one of the 163 categories.

Let’s keep only 3 categories which interest us:

1
2
3
countries = ["China", "France", "Italy"]
df2 = df.loc[df["country"].isin(countries), :]
print(df2.info())
1
2
3
4
5
6
7
8
Int64Index: 42 entries, 4431 to 12429
Data columns (total 3 columns):
province    41 non-null category
country    	42 non-null category
cases    		42 non-null float64
dtypes: category(2), float64(2)
memory usage: 13.4 KB
None
1
df2["country"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Id
4431      China
4524      China
4617      China
4710      China
4803      China
4896      China
4989      China
5082      China
5175      China
5268      China
5361      China
5454      China
5547      China
5640      China
5733      China
5826      China
5919      China
6012      China
6105      China
6198      China
6291      China
6384      China
6477      China
6570      China
6663      China
6756      China
6849      China
6942      China
7035      China
7128      China
7221      China
7314      China
7407      China
9453     France
9546     France
9639     France
9732     France
9825     France
9918     France
10011    France
10104    France
12429     Italy
Name: country, dtype: category
Categories (163, object): [Afghanistan, Albania, Algeria, Andorra, ..., Uzbekistan, Venezuela, Vietnam, Zambia]
1
df2["country"].unique()
1
2
[China, France, Italy]
Categories (3, object): [China, France, Italy]
1
df2["country"].value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
China          33
France          8
Italy           1
Zambia          0
Greece          0
               ..
Netherlands     0
Nepal           0
Namibia         0
Morocco         0
Afghanistan     0
Name: country, Length: 163, dtype: int64
1
df2.groupby(["country"])["cases"].sum()
1
2
3
4
5
6
7
8
9
10
11
12
13
country
Afghanistan            0.0
Albania                0.0
Algeria                0.0
Andorra                0.0
Antigua and Barbuda    0.0
                      ... 
Uruguay                0.0
Uzbekistan             0.0
Venezuela              0.0
Vietnam                0.0
Zambia                 0.0
Name: cases, Length: 163, dtype: float64

Now we observe that even if there remains only 42 entries, there are still 163 categories as before.

To solve this, we can do the following:

1
df2.groupby(df["country"].cat.remove_unused_categories())["cases"].sum()
1
2
3
4
5
country
China     81305.0
France    14427.0
Italy     53578.0
Name: cases, dtype: float64

Let’s see what’s going on.

df2["country"].cat is a pandas.core.arrays.categorical.CategoricalAccessor object. The method Series.cat.remove_unused_categories() removes categories which are not used and returns a Series (DataFrame column).

The above can also be done by:

1
df2.assign(country=df2["country"].cat.remove_unused_categories()).groupby("country")["cases"].sum()

or

1
2
df2["country"] = df2["country"].cat.remove_unused_categories()
df2.groupby("country")["cases"].sum()
Read more »

statsmodels memo

Posted on 2020-03-23 | In Python
| Words count in article 2635

Import statsmodels:

1
import statsmodels.api as sm

Ordinary Least Squares

Fit OLS:

1
2
3
4
regression_results = sm.OLS(y_train, X_train).fit()
# regression_results is a 
# statsmodels.regression.linear_model.RegressionResultsWrapper
# Replace X_train by X_train.assign(const=1) if needed

Summarize the regression results:

1
print(regression_results.summary())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                revenue   R-squared:                       0.604
Model:                            OLS   Adj. R-squared:                  0.604
Method:                 Least Squares   F-statistic:                     2140.
Date:                Tue, 24 Mar 2020   Prob (F-statistic):               0.00
Time:                        17:29:10   Log-Likelihood:                -55237.
No. Observations:                2804   AIC:                         1.105e+05
Df Residuals:                    2801   BIC:                         1.105e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
budget         2.4760      0.047     52.976      0.000       2.384       2.568
popularity    2.9e+06   1.57e+05     18.433      0.000    2.59e+06    3.21e+06
const      -1.415e+07   2.19e+06     -6.467      0.000   -1.84e+07   -9.86e+06
==============================================================================
Omnibus:                     1912.035   Durbin-Watson:                   2.027
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            57009.692
Skew:                           2.809   Prob(JB):                         0.00
Kurtosis:                      24.364   Cond. No.                     5.88e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.88e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Forward variable selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def forward_variable_selection(X, y, early_stop_pvalue=None, k=None, 
                               const_col_name="const"):
    """
    Forward variable selection for regression problems
    
    X, y: pandas DataFrame
    
    If X has already been augmented by a constant column, 
    that column must be named by const_col_name. Otherwise 
    no columns should be named by const_col_name. 
    """
    if k is None: 
        k = X.shape[1]
        if const_col_name in X.columns:
            k -= 1 
    used = {const_col_name: True} 
    target = sm.OLS(y, pd.DataFrame([[1]], index=X.index)).fit().resid
    selected = []
    for _ in range(k):
        tuples = [] 
        for feature in X.columns:
            if not used.get(feature, False): 
                regression_results = sm.OLS(target, X.loc[:, [feature]]).fit()
                tuples.append((feature, regression_results.resid, 
                               regression_results.tvalues.iloc[0], 
                               regression_results.pvalues.iloc[0])) 
        chosen_tuple = max(tuples, key=lambda x: x[2]) 
        if early_stop_pvalue is not None:
            if chosen_tuple[3] > early_stop_pvalue:
                break 
        selected.append(chosen_tuple[0])
        used[chosen_tuple[0]] = True
        target = chosen_tuple[1]
    return selected

selected = forward_variable_selection(X_train, y_train)

Leverage statistics

High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.

In the linear regression model, the leverage score for the i-th observation is defined as:

\[h_{ii} = [H]_{ii}\]

the i-th diagonal element of the projection matrix $H$ where

\[H = X(X^{\intercal}X)^{-1}X^{\intercal}\]
1
leverage = regression_results.get_influence().hat_matrix_diag

As leverage is independent of label vector y, on can instead do the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def build_leverage(df, visualize=True, xticks_labels=None, rotation=85, 
                   title=None, augment_const=True):
    """
    If constant columns has not already been augmented, use augment_const=True, 
    otherwise use augment_const=False. 
    """
    if augment_const:
        X = df.assign(const=1)
    else:
        X = df
    regression_results = sm.OLS(pd.DataFrame([[1]], index=X.index), X).fit()
    leverage = regression_results.get_influence().hat_matrix_diag
    if visualize:
        plt.plot(leverage)
        plt.grid()
        if title is not None:
            plt.title(title)
        if xticks_labels is not None:
            plt.xticks(range(len(countries)), countries, rotation=rotation)
        plt.show()
    return leverage

leverage = build_leverage(X_train)

Variance inflation factor (VIF) for multicollinearity

1
2
3
4
5
6
7
8
from statsmodels.stats.outliers_influence import variance_inflation_factor

def build_vif(df):
    vif = pd.DataFrame()
    vif["feature"] = df.columns
    vif["vif"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    vif.set_index("feature", inplace=True)
    return vif

Visualize long table (of vif) in Jupyter notebook:

1
2
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(vif)

Logistic Regression

Fit logistic regression:

1
2
3
4
binary_results = sm.Logit(y_train , X_train).fit(method="bfgs", maxiter=1000)
# binary_results is a 
# statsmodels.discrete.discrete_model.BinaryResultsWrapper
# Replace X_train by X_train.assign(const=1) if needed

Check all available output which is dependent on the solver:

1
binary_results.mle_retvals

Summarize the regression results:

1
print(binary_results.summary())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
                           Logit Regression Results                           
==============================================================================
Dep. Variable:               Survived   No. Observations:                  623
Model:                          Logit   Df Residuals:                      612
Method:                           MLE   Df Model:                           10
Date:                Sat, 04 Jan 2020   Pseudo R-squ.:                  0.3209
Time:                        15:19:25   Log-Likelihood:                -278.95
converged:                       True   LL-Null:                       -410.79
Covariance Type:            nonrobust   LLR p-value:                 7.167e-51
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Age              -0.0324      0.009     -3.494      0.000      -0.051      -0.014
SibSp            -0.2659      0.118     -2.244      0.025      -0.498      -0.034
Parch            -0.1157      0.144     -0.802      0.422      -0.398       0.167
Fare              0.0038      0.003      1.246      0.213      -0.002       0.010
const             3.5595      0.549      6.486      0.000       2.484       4.635
Pclass_2         -0.3538      0.356     -0.993      0.321      -1.052       0.345
Pclass_3         -1.6995      0.357     -4.757      0.000      -2.400      -0.999
Sex_male         -2.6076      0.236    -11.040      0.000      -3.071      -2.145
Embarked_Q       -0.1484      0.451     -0.329      0.742      -1.032       0.735
Embarked_S       -0.6639      0.285     -2.326      0.020      -1.223      -0.105
Embarked_null     9.3897    407.838      0.023      0.982    -789.958     808.738
=================================================================================
1
print(binary_results.summary2())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.321     
Dependent Variable: Survived         AIC:              579.8962  
Date:               2020-01-04 15:19 BIC:              628.6762  
No. Observations:   623              Log-Likelihood:   -278.95   
Df Model:           10               LL-Null:          -410.79   
Df Residuals:       612              LLR p-value:      7.1671e-51
Converged:          1.0000           Scale:            1.0000    
-----------------------------------------------------------------
               Coef.  Std.Err.    z     P>|z|    [0.025   0.975] 
-----------------------------------------------------------------
Age           -0.0324   0.0093  -3.4939 0.0005   -0.0505  -0.0142
SibSp         -0.2659   0.1185  -2.2438 0.0248   -0.4981  -0.0336
Parch         -0.1157   0.1442  -0.8024 0.4223   -0.3982   0.1669
Fare           0.0038   0.0030   1.2464 0.2126   -0.0022   0.0097
const          3.5595   0.5488   6.4864 0.0000    2.4839   4.6351
Pclass_2      -0.3538   0.3564  -0.9926 0.3209   -1.0524   0.3448
Pclass_3      -1.6995   0.3573  -4.7565 0.0000   -2.3998  -0.9992
Sex_male      -2.6076   0.2362 -11.0399 0.0000   -3.0705  -2.1447
Embarked_Q    -0.1484   0.4509  -0.3292 0.7420   -1.0322   0.7353
Embarked_S    -0.6639   0.2854  -2.3261 0.0200   -1.2233  -0.1045
Embarked_null  9.3897 407.8381   0.0230 0.9816 -789.9584 808.7378
=================================================================

Test classification performance:

1
2
3
4
5
6
7
8
from sklearn.metrics import accuracy_score, roc_auc_score

def test_perf(X_test, y_test, y_score):
    accuracy = accuracy_score(y_test, y_score > 0.5)
    roc_auc = roc_auc_score(y_test, y_score)
    print("Accuracy: {:.4}\nROC_AUC: {:.4}".format(accuracy, roc_auc))

test_perf(X_test, y_test, binary_results.predict(X_test))

Regularized Logistic Regression

Fit regularized logistic regression:

1
2
3
4
5
l1binary_results = sm.Logit(y_train, X_train)\
.fit_regularized(method="l1_cvxopt_cp", alpha=1e-2, maxiter=1000)
# l1binary_results is a 
# statsmodels.discrete.discrete_model.L1BinaryResultsWrapper
# Replace X_train by X_train.assign(const=1) if needed

Summarize the regression results:

1
2
print(l1binary_results.summary())
print(l1binary_results.summary2())
Read more »

Pandas memo

Posted on 2020-03-23 | In Python
| Words count in article 1722

Read dataset and reset index in place:

1
2
df = pd.read_csv("train.csv")
df.set_index("id", inplace=True)

Save dataset set a csv file

1
2
df.to_csv("blabla.csv")
df.to_csv("blabla.csv", sep="\t", encoding="utf-8")

Visualize DataFrame:

1
2
print(df.info())
df

Convert object (string) columns to Categorical columns:

1
2
categorical_cols = ["...", "...", "..."]
df.loc[:, categorical_cols] = df.loc[:, categorical_cols].astype("category")

Convert float columns to int columns (as long as there are no missing values):

1
2
int_cols = ["...", "...", "..."]
df.loc[:, int_cols] = df.loc[:, int_cols].astype("int")

Convert Categorical columns into dummy columns:

1
2
3
4
5
6
7
df = pd.get_dummies(df, drop_first=True)
# drop_first: whether to get k-1 dummies out of k categorical 
# levels by removing the first level.

df = pd.get_dummies(df, columns=["...", "..."])
# If columns is None (default) then all the columns with 
# object or category dtype will be converted.

Count null values of each column:

1
df.isnull().sum() 

Select rows with one or more null values:

1
df.loc[df.isnull().any(axis=1)] 

Fill missing values:

1
2
3
df["name"].fillna("unknown", inplace=True)
df["age"].fillna(df["age"].mean(), inplace=True)
df["A"].fillna(method="ffill", inplace=True)

Remove missing values:

1
2
df.dropna(axis=0, how="any", inplace=True) # drop rows
df.dropna(axis=1, how="any", inplace=True) # drop columns

Remove columns/features:

1
df.drop(["...", "...", "..."], axis=1, inplace=True)

Remove duplicate rows of DataFrame:

1
df.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True) 

Add interaction term:

1
df["C"] = df["A"] * df["B"]

Reset the index to the default integer index:

1
2
df.reset_index(inplace=True) # the old index is added as a column
df.reset_index(drop=True, inplace=True) # drop the old index 

Convert pandas column to DateTime:

1
df["date"] =  pd.to_datetime(df["date"])

Select rows between two dates:

1
df.loc[(df["date"] >= "2020-03-17") & (df["date"] < "2020-03-20"), :]

Select rows if value in column is in a list of values:

1
2
3
4
df.loc[df["country"].isin(["China", "France"]), :]

# to get the opposite, use ~
df.loc[~df["country"].isin(["China", "France"]), :]

Sort by the values along either axis:

1
df.sort_values("A", ascending=True, inplace=False, axis=0)

Sort object by labels (along an axis):

1
df.sort_index(ascending=True, inplace=True, axis=0)

Replace values:

1
2
3
4
# replaces 0 by 5
df.replace(0, 5, inplace=True)    
# replaces Inf by NaN 
df.replace([np.inf, - np.inf], np.nan, inplace=True)  

Normalize features:

1
2
3
4
5
6
7
8
9
10
11
# normalize features to a given range
from sklearn.preprocessing import MinMaxScaler
# normalize features by removing the mean and scaling to unit variance
from sklearn.preprocessing import StandardScaler

scaler = MinMaxScaler() # StandardScaler()
# Let's assume df has not been augmented 
# by the constant column yet.
df = pd.DataFrame(scaler.fit_transform(df), 
columns=df.columns, index=df.index)
df = df.assign(const=1)

Standard database join operations between DataFrame or named Series objects:

1
2
3
4
5
6
7
8
df3 = df1.merge(df2, on=["..."], how="inner") # SQL inner join
df3 = df1.merge(df2, on=["..."], how="left")  # SQL left outer join
df3 = df1.merge(df2, on=["..."], how="right") # SQL right outer join
df3 = df1.merge(df2, on=["..."], how="outer") # SQL full outer join
df3 = df1.merge(df2, left_index=True, right_index=True)

# If both objects are Series or named Series, consider:
df = pd.merge(s1, s2, left_index=True, right_index=True, how="outer")

Concatenate DataFrame objects:

1
2
# vertically concatenate two DataFrames
df = pd.concat([df1, df2], axis=0, ignore_index=True) 

Split features X and labels y:

1
2
y = df.loc[:, "y"]
X = df.drop("y", axis=1)

Split training set and test set:

1
2
3
4
5
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.3, random_state=42)

Add a column of constant 1’s:

1
2
3
X = X.assign(const=1)
# Used for statsmodels
# The added column's name is "const"

Swap/reorder columns of DataFrame objects:

1
df = df.reindex(columns=["...", "..."])

Visualize the number of distinct values that each feature can take:

1
df.nunique()

Visualize the number of distinct values that each feature can take and the corresponding data type:

1
2
3
4
5
unique_counts = pd.DataFrame([(col, df[col].dtype, df[col].nunique()) \
                              for col in df.columns], columns=["col_name", "dtype", "nunique"])\
.set_index("col_name").sort_values(by=['nunique'])

unique_counts

Visualize counts of unique values in descending order of frequency:

1
df["A"].value_counts()

For a certain column, parse string representations to lists of elements:

1
2
3
4
import ast
col_names = ["genres", "spoken_languages"]
for col in col_names:
    df[col] = df[col].apply(lambda x: ast.literal_eval(x)) 

For a certain column, convert lists of strings to dummies:

1
2
3
for col in ["genres", "production_countries", "spoken_languages"]:
    df = df.join(df[col].str.join('|').str.get_dummies()\
    .add_prefix(col + "_")).drop(col, axis=1)
1
2
genres: [Comedy, Drama, Family, Romance], [Drama, Thriller], ...	
genres_Comedy, genres_Drama, genres_Family, genres_Romance, ...

Get a subset of the DataFrame’s columns based on the column dtypes:

1
df.select_dtypes(include=["object", "bool", "float64", "int"])

Make plots of Series or DataFrame (matplotlib is used by default):

1
2
df.plot(x="...", y="...", kind="line")
plt.show()

Make histograms:

1
2
df["age"].hist(bins=10)
plt.show()

Generate descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max for each feature) of a GroupBy object`:

1
2
3
df.groupby(["country"]).describe()
df.groupby(["country"])["cases"].describe()
df.groupby(["country"])["cases", "fatalities"].describe()

MultiIndex indexing:

1
2
3
4
5
6
7
                                    cases  fatalities
country 			 date                                  
China          2020-01-22             548          17
               2020-01-23             643          18
...                                   ...         ...
Italy          2020-03-17           31506        2503
               2020-03-18           35713        2978
1
2
3
df.loc["China", :]
df.loc[(slice(None), "2020-01-23"), :]
df.loc[(slice(None), slice("2020-01-23", "2020-01-25")), :]

Revert from MultiIndex to single index dataframe:

1
2
3
4
5
6
# level: only remove the given levels from the index

# integer position
df.reset_index(level=[1, 3], inplace=True) 
# the name of the level
df.reset_index(level=["...", "...", "..."], inplace=True) 

Transform multiple columns to MultiIndex:

1
df.set_index(["A", "B"], inplace=True)

Slice a MultiIndex DataFrame with a condition based on the index:

1
2
3
l0 = df.index.get_level_values(0) # level 0 index
l1 = df.index.get_level_values(1) # level 1 index 
df.loc[(l0 == "foo") | ((l0=="bar") & (l1=="two")), :]

Exchange index level of a MultiIndex DataFrame:

1
2
3
4
5
6
7
8
9
10
11
df = df.swaplevel(0, 1)
df = df.swaplevel("...", "...")
df = df.swaplevel("...", 1)
df = df.swaplevel(0, "...")

# However, calling this method does not 
# change the ordering of the values.

# To solve this, do the following: 
# "..." stands for the index level name (str)
df.sort_values("...", ascending=True, inplace=True, axis=0)

Rename a Series:

1
s.rename("...", inplace=True)

Convert Series to DataFrame:

1
df = s.to_frame()

Build a pd.Timestamp (datetime) object:

1
time_stamp = pd.to_datetime("2020-03-21")

Add date offset to a pd.Timestamp object:

1
2
3
time_stamp = pd.to_datetime("2019-02-19")  # 2019-02-19 00:00:00
time_stamp += pd.DateOffset(days=54)       # 2019-04-14 00:00:00
time_stamp += pd.DateOffset(minutes=23849) # 2019-04-30 13:29:00

Slice a dataframe based on datetime index (if datetime column has been set as the index):

1
2
df.loc["2009-05-01" : "2010-03-01", :]
df.loc["2012" : "2012-02-18", :]

Conversion of pd.Timedelta Series, pd.TimedeltaIndex, and pd.Timedelta scalars:

1
2
3
4
5
6
7
8
9
10
11
12
13
# All the following conversions output float64

# to days 
td / np.timedelta64(1, 'D')
(pd.to_datetime("2020-03-21") - df["date"]) / np.timedelta64(1, 'D')
td.astype('timedelta64[D]')

# to seconds 
td / np.timedelta64(1, 's')
td.astype('timedelta64[s]')

# to months (these are constant months)
td / np.timedelta64(1, 'M')
1
2
3
4
5
6
time_delta = pd.to_datetime("2021-02-19") - pd.to_datetime('2020-06-18')
print(time_delta) # 246 days 00:00:00
print(time_delta.days) # 246
print(time_delta.components) 
# Components(days=246, hours=0, minutes=0, seconds=0, 
# milliseconds=0, microseconds=0, nanoseconds=0)

Rolling window calculations:

1
2
3
4
5
6
7
8
9
10
11
# suppose df is the following DataFrame with shape (5, 1)

# rolling window calculations return DataFrames with 
# exactly the same shape

   column_1
0         0
1         1
2         2
3         3
4         4
1
df.rolling(3).sum()
1
2
3
4
5
6
   column_1
0       NaN
1       NaN
2       3.0
3       6.0
4       9.0
1
df.rolling(3, min_periods=1).sum()
1
2
3
4
5
6
   column_1
0       0.0
1       1.0
2       3.0
3       6.0
4       9.0
1
df.rolling(3, center=True).sum()
1
2
3
4
5
6
   column_1
0       NaN
1       3.0
2       6.0
3       9.0
4       NaN
1
df.rolling(3, min_periods=1, center=True).sum()
1
2
3
4
5
6
   column_1
0       1.0
1       3.0
2       6.0
3       9.0
4       7.0

Count the number of True/False in each row/column:

1
2
3
4
5
6
7
8
9
10
11
12
# suppose df is a DataFrame that has either 
# True or False as value

# For pd.DataFrame.apply(), 
# axis=0/"index", apply function to each column
# axis=1/"columns", apply function to each row 

# find out how many True/False are there in each column 
df.apply(pd.Series.value_counts, axis=0)

# find out how many True/False are there in each row 
df.apply(pd.Series.value_counts, axis=1)

Print all rows and all columns of a DataFrame

1
2
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    print(df)
Read more »

C++ ternary conditional

Posted on 2020-02-07 | In C++
| Words count in article 122

In C++, ternary conditional/operator ?: can be understood by the following:

1
2
3
int a = 2, b = 1;
int c = a > b ? a-- : b--;
cout << a << ", " << b << ", " << c << endl; // 1, 1, 2
1
2
3
int a = 2, b = 1;
int c = a <= b ? a-- : b--;
cout << a << ", " << b << ", " << c << endl; // 2, 0, 1
Read more »

C++ universal/uniform/brace initialization

Posted on 2020-01-21 | In C++
| Words count in article 271

Universal/uniform/brace initialization introduced in C++11

1
2
3
vector<vector<int>> res;
res.push_back({0, 1});
// res[0] is a vector
1
2
3
vector<vector<int>> res;
res.push_back(vector<int>{0, 1});
// res[0] is a vector
1
2
3
vector<list<int>> res;
res.push_back(list<int>{0, 1});
// res[0] is a list
1
2
3
4
5
6
7
8
9
10
11
12
13
struct Student{
    int id, grade;
};

int main(){
    Student s{1, 68};
    
    vector<Student> v;
    v.push_back(Student{2, 90});
    v.push_back({3, 56});
    cout << v.size() << endl; // 2
    for(auto e: v) cout << e.id << ", "; cout << endl; // 2, 3,
}

One trap of universal initializations is shown here:

1
2
3
4
5
6
7
8
9
vector<int> v1(4, 100); // [100, 100, 100, 100], fill constructor
vector<int> v2{4, 100}; // [4, 100], equivalent to vector<int> v = {4, 100};

vector<vector<int>> vv;
vv.push_back(vector<int>(4, 100));
vv.push_back(vector<int>{4, 100});
for(auto e: vv[0]) cout << e << ", "; cout << endl; // 100, 100, 100, 100, 
for(auto e: vv[1]) cout << e << ", "; cout << endl; // 4, 100, 

Read more »

Probability

Posted on 2019-12-21 | In Mathematics
| Words count in article 1071

Gaussian distribution

One-dimensional case:

\[\mathbb{P}_{\mu, \sigma}(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp(-\frac{(x-\mu)^2}{2\sigma^2})\]

$d$-dimensional case:

\[\mathbb{P}_{\mu, \Sigma}(x) = \frac{1}{\sqrt{(2\pi)^d \det \Sigma}} \exp(-\frac{1}{2}(x-\mu)^\intercal \Sigma^{-1} (x-\mu))\]

68–95–99.7 rule

The 68–95–99.7 rule is a shorthand used to remember the percentage of values that lie within one, two and three standard deviations of the mean in a normal distribution.

68_95_997_rule

The three-sigma rule states that even for non-normally distributed variables, at least 88.8% of cases should fall within properly calculated three-sigma intervals. It follows from Chebyshev’s inequality. For unimodal distributions the probability of being within the interval is at least 95% by the Vysochanskij–Petunin inequality. There may be certain assumptions for a distribution that force this probability to be at least 98%.

Laplace distribution

One-dimensional case:

\[\mathbb{P}_{\mu, b}(x) = \frac{1}{2b} \exp(-\frac{|x - \mu|}{b})\]

Mean: $\mu$

Variance: $2b^2$

Chi-squared distribution

$\chi^2$ distribution with $k$ degrees of freedom is denoted by $\chi_k^2$.

Mean: $k$

Variance: $2k$

Let $X_i \sim \mathcal{N}(0, 1)$, $X_i$’s are independent. $\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i$.

\[\sum_{i=1}^n X_i^2 \sim \chi_n^2\] \[\sum_{i=1}^n (X_i - \overline{X})^2 \sim \chi_{n-1}^2\]

Student’s t-distribution

Let $X_i \sim \mathcal{N}(\mu, \sigma^2)$, $X_i$’s are i.i.d.. $\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i$, $S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \overline{X})^2$.

\[\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \sim \mathcal{N}(0, 1)\] \[t = \frac{\overline{X} - \mu}{\frac{S}{\sqrt{n}}}\]

$t$ follows Student’s t-distribution with $n-1$ degrees of freedom.

Mean: $0$ for $\nu > 1$ degrees of freedom, otherwise undefined

F-distribution

F-distribution with parameters $d_1$ and $d_2$ is denoted by $F(d_1, d_2)$.

If $X_1 \sim \chi_{d_1}^2$ and $X_2 \sim \chi_{d_2}^2$ are independent, then

\[\frac{\frac{X_1}{d_1}}{\frac{X_2}{d_2}} \sim F(d_1, d_2)\]

Mean: $\frac{d_2}{d_2 - 2}$ for $d_2 > 2$

Poisson distribution

Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

A discrete random variable $X$ is said to have a Poisson distribution with parameter $\lambda > 0$, if $\forall k = 0, 1, 2, …,$

\[\mathbb{P}_{\lambda}(X = k) = \frac{\lambda^k e^{- \lambda} }{k!}\]

Mean: $\lambda$

Variance: $\lambda$

Bernoulli distribution

\[\mathbb{P}_{p}(X=1) = 1 - \mathbb{P}_{p}(X=0) = p\]

Mean: $p$

Variance: $p (1 - p)$

A Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted.

Binomial distribution 二项分布

\[\mathbb{P}_{n, p}(X = k) = {n \choose k} p^k (1 - p)^{n - k}\]

Mean: $n p$

Variance: $n p (1 - p)$

Geometric distribution

The geometric distribution is either one of two discrete probability distributions:

  • The probability distribution of the number $X$ of Bernoulli trials needed to get one success, supported on the set ${1, 2, 3, …}$.
  • The probability distribution of the number $Y = X − 1$ of failures before the first success, supported on the set ${0, 1, 2, …}$.

Which of these is called “the” geometric distribution is a matter of convention and convenience. These two different geometric distributions should not be confused with each other.

The probability that the first occurrence of success requires $k$ independent trials, each with success probability $p$, i.e. the number of trials up to and including the first success or the probability that the kth trial (out of k trials) is the first success:

\[\mathbb{P}(X = k) = (1-p)^{k-1} p\]

Mean: $\frac{1}{p}$

Variance: $\frac{1-p}{p^2}$

The number of failures $Y$ until the first success:

\[\mathbb{P}(Y = k) = (1-p)^{k} p\]

Mean: $\frac{1 - p}{p}$

Variance: $\frac{1-p}{p^2}$

Laws

Law of total probability

Also known as:

  • 全概率定理

law_of_total_probability.png

Law of total expectation

Also known as:

  • 全期望值定理
  • law of iterated expectations
  • tower rule
  • Adam’s law
  • smoothing theorem
  • 双重期望值定理
  • 重叠期望值定理

If $X$ is a random variable whose expected value $\mathbb{E}[X]$ is defined, and $Y$ is any random variable on the same probability space, then

\[\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]]\]

i.e., the expected value of the conditional expectation of $X$ given $Y$ is the same as the expected value of $X$.

Law of total variance

Also known as:

  • variance decomposition formula
  • conditional variance formulas
  • law of iterated variances
  • Eve’s law

If $X$ and $Y$ are random variables on the same probability space, and the variance of $Y$ is finite, then

\[\mathbb{Var}[Y] = \mathbb{E}[\mathbb{Var}[Y | X]] + \mathbb{Var}[\mathbb{E}[Y | X]]\]

For statisticians, the two terms are the “unexplained” and the “explained” components of the variance respectively. These two components are also the source of the term “Eve’s law”, from the initials EV VE for “expectation of variance” and “variance of expectation”.

Read more »

Machine Learning quick notes

Posted on 2019-12-21 | In Mathematics
| Words count in article 4145

By convention, $X \in \mathbb{R}^{n\times p}$ where $n$ denotes the number of samples, $p$ denotes the number of features. $x = (x_1, x_2, …, x_n)$.

PCA

Recall SVD, for any matrix $X\in \mathbb{R}^{n\times p}$,

\[X = U\Sigma V^\intercal\]

where $U\in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{p\times p}$, $V^\intercal$ is the transpose of $V$. The columns of $U$ are called left singular vectors, the columns of $V$ are called right singular vectors. $U$ and $V$ are orthogonal matrices.

To compute PCA for $X\in \mathbb{R}^{n\times p}$,

  • For each column/feature of $X$, substract the mean. Optionally, one can also divide the standard deviation for each column, this seems to be recommended. Call the resulted matrix $\tilde{X}\in \mathbb{R}^{n\times p}$.
  • Compute SVD of $\tilde{X}$, one obtains $\tilde{X} = U\Sigma V^\intercal$. The columns of $V$ (right singular vectors) are called principal components. Equivalently, this is to compute eigendecomposition of $\tilde{X}^\intercal \tilde{X} \in \mathbb{R}^{p\times p}$, i.e. $\tilde{X}^\intercal \tilde{X} = P S P^{-1}$. $\tilde{X}^\intercal \tilde{X}$ is the empirical covariance matrix between features up to a normalization constant ($\frac{1}{n}$ or $\frac{1}{n - 1}$ if using Bessel’s correction). The right singular vectors of $\tilde{X}$ are the eigenvectors of $\tilde{X}^\intercal \tilde{X}$, that is, $\tilde{X}^\intercal \tilde{X} = V\Sigma U^\intercal U\Sigma V^\intercal = V\Sigma^2 V^\intercal = V S V^\intercal = V S V^{-1} = P S P^{-1}$. By convention, the singular values are sorted in descending order, singular vectors are sorted accordingly.
  • $X^* = \tilde{X} V = U\Sigma \in \mathbb{R}^{n\times p}$ is the transformed version of $X$ by PCA. So far, one uses a new orthogonal basis to represent data $X$. If one seeks for dimensionality reduction, one only keeps the $k$ principal components which correspond to the $k$ largest singular values. That is, $V_k \in \mathbb{R}^{p\times k}$, $X_k^* = \tilde{X} V_k \in \mathbb{R}^{n\times k}$.
  • For unseen test data $Z\in \mathbb{R}^{m\times p}$, normalize it with the statistics (means, standard deviations) of training set and then do the matrix multiplication $Z_k^* = \tilde{Z}V_k \in \mathbb{R}^{m\times k}$. One should neither use statistics of test set to normalize nor compute the statistics using the whole dataset (training + test). The principal components are of course those computed from training set.

For symmetric positive semidefinite matrices such as the empirical covariance matrix $\tilde{X}^\intercal \tilde{X}$, the SVD and the eigendecomposition give the same result, they are mathematically equivalent. Numerically, this might not be the case because they can use different algorithms. Andrew Ng says that SVD is numerically more stable than eigendecomposition. An attempt to explain this is that the SVD implementation seems to be using a divide-and-conquer approach (LAPACK), while the eigendecomposition uses a plain QR algorithm. This explanation remains to be confirmed.

MLE & MAP

Maximum likelihood estimation (MLE):

\[\begin{equation} \begin{split} \hat{\theta}_{\text{MLE}} &= \underset{\theta}{\operatorname{argmax}} \log \mathbb{P}_\theta (x) = \underset{\theta}{\operatorname{argmax}} \log \mathbb{P} (x | \theta) \\ &= \underset{\theta}{\operatorname{argmax}} \sum_i \log \mathbb{P} (x_i | \theta) \end{split} \end{equation}\]

Maximum a posteriori (MAP):

\[\begin{equation} \begin{split} \hat{\theta}_{\text{MAP}} &= \underset{\theta}{\operatorname{argmax}} \log \mathbb{P} (\theta | x) \\ &= \underset{\theta}{\operatorname{argmax}} \log \frac{\mathbb{P} (x | \theta) \mathbb{P}(\theta)}{\mathbb{P}(x)} \\ &= \underset{\theta}{\operatorname{argmax}} \log \mathbb{P} (x | \theta) \mathbb{P}(\theta) \\ &= \underset{\theta}{\operatorname{argmax}} \sum_i \log \mathbb{P} (x_i | \theta) + \log \mathbb{P}(\theta) \end{split} \end{equation}\]

If the prior $\mathbb{P}(\theta)$ is uniform, then MAP reduces to MLE. MAP can be seen as MLE augmented with a regularization term. If the prior is assumed to be centered Gaussian distribution $\mathcal{N}(\boldsymbol{0}, \frac{1}{\lambda} I)$, then this is $L2$ regularization, that is, $-\log \mathbb{P}(\theta) = \frac{\lambda}{2}\vert\vert\theta\vert\vert_2^2$. If the prior is assumed to be centered Laplace distribution, then this is $L1$ regularization.

Linear regression

There are 4 principal assumptions which justify the use of linear regression models for purposes of inference or prediction [1]:

  • linearity and additivity of the relationship between dependent variables (target variables) and independent variables (features)
    • the expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed
    • the slope of that line does not depend on the values of the other variables
    • the effects of different independent variables on the expected value of the dependent variable are additive
  • statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)
  • homoscedasticity (constant variance, 同方差性) of the errors
    • versus time (in the case of time series data)
    • versus the predictions
    • versus any independent variable
  • normality of the error distribution

If any of these assumptions is violated, then the forecasts, confidence intervals, and scientific insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading.

\[y = X\theta^* + \epsilon\]

Ordinary least squares (OLS)

\[\hat{\theta} \in \underset{\theta}{\operatorname{argmin}} \frac{1}{2} \vert\vert X\theta - y \vert\vert_2^2\]

Method 1:

\[\nabla_\theta \frac{1}{2} \vert\vert X\theta - y \vert\vert_2^2 = X^\intercal (X\theta - y) = 0\]

Method 2:

\[f(\theta) = \frac{1}{2} \vert\vert X\theta - y \vert\vert_2^2 = \frac{1}{2} \theta^\intercal X^\intercal X\theta - \langle \theta, X^\intercal y \rangle + \frac{1}{2} \vert\vert y \vert\vert^2\] \[f(\theta + h) = f(\theta) + \langle h, X^\intercal X\theta - X^\intercal y \rangle + \frac{1}{2} h^\intercal X^\intercal Xh\]

According to the first-order Taylor series approximation $f(\theta + h) = f(\theta) + \langle h, \nabla f(\theta) \rangle + o(h)$, $\nabla f(\theta) = X^\intercal X\theta - X^\intercal y $.

Method 3:

One aims to find $\theta$ such that $y = X\theta$. However, as $y \notin \text{im}(X)$, where $\text{im}(X)$ denotes the column space of $X$, one instead solves $\text{Proj}_{\text{im}(X)}(y) = X\theta$.

\[\begin{equation} \begin{split} & X\theta - y = \text{Proj}_{\text{im}(X)}(y) - y \\ & \text{Proj}_{\text{im}(X)}(y) - y \in \text{im}(X)^\perp \end{split} \end{equation}\] \[\Downarrow\] \[X\theta - y \in \text{im}(X)^\perp\]

As $\text{im}(X)^\perp = \ker(X^\intercal)$, $X^\intercal (X\theta - y) = 0$.

projection_onto_space

Theorem: If the columns of $X$ are linearly independent, then $X^\intercal X$ is invertible and $X^\intercal X \theta = X^\intercal y$ has a unique solution $\hat{\theta} = (X^\intercal X)^{-1}X^\intercal y$.

Proof: Let $\theta \in \ker(X^\intercal X)$, that is, $X^\intercal X \theta = 0$. Then,

\[\begin{equation}\begin{split} & \theta^\intercal X^\intercal X\theta = \vert\vert X\theta \vert\vert^2 = 0 \\ \Rightarrow & X\theta = \theta_1 \text{col}_1(X) + \theta_2 \text{col}_2(X) + ... = 0 \end{split}\end{equation}\]

As the columns of $X$ are linearly independent, $\theta = 0$, $\ker(X^\intercal X) = {0}$.

OLS estimator is the best linear unbiased estimator. Ridge regression has higher bias and lower variance than OLS.

Ridge regression

\[\hat{\theta} \in \underset{\theta}{\operatorname{argmin}} \frac{1}{2} \vert\vert X\theta - y \vert\vert_2^2 + \frac{\lambda}{2} \vert\vert \theta \vert\vert_2^2\] \[\nabla_\theta \frac{1}{2} \vert\vert X\theta - y \vert\vert_2^2 + \frac{\lambda}{2} \vert\vert \theta \vert\vert_2^2 = 0 \Rightarrow \hat{\theta} = (X^\intercal X + \lambda I)^{-1}X^\intercal y\]

Coefficient of determination

The most general definition of the coefficient of determination is

\[R^2 = 1 - \frac{\text{residual sum of squares}}{\text{total sum of squares}} = 1 - \frac{\text{残差平方和}}{\text{总平方和}} = 1 - \frac{\sum_i (y_i - f_i)^2}{\sum_i (y_i - \overline{y})^2} = 1 - \frac{\frac{1}{n} \sum_i (y_i - f_i)^2}{\frac{1}{n} \sum_i (y_i - \overline{y})^2} = 1 - \frac{\text{mean squared error}}{\text{variance}}\]

where

\[\overline{y} = \frac{1}{n} \sum_i^n y_i\]

$y_1$, $y_2$, …, $y_n$ denotes the $n$ target variables, $f_1$, $f_2$, …, $f_n$ denotes the $n$ predicted values by our model.

Values of $R^2$ outside the range $0$ to $1$ can occur when the model fits the data worse than a horizontal hyperplane.

$R^2$ (R squared) can be interpreted as the proportion of the variance in the target variable (label) that is predictable from the features. It is $1$ minus the “normalized” mean squared error (normalized by the variance of the target/label variables).

Logistic regression

Suppose $w, x \in \mathbb{R}^{p+1}$, the bias term is included. Capital letters indicate random variables.

\[\mathbb{P}(Y=1 | x) = \frac{1}{1 + \exp{(-w^\intercal x)} }\] \[w^\intercal x = \log \frac{\mathbb{P}(Y=1 | x)}{1 - \mathbb{P}(Y=1 | x)} = \text{logit}(\;\mathbb{P}(Y=1 | x)\;)\]

The function $\text{logit}(\cdot)$ is called log-odds or logit function.

Use MLE to estimate the parameter $w$.

If $y\in {0, 1}$,

\[\begin{equation} \begin{split} \underset{w}{\mathrm{argmax}}\; L(w) &= \underset{w}{\mathrm{argmax}}\; \log \prod_{i} \mathbb{P}(Y=1 | x_i)^{y_i} (1 - \mathbb{P}(Y=1 | x_i)\;)^{1-y_i} \\ &= \underset{w}{\mathrm{argmax}}\; \sum_i y_i \log \mathbb{P}(Y=1 | x_i) + (1-y_i) \log(1 - \mathbb{P}(Y=1 | x_i) \;) \\ &= \underset{w}{\mathrm{argmax}}\; \sum_i y_i \log \frac{\mathbb{P}(Y=1 | x_i)}{1 - \mathbb{P}(Y=1 | x_i)} + \log(1 - \mathbb{P}(Y=1 | x_i) \;) \\ &= \underset{w}{\mathrm{argmin}}\; \sum_i \log(1 + \exp(w^\intercal x)\;) - y_i w^\intercal x_i \end{split} \end{equation}\] \[\begin{equation} \begin{split} \nabla_w L(w) &= \sum_i \frac{1}{1 + \exp(- w^\intercal x_i)\;} x_i - y_i x_i \\ &= \sum_i (\frac{1}{1 + \exp(- w^\intercal x_i)\;} - y_i) x_i \end{split} \end{equation}\]

If $y\in {-1, 1}$,

\[\begin{equation} \begin{split} \underset{w}{\mathrm{argmax}}\; L(w) &= \underset{w}{\mathrm{argmax}}\; \log \prod_i \mathbb{P}(Y=y_i | x_i) \\ &= \underset{w}{\mathrm{argmax}}\; \log \prod_i \frac{1}{1 + \exp(-y_i w^\intercal x_i)} \\ &= \underset{w}{\mathrm{argmin}}\; \sum_i \log(1 + \exp(-y_i w^\intercal x_i)\;) \end{split} \end{equation}\] \[\begin{equation} \begin{split} \nabla_w L(w) &= \sum_i \frac{- y_i}{1 + \exp(y_i w^\intercal x_i)} x_i \end{split} \end{equation}\]

SVM

Suppose $w, x \in \mathbb{R}^{p}$, the bias term is not included. $y\in {-1, 1}$.

If the training data is linearly separable, hard-margin SVM.

The primal form:

\[\begin{equation}\begin{split} &\min_{w,b} \quad \quad \quad \frac{1}{2} ||w||^2 \\ &\text{s.t.} \quad 1 - y_i (w^\intercal x_i + b) \leqslant 0, \forall i \end{split}\end{equation}\]

The dual form:

\[\begin{equation}\begin{split} &\min_{\alpha} \quad \quad \quad \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^\intercal x_j - \sum_i \alpha_i \\ &\text{s.t.} \quad \alpha_i \geqslant 0, \forall i \\ & \quad \quad \sum_i \alpha_i y_i = 0, \forall i \end{split}\end{equation}\]

There exists $j$ such that $\alpha_j^* > 0$,

\[w^* = \sum_i \alpha_i^* y_i x_i\] \[b^* = y_j - \sum_i \alpha_i^* y_i x_i^\intercal x_j\]

svm

If the training data is not linearly separable, soft-margin SVM.

The primal form:

\[\begin{equation}\begin{split} &\min_{w,b,\xi} \quad \quad \quad \frac{1}{2} ||w||^2 + C \sum_i \xi_i \\ &\text{s.t.} \quad 1 - y_i (w^\intercal x_i + b) \leqslant \xi_i, \forall i \\ & \quad \quad \xi_i \geqslant 0, \forall i \end{split}\end{equation}\]

The dual form:

\[\begin{equation}\begin{split} &\min_{\alpha} \quad \quad \quad \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^\intercal x_j - \sum_i \alpha_i \\ &\text{s.t.} \quad 0 \leqslant \alpha_i \leqslant C, \forall i \\ & \quad \quad \sum_i \alpha_i y_i = 0, \forall i \end{split}\end{equation}\]

There exists $j$ such that $0 < \alpha_j^* < C$,

\[w^* = \sum_i \alpha_i^* y_i x_i\] \[b^* = y_j - \sum_i \alpha_i^* y_i x_i^\intercal x_j\]

Soft-margin SVM can be written as the hinge loss with regularization term.

\[\begin{equation}\begin{split} \min_{w,b} \sum_i \max(1-y_i (w^\intercal x_i + b), 0) + \lambda ||w||^2 \end{split}\end{equation}\]

Lagrangian function

\[\begin{equation}\begin{split} &\min_{x} \quad \quad \quad f(x) \\ &\text{s.t.} \quad g_i(x) \leqslant 0, \forall i \\ & \quad \quad h_j(x) = 0, \forall j \end{split}\end{equation}\]

The Lagrangian function is formed as follows:

\[\mathcal{L}(x,\mu,\lambda) = f(x) + \mu^\intercal g(x) + \lambda^\intercal h(x)\]

KKT conditions

  • First order primal optimality
\[\nabla_x \mathcal{L}(x, \mu, \lambda) = 0\]
  • Primal feasibility
\[g_i (x^*) \leqslant 0, \forall i\] \[h_j(x^*) = 0, \forall j\]
  • Dual feasibility
\[\mu_i \geqslant 0, \forall i\]
  • Complementary slackness
\[\mu_i g_i (x^*) = 0, \forall i\]

Tf–idf

One way to define tf–idf (term frequency–inverse document frequency) is the following:

\[w_{i,j} = \text{t}_{i,j} \log \frac{D}{\text{d}_i}\]

where $w_{i,j}$ is the tf-idf weight for token $i$ in document $j$, $t_{i,j}$ is the number of occurences of token $i$ in document $j$, $d_{i}$ is the number of documents that contain token $i$, $D$ is the total number of documents.

However other ways of defining the term frequency and the document frequency exist, for example, normalization, smoothing, etc.

Convolutional neural networks

To compute the output width (resp. height) of a 2D convolutional layer, suppose the kernel size $F$, the stride $S$, the amount of zero padding $P$, the input width (resp. height) $I$, the output width (resp. height) $O$:

\[O = \frac{(I - F + 2P)}{S} + 1\]

To compute the output width (resp. height) of a 2D pooling layer, suppose the kernel size $F$, the stride $S$, the input width (resp. height) $I$, the output width (resp. height) $O$:

\[O = \frac{I - F}{S} + 1\]

A 2D convolutional layer with $F=3, P=1, S=1$ will not change the width (resp. height): $O=I$, a 2D pooling layer with $F=2, S=2$ will produce $O = \frac{I}{2}$ (without remainder).

Reinforcement learning (RL)

In applications where there is a natural notion of final time step, an episode is a (sub)sequence of agent–environment interactions, such as plays of a game, trips through a maze, or any sort of repeated interactions. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks. On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. We call these continuing tasks.

General definition of return:

return_rl.png

Definition of state value function for policy $\pi$:

definition_state_value_function.png

Definition of action value function for policy $\pi$:

definition_action_value_function.png

Bellman equations refer to a set of equations that decompose the value function into the immediate reward plus the discounted future values.

Bellman equation for state value functions (relationship between the value of a state and the values of its successor states):

bellman_equation_v.png

Bellman optimality equation for state value functions:

bellman_optimality_equation_v.png

Bellman optimality equation for action value functions:

bellman_optimality_equation_q.png

Bellman equations according to Lilian Weng:

bellman_equations.png

The n-step return can be computed by truncating the sum of returns after n steps, and approximating the return from the remaining steps using a value function V(s):

n_step_return_rl.png

1-step return:

one_step_return_rl.png

In n-step returns, $n$ acts as a trade-off between bias and variance for the value estimator. While $R_t^{(∞)}$ provides an unbiased but high variance estimator, $R_t^{(1)}$ provides a biased but low variance estimator.

Another method to trade off between the bias and variance of the estimator is to use a λ-return, calculated as an exponentially-weighted average of n-step returns with decay parameter λ:

lambda_return_rl.png

λ = 0 reduces to the single-step return $R_t^{(1)}$, and λ = 1 recovers the Monte-Carlo return $R_t^{(∞)}$. Intermediate values of $\lambda \in (0, 1)$ produces interpolants that can be used to balance the bias and variance of the value estimator.

When the model is fully known, following Bellman equations, we can use Dynamic Programming (DP) to iteratively evaluate value functions and improve policy.

  • Policy Evaluation is to compute the state-value
  • Based on the value functions, Policy Improvement generates a better policy by acting greedily.
  • The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.

Monte-Carlo (MC) methods need to learn from complete episodes to compute returns and all the episodes must eventually terminate.

Similar to Monte-Carlo methods, Temporal-Difference (TD) Learning is model-free. However, TD learning can learn from incomplete episodes. TD learning methods update targets with regard to existing estimates rather than exclusively relying on actual rewards and complete returns as in MC methods. This approach is known as bootstrapping. Temporal difference (TD) methods learn a guess from a guess, they bootstrap.

TD(0), which is the simplest TD method:

td0_update_rl.png

Methods in which the temporal difference extends over $n$ steps are called n-step TD methods. We are free to pick any $n$ in TD learning as we like. Now the question becomes what is the best $n$? A common yet smart solution is to apply a weighted sum of all possible n-step TD targets rather than to pick a single best $n$. The weights decay by a factor λ with $n$, $\lambda^{n-1}$. the intuition is similar to why we want to discount future rewards when computing the return: the more future we look into the less confident we would be. This weighted sum of many n-step returns is called λ-return.

TD(λ):

  • Updating the value function using the temporal difference computed with the λ-return results in the TD(λ) algorithm.
  • TD learning that adopts λ-return for value updating is labeled as TD(λ).
  • The TD(λ) algorithm can be understood as one particular way of averaging n-step TD targets. This average contains all the n-step TD targets, each weighted proportional to $\lambda^{n-1}$, and normalized by a factor of (1 − λ) to ensure that the weights sum to 1.

In the TD(λ) algorithm, the λ refers to the use of an eligibility trace. There are two ways to view eligibility traces:

  • The more theoretical view (forward view), is that they are a bridge from TD to Monte Carlo methods. When TD methods are augmented with eligibility traces, they pro- duce a family of methods spanning a spectrum that has Monte Carlo methods at one end and one-step TD methods at the other. In between are intermediate methods that are often better than either extreme method. The forward view is most useful for understanding what is computed by methods using eligibility traces.
  • The other way (backward view) to view eligibility traces is more mechanistic. From this perspective, an eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. The trace marks the memory parameters associated with the event as eligible for un- dergoing learning changes. When a TD error occurs, only the eligible states or actions are assigned credit or blame for the error. Thus, eligibility traces help bridge the gap between events and training information. The backward view is more appropriate for developing intuition about the algorithms themselves.

The Generalized Advantage Estimator GAE(γ, λ) is defined as the exponentially-weighted average of the k-step advantage estimators (discounted by $\lambda$). It is also the infinite sum of future TD errors, exponentially discounted by $(\gamma \lambda)$.

  • Each k-step advantage estimator is defined to be the exponentially discounted sum of the next k TD errors (discounted by $\gamma$).
  • Each k-step advantage estimator is also a k-step estimate of the returns, minus a baseline term $V(s_t)$ (the value of the current state).

Estimating the advantage with the λ-return yields the Generalized Advantage Estimator (GAE). TD(λ) is an estimator of the value function, whereas GAE is estimating the advantage function.

Generalized Advantage Estimator GAE(γ, λ):

gae_rl.png

gae_rl_2.png

gae_special_cases_rl.png

In GAE, both $\gamma$ and $\lambda$ contribute to the bias-variance tradeoff when using an approximate value function. However, they serve different purposes and work best with different ranges of values.

  • $\gamma$ most importantly determines the scale of the value function $V^{\pi, \gamma}$, which does not depend on $\lambda$.
  • Taking $\gamma < 1$ introduces bias into the policy gradient estimate, regardless of the value function’s accuracy.
  • Taking $\lambda < 1$ introduces bias only when the value function is inaccurate.
  • Empirically, we find that the best value of $\lambda$ is much lower than the best value of $\gamma$, likely because $\lambda$ introduces far less bias than $\gamma$ for a reasonably accurate value function.
  • In the original GAE paper which mainly considers simulated robotic locomotion tasks, $\lambda$ in the range [0.9, 0.99] usually results in the best performance.

TD control:

  • SARSA: On-Policy TD control. “SARSA” refers to the procedure of updaing Q-value by following a sequence of state, action, reward, state, action, etc. In each step of SARSA, we need to choose the next action according to the current policy.
  • Q-Learning: Off-policy TD control. The key difference from SARSA is that Q-learning does not follow the current policy to pick the second action.

References

[1] Testing the assumptions of linear regression. (n.d.). people.duke.edu. https://people.duke.edu/~rnau/testing.htm

[2] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second. The MIT Press, 2018. http://incompleteideas.net/book/the-book-2nd.html.

[3] Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” arXiv E-Prints, June 2015, arXiv:1506.02438.

[4] Weng, Lilian. A (Long) Peek into Reinforcement Learning. 2018. https://lilianweng.github.io/posts/2018-02-19-rl-overview/

[5] Peng, Xue Bin, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. “DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills.” arXiv, July 27, 2018. http://arxiv.org/abs/1804.02717.

Read more »

Venn diagrams

Posted on 2019-12-18 | In Mathematics
| Words count in article 25

types_matrices

positive definite $\subset$ positive semi-definite $\subset$ symmetric / Hermitian

Identity matrix is the only real matrix which is positive definite and orthogonal.

correlation_dependence

chomsky_hierarchy

morphism

entropy_venn

conditional_mutual_information_venn

Read more »

Leetcode my posts

Posted on 2019-12-08 | In Misc
| Words count in article 474

274. H-Index

2 lines C++ solution, easy to understand

https://leetcode.com/problems/h-index/discuss/359325/2-lines-C%2B%2B-solution-easy-to-understand


46. Permutations

C++ clear general backtracking, beats 98.38% in time, 100.00% in memory

https://leetcode.com/problems/permutations/discuss/359476/C%2B%2B-clear-general-backtracking-beats-98.38-in-time-100.00-in-memory


47. Permutations II

Explaining the “used” array trick

https://leetcode.com/problems/permutations-ii/discuss/360071/Explaining-the-%22used%22-array-trick


200. Number of Islands

C++ DFS solution & 2D Union Find solution

https://leetcode.com/problems/number-of-islands/discuss/448901/C%2B%2B-DFS-solution-and-2D-Union-Find-solution


912. Sort an Array

C++ merge, insertion, bubble, quick, heap

https://leetcode.com/problems/sort-an-array/discuss/448903/C%2B%2B-merge-insertion-bubble-quick-heap


111. Minimum Depth of Binary Tree

C++ BFS & DFS

https://leetcode.com/problems/minimum-depth-of-binary-tree/discuss/448904/C%2B%2B-BFS-and-DFS


113. Path Sum II

Passing current path by reference or not

https://leetcode.com/problems/path-sum-ii/discuss/448905/Passing-current-path-by-reference-or-not


114. Flatten Binary Tree to Linked List

comparison of traversal order dfs/bfs on binary tree

https://leetcode.com/problems/flatten-binary-tree-to-linked-list/discuss/448906/comparison-of-traversal-order-dfsbfs-on-binary-tree


General discussion

Can I use non-const references in C++ during interviews (onsite or not) ?

https://leetcode.com/discuss/general-discussion/471408/Can-I-use-non-const-references-in-C%2B%2B-during-interviews-(onsite-or-not)


98. Validate Binary Search Tree

C++ 3 approaches: inorder traversal, DFS range check, BFS range check

https://leetcode.com/problems/validate-binary-search-tree/discuss/485751/c-3-approaches-inorder-traversal-dfs-range-check-bfs-range-check


69. Sqrt(x)

C++ binary search solution (two versions)

https://leetcode.com/problems/sqrtx/discuss/493465/C%2B%2B-binary-search-solution-(two-versions)


75. Sort Colors

C++ 6 approaches to solve the dutch national flag problem

https://leetcode.com/problems/sort-colors/discuss/499396/C%2B%2B-six-approaches-to-solve-the-dutch-national-flag-problem


146. LRU Cache

C++ brute force solution vs doubly linked list solution

https://leetcode.com/problems/lru-cache/discuss/500336/C%2B%2B-brute-force-solution-vs-double-linked-list-solution


37. Sudoku Solver

Why isn’t this C++ code working?

https://leetcode.com/problems/sudoku-solver/discuss/505166/Why-isn’t-this-C%2B%2B-code-working


18. 4Sum

C++ generalized to kSum

https://leetcode.com/problems/4sum/discuss/517139/C%2B%2B-generalized-to-kSum


23. Merge k Sorted Lists

C++ using priority queue

https://leetcode.com/problems/merge-k-sorted-lists/discuss/522375/C%2B%2B-using-priority-queue


25. Reverse Nodes in k-Group

C++ iterative solution & recursive solution

https://leetcode.com/problems/reverse-nodes-in-k-group/discuss/523641/C%2B%2B-iterative-solution-and-recursive-solution


127. Word Ladder

C++ fast one-end & two-end BFS

https://leetcode.com/problems/word-ladder/discuss/538333/C%2B%2B-fast-one-end-and-two-end-BFS


126. Word Ladder II

C++ two approaches

https://leetcode.com/problems/word-ladder-ii/discuss/539440/C%2B%2B-two-approaches


399. Evaluate Division

Python: weighted Union Find & BFS

https://leetcode.com/problems/evaluate-division/solutions/4688642/python-weighted-union-find-bfs/


Read more »

Matrix calculus, gradients

Posted on 2019-10-03 | In Mathematics
| Words count in article 411

Rules

\[\nabla (u^\intercal A) = \nabla (u^\intercal) A\] \[\nabla (u^\intercal v) = \nabla (u^\intercal) v + \nabla (v^\intercal) u\]

Derivative of cross entropy loss with softmax function

For a single data point, cross entropy loss:

\[L = - \sum_i y_i \log p_i\]

Softmax function:

\[p_i = \frac{ \exp (o_i) }{\sum_j \exp(o_j) }\] \[\begin{equation}\begin{split} \frac{\partial p_i}{\partial o_k} = \frac{ \partial \frac{ \exp (o_i) }{\sum_j \exp(o_j)} }{\partial o_k} = \frac{ \frac{\partial \exp(o_i)}{\partial o_k} \sum_j \exp(o_j) - \frac{\partial \sum_j \exp(o_j)}{\partial o_k} \exp (o_i) }{(\sum_j \exp(o_j))^2} \end{split}\end{equation}\]

If $i = k$,

\[\begin{equation}\begin{split} \frac{\partial p_i}{\partial o_k} &= \frac{ \exp (o_k) \sum_j \exp(o_j) - \exp (o_k) \exp (o_k) }{(\sum_j \exp(o_j))^2} \\ &= \frac{\exp (o_k)}{\sum_j \exp(o_j)} \frac{ \sum_j \exp(o_j) - \exp (o_k) }{\sum_j \exp(o_j)} \\ &= p_k (1 - p_k) \end{split}\end{equation}\]

If $i \neq k$,

\[\begin{equation}\begin{split} \frac{\partial p_i}{\partial o_k} &= \frac{ \frac{\partial \exp(o_i)}{\partial o_k} \sum_j \exp(o_j) - \frac{\partial \sum_j \exp(o_j)}{\partial o_k} \exp (o_i) }{(\sum_j \exp(o_j))^2} \\ &= \frac{ - \exp (o_k) \exp (o_i) }{(\sum_j \exp(o_j))(\sum_j \exp(o_j))} \\ &= - p_k p_i \end{split}\end{equation}\] \[\begin{equation}\begin{split} \frac{\partial p_i}{\partial o_k} = \begin{cases} p_k (1 - p_k) & \text{if }\; i=k \\ - p_k p_i & \text{if }\; i\neq k \end{cases} \end{split}\end{equation}\]

In consequence,

\[\begin{equation}\begin{split} \frac{\partial L}{\partial o_k} &= - \sum_i y_i \frac{\partial \log p_i}{\partial o_k} \\ &= - \sum_i y_i \frac{\partial \log p_i}{\partial p_i} \frac{\partial p_i}{\partial o_k} \\ &= - \sum_i \frac{y_i}{p_i} \frac{\partial p_i}{\partial o_k} \\ &= - y_k (1 - p_k) + \sum_{i\neq k} y_i p_k \\ &= - y_k + \sum_i y_i p_k \\ &= p_k - y_k \end{split}\end{equation}\]

$\sum_i y_i = 1$ because $y$ is a one-hot vector. $o$ is a vector which represents the raw output of neural network (before softmax activation).

Read more »

Useful websites

Posted on 2019-09-16 | In Misc
| Words count in article 400

Draw sequence diagrams in seconds using this free online tool.

https://www.websequencediagrams.com/


Tutorial about ray tracing. After this, look at Fundamentals of Computer Graphics

https://raytracing.github.io/


Pan Xiao’s summary of algorithms

https://panxiao.gitbook.io/lc/


Space and time Big-O complexities of common algorithms used in Computer Science

https://www.bigocheatsheet.com/


git - the simple guide

http://rogerdudler.github.io/git-guide/


Probability! An interactive introduction.

https://bookdown.org/probability/beta/


The GDELT Project:

monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies…

https://www.gdeltproject.org/


Introduction to Conditional Random Fields (this is a tech blog with many other interesting articles)

http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/


Quantopian’s lectures

https://www.quantopian.com/lectures


Stack Exchange - Sun Haozhe (Cross Validated, Stack Overflow, Mathematics, Robotics, etc.)

https://stats.stackexchange.com/users/239861/sun-haozhe


Dive into Deep Learning - An interactive deep learning book with code, math, and discussions - 李沐

https://d2l.ai/index.html


Preprocessing for deep learning: from covariance matrix to image whitening

https://hadrienj.github.io/posts/Preprocessing-for-deep-learning/


JPEG Image Quality in PIL

https://jdhao.github.io/2019/07/20/pil_jpeg_image_quality/


Lilian Weng’s blog

https://lilianweng.github.io/lil-log/archive.html


Notes which accompany the Stanford class CS231n: Convolutional Neural Networks for Visual Recognition

https://cs231n.github.io/


Stanford class CS224n: Natural Language Processing with Deep Learning

http://web.stanford.edu/class/cs224n/


What Every Computer Scientist Should Know About Floating-Point Arithmetic

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html


lossfunctions - They are a window to your model’s heart.

https://lossfunctions.tumblr.com/


Madcola blog (OCR, OpenCV, CUDA, etc.)

https://www.cnblogs.com/skyfsm/


CTC Algorithm Explained - Yudong’s blog

https://xiaodu.io/ctc-explained/


A website to create Venn diagram (visme)

https://dashboard.visme.co/v2/projects/own


Math 4120 (Modern Algebra), Summer I 2020 (online)

http://www.math.clemson.edu/~macaule/classes/m20_math4120/


Simply enter your words in the above box and press search to generate a list of possible acronyms.

http://acronymify.com/


.gitignore cheat sheet

https://github.com/kenmueller/gitignore


Free Mind

不错的博客(从数学理论到各种代码应用实现都有)

https://blog.pluskid.org/


Fan’s Farm

ZHENG Fan 郑帆的博客(computer vision, SLAM, robotics, mobile robot navigation)

https://fzheng.me/


10000 个科学难题 数学卷

http://video.moe.gov.cn/kejisi/shuxue.pdf


10000 个科学难题 信息科学卷

http://video.moe.gov.cn/kejisi/xinxikexue.pdf


PyPI Download Stats (Analytics for PyPI packages)

https://pypistats.org/


BibTeX Entry and Field Types - Bib-it A BibTEX manager.

http://bib-it.sourceforge.net/help/fieldsAndEntryTypes.php


Hiplot 科研绘图平台

https://hiplot.com.cn/


Ubuntu触控板支持多指手势和滑动切换应用

https://blog.csdn.net/xiaoduanayu/article/details/105779245


Curly Braces in Markdown with Jekyll

https://www.tomordonez.com/curly-braces-markdown-jekyll/


Read more »

Useful unix commands

Posted on 2019-09-14 | In Misc
| Words count in article 589

https://stackoverflow.com/questions/43702546/tensorboard-doesnt-show-all-data-points

1
tensorboard --samples_per_plugin scalars=0 --logdir xxx/

For Mac OS, the caffeinate command is used to prevent a Mac from going to sleep.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# stay until you tell the command to stop running 
# or close the terminal
caffeinate 

# wait for the process with ID 41734 to finish
caffeinate -w 41734  

# 18000 is the timeout value in seconds 
caffeinate -t 18000  

# Option -d prevents the display from sleeping.
# Option -i prevents the system from idle sleeping.
# Option -s prevents the system from sleeping. 
# Option -s is valid only when system is running on AC power.
# Some other options also exist. 

To get the summary of a grand total disk usage size of a directory in “Human Readable Format”:

1
2
du -sh *
# 1.8G	XXX

Report file system disk space usage. For example, to get the amount of the remaining disk size in “Human Readable Format”:

1
2
3
4
5
# information of the system
df -h

# information of a directory
df -h *

On Mac OS, check CPU temperature (this needs to be installed sudo gem install istats)

1
istats 

On Mac OS, change the default shell to Zsh

1
chsh -s /bin/zsh

On Mac OS, change the default shell to Bash

1
chsh -s /bin/bash

Check sum on Mac OS

1
2
3
4
5
# user manual
shasum --help

# check sha1 sum 
shasum -a 1 [file_name]

zip using command line

1
2
3
4
5
6
7
8
9
10
11
# zip a directory
zip -r new_zip_file.zip target_directory/

# zip all the content in the current directory 
zip -r new_zip_file.zip . 

# -X excludes extra file attributes
zip -r -X new_zip_file.zip target_directory/

# zip several files 
zip new_zip_file.zip target_file_1 target_file_2

unzip using command line

1
2
3
4
5
6
# unzip into a separate new directory 
unzip -d name_of_a_new_directory blabla.zip

# unzip and put all the files directly 
# into the current directory
unzip blabla.zip

NVIDIA System Management Interface (nvidia-smi)

1
2
3
4
5
6
7
8
nvidia-smi

# If we just want the GPU name
nvidia-smi --query-gpu=name --format=csv,noheader

# The following will loop and call 
# the view at every second. 
nvidia-smi -l 1

Print newline, word, and byte counts for each file

1
2
3
4
5
6
7
wc

# count the number of lines
wc -l

# count the number of files in [name_of_directory]
ls [name_of_directory] | wc -l
1
2
3
4
5
6
7
8
9
10
11
12
13
man wc

-c, --bytes
    print the byte counts

-m, --chars
    print the character counts

-l, --lines
    print the newline counts
    
-w, --words
    print the word counts

Search pattern (regular expression)

1
2
3
4
grep

# print the line containing the pattern [pattern]
ls [name_of_directory] | grep [pattern]

Search the occurences of some text in all files in a folder.

1
grep -rl "blabla" [folder_path]

grep 是一个在文件中搜索文本的命令。 -r 选项表示递归搜索子目录。 -l 选项表示只列出包含匹配文本的文件名。 "blabla" 是要搜索的特定单词。 [folder_path] 是你要搜索的目录的路径。

如果你要在当前目录及其子目录中搜索,

1
grep -rl "blabla" .

--include="*.py" 选项表示只搜索扩展名为 .py 的文件。

1
grep -rl --include="*.py" "blabla" [folder_path]

Read more »

C++ code common algorithms

Posted on 2019-08-30 | In C++
| Words count in article 1016

C++ code common algorithms

Dijkstra

Assume N nodes, source node is indexed by root, using adjacency list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include<climits> // INT_MAX
#include<vector>
#include<queue> // priority_queue
#include<<utility>> // pair
typedef pair<int, int> pii; // weight goes first 

vector<int> dist(N, INT_MAX);
dist[root] = 0;
priority_queue<pii, vector<pii>, greater<pii>> q;
q.push(make_pair(0, root));
while(!q.empty()){
  pii current = q.top();
  q.pop();
  for(pii neighbor : adj[current.second]){
    if(dist[neighbor.second] > dist[current.second] + neighbor.first){
      dist[neighbor.second] = dist[current.second] + neighbor.first;
      q.push(make_pair(dist[neighbor.second], neighbor.second));
    }
}

To track the actual shortest path, introduce an array int Predecessor[N], update whenever dist[neighbor.second] changes, simply set to the new predecessor current.second.

Union Find

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include<map>
#include<<utility>> // pair

map<int, pair<int, int>> m; // map to parent & rank

void makeSet(int x, map<int, pair<int, int>>& m){
  m.insert(make_pair(x, make_pair(x, 0)));
}
    
int findSet(int x, map<int, pair<int, int>>& m){
  if(m[x].first == x) return x;
  else{
    m[x].first = findSet(m[x].first, m);
    return m[x].first; 
  }
}
    
void unionSet(int x, int y, map<int, pair<int, int>>& m){
  int parX = findSet(x, m);
  int parY = findSet(y, m);
  if(parX == parY) return;
  
  int rankX = m[parX].second; 
  int rankY = m[parY].second;
  if(rankX < rankY) m[parX].first = parY;
  else m[parY].first = parX;
  
  if(rankX == rankY) m[parX].second += 1; 
}

Catalan number

The first Catalan numbers for n = 0, 1, 2, 3, … are

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452, …

catalan_formula

catalan_recursion

Fibonacci number

The first Fibonacci numbers for n = 0, 1, 2, 3, … are

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, …

fibonacci_formula

KMP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include<vector>
#include<string>

vector<int> buildKmpArray(string& pattern){
  vector<int> kmp(pattern.size() + 1, 0); // temporary array for KMP 
  kmp[0] = -1;
  int j = 0;
  for(int i = 1; i <= pattern.size(); i++){
    kmp[i] = j;
    while(j >= 0 && pattern[j] != pattern[i])
      j = kmp[j];
    j++;
  }
return kmp;
}
    
int KMP(string txt, string pattern) {
  vector<int> kmp = buildKmpArray(pattern);
  // search for pattern
  int j = 0;
  for(int i = 0; i <= txt.size(); i++){
    while(j >= 0 && pattern[j] != txt[i])
      j = kmp[j];
    j++;
    if(j == pattern.size()){
      cout << "match found at " <<  i - pattern.size() + 1 << endl;
      j = kmp[j];
    }
  }
}

Binary search

1
2
3
4
5
6
7
8
9
10
int binary_search(vector<int>& nums, int target){
  int l = 0, r = nums.size() - 1, m = 0;
  while(l <= r){
    m = l + (r - l) / 2;
    if(nums[m] < target) l = m + 1;
    else if(nums[m] > target) r = m - 1;
    else return m;
  }
  return - 1;
}

argsort in C++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <vector>
#include <numeric>   // std::iota
#include <algorithm> // std::stable_sort
using namespace std;

vector<int> argsort(vector<int>& v){
  vector<int> idx(v.size());       // idx = {0, 0, 0, ..., 0};
  iota(idx.begin(), idx.end(), 0); // idx = {0, 1, 2, ..., v.size() - 1};
  stable_sort(idx.begin(), idx.end(), [&v](int i1, int i2){return v[i1] < v[i2];});
  return idx;
}

int main(){
  vector<int> v = {23, 40, 10, 15};
  vector<int> idx = argsort(v); 
  for(auto i: idx) cout << i << ", "; // 2, 3, 0, 1, 
}
Read more »

C++11 memo (headers, APIs)

Posted on 2019-08-23 | In C++
| Words count in article 4673

using namespace std;


typedef vector<vector<int>> graph; typedef pair<int, int> pii;


According to https://www.geeksforgeeks.org/c-data-types/

char 1 byte (8 bits)

short int 2 bytes (16 bits)

int 4 bytes (32 bits)

unsigned int 4 bytes (32 bits)

long long int 8 bytes (64 bits)

float 4 bytes (32 bits)

double 8 bytes (64 bits)


INT_MIN, INT_MAX: #include <climits>

INT_MIN -2147483648 = - (2^31)

INT_MAX 2147483647 = (2^31) − 1; 2.15 * (10^9); 二十一亿四千七百四十八万三千六百四十七

65535 = (2^16) - 1; 六万五千五百三十五


1
2
3
4
5
6
#include <cfloat>

cout << FLT_MIN << endl; // 1.17549e-38
cout << FLT_MAX << endl; // 3.40282e+38
cout << DBL_MIN << endl; // 2.22507e-308
cout << DBL_MAX << endl; // 1.79769e+308

FLT_MIN 等 属于<cfloat>,不属于<climits>。


numeric_limits<int>::min(), numeric_limits<int>::max(): #include <limits>

numeric_limits<int>::min() 同 INT_MIN

numeric_limits<int>::max() 同 INT_MAX

numeric_limits<float>::min() 同 FLT_MIN

numeric_limits<double>::min() 同 DBL_MIN

INT_MIN 等 属于<climits>,不属于<limits>。

numeric_limits<int>::min() 等 属于<limits>,不属于<climits>。


memset: #include <cstring>

1
2
3
4
int a[5]; 
   
// all elements of A are zero 
memset(a, 0, sizeof(a));

According to https://www.geeksforgeeks.org/memset-in-cpp/

We can use memset() to set all values as 0 or -1 for integral data types also. It will not work if we use it to set as other values. The reason is simple, memset works byte by byte.


fill_n: #include <algorithm>

1
2
3
4
5
#include <algorithm>    // std::fill_n
#include <vector>       // std::vector

std::vector<int> myvector (8, 10);        // myvector: 10 10 10 10 10 10 10 10
std::fill_n (myvector.begin(), 4, 20);     // myvector: 20 20 20 20 10 10 10 10

cout: #include <iostream> cin: #include<iostream> scanf: #include<iostream> printf: #include<iostream>


priority_queue: #include <queue>

empty(), size(), top(), push(), pop()

1
2
3
4
int myints[]= {10,60,50,20};
std::priority_queue<int> first;
std::priority_queue<int> second (myints,myints+4);
std::priority_queue<int, std::vector<int>, std::greater<int> > third (myints,myints+4);

In C++, the default STL (Standard Template Library) priority queue is Max priority queue (returns the largest element). Using greater as comparison function to get Min priority queue, like:

priority_queue<int, vector<int>, greater<int> >pq;

To use lambda expressions:

1
2
3
4
5
6
7
// Max priority queue, similar to "priority_queue<int> pq;"
auto cmp = [](ListNode* a, ListNode* b){return a->val < b->val;};
priority_queue<ListNode*, vector<ListNode*>, decltype(cmp)> max_pq(cmp);

// Min priority queue, similar to "priority_queue<int, vector<int>, greater<int>> pq;"
auto cmp = [](ListNode* a, ListNode* b){return a->val > b->val;};
priority_queue<ListNode*, vector<ListNode*>, decltype(cmp)> min_pq(cmp);

queue: #include <queue>

empty(), size(), front(), back(), push(), pop()


stack: #include <stack> empty(), size(), top(), push(), pop()


vector: #include <vector> size(), empty(), front(), back(), push_back(), pop_back()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Returns an iterator pointing to the first element in the vector.
begin()

// Returns an iterator referring to the past-the-end element in the vector container. 
// The past-the-end element is the theoretical element that would follow the last element 
// in the vector. It does not point to any element, and thus shall not be dereferenced.
end() 

// the maximum number of elements that the vector can hold.
// This is the maximum potential size the container can reach due to 
// known system or library implementation limitations, but the container 
// is by no means guaranteed to be able to reach that size: it can still 
// fail to allocate storage at any point before that size is reached.
max_size()

// Resizes the container so that it contains n elements.
resize()

// Returns the size of the storage space currently allocated for the vector, expressed in terms of elements.
// This capacity is not necessarily equal to the vector size. It can be equal or greater, with the extra space 
// allowing to accommodate for growth without the need to reallocate on each insertion.
capacity()

// Requests that the vector capacity be at least enough to contain n elements.
reserve() 

// Returns a reference to the element at position n in the vector container.
[] 

// Returns a reference to the element at position n in the vector. 
// It automatically checks whether n is within the bounds of valid elements in the vector, 
//throwing an out_of_range exception if it is not
at() 

clear() // Removes all elements from the vector (which are destroyed), leaving the container with a size of 0.

// The vector is extended by inserting new elements before the element at the specified position, 
// effectively increasing the container size by the number of elements inserted.
// Because vectors use an array as their underlying storage, inserting elements in positions other 
// than the vector end causes the container to relocate all the elements that were after position to 
// their new positions. This is generally an inefficient operation compared to the one performed for 
// the same operation by other kinds of sequence containers (such as list or forward_list).
insert()

std::vector<int> myvector (3,100);
std::vector<int>::iterator it;
it = myvector.begin();
it = myvector.insert ( it , 200 );

// Removes from the vector either a single element (position) or a range of elements ([first,last)).
// Because vectors use an array as their underlying storage, erasing elements in positions other than the 
// vector end causes the container to relocate all the elements after the segment erased to their new positions. 
// This is generally an inefficient operation compared to the one performed for the same operation by other kinds 
// of sequence containers (such as list or forward_list).
erase() 

// erase the 6th element
myvector.erase (myvector.begin()+5);

// erase the first 3 elements:
myvector.erase (myvector.begin(), myvector.begin()+3);

Doubly-linked lists

list: #include <list>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
begin(), end() // iterators
empty(), size(), max_size(), front(), back(), push_front(), pop_front(), push_back(), pop_back(), resize(), clear()

insert()

std::list<int> mylist;
std::list<int>::iterator it;
// set some initial values:
for (int i=1; i<=5; ++i) mylist.push_back(i); // 1 2 3 4 5
it = mylist.begin();
++it;       // it points now to number 2           ^
mylist.insert (it,10);                        // 1 10 2 3 4 5


erase()

int main ()
{
  std::list<int> mylist;
  std::list<int>::iterator it1,it2;

  // set some values:
  for (int i=1; i<10; ++i) mylist.push_back(i*10);

                              // 10 20 30 40 50 60 70 80 90
  it1 = it2 = mylist.begin(); // ^^
  advance (it2,6);            // ^                 ^
  ++it1;                      //    ^              ^

  it1 = mylist.erase (it1);   // 10 30 40 50 60 70 80 90
                              //    ^           ^

  it2 = mylist.erase (it2);   // 10 30 40 50 60 80 90
                              //    ^           ^

  ++it1;                      //       ^        ^
  --it2;                      //       ^     ^

  mylist.erase (it1,it2);     // 10 30 60 80 90
                              //        ^

  return 0;
}

Singly-linked lists forward_list: #include <forward_list>

1
2
begin(), end() // iterators
empty(), max_size(), front(), push_front(), pop_front(), insert_after(), erase_after(), resize(), clear()

pair: #include <utility> // std::pair, std::make_pair

1
2
3
4
5
6
std::pair <std::string, double> product1;                     // default constructor
std::pair <std::string, double> product2 ("tomatoes", 2.30);   // value init
std::pair <std::string, double> product3 (product2);          // copy constructor
product1 = std::make_pair(std::string("lightbulbs"), 0.99);   // using make_pair (move)
product2.first = "shoes";                  // the type of first is string
product2.second = 39.90;                   // the type of second is double

Sets are typically implemented as binary search trees. set: #include <set> begin(), end(), empty(), size(), max_size(), clear()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert()
std::set<int> myset;
for (int i=1; i<=5; ++i) myset.insert(i*10);    // set: 10 20 30 40 50

erase()
std::set<int> myset;
myset.erase (40);
for(int i=1; i<10; i++) myset.insert(i*10);  // 10 20 30 40 50 60 70 80 90

find()
std::set<int> myset;
std::set<int>::iterator it;
for (int i=1; i<=5; i++) myset.insert(i*10);    // set: 10 20 30 40 50
myset.erase (myset.find(40));

count()
if (myset.count(x)) {... // x is in the set, count is 1 } 
else {... // count zero, i.e. x not in the set }

unordered_set: #include <unordered_set> begin(), end(), find(), count(), insert(), erase(), clear()


Maps are typically implemented as binary search trees. map: #include <map> empty(), size(), max_size(), [], at(), insert(), erase(), clear(), find(), count()

[]: If k matches the key of an element in the container, the function returns a reference to its mapped value. If k does not match the key of any element in the container, the function inserts a new element with that key and returns a reference to its mapped value. Notice that this always increases the container size by one, even if no mapped value is assigned to the element (the element is constructed using its default constructor).


unordered_map: #include <unordered_map> begin(), end(), [], at(), find(), count(), insert(), erase(), clear()


library

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
#include <algorithm>

find()
// using std::find with array and pointer:
int myints[] = { 10, 20, 30, 40 };
int * p;
p = std::find (myints, myints+4, 30);
if (p != myints+4) std::cout << "Element found in myints: " << *p << '\n';
else std::cout << "Element not found in myints\n";

count()
int myints[] = {10,20,30,30,20,10,10,20};   // 8 elements
int mycount = std::count (myints, myints+8, 10);
std::cout << "10 appears " << mycount << " times.\n";

swap()
#include <algorithm>    // std::swap
int x=10, y=20;                              // x:10 y:20
std::swap(x,y);                              // x:20 y:10
std::vector<int> foo (4,x), bar (6,y);       // foo:4x20 bar:6x10
std::swap(foo,bar);                          // foo:6x10 bar:4x20

reverse()
std::reverse(myvector.begin(),myvector.end());

rotate()
std::vector<int> myvector;
for (int i=1; i<10; ++i) myvector.push_back(i); // 1 2 3 4 5 6 7 8 9
std::rotate(myvector.begin(),myvector.begin()+3,myvector.end()); // 4 5 6 7 8 9 1 2 3

random_shuffle() 
// using built-in random generator:
std::random_shuffle ( myvector.begin(), myvector.end() );

sort()
sort(myvector.begin(), myvector.begin()+4);
bool myfunction (int i,int j) { return (i<j); }
std::sort (myvector.begin()+4, myvector.end(), myfunction);

stable_sort()
std::stable_sort (myvector.begin(), myvector.end());

is_sorted()
std::is_sorted(foo.begin(),foo.end())

lower_bound(), upper_bound()
int myints[] = {10,20,30,30,20,10,10,20};
std::vector<int> v(myints,myints+8);           // 10 20 30 30 20 10 10 20
std::sort (v.begin(), v.end());                // 10 10 10 20 20 20 30 30
std::vector<int>::iterator low,up;
low=std::lower_bound (v.begin(), v.end(), 20); //          ^
up= std::upper_bound (v.begin(), v.end(), 20); //                   ^
std::cout << "lower_bound at position " << (low- v.begin()) << '\n';
std::cout << "upper_bound at position " << (up - v.begin()) << '\n';

binary_search()
int myints[] = {1,2,3,4,5,4,3,2,1};
std::vector<int> v(myints,myints+9);                         // 1 2 3 4 5 4 3 2 1
std::sort (v.begin(), v.end());
std::cout << "looking for a 3... ";
if (std::binary_search (v.begin(), v.end(), 3))
    std::cout << "found!\n"; 
    else std::cout << "not found.\n";
    
merge()
int first[] = {5,10,15,20,25};
int second[] = {50,40,30,20,10};
std::vector<int> v(10);
std::sort (first,first+5);
std::sort (second,second+5);
std::merge (first,first+5,second,second+5,v.begin());
std::cout << "The resulting vector contains:";
for (std::vector<int>::iterator it=v.begin(); it!=v.end(); ++it)
    std::cout << ' ' << *it;
std::cout << '\n';

make_heap()
int myints[] = {10,20,30,5,15};
std::vector<int> v(myints,myints+5);
std::make_heap (v.begin(),v.end());
std::cout << "initial max heap   : " << v.front() << '\n'; // 30

min(), max()
std::cout << "min(1,2)==" << std::min(1,2) << '\n';
std::cout << "max(1,2)==" << std::max(1,2) << '\n';
  
// Returns an iterator pointing to the element with the smallest value in the range [first,last).
min_element(), max_element()
int myints[] = {3,7,2,5,6,4,9};
std::cout << "The smallest element is " << *std::min_element(myints,myints+7) << '\n';
std::cout << "The largest element is "  << *std::max_element(myints,myints+7) << '\n';

lexicographical_compare() 
char foo[]="Apple";
char bar[]="apartment";
std::cout << std::lexicographical_compare(foo,foo+5,bar,bar+9); // true

next_permutation()
// next_permutation example
#include <iostream>     // std::cout
#include <algorithm>    // std::next_permutation, std::sort
int main () {
  int myints[] = {1,2,3};

  std::sort (myints,myints+3);

  std::cout << "The 3! possible permutations with 3 elements:\n";
  do {
    std::cout << myints[0] << ' ' << myints[1] << ' ' << myints[2] << '\n';
  } while ( std::next_permutation(myints,myints+3) );

  std::cout << "After loop: " << myints[0] << ' ' << myints[1] << ' ' << myints[2] << '\n';

  return 0;
}

prev_permutation()
#include <iostream>     // std::cout
#include <algorithm> 
int main () {
  int myints[] = {1,2,3};

  std::sort (myints,myints+3);
  std::reverse (myints,myints+3);

  std::cout << "The 3! possible permutations with 3 elements:\n";
  do {
    std::cout << myints[0] << ' ' << myints[1] << ' ' << myints[2] << '\n';
  } while ( std::prev_permutation(myints,myints+3) );

  std::cout << "After loop: " << myints[0] << ' ' << myints[1] << ' ' << myints[2] << '\n';

  return 0;
}

less<T>(): #include <functional>

1
2
int foo[]={10,20,5,15,25};
std::sort (foo, foo+5, std::less<int>());  // 5 10 15 20 25

greater<T>(): #include <functional>

1
2
int numbers[]={20,40,50,10,30};
std::sort (numbers, numbers+5, std::greater<int>());

string: #include <string>

1
2
3
4
5
6
int main ()
{
  std::string str ("Test string");
  std::cout << "The size of str is " << str.length() << " bytes.\n";
  return 0;
}

Lambda function in C++11

1
2
vector<int> vec = { 1,2,3,4,5,6,7,8,9 }; // 1 2 3 4 5 6 7 8 9 
sort(vec.begin(), vec.end(), [](int a, int b){return a > b;}); // 9 8 7 6 5 4 3 2 1 

Macro definitions: #define

#define TABLE_SIZE 100 #define int32 uint32_t


const int MAXN = 21;


bool adj[MAXN][MAXN]; bool visited[51];


C++ coding competition style IO:

1
2
3
typedef long long ll;
ll N;
scanf("%lld", & N);
1
2
3
4
5
6
7
int scanf_ret, T;
int a[5];
while(true){
  scanf_ret = scanf("%d", &T);
  if(scanf_ret == EOF) break;
  for(int i = 0; i < 5; i++) scanf("%d", &a[i]);
}
1
2
3
4
scanf("%c", &c); // reads one character, won't ignore leading whitespace
scanf(" %c", &c); // The blank in the format string tells scanf to skip leading whitespace
printf("c = %c\n", c);
putchar('\n'); 
1
2
3
int T;
scanf("%d", &T);
printf("Network %d: %s\n\n", T, "DISD");

rand()

Returns a pseudo-random integral number in the range between 0 and RAND_MAX.

RAND_MAX is a constant defined in <cstdlib>. Test with http://cpp.sh/ shows that: cout << RAND_MAX << endl; // 2147483647

1
2
3
v1 = rand() % 100;         // v1 in the range 0 to 99
v2 = rand() % 100 + 1;     // v2 in the range 1 to 100
v3 = rand() % 30 + 1985;   // v3 in the range 1985-2014 

According to https://stackoverflow.com/questions/35910043/why-does-rand-compile-without-including-cstdlib-or-using-namespace-std :

rand() is defined in <cstdlib>. However, standard library header files may include other standard library header files. So iostream may include cstdlib directly or indirectly. In practice, it suffices to use #include<iostream>.

Header files with C-standard library equivalents (e.g. cstdlib) are allowed to bring C standard library names into the global namespace, that is, outside of the std namespace (e.g. rand) This is formally allowed since C++11, and was largely tolerated before.


abs()

abs() is defined in <cmath>. However, in practice, it suffices to use #include<iostream>.


Library bitset

1
2
3
4
5
6
7
8
9
#include <iostream>       // std::cout
#include <bitset>         // std::bitset
int main (){
  std::bitset<4> foo; // 4 bits init to 0 
  foo[1]=1;             // 0010
  foo[2]=foo[1];        // 0110
  std::cout << "foo: " << foo << '\n'; // foo: 0110 
  return 0;
}

Modulo 10^9+7 (1000000007):

https://www.geeksforgeeks.org/modulo-1097-1000000007/

  • prevent integer overflows
  • In some of the problems, to compute the result modulo inverse is needed and this number helps a lot because it is prime.

A few distributive properties of modulo are as follows:

  • ( a + b) % c = ( ( a % c ) + ( b % c ) ) % c
  • ( a * b) % c = ( ( a % c ) * ( b % c ) ) % c
  • ( a – b) % c = ( ( a % c ) – ( b % c ) ) % c

( a / b ) % c = ( ( a % c ) / ( b % c ) ) % c is not true!!!!


struct in C++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
struct TreeNode {
      int val;
      TreeNode* left;
      TreeNode* right;
      TreeNode(int x) : val(x), left(nullptr), right(nullptr) {}
 };

struct Tuple{
    TreeNode* node;
    int a, id;
    Tuple(int id, TreeNode* node, int a): id(id), node(node), a(a){}
};

struct Student{
    int id, grade;
    Student(int id, int grade): id(id), grade(grade){}
    
    bool operator<(Student other){
        return grade < other.grade;
    }
};

int main(){
    Tuple t1(1, new TreeNode(0), 10);
    Tuple* t2 = new Tuple(2, new TreeNode(5), 25);
    vector<Tuple> v = {t1, *t2, Tuple(3, new TreeNode(7), 15)};
    for(auto e: v) cout << e.id << ", "; cout << endl; // 1, 2, 3, 
    sort(v.begin(), v.end(), [](Tuple t1, Tuple t2){return t1.a < t2.a;});
    for(auto e: v) cout << e.id << ", "; cout << endl; // 1, 3, 2, 
    
    Student s1(1, 56);
    vector<Student> vv = {s1, Student(2, 30), Student(3, 99)};
    for(auto e: vv) cout << e.id << ", "; cout << endl; // 1, 2, 3,  
    sort(vv.begin(), vv.end());
    for(auto e: vv) cout << e.id << ", "; cout << endl; // 2, 1, 3, 
}

Array in C++ (raw array, as opposed to STL std::array in <array>)

1
2
3
4
int a[5];
int b[5] = {16, 2, 77, 40, 12071}; 
int c[] = {4, 57, 9, 2};
int d[]{10, 20, 30}; 

Compiler uses pointer arithmetic to access array element, arr[i] is treated as *(arr + i) by the compiler. However, array and pointer are two different things in general.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int arr[] = {10, 20, 30, 40, 50, 60}; 
int *ptr = arr; 

cout << arr[2] << endl;    // 30
cout << *(arr + 2)<< endl; // 30
cout << ptr[2]<< endl;     // 30
cout << *(ptr + 2)<< endl; // 30

// sizof(int) * (number of element in arr[]) is printed 
cout << sizeof(arr) << endl; // 24

// sizeof a pointer is printed which is same for all type  
// of pointers (char *, void *, etc) 
cout << sizeof(ptr) << endl; // 8 

hash: #include <functional>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
char a[] = "test";
string b(a);
int c = 0, d = 1, e = 100, f = - 23;
float g = 17.5;
double h = 17.5;

std::hash<char*> hashc;
std::hash<string> hashs;
std::hash<int> hashi;
std::hash<float> hashf;
std::hash<double> hashd;

cout << hashc(a) << endl; // 140723202892979
cout << hashs(a) << endl; // 15118982290295364091
cout << hashi(c) << endl; // 0
cout << hashi(d) << endl; // 1
cout << hashi(e) << endl; // 100
cout << hashi(f) << endl; // 18446744073709551593
cout << hashf(g) << endl; // 3757725912424151043  
cout << hashd(h) << endl; // 4693821169703498693

To create an unordered_map of user-defined class in C++:

  • implement const method operator== in the user-defined class
  • create a struct/class that contains const method operator()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct Coordinate {
    int x, y;
    
    bool operator==(Coordinate other) const{
        return x == other.x && y == other.y;
    }
};

struct myHash{
    size_t operator()(Coordinate c) const{
      return std::hash<int>()(c.x + 1000 + c.y);
    }
};

int main(){
    unordered_map<Coordinate, Coordinate, myHash> pred;
    unordered_map<Coordinate, bool, myHash> used;
    
    Coordinate c{0, 1};
    cout << used[c] << endl; // 0
    used[c] = true;
    cout << used[c] << endl; // 1
}

1
2
3
4
5
6
7
8
9
enum Color{white, black, red, green, blue};

int main(){
    cout << white << endl; // 0
    cout << black << endl; // 1
    
    Color g = green;
    cout << g << endl; // 3
}

numeric_limits<double>::epsilon():

Machine epsilon (the difference between 1 and the least value greater than 1 that is representable). See also FLT_EPSILON, DBL_EPSILON.

1
2
3
4
5
6
7
8
9
10
double SquareRoot(double x) {
    double l = 1, r = x, m = 0;
    if(x < 1) swap(l, r);
    while( (r - l) / max( abs(l), abs(r) ) > numeric_limits<double>::epsilon() ){
      m = l + (r - l) / 2;
      if(m * m <= x) l = m;
      else r = m;
    }
    return l;
}

Read more »
1 2 3 4
Sun Haozhe

Sun Haozhe

GitHub IO blogs

78 posts
4 categories
16 tags
GitHub
© 2019 - 2024 Sun Haozhe
Powered by Jekyll
Theme - NexT.Mist