CSML

Observation on some similarity between Riesz-Markov Theorem and Birkhoff theorem

2019-12-20T00:00:00-08:00

Two theorems at a glance.

Riesz - Markov Theorem

Theorem: Let $X$ be a locally compact Hausdorff space. For any positive linear functional $\psi$ on $C(X)$, there is a unique regular Borel measure $\mu$ on $X$ such that

\[\begin{equation} \forall f \in C(X), \psi(f) = \int_X f(x) d\mu(x) \end{equation}\]

Birkhoff’s ergodic theorem

Theorem: Let $(X, \mathcal{B}, \mu, T)$ be a measure-preserving system ($T: X \mapsto X$ is measure-preserving transformation). For any $f \in \mathcal{L}_{\mu}^1$,

\[\begin{equation} \lim_{n\rightarrow \infty} \frac{1}{n} \sum_{i=0}^{n-1} f\circ T^i(x) = \int_X f d\mu, \end{equation}\]

is true almost everywhere in $X$.

Discussions

First, both right hand sides are the same.

If we take the finite approximation of the left hand side on the second equation, denote $\mathcal{K}$ as the Koopman operator on the measure-preserving system associated with $T$, then we have

\[\begin{equation} \frac{1}{n} \sum_{i=0}^{n-1} f \circ T^i(x) = \left(\frac{1}{n} \sum_{i=0}^{n-1} \mathcal{K}^i\right) f \triangleq \bar{\mathcal{K}}_n f. \end{equation}\]

Since $\bar{\mathcal{K}_n}$ is a linear operator (so as the corresponding limit) rather than a positive linear functional, one cannot directly apply RMT to obtain Birkhoff theorem. However, I guess the ergodic nature makes the linear operator evaluated pointwise resembles a linear functional. But there is still some difference that makes them quite different.

Just to take the note here to not confuse one with another.

Sensitivity of warm-start in computing MultitaskElasticNet path by coordinate descent in Sklearn

2019-01-26T00:00:00-08:00

Abstract

This post describes a phenomena that we encounter in computing MultitaskElasticNet path, i.e., computing the coefficients of MultitaskElasticNet model with sweeping sparsity regularization parameter $\alpha$. We solve this problem by manually setting up the warm starts and everything works as expected.

What kind of problem does MultiTaskElasticNet solve?

Linear regression

Most useful scientific problem in daily life can be cast into linear regression problem if the features are well designed. So let’s begin with a standard linear regression problem, with $N$ as number of data points, $M$ as the dimension of the data, $P$ as the number of features used. To find the $W$ with least square of the residuals, we simply solve the following problem,

\[\begin{equation} \min \lVert Y - XW \rVert^2_{F}, \end{equation}\]

where $X \in \mathbb{R}^{N \times P}$ are features, $Y \in \mathbb{R}^{N \times M}$ are targets, $W \in \mathbb{R}^{P \times M}$ are model coefficients.

Unique and sparse solution is preferred in modeling scientifc problem

In general, the above problem can be viewed as a linear inversion problem especially if the problem is ill-condition either due to lack of observations or the heavily correlated features. Thus simply solving the above least square problem won’t give us unique solution. However, most of the ground truth behind a inversion problem in scientic community is unique. So what can we do? The standard procedure is to consider regularization: let model to prefer a certain type of solution, for example, sparsity, which is ideal in most of our cases. This naturally leads to the development of LASSO. This would be helpful in noisy data since the model will not only consider MSE but also the sparsity of the solution $W$.

Further, for cases where features are correlated, even though the truth is a unique solution. Certainly there is no unique solution to the optimization problem, even LASSO the typical $L1$ sparsity is considered. To at least uniquely determine the solution, ElasticNet is proposed based on LASSO simply adding a $L2$ regularization. One needs to carefully tune the L2 regularization though.

MultiTask learning: dominant features are shared across different tasks

Most of the time, if there is a multi-output (multi-task) linear regression problem, besides sparsity and uniqueness of the solution, another desired property often overlooked is: dominant features are shared across different tasks. Similar as before, one can come up with a loss function that considers preference for this desired property. For example, consider the following penalty on $W$ as $\begin{equation} \lVert W \rVert_{21} = \sum_{i} \sqrt{\sum_{j} W_{ij}^2 }. \end{equation}$

This $L_{2,1}$ is first proposed by Argyriou et al. in 2008. It can be thought as first compute the 2-norm for each row, and then compute 1-norm on the resulting norm vector. Following the similar spirit in LASSO, the second step encourages the the number of zero rows in the solution $W$, which is encouraging a small subset of features, i.e., common features across all tasks. Now. let’s upgrade the previous problem of ElasticNet into the following MultiTaskElasticNet, we have the new objective function,

\[\begin{equation} \frac{1}{N} \lVert Y - XW \rVert_{F}^2 + \alpha c \lVert W \rVert_{21} + 0.5 \alpha (1 - c) \lVert W \rVert_{F}^2 \end{equation}.\]

Does it work? A toy example show the sensetivity of warm start.

Download the case:

git clone https://github.com/pswpswpsw/example_sensetivity_initial_guess_MultiTaskElasticNet.git

where I have prepared a case with 1600 data points, for a two tasks regression with 14 features.

Our goal is to draw the path of the coefficient, i.e., find the optimal $W$ with each time varying the regularization coefficients $\alpha$ while keep $c$ fixed as 0.5. To optimize the aforementioned MultiTaskElasticNet loss function, we simply take the Sklearn implementation of MultiTaskElasticNet. Alternatively, one can also call the enet_path.

Then

python test.py

It is surprise to see that the results are different between MultiTaskElasticNet and enet_path even with tuned optimization hyperparameters, say increasing the number of iterations. While the later reaches a better local optimum than the former one. Note that MultiTaskElasticNet calls the enet_path. So there must be something weird!

Recall that the algorithm in Sklearn is just a simple coordinate descent algorithm but usually sufficient and fast for linear models, the issue turns out to be whether or not reuse previous solution as initial condition. In default, MultiTaskElasticNet calls enet_path every single time but explicitly disable resuing the coefficient in default. However, directly using enet_path will enable the reusing of the coefficient.

Acknowledgement

I thank Alex Sun for debugging to figure out the issue of initial condition and Alexandre Gramfort for noticing the warm_start could be an issue.

Can we use coefficient of determination for nonlinear regression?

2018-09-27T00:00:00-07:00

Abstract

This is a note of my thoughts on R^2 after taking Time Series Analysis class by Prof. Byon. I will make the following assumptions.

scalar target: $y \in \mathbb{R}^1$
data is sampled i.i.d.

Introduction

Coefficient of determination arose from the observation in linear regression that

\[\begin{equation} SST = SSR + SSE, \end{equation}\]

where

\[\begin{equation} \textrm{sum of squared total variance: }SST = \sum_{i} (y_i - \overline{y})^2, \\ \textrm{sum of squared error: }SSE = \sum_{i} (y_i - \hat{y}_i)^2, \\ \textrm{sum of squared regression: }SSR = \sum_{i} (\hat{y}_i - \overline{y})^2. \end{equation}\]

The proof is quite ubiquitous in any textbook or online materials. With this equality, one can find a nondimensionalized version of it as, coefficient of determination , $R^2$ as

\[\begin{equation} R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}. \end{equation}\]

Note that in the above equation there is two =, which of them is the definition for $R^2$ is not really certain. Wiki says it is the second one, while Jim claim the first one is more natural. Both of them is well defined and equal in the context of linear regression. In general, we believe the R^2 is a statistic that measures how much proportions of variance of the target is explained by predictor variable, excluding the constant. In such sense, the first one is more natural.

Appearance in nonlinear regression

From my viewpoint, there are mainly two aspects of $R^2$ in the context of linear regression, that makes it popular.

the nondimensional property i.e., we don’t have to worry about getting different performance measure over different datasets. In general, $R^2$ over 0.9 is a good indicator of well performed models. Note that if nondimensional property is not favored, for example, we simply interested in one datasets and we don’t have issue of measuring performance of models across different datasets with different scales, then simply one can choose to use $RMSE$ as commonly seen in machine learning and fluid dynamic community, or the so-called standard error of regression.
variance explanation property. Note that in the context of linear regression, another equivalent phase for explaining variance is correlation coefficient. It can be shown that the square of Person correlation coefficient between $y$ and $\hat{y}$ is essentially $R^2$.. There are other alternative correlation coefficient out there but not really satisfies me anyway. I will write another post about a potentially dumb/workable nonlinear coefficient.

However, as it has been mentioned a lot of times that in the context of nonlinear modeling,

\[\begin{equation} SST \neq SSR + SSE. \end{equation}\]

Therefore, one need to make a choice for the definition for $R^2$. Most of the time, people like to use the one with SSE since minimizing SSE is what we do and the smaller the higher for $R^2$. Note that it is implemented in in Scikit-learn as the SSE is well-defined for both linear and nonlinear regression. However, the variance explanation property might not be hold. Because of this issue, there are some negative viewpoint on the usage of $R^2$.

The difference might be small for well-trained nonlinear models

The key to make the equality lies in the following condition

\[\begin{equation} \sum_{i}(y_i - \hat{y}_i) (\hat{y}_i - \overline{y}) = 0, \\ \rightarrow \sum_{i}\epsilon_i (\hat{y}_i - \overline{y}) = 0. \end{equation}\]

Note that the following two are sufficient conditions for the above,

\[\begin{equation} \sum_{i}\epsilon_i \hat{y}_i = 0, \\ \sum_{i}\epsilon_i \overline{y} = 0. \end{equation}\]

The second one is easy, as long as model takes constant as linear features

\[\begin{equation} \hat{y} = \alpha + g_{\beta}(x). \end{equation}\]

One can show that the OLS solution corresponds to the extrema in $\alpha$ would lead to

\[\begin{equation} \sum_{i} \epsilon_i = 0. \end{equation}\]

For the first one, we notice the following

\[\begin{equation} 0 = \sum_i \epsilon_i \hat{y}_i = \sum_i \epsilon_i(\hat{y}_i - \frac{1}{N}\sum_j \hat{y}_j) \sim \mathbb{Cor}(\epsilon, \hat{y}). \end{equation}\]

Note $\epsilon$ is the residual and clearly it is zero mean, if the data is sampled i.i.d, so the explicitly unweighted sum in the above is proportional to the correlation between residual and prediction.

For well-trained, models, Billings et. al., derived several criterions on determining whether the neural network model is well-trained or not. In the equation 21 in their paper, it shows the above uncorrelation in the linear sense is a condition required, which would certainly leads to $\begin{equation} SST \approx SSR + SSE. \end{equation}$ Under the above sense, we should expect that one is able to use $R^2$ for nonlinear regression with the variance explanation property and nondimensionalized property.

Adjusted $R^2$ is a weaker penalty than AIC

2018-09-27T00:00:00-07:00

Abstract

This is a note of thought that I bumped into randomly.

It has been a long history in statistics on model selection that penalizes the exploding number of parameters. AIC, i.e., Alkaline information criterion, is perhaps the most famous one due to its simplicity and generality. While, when $R^2$ is introduced in the class, immediately adjusted $R^2$ is introduced. The later one does not follow the “variance explanation” per se since there is no guarantee about the ratio being kept in $[0,1]$. But, it is supposed to penalize large number of parameters by showing a lower $R^2$. In this post, I will show that adjusted $R^2$ has a weaker penalty than AIC/BIC criterion.

Introduction

Before discussion, let’s make the definition clear. $n \in \mathbb{N}$ is the total number of samples. $p$ is the number of predictors (excluding 1).

Adjusted $R^2$

The difference between common $R^2$ and adjusted $R^2$ is that, adjusted $R^2$ considers the variance explanation by taking independency into account. Therefore, the more parameter you have, the residuals of all data, would be less independent simply due to more and more constraints are imposed by the OLS formulation.

The expression for adjusted $R^2$ is

\[\begin{equation} \overline{R}^2 \triangleq 1 - \dfrac{SSE}{SST} \dfrac{n-1}{n-p-1}. \end{equation}\]

Minimal Description Length (MDL)

AIC originates in information theory. In Bayes belief net, there is a criterion called Minimal Description Length (MDL). One would like to choose the Bayes belief net models with the shortest MDL.

In general, for a given training set $D = { {x}_1,\ldots,{x}_m }$, the scoring function on Bayes net $B = \langle G, \Theta \rangle$ on the training set $D$ is

\[\begin{equation} s(B|D) = f(\theta)|B| - LL(B|D), \end{equation}\]

where $f(\theta)$ is the bits required to describe each parameter while $\vert B \vert$ is the number of parameters in the Bayes net.

\[\begin{equation} LL(B|D) = \sum_{i=1}^m \log P_B({x}_i) \sim -\frac{m}{2} \log(\sigma^2) -\frac{1}{2\sigma^2} SSE \sim -\frac{m}{2}\log(SSE/m), \end{equation}\]

which is the log-likelihood of all data and $\sigma^2$ is the uncertainty in the likelihood model. Note that $\sigma^2 = SSE/m$ as a MLE estimation for the residual variance. Remember that this implies i.i.d, i.e., independently identical distribution.

MDL induces several concepts which are shown below without proof. Note that the sources are from here.

\[\begin{equation} AIC(B|D) = |B| - LL(B|D) = p - \frac{1}{2\sigma^2} SSE, \end{equation}\]

which assumes each parameter costs $1$ bit for description.

\[\begin{equation} BIC(B|D) = \frac{\log m}{2}|B| - LL(B|D) = \frac{\log m}{2}p - \frac{1}{2\sigma^2} SSE, \end{equation}\]

which assumes each parameter costs $\log m /2$ bits for descriptions.

Adjusted $R^2$ penalize weaker than AIC/BIC

Note that for model selection, we hope to select the one maximize criterion.

Start with $\overline{R}^2$, so it is equivalent to minimize the $\dfrac{SSE}{SST} \dfrac{n-1}{n-p-1}$. Since $SST$ is fixed by the given data and $n$ is also fixed. It is simply the one that minimize the

\[\begin{equation} SSE/(n-p-1) \end{equation}\]

while it does not hurt to take the $\log$ and add/minus constants so $\begin{equation} \log SSE/n + \log \frac{1}{n(1-(p+1)/n)} = \log SSE/n + \log \frac{1}{n} + \log \frac{1}{1-(p+1)/n} \\ \sim \log SSE/n +\log \frac{1}{1-(p+1)/n} \end{equation}$

Second, for AIC/BIC, the equivalent quantity to minimize is

\[\begin{equation} 2 C p + n \log (SSE/n), \end{equation}\]

where $C = 1, \frac{\log n}{2}$ for AIC and BIC respectively.

Note that $n$ is a constant, therefore it is equivalent to minimize

\[\begin{equation} 2 C p/n + \log (SSE/n) \sim 2 C (p+1)/n + \log (SSE/n). \end{equation}\]

Clearly, the ratio between the $\log \frac{1}{1-(p+1)/n}$ and $2 C (p+1)/n$ determines the relative penalization between adjusted $R^2$ and AIC/BIC.

First, let’s investigate this ratio as follows

\[\begin{equation} f(x) = \frac{\log(\frac{1}{1-x})}{x} \ge 1, \forall x \in (0,1). \end{equation}\]

To see this, note $f(0^+) = 1$ and one simply take $f’(x)$ as

\[\begin{equation} f'(x) = \frac{\frac{x}{1-x} - \log \frac{1}{1-x}}{x^2} \\ = \frac{ \frac{1}{1-x} - \log \frac{1}{1-x} - 1}{x^2} \ge 0, \forall x \in (0,1). \end{equation}\]

Therefore, take $x = \frac{p+1}{n}$, we have it as a monotonic increasing function with respect to $(p+1)/n$ and when $p \ll n$, we have the ratio between adjusted $R^2$ and AIC/BIC as the following:

\[\begin{equation} \frac{\log(1/(1-(p+1)/n))}{2C(p+1)/n} \ll 1, \end{equation}\]

where $C = 1, \frac{\log n}{2} $.

The ultimate way to postprocess OpenFoam data in Python (updated to Pyvista)

2018-09-27T00:00:00-07:00

Abstract

In this post, I use foamToVTK in OpenFoam to convert OpenFoam data into legacy VTK (The Visualization ToolKit)) format, then use vtkInterface for data manipulation in Python under Ubuntu.

Introduction

OpenFoam is a popular open source code for computational fluid dynamics (CFD). Although it contains various helpful postpocessing modules in command line such as postProcess, it is still designed for convenience but not flexibility. For example, it only provides operations that are very common in the context of fluid mechanics or vector mathematics and it hides the details of operation such as the numerical scheme to approximate the derivatives. Most of the time, OpenFoam saves the data in folder named by the current time and in each folder contains a special OpenFoam format-txt like data, which is also designed for convenience such that one can directly read the result in the field data. However, if one wants to manipulate the data in a more flexible sense in the modern data-driven era, a Python-script-driven manipulation of data is extremely favorable. Also, to avoid dealing with the mesh in the script, if would be great if one can simply add the modified field on the original mesh.

Fortunately, with the help of an awesome Python package on Github: currently Pyvista previously vtkInterface , originally by Alex Kaszynski, one can easily leverage the powerful libraries in Python environment to postprocessing traditional, mature, standard and specialized scientific computing data and immediately put them back in to again, leverage the existing powerful visualization software in scientific computing community.

Using pip to install vtkInterface

Note that it supports well in Python 3.5+.

sudo pip install pyvista

Tutorial: 2D Flow past cylinder

The material can be obtained from Wolf Dynamics at this link.

Prepare data

untar the .tar file
```
tar -zxvf vortex_shedding.tar.gz ./
```

go to c1 directory for running a standard case

cd c1
blockMesh
checkMesh
icoFoam > log &

Convert OpenFoam default format to VTK

foamToVTK

Using Python to manipulate VTK data

import pyvista as vtki
import numpy as np

## grid is the central object in VTK where every field is added on to grid
grid = vtki.UnstructuredGrid('./VTK/c1_1000.vtk')

## point-wise information of geometry is contained
print grid.points

## get a dictionary contains all cell/point information
print grid.cell_arrays # note that cell-based and point-based are in different size
print grid.point_arrays # 

## get a field in numpy array
p_cell = grid.cell_arrays['p']

## create a new cell field of pressure^2
p2_cell = p_cell**2
grid._add_cell_scalar(p2_cell, 'p2')

## remember to save the modified vtk
grid.save('./VTK/c1_1000_shaowu.vtk')

Visualize the new field in ParaView

paraview

Reading List From M.Jordan

2018-09-26T00:00:00-07:00

Note on M.J ML reading list

Elementary

Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.

For a slightly more advanced book that’s quite clear on mathematical techniques, the following book is quite good:

Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.

You’ll need to learn something about asymptotics at some point, and a good starting place is:

Lehmann, E. (2004). “Elements of Large-Sample Theory” Springer.

Those are all frequentist books. You should also read something Bayesian:

Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.

and you should start to read about Bayesian computation:

Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.

On the probability front, a good intermediate text is:

Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.

At a more advanced level, a very good text is the following:

Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.

The standard advanced textbook is Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.

Machine learning research also reposes on optimization theory. A good starting book on linear optimization that will prepare you for convex optimization:

Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.

Advanced

And then you can graduate to:

Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.

Linear Algebra

Getting a full understanding of algorithmic linear algebra is also important. At some point you should feel familiar with most of the material in

Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.

It’s good to know some information theory. The classic is:

Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.

Functional Analysis

Finally, if you want to start to learn some more abstract math, you might want to start to learn some functional analysis (if you haven’t already). Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics. Here’s a book that I find very readable:

Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.