CodePlexProject Hosting for Open Source Software

The statistical data analysis framework of Meta.Numerics is organized into classes that represent different kinds of data collections:

Each kind of data collection allows for its own set of descriptive statistics, statisitcal tests, and model fitting procedures. They are described in detail below.

In addition to the data collection classes, information is available on Distributions.

## (Univariate) Samples

The Sample class is a data container for experiments consisting of independent measurements of a singe variable. Suppose, for example that you have time-to-failure measurements of a random sample of components produced by some manufacturing process. The time-to-failure
values are: 1.0, 1.3, 1.5, 1.7, and 1.9. This data can be sumbitted to a Sample class as follows:

To control the contents of the sample, you can also use the Add overload that accepts a single value, the Remove method to remove a single value, and the Clear method to remove all values.

The Sample class provides methods for obtaining summary statistics for the sample and estimates of summary statistics for the underlying population. The relevant methods are summarized in the following table:

Note that estimates of population parameters are UncertainValues, which contain both a best estimate and standard error estimate.

The Sample class supports several statistical tests appropriate to univariate samples. For example, you can perform a Student t-test to determine whether you can state with some critical confidence level whether the sample was drawn from a population with a mean above or below some reference level. Suppose, for example, that our customer wishes us to guarantee, with 95% confidence, that the mean-time-to-failure of our component is above 1.25.

The Sample class also supports a t-test against another Sample to test for the equality of means, and a Kolmogorov-Smirnov test against a Distribution or another sample to test for the equality of distributions.

Finally, you may want to fit your sample to a parameterized distribution. Do not do this by binning the data and doing a least-squares fit to the bin counts: binning throws away information, the bin count variances may not satisfity the assumptions underlying a least-squares fit, and there is a better way. Do a maximum likelihood fit directly on the sample data, using the Sample class's MaximumLikelihoodFit method.

The IParameterizedDistribution passed into the method is just an initial guess. When the method returns, the resultant FitResult will contain the best-fit parameters and the input distribution will have been changed to the best-fit one.

## Data Sets

The DataSet class describes a type of experiment very common in the physical sciences: a measurement, with uncertainty, is recorded for various values of an independent variable. For example, we might measure the intensity of scattered light as a function of
angle, or the concentration of a chemical in a reaction chamber as a function of time, or the temperature of the ocean surface as a function of location.

Note that the last example differs from the first two in an important way: the independent variable is not a single real number, but rather point in some more complicated parameter space (in this case, one likely specificed by giving latitude and longitude). For data sets like these, you should use the generic DataSet<T> class, with T set to a type that describes your independent variable. The non-generic DataSet class is essentially DataSet<double>, and it defines several methods, such as fitting to a polynomial, that only make sense for functions of a single, real variable.

Individual measurments in a data set are representated by DataPoint<T> instances. Each data point consists of a value for the independent variable (X, of type T) and a value and error bar for the dependent variable (Y, of type UncertainValue). You can add data points to your data set using any of several overloads of the Add method.

You can remove one or all data points from your set using the Remove or Clear methods.

DataSet allows you to fit your data to various model forms. For example, the following code computes the line that best fits the data.

DataSet also defines methods for fitting data to a constant, a proportionality relationship (a line with its intercept fixed to zero), or a polynomial of arbitrary degree.

You can also fit data sets to any parameterized model you can write down. For example, the following sample defines a 3-parameter oscilatory model function and fits the data to it.

If your model is a linear function of the model parameters, you can use the FitToLinearFunction method for a faster fit. Note that the FitToFunction and FitToLinearFunction methods are defined for the generic DataSet<T>. To fit a data set on a field of type T, your model function should accept an input argument of type T.

**Sample**: Each observation consists of a single number drawn form a population. For example: weight, income, or lifetime are measured for a study group.**MultivariateSample**: Each observation consists of a set of numbers drawn from a population. For example: weight, income, and lifetime are measured for a study group.

**DataSet**: Each observation consists of a measured value and error bar associated with a single independent variable. For example: the solubility of a substance in water is measured as a function of temperature.**DataSet<T>**: Each observation consists of a measured vallue and error bar associated with an arbitrary independent variable.

**BinaryContingencyTable**: Observations are classified by subjects that fell into one binary catetory compared to those that fell into another binary category. For example: a study group is classified into subjects that were treated and not treated, survived and not survived.**ContingencyTable**: Obsercations are classified by subjects that fell into one category set compared to those that fell into another category set. For eample: a study group is classified by subject grade level and whether the subject passed or failed the grade.

Each kind of data collection allows for its own set of descriptive statistics, statisitcal tests, and model fitting procedures. They are described in detail below.

In addition to the data collection classes, information is available on Distributions.

Sample sample = new Sample(); sample.Add(new double[] { 1.0, 1.3, 1.5, 1.7, 1.9 });

To control the contents of the sample, you can also use the Add overload that accepts a single value, the Remove method to remove a single value, and the Clear method to remove all values.

The Sample class provides methods for obtaining summary statistics for the sample and estimates of summary statistics for the underlying population. The relevant methods are summarized in the following table:

Statistic | Sample | Population |
---|---|---|

mean | Mean | PopulationMean |

standard deviation | StandardDeviation | PopulationStandardDeviation |

raw moment | Moment | PopulationMoment |

central moment | MomentAboutMean | PopulationMomentAboutMean |

Note that estimates of population parameters are UncertainValues, which contain both a best estimate and standard error estimate.

The Sample class supports several statistical tests appropriate to univariate samples. For example, you can perform a Student t-test to determine whether you can state with some critical confidence level whether the sample was drawn from a population with a mean above or below some reference level. Suppose, for example, that our customer wishes us to guarantee, with 95% confidence, that the mean-time-to-failure of our component is above 1.25.

// test whether mu > 1.25 with 95% confidence TestResult t = sample.StudentTTest(1.25); if (t.LeftProbability > 0.95) { Console.WriteLine("95% confident that mu > 1.25"); } else { Console.WriteLine("not 95% confident that mu > 1.25"); }

The Sample class also supports a t-test against another Sample to test for the equality of means, and a Kolmogorov-Smirnov test against a Distribution or another sample to test for the equality of distributions.

Finally, you may want to fit your sample to a parameterized distribution. Do not do this by binning the data and doing a least-squares fit to the bin counts: binning throws away information, the bin count variances may not satisfity the assumptions underlying a least-squares fit, and there is a better way. Do a maximum likelihood fit directly on the sample data, using the Sample class's MaximumLikelihoodFit method.

IParameterizedDistribution distribution = new WeibullDistribution(sample.Mean, 1.0); FitResult fit = sample.MaximumLikelihoodFit(distribution); for (int i = 0; i < fit.Dimension; i++) { UncertainValue parameter = fit.Parameter(0); Console.WriteLine("a[{0}] = {1}", i, parameter); }

The IParameterizedDistribution passed into the method is just an initial guess. When the method returns, the resultant FitResult will contain the best-fit parameters and the input distribution will have been changed to the best-fit one.

Note that the last example differs from the first two in an important way: the independent variable is not a single real number, but rather point in some more complicated parameter space (in this case, one likely specificed by giving latitude and longitude). For data sets like these, you should use the generic DataSet<T> class, with T set to a type that describes your independent variable. The non-generic DataSet class is essentially DataSet<double>, and it defines several methods, such as fitting to a polynomial, that only make sense for functions of a single, real variable.

Individual measurments in a data set are representated by DataPoint<T> instances. Each data point consists of a value for the independent variable (X, of type T) and a value and error bar for the dependent variable (Y, of type UncertainValue). You can add data points to your data set using any of several overloads of the Add method.

DataSet data = new DataSet(); DataPoint<double> point1 = new DataPoint<double>(1.0, new UncertainValue(2.0,3.0)); data.Add(point1); DataPoint<double> point2 = new DataPoint<double>(2.0, 1.0, 3.0); data.Add(point2); data.Add(3.0, 2.0, 1.0);

You can remove one or all data points from your set using the Remove or Clear methods.

DataSet allows you to fit your data to various model forms. For example, the following code computes the line that best fits the data.

// fit a line to the data FitResult line = data.FitToLine(); // write the parameters, with 1-sigma error bars Console.WriteLine("intercept = {0}", line.Parameter(0)); Console.WriteLine("slope = {1}", line.Parameter(1)); for (int i = 0; i < line.Dimension; i++) { Console.WriteLine(line.Parameter(i)); }

DataSet also defines methods for fitting data to a constant, a proportionality relationship (a line with its intercept fixed to zero), or a polynomial of arbitrary degree.

You can also fit data sets to any parameterized model you can write down. For example, the following sample defines a 3-parameter oscilatory model function and fits the data to it.

// define an oscilatory model // a[0] ~ amplitude, a[1] ~ period, a[2] ~ phase Function<double[], double, double> model = delegate (double[] a, double x) { double y = a[0] * Math.Sin(2.0 * Math.PI / a[1] + a[2]); return(y); }; // fit the data, providing an initial guess for the model parameters FitResult oscilation = data.FitToFunction(model, new double[] { 1.0, 1.0, 0.0 }); // report the result for (int i = 0; i < oscilation.Dimension; i++) { Console.WriteLine("a[{0}] = {1}", i, oscilation.Parameter(i)); }

If your model is a linear function of the model parameters, you can use the FitToLinearFunction method for a faster fit. Note that the FitToFunction and FitToLinearFunction methods are defined for the generic DataSet<T>. To fit a data set on a field of type T, your model function should accept an input argument of type T.

Last edited Aug 4, 2009 at 8:15 PM by ichbin, version 6