We often need to calculate averages of a group of data with associated uncertainties (e.g. a group of cosmogenic ages from erratics on the same moraine). The individual ages are usually in the form of *AGE*±*UNCERTAINTY*. This represents a most likely *AGE* and the age range that corresponds to “one sigma”, meaning that there is a 68% probability of our “true age” to be between *AGE*–*UNCERTAINTY* and *AGE*+*UNCERTAINTY*.

This way of expressing data is easy to read, allowing the reader to get a clear idea of the age and the precision using only two figures. Also, for most cosmogenic ages, this form represents the accurate distribution of the data, which is required when these ages are later used to calculate something else.

That is why we tend to represent the average of several ages using the from *AGE*±*UNCERTAINTY*. However, there are several ways of calculating this average, and the accuracy of the calculated average in representing the original dataset depends on both a) the characteristics of our dataset, and b) the method used to calculate the average.

**Surface exposure ages from cosmonuclide concentrations**

Online calculators are typically used to calculate surface exposure ages from the concentration of cosmogenic nuclides in surface samples. Some of the most used online calculators are:

- The online calculators formerly known as the CRONUS-Earth online calculators
- The CRONUS Earth Web Calculators
- CREp (Cosmic Ray Exposure program)

All these calculators require inputting data about the sampling site, the characteristics of the reference material used in the concentration measurements, and the measured concentrations with their uncertainties. The measurement uncertainty should include the laboratory and analytical uncertainties, usually including the scatter of the spectrometer measurements, weighing uncertainties, and the nominal error of the concentration of the carrier used, if any. You can find more information on how to calculate concentration uncertainties here:

- Converting Al and Be isotope ratio measurements to nuclide concentrations in quartz. (Greg Balco, 2006)
- Basic error propagation and cosmogenic isotopes.

The concentration uncertainty is directly transmitted to the *internal error* of the final age.

The online calculators always report at least 3 data per sample: the apparent surface exposure age and two uncertainties: *internal* and *external* uncertainties. **Internal uncertainty** includes only the transmission of the uncertainty of the concentrations, and this is the figure we should use **when comparing ages from samples collected nearby and prepared in the same way**, usually our whole dataset. The external uncertainty contains the internal uncertainty and the uncertainty of the method used to calculate the age, typically the uncertainties of the scaling method and the uncertainty of the reference production rate, i.e. the scatter of the calibration data. The **external uncertainty** is the figure we should use when **comparing our data with ages from other sites or ages calculated using other methods.**

The **C**alibration and **S**caling **U**ncertainty is usually a fixed percentage of our ages. You can easily check this by subtracting both errors *in quadrature* (σext. and σint.) and divide the result by the age (μ):

The result is usually a constant percentage for all your ages. E.g. the calibration uncertainty of the Be-10 ages calculated using the *online-calculators-formerly-known-as-the-CRONUS-Earth-online-calculators* v.3 using the LSDn scaling scheme is typically 8.2%.

As we can always calculate the external error by adding *in quadrature* the CSU to our internal uncertainties, we can just forget about the external uncertainties when operating with our ages (e.g. calculating averages) and add the CSU at the end of our calculations.

**Types of surface exposure age datasets**

The preparation and measurement of samples for cosmogenic exposure dating is time-consuming and expensive. Therefore, the samples that are finally measured are thoroughly selected and the final datasets contain a small number of samples, typically 4-6 per geologic landform.

Despite all care put in the sample selection, several natural processes make the apparent surface exposure ages to move from the true landform age. This *natural noise* could be caused by previous exposure of the sampled surfaces or non-constant exposure since the landform formation (e.g. boulder rotation). Therefore, we should expect outliers in our dataset or at least some scatter of our ages.

Also, the inhomogeneities during the sample preparation and analysis (e.g. different sample sizes, different AMS current, etc.) sometimes yield datasets with mixed precisions, even from identical geological samples.

All this makes typical surface exposure age datasets 1) small in terms of the number of data, 2) scattered due to the *natural noise*, and 3) often containing data with a mix of precise and imprecise data. Here we can see 4 synthetic examples of typical surface exposure age datasets with data obtained from samples from erratics on a LGM moraine (~18 ka):

The precise dataset. | The scattered dataset | A dataset with outliers | Mixed precisions dataset |

When calculating the average of one of these datasets, we normally use the average or the weighted average.

**Average and standard deviation**

The simplest way of averaging ages is using the arithmetic mean, which is the sum of the ages divided by the number of ages :

The uncertainty associated with the average is the standard deviation, which is typically calculated as:

If we apply these formulas to the previous datasets, we obtain the following AV±SD:

The precise dataset. AV: 17.90 ± 0.36 (1.51) ka 🙂 | The scattered dataset AV: 18.4 ± 1.2 (1.9) ka 🙂 | A dataset with outliers AV: 16.8 ± 2.2 (2.6) ka 😦 | Mixed precisions dataset AV: 19 ± 2.1 (2.6) ka 😦 |

In a *good* set of ages, as in the first case (the precise dataset), the error bars of the individual ages correspond to the scatter of the dataset. Excluding the first case, the main problem when using this approach is that we are **ignoring the uncertainties of the individual ages**. This is not a big problem when the individual uncertainties are negligible compared to the scatter of the data, as in the second example (the scattered dataset). However, the presence of outliers *pull*s the average toward them in the third example (a dataset with outliers). The same happens in the last example (mixed precisions dataset), even when all individual age ranges overlap in the age of 18ka, the average yields 19ka.

**Weighted average**

The weighted average, of weighted arithmetic mean, is a way of calculating the average increasing the importance of the individual ages that are known more precisely. That means that the *ages with smaller error bars* will contribute more to the average than the *ages with bigger uncertainties*. This is typically calculated as:

The uncertainty of the weighted average is sometimes calculated as the standard error of the weighted mean using this formula:

which is a good representation of the effect of all analytical uncertainties if the individual ages on the weighted average. However, this method ignores the scatter of the data, which is usually bigger than the individual uncertainties. To take into account both sources of uncertainty in the weighted average, we should use the square root of the weighted sample variance to calculate this uncertainty. Thus, the Deviation of our Weighted Average will be:

If we apply these formulas to the previous datasets, we obtain the following WA±DWA:

The precise dataset. WA: 17.90 ± 0.33 (1.50) ka 🙂 | The scattered dataset WA: 18.4 ± 1.1 (1.9) ka 🙂 | A dataset with outliers WA: 16.2 ± 2.1 (2.5) ka 😦 | Mixed precisions dataset WA: 17.87 ± 0.30 (1.47) ka 🙂 |

The weighted average does not solve the problem with the outliers *pulling* the average towards younger ages in the third example. Actually, in the first 3 examples, this method produces a very similar result to the simple arithmetic mean. However, the weighted average is successful* ignoring* the effect of the imprecise data in the last example. **A weighted average is a good option for filtering poor analytical data without discarding it**.

##### Filtering data

When we look to a set of ages and error bars, we can *intuitively* guess which is the right age of the unit we are trying to date.

The precise dataset. | The scattered dataset | A dataset with outliers | Mixed precisions dataset |

The weighted average will match our guess in the first, second, and fourth examples. However, to get rid of the effect of the obvious outliers in the third case we might need to discard data.

###### Outliers

We can just **remove the odd ages manually**. This might seem obvious looking at the third example, but it is less evident if we look at the second one. In the second example, and many real datasets, **removing outliers manually is arbitrary** and makes it difficult to compare averages of the different datasets that were manually trimmed. The election of outliers and the number of data we discard manually is often driven by the age we were primarily expecting and by our hope of getting a final age with a small error bar. Too human.

There are many *automatic* mathematical methods to discard outliers. We could stick to one method, discard the ages that *seem not to fit*, and apply the average, or the weighted average, to the mutilated dataset. However, this also brings some problems:

- The surface exposure datasets usually contain 4-6 ages. Most methods for removing outliers are designed to be used in groups with much more data. It is very difficult to justify statistically the removal of 2 data form a group of 6 ages.
- Usually, the geological samples have been chosen carefully to avoid samples that are not optimum for the method, especially when using expensive and time-consuming dating methods, such as cosmogenic surface exposure dating. When rejecting outliers, the scientist should provide a geological interpretation of the outlier. And this interpretation usually involves rejuvenating or ageing processes, that have been systematically avoided during the sample selection.

Is there a method that is *not manual*, and allows us calculating a *realistic* average of our ages *without discarding outliers*?

##### The Best Gaussian Fit

A good candidate to automatically get an average of our data that is similar to our intuitive age is the Best Gaussian Fit (BGF).

We can calculate the probability corresponding to each time for each age assuming that it normally distributed:

Then we can sum up all the probability distributions and find the Gaussian curve that fits better the resulting camelplot.

The precise dataset. | The scattered dataset | A dataset with outliers | Mixed precisions dataset |

The and values corresponding to the best fitting curve is our BGF:

The precise dataset. BGF: 17.85 ± 0.52 (1.55) ka 🙂 | The scattered dataset BGF: 18.4 ± 1.7 (2.3) ka 🙂 | A dataset with outliers BGF: 18.13 ± 0.76 (1.66) ka 🙂 | Mixed precisions dataset BGF: 17.87 ± 0.54 (1.54) ka 🙂 |

As we can see, this method mimics quite well our *intuitive age interpretation* using a mathematical algorithm that **takes into account all ages and their uncertainties**. However, the process of finding our BGF requires goal fitting methods that are slightly beyond the capacity of our favourite calculator, Microsoft Excel®.

I did all these graphs using Octave. Below you can find a link to my GitHub repository with the code needed to perform all these calculations and plots at once (Average, Weighted Average, and Best Gaussian Fit with internal and external uncertainties). This code works well in both **MATLAB® and Octave**.

Additionally, I tried to make an **XLSX file** that calculates the same, but with no plots. However, the BGF calculations are based on a set of 1000 random curves, and this is often not enough to get the best fit. If you download this version, remember that **the BGF approach might not be accurate!**

For bigger datasets, such as calibrated Schmidt Hammer ages, it can be interesting fitting our cameplot to multiple Gaussian curves if we suspect that our ages reflect more than one unique events. Jason Dortch developed P-CAAT for fitting several Gaussian curves to big datasets.

#### Cosmogenic Exposure Age Averages (CEAA) calculators:

###### MATLAB/Octave program:

How-to with your data:

- If you don’t have MATLAB or Octave, you can freely download and install Octave form here.
- Organize your ages in four columns: Sample-name, age, internal-error, and external-error.

- Save your ages in a
**CSV file**(comma separated values). You can save the file straight from Excel, LibreOffice, etc., or populate it using a text editor: just separate the numbers using commas.

- Screenshot at 2020-12-08 10-50-26Download the CEAA code from my
**GitHub repository https://github.com/angelrodes/CEAA** - Unzip the file
`CEAA-master.zip`. It includes a folder with the examples shown here. - Run the script
`start.m`using Octave or MATLAB. - A dialogue box will ask you to select your CSV file.

- You will get an output like this:

###### CEAA Excel spreadsheet:

The CEAA.xlsx spreadsheet performs the same calculations as the scripts above, except for **the BGF**, which **is approximated based on 1000 random curves**. Also, this spreadsheet does not output any plot.

How-to with your data:

- Just paste your data in the first 4 columns (tab:
`CalculatorIO`). Up to 1000 ages are accepted. - Results will appear on the right.
- Press
**F9**to recalculate the random curves and check the accuracy of the BGF results.

#### Citation

If you use this code for your research, you can cite it as:

Ángel Rodés (2020) *Cosmogenic Exposure Age Averages.* github.com/angelrodes/CEAA *doi:*10.5281/zenodo.4024909