This forum is about wrong numbers in science, politics and the media. It respects good science and good English.
In regard to JEB's piece "Curiouser and curiouser", the strange correlation between rate of sea level rise and sunspot numbers was actually first identified by Steve McIntyre of the Climate Audit blog, as I remember it. I managed to track down the relevant Climate Audit blog post, written in Feb 2007:
Willis Eschenbach of WUWT embarks on a strange grandstanding-type attempt to debunk the comparison when it appears years later in the Solheim paper. I think most people would be impressed by the correlation, even if it cannot currently be explained. McIntyre's comments on the correlation back in 2007 were "The maxima and minima of the solar cycles seem to match the fluctuations in sea level rise rather uncannily... offhand, I can’t think of any two climate series with better decadal matching".
The Eschenbach analysis which tries to prove there is no correlation is I think pretty poor. He is using the R squared test, which as I've always understood it, is used to check how well experimental data matches some curvefit to the data, where the curvefit might possibly be a simple straight line, a polynomial formula, or some more complicated mathematical formula. But he seems to be applying the R squared idea to the y coordinate values of the two experimental data sets, without there being any curvefit in the analysis.
To give a historical example of the value of publishing correlations which aren't currently understood, and may even just be coincidences, consider the pioneer Victorian epidemologist William Farr, who wrote a report called "The Mortality of Cholera in England, 1848-49". The only parameter that Farr could find which seemed to be connected to mortality rate was the height above sea level at which the cholera victims lived, and he worked out a mathematical formula to express the elevation related result. This correlation didn't make a lot of sense at the time, and Farr interpreted it as meaning that 'miasma' or 'bad air' (which was assumed to be the cause of disease at the time) must be more prevalent at low elevation. If cholera is a water-borne disease, which it was eventually shown to be, the correlation makes a lot more sense, and it suggests that the closer you are to the River Thames, the more likely you are to get cholera through contaminated drinking water.
Nice one Dave.
I've long-since developed the view that if the statistical test of choice denies the conclusion that is blatantly obvious upon eyeballing then the statistical test is wrong.
This kind of data defies statistical analysis. We use statistics on data from repeated treatments under controlled conditions (and only then because we can't afford to do enough repeats to put the result entirely beyond reproach), we can't do it on one-off results pulled out of the aether of the real world. Particularly data (sea level changes) that are bound to be affected by hundreds of factors, known and unknown, some of them cyclical. There are so many confounders out there that the bar of statistical significance is simply too high, and in this case has resulted in the writing off of a correlation that looks genuine to the eyeball. The fact that we cannot robustly (by statistical standards anyway) draw a positive conclusion does not force us to draw the negative conclusion.
If there's a criticism of the method to make it's that picking sunspots as the one potential cause to look at was probably data-driven. That still doesn't mean it's not a factor, or a Thames-distance/height proxy for something that is a factor.
R2 is for curve fits but I (admittedly rusty) can't think of a good reason it couldn't be applied to this kind of data. Ideally you would make some kind of rank correlation out of it first, doing it on sinusoidal stuff is likely to go badly wrong.
On the issue of checking whether two time series are correlated or not, my understanding of the basic statistical test to do this is that you calculate a parameter called the "Pearson correlation coefficient". This coefficient is described in this Wikipedia article:
This coefficient is usually defined in terms of a quantity called 'covariance'. However looking at the section 'Mathematical properties' in the Wikipedia article, it appears that after some manipulation, it does tie up with the maths used for the R squared test in curvefitting. So the R in the R squared test is actually Pearson's correlation coefficient.
The Wikipedia article does not state a criterion for what is an acceptable value for R. However there is a criterion used in civil engineering for "statistically independent time histories" that the absolute value of R should be less than 0.3. ['Time history' is the rather ugly engineering equivalent term to 'time series'.} Willis Eschenbach has calculated an R² of 0.13 for the rise of sea level and sunspot numbers time series, which corresponds to an R value of 0.36. Using the civil engineering criterion, the R value for the two time series is not actually low enough to claim they are uncorrelated, but the correlation does appear to be pretty weak.
If you go back to the covariance definition of R, the covariance is itself defined in terms of the mean values of the time series over the time range. That suggests to me that to work out an accurate value for the correlation coefficient you need to calculate accurate mean values of the two time series, which is I think likely to depend on the number of time points used in the digtisation of the graph curves. If I was digitising the graphs myself I would use a lot of time points, and have the time points at equal time intervals, to make sure the means were accurately calculated. Eschenbach may have just used the key points in the graphs at non-uniform time intervals for the digitisation to reduce the amount of work, particularly if he is carrying out the digitisation by hand with the values estimated using a ruler from a printed out copy of the graphs. If I was doing the digitisation myself I would use a free program called "Plot Digitizer", which should be much faster than trying to do it by hand (http://plotdigitizer.sourceforge.net/), and would enable an ample number of time points to be used.
However I think the main problem with the Eschenbach analysis, as JamesV alludes to, is that he is acting as though the correlation coefficient calculation is what is known as a 'robust' statistical test. In the Wikipedia article it is pointed out in the 'Robustness' section that the test is not robust, so you can't really assume it is reliable enough to override a visual inspection of the graphs. There may be a robust statistical test in existence for checking whether a given pair of time series are correlated or not, but if there is I would imagine it would be a much more complicated test and would probably require some qualified statistician to apply it.
I found an old book on the subject of time series analysis from my undergraduate days. It was not on the syllabus, which was all historical engineering rather than the electronic engineering that I hoped to pursue. The book mark in it was a piece of paper, on which I had typed the following:
Time series analysis
Is a compound of fallacies
By subtle catalysis
Causing mental paralysis.
GRRR This fits into our authors poem. The linked video cause my blood pressure to increase. It wouldn't take much to cause me to tantrum. I plot the raw data for the red regions and can't find jack @()#$ in the way of increasing temperatures. There are folks out there with real hats on that are also saying the same thing, but how can these people be so arrogant?
I thought I'd update this thread as I have finally got round to calculating the correlation coefficient myself for the sea level change and sunspot number pair of time series.
The method I used was:
a) digitise the graphs from a screen capure of Fig 2 in the Solheim paper using 'PlotDigitizer'. PlotDigitizer only takes about 5 minutes to produce pretty accurate x, y coordinates for the two graphs, whereas it would take at least several hours if you tried to estimate the coordinate values by scaling from an enlarged printed out copy of Fig 2 using a ruler.
b) interpolate the points at uniform time intervals using a short Fortran program. With this I turned each time series into 911 data points at 0.1 year intervals for the year range 1909 to 2000.
c) run the two 911 point time series through another short Fortran program called 'corrcheck' I wrote about twenty years ago which calculates the absolute value of the correlation coefficient. This program uses the text book definition of R, the same one as in the Wikipedia article.
The value for the correlation coefficient I worked out was 0.35842, or 0.36 to two decimal places. This agrees very well with the value quoted by Willis Eschenbach [I was expecting to prove that he had significantly underestimated it].
So this example does raise an interesting point, two time series which look pretty well correlated by eyeball have quite a poor R value, only just above the acceptance criterion of 0.3 for them being taken as correlated.
Yes. I think your example illustrates the fragility of trying to use just one number to summarise the relationship between two waveforms. Consider a sine and a cosine (possibly derived in a physical system from the same source). The cross-correlation function is clearly periodic, but its average is zero. Even more difficult is the case where there is clearly oscillatory behaviour, but the periodic time is wandering, as happens in many common physical processes. This is clearly visible to the eye, but not to the single number. Even the cross-correlation function might blur the effect to relative invisibility.