PHYS291 exercise

William Sigurdsson
Starting the project

I choose to do some statistical analysis of the annual Stoltzekleiven race. I know they keep statistics online, so finding something to work with shouldn't be too hard.

Data mining

The statistics provided by come in some different varieties. I'll have to make 28 different datafiles if I want to get all the data I can. I messed around with ROOT a bit and threw together a quick histogram of just the womens scores (no mucking about with Trees or anything). This was also the simple list with just the full times.

void stoltz() {
TH1F *h1 = new TH1F("h1", "h1 title", 100, 0, 40);
using namespace std;
	string line;
	ifstream myfile("datadamer.txt");
	if (myfile.is_open())
		while (getline(myfile, line))
			std::string colon = ":";
			std::size_t found = line.find_first_of(colon.c_str());
			TString s_min = line.substr(found - 2, 2);
			TString s_sec = line.substr(found + 1, 2);
			Double_t f_min = s_min.Atof();
			Double_t f_sec = s_sec.Atof();
			Double_t tid = f_min + f_sec / 60;

	else cout << "Unable to open file";

	TCanvas *c = new TCanvas();

Result is this:

This approach won't work for the full data files though, if there's a colon in a clubname or something, the program would crash.

Parsing lines

A random line in one of the data files looks like this:

40.	Ladislav Kocbach	19:43	Universitetet i Bergen	03:09	09:00 (05:51)	17:05 (08:05)	19:43 (02:38)

Hopefully the weird long spaces work as delimiters, or this is going to get annoying.


Tried using stringstreams. Gave up using stringstreams.

Minor breakthrough

Finally got the parsing to work, and after a long time struggling, the tree is behaving like a good tree. Running out of time, though.


Okay. I got the histogram factory up and running, and I've got a few results. First of all, let's take a look at the overall race scores.

That's altogether 4782 different runners with a mean time of 15.86. This includes men and women, adults and children.

Let's see how the results are distributed by gender.

I had to weight the female population since there were more than twice as many men. The average man scores a couple minutes better than the average woman.

I wanted to take a look at how the race times varied with how you chose to pace your run. I had the data for three split times, and chose to see how the ratio of the split time and the full time compared to your final score.

For the first split most people had ran 16-18% of their full time, and that's also the range where the fastest runners were. It could seem like giving too much on the first split results in some really bad times, which kind of makes sense.

Things I should have done

The statistical analysis of this exercise was extremely non-rigorous, although I did get a lot more familiar with ROOT. I would have liked to attempt to fit some of my histograms to maybe a beta distribution, but I didn't have the time. I estimate approximately 95% of my time was spent yelling at the screen trying to get my Tree to do as it was told.

I should also have done something with the distribution of ages. Unfortunately, the runners ages were sorted in nonuniform groups, rather than their actual age, which made working with them bothersome and unproductive. Additionally, I wanted to somehow include a lego plot, because I think they look kind of cool.