This Sage worksheet was developed for the MAA PREP Workshop "Sage: Using Open-Source Mathematics Software with Undergraduates" (funding provided by NSF DUE 0817071).
Although Sage began as a project in algebra and geometry, it has many functions for statistics and finance. Particularly due to the R project being a component of Sage, we have very powerful statistical techniques at our disposal. However, some basic things are built right in.
{{{id=1| mean([1,2,3,5]) /// 11/4 }}} {{{id=3| std([1,2,2,4,5,6,8]) # The standard deviation /// sqrt(19/3) }}}Once we get beyond such things, it will be slightly more complicated. Even if you are just trying to generate a random sample from a given type of continuous distribution, there are several ways to do this. In the first example, we use the simplest method of generating random elements from a log normal distribution with (normal) mean 2 and $\sigma=3$. Notice that there is really no way around making some kind of loop.
{{{id=6| my_data=[lognormvariate(2,3) for i in range(10)] my_data /// [1.2200073608568685, 0.99262658944848614, 3.4397547697767741, 3.4838891195212311, 19654.894101139962, 26.815549450474581, 164.14210210663543, 774.31598178441175, 57.1938092064524, 11.286595257004107] }}}We can check whether the mean of the log of the data is close to 2.
{{{id=19| mean([log(item) for item in my_data]) /// 3.4072843137524691 }}}In the following example, we let 'dist' be the variable assigned to a continuous Gaussian/normal distribution with standard deviation of 3, and then use the '.get_random_element()' method ten times, adding 2 each time so that the mean is equal to 2. This uses the Gnu scientific library under the hood.
{{{id=8| dist=RealDistribution('gaussian',3) my_data=[dist.get_random_element()+2 for _ in range(10)] my_data /// [2.91919853699, 0.726840961836, 1.38635997794, 0.713787150357, 5.9164163059, 3.94242954162, 5.58270406261, -3.22142134694, 2.78213058546, 1.19652246063] }}}Unfortunately, we don't yet have as easy access to discrete distributions. To find these, we access one of several other pieces of Sage which have statistics built into them. One of these is Scipy, which we have to 'import'.
Here, we are calling 'binom_dist' the binomial distribution with 20 trials and 5% expected failure rate. The '.pmf(x)' method gives the probability of $x$ failures, which we then plot in a bar chart for $x$ from 0 to 20. (Note that 'range(21)' means all integers from zero to twenty!)
{{{id=11| import scipy.stats binom_dist = scipy.stats.binom(20,.05) bar_chart([binom_dist.pmf(x) for x in range(21)]) ///Scipy's statistics can do other things too. Here, we find the median (as the fiftieth percentile) of an earlier data set.
{{{id=12| scipy.stats.scoreatpercentile(my_data, 50) /// 2.0842452816975765 }}}The key thing to remember here is to look at the documentation! Particularly for Scipy, not everything in Sage is "wrapped" with an easy command, so you may have to do some experimentation. This is also a place where we would be looking for volunteers to help improve things!
There are several other pieces of Sage that have statistical capabilities, but by far the most important is the R project, which is the industry and academic standard for statistical analysis of all kinds. We'll finish our 'Quickstart' with this.
Even here, there are several ways to access R. One of the easiest is to just put 'r()' around things you want to make into statistical objects, and then use R commands via 'r.method()' to pass them on to Sage for further processing. This is particularly nice if you will use the results in something else. However, it isn't always easy to do very complicated commands using this interface.
The following example of the Kruskal-Wallis test comes directly from the examples in 'r.kruskal_test?' in the notebook.
{{{id=14| x=r([2.9, 3.0, 2.5, 2.6, 3.2]) # normal subjects y=r([3.8, 2.7, 4.0, 2.4]) # with obstructive airway disease z=r([2.8, 3.4, 3.7, 2.2, 2.0]) # with asbestosis a = r([x,y,z]) # make a long R vector of all the data b = r.factor(5*[1]+4*[2]+5*[3]) # create something for R to tell which subjects are which a; b # show them /// [1] 2.9 3.0 2.5 2.6 3.2 3.8 2.7 4.0 2.4 2.8 3.4 3.7 2.2 2.0 [1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 }}} {{{id=18| r.kruskal_test(a,b) # do the KW test! /// Kruskal-Wallis rank sum test data: sage36 and sage2 Kruskal-Wallis chi-squared = 0.7714, df = 2, p-value = 0.68 }}}Looks like we can't reject the null hypothesis here.
Finally, the best way to use R seriously is to simply ask each individual cell to evaluate completely in R, using a so-called "percent directive". Here is a sample linear regression from John Verzani's simpleR text. Notice that R also uses the '#' symbol to indicate comments.
{{{id=16| %r x = c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37) # ages of individuals y = c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178) # maximum heart rate of each one png() # turn on plotting plot(x,y) # make a plot lm(y ~ x) # do the linear regression abline(lm(y ~ x)) # plot the regression line dev.off() # turn off the device so it plots /// Call: lm(formula = y ~ x) Coefficients: (Intercept) x 210.0485 -0.7977 null device 1 }}}To get a whole worksheet to evaluate in R (and be able to ignore the '%'), you could also drop down the 'r' option in the menu close to the top which currently has 'sage' in it.
There are many texts (and even another PREP workshop this summer) which discuss using R in introductory statistics, so this 'Quickstart' should be just the beginning if you are interested!
{{{id=17| /// }}}