-
And that's all she wrote folks. I've moved Distributed Ecology to GitHub, you can find all my new posts here: http://emhart.github.com/ I've imported all the old posts to the new blog, but much of the formatting is lost so I'll leave these here as an archive. But be sure to update RSS readers, etc. I'm excited to start using Jekyll as a new blogging platform.
-
So last week I wrote up a set of functions to visualize the spread of data on twitter called "impacTwit". This has proven to be an interesting side project and so when I saw this paper by UBC's own Rosie Redfield called "“Why Do We Have to Learn This Stuff?”—A New Genetics for 21st Century Students" come across my feed a few times I decided to run impacTwit on it (that is really a ridiculous name). So here are a couple graphs. First is the cumulative impact plot.
In less than 2 hours it crossed the 150k viewer threshold. That spread fast! Who was the most influential about spreading the word?
Not surprisingly if you want to be big on twitter get Carl Zimmer's attention. The cumulative impact of all the other major sources of retweets pale in comparison to Carl. So this is interesting to me, and in the future I hope to integrate these functions with the altmetrics API, and I've found another API for searching archived tweets. I also think I'll take these figures and then maybe convert them into network diagrams as another kind of graphical output. Until then I guess I'll work on becoming Carl Zimmer's friend.
2View comments
-
Remember 1992? I had just turned 13 and was still a year away from highschool when my true descent into nerdom and the internet would begin. Back then it was with a local BBS (Bulletin Board System) run by a guy in a trailer park named Charlie and a 1200 baud modem on an old Compaq 386. But there were some intrepid ecologists who were already forging ahead with what we take for granted 20 years later, Ecolog. Last September, Ecolog hit 13000 subscribers. I took the data that David mentioned in his post and plotted it out and fitted a curve to it. The fit was based on an assumption of there being 100 subscribers when Ecolog first started. On Sunday he posted that there were now 15,000 subscribers so I decided to see how good my fit was. Here's the original fit with new the new data point and the prediction (in red),I fit this exponential growth equation:
where r in the growth rate, and d is the number of days since Sept. 1 1992, and
The solid black dot is the true value,
the red dot the fitted for the new data point
What should that initial starting value be? Clearly it's not 100. Let's begin by assuming that Ecolog grew exponentially. We're missing a huge amount of data, so I'll say up front that yes I know that there could be any number of weird things that happened between the start year of 1992 and the first year we have data, 2006. But exponential growth isn't an unreasonable assumption. We know it's not linear, a linear fit gives you ~ -15,000 subscribers at time 0! We could make guesses and keep plotting to eyeball our fit. The first problem to solve is how do we figure out the best fitting value for N0? It's a perfect problem for the built-in optimization functions that R has, optim(). First we define a function that we want to minimize. Here we just use nls() with different values for N0 and minimize the deviance, and just plug that into optim().
fp.est <- function(param,dat){ z <- nls(dat[,2]~param*exp(r*dat[,1]),start=list(r=.001)) return(deviance(z)) } tmp <- optim(1,fp.est,dat=dat,method="L-BFGS-B",lower=1,upper=6000)
That gives us a perfectly reasonable point estimate of 545, and vastly improves our fit as you can see.
We can even get a little creative with the data and come up with a measure of standard deviation. We do this with a jackknife style estimate of variance. Jackknife estimates are type of resampling that work by calculating some statistic repeatedly by removing one or more of the data points. All we do is come up with all possible combinations for our data that have 8 out 9 points, and run our optimization routine. This yields a final estimate for the number of subscribers of Ecolog at the very beginning as:
I have no idea if this is right, but it's a neat little example of using optim() and jackknifing to solve a problem where we have good intuition (that exponential growth is the right function) and not very much data. Here's a gist of all the code that this post used.2View comments
-
So I came across this post by Cath Ennis (@enniscath) about Einstein's riddle. I had never heard of the riddle, but the apocryphal story that goes with it is that only 2% of people in the world could solve it. Well, never wanting to miss out on a chance to prove that I'm special, I sat down and decided I should solve it. Cath did it in 18 minutes. I didn't time myself, but I'm pretty sure that I didn't do it in less than 18 minutes, I'd peg my time more around 25ish minutes. It reminded me of doing Soduku. In any case, here's my answer, you can see that I solved in Power Point of all things. I find it hard to believe that 98% of people couldn't solve this. It seemed pretty easy to me, just a lot of variables to keep track of. I'd reinterpret it as only 2% of people in the world are really nerdy enough to give a shit about logic puzzles. Here's my solution!
1View comments
-
Old timey dog treadmill
Endurance athletes sometimes say they're "addicted" to exercise. In fact, scientists have shown that rhythmic, continuous exercise — aerobic exercise — can in fact produce narcotic like chemicals in the body. Now researchers suggest that those chemicals may have helped turn humans, as well as other animals, into long-distance runners.
At this point feel free to be skeptical. The phrase is carefully worded to only imply causality, the authors and NPR push the envelope a bit later in the piece.
"Wired to run, meaning that our brains are probably, have been sort of rewired from an evolutionary sense to encourage these running and high aerobic-activity behaviors," Raichlen says....
Meat is one payoff for runners. But Raichlen thinks there may have been another reward: a runner's high. He designed an experiment to test this idea.
Now you should be waiting for some elegantly designed experiment to test all this. The experiment that comes though is really not at all what one might expect. Essentially the authors took two mammals, dogs and ferrets, and measured levels of eCB's after exercise and did the same with humans. Dogs and humans are "cursorial" mammals, but ferrets aren't. Therefore if dogs and humans run for the reward, they are expected to have elevated eCB's. Low and behold they do. The chain of causality is that eCB's make us feel good when we run, so we evolved to run, the reward being the runner's high. Anyone who's read Gould and Lewontin's "Spandrels of San Marco" paper ought to have caught the flaw in the reasoning here (I think its still valid here despite criticisms of Gould and Lewontin.) Jeremy Yoder wrote a great post about how they didn't demonstrate anything to do with evolution at all. He of course has read Endler's "Measuring selection in the wild" as he astutely points out on his blog:
To really show that the runner's high is an adaptation, of course, we'd need data that showed (1) observed variation in the runner's high response has a genetic basis, and (2) people who had get stronger runner's highs have more babies.
The logic of Raichlen is a bit like saying birds fly because it feels good to be up so high in the sky. Now birds are a good example of this because flightlessness has evolved multiple times. Just because the physiological structures exist that allow a behavior (running, flight) does not mean the organism engages in said behavior because of a trait. Raichlen says in his conclusion:It is possible that neurobiological rewards induced by eCB signaling are an ancient human trait that evolved to encourage aerobic activity (Bramble and Lieberman, 2004; Carrier, 1984; Malina and Little, 2008), and that the rewards explain the evolution of differences in voluntary locomotor activity more broadly across mammals.
I highly doubt that eCB's "evolved to encourage aerobic" activity. If we were to engage in "just-so" story telling about this system, you'd say that humans that produced eCB's were more effective at capturing resources and could therefore have more children. Raichlen doesn't mention reproduction at all. But I'll leave the Endleresque criticism to Jeremy, I have another problem with this study.
Raichlen chose just two taxa besides humans. The authors are very open about this, but what they totally neglect is the underlying phylogeny of their taxonomic choices. According to recent mammalian phylogenies, ferrets and dogs are actually sister taxa. So I adopted two possible pathways based on what I know about cursorial vs non-cursorial mammals. We know already that mice have elevated eCB's during exercise. By the authors own definition, let's assume that sedentary mammals like cows and sheep don't have eCB's. By examining the tree, it can be seen that eCB stimulation was either lost at least twice, or gained at least three times (red circles are losses and green ones gains). What would be really interesting is to see where this trait maps across the phylogeny. Is it a conserved trait that was selected for in some ancestor? That would point to the fact that maybe it has nothing to do with running. The authors are mute about phylogeny, but eCB's could alternatively be the ancestral character state, and really the interesting question is why did ferrets evolve the loss of this state? On the other hand maybe the trait evolved multiple times, and that also is really interesting to ask how that happened. But either phylogenetic scenario undermine the central thesis of Raichlen. Either it evolved in a common ancestor in which case it may have nothing to do with running. Or it evolved independently in dogs and humans, meaning that there may have been different selective pressures at work in each species (or the same, we don't really know).
Raichlen is an anthropologist and something of a media darling scientist who studies a topic near and dear to many people. It seems irresponsible that he was able to make so many assertions about evolution with zero evidence for it, and make assumptions about the conservation of traits across a phylogeny without even mentioning phylogeny. So what's the role of the media in all of this though? It's something that was picked up by lots of media sources saying things like "Yet whether other mammals are also driven to exercise by endocannabinoids has remained a mystery." (from the Economist). Unfortunately, there's no way to undo that misplaced causality.
Raichlen DA, Foster AD, Gerdeman GL, Seillier A, & Giuffrida A (2012). Wired to run: exercise-induced endocannabinoid signaling in humans and cursorial mammals with implications for the 'runner's high'. The Journal of experimental biology, 215 (Pt 8), 1331-6 PMID: 224423712View comments
-
Le Grand Casino of Monte Carlo
A simple integration example
Let's start with a trivial example, integrating a function we use all the time as ecologists, the normal distribution. Maybe you want to integrate the normal probability density function (PDF) from -1 to 1, because you're curious about how likely an event within 1 standard deviation is. To get the area under the curve we simply integrate the PDF from -1 to 1.
MC integration of the normal
PDF between -1 and 1
A simple statistical example
Another place we can use these methods in statistical hypothesis testing. The simplest case is as an alternative to a t-test. Imagine you have a data set with measurements of plant height for the same species in shaded and unshaded conditions. Your data might look like this:
Shaded 13.3 Shaded 12.1 Shaded 14.7 Shaded 12.8 Unshaded 17.8 Unshaded 19.4 Unshaded 18.5 Unshaded 18.5
"All that code for a t-test?"
That must be what you're thinking, and you're right, its certainly unwieldy to write all that code for something so simple. But its a good starting point for when we begin talking about null models. You may not have realized it but in the previous example there's two assumptions behind our inference. The first is that some process or mechanism has caused a difference between our groups. The fact that plants grow to different heights in shaded and unshaded conditions says something about the way plants use light, or the way they compete for light, or maybe some other mechanism I haven't thought about. The second is that by randomizing our existing data, we can simulate a situation where we have collected the data under completely stochastic conditions, e.g. the process causing plant height is random. So there are our two assumptions: A.) Our data set represents the outcome of some process and B). by randomizing we can create a null model without process to test our own data against. Here's where things become a bit more gray. In our example above the the null hypothesis is pretty clear, and we can all agree on it, but problems arise with more complicated questions. Traditionally null models have been used to make inferences about community assembly rules. The use of these models was prevalent during the "battles" constituting what is tongue-in-cheek called the "null model wars". I won't take up any space rehashing the null model wars of the 70's and 80's but links to good resources can be found here on Jeremy Fox's Oikos blog and his post about refighting the null model wars. Suffice to say careful attention needs to be paid to the selection of the proper null model. Nick Gotelli has lots of good papers about null models if you peruse his work . I've worked up several examples from his 2000 and 2010 papers, and sometimes the algorithms can be challenging. I'll cover more advanced methods in a future post going over some methods from Nick's papers.0Add a comment
-
I've been following with interest Bob O'Hara's blog post over at his blog, "Deep thoughts and
silliness". In his post he is highly critical of a piece that came out in Science earlier this year by Fernando Maestre, Nick Gotelli and many others (full disclosure I was Nick's graduate student up until last year and have known Fernando for several years). Bob's problems seemed to be two fold, stemming from an assertion by Fernando that "significant relationships would indicate potentially strong effects of richness on multifunctionality".
Red lines from errors in x, blue from errors in y as errors
in x increase, slopes tend to 0. As they increase in y,
slopes vary randomly. The true slope is a thick dashed
black line
This is of course no surprise, it has certainly been published in ecology before, a quick search brought me to this paper in 1994 by Steve Carpenter in Ecology, and I'd bet it come up before that. I would argue that the fundamental importance of Fernando's work is not the magnitude of the slope, but that the relationship is present and holds up across a huge observational data set. The true measure of interest is how much of the variation can be explained by species richness, and how does error-in-variable affect that? So I quickly did Bob's simulations as best I could inferring his code from the text.
Errors in y produce almost no bias in slope
but errors in x always underestimate the
slope. The black line is the true value.Errors in x and y produce
similar bias in estimates
of R2
You can see my full code over at my GitHub page. I first created a multiple regression with four variables, all equal in their effects. I chose one variable at random and imagined that I only measured my response and that one variable, yielding an R2 of ~%25. I then I ran two simulations, one with measurement error in y and another with error in x, such that the correlation between the true value and "corrupted" value of x and y was between 0 to 1, the same as Bob. Its clear that errors measurement error in x has a different bias than error in y. Bob's problem, as I understand it, is that when measuring species richness surely there is measurement error in that variable. This error will necessarily bias the slope. I would argue that Fernando's central point was that it explained some of the variance. So when I ran my simulation, I asked how do errors in variable affect values of R2? Plotting that reveals a non-linear increase for both error types as error decreases. So at worst they are underestimating the amount of variance in multifunctionality (which is quite low at 4%). Therefore if the quantity of interest is the slope, then errors in variables matter, but if the quantity of interest is R2 then the location of the error is less relevant. I've already made my case that its the latter so Bob's technical objections seem moot.
So what does it all mean
The next objection he raises is of a much more general nature about observational studies and I agree with it in part. One of his last sentences sums it up.If we estimate an effect of species richness (say), it could be that the true effect of biodiversity is larger and we are only measuring an indicator, or it could be that there is no effect, but both ecosystem function and species richness are affected by the same abiotic drivers.
The former point is quite astute. Most ecological studies do measure some dependent variable of interest (multifunctionality in this case), and several independent variables that may be indicative of some process / mechanism but not the actual mechanism. Bob's example is that an investigator may measure winter temperature and water bird abundance, but the true mechanism may pond ice out date. His point: mechanisms are complicated and observational studies often miss these nuances. The second point falls into my bin of "pointless criticisms of *study type*". Yes, observational studies can never show causality with with complete confidence, but does that mean we should dismiss them all together? In ecology almost any study falls somewhere in the range between obeservational and microcosm. Each experiment type, from large scale macorecology studies to mesocosms to microcosms are up for some sort of criticism along these lines. For instance microcosms are often criticized as being unrealistic because of their small scale, should we stop doing them?(if you say yes Jeremy Fox has an earful for you).
I think that most ecologists would agree that hypotheses within our field need a robust series of results from different experiment types. We need good theory, good mathematical underpinnings, but also evidence from large scale surveys, manipulative field experiments, microcosms etc... to have strong support for a given hypothesis. In the end Bob's problem with errors in variables seems irrelevant given that the magnitude of the slope was not the quantity of interest. His points about how mechanisms are complicated is valid, but he follows it up with essentially dismissing all observational studies as incapable of demonstrating causality. Is Fernando's paper the alpha and omega on biodiversity and ecosystem function? Of course not, but is it "utter rubbish" as Bob says? I'd hardly say the paper warrants the vitriol.5View comments
-
Duke blew it this year
Null Model results
The vertical lines all represent different correct game picks. The dashed black line is the 95% confidence interval. Its drawn at 31, so if you had 31 or fewer choices correct on your bracket, it basically means a monkey could have done better than you, or really animal since you did no better than just randomly choosing teams. The solid black line is 40, the number of correct choices you would have had if you simply followed the seedings of the teams. In my bracket most people did worse than this. My number of correct picks is the blue line, 43, achieved with my model you can see in my previous post. Finally the red line the best bracket chosen from the ESPN bracket challenge (its 51). So there you have it, a way you can statistically prove that your friends are no better than monkeys and you are a genius when it comes to NCAA brackets!1View comments
-
Well, it's come to that time of the year again, a moment that brings me back to my first brackets that I filled out in 7th grade. Back then I didn't care that much about college basketball, so I just made guesses. Not much has changed in 20 years. I still don't care that much about college basketball, but how I guess has gotten much better. These days I use a pretty simple simulation model with a tried and true Bayesian core to it. I have an old post on this model from two years ago, but I think its time for an update. Lots of people have fun modeling the tournament, Nate Silver at the New York times has a good model on his 538 blog, as does a group over at Georgia Tech called the LMRC headed by Joel Sokol. My own model is probably the simplest of all of these, but it shares its DNA with Nate Silvers. Here's the overview.
Model description
The guts of my model are a simple Bayesian Beta-Binomial posterior distribution. This is composed of two parts, one is a Binomial distribution, which is simply the probability that team A beats team B. The next is the prior distribution. Most often in ecology we would use a flat prior, but because this is sports we have lots of background data. Already it should be clear we have problem. That's right, often times team A has never played team B, and certainly not enough times to have any data. So the first thing we need to do is get some data. My way around this is to use Ken Pomeroy's ratings. He gives each team a "Pythagorean" ranking. What's convenient about his rankings is that you can take any two teams and convert their rankings into a probability, let's call it p, that team A beats team B. We do this using a simple formula called the log5 odds. Here's how its implemented in the codel5rule <- function(ratings){ a <- ratings[1] b <- ratings[2] wpct <- (a - (a*b))/ ((a + b) - (2*a*b)) return(wpct)}
Based on this win percentage, I simulate a 34 game season between the teams, based on that p. That's the binomial part of my posterior. Next I want to include prior information with that simulated data, otherwise there's no real need for simulation, I could just use the log5 odds rule to calculate everything. I do this by creating a Beta distribution based on power rankings from two other experts, Jeff Sagarin and Sonny Moore. I use their rankings to create a Beta distribution centered around the p that team A beats team B. So now I have my "data" based on the Pomeroy rankings, and the prior. I combine this into into my posterior and I generate a single draw from that posterior distribution. I work through the entire tournament, simulating it 10,000 times and generate a table of probabilities that each team makes it to the each stage in the tournament. You can see my results here. Predictions are blue highlighted cells, and actual outcomes are red text. So far I've had about 80% accuracy. So far its on par with the LMRC model. You can see all the code on github.
Last thoughts
My personal bracket in my pool is outperforming my model. I sort of cheated on myself filling it out. One of the great things about the tournament is that you know there will be upsets...but which ones will they be? My model doesn't really account for that, but there is a model that does. Over at the Harvard sports analysis group they have a great model that they used, and based on that I threw some upsets into my own picks. I'll be back later with more updates on how it performs.1View comments
Add a comment