Critique of chipping sparrow coalition paper

Our critique of a recent paper on chipping sparrow territorial coalitions was published in Biology Letters. The original paper by Sarah Goodwin and Jeff Podos of UMass suggested that the male sparrows form territorial alliances in response to simulated intrusions that depend on the performance levels of the songs sung by the allies and the intruder. We found that the paper had some serious flaws including:

  1. all subjects and their presumed neighbors were non-banded and most neighbors not even recorded before or after the experiment, complicating the conclusions about their identities;
  2. the authors did not rule out alternative non-cooperative scenarios that may have led to the presence of extra birds during some of the simulated intrusions;
  3. analyses were carried out on an inadequate metric of trill performance (namely trill rate) and analyses on better metrics such as vocal deviation were not reported because apparently they yielded no significant effect;
  4. there were numerous problems with the analyses on coalition forming which led to inflated p-values, at least some of which can be shown to be non-significant once corrected.

You can read our critique and the reply by Goodwin and Podos by clicking the respective links. Based on the above points and others we argued that the conclusions by the authors had no empirical footing. The authors in their reply declare that they stand by their original methods, design and analyses, rather unconvincingly in my opinion. As far as I can tell, their main defense is that they can identify individuals from song (but given that most neighbors are not recorded before or after the experiment, this wouldn’t help them to ascertain that these are neighbors or that these birds are there too “cooperate”), and some rather weak arguments regarding the statistics. In particular, although they grant that all of their original binomial tests gave inflated p-values, they manage squeeze out a p-value smaller than 0.05 for one of them and that seems to be the last thread they hang on to. Ironically, this made-to-order test (see the definition of p-hacking) is also incorrect- the authors use the “observed” chance levels for 3 subjects and 0.74 for the rest of the subjects, but to be correct they actually would have to re-calculate the chance levels for the remaining six (in reality five) subjects. If the 3 subjects that they have trill rate information on presumed neighbors are the ones with the highest trill rate than the chance level for the remaining 6 would be 0.90 and the binomial test would not give a significant effect anymore.

My hope is that our critique will illustrate the pitfalls of studying social behavior in otherwise unmarked individuals as well as establish more rigorous standards for future claims of complex social strategies.