Certainty score

Back in June, I wrote about algorithms I was working on to take the volunteer data and spit out the “correct” classification of for each image. First, I made a simple majority-rules algorithm and compared its results to several thousand classifications done by experts. Then, when the algorithm came up with no answer for some of the images (because there were no answers in the majority), I tried a plurality algorithm, which just looked to see which species got the most votes, even if it didn’t get more than half the votes. It worked well, so I’m using the plurality algorithm going forward.

One of the things I’ve been curious about is whether we can detect when particular images are “hard.” You know what I mean by hard: animals smack up in front of the camera lens, animals way back on the horizon, animals with just a tip of the ear or a tuft of tail peeking onto the image from one side, animals obfuscated by trees or the dark of night.

So how can we judge “hard”? One way is to look at the “evenness” of the volunteer votes. Luckily, in ecology, we deal with evenness a lot. We frequently want to know what species are present in a given area. But we also want to know more than that. We want to know if some species are very dominant in that area or if species are fairly evenly distributed. For example, in a famous agricultural ecology paper*, Cornell entomologist Richard Root found that insect herbivore (pest) species on collard greens were less even on collards grown in a big plot with only other collards around versus on those grown in a row surrounded by meadow plants. In other words, the insect species in the big plot were skewed toward many individuals of just a few species, whereas in the the meadow rows, there were a lot more species with fewer individuals of each species.

We can adopt a species evenness metric called “Pielou’s evenness index” (which, for you information theorists, is closely related to Shannon entropy.)

[An aside: I was surprised to learn that this index is named for a woman: Dr. Evelyn Chrystalla Pielou. Upon reflection, this is the first time in my 22 years of formal education (in math, computer science, and ecology) that I have come across a mathematical term named for a woman. Jacqueline Gill, who writes a great paleo-ecology blog, has a nice piece honoring Dr. Pielou and her accomplishments.]

Okay, back to the Pielou index: we can use it to judge how even the votes are. If all the votes are for the same species, we can have high confidence. But if we have 3 votes for elephant and 3 votes for rhino and 3 votes for wildebeest and 3 votes for hippo, then we have very low confidence. The way the Pielou index works out, a 0 means all the votes are for the same species (high skew, high confidence) and a 1 means there are at least two species and they all got the same number of votes (high evenness, low confidence). Numbers in between 0 and 1 are somewhere between highly skewed (e.g. 0.2) and really even (e.g. 0.9).

Another way we could measure the difficulty of an image is to look at how many people click “nothing here.” I don’t like it, but I suspect that some people use “nothing here” as an “I don’t know” button. Alternatively, if animals are really far away, “nothing here” is a reasonable choice. We might assume that the percentage of “nothing here” votes correlates with the difficulty of the image.

I calculated the Pielou evenness index (after excluding “nothing here” votes) and the fraction of “nothing here” votes for the single-species images that were classified by experts. And then I plotted them. Here I have the Pielou index on the x-axis and the fraction of “nothing here” votes on the y-axis. The small pink dots are the 3,775 images that the algorithm and the experts agreed on, the big blue dots are the 84 images that the plurality algorithm got wrong, and the open circles are the 29 images that the experts marked as “impossible.”  (Click to enlarge.)

Pielou-and-blanksAnd sure enough, we see that the images the algorithm got wrong had relatively high Pielou scores. And the images that were “impossible” had either high Pielou scores or a high fraction of “nothing here” votes (or both). I checked out the four anomalies over on the left with a Pielou score of zero. All four were unanimously voted as wildebeest. For the three “impossibles,” both Ali and I agree that wildebeest is a reasonable answer. But Ali contends that the image the algorithm got wrong is almost certainly a buffalo. (It IS a hard image, though — right up near the camera, and at night.)

So we do seem to be able to get an idea of which images are hardest. But note that there are a lot more correct answers with high Pielou scores and high “nothing here” fractions than errors or “impossibles”. We don’t want to throw out good data, so we can’t just ignore the high-scorers. But we can attach a measure of certainty to each of our algorithm’s answers.

* Richard B. Root. 1973. Organization of a Plant-Arthropod Association in Simple and Diverse Habitats: The Fauna of Collards (Brassica oleracea). Ecological Monographs, Vol. 43, No. 1, pp. 95-124.


Tags: , ,

About Margaret Kosmala

I am an ecologist exploring the complex dynamics of plant and animal systems. I am especially interested in understanding how species communities change over time and how humans impact them.

6 responses to “Certainty score”

  1. Tali Hoffman says :

    Margaret I am enjoying your blogs so much. You make science so accessible and understandable to me, and as a scientist trying to do the same I greatly admire what you are doing. I would really like to talk to you about MammalMAP – the Africa-wide mammal atlasing project that I manage. Have a look here (http://mammalmap.blog.com/) to get a brief idea of what MammalMAP is about, and if you are willing to chat to me about this please do get in touch. Thanks! Tali Hoffman

  2. william garner says :

    Maybe adding an ‘I don’t know’ button would help sort out the true ‘nothing here’ from the other ‘hard’ ones. Having classified a few thousand of these images I must confess that at times there was something in the image but not identifiable with any kind of certainty so they were classified as ‘nothing here’. Other times just a wild guess was entered. Neither helps your statistics.

    Just a simple calculation shows that your algorithm correctly classifies the ‘hard’ images to about 97% which is quite impressive. So how good is good enough? It seems that dropping the really hard images would barely budge the overall results, especially since these ‘hard’ images are a small fraction of the overall total of usable images.

    • Margaret Kosmala says :

      Great observations, william. I’m going to talk more about “how good is good enough” more this coming week. Stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: