Summary of the Experts
Last week, william garner asked me in the comments to my post ‘Better with experience’ how well the experts did on the about 4,000 images that I’ve been using as the expert-identified data set. How do we know that those expert-identifications are correct?
Here’s how I put together that expert data set. I asked a set of experts to classify images on snapshotserengeti.org — just like you do — but I asked them to keep track of how many they had done and any that they found particularly difficult. When I had reports back that we had 4,000 done, I told them that they could stop. Since the experts were reporting back at different times, we actually ended up doing more than 4,000. In fact, we’d done 4,149 sets of images (captures), and we had 4,428 total classifications of those 4,149 captures. This is because some experts got the same capture.
Once I had those expert classifications, I compared them with the majority algorithm. (I hadn’t yet figured out the plurality algorithm.) Then I marked (1) those captures where experts and the algorithm disagreed, and (2) those captures that experts had said were particularly tricky. For these marked captures, I went through to catch any obvious blunders. For example, in one expert-classified capture, the expert classified the otherBirds in the images, but forgot to classify the giraffe the birds were on! The rest of these marked images I sent to Ali to look at. I didn’t tell her what the expert had marked or what the algorithm said. I just asked her to give me a new classification. If Ali’s classification matched with either the algorithm or the expert, I set hers as the official classification. If it didn’t, then she, and Craig, and I examined the capture further together — there were very few of these.
And that is how I came up with the expert data set. I went back this week to tally how the experts did on their first attempt versus the final expert data set. Out of the 4,428 classifications, 30 were marked as ‘impossible’ by Ali, 1 was the duiker (which the experts couldn’t get right by using the website), and 101 mistakes were made. That makes for a 97.7% rate of success for the experts. (If you look at last week’s graph, you can see that some of you qualify as experts too!)
Okay, and what did the experts get wrong? About 30% of the mistakes were what I call wildebeest-zebra errors. That is, there are wildebeest and zebra, but someone just marks the wildebeest. Or there are only zebra, and someone marks both wildebeest and zebra. Many of the wildebeest and zebra herd pictures are plain difficult to figure out, especially if animals are in the distance. Another 10% of the mistakes were otherBird errors — either someone marked an otherBird when there wasn’t really one there, or (more commonly) forgot to note an otherBird. About 10% of the time, experts listed an extra animal that wasn’t there. And another 10% of the time, they missed an animal that was there. Some of these were obvious blunders, like missing a giraffe or eland; other times it was more subtle, like a bird or rodent hidden in the grass.
The other 40% of the time were mis-identifications of the species. I didn’t find any obvious patterns to where the mistakes were; here are the species that were mis-identified:
|wildebeest||6||buffalo, hartebeest, elephant, lionFemale|
|hartebeest||5||gazelleThomsons, impala, topi, lionFemale|
|gazelleGrants||4||impala, gazelleThomsons, hartebeest|
|reedbuck||3||dikDik, gazelleThomsons, impala|
4 responses to “Summary of the Experts”
Trackbacks / Pingbacks
- June 9, 2015 -
The biggest challenge for amateurs is when only a small part of an animal is shown. If each volunteer were fed a chronological sequence from only a single camera, they could use time data to infer ID of questionable images if there were adjacent better images of the same animal. Better still if they could rewind and correct if a sequence improved through time. Assume you do not do this because some cameras are unproductive and volunteers would tire and drop out?
Part of it is because we didn’t want volunteers to spend too long on each image. The other reason is that by classifying each image without seeing the ones before and after, all the images have independent classifications. This allows us to do some statistics that would become much more complicated otherwise.
I agree, though, that those small parts of animals captures are really hard and could benefit from information about the images before and after. I’m planning to modify the plurality algorithm to take into account classifications of before and after captures to see if that helps improve its accuracy.
Margaret do allow for error as a percentage? what are your expectations of non-experts – what is a good return for us?