If I tossed a coin five times and it came up heads every time, would you think it was a fair coin? "Probably not", you might say, and you probably wouldn't take a bet from me that the next coin comes up tails.
Now, my coin might be fair. It could well be that I have an equal chance of tossing a head or a tail, but in this particular case I happened to get five heads in a row. With a fair coin, the chance of that happening is about 3%: unlikely, but still possible. But you might think that likelihood is small enough -- that is, the outcome was surprising enough -- to suspect my motives and conclude instead that the coin isn't fair.
Your intuitive response is essentially a significance test. A statistician would say that your working (or "null") hypothesis is that the coin is fair, but if you observe an event that's really unlikely -- say, with a probability of less than 5% -- you'll instead accept the alternative hypothesis that I'm cheating. In example we opened with, you accepted the alternative hypothesis that the coin was biased at the 5% significance level (i.e. with a P-value < 0.05).
P-values are a staple of introductory statistics courses, and are commonly used in the scientific literature to decide whether an effect is "significant": if a drug improves survival rates more than could be explained by chance alone, for example. But P-values and significance tests are also easily abused, in more ways than we have time to get into here. But one danger is particular is that of "multiple comparisons": applying a P-value test to many experiments at once.
Let's suppose you did take me up on that bet, and you lost. You're so angry that you pocket my coin and proceed to sue me in court for fraud. Before the trial, you toss my coin five times to prove it's biased ... but you get three heads and two tails: not exactly compelling evidence of fraud. So you try again with another five tosses. And again. And again. Twenty-five times you repeat the sequence of five tosses, until eventually it happens to come up all heads. In court, you present as evidence that you tossed my coin five times and got five heads -- "significant evidence of fraud, at the 5% level!", you say. But by not mentioning the other 24 trials where no such significant result occurred, you're not really being fair.
The same considerations apply when using significance tests in science: if you plan to do many such tests, you need to adjust for the fact that a "significant" result is more likely to occur by chance alone. (The R language has functions and packages for making such "multiple comparison" corrections.) But xkcd explains the issue much better than I can, buy explaining why you might want to be skeptical when seeing headlines like this: