Lament for the “non-reward marker”, the underdog in dog training

I never thought I would revisit “quadrant hell”, as we used to lovingly call it at The Academy for Dog Trainers. My venture into this pesky topic was triggered by a lack of consensus regarding a humble training aid called the “non-reward marker” (NRM).

While its most famous “reward marker” counterpart – the clicker – enjoys a cult-like following, the NRM has an image problem. It seems the poor NRM has become the nerdy kid that few people want to be seen socialising with. Always rooting for the underdog, I’ll try to jump to the little fella’s defence.

The context: The four quadrants.

In short, the four quadrants are used to explain the processes of learning via consequences.

REINFORCEMENT

R+

Your dog comes to you when you call her and as a consequence you give her a piece of chicken. If, as a result, your dog comes more often when called in the future, the behaviour will have been ‘positively’ reinforced. ‘Reinforced’ because the behaviour of coming when called increased and positively because you ‘added’ something to make this happen (you gave the dog chicken).

R-

You say ‘sit’ and put pressure on your dog’s rear end until he sits. If, as a result, your dog sits more often when you say ‘sit’ in the future, the behaviour of sitting will have been ‘negatively’ reinforced. ‘Reinforced’ because the behaviour of sitting increased and ‘negatively’ because you subtracted something to make this happen (you stopped putting pressure on the dog’s butt).

P+

Your dog jumps on the couch and you hit him over the head with a rolled up newspaper. If, as a result, the dog jumps on the couch less often, the behaviour will have been positively punished. ‘Punished’ because the behaviour of jumping on the couch decreased and ‘positively’ because you added something to make this happen (you whacked the dog).

PUNISHMENT

P-

You play tug with your dog and she nips your hand. As a consequence you stop playing with her for a minute. If, as a result, your dog grabs your hand less often, the behaviour of grabbing your hand will have been ‘negatively’ punished. ‘Punished’ because the behaviour decreased and ‘negatively’ because you subtracted something to make this happen (you stopped playing with your dog).

What are the quadrants for?

Animal trainers use the quadrants in two ways. First, to choose which of the processes to use, based on what we hope to achieve and our individual philosophy. And second, to assess the outcome of our training.

INTENTIONS

You may wonder why I have colour-coded the quadrants and, yes, your suspicion is correct. I prefer to use R+ and P- because I consider P+ and R- to be offensive and fraught with danger. If you read through the four examples, you will notice that there is a common factor in R+ and P-: they deal with good things (stuff that the dog likes such as food, play etc.). P+ and R- on the other hand deal with bad things (anything the dog wants to avoid such as fear, pain, discomfort etc.).

Since my intention is to keep the dog happy when I train, I avoid the use of bad things. So it is my choice to use R+ and P-. But do my intentions lead to the desired outcomes?

We cannot know for sure what the dog experiences as punishing or reinforcing. So, when I choose a certain type of food for training, I’m simply guessing that it will work as a reinforcer. Once I witness the dog doing the target behaviour more often in the future, I can be confident my approach has worked.

OUTCOMES

Equally, if I remove myself from the room whenever the dog chews my clothes, I cannot claim I am punishing the dog (by removing myself) until the time I have firmly established that the behaviour of chewing on my clothes has indeed decreased because of my actions.

But wait, there’s more

DIFFERENTIAL REINFORCEMENT

So, one way to tell if reinforcement or punishment has occurred is to monitor the frequency of the dog’s behaviour we are concerned with. However, behaviour doesn’t only decrease when it is punished. It can also decrease because it is never reinforced.

A training technique called “differential reinforcement” strengthens one behaviour over others. By rewarding one behaviour but not others that occur in the same context the dog will likely perform the rewarded behaviour more often and other behaviours less often. Although other behaviours are decreasing in frequency, they have not been punished. The dog neither lost something they value nor did they receive bad consequences in response to engaging in those behaviours. They simply weren’t reinforced. If a behaviour never pays off, it is not worth engaging in.

EXTINCTION

We use the term “extinction” to describe non-punishment-based behaviour reductions. If a behaviour has a strong history of reinforcement, extinction can be very frustrating for a dog because it leaves a “behaviour vacuum”. This is unless we direct the dog towards alternative behaviours to build a new reinforcement history.

What are markers and why is one the superstar and another the villain?

REWARD MARKER

Sometimes it is helpful to “mark” the target behaviour we want to reinforce with a “yes” or a “good” or a click with the clicker. This is especially recommended when we are not quick enough to deliver the reward and the dog may therefore not make the connection between a specific behaviour, e.g. sitting, and the reward that follows. The marker stands for “your reward is coming” and therefore becomes a reward marker, a type of conditioned reinforcer. Reward markers, and in particular the clicker, have proved to be incredibly useful training aids and have been closely linked to the rise of reward-based animal training.

PUNISHMENT MARKER

Do markers also work for punishment? Yes, we can condition a sound, word or phrase to mean “you are about to lose something”. Usually that something is the company of people or other dogs, for example when giving a timeout for bullying another dog at the dog park. The marker is useful to inform the dog of the precise moment they “stuffed up”, particularly when getting hold of the dog may not be instantaneous and the implementation of the timeout is therefore delayed. After several repetitions of preceding the capture and timeout with a marker, e.g. “you’re gone”, the dog will learn that the marker predicts being removed from the fun. The marker becomes a punishment marker, or conditioned punisher.

NON-REWARD MARKER

And finally there is this thing called a non-reward marker. It is easy to see how it could be interpreted as being the same as a punishment marker but we’ve already established that the absence of reinforcement does not automatically mean punishment. If I let the dog know that they just made the “wrong” move, for example moving out of the position they were asked to stay in, am I telling them they are being punished? Am I telling them they are about to lose something? Or am I telling them the opportunity to earn a reward has just been delayed? I think the latter. We are not taking anything away from the dog that they are currently in possession of or enjoying, so there’s no case for P-. And, as long as the non reward marker is given in a friendly, non-threatening tone, there is also no risk of it tipping over into P+.

The training scenarios that have been presented as evidence that NRMs are “punishing” and decrease a dog’s enthusiasm for training were – to my knowledge – cases of using NRMs without a clearly defined instance of the “wrong move”. The NRM was given when the dog did not perform the desired behaviour. The NRM was apparently given to mark a whole range of behaviours, except the target one, as “wrong”. That would seem to me like a bad application of the NRM. From the dog’s perspective, there is no clear instance of behaviour to attach the NRM to. The risk here is that – after enough repetitions – the dog may well make an association between the NRM and a frustrating experience (“What am I supposed to do?!”), which could make the dog reluctant to take part in future training.

Compare this with a very clearly defined instance of “getting out of position” in case of a stay exercise. The dog is already doing the “right” behaviour but then breaks before the trial is complete. The information given to the dog by marking the break of stay is unambiguous. And if the break of stay doesn’t resolve itself quickly, we go back to a shorter duration to help the dog succeed.

Conclusion

A dog who has learned what to do by being rewarded for it (R+) will eagerly take part in training as training itself becomes rewarding. They want to be there and share the fun with you. If using a non-reward marker makes this less enjoyable for the dog, the problem is not the marker. If we leave the dog guessing as to what exactly triggered the NRM or if we use the NRM too frequently because our training setup is too difficult for the dog, we have only ourselves to blame if the dog quits.

RATE OF REINFORCEMENT

In my opinion, as long as the dog wins a lot, there is usually no need to aim for errorless learning. If the dog sometimes doesn’t get a treat because they haven’t done the requested behaviour, that’s perfectly OK. The anticipation of earning rewards and the social aspect of training is already reinforcing for the dog. Not always getting a treat is part of the game and raises the anticipation (this is exactly what we do to maintain behaviour once it is learned: we put it on a variable reinforcement schedule). There is no risk of frustration if the rate of reinforcement remains high enough for the dog’s level of skill in the required task, training experience and sensitivity. It’s the trainer’s challenge to take care of that and to know their students.

Phew! Now back to more entertaining things.