The Things That Need Not Be Said: The Redundancy of Languages
Fields covered: Linguistics, Information Theory, Probability, Statistics
Related Articles:
§ Entropy and Redundancy in English § Prediction and Entropy of Printed English |
Language is part of our daily lives. We use language to communicate ideas and concepts to each other. We hold conversations with people we meet, send text messages, and watch the news to keep each other up-to-date on what’s happening around us. It’s an irreplaceable part of our lives.
But do you ever wonder, how much of the language we use actually carry meaning? Consider the times we read an email and just go, ugh, can’t this fellow get to the point? Language can help us get our point across, but is it possible for language to get more succinct?
Claude Shannon, a mathematician, looked into this. He defined a way to measure the information content of a language, by quantifying the amount of “surprise” each successive letter has. We can think of this in terms of probability.
Consider a randomly generated text in English, with each alphabet having an equal probability of appearing. How much information does this contain? What can this tell us about the next character? Such a corpus tells us nothing about each character, as they have equal chances of appearing.
Fortunately, this is not how languages work. In English, some letters appear more frequently than others. We see vowels fairly often, while more elusive consonants are hardly seen at all. The top 5 most common alphabets are “e”, “a”, “r”, “i”, “o”. Given a random English book, could you predict a letter arbitrarily chosen from it? You are more likely to see an ‘e’ compared to a ‘z’. This imbalance of probabilities encodes information into our languages.
We can consider successive-order encoding as well. Given that you know the previous letter is ‘a’, could you guess the next letter? It seems more likely a consonant than yet another vowel. How about if the previous letter was ‘q’? The probabilities immediately skew heavily towards one particular vowel coming next. This can be applied successively, such that when you see ‘languag_’, you can fill in the blank easily.
This measure of information content applies not just to letters, but to higher-order linguistic structures as well. Some words appear more frequently than others, such as ‘the’, ‘she’, ‘an’, while individual nouns in general barely make an appearance. When you see an adjective, a noun or another adjective should follow. When you read a sentence about the life cycle of frogs, the next sentence is expected to continue on that same topic. When you wrap up a book about black holes, a paragraph regarding black holes should be its conclusion.
There are redundancies encoded into every level of our language. Shannon himself estimated that at least 50% of English does not carry information. Just think about it! At least half of every book read, every speech heard, of this article even, need not be said!
Next time you get stuck in that dull meeting, or trapped reading that monotonous report, remember: you are justified in thinking that some of it, is meaningless.