[_ Old Earth _] How to measure information content of a genome?

jwu · Jun 12, 2010

Pard wrote this:

Pard said:
logical bob said:

Pard said:

Macro-evolution requires for a complete mutation of gnome to occur.

Click to expand...

Sorry if I'm being slow, but could you explain what "complete mutation of genome" means?

Click to expand...

Mmm, to add information to the genetic sequence. There is no way a mutation can add data the the genetic sequence and this is required for a jump from one group to another (like reptile to mammal)

viewtopic.php?f=19&t=48203&st=0&sk=t&sd=a#p586052

So...how can we objectively measure the information content of a genome? What is the unit of measure?

Crying Rock · Jun 12, 2010

Information is always a measure of the decrease of uncertainty at a receiver (or molecular machine).

R = H(x) - Hy(x)

R = Hbefore - Hafter.

A way to see this is to work out the information in a bunch of DNA binding sites.

Definition of "binding": many proteins stick to certain special spots on DNA to control genes by turning them on or off. The only thing that distinguishes one spot from another spot is the pattern of letters (nucleotide bases) there. How much information is required to define this pattern?

Here is an aligned listing of the binding sites for the cI and cro proteins of the bacteriophage (i.e., virus) named lambda:

alist 5.66 aligned listing of:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
piece names from:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
The alignment is by delila instructions
The book is from: -101 to 100
This alist list is from: -15 to 15

------ ++++++
111111--------- +++++++++111111
5432109876543210123456789012345
...............................
OL1 J02459 35599 + 1 tgctcagtatcaccgccagtggtatttatgt
J02459 35599 - 2 acataaataccactggcggtgatactgagca
OL2 J02459 35623 + 3 tttatgtcaacaccgccagagataatttatc
J02459 35623 - 4 gataaattatctctggcggtgttgacataaa
OL3 J02459 35643 + 5 gataatttatcaccgcagatggttatctgta
J02459 35643 - 6 tacagataaccatctgcggtgataaattatc
OR3 J02459 37959 + 7 ttaaatctatcaccgcaagggataaatatct
J02459 37959 - 8 agatatttatcccttgcggtgatagatttaa
OR2 J02459 37982 + 9 aaatatctaacaccgtgcgtgttgactattt
J02459 37982 - 10 aaatagtcaacacgcacggtgttagatattt
OR1 J02459 38006 + 11 actattttacctctggcggtgataatggttg
J02459 38006 - 12 caaccattatcaccgccagaggtaaaatagt
^

Each horizontal line represents a DNA sequence, starting with the 5' end on the left, and proceeding to the 3' end on the right. The first sequence begins with: 5' tgctcag ... and ends with ... tttatgt 3'. Each of these twelve sequences is recognized by the lambda repressor protein (called cI) and also by the lambda cro protein.

What makes these sequences special so that these proteins like to stick to them? Clearly there must be a pattern of some kind.

Read the numbers on the top vertically. This is called a "numbar". Notice that position +7 always has a T (marked with the ^). That is, according to this rather limited data set, one or both of the proteins that bind here always require a T at that spot. Since the frequency of T is 1 and the frequencies of other bases there are 0, H(+7) = 0 bits. But that makes no sense whatsoever! This is a position where the protein requires information to be there.

That is, what is really happening is that the protein has two states. In the BEFORE state, it is somewhere on the DNA, and is able to probe all 4 possible bases. Thus the uncertainty before binding is Hbefore = log2(4) = 2 bits. In the AFTER state, the protein has bound and the uncertainty is lower: Hafter(+7) = 0 bits. The information content, or sequence conservation, of the position is Rsequence(+7) = Hbefore - Hafter = 2 bits. That is a sensible answer. Notice that this gives Rsequence close to zero outside the sites.

http://www.lecb.ncifcrf.gov/~toms/infor ... ainty.html

http://www.ccrnp.ncifcrf.gov/~toms/bion ... al.Entropy

Barbarian · Jun 12, 2010

The problem is that this can't be applied to populations, and hence can't measure whether or not a mutation adds information.

You'd have to use the allele frequencies in the population. But it can be done. And Doctor Tom has it wrong. Information is the measure of uncertainty in a message.

In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information contained in a message, usually in units such as bits. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. The concept was introduced by Claude E. Shannon in his 1948 paper "A Mathematical Theory of Communication".
http://en.wikipedia.org/wiki/Entropy_%2 ... _theory%29

jwu · Jun 13, 2010

Crying Rock said:
Information is always a measure of the decrease of uncertainty at a receiver (or molecular machine).

Click to expand...

[quote:9hha5eb9]R = H(x) - Hy(x)

R = Hbefore - Hafter.

A way to see this is to work out the information in a bunch of DNA binding sites.

Definition of "binding": many proteins stick to certain special spots on DNA to control genes by turning them on or off. The only thing that distinguishes one spot from another spot is the pattern of letters (nucleotide bases) there. How much information is required to define this pattern?

Here is an aligned listing of the binding sites for the cI and cro proteins of the bacteriophage (i.e., virus) named lambda:

alist 5.66 aligned listing of:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
piece names from:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
The alignment is by delila instructions
The book is from: -101 to 100
This alist list is from: -15 to 15

------ ++++++
111111--------- +++++++++111111
5432109876543210123456789012345
...............................
OL1 J02459 35599 + 1 tgctcagtatcaccgccagtggtatttatgt
J02459 35599 - 2 acataaataccactggcggtgatactgagca
OL2 J02459 35623 + 3 tttatgtcaacaccgccagagataatttatc
J02459 35623 - 4 gataaattatctctggcggtgttgacataaa
OL3 J02459 35643 + 5 gataatttatcaccgcagatggttatctgta
J02459 35643 - 6 tacagataaccatctgcggtgataaattatc
OR3 J02459 37959 + 7 ttaaatctatcaccgcaagggataaatatct
J02459 37959 - 8 agatatttatcccttgcggtgatagatttaa
OR2 J02459 37982 + 9 aaatatctaacaccgtgcgtgttgactattt
J02459 37982 - 10 aaatagtcaacacgcacggtgttagatattt
OR1 J02459 38006 + 11 actattttacctctggcggtgataatggttg
J02459 38006 - 12 caaccattatcaccgccagaggtaaaatagt
^

Each horizontal line represents a DNA sequence, starting with the 5' end on the left, and proceeding to the 3' end on the right. The first sequence begins with: 5' tgctcag ... and ends with ... tttatgt 3'. Each of these twelve sequences is recognized by the lambda repressor protein (called cI) and also by the lambda cro protein.

What makes these sequences special so that these proteins like to stick to them? Clearly there must be a pattern of some kind.

Read the numbers on the top vertically. This is called a "numbar". Notice that position +7 always has a T (marked with the ^). That is, according to this rather limited data set, one or both of the proteins that bind here always require a T at that spot. Since the frequency of T is 1 and the frequencies of other bases there are 0, H(+7) = 0 bits. But that makes no sense whatsoever! This is a position where the protein requires information to be there.

That is, what is really happening is that the protein has two states. In the BEFORE state, it is somewhere on the DNA, and is able to probe all 4 possible bases. Thus the uncertainty before binding is Hbefore = log2(4) = 2 bits. In the AFTER state, the protein has bound and the uncertainty is lower: Hafter(+7) = 0 bits. The information content, or sequence conservation, of the position is Rsequence(+7) = Hbefore - Hafter = 2 bits. That is a sensible answer. Notice that this gives Rsequence close to zero outside the sites.

http://www.lecb.ncifcrf.gov/~toms/infor ... ainty.html

http://www.ccrnp.ncifcrf.gov/~toms/bion ... al.Entropy[/quote:9hha5eb9]
So you're saying that the information that is required to define protein binding sites relates to the overall information content of a genome, correct?

Please elaborate on that.
Would you say that a genetic change that makes one protein that used to bind in 10 places only bind in 5 places afterwards is an increase of information of one bit? That'd be a measure of a change of information though, not the total content.
If so, how would you derive the overall, absolute information content of a genome?

Crying Rock · Jun 13, 2010

If so, how would you derive the overall, absolute information content of a genome?

An example: If, after replication of DNA that had never been altered, R = H(x) - Hy(x) = 100.

Crying Rock · Jun 13, 2010

The Barbarian said:
The problem is that this can't be applied to populations, and hence can't measure whether or not a mutation adds information.

You'd have to use the allele frequencies in the population. But it can be done. And Doctor Tom has it wrong. Information is the measure of uncertainty in a message.

In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information contained in a message, usually in units such as bits. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. The concept was introduced by Claude E. Shannon in his 1948 paper "A Mathematical Theory of Communication".
http://en.wikipedia.org/wiki/Entropy_%2 ... _theory%29

Primary source:

Shannon (1948) said on page 20:

R = H(x) - Hy(x)

"The conditional entropy Hy(x) will, for convenience, be called the equivocation. It measures the average ambiguity of the received signal."

jasoncran · Jun 13, 2010

this argument sounds all too familiar. :yes

Barbarian · Jun 14, 2010

Yep. CR has confused the ambiguity with information.

jwu · Jun 14, 2010

Crying Rock said:
If so, how would you derive the overall, absolute information content of a genome?

Click to expand...

An example: If, after replication of DNA that had never been altered, R = H(x) - Hy(x) = 100.

That's not an example though, just a generic theorem without any other context. The relevant question here is how to apply it to biology/genetics in particular. In other words, the relevant question is how to calculate H(x) and Hy(x) in the context of genetics. Without a schematic for that we cannot calculate R.

Anyway, you said that in case of a replication of a genome that has never been altered the decrease of uncertainty is 100. What is the unit of measure of that "100"? What would that value be in case of a genome that has been altered?

Crying Rock · Jun 14, 2010

jwu said:
Crying Rock said:

If so, how would you derive the overall, absolute information content of a genome?

Click to expand...

An example: If, after replication of DNA that had never been altered, R = H(x) - Hy(x) = 100.

Click to expand...

That's not an example though, just a generic theorem without any other context. The relevant question here is how to apply it to biology/genetics in particular. In other words, the relevant question is how to calculate H(x) and Hy(x) in the context of genetics. Without a schematic for that we cannot calculate R.

Anyway, you said that in case of a replication of a genome that has never been altered the decrease of uncertainty is 100. What is the unit of measure of that "100"? What would that value be in case of a genome that has been altered?

Shannon surely addresses units of measure? I'll look it up when I get some time. In the meantime, if you have some time on your hands, please share what you come up with.

Crying Rock · Jun 14, 2010

The Barbarian said:
Yep. CR has confused the ambiguity with information.

Barb, why don't you ask jwu what Shannon information is. He seems to have a firm grasp of the matter.

jwu · Jun 15, 2010

Crying Rock said:
Shannon surely addresses units of measure? I'll look it up when I get some time. In the meantime, if you have some time on your hands, please share what you come up with.

Bits work well in general. I was wondering about the 100 that you posted though - as bits it seemed rather odd to me, you might have meant "percent of orginal" or something like that. Hence the request for clarification.

Barb, why don't you ask jwu what Shannon information is. He seems to have a firm grasp of the matter.

Well, i would prefer if we could just go on the way we're heading for now a little bit further. We'll get to an interesting point very soon.

Barbarian · Jun 15, 2010

Genetics
Vol. 159, 915-917, November 2001
Shannon's Brief Foray into Genetics
James F. Crowa
a Laboratory of Genetics, University of Wisconsin, Madison, Wisconsin 53706
Yet, what is not generally known is that Shannon's Ph.D. thesis dealt with population genetics. Immediately after receiving the degree, he went to work for the Bell Telephone Laboratories and began his path-breaking studies of communication. He never returned to genetics and the thesis was never published. After half a century it was finally reprinted along with most of Shannon's major papers (SLOANE and WYNER 1993...In this paper Shannon showed that, with the proper definition of information, all information sources have a source rate, measured in bits per second. The measure of information was {Sigma}P log P, in which P is the probability of choosing a particular message from among the alternatives, which is of the same form as entropy, long used as a measure of disorder in physical systems. For information theory it is natural to measure information in logs to the base 2. Thus, a simple system with two equally likely alternatives has log22 = 1 bit of information...

Just saying...

Barbarian · Jun 15, 2010

Shannon surely addresses units of measure?

Invented the unit, in fact. The bit.

I found a very basic site for understanding information:

Measuring Information Content
In the preceding example we used a die with eight faces. Since eight is a power of two, the optimal code for a uniform probability distribution is easy to caclulate: log 8 = 3 bits. For the variable length code, we wrote out the specific bit pattern to be transmitted for each face A-H, and were thus able to directly count the number of bits required.

Information theory provides us with a formula for determining the number of bits required in an optimal code even when we don't know the code. Let's first consider uniform probability distributions where the number of possible outcomes is not a power of two. Suppose we had a conventional die with six faces. The number of bits required to transmit one throw of a fair six-sided die is: log 6 = 2.58. Once again,we can't really transmit a single throw in less than 3 bits, but a sequence of such throws can be transmitted using 2.58 bits on average. The optimal code in this case is complicated, but here's an approach that's fairly simple and yet does better than 3 bits/throw. Instead of treating throws individually, consider them three at a time. The number of possible three-throw sequences is 6^3= 216. Using 8 bits we can encode a number between 0 and 255, so a three-throw sequence can be encoded in 8 bits with a little to spare; this is better than the 9 bits we'd need if we encoded each of the three throws seperately.

In probability terms, each possible value of the six-sided die occurs with equal probability P=1/6. Information theory tells us that the minmum number of bits required to encode a throw is -log P = 2.58. If you look back at the eight-sided die example,you'll see that in the optimal code that was described, every message had a length exactly equal to -log P bits. Now let's look at how to apply the formula to biased (non-uniform) probability distributions. Let the variable x range over the values to be encoded,and let P(x) denote the probability of that value occurring. The expected number of bits required to encode one value is the weighted average of the number of bits required to encode each possible value,where the weight is the probability of that value:

Underscript[?, x] P (x) Ã— -log P (x)

Now we can revisit the case of the biased coin. Here the variable ranges over two outcomes: heads and tails. If heads occur only 1/4 of the time and tails 3/4 of the time, then the number of bits required to transmit the outcome of one coin toss is:

RowBox[{1/4 Ã— -log (1/4) + 3/4 Ã— -log (3/4), , =, , RowBox[{0.8113, , bits}]}]

A fair coin is said to produce more "information" because it takes an entire bit to transmit the result of the toss:

1/2 Ã— -log (1/2) + 1/2 Ã— -log (1/2) = 1 bit

The Intuition Behind the -P log P Formula

The key to gaining an intuitive understanding of the -P log P formula for calculating information content is to see the duality between the number of messages to be encoded and their probabilities. If we want to encode any of eight possible messages, we need 3 bits, because log 8 = 3. We are implicitly assuming that the messages are drawn from a uniform distribution.

The alternate way to express this is: the probability of a particular message occurring is 1/8, and -log(1/8) = 3, so we need 3 bits to transmit any of these messages. Algebraically, log n = -log (1/n), so the two approaches are equivalent when the probability distribution is uniform. The advantage of using the probability approach is that when the distribution is non-uniform, and we can't simply count the number of messages, the information content can still be expressed in terms of probabilities.

Sometimes we write about rare events as carrying a high number of bits of information. For example, in the case where a coin comes up heads only once in every 1,000 tosses, the signal that a heads has occurred is said to carry 10 bits of information. How is that possible, since the result of any particular coin toss takes 1 bit to describe? Transmitting when a rare event occurs, if it happens only about once in a thousand trials, will take 10 bits. Using our message counting approach, if a value occurs only 1/1000 of the time in a uniform distribution, there will be 999 other possible values, all equally likely, so transmitting any one value would indeed take 10 bits.

With a coin there are only two possible values. What information theory says we can do is consider each value separately. If a particular value occurs with probability P, we assume that it is drawn from a uniformly distributed set of values when calculating its information content. The size of this set would be 1/P elements. Thus, the number of bits required to encode one value from this hypothetical set is -log P. Since the actual distribution we're trying to encode is not uniform, we take the weighted average of the estimated information content of each value (heads or tails, in the case of a coin), weighted by the probability P of that value occuring. Information theory tells us that an optimal encoding can do no better than this. Thus, with the heavily biased coin we have the following:

P(heads) = 1/1000, so heads takes -log(1/1000) = 9.96578 bits to encode

P(tails) = 999/1000, so tails takes -log(999/1000) = 0.00144 bits to encode

Avg.bits required = Underscript[?, x] -P(x) log P(x)
= (1/1000) Ã— 9.96578 + (999/1000) Ã— 0.00144 = 0.01141 bits per coin toss
http://www.cs.cmu.edu/~dst/Tutorials/Info-Theory/

Does that help?

Crying Rock · Jun 16, 2010

C:

Shannon surely addresses units of measure?

B:

Does that help?

What part?

Crying Rock · Jun 16, 2010

Bits work well in general. I was wondering about the 100 that you posted though - as bits it seemed rather odd to me, you might have meant "percent of orginal" or something like that. Hence the request for clarification.

"percent of orginal"

Percent of the uncertainty of the message after receipt by the molecular machine.

Barbarian · Jun 16, 2010

Shannon surely addresses units of measure?

Barbarian observes:
Invented the unit, in fact. The bit.

Barbarian cites Shannon's thesis, in which he states the formula he developed for information in population genetics:

The Intuition Behind the -P log P Formula
The key to gaining an intuitive understanding of the -P log P formula for calculating information content is to see the duality between the number of messages to be encoded and their probabilities. If we want to encode any of eight possible messages, we need 3 bits, because log 8 = 3. We are implicitly assuming that the messages are drawn from a uniform distribution.

The alternate way to express this is: the probability of a particular message occurring is 1/8, and -log(1/8) = 3, so we need 3 bits to transmit any of these messages. Algebraically, log n = -log (1/n), so the two approaches are equivalent when the probability distribution is uniform. The advantage of using the probability approach is that when the distribution is non-uniform, and we can't simply count the number of messages, the information content can still be expressed in terms of probabilities...Avg.bits required = Underscript[?, x] -P(x) log P(x)
= (1/1000) Ã— 9.96578 + (999/1000) Ã— 0.00144 = 0.01141 bits per coin toss
http://www.cs.cmu.edu/~dst/Tutorials/Info-Theory/

Barbarian asks:
Does that help?

What part?

All of it, but particularly the part that says that the Shannon formula for information content of the genome of a population is [?, x] -P(x) log P(x).

jwu · Jun 16, 2010

Crying Rock said:
Bits work well in general. I was wondering about the 100 that you posted though - as bits it seemed rather odd to me, you might have meant "percent of orginal" or something like that. Hence the request for clarification.

Click to expand...

[quote:1beqkhq2]"percent of orginal"

Percent of the uncertainty of the message after receipt by the molecular machine.[/quote:1beqkhq2]Yup! That'd be 100%, if we assume that the integrity of the message could be confirmed, that is.
If the receiver simply cannot be sure if the message has been altered or not, then the receiver would have to take that possibility into account. Then even in case of an unaltered message his difference in uncertainty would be less than 100%, for he knows that he cannot necessarily rely on the veracity of the message. There would be, quite literally, remaining uncertainty about the integrity of the message. Thus the uncertainty wouldn't be 100% gone but only something like 99.9999%.

Are you with me so far? If yes, then let's continue with this question:
What would be the difference in uncertainty in case of a slightly altered instead of an unaltered replication?

Crying Rock · Jun 17, 2010

Thus the uncertainty wouldn't be 100% gone but only something like 99.9999%.

Of course. I've never read a report about a totally noiseless channel. Just as there is entropy in physical systems, there is entropy of information. I remember the old drill in school where one person transmit an original message, and the message would be transmitted by a number of people, and then you would compare the original message to the final message.

What would be the difference in uncertainty in case of a slightly altered
instead of an unaltered replication?

There would be a slight decrease in information. Our views may differ because I think that our creator transmitted the original message.

Barbarian · Jun 17, 2010

One consequence of Shannon's equation is that you can get as accurate a transmission as you want, by increasing redundancy.

This is how tiny transmitters can send practically loss-free signals over many millions of kilometers.

Wake up and smell the coffee!

Need prayer and encouragement?

Desire to be a vessel of honor unto the Lord Jesus Christ?

[_ Old Earth _] How to measure information content of a genome?

jwu

Crying Rock

Barbarian

jwu

Crying Rock

Crying Rock

jasoncran

Barbarian

jwu

Crying Rock

Crying Rock

jwu

Barbarian

Barbarian

Crying Rock

Crying Rock

Barbarian

jwu

Crying Rock

Barbarian

Similar threads