What is Statistical Genomics? – Bioinformatics Zone

Question: What is Statistical Genomics?
- Keywords:
  genomics,
  statistic,
  work
Asked by to Ian on 17 Jun 2014. This question was also asked by .
- Ian Simpson answered on 17 Jun 2014:
  
  Statistical Genomics is a research area that uses some advanced mathematical methods to make sense of data gathered from the genomes of organisms.
  
  One of the most widely performed types of experiment that use statistical genomics methods are called “Genome Wide Association Studies” or GWAS. It’s a method that has been used for many years to try to find out which genes are responsible for diseases.
  
  I’m going to try to explain how it works, it’s complicated and the explanation is LONG (which is why I couldn’t answer in the chat) so please ask me if you need me to try again !
  
  1. Variation between people
  Every person has thousands of small changes in their DNA, we are all unique. Most of these changes are neutral, they normally don’t do anything bad. I’m going to call these SNPs (which stands for Single Nucleotide Polymorphisms). There are a couple of terms there you might not be familiar with :-
  
  nucleotide – fancy word we mean base in DNA i.e. A,C,G or T
  polymorphism – another fancy word, it means many or multiple forms.
  
  So by SNP we mean a position in the human genome where the base found is different between people. So if I sequenced a little bit of 4 peoples DNA and found :-
  
  person1 AACTGCTGTTCTGT
  person2 AACAGCTGTTCTGT
  person3 AACCGCTGTTCTGT
  person4 AACGGCTGTTCTGT
  
  you can see that at position 4 the base can be any of A,C,G,T but at all other locations they’re the same. Position 4 is a SNP.
  
  Next up:-
  
  2. Recombination
  As I suspect you know from Biology classes human chromosomes are found in pairs, one inherited from each parent. During the formation of sperm and egg (gametes) the chromosomes do something special. At a particular point during the process the chromosomes can break and then rejoin on the opposite one, I’ve tried to show this below :-
  
  XXXXXXXXXXXX chromosome 19 from Mum
  ———————— chromosome 19 from Dad
  
  goes to :-
  
  XXXXXX————
  ————XXXXXX
  
  the two are now mixed. Note that the chromosome is still intact it’s just that the chromosomes have swapped over the material either side of the break point. This process is called a cross-over and it’s natures way of shuffling the genetic material from your parents. Typically around 2-3 of these cross-overs occur on each chromosome during the formation of gametes.
  
  3. Linkage
  Now if you imagine that we have 2 SNPs that are very close together and 2SNPs that are very far apart on the same chromosome. Knowing that we only have 2-3 cross-overs per chromosome and if we assume that the place at which they happen is random. Then it is much more likely that 2 SNPs that are close together will stay together and conversely that SNPs that are far apart are likely to be shuffled as a cross-over is likely to happen between them. The positions that are close together are said to be in linkage i.e. they tend to be inherited together because they are close together on the chromosome and a cross-over is unlikely to separate them.
  
  4. Disease association
  Now lets take 500 unaffected people and 500 people with a disease and sequence not 1 or 2 but 10s or 100s of thousands of SNPs spread across the whole human genome. The only consistent difference between the two groups is presence or absence of this particular disease. How can we use what we know from above to help us find gene(s) associated with the disease ?
  
  OK. To make things simpler lets say the disease is caused by a mutation in a single gene on chromosome 19. If there was a SNP that was physically close to this gene we would expect the SNP to be inherited with that gene most of the time in both controls and the disease population (because it is just as close in both).
  
  We now need to step back in time. At some point in the past probably many generations ago the mutation appears for the first time and that person is the first to have the disease. What they also have is one particular form of the close SNP. So lets say in that pioneer individual that SNP close to the new disease gene was an ‘A’.
  
  Fast forward back to our massive SNP experiment, we want to ask a question of every SNP comparing each one between the control group and the disease group:-
  
  “Is the frequency of the different bases that are possible at this SNP different between the control group and the disease group ?”
  
  So for a SNP that is nowhere near the disease gene you would expect it to cross-over between chromosome independent of the disease gene so the frequency of A,C,G and T would be the same in both control and disease groups. BUT when we come to our SNP close to the disease gene lets say we only ever find ‘A’ (the original version of the SNP that was found in the pioneer) in the disease group and find an even distribution in the control group :-
  
  control A 25%, C 25%, G 25%, T 25%
  disease A 100%,C 0%, G 0%, T 0%
  
  we can see that there is something odd happening here. This particular SNP is associated with the disease group and this means that it is linked (remember linkage, above?) to the disease. In this case scientists would look very closely at the human genome sequence and work out which genes are close to this SNP, they become candidate genes likely to be responsible for the disease.
  
  5. Polygenic disease
  The reason why GWAS is such a popular method is that it can look for associations between a disease and many genes. All of the complex diseases that scientists are studying such as heart disease, cancer, diabetes, neurological disease are caused by mutations in more than one gene. GWAS studies comparing patients with these disease to controls reveal many SNPs that are associated with the disease and many candidate genes. These genes then become the focus of further work to determine what their role in the disease is.
  
  So, statistical genomics encompasses the mathematical methodologies used to analyse the data from experiments such as GWAS. As you can imagine you have base frequencies from thousands of SNPs in two populations and you need to work out which SNPs are likely to be linked to the disease. In reality it’s not 100% of one base pair in disease against an even distribution in control, but a much closer mix of numbers. We build statistical models of these SNPs to allow us to decide with confidence which are the ones of interest.
  
  Principles like this are used right across biology for example in plant breeding to find which genes are responsible for which traits (fruit colour, taste, yield) using similar methods.
  
  So sorry for the very long answer, but I hoped you’d be interested in getting a full answer rather than a not very informative one-liner.

Question: What is Statistical Genomics?

Comments