Tuesday, September 19, 2017

Accurate Genomic Prediction Of Human Height

I've been posting preprints on arXiv since its beginning ~25 years ago, and I like to share research results as soon as they are written up. Science functions best through open discussion of new results! After some internal discussion, my research group decided to post our new paper on genomic prediction of human height on bioRxiv and arXiv.

But the preprint culture is nascent in many areas of science (e.g., biology), and it seems to me that some journals are not yet fully comfortable with the idea. I was pleasantly surprised to learn, just in the last day or two, that most journals now have official policies that allow online distribution of preprints prior to publication. (This has been the case in theoretical physics since before I entered the field!) Let's hope that progress continues.

The work presented below applies ideas from compressed sensing, L1 penalized regression, etc. to genomic prediction. We exploit the phase transition behavior of the LASSO algorithm to construct a good genomic predictor for human height. The results are significant for the following reasons:
We applied novel machine learning methods ("compressed sensing") to ~500k genomes from UK Biobank, resulting in an accurate predictor for human height which uses information from thousands of SNPs.

1. The actual heights of most individuals in our replication tests are within a few cm of their predicted height.

2. The variance captured by the predictor is similar to the estimated GCTA-GREML SNP heritability. Thus, our results resolve the missing heritability problem for common SNPs.

3. Out-of-sample validation on ARIC individuals (a US cohort) shows the predictor works on that population as well. The SNPs activated in the predictor overlap with previous GWAS hits from GIANT.
The scatterplot figure below gives an immediate feel for the accuracy of the predictor.
Accurate Genomic Prediction Of Human Height

Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos, and Stephen D.H. Hsu

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ∼40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.
This figure compares predicted and actual height on a validation set of 2000 individuals not used in training (males + females; actual heights uncorrected for gender). We have also tested validity on a population of US individuals (i.e., out of sample; not from UKBB).

This figure illustrates the phase transtion behavior at fixed sample size n and varying penalization lambda.

These are the SNPs activated in the predictor -- about 20k in total, uniformly distributed across all chromosomes; vertical axis is effect size of minor allele:

The big picture implication is that heritable complex traits controlled by thousands of genetic loci can, with enough data and analysis, be predicted from DNA. I expect that with good genotype | phenotype data from a million individuals we could achieve similar success with cognitive ability. We've also analyzed the sample size requirements for disease risk prediction, and they are similar (i.e., ~100 times sparsity of the effects vector; so ~100k cases + controls for a condition affected by ~1000 loci).

Note Added: Further comments in response to various questions about the paper.

1) We have tested the predictor on other ethnic groups and there is an (expected) decrease in correlation that is roughly proportional to the "genetic distance" between the test population and the white/British training population. This is likely due to different LD structure (SNP correlations) in different populations. A SNP which tags the true causal genetic variation in the Euro population may not be a good tag in, e.g., the Chinese population. We may report more on this in the future. Note, despite the reduction in power our predictor still captures more height variance than any other existing model for S. Asians, Chinese, Africans, etc.

2) We did not explore the biology of the activated SNPs because that is not our expertise. GWAS hits found by SSGAC, GIANT, etc. have already been connected to biological processes such as neuronal growth, bone development, etc. Plenty of follow up work remains to be done on the SNPs we discovered.

3) Our initial reduction of candidate SNPs to the top 50k or 100k is simply to save computational resources. The L1 algorithms can handle much larger values of p, but keeping all of those SNPs in the calculation is extremely expensive in CPU time, memory, etc.

4) We will have more to say about nonlinear effects, additional out-of-sample tests, other phenotypes, etc. in future work.

5) Perhaps most importantly, we have a useful theoretical framework (compressed sensing) within which to think about complex trait prediction. We can make quantitative estimates for the sample size required to "solve" a particular trait.

I leave you with some remarks from Francis Crick:
Crick had to adjust from the "elegance and deep simplicity" of physics to the "elaborate chemical mechanisms that natural selection had evolved over billions of years." He described this transition as, "almost as if one had to be born again." According to Crick, the experience of learning physics had taught him something important — hubris — and the conviction that since physics was already a success, great advances should also be possible in other sciences such as biology. Crick felt that this attitude encouraged him to be more daring than typical biologists who tended to concern themselves with the daunting problems of biology and not the past successes of physics.

Friday, September 15, 2017

Phase Transitions and Genomic Prediction of Cognitive Ability

James Thompson (University College London) recently blogged about my prediction that with sample size of order a million genotypes|phenotypes, one could construct a good genomic predictor for cognitive ability and identify most of the associated common SNPs.
The Hsu Boundary

... The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 million people may be required to reliably identify the genetic signals of intelligence.

... the behaviour of an optimization algorithm involving a million variables can change suddenly as the amount of data available increases. We see this behavior in the case of Compressed Sensing applied to genomes, and it allows us to predict that something interesting will happen with complex traits like cognitive ability at a sample size of the order of a million individuals.

Machine learning is now providing new methods of data analysis, and this may eventually simplify the search for the genes which underpin intelligence.
There are many comments on Thompson's blog post, some of them confused. Comments from a user "Donoho-Student" are mostly correct -- he or she seems to understand the subject. (The phase transition discussed is related to the Donoho-Tanner phase transition. More from Igor Carron.)

The chain of logic leading to this prediction has been discussed here before. The excerpt below is from a 2013 post The human genome as a compressed sensor:

Compressed sensing (see also here) is a method for efficient solution of underdetermined linear systems: y = Ax + noise , using a form of penalized regression (L1 penalization, or LASSO). In the context of genomics, y is the phenotype, A is a matrix of genotypes, x a vector of effect sizes, and the noise is due to nonlinear gene-gene interactions and the effect of the environment. (Note the figure above, which I found on the web, uses different notation than the discussion here and the paper below.)

Let p be the number of variables (i.e., genetic loci = dimensionality of x), s the sparsity (number of variables or loci with nonzero effect on the phenotype = nonzero entries in x) and n the number of measurements of the phenotype (i.e., the number of individuals in the sample = dimensionality of y). Then  A  is an  n x p  dimensional matrix. Traditional statistical thinking suggests that  n > p  is required to fully reconstruct the solution  x  (i.e., reconstruct the effect sizes of each of the loci). But recent theorems in compressed sensing show that  n > C s log p  is sufficient if the matrix A has the right properties (is a good compressed sensor). These theorems guarantee that the performance of a compressed sensor is nearly optimal -- within an overall constant of what is possible if an oracle were to reveal in advance which  s  loci out of  p  have nonzero effect. In fact, one expects a phase transition in the behavior of the method as  n  crosses a critical threshold given by the inequality. In the good phase, full recovery of  x  is possible.

In the paper below, available on arxiv, we show that

1. Matrices of human SNP genotypes are good compressed sensors and are in the universality class of random matrices. The phase behavior is controlled by scaling variables such as  rho = s/n  and our simulation results predict the sample size threshold for future genomic analyses.

2. In applications with real data the phase transition can be detected from the behavior of the algorithm as the amount of data  n  is varied. A priori knowledge of  s  is not required; in fact one deduces the value of  s  this way.

3.  For heritability h2 = 0.5 and p ~ 1E06 SNPs, the value of  C log p  is ~ 30. For example, a trait which is controlled by s = 10k loci would require a sample size of n ~ 300k individuals to determine the (linear) genetic architecture.
For more posts on compressed sensing, L1-penalized optimization, etc. see here. Because s could be larger than 10k, the common SNP heritability of cognitive ability might be less than 0.5, and the phenotype measurements are noisy, and because a million is a nice round figure, I usually give that as my rough estimate of the critical sample size for good results. The estimate that s ~ 10k for cognitive ability and height originates here, but is now supported by other work: see, e.g., Estimation of genetic architecture for complex traits using GWAS data.

We have recently finished analyzing height using L1-penalization and the phase transition technique on a very large data set (many hundreds of thousands of individuals). The paper has been submitted for review, and the results support the claims made above with s ~ 10k, h2 ~ 0.5 for height.

Added: Here are comments from "Donoho-Student":
Donoho-Student says:
September 14, 2017 at 8:27 pm GMT • 100 Words

The Donoho-Tanner transition describes the noise-free (h2=1) case, which has a direct analog in the geometry of polytopes.

The n = 30s result from Hsu et al. (specifically the value of the coefficient, 30, when p is the appropriate number of SNPs on an array and h2 = 0.5) is obtained via simulation using actual genome matrices, and is original to them. (There is no simple formula that gives this number.) The D-T transition had only been established in the past for certain classes of matrices, like random matrices with specific distributions. Those results cannot be immediately applied to genomes.

The estimate that s is (order of magnitude) 10k is also a key input.

I think Hsu refers to n = 1 million instead of 30 * 10k = 300k because the effective SNP heritability of IQ might be less than h2 = 0.5 — there is noise in the phenotype measurement, etc.

Donoho-Student says:
September 15, 2017 at 11:27 am GMT • 200 Words

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing. These results give performance guarantees and describe phase transition behavior, but because they are rigorous theorems they only apply to specific classes of sensor matrices, such as simple random matrices. Genomes have correlation structure, so the theorems do not directly apply to the real world case of interest, as is often true.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don’t think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don’t think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.

Wednesday, September 13, 2017

"Helicopter parents produce bubble wrapped kids"

Heterodox Academy. In my opinion these are reasonable center-left (Haidt characterizes himself as "liberal left") people whose views would have been completely acceptable on campus just 10 or 20 years ago. Today they are under attack for standing up for freedom of speech and diversity of thought.

Sunday, September 10, 2017

Bannon Unleashed

Most of this short segment was edited out of the long interview shown on 60 Minutes (see video below).

Bannon denounces racism and endorses Citizenism. See also The Bannon Channel.

Paraphrasing slightly:
Economic nationalism inclusive of all races, religions, and sexual preferences -- as long as you're a citizen of our country.

The smart Democrats are trying to get the identity politics out of the party. The winning strategy will be populism -- the only question is whether it will be a left-wing populism or right-wing populism. We'll see in 2020.
This is the longer interview, with no quarter given to a dumbfounded Charlie Rose:

This is a 1 hour video that aired on PBS. There are amazing details about the 2016 campaign from Bannon the deep insider. If you followed the election closely you will be very interested in this interview. (In case this video is taken down you might find the content here.)

Varieties of Snowflakes

I was pleasantly surprised that New Yorker editor David Remnick and Berkeley law professor Melissa Murray continue to support the First Amendment, even if some of her students do not. Remnick gives Historian Mark Bray (author of Antifa: The Anti-Fascist Handbook) a tough time about the role of violence in political movements.

After Charlottesville, the Limits of Free Speech

David Remnick speaks with the author of a new and sympathetic book about Antifa, a law professor at University of California, Berkeley, and a legal analyst for Slate, to look at how leftist protests at Berkeley, right-wing violence in Charlottesville, and open-carry laws around the country are testing the traditional liberal consensus on freedom of expression

Thursday, September 07, 2017

BENEFICIAL AI 2017 (Asilomar meeting)

AI researcher Yoshua Bengio gives a nice overview of recent progress in Deep Learning, and provides some perspective on challenges that must be overcome to achieve AGI (i.e., human-level general intelligence). I agree with Bengio that the goal is farther than the recent wave of excitement might lead one to believe.

There were many other interesting talks at the BENEFICIAL AI 2017 meeting held in Asilomar CA. (Some may remember the famous Asilomar meeting on recombinant DNA in 1975.)

Here's a panel discussion Creating Human-level AI: How and When?

If you like speculative discussion, this panel on Superintelligence should be of interest:

Tuesday, September 05, 2017

DeepMind and StarCraft II Learning Environment

This Learning Environment will enable researchers to attack the problem of building an AI that plays StarCraft II at a high level. As observed in the video, this infrastructure development required significant investment of resources by DeepMind / Alphabet. Now, researchers in academia and elsewhere have a platform from which to explore an important class of AI problems that are related to real world strategic planning. Although StarCraft is "just" a video game, it provides a rich virtual laboratory for machine learning.
StarCraft II: A New Challenge for Reinforcement Learning

This paper introduces SC2LE (StarCraft II Learning Environment), a reinforcement learning environment based on the StarCraft II game. This domain poses a new grand challenge for reinforcement learning, representing a more difficult class of problems than considered in most prior work. It is a multi-agent problem with multiple players interacting; there is imperfect information due to a partially observed map; it has a large action space involving the selection and control of hundreds of units; it has a large state space that must be observed solely from raw input feature planes; and it has delayed credit assignment requiring long-term strategies over thousands of steps. We describe the observation, action, and reward specification for the StarCraft II domain and provide an open source Python-based interface for communicating with the game engine. In addition to the main game maps, we provide a suite of mini-games focusing on different elements of StarCraft II gameplay. For the main game maps, we also provide an accompanying dataset of game replay data from human expert players. We give initial baseline results for neural networks trained from this data to predict game outcomes and player actions. Finally, we present initial baseline results for canonical deep reinforcement learning agents applied to the StarCraft II domain. On the mini-games, these agents learn to achieve a level of play that is comparable to a novice player. However, when trained on the main game, these agents are unable to make significant progress. Thus, SC2LE offers a new and challenging environment for exploring deep reinforcement learning algorithms and architectures.

Friday, September 01, 2017

Lax on vN: "He understood in an instant"

Mathematician Peter Lax (awarded National Medal of Science, Wolf and Abel prizes), interviewed about his work on the Manhattan Project. His comments on von Neumann and Feynman:
Lax: ... Von Neumann was very deeply involved in Los Alamos. He realized that computers would be needed to carry out the calculations needed. So that was, I think, his initial impulse in developing computers. Of course, he realized that computing would be important for every highly technical project, not just atomic energy. He was the most remarkable man. I’m always utterly surprised that his name is not common, household.

It is a name that should be known to every American—in fact, every person in the world, just as the name of Einstein is. I am always utterly surprised how come he’s almost totally unknown. ... All people who had met him and interacted with him realized that his brain was more powerful than anyone’s they have ever encountered. I remember Hans Bethe even said, only half in jest, that von Neumann’s brain was a new development of the human brain. Only a slight exaggeration.

... People today have a hard time to imagine how brilliant von Neumann was. If you talked to him, after three words, he took over. He understood in an instant what the problem was and had ideas. Everybody wanted to talk to him.


Kelly: I think another person that you mention is Richard Feynman?

Lax: Yes, yes, he was perhaps the most brilliant of the people there. He was also somewhat eccentric. He played the bongo drums. But everybody admired his brilliance. [ vN was a consultant and only visited Los Alamos occasionally. ]
Full transcript. See also Another species, an evolution beyond man.

Blog Archive