Genetics meets proteomics: perspectives for large population-based studies

Suhre, Karsten; McCarthy, Mark I.; Schwenk, Jochen M.

doi:10.1038/s41576-020-0268-2

Review Article
Published: 28 August 2020

Genetics meets proteomics: perspectives for large population-based studies

Nature Reviews Genetics volume 22, pages 19–37 (2021)Cite this article

27k Accesses
152 Citations
143 Altmetric
Metrics details

Subjects

Abstract

Proteomic analysis of cells, tissues and body fluids has generated valuable insights into the complex processes influencing human biology. Proteins represent intermediate phenotypes for disease and provide insight into how genetic and non-genetic risk factors are mechanistically linked to clinical outcomes. Associations between protein levels and DNA sequence variants that colocalize with risk alleles for common diseases can expose disease-associated pathways, revealing novel drug targets and translational biomarkers. However, genome-wide, population-scale analyses of proteomic data are only now emerging. Here, we review current findings from studies of the plasma proteome and discuss their potential for advancing biomedical translation through the interpretation of genome-wide association analyses. We highlight the challenges faced by currently available technologies and provide perspectives relevant to their future application in large-scale biobank studies.

You have full access to this article via your institution.

Download PDF

Plasma proteomic associations with genetics and health in the UK Biobank

Article Open access 04 October 2023

Large-scale plasma proteomics comparisons through genetics and disease associations

Article Open access 04 October 2023

Large-scale integration of the plasma proteome with genetics and disease

Article 02 December 2021

Introduction

Genome-wide association studies (GWAS) for human diseases have robustly identified large numbers of risk-associated genetic variants implicated in disease susceptibility¹. The colocalization of genetic associations for disease traits with those for intermediate molecular phenotypes, such as gene expression² and metabolomics³, provides powerful evidence to advance hypotheses regarding the genes and pathways through which these disease-associated variants mediate their effects⁴. Proteins can appear in the blood circulation owing to active secretion or cellular leakage and thereby provide a window into the current state of human health⁵. Quantitative trait loci (QTLs) for circulating proteins can therefore complement these analyses: collectively, colocalized associations with multiple omic phenotypes offer a route to a comprehensive molecular interpretation of disease GWAS hits⁶. These analyses have the potential to expose disease-causing pathways, uncover new drug targets, highlight novel therapeutic indications and identify clinically relevant biomarkers⁷.

However, despite the central importance of proteins in disease pathogenesis, large-scale studies of protein QTLs (pQTLs) have only recently become feasible^{8,9,10,11,12,13}. For most biomedical and clinical applications conducted at a population scale, whole blood and its cell-free derivatives are clinically accessible and minimally invasive specimens that are well suited to the assessment of human health and disease states. Studies of blood-based pQTLs have identified hundreds of associations between single-nucleotide polymorphisms (SNPs) and protein levels^{6,8,9,10,11,14,15,16,17}. Many of these pQTLs colocalize with association signals for common human diseases. Nevertheless, the scale of the largest pQTL studies conducted to date remains limited to the examination of a few thousand proteins across a few thousand individuals. There remains considerable potential in scaling these studies up, in terms of both sample size and proteome coverage.

Beyond the value of large-scale proteomics data to genetic studies, collecting large-scale proteomics data in population studies provides opportunities to study non-genetic associations, making it possible to capture markers of lifestyle and environmental exposure, to stratify individuals according to their state of health (or disease) and to monitor the longitudinal progression of disease. The development of increasingly massive biobanks and population cohorts allows us to envisage deep molecular phenotyping performed across hundreds of thousands of individuals^18,19,20,21. These opportunities inevitably raise questions regarding which technologies to prioritize, what outcomes to expect and whether it is worth the investment (given current costs) to characterize the proteomes of entire populations.

In this Review, we focus on large-scale proteomic technologies currently capable of profiling the blood circulating proteome in large population studies (Fig. 1). We start by reviewing the current state of the field and describe the proteomic composition of serum and plasma. Next, we turn to the experimental challenges related to measuring the circulating proteins, focusing on technologies that are applicable in high throughput to blood-derived samples collected from large population studies. Then we provide an overview of the pQTL studies conducted to date and discuss current challenges and opportunities faced by the field. Finally, we showcase selected applications of pQTL studies in the context of disease GWAS. We close with perspectives for future large-scale studies with proteomics and how the information they provide can be translated into clinical and biomedical applications.

**Fig. 1: Key concepts of pQTL studies.**

The human plasma proteome

The human proteome is defined by analysis and annotation of all potentially protein-coding genes in the human genome. There are approximately 20,000 genes in humans that provide the blueprint for all proteins to be expressed and processed at any given point in time and in any tissue or organ of the human body. However, translated proteins often undergo further post-translational processing: for example, particular amino acid residues can be modified to modulate their physiological properties through phosphorylation or glycosylation. Most proteins then execute their function in interaction with other partners, either locally in their cells and tissues of origin, or after being transported to act in some distant process. In the following section, we introduce the current efforts of the proteomics community with a focus on approaches and technologies to profile proteins in blood.

Experimental coverage of the human proteome

Critical for any proteomic studies is the definition of the set of proteins that one might expect to detect. As of today, the Human Proteome Project has collected mass spectrometry (MS) data that provide experimental evidence for almost 90% of the approximately 20,000 canonical proteins predicted from genomic open reading frames²². One of the current main efforts of the Human Proteome Project is to find experimental evidence for the predicted — but so far undetected — ‘missing proteins’ and to determine their function. One strategy involves expanding proteomic analyses to rare tissues and cell types under a range of conditions²³.

MS remains the most commonly used technology²⁴ in proteomics, alongside complementary approaches using affinity-based assays^25,26 or recently emerging protein sequencing methods²⁷. Large-scale MS-based studies have captured nearly 80% of the human proteins observed across several human tissues^28,29 and provide access to browse data built from more than 250,000 peptides (see PeptideAtlas, Human Proteome Map and ProteomicsDB in Related links). In parallel, the Human Protein Atlas (HPA) is dedicated to mapping human protein expression across organs and tissues³⁰, including blood³¹, and to mapping their subcellular localization³². The HPA uses antibodies and transcriptomics to annotate tissue-specific protein expression on a gene-centric level and currently hosts data from 26,000 antibodies targeting proteins from more than 17,000 protein-coding genes (approximately 87% of the human proteome). The HPA has demonstrated that half of all proteins are expressed in all tissues and that instances of tightly restricted protein expression are far less common, except for proteins involved in specialized processes such as spermatogenesis. In terms of their cellular localizations, the HPA found that the products of approximately 13,000 protein-coding genes reside inside cells and the products of approximately 5,500 protein-coding genes reside in membranes, while the products of approximately 2,600 protein-coding genes are secreted into the extracellular space³⁰. Although the proteomic analysis of tissues or cells is less relevant to large-scale population studies, such analyses are central to efforts to understand the mechanisms involved in disease pathogenesis, as well as to put findings from population studies into context.

The plasma proteome

On collection, blood is often centrifuged to generate fractions containing blood cells and a cell-free fluid, which accommodates the circulating proteome. The fluid is called ‘serum’ or ‘plasma’ depending on whether blood clotting is permitted (serum) or inhibited by anticoagulants (plasma) during the blood draw. The key challenges in analysing blood samples are the broad range of concentrations at which circulating proteins appear and the fact that some proteins fluctuate strongly in their abundance in response to disease or other physiological reasons. Collectively, MS or immunoassays have detected about 5,000 circulating proteins, which represent approximately 25% of the human proteome³³. Whereas more than 2,300 proteins account for 75% of the overall protein mass in cells, only 20 account for around 90% of the total protein mass in plasma²⁸.

Proteins circulating in plasma can theoretically originate from any organ or cell, or can even pass through the placenta to be exchanged between mother and child³⁴. However, the most abundant proteins, such as albumin, apolipoproteins and components of the complement system, are primarily produced and secreted by the liver. A recent reannotation of the human secretome found that about 730 of the estimated 2,600 secreted proteins have the bloodstream as their primary destination: other secreted proteins can reside locally in the extracellular matrix or in other body fluids³⁵. More information about the human secretome annotation is accessible via the secreted protein section of the HPA portal (see Related links). Proteins secreted or that leak into the circulation are found in a wide range of concentrations and are involved in processes related to blood homeostasis, transportation, defence, signalling, digestion and inhibition of other proteins, inflammation and mechanisms of wound healing³⁵.

Although secretion is the active process directing proteins towards the extracellular space, the presence of cellular proteins in the bloodstream may be a consequence of a variety of different and complex processes, including response to damage to cells and organs (Fig. 1). The detection of leakage proteins, such as cardiac troponin, can provide biomarkers for tissue damage. Knowledge of the tissue and cellular origin of the proteins that leak into the blood can reveal insights into the characteristics of the biosample: proteins leaking from blood cells can be used to assess the quality of sample preparation³⁶.

Current limitations in the knowledge of the circulating proteome

There has been substantial progress in studying the characteristics of proteins across tissues and determining their structure–function relationship in biological systems³⁷. More recently, there have been moves towards single-cell proteomics³⁸. Blood-based proteomics has focused mainly on expanding measures of protein abundance: insights into interactions, structures, isoforms and post-translational modifications of circulating proteins have not been addressed in great detail. Even though these factors contribute to the complexity of identifying and precisely measuring the components of the plasma proteome, current technologies are unable to characterize these at the required throughput, precision and sensitivity. When alternative splicing of transcripts from the approximately 20,000 protein-coding genes is accounted for, the estimated number of possible ‘proteoforms’ reaches 70,000 (ref.³⁹), a number further magnified when all possible combinations of post-translational modifications are considered. Capturing this diversity poses enormous technical challenges considering that different proteoforms of the same protein may coexist in the same sample, and that one of them may be pathogenic, while another may be protective⁴⁰.

Preanalytical variables can also influence the quantification of circulating proteins, and to minimize possible effects, stringent, consistent and short sample processing procedures are recommended^36,41. Variables include the blood collection type (serum or plasma), collection tubes (type of anticoagulants), preservatives (protease inhibitors), separation of blood cells (centrifugation speed), times for preparation (needle to centrifugation) and until storage (needle to freeze), storage format (aliquotation) and biobanking (freezer temperature). Besides inconsistency in plasma purity³⁶, protein degradation occurring after blood draw and over the years of storage can influence the detectable protein levels⁴². Effects of preanalytical variables have been demonstrated for the processing steps from needle to freezer, such as precentrifugation or postcentrifugation delay^43,44,45. In multicentre studies, differences in institutional practices in sample processing, shipping and storage are to be considered. For each proteomics technology or assay used, the influence of preanalytical variables on the protein detectability needs to be considered and assessed^46,47,48,49 so that data from different studies can be robustly compared⁴¹.

Probing the plasma proteome in high throughput

The ultimate aim of population proteomic analysis is to be able to survey the circulating plasma proteome of individuals at high sensitivity and high specificity combined with a high degree of multiplexing, massive sample throughput and low cost. The challenge lies in applying precise protein-detection methods that are able to quantify thousands of proteins across a wide dynamic range in many thousands of samples, while minimizing the amount of starting material required and the analysis time needed. In the following discussions, we introduce and compare the main technologies that allow the capture of circulating proteins in the context of large-scale population studies; that is, MS and affinity-based methods (for further methodological information, see Box 1).

Box 1 The principal technologies used to characterize the blood proteome

Mass spectrometry

Mass spectrometry (MS) is the most commonly used technology in proteomics. Predominately, enzymatic digestion of samples is used to generate peptides for the identification and quantification of proteins. Analytically, the peptides are separated into fractions by chromatographic methods and sequentially injected into the MS instrument, where the peptides are scanned, selected, fragmented and detected. For a data-dependent acquisition (DDA), a selected set of prominent peptides are chosen for analysis, whereas data-independent acquisition (DIA) scans for all peptides within a mass-to-charge window. The aim is to increase the analytical coverage and depth without the need for extensive removal or separation of abundant proteins and peptides.

In either setting, a bioinformatics pipeline is applied to match the detected peptides with peptide libraries to identify the respective protein (preferably using several unique peptides containing more than nine amino acids). MS-based proteomics can further determine protein modifications, structures and interactions, or can even detect the intact protein instead of its peptides. Quantification in MS-based analysis is usually achieved via isotope-labelled standards that have a definable shift in mass compared with their endogenous counterparts. Current MS workflows are capable of detecting 500–1,000 plasma proteins in tens of samples at a time.

As depicted in the figure (part a), a protein of interest (in yellow) is cleaved into peptides by enzymatic digestion, separated by chromatography and subsequently injected into the mass analyser. Per analysis, several thousand peptides can be detected. The identification of the protein is based on matching several peptides from the experimental data with a reference library. The peak intensity of the peptide spectra is used as a measure of protein abundance.

Multiplexed antibody-based immunoassays (Luminex and Olink)

Sandwich immunoassays are a widely applied technique to quantify analytes in clinical settings. They are based on the concept of dual antigen recognition by using a matched combination of the most often used antibodies, where the concurrent binding of two antibodies to a specific protein enables its detection. These immunoassays are considered to provide the most sensitive methods for the detection of low-abundance analytes.

Miniaturization of such assays has also resulted in the possibility to perform analysis of several proteins in parallel. Such systems are based on use of planar microarray chips but more frequently use of barcoded particles (Luminex and Quanterix) and most often use fluorescent reporter molecules on the secondary binding molecule for readout. These immunoassays quantify up to 30 proteins in multiplex through calibration of each analyte assay via protein standards. These assays have been established in formats allowing up to 384 samples to be profiled at a time.

A version of the multiplexed immunoassay makes use of dual recognition by modifying antibodies with complementary DNA tags (Olink). Upon binding in close proximity to a common target, the tags of the two binders can be extended and the DNA is amplified. Followed by a readout based on quantitative PCR, this solution-phase assay reports protein abundance via the number of amplification cycles. From a portfolio of ~1,100 targets, the current format of this proximity extension assay allows the detection of approximately 90 proteins in 90 samples at a time using a microfluidic readout platform.

As visualized in the figure (part b), first, protein-specific antibodies (Y) are generated, selected and used to develop an assay for one respective target protein at a time. In dual-binder assays, several pairs of antibodies are then combined for multiplexed analysis. Exemplified for Olink, the readout is based on proximity extension assays of complementary primer pairs of two matching antibodies. Using fluorescent molecules as the reporters, the assay delivers protein abundance as the inverse number of quantitative PCR cycles.

Multiplexed aptamer-based immunoassay (SomaScan)

Alternative methods have emerged that use DNA or RNA scaffolds as binding reagents. The versatility of oligonucleotides, chemical modifications and combinatorial designs has enabled an aptamer-based proteomics platform. The assay is based on in-solution protein capture followed by stages of labelling and purification. The selectivity of the assay is enhanced by selecting binding molecules with extended off-rate binding kinetics. The fluorescent DNA aptamers are then hybridized to a DNA chip for detection. Today, aptamers for up to 5,000 different proteins have been implemented into this platform to profile approximately 90 samples at a time.

As shown in the figure (part c), first, protein-specific aptamers (A) are selected from libraries of designed DNA molecules. The SomaScan approach selects modified aptamers that bind to the target for the longest time (slow off rate). Multiplexing is achieved by combining several hundred aptamer–protein interactions, followed by biotinylation of protein–aptamer complexes and release of bead-bound aptamers. The free aptamers are then hybridized to a DNA microarray, where the numbers of remaining and labelled aptamers are the measure of protein abundance.

MS-based proteomics

Since the early years of the first decade of this century, MS-based analysis has been applied to measure circulating proteins^5,50. At its core, MS measures the mass-to-charge ratio (m/z) of ionized molecules, such as proteins or peptides, within a gas phase. Although many variations and instruments are used for the ionization, detection, identification and quantification of the ions, plasma proteomics workflows are typically based on an enzymatic digestion of the circulating proteins into peptides. Depletion of highly abundant proteins⁵¹ or enrichment of proteins present at low concentrations⁵² is sometimes applied to address the challenge imposed by the concentration range and dynamics of proteins of the plasma.

There are two complementary approaches to peptide measurement in MS. Targeted MS uses stable-isotope standard peptides as reference points to provide absolute quantification of the peptides in the sample. By contrast, untargeted MS uses the intensity of the peptide ions as a semiquantitative readout of peptide abundance. Particularly for plasma, as peptides representing low-abundance proteins can be masked by peptides originating from more highly abundant proteins, accurate measurement across a wider range of protein abundance concentrations has generally required extensive prefractionation of the sample, limiting MS-based analyses to smaller-scale studies⁵³. Efforts to increase the scale and scope of MS-based assays have revived interest in application to plasma, with recent studies demonstrating reliable detection of approximately 500 abundant proteins in nearly 1,000 samples^41,54. In MS studies focusing on detecting as many proteins as possible, up to 4,000 proteins can be identified in a handful of samples⁵⁵. Although general concepts, advances and the versatility of MS have been reviewed extensively elsewhere^37,56, we describe here some recent studies that indicate the growing potential of MS for population-wide studies of the plasma proteome.

For untargeted MS, data-dependent acquisition (DDA) MS methods, which focus on analysing the more prominent peptide ions, have been widely used⁵⁷. Recent efforts to shorten the analysis time and peptide scanning schemes allow measurement of nearly 300 proteins in 10 plasma samples within 3 hours⁵⁸, and one recent study included 1,300 sample assays for 450 proteins⁵⁹. Data-independent acquisition (DIA) approaches consider all peptide ions and are better able to detect less-abundant proteins. DIA requires less complex preanalytical sample handling than DDA but at the cost of increased time for data collection and subsequent bioinformatic analyses to match the extended list of targets with comprehensive protein libraries. One recent DIA-based study described the quantification of 340 proteins in 200 plasma samples from twins⁶⁰ and revealed that 50% of all detected peptides (so-called peptidoforms) had undergone post-translational modification⁶¹. The largest DIA MS proteomics study to date included 1,500 plasma samples, using improved chromatography and increased reproducibility to detect a common set of 320 proteins in more than 90% of the samples⁶².

For more focused studies, where a smaller number of preselected proteins are of interest, targeted MS workflows can be preferred. In such a setting, targeted MS provides a more sensitive and absolute quantification of protein concentrations. These assays use predetermined analysis conditions and addition of known amounts of synthetic, isotopically labelled peptides (or proteins) to act as references for quantification. This approach facilitates comparison of protein quantities across studies^51,63,64. However, the scope of targeted MS is constrained by the predefined sets of labelled canonical peptides offered by companies such as Biognosys (Zurich, Switzerland) or MRM Proteomics (Montreal, Canada).

Affinity-based proteomics

Affinity proteomics has emerged as an attractive alternative to MS-based identification of proteins²⁶ by conducting classical immunoassays at higher throughput and higher sensitivity and in multiplex format²⁵. Among the variety of immunoassays such as ELISA, immunohistochemistry, immunofluorescence, flow cytometry and western blots, all mainly use classic antibodies but may also use alternative binders to detect endogenous proteins in a variety of sample types⁶⁵. The basic concept for these multianalyte assays⁶⁶, often making use of advances in DNA technologies (such as DNA microarrays), relies on measuring multiple proteins simultaneously in one sample by miniaturizing the assays. A reduced analytical surface area increases the signal-to-noise ratios by increasing the number of occupied binding sites and avoids the depletion of the analyte in solution.

Over the past two decades, methods for multiplexed protein measurement have emerged to enable analyses across a wide range of concentrations, in thousands of samples. These developments have made these assays attractive for population-wide analyses, and they have been adopted in most recent genome-wide pQTL analyses. These advances in affinity assay methods rely on the improved performance of the binding reagents (antibodies or aptamers) used to detect antigens with high selectivity and high binding affinity⁶⁷. For analyses of plasma proteins, available platforms range from ultrasensitive single-analyte assays for the detection of circulating proteins at less than 1 pg ml⁻¹ (ref.⁶⁸) to multiplexed assays that capture more than 4,000 proteins by designed protein-binding DNA aptamers⁹.

The suspension bead array technology of Luminex (Austin, TX, USA) supports multiplexing via combining different colour-coded microparticles followed by flow cytometry analysis⁶⁹: each colour code represents one immunoassay for a given protein target. These assays can typically be multiplexed to quantify 30 analytes per sample, processed in batches of up to 384 samples, and have been applied to studies of more than 8,000 samples and up to 50 proteins⁷⁰. Myriad RBM (Austin, TX, USA) offers services based on a set of more than 300 proteins on the Luminex technology, whereas a variety of predefined and/or customizable panels of sandwich immunoassays are available from providers such as MilliporeSigma, R&D Systems, Bio-Rad and Thermo Fisher. Alternative platforms for multiplexed immunoassays are offered by Quanterix (Billerica, MA, USA), ProteinSimple (San Jose, CA, USA) and Meso Scale Diagnostics (Rockville, MD, USA), and are reviewed elsewhere⁷¹.

By contrast, the technology implemented by Olink (Uppsala, Sweden) uses proximity extension assays. Detection of a given protein requires binding of two separate antibodies that carry complementary oligonucleotide tags: when two antibodies bind to the target protein, these oligonucleotides can be hybridized and extended by a DNA polymerase⁷². This assay therefore uses protein-specific binding properties to generate a readout that then relies on DNA concentrations more easily quantifiable by quantitative PCR. Presently, the technology enables the analysis of up to 1,164 proteins, distributed across 14 themed protein panels, each panel analysing 92 proteins in 90 samples at a time. Recently, one of these panels was deployed across 3,400 individuals to map genetic loci for plasma protein biomarkers in cardiovascular disease⁷³, followed by a meta-analysis in more than 21,000 individuals⁷⁴.

SomaLogic (Boulder, CO, USA) uses an in-house-developed library of modified aptamers for highly multiplexed protein profiling⁷⁵. In recent years, its aptamer library has grown from approximately 800 (ref.⁷⁵) to more than 4,000; in parallel the sample sizes investigated have grown from hundreds to tens of thousands⁷⁶. The proprietary assay processes 90 samples per batch, and its readout is on DNA arrays. Instead of matching two binding reagents for increased specificity, the selectivity of the platform is built around specific aptamers selected for their capability to bind to their target protein for an extended time (slow off rate).

Comparative discussion of the available methods

Several criteria influence decisions about which technology to deploy for a given large-scale plasma proteome analysis. Although no single method is a universal solution for all analytical aspects, each has particular strengths. There has been no systematic and direct comparison of all available methods, but a starting point for future work could be the ‘popular proteins’⁷⁷, which is a set of approximately 1,000 circulating proteins that can currently be detected by both MS-based and affinity assay approaches³³. Analytical performance criteria for the main technologies are summarized in Table 1, including specificity, reproducibility, sensitivity, degree of multiplexing, sample throughput, quantification and translatability, costs and accessibility of the technology and the derived data.

Table 1 Features of plasma proteomics technologies

Full size table

MS-based approaches benefit from a large community of users and companies. The mode of MS-based protein identification and absolute target quantification as well as future opportunities to capture several post-translational modifications⁶¹ are noticeable benefits compared with the affinity assays. However, sample throughput and analytical sensitivity are relatively low (there is no amplification method available), limiting the potential to apply MS-based approaches to large plasma studies. Peptide sequence variants and modifications may further increase the number of missing data points. Newer approaches — such as DIA and variations thereof⁵⁶ — enable the protein-level data to be digitally stored, so improvements in matching tools to detect peptides together with growing libraries may allow reanalysis of the data in the future.

Affinity-based methods clearly lead on sample throughput (N > 1,000) and analytical sensitivity (below 1 ng ml⁻¹), which can be enhanced by use of signal amplification methods such as are possible with some DNA-based readouts. The number of analytes that can be measured differs between technologies, and several panels may be used to detect large numbers of target proteins. Although the number of reagents to target the human proteome is increasing⁷⁸, the susceptibility to off-target binding and the lack of reproducibility of assays have raised concerns about the quality of some affinity-based measures⁷⁹. To achieve higher-quality data, guidelines on the validation of affinity capture-based data have been established⁸⁰ and systematically applied to a variety of common techniques and samples^81,82.

GWAS with proteins circulating in plasma

The goals of proteomic GWAS are to identify genetic sequence variants associated with proteomic features in a given cell or biofluid. Those features are typically protein abundance levels but could extend to other characteristics, such as isoform diversity or post-translational modification. GWAS analysis conducted across the large number of proteomic phenotypes now becoming available for many cohorts presents challenges, not only in terms of computational efforts and data management but also with respect to the quality control of the phenotypic measurements, automation of data analyses, and integration and interpretation of the results.

Identification of pQTL signals

Preprocessing and quality control of proteomic data to avoid spurious association from outliers, or deviations of the protein abundance distributions from assumptions made in standard statistical models, may be cumbersome when one is working with thousands of quantitative traits. Inverse normal scaling of protein levels is often used as a simple and conservative approach to deal with distributional issues and appears to lead to robust associations, as shown by the generally good replication between studies¹⁰; however, this is achieved at the cost of some loss in statistical power. Log scaling and/or winsorization — that is, moving extreme values closer to the normal distribution of the proteomic data — offer alternative approaches to inverse normal scaling, and computation of P values based on data preprocessed by multiple methods can be used to obtain consensus associations⁸. Associations can be computed with the same tools and models as deployed in other GWAS of quantitative traits^83,84: batch effects and covariates can be managed by a combination of direct inclusion in analytical models and the use of latent variable approaches, such as PEER⁸⁵, to ‘find’ the covariates that matter (at some price of removing real effects). Adding specific principal components of the proteomic data as covariates may increase statistical power, for instance where principal components reflect variation induced by processes related to sample handling and storage.

For pQTL discovery studies, the gold standard involves application of a combined genome-wide and proteome-wide significance level together with independent replication. However, this can be a high bar, since a study-wide significance threshold designed to minimize false-positive associations may require researchers to account for the billions of tests performed (a million or more DNA variants for each of several thousand proteins). This approach leaves many studies underpowered to detect genuine but weaker associations. Limiting association testing to variants located in the immediate vicinity of the respective protein-encoding genes can increase statistical power for the detection of cis-pQTLs but restricts the breadth of the study. Novel methods to better incorporate correlative structures between proteins and genetic variants, such as use of sparse multivariate regression models⁸⁶, may improve this trade-off between power and robustness.

Extending pQTL analyses to include rarer alleles (rather than the common alleles that have been the primary focus of GWAS) may increase opportunities to detect variants with particularly large effects on protein expression and disease risk, since rare alleles are often of more recent origin and may not yet have been subject to negative selection. However, as with disease association studies, the robust detection of rare variant associations can be troublesome, particularly if the rarity of the alleles themselves requires the use of ‘variant aggregation’ tests to provide a signal. Such approaches depend for their power on the ability to combine disparate alleles that have similar broad functional impact, while avoiding dilution of the test by including neutral variants. Such an approach can be difficult enough for coding variants, where the gene provides an obvious unit for aggregation, but is some way from being solved for variants that fall into intergenic and regulatory regions.

pQTL studies to date

Over the past decade, there have been a profusion of GWAS analyses, many also with a limited proteomic scope^1,6, including several large-scale studies with multiplexed immunoassays that cover tens of proteins to a few hundred proteins^{12,13,17,70,73,87}. We provide a comprehensive list of all pQTL studies of which we are currently aware in Table 2. To date, only two MS-based studies in blood have combined proteomics with GWAS, one identifying approximately 160 proteins in the plasma of approximately 1,000 individuals⁵³ and the other analysing the small peptide subset of a non-targeted metabolomics platform⁸⁸. Otherwise, large-scale pQTL studies have made use of affinity proteomics approaches^8,9,10,11.

Table 2 Published GWAS with plasma proteomics

Full size table

Several of the affinity proteomics studies have used the aptamer approach developed by SomaLogic. The KORA study analysed more than 1,100 proteins in 1,000 participants from a German cohort⁸ and identified 540 pQTLs that connected 450 independent genetic variants with 280 proteins. This analysis was extended within the INTERVAL cohort of UK blood donors¹⁰: an expanded SomaScan panel of almost 3,000 proteins was deployed across 3,300 individuals, raising the pQTL count to 1,930, involving 1,480 proteins. The largest published SomaScan-based pQTL study analysed more than 4,000 proteins across 5,500 Icelanders from the AGES Reykjavik study⁹ and reported more than 3,130 pQTLs influencing the abundance of 1,800 proteins.

Studies using the Olink platform have tended to include smaller numbers of proteins but larger sample sizes. The most recent analysis, conducted by the SCALLOP consortium, analysed 90 cardiovascular proteins in more than 21,000 individuals⁷⁴: this study yielded a total of 467 pQTLs influencing 94% of these protein targets.

These pQTL studies have provided confidence in the reproducibility of the underlying methods. The three SomaScan studies displayed good agreement: the AGES study confirmed more than 84% of the pQTLs found in the KORA study and more than 72% of those from the INTERVAL study. The INTERVAL study assayed participants with both the SomaScan platform and the Olink platform and found that 65% of the SomaScan-detected pQTLs for overlapping targets were replicated with Olink (with a correlation of 0.95 in effect-size estimates). Some small-scale studies provide data measured in parallel on two platforms^89,90. Further comparative studies are needed to fully delineate the consistency between platforms, particularly between MS-based and affinity-based analyses: such analyses have been limited by the sparse overlap between proteins jointly detected on both platforms⁹¹.

Cis-pQTLs and trans-pQTLs

Genetic variants that associate with protein levels are generally classified into one of two categories. They are located either in cis, close to the gene that encodes the associated protein, or in trans, at greater distance, typically on a different chromosome (Fig. 2). Between 18% and 25% of the proteins assayed by aptamer assays have been found, at current sample sizes, to have a significant cis-pQTL. Twenty-seven per cent of the pQTLs reported by the KORA study⁸ were trans-pQTLs, rising to 70% in the larger INTERVAL study¹⁰. The greater multiple testing burden inherent in trans-pQTL analysis, together with the increased potential for non-genetic phenotypic variability (Fig. 2), makes sample size a critical factor in trans-pQTL detection.

**Fig. 2: Ways a genetic variant can lead to a pQTL.**

Cis-pQTLs indicate the presence of a variant that is likely to have a direct and causal effect on the observed protein levels at that locus. If the causal variant of a cis-pQTL acts primarily through mRNA expression or turnover, an (RNA) expression QTL (eQTL) may also be found in a relevant tissue or cell type. If instead, the cis variant alters protein abundance through an impact on protein translation or turnover, a cis-eQTL is less likely, but it may still be observed if expression is upregulated through compensatory feedback to maintain protein levels. Although the presence of a nearby eQTL generally supports the identification of a causal pQTL variant, it is also possible that the two signals map close together by chance or that the eQTL and pQTL arise from different variants that are in high linkage disequilibrium. A variant in a shared promoter of two genes may further complicate the picture, giving rise to more complex hypotheses regarding the link between the variant and the associated protein level.

Trans-pQTLs are of particular inferential value because a trans-pQTL implies an interaction between an — often still to be identified — causal gene at the pQTL locus and the associated protein encoded at the trans position, pointing to novel pathways of protein regulation or interaction. Sun et al.¹⁰ present several examples including the identification of PRDM1 as the probable causal gene at an inflammatory bowel disease GWAS locus, and the dissection of complex association signals for antineutrophil cytoplasmic antibody-associated vasculitis at the SERPINA1 locus. Orthogonal evidence based on pharmacological intervention and transgenic mice can support hypothesized causal gene-to-protein relationships from trans-pQTLs, as exemplified by the recent study by the SCALLOP consortium on eight gene products targeted by compounds or antibodies in clinical development⁷⁴. Genetic loci involving multiple cis-pQTLs and trans-pQTLs are particularly informative, as they link multiple proteins in putative gene–protein networks that can suggest new hypotheses as to their possible interactions and functions. A genetic variant associated with a trans-pQTL can further change protein levels in trans through regulation or modification of cis-encoded microRNAs or epigenetic marks. The study by the SCALLOP consortium identified 30 trans-pQTLs that involved two or more proteins, with ABO, ST3GAL4, JMJD1C, SH2B3 and ZFPM2 showing association with the levels of 5 or more of the 90 proteins analysed by one panel of the Olink platform⁷⁴.

Colocalization with eQTLs

Most disease GWAS signals map to regulatory rather than coding sequences, and the downstream effectors through which they operate are typically unclear⁹². Evidence of enrichment between patterns of disease association and the location of tissue-specific regulatory sequences (such as enhancers), as well as for the sites of tissue-specific eQTLs, indicates that the identification of colocalizing cis-eQTLs has value in highlighting those effectors and reconstructing disease mechanisms⁹³. However, it is also clear that this approach is not as robust as has often been assumed: at many loci, there are multiple colocalizing cis-eQTLs, implicating several candidate effectors, and it seems intrinsically unlikely that all are directly mediating disease risk⁹⁴. pQTL analysis provides a complementary approach to reconstructing links between genetic variation, molecular processes and disease predisposition. In recent studies^8,9,10, only 40% of detected cis-pQTLs could be shown to colocalize with cis-eQTLs detected in projects such as Genotype–Tissue Expression (GTEx)², which generated cis-eQTL data for approximately 40 tissues in up to 900 individuals. It remains to be seen how the recently reported contamination of GTEx data with highly expressed, tissue-enriched genes impacts on this observation⁹⁵.

The discrepancy between pQTL and eQTL discovery has led to questions about their relative merits, questions which, by and large, ignore the fact that these data types are quite distinct. Most obviously, large-scale pQTL data are almost entirely restricted to the circulating proteome, whereas projects such as GTEx have provided eQTL maps for multiple tissues: there will be many loci where the downstream effectors (for example, transcription factors) are not represented in the circulation and where any contribution from pQTL studies will need to await technologies that provide tissue-specific, and even cell-specific, proteomic readouts. On the other hand, there will be other loci where the mediating mechanisms can be detected only through pQTL analyses: where the genetic variants exert their effects through an impact on protein stability or modification, for example, or where the protein half-life in the circulation far exceeds that of the RNA.

Data sharing

Sharing of full summary statistics — that is, reporting all pairwise variant–protein associations with effect size and standard error regardless of their significance levels — can be data storage intensive but is an essential component of scientific discourse. The GWAS catalogue¹ now accepts and maintains full summary association statistics data sets, and summary statistics for some of the larger pQTL studies are already freely available^8,10. Access to summary data enables immediate incorporation into online tools for causal inference that use Mendelian randomization approaches, such as EpiGraphDB⁹⁶ and MRbase⁹⁷. Whereas sharing of summary-level data should be a prerequisite for publication, proactive sharing of individual-level data (both raw and processed genotypes and protein measurements) renders data sets far more valuable facilitators of scientific discovery, fostering development of novel statistical approaches and maximizing the opportunities for integration with other complementary sources of data. Controlled-access repositories, such as EGA and dbGAP, provide a mechanism to support the sharing of sensitive individual-level data with bona fide investigators.

Working with ratios

It is possible to conduct association analyses focused not only on the expression levels of individual proteins but also on the ratios between them. Inspired by the use of hypothesis-free testing of metabolite ratios in metabolomic GWAS⁹⁸, the challenge with ratio-based analyses is the massive explosion in the number of statistical tests that could be performed and the consequent need to allow for the inflation in type 1 error that could result. However, as with metabolomic data, where known relationships between products and substrates can be used to constrain the scope of ratio testing (avoiding testing of all pairwise combinations), there may be opportunities to perform protein-ratio QTL analyses in subsets of functionally connected proteins. For example, the strength of the pQTL association of rs41341749 with CCL14 and CCL23 increased by 35 orders of magnitude when the ratios of their abundance were used, an effect replicated in an independent cohort⁸. A gain in the strength of association when ratios are used may also result when one of the proteins concerned is a proxy for variability induced by sample handling and storage, or any other common normalizing factor.

Network integration

A broader view of disease biology can be enabled by analyses that present the relationships between proteins as networks, especially when those networks can be integrated with additional information such as that provided by genetic associations and colocalization of those associations with clinical end points. Network relationships between proteins can be derived from existing biological knowledge, using databases such as WikiPathways⁹⁹ and STRING¹⁰⁰, or they can emerge from data-driven integration in multidimensional omics data sets. For example, Gaussian graphical models (GGMs) have been shown to reconstruct pathway reactions from high-throughput metabolomics data¹⁰¹, and, when integrated with metabolic and disease GWAS associations, are powerful tools to mine high-dimensional GWAS data sets for biomedically relevant associations¹⁰². GGMs that include proteomic data recently revealed a genome–proteome–disease subnetwork implicated in Crohn’s disease⁸.

One illustrative example involves the pleiotropicABO locus. In GWAS, this locus has been associated with numerous disease outcomes, including coronary artery disease¹⁰³ and venous thromboembolism¹⁰⁴. This locus includes at least three independent, widely replicated genetic signals that influence the plasma abundance of multiple proteins in trans^{8,9,10,14,91,105}. Two of these signals tag the blood group B and O alleles, and the third lies upstream of the ABO gene. A network based around these trans proteins, assembled from literature findings, experimental protein–protein interactions and the partial correlations between their abundances, indicates that these proteins play a joint role in angiogenesis and that this angiogenic function is modulated by variation at the three ABO signals⁸. Further analyses of such network models are required to establish the functional relevance of observed links between GGM network nodes, possibly using statistical approaches to objectively define functional modules in such networks¹⁰⁶.

Analytical considerations

As with all high-throughput omics technologies, proteomic data are subject to multiple potential sources of error and bias, which the user must consider to avoid misinterpretation or overinterpretation of the results (Table 1).

Epitope effects

One challenge in interpreting cis-pQTL associations from affinity assay-based GWAS is the possibility that a genetic variant in the protein-coding sequence modifies the binding epitope of the protein that is recognized by an assay’s antibody or aptamer, in the absence of other biological consequences¹⁴. Such an epitope effect can be introduced by a non-synonymous mutation within the target epitope. Alternatively, it may reflect larger-scale changes in the protein structure arising from variants (including frameshift insertions or deletions) that modify the protein’s overall structure or folding, or its potential for post-translational modification. Linkage disequilibrium (that is, correlation) between such a variant influencing epitope binding and otherwise entirely unrelated variants within adjacent non-coding sequence can result in a pattern of association that is indistinguishable from a genuine cis-pQTL.

There are several ways to establish whether epitope effects could be driving an observed cis-pQTL signal^8,9,10,14. Evidence that the gene implicated in a cis-pQTL contains a coding variant (which may require interrogation of imputed or sequenced whole-genome sequence data) raises this as a concern: the coding variant needs to be evaluated for its impact on protein sequence and structure. Experimentally, one could produce both protein variants and quantify differential recognition by directly comparing assay performance. The presence of a colocalizing eQTL can be reassuring, serving as a pointer that the observed differences in protein abundance are likely to result from altered protein expression rather than differential recognition. However, this is not fail-safe: the eQTL could itself be an artefact if the coding variant leads to allelic bias in the mapping of RNA sequencing reads. Where pQTLs are the consequence of SNP-associated allele-specific gene expression, RNA sequencing can be used to provide evidence that only one protein isoform is expressed. Replication of a pQTL on multiple platforms, or when a range of binders and assays are used, provides confidence that the association reflects changes in protein levels rather than binding affinity. pQTLs that display exceptionally large effect sizes and that show no colocalization with other GWAS phenotypes are more likely to represent epitope effects, as large differences in protein levels might be expected to have measurable biological impact.

Trans-acting epitope effects are also possible. This can happen when a coding variant at a pQTL signal influences the properties of a protein-modifying enzyme or an interaction partner in cis: this can affect the epitopes of proteins modified by this enzyme or those which interact in a genotype-dependent manner with the binding partner⁸. In some relevant loci, binders have been selected explicitly to bind different epitopes of the same target: the use of multiple SomaScan aptamers to target the products of the various APOE alleles is one such example⁸.

Protein-altering variants can also influence MS-derived data¹⁰⁷. If a SNP introduces a new protein cleavage site (or eliminates an expected one), the peptides generated by enzymatic digestion may differ from those listed in the canonical reference peptide library. A possible workaround for capturing these instances is to apply a ‘proteogenomics’ approach¹⁰⁸, in which peptide variants are imputed from the genetic data and then used to build a variant-aware peptide reference¹⁰⁹.

From the findings taken together, it is clear that epitope effects represent a challenge to the interpretation of pQTL studies and require careful annotation of the assays used. However, these effects also constitute opportunities for further analyses of biological variation in the protein sequences that have no direct impact on the biological function (until shown otherwise). It is important to bear in mind that pQTLs identified with affinity assays after all represent genetic differences in the amount of affinity-captured proteins, which is then merely interpreted as representing protein abundances.

Binding specificity and cross-reactivity

Affinity-based assays rely on the correct identification of their intended targets, but cross-reactivity and lack of specificity have been a concern⁸¹. The selectivity and suitability of affinity reagents are context and technology dependent and hence must be evaluated for each sample type and application⁷¹. For example, a SomaScan-based proteome analysis of blood from young and old parabiotic mice reported an age-related reduction in circulating levels of growth/differentiation factor 11 (GDF11), raising the possibility that restoring normal levels of GDF11 could reverse ageing. However, it was subsequently established that the assay used to identify and isolate GDF11 had low specificity¹¹⁰. Cross-reactivity to similar epitopes, such as those encoded by paralogous genes, or domains shared across different proteins may confound an association signal¹⁰. Cross-target binding may also occur when binders designed to target non-human proteins are included in the panel, such as viral or bacterial proteins on the SomaScan platform³. In the absence of the intended target, pQTLs identified with such binders may correspond to genuine genetic signals of another, yet unidentified target protein. Finally, genuine errors in binder selection, for instance due to misidentification of the intended target protein, constitute another potential source of error.

Binding specificity has been addressed by several studies^8,9,10,14. For instance, Sun et al.¹⁰ assessed potential off-target cross-reactivity for 920 aptamers and found that 14% showed comparable binding with a homologous protein, nearly half of which were alternative forms of the same protein. Emilsson et al.⁹ provided evidence for target specificity for 773 affinity binders for a panel of 4,137 aptamers by using affinity pull-down followed by MS-based target identification, although only some of the experiments were conducted with blood serum or plasma. Recently, SomaLogic conducted a systematic analysis of the reliability and specificity of SOMAmer protein-affinity reagents in the SomaScan assay. They found that out of 1,612 tested SOMAmer reagents, 73% did not bind to any related proteins, 14% bound to related proteins with at least tenfold weaker affinity and 13% bound to other unrelated proteins with similar affinity. They further confirmed specific target enrichment by pull-downs from human plasma for 123 of the SOMAmer reagents¹¹¹. A recent article from SomaLogic also provides updated information on MS-target confirmation and characterization of possible cross-binding and off-target effects for many of the proteins on its in-house platform⁷⁶.

The use of community standards for validation⁸⁰ and open access to validation data will help to drive improvements: it should be possible to understand, for any given binder, not only that it captures the intended target but also which other circulating proteins can be enriched. Sharing raw data from multiplatform studies may allow independent evaluation of platform performance⁸⁹. The integration of such information into public protein databases with cross-referencing identifiers to affinity proteomics platforms should increase the accuracy of pQTL inference.

Clinical and biomedical applications

The characterization of pQTLs drives various biomedical and pharmaceutical applications. pQTLs provide intermediate phenotypes to interpret the findings of disease GWAS, clues to genes that are causal for disease biology, opportunities to discover clinical biomarkers, matches between existing drugs and new disease indications, pointers to potential safety concerns for drugs in development, insights into protein–protein interaction networks and much more (Box 2). Here, we discuss some of the key applications in further detail.

Box 2 Examples for key pharmaceutical applications of pQTLs

Providing confidence in disease mechanisms

Genome-wide association studies (GWAS) with disease end points are increasingly powerful for identifying genes and gene variants that modify disease risk. Colocalized genetic associations with circulating protein levels as intermediate phenotypic traits can help to rationalize potential disease pathways. Combined with an evaluation of causality between protein levels and disease risk using Mendelian randomization, protein quantitative trait loci (pQTLs) can be used in support of drug target validation and provide evidence to explain the mode of action of a drug. For instance, in a pQTL study with selected proteins, Yao et al.¹³ identified pQTLs for cystatin C and PON1 that overlapped with cardiovascular disease risk loci. Mendelian randomization and prospective associations of cystatin C and PON1 levels with cardiovascular disease suggested that these proteins are potential targets for cardiovascular disease prevention and treatment.

Identification of causal genes in disease GWAS

A major obstacle in translating genetic disease associations to clinical application is to identify the causal gene within a range of potential candidate genes. Although colocalized expression QTLs (eQTLs) are increasingly used to prioritize candidates, pQTLs provide independent and often complementary evidence for the identification of causal genes. For instance, a GWAS of inflammatory bowel disease identified a risk variant in the intergenic region near two genes, PRDM1 and ATG5. However, the causal gene could not be unequivocally resolved, as both genes were plausible candidates. A colocalized cis-pQTL with BLIMP1, which is the gene product of PRDM1, reported in ref.¹⁰ provided support for PRDM1 as the causal gene, thereby increasing its priority as a potential therapeutic target.

Genotype-dependent variation in clinical biomarker proteins

Protein readouts are increasingly used as diagnostic biomarkers, to guide therapy decisions and to evaluate the efficacy of treatment in clinical trials. However, genetic variance that confounds with these biomarker readouts can weaken the reliability and statistical power of such tests. Ten years ago, Anderson¹¹³ identified 109 US Food and Drug Administration (FDA)-cleared or FDA-approved protein analytes assayed in serum or plasma. For the purpose of this Review, we updated this list, which now includes 199 unique analytes, and asked which of them appears in a pQTL (Supplementary Table 1). We found that about one-third of these FDA-cleared or FDA-approved protein analytes have a cis-coding pQTL (Table 1). Accounting for this genetic variance could potentially increase the statistical power of these biomarkers, especially in situations where they are used as outcomes in clinical trials.

Matching existing drugs with new disease indications

When pQTLs of proteins that are drug targets colocalize with GWAS hits for disease outcomes that are not the current indication for that drug, this can suggest possible new uses for that drug. For instance, Sun et al.¹⁰ identified a cis-pQTL associated with both higher GP1BA abundance and higher platelet count, and a trans-pQTL for GP1BA that colocalized with associations with platelet count, myocardial infarction and stroke. This observation suggests that GP1BA influences vascular risk via platelets. Drugs targeting GP1BA, a receptor for von Willebrand factor, are currently under development as antithrombotic agents and for the treatment of thrombotic thrombocytopenic purpura. On the basis of their findings, Sun et al.¹⁰ argue that GP1BA could also be targeted in conditions characterized by platelet aggregation, such as arterial thrombosis.

Identification of potential safety concerns for drugs under development

Genetic associations with protein levels can provide instruments to establish causal relationships between protein levels and adverse outcomes. This knowledge can then be used to predict the potential effect of a drug targeting that protein in a given medical condition, thereby making clinical trials safer. For instance, MMP12 inhibitors are under development for treatment of chronic obstructive pulmonary disease. Observational data show an association between high levels of plasma MMP12 and recurrent cardiovascular disease, which suggests that MMP12 inhibitors might have a role in the treatment or prevention of cardiovascular disease. However, Mendelian randomization analysis instead shows that genetic variants associated with higher MMP12 levels are associated with decreased risks of coronary artery disaese and large artery atherosclerotic stroke, discouraging potential clinical trials of MMP12 inhibitors¹⁰.

Genetic variance in clinical biomarker proteins

The measurement of protein levels in accessible biofluids (for example, plasma, urine or cerebrospinal fluid) is one of the mainstays of clinical medicine, providing many biomarkers with diagnostic or prognostic value. The ability to generate robust pQTL associations at scale could not only provide mechanistic insights into disease biology that prioritize targets for therapeutic development but could also promote the discovery of novel clinical biomarkers. Some of these biomarkers will support more accurate diagnosis of disease, whereas others will increase the detection of drug side effects, stratify disease risk or act as surrogates of developing disease that are valuable in clinical trials.

Many frequently used clinical tests are blood based; nearly half involve the measurement of proteins⁵⁴. Historically, these tests have involved measurement of the abundance of single proteins, but multiprotein biomarkers are emerging, one example being the profiling of cardiovascular risk in patients with coronary heart disease¹¹². The current set of clinically used proteins represents only a fraction of the circulating proteome, suggesting considerable opportunities to develop improved protein signatures for monitoring human health and disease states⁷⁶. Ten years ago, Anderson¹¹³ identified 109 US Food and Drug Administration (FDA)-approved protein analytes that can be assayed in serum or plasma. An update of this list (Supplementary Table 1) now includes 199 unique analytes. Among the subset of these FDA-approved protein analytes for which assays were available in at least one of the three recent large-scale SomaScan studies^8,9,10, almost one-third have pQTLs with an effect size large enough to be detected (Table 3). This observation, which is in agreement with recent reports of the impact of ancestry on protein biomarker levels¹¹⁴, indicates that some of the observed differences in disease prevalence between ethnicities may have a genetic basis reflected in blood protein abundances and that reference intervals for those biomarkers should be tailored to ancestry.

Table 3 Established blood-based protein markers with cis-pQTLs

Full size table

Interpretation of the findings of disease GWAS

One of the key motivations behind large-scale pQTL analyses is to support efforts to relate protein abundance levels to the growing inventory of disease-associated genetic variants and thereby to accelerate the identification of potentially translatable biomarkers. As the scale and scope of pQTL studies have expanded, so too has the list of pQTLs coincident with disease risk variants^8,9,10. In recent SomaScan articles^8,9, between 11.5% and 20.7% of cis-pQTLs were in high linkage disequilibrium (r² ≥ 0.8) with sentinel disease-associated variants. Webservers such as PhenoScanner¹¹⁵ and SNiPA¹¹⁶ integrate GWAS hits from multiple sources and can be used to identify and interpret such overlapping signals. By addition of data from large-scale studies of RNA expression, such as those provided by GTEx, it becomes possible to construct causal chains leading from DNA to RNA and on to protein and disease². Tissue-specific protein expression data gathered from the HPA³⁰ can further illuminate these pathways. One important caveat is that the rapidly growing numbers of GWAS associations and pQTLs mean that some of the apparent overlap in genomic locations is merely the consequence of two distinct signals — driven by entirely different sequence variants — which happen to map to the same stretch of DNA but which have no mechanistic relationship. It is therefore essential to confirm that coincident signals are driven by the same genetic variants (that is, that they colocalize) before assuming any biological connection¹¹⁷.

Polygenic scores

One powerful strategy for biomarker discovery is emerging from recent advances in the use of polygenic risk scores to stratify disease risk across populations¹¹⁸. These scores aggregate the information on individual genetic predisposition for a given disease across many thousands of risk variants. For many common conditions, these scores can identify individuals who, on the basis of their patterns of shared common sequence variation, are at substantially increased (or decreased) future risk of disease¹¹⁸. Proteomic analyses conducted in individuals from these extremes of disease risk, ideally using plasma samples collected from healthy individuals many years ahead of clinical disease onset, offer a route to accelerate the discovery of prognostic biomarkers. It is possible to directly evaluate relationships between disease polygenic risk scores and protein expression levels in cross-sectional cohorts using a Mendelian randomization framework¹¹⁹. As an example of the power of such an approach, Mosley et al.¹²⁰ intersected coronary artery disease risk score data with an aptamer-based pQTL study of 759 individuals, computing a ‘virtual proteome’ for each individual, which was then evaluated for its association with clinical end points in a much larger cohort.

To the extent that these biomarkers capture processes fundamental to disease risk, they have the potential to profile individual risk irrespective of its basis — genetic or otherwise — in the same way that cholesterol levels synthesize both genetic and lifestyle-related contributions to the risk of coronary disease. There are opportunities to extend these approaches beyond measures of overall disease risk to characterize biomarkers that capture the relative contribution of multiple disease-associated processes to disease progression in a given individual. This information can be parlayed into improved tools for prognostication and therapeutic optimization. In the case of type 2 diabetes, for example, it is possible to construct a set of ‘partitioned’ risk scores, each made up of variants that influence one of several pathways contributing to disease risk (such as obesity, fat distribution, insulin resistance and insulin secretion)¹²¹ and then to use them in the search for process-specific biomarkers.

Inferring causality

Although pQTL and eQTL analyses can generate important hypotheses linking a disease risk variant to a plausible downstream effector, these inferences are correlative rather than causal. Even in the ideal setting, whereby a disease risk variant is shown to colocalize with a pQTL for a protein with a biologically plausible link to disease, there is no formal proof that the protein lies on the causal pathway to disease. Instead, variation in the circulating levels of that protein might be just one of several molecular events attributable to the risk variant, only one of which is causally linked to disease development. Alternatively, variation in the biomarker may be secondary to, rather than causal for, the disease process (‘reverse causation’). Such pQTLs may still point to useful biomarkers of disease predisposition or severity, but interventions that restore levels of the biomarker to normal levels will not necessarily be disease modifying. In the case of coronary artery disease, for example, the evidence indicates that triglycerides and LDL cholesterol play a causal role, but the levels of HDL cholesterol and C-reactive protein do not.

Proof of a causal connection ultimately relies on demonstrating — in a suitable system and using a disease-relevant readout — that perturbation of the expression or function of the protein of interest has a material effect on the development of disease. Fortunately, there are a host of approaches for examining the consequences of protein perturbation, ranging from cellular screens, through manipulation in animal models, to the detection, in isolated populations or those with high levels of consanguinity, of individuals who have inherited genotypes that result in extreme (high or low) levels of the protein. Here too, the integration of genetic and proteomic data can be extremely valuable. Finding that coding variants in the gene encoding a protein of interest are causally implicated with disease provides reassurance that a regulatory pQTL involving the same gene is also likely to be causal. Similarly, evidence from Mendelian randomization studies that a given protein is the target of several distinct cis-pQTLs and trans-pQTLs, each of which also colocalizes with a disease GWAS signal, represents consistent evidence that the protein is mediating disease risk rather than simply reflecting the activity of another causal process with which it happens to share some regulatory overlap¹²².

Mendelian randomization analyses are central to many of these approaches, providing what can be considered the genetic equivalent of a randomized clinical trial. For example, large-scale SomaScan analyses have implicated several protein pathways as causal for disease risk, including a protective role for prostate secretory protein of 94 amino acids (PSP94) in prostate cancer, and associations between raised IL1RL2 and IL18R1 levels and atopic dermatitis risk, and between higher MMP12 levels and decreased risk of coronary disease risk and stroke⁹. Yao and colleagues¹³ used pQTL data to identify six proteins likely to be causal for coronary heart disease, two of which, cystatin C and PON1, were also associated prospectively with death from new-onset coronary heart disease or cardiovascular disease in the Framingham Heart Study. Chong and colleagues¹²³ conducted a systematic Mendelian randomization meta-analysis of the circulating proteome based on pQTL data from more than 20,000 individuals^8,9,10,70,73 and disease GWAS data from up to 400,000 participants. This study revealed seven causal mediators for ischaemic stroke, including a protective role for SCARA5 and TNFSF12, highlighting important roles for stroke protein biomarkers in cardiovascular and non-cardiovascular diseases and singling out LPA, F11 and SCARA5 as particularly attractive drug targets. Many recent association studies with intermediate proteomics traits include Mendelian randomization analyses^11,74,87,124, and new methods are emerging to cope with the inherent challenges of a multi-omics Mendelian randomization approach, such as pleiotropy between the traits.

Conclusions and perspectives

Genetic association studies using intermediate phenotypes represent analyses of experiments conducted by nature, able to provide a valuable resource of biological information. Proteomic data at scale have only recently become accessible to large-scale GWAS approaches, largely due to analytical limitations that are specific to the composition of the plasma proteome. Advances in affinity proteomics technologies are rapidly remedying this shortfall, with recent studies reporting the parallel measurement of 5,000 proteins in almost 17,000 blood samples⁷⁶, and many substantially larger studies are under way⁷⁴. Currently, between 10% and 20% of all pQTLs discovered were found to colocalize with clinical GWAS loci. This overlap can be expected to increase as the scale and scope of both pQTL studies and clinical GWAS continues to increase. This will add to the value of plasma proteomic data especially for the biomedical and pharmaceutical field by providing more and better instruments for drug target validation⁷ and causal inference⁹⁶. Crucially, large-scale pQTL studies in large healthy cohorts complement the growing use of plasma proteomic data in the context of case–control datasets and in clinical trials, improving inference in study designs that are subject to reverse causation and confounding.

Although affinity assays lead the way in pQTL mapping, it is wise to remain aware of their limitations. Associations of interest should be supported by independent validation of target specificity and should include consideration of possible cross-reactivity and epitope effects. As none of the current methods is the optimal or sole solution, validation across multiple technologies will be key, with the most important findings further supported by cellular and other functional studies. It remains to be shown how feasible MS-based approaches will be for large-scale GWAS projects. Although low sample throughout and low analytical sensitivity currently limit the use of MS-based techniques in population studies, MS has the potential to identify a broader range of peptides and proteins, as well as a variety of post-translational modifications and splice forms. GWAS of IgG glycosylation¹²⁵ and total N-glycans¹²⁶ have already reported biologically relevant associations, most of them with proteins that are involved in protein glycosylation. Other approaches, known as ‘glycoproteogenomics’, seek to understand the differential glycosylation of plasma protein biomarkers¹²⁷. Future efforts should consequently also attempt to capture the different modification characteristics of proteins by high-throughput and quantitative technologies and possibly determine tissue-specific isoforms that are of relevance for these traits. In parallel, the prospects of combining data from large-scale proteomics with molecular readouts, such as DNA methylation^{87,124,128,129}, metabolomics¹³⁰ and glycomics¹³¹, opens new avenues for studying human health at a molecular level in population-scale studies.

Recent advances in the scale and scope with which it is possible to survey the proteomic content of plasma and other biofluids are allowing proteomics to take its place alongside the comprehensive characterization possible for other omics approaches, such as those focused on genetic variation and RNA expression. These advances open new opportunities to use proteomics to deliver improved understanding of the mechanistic basis of disease^132,133 and to promote novel translational strategies through target and biomarker identification^74,76.

References

MacArthur, J. et al. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS catalog). Nucleic Acids Res. 45, D896–D901 (2017).
CAS PubMed Google Scholar
Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
CAS Google Scholar
Suhre, K. et al. Human metabolic individuality in biomedical and pharmaceutical research. Nature 477, 54–60 (2011).
CAS PubMed Google Scholar
Kastenmuller, G., Raffler, J., Gieger, C. & Suhre, K. Genetics of human metabolism: an update. Hum. Mol. Genet. 24, R93–R101 (2015).
PubMed PubMed Central Google Scholar
Anderson, N. L. & Anderson, N. G. The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–867 (2002).
CAS PubMed Google Scholar
Melzer, D. et al. A genome-wide association study identifies protein quantitative trait loci (pQTLs). PLoS Genet. 4, e1000072 (2008).
PubMed PubMed Central Google Scholar
Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).
CAS PubMed Google Scholar
Suhre, K. et al. Connecting genetic risk to disease end points through the human blood plasma proteome. Nat. Commun. 8, 14357 (2017). This is one of the first GWAS using the SomaScan platform for 1,100 proteins.
CAS PubMed PubMed Central Google Scholar
Emilsson, V. et al. Co-regulatory networks of human serum proteins link genetics to disease. Science 361, 769–773 (2018). This is currently the largest GWAS using the updated SomaScan platform for 4,000 proteins and 4,000 samples.
CAS PubMed PubMed Central Google Scholar
Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018). This is a recent GWAS using the SomaScan platform with 3,000 proteins on 3,000 samples.
CAS PubMed PubMed Central Google Scholar
Benson, M. D. et al. Genetic architecture of the cardiovascular risk proteome. Circulation 137, 1158–1172 (2018).
PubMed Google Scholar
Zhernakova, D. V. et al. Individual variations in cardiovascular-disease-related protein levels are driven by genetics and gut microbiome. Nat. Genet. 50, 1524–1532 (2018).
CAS PubMed PubMed Central Google Scholar
Yao, C. et al. Genome-wide mapping of plasma protein QTLs identifies putatively causal genes and pathways for cardiovascular disease. Nat. Commun. 9, 3268 (2018).
PubMed PubMed Central Google Scholar
Enroth, S., Johansson, A., Enroth, S. B. & Gyllensten, U. Strong effects of genetic and lifestyle factors on biomarker variation and use of personalized cutoffs. Nat. Commun. 5, 4684 (2014). This is an early GWAS using the Olink platform; the study highlights the potential impact of epitope effects on protein readouts.
CAS PubMed Google Scholar
Lourdusamy, A. et al. Identification of cis-regulatory variation influencing protein abundance levels in human plasma. Hum. Mol. Genet. 21, 3719–26 (2012).
CAS PubMed PubMed Central Google Scholar
Sasayama, D. et al. Genome-wide quantitative trait loci mapping of the human cerebrospinal fluid proteome. Hum. Mol. Genet. 26, 44–51 (2017).
CAS PubMed Google Scholar
Sun, W. et al. Common genetic polymorphisms influence blood biomarker measurements in COPD. PLoS Genet. 12, e1006011 (2016).
PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). This study highlights the potential of large biobanks.
CAS PubMed PubMed Central Google Scholar
German National Cohort (GNC) Consortium. The German National Cohort: aims, study design and organization. Eur. J. Epidemiol. 29, 371–82 (2014).
Google Scholar
Precision Medicine Initiative (PMI) Working Group Report to the Advisory Committee to the Director, NIH. The Precision Medicine Initiative Cohort Program – Building a Research Foundation for 21st Century Medicine (National Institutes of Health, 2015).
Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).
PubMed PubMed Central Google Scholar
Omenn, G. S. et al. Progress on identifying and characterizing the human proteome: 2018 metrics from the HUPO Human Proteome Project. J. Proteome Res. 17, 4031–4041 (2018).
CAS PubMed PubMed Central Google Scholar
Baker, M. S. et al. Accelerating the search for the missing proteins in the human proteome. Nat. Commun. 8, 14271 (2017).
CAS PubMed PubMed Central Google Scholar
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
CAS PubMed Google Scholar
Stoevesandt, O. & Taussig, M. J. Affinity proteomics: the role of specific binding reagents in human proteome analysis. Expert. Rev. Proteom. 9, 401–14 (2012).
CAS Google Scholar
Smith, J. G. & Gerszten, R. E. Emerging affinity-based proteomic technologies for large-scale plasma profiling in cardiovascular disease. Circulation 135, 1651–1664 (2017).
CAS PubMed PubMed Central Google Scholar
Timp, W. & Timp, G. Beyond mass spectrometry, the next step in proteomics. Sci. Adv. 6, eaax8978 (2020).
CAS PubMed PubMed Central Google Scholar
Kim, M. S. et al. A draft map of the human proteome. Nature 509, 575–81 (2014).
CAS PubMed PubMed Central Google Scholar
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
CAS PubMed Google Scholar
Uhlen, M. et al. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
PubMed Google Scholar
Uhlen, M. et al. A genome-wide transcriptomic analysis of protein-coding genes in human blood cells. Science 366, eaax9198 (2019).
CAS PubMed Google Scholar
Thul, P. J. et al. A subcellular map of the human proteome. Science 356, eaal3321 (2017).
PubMed Google Scholar
Schwenk, J. M. et al. The human plasma proteome draft of 2017: building on the Human Plasma PeptideAtlas from mass spectrometry and complementary assays. J. Proteome Res. 16, 4299–4310 (2017). This article reviews recent advances in plasma proteomics and uses data from the community to summarize the circulating proteins detected by MS.
CAS PubMed PubMed Central Google Scholar
Pernemalm, M. et al. In-depth human plasma proteome analysis captures tissue proteins and transfer of protein variants across the placenta. Elife 8, e41608 (2019).
PubMed PubMed Central Google Scholar
Uhlen, M. et al. The human secretome. Sci Signal 12, eaaz0274 (2019). This article reviews the actively secreted proteins of the human proteome for their destination and reveals that only approximately 730 proteins are secreted into the circulation.
CAS PubMed Google Scholar
Geyer, P. E. et al. Plasma proteome profiling to detect and avoid sample-related biases in biomarker studies. EMBO Mol. Med. 11, e10427 (2019).
CAS PubMed PubMed Central Google Scholar
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–55 (2016).
CAS PubMed Google Scholar
Marx, V. A dream of single-cell proteomics. Nat. Methods 16, 809–812 (2019).
CAS PubMed Google Scholar
Aebersold, R. et al. How many human proteoforms are there? Nat. Chem. Biol. 14, 206–214 (2018).
CAS PubMed PubMed Central Google Scholar
Theodoratou, E. et al. The role of glycosylation in IBD. Nat. Rev. Gastroenterol. Hepatol. 11, 588–600 (2014).
CAS PubMed Google Scholar
Ignjatovic, V. et al. Mass spectrometry-based plasma proteomics: considerations from sample collection to achieving translational data. J. Proteome. Res. 18, 4085–497 (2019).
CAS PubMed PubMed Central Google Scholar
Enroth, S., Hallmans, G., Grankvist, K. & Gyllensten, U. Effects of long-term storage time and original sampling month on biobank plasma protein concentrations. EBioMedicine 12, 309–314 (2016).
PubMed PubMed Central Google Scholar
Kofanova, O. et al. IL8 and IL16 levels indicate serum and plasma quality. Clin. Chem. Lab. Med. 56, 1054–1062 (2018).
CAS PubMed Google Scholar
Qundos, U. et al. Profiling post-centrifugation delay of serum and plasma with antibody bead arrays. J. Proteom. 95, 46–54 (2013).
CAS Google Scholar
Daniels, J. R. et al. Stability of the human plasma proteome to pre-analytical variability as assessed by an aptamer-based approach. J. Proteome. Res. 18, 3661–3670 (2019).
CAS PubMed PubMed Central Google Scholar
Kim, C. H. et al. Stability and reproducibility of proteomic profiles measured with an aptamer-based platform. Sci. Rep. 8, 8382 (2018).
PubMed PubMed Central Google Scholar
Shen, Q. et al. Strong impact on plasma protein profiles by precentrifugation delay but not by repeated freeze-thaw cycles, as analyzed using multiplex proximity extension assays. Clin. Chem. Lab. Med. 56, 582–594 (2018).
CAS PubMed Google Scholar
Di Girolamo, F., Alessandroni, J., Somma, P. & Guadagni, F. Pre-analytical operating procedures for serum low molecular Weight protein profiling. J. Proteom. 73, 667–77 (2010).
Google Scholar
Zimmerman, L. J., Li, M., Yarbrough, W. G., Slebos, R. J. & Liebler, D. C. Global stability of plasma proteomes for mass spectrometry-based analyses. Mol. Cell. Proteomics 11, M111.014340 (2012).
PubMed PubMed Central Google Scholar
Shen, Y. et al. Characterization of the human blood plasma proteome. Proteomics 5, 4034–45 (2005).
CAS PubMed Google Scholar
Abbatiello, S. E. et al. Large-scale interlaboratory study to develop, analytically validate and apply highly multiplexed, quantitative peptide assays to measure cancer-relevant proteins in plasma. Mol. Cell. Proteomics 14, 2357–74 (2015).
CAS PubMed PubMed Central Google Scholar
Harney, D. J. et al. Small-protein enrichment assay enables the rapid, unbiased analysis of over 100 low abundance factors from human plasma. Mol. Cell. Proteomics 18, 1899–1915 (2019).
CAS PubMed PubMed Central Google Scholar
Johansson, A. et al. Identification of genetic variants influencing the human plasma proteome. Proc. Natl Acad. Sci. USA 110, 4673–8 (2013).
CAS PubMed PubMed Central Google Scholar
Geyer, P. E., Holdt, L. M., Teupser, D. & Mann, M. Revisiting biomarker discovery by plasma proteomics. Mol. Syst. Biol. 13, 942 (2017).
PubMed PubMed Central Google Scholar
Keshishian, H. et al. Multiplexed, quantitative workflow for sensitive biomarker discovery in plasma yields novel candidates for early myocardial injury. Mol. Cell. Proteomics 14, 2375–93 (2015).
CAS PubMed PubMed Central Google Scholar
Ludwig, C. et al. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial. Mol. Syst. Biol. 14, e8126 (2018).
PubMed PubMed Central Google Scholar
Doerr, A. Mass spectrometry-based targeted proteomics. Nat. Methods 10, 23 (2013).
PubMed Google Scholar
Geyer, P. E. et al. Plasma proteome profiling to assess human health and disease. Cell Syst. 2, 185–95 (2016).
CAS PubMed Google Scholar
Geyer, P. E. et al. Proteomics reveals the effects of sustained weight loss on the human plasma proteome. Mol. Syst. Biol. 12, 901 (2016).
PubMed PubMed Central Google Scholar
Liu, Y. et al. Quantitative variability of 342 plasma proteins in a human twin population. Mol. Syst. Biol. 11, 786 (2015).
PubMed PubMed Central Google Scholar
Rosenberger, G. et al. Inference and quantification of peptidoforms in large sample cohorts by SWATH-MS. Nat. Biotechnol. 35, 781–788 (2017).
CAS PubMed PubMed Central Google Scholar
Bruderer, R. et al. Analysis of 1508 plasma samples by capillary-flow data-independent acquisition profiles proteomics of weight loss and maintenance. Mol. Cell. Proteomics 18, 1242–1254 (2019).
CAS PubMed PubMed Central Google Scholar
Addona, T. A. et al. Multi-site assessment of the precision and reproducibility of multiple reaction monitoring-based measurements of proteins in plasma. Nat. Biotechnol. 27, 633–41 (2009).
CAS PubMed PubMed Central Google Scholar
Percy, A. J. et al. Method and platform standardization in MRM-based quantitative plasma proteomics. J. Proteom. 95, 66–76 (2013).
CAS Google Scholar
Stoevesandt, O. & Taussig, M. J. Affinity reagent resources for human proteome detection: initiatives and perspectives. Proteomics 7, 2738–50 (2007).
CAS PubMed Google Scholar
Ekins, R. P. Multi-analyte immunoassay. J. Pharm. Biomed. Anal. 7, 155–68 (1989).
CAS PubMed Google Scholar
Ayoglu, B. et al. Systematic antibody and antigen-based proteomic profiling with microarrays. Expert Rev. Mol. Diagn. 11, 219–34 (2011).
CAS PubMed Google Scholar
Rissin, D. M. et al. Single-molecule enzyme-linked immunosorbent assay detects serum proteins at subfemtomolar concentrations. Nat. Biotechnol. 28, 595–9 (2010).
CAS PubMed PubMed Central Google Scholar
Fulton, R. J., McDade, R. L., Smith, P. L., Kienker, L. J. & Kettman, J. R. Jr. Advanced multiplexed analysis with the FlowMetrix system. Clin. Chem. 43, 1749–56 (1997).
CAS PubMed Google Scholar
Ahola-Olli, A. V. et al. Genome-wide association study identifies 27 loci influencing concentrations of circulating cytokines and growth factors. Am. J. Hum. Genet. 100, 40–50 (2017).
CAS PubMed Google Scholar
Fredolini, C. et al. Immunocapture strategies in translational proteomics. Expert Rev. Proteom. 13, 83–98 (2016).
CAS Google Scholar
Assarsson, E. et al. Homogenous 96-plex PEA immunoassay exhibiting high sensitivity, specificity, and excellent scalability. PLoS ONE 9, e95192 (2014).
PubMed PubMed Central Google Scholar
Folkersen, L. et al. Mapping of 79 loci for 83 plasma protein biomarkers in cardiovascular disease. PLoS Genet. 13, e1006706 (2017).
PubMed PubMed Central Google Scholar
Folkersen, L. et al. Genomic evaluation of circulating proteins for drug target characterisation and precision medicine. Preprint at bioRxiv https://doi.org/10.1101/2020.04.03.023804 (2020). This is currently one of the largest pQTL studies, with more than 21,000 samples on a 92-protein panel from the Olink platform.
Article Google Scholar
Gold, L. et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS ONE 5, e15004 (2010).
CAS PubMed PubMed Central Google Scholar
Williams, S. A. et al. Plasma protein patterns as comprehensive indicators of health. Nat. Med. 25, 1851–1857 (2019).
CAS PubMed PubMed Central Google Scholar
Lam, M. P. et al. Data-driven approach to determine popular proteins for targeted proteomics translation of six organ systems. J. Proteome Res. 15, 4126–4134 (2016).
CAS PubMed PubMed Central Google Scholar
Colwill, K. & Graslund, S. A roadmap to generate renewable protein binders to the human proteome. Nat. Methods 8, 551–8 (2011).
CAS PubMed Google Scholar
Baker, M. Reproducibility crisis: blame it on the antibodies. Nature 521, 274–6 (2015).
CAS PubMed Google Scholar
Uhlen, M. et al. A proposal for validation of antibodies. Nat. Methods 13, 823–7 (2016).
CAS PubMed Google Scholar
Fredolini, C. et al. Systematic assessment of antibody selectivity in plasma based on a resource of enrichment profiles. Sci. Rep. 9, 8324 (2019).
PubMed PubMed Central Google Scholar
Edfors, F. et al. Enhanced validation of antibodies for research applications. Nat. Commun. 9, 4130 (2018).
PubMed PubMed Central Google Scholar
Aulchenko, Y. S., Ripke, S., Isaacs, A. & van Duijn, C. M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–6 (2007).
CAS PubMed Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–75 (2007).
CAS PubMed PubMed Central Google Scholar
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–7 (2012).
CAS PubMed PubMed Central Google Scholar
Ruffieux, H., Davison, A. C., Hager, J. & Irincheeva, I. Efficient inference for genetic association studies with multiple outcomes. Biostatistics 18, 618–636 (2017).
PubMed Google Scholar
Ahsan, M. et al. The relative contribution of DNA methylation and genetic variants on protein biomarkers for human diseases. PLOS Genet. 13, e1007005 (2017).
PubMed PubMed Central Google Scholar
de Vries, P. S. et al. Whole-genome sequencing study of serum peptide levels: the Atherosclerosis Risk in Communities study. Hum. Mol. Genet. 26, 3442–3450 (2017).
PubMed PubMed Central Google Scholar
Graumann, J. et al. Multi-platform affinity proteomics identify proteins linked to metastasis and immune suppression in ovarian cancer plasma. Front. Oncol. 9, 1150 (2019).
PubMed PubMed Central Google Scholar
Billing, A. M. et al. Complementarity of SOMAscan to LC-MS/MS and RNA-seq for quantitative profiling of human embryonic and mesenchymal stem cells. J. Proteom. 150, 86–97 (2017).
CAS Google Scholar
Ruffieux, H. et al. A Bayesian joint pQTL study sheds light on the genetic architecture of obesity. Preprint at bioRxiv https://doi.org/10.1101/524405 (2019).
Article Google Scholar
Freedman, M. L. et al. Principles for the post-GWAS functional characterization of cancer risk loci. Nat. Genet. 43, 513–8 (2011).
CAS PubMed PubMed Central Google Scholar
Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet. 50, 956–967 (2018).
CAS PubMed PubMed Central Google Scholar
Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 51, 592–599 (2019).
CAS PubMed PubMed Central Google Scholar
Nieuwenhuis, T. O. et al. Consistent RNA sequencing contamination in GTEx and other data sets. Nat. Commun. 11, 1933 (2020).
CAS PubMed PubMed Central Google Scholar
Zheng, J. et al. Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases. Preprint at bioRxiv https://doi.org/10.1101/627398 (2019).
Article Google Scholar
Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife 7, e34408 (2018).
PubMed PubMed Central Google Scholar
Petersen, A. K. et al. On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics 13, 120 (2012).
PubMed PubMed Central Google Scholar
Slenter, D. N. et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 46, D661–D667 (2018).
CAS PubMed Google Scholar
Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).
CAS PubMed Google Scholar
Krumsiek, J., Suhre, K., Illig, T., Adamski, J. & Theis, F. J. Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Syst. Biol. 5, 21 (2011).
CAS PubMed PubMed Central Google Scholar
Shin, S. Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).
CAS PubMed PubMed Central Google Scholar
van der Harst, P. & Verweij, N. Identification of 64 novel genetic loci provides an expanded view on the genetic architecture of coronary artery disease. Circ. Res. 122, 433–443 (2018).
PubMed PubMed Central Google Scholar
Klarin, D., Emdin, C. A., Natarajan, P., Conrad, M. F. & Kathiresan, S. Genetic analysis of venous thromboembolism in UK Biobank identifies the ZFPM2 locus and implicates obesity as a causal risk factor. Circ. Cardiovasc. Genet. 10, e001643 (2017).
CAS PubMed PubMed Central Google Scholar
Nath, A. P. et al. Multivariate genome-wide association analysis of a cytokine network reveals variants with widespread immune, haematological, and cardiometabolic pleiotropy. Am. J. Hum. Genet. 105, 1076–1090 (2019).
CAS PubMed PubMed Central Google Scholar
Do, K. T., Rasp, D. J. N., Kastenmuller, G., Suhre, K. & Krumsiek, J. MoDentify: phenotype-driven module identification in metabolomics networks at different resolutions. Bioinformatics 35, 532–534 (2019).
CAS PubMed Google Scholar
Zhang, B. et al. Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–7 (2014).
CAS PubMed PubMed Central Google Scholar
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–25 (2014).
CAS PubMed PubMed Central Google Scholar
Ting, Y. S. et al. PECAN: library-free peptide detection for data-independent acquisition tandem mass spectrometry data. Nat. Methods 14, 903–908 (2017).
CAS PubMed PubMed Central Google Scholar
Harper, S. C. et al. Is growth differentiation factor 11 a realistic therapeutic for aging-dependent muscle defects? Circ. Res. 118, 1143–50 (2016).
CAS PubMed PubMed Central Google Scholar
SomaLogic. Short Technical Note: Characterization of the Binding Specificity of SOMAmer Reagents in the SomaScan Assay (2019).
Ganz, P. et al. Development and validation of a protein-based risk score for cardiovascular outcomes among patients with stable coronary heart disease. JAMA 315, 2532–41 (2016).
CAS PubMed Google Scholar
Anderson, N. L. The clinical plasma proteome: a survey of clinical assays for proteins in plasma and serum. Clin. Chem. 56, 177–85 (2010). This is an early survey that lists the FDA-approved plasma biomarkers (an update of this list is provided in Supplementary Table 1).
CAS PubMed Google Scholar
Sjaarda, J. et al. Influence of genetic ancestry on human serum proteome. Am. J. Hum. Genet. 106, 303–314 (2020).
CAS PubMed PubMed Central Google Scholar
Staley, J. R. et al. PhenoScanner: a database of human genotype-phenotype associations. Bioinformatics 32, 3207–3209 (2016).
CAS PubMed PubMed Central Google Scholar
Arnold, M., Raffler, J., Pfeufer, A., Suhre, K. & Kastenmuller, G. SNiPA: an interactive, genetic variant-centered annotation browser. Bioinformatics 31, 1334–6 (2015).
PubMed Google Scholar
He, X. et al. Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am. J. Hum. Genet. 92, 667–80 (2013).
CAS PubMed PubMed Central Google Scholar
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
CAS PubMed PubMed Central Google Scholar
Richardson, T. G., Harrison, S., Hemani, G. & Davey Smith, G. An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife 8, e43657 (2019).
PubMed PubMed Central Google Scholar
Mosley, J. D. et al. Probing the virtual proteome to identify novel disease biomarkers. Circulation 138, 2469–2481 (2018).
CAS PubMed PubMed Central Google Scholar
Udler, M. S. et al. Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: A soft clustering analysis. PLoS Med. 15, e1002654 (2018).
PubMed PubMed Central Google Scholar
Plump, A. & Davey Smith, G. Identifying and validating new drug targets for stroke and beyond. Circulation 140, 831–835 (2019).
CAS PubMed Google Scholar
Chong, M. et al. Novel drug targets for ischemic stroke identified through mendelian randomization analysis of the blood proteome. Circulation 140, 819–830 (2019).
CAS PubMed Google Scholar
Hillary, R. F. et al. Genome and epigenome wide studies of neurological protein biomarkers in the Lothian Birth Cohort 1936. Nat. Commun. 10, 3160 (2019).
PubMed PubMed Central Google Scholar
Shen, X. et al. Multivariate discovery and replication of five novel loci associated with immunoglobulin G N-glycosylation. Nat. Commun. 8, 447 (2017).
PubMed PubMed Central Google Scholar
Sharapov, S. Z. et al. Defining the genetic control of human blood plasma N-glycome using genome-wide association study. Hum. Mol. Genet. 28, 2062–2077 (2019).
CAS PubMed PubMed Central Google Scholar
Lin, Y. H., Zhu, J., Meijer, S., Franc, V. & Heck, A. J. R. Glycoproteogenomics: a frequent gene polymorphism affects the glycosylation pattern of the human serum fetuin/alpha-2-HS-glycoprotein. Mol. Cell. Proteomics 18, 1479–1490 (2019).
PubMed PubMed Central Google Scholar
Zaghlool, S. B. et al. Epigenetics meets proteomics in an epigenome-wide association study with circulating blood plasma protein traits. Nat. Commun. 11, 15 (2020).
CAS PubMed PubMed Central Google Scholar
Huan, T. et al. Genome-wide identification of DNA methylation QTLs in whole blood highlights pathways for cardiovascular disease. Nat. Commun. 10, 4267 (2019).
PubMed PubMed Central Google Scholar
Zaghlool, S. B. et al. Deep molecular phenotypes link complex disorders and physiological insult to CpG methylation. Hum. Mol. Genet. 27, 1106–1121 (2018).
CAS PubMed PubMed Central Google Scholar
Suhre, K. et al. Fine-mapping of the human blood plasma n-glycome onto its proteome. Metabolites 9 (2019).
Gudmundsdottir, V. et al. Circulating protein signatures and causal candidates for type 2 diabetes. Diabetes https://doi.org/10.2337/db19-1070 (2020).
Article PubMed PubMed Central Google Scholar
Lehallier, B. et al. Undulating changes in human plasma proteome profiles across the lifespan. Nat. Med. 25, 1843–1850 (2019).
CAS PubMed PubMed Central Google Scholar
Kim, S. et al. Influence of genetic variation on plasma protein levels in older adults using a multi-analyte panel. PLoS ONE 8, e70269 (2013).
CAS PubMed PubMed Central Google Scholar
Kauwe, J. S. et al. Genome-wide association study of CSF levels of 59 Alzheimer’s disease candidate proteins: significant associations with proteins involved in amyloid processing and inflammation. PLoS Genet. 10, e1004758 (2014).
PubMed PubMed Central Google Scholar
Deming, Y. et al. Genetic studies of plasma analytes identify novel potential biomarkers for several complex traits. Sci. Rep. 6, 18092 (2016).
CAS PubMed Central Google Scholar
Solomon, T. et al. Associations between common and rare exonic genetic variants and serum levels of 20 cardiovascular-related proteins: the Tromso study. Circ. Cardiovasc. Genet. 9, 375–83 (2016).
CAS PubMed PubMed Central Google Scholar
Di Narzo, A. F. et al. High-throughput characterization of blood serum proteomics of ibd patients with respect to aging and genetic factors. PLoS Genet. 13, e1006565 (2017).
PubMed PubMed Central Google Scholar
Carayol, J. et al. Protein quantitative trait locus study in obesity during weight-loss identifies a leptin regulator. Nat. Commun. 8, 2084 (2017).
PubMed PubMed Central Google Scholar
Solomon, T. et al. Identification of common and rare genetic variation associated with plasma protein levels using whole-exome sequencing and mass spectrometry. Circ. Genom. Precis. Med. 11, e002170 (2018).
CAS PubMed PubMed Central Google Scholar
Sliz, E. et al. Genome-wide association study identifies seven novel loci associating with circulating cytokines and cell adhesion molecules in Finns. J. Med. Genet. 56, 607–616 (2019).
CAS PubMed Google Scholar
Gilly, A. et al. Whole genome sequencing analysis of the cardiometabolic proteome. Preprint at bioRxiv https://doi.org/10.1101/854752 (2020).
Article Google Scholar
Orru, V. et al. Genetic variants regulating immune cell levels in health and disease. Cell 155, 242–56 (2013).
CAS PubMed PubMed Central Google Scholar
Patin, E. et al. Natural variation in the parameters of innate immune cells is preferentially driven by genetic factors. Nat. Immunol. 19, 302–314 (2018).
CAS PubMed Google Scholar

Download references

Acknowledgements

K.S. is supported by the Biomedical Research Program at Weill Cornell Medicine in Qatar, a programme funded by the Qatar Foundation. J.M.S. is supported by the KTH Center for Applied Precision Medicine funded by the Erling Persson Family Foundation and acknowledges the Knut and Alice Wallenberg Foundation for funding the Human Protein Atlas. J.M.S. and M.I.M. acknowledge the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115317 (DIRECT), the resources of which are composed of a financial contribution from the European Union’s Seventh Framework Programme and an EFPIA companies’ in kind contribution. The views expressed in this article are those of the authors and not necessarily those of the UK NHS, the UK NIHR, the UK Department of Health or the Qatar Foundation.

Author information

Authors and Affiliations

Department of Biophysics and Physiology, Weill Cornell Medicine-Qatar, Doha, Qatar
Karsten Suhre
Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital, University of Oxford, Oxford, UK
Mark I. McCarthy
Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Mark I. McCarthy
Oxford NIHR Biomedical Research Centre, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, UK
Mark I. McCarthy
Genentech, South San Francisco, CA, USA
Mark I. McCarthy
Affinity Proteomics, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden
Jochen M. Schwenk

Authors

Karsten Suhre
View author publications
You can also search for this author in PubMed Google Scholar
Mark I. McCarthy
View author publications
You can also search for this author in PubMed Google Scholar
Jochen M. Schwenk
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.S. and J.M.S. researched data for article. All authors contributed to the discussion of content, writing the article and reviewing/editing the manuscript before submission.

Corresponding authors

Correspondence to Karsten Suhre or Jochen M. Schwenk.

Ethics declarations

Competing interests

M.I.M. has served on advisory panels for Pfizer, NovoNordisk and Zoe Global, has received honoraria from Merck, Pfizer, NovoNordisk and Eli Lilly, has stock options in Zoe Global and has received research funding from AbbVie, AstraZeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, NovoNordisk, Pfizer, Roche, Sanofi Aventis, Servier and Takeda. As of June 2019, M.I.M. is an employee of Genentech and holds stock in Roche. K.S. and J.M.S. declare no competing interests.

Additional information

Peer review information

Nature Reviews Genetics thanks M. Altelaar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Glossary

Colocalization: Two genetic associations are said to be colocalized if the strengths of their statistical associations covary at a genetic locus, suggesting a shared genetic causal variant for the observed associations.
Protein QTLs: (pQTLs). A protein quantitative trait locus (pQTL) is an association of protein levels at a genetic locus; it is often represented by the strongest associating single-nucleotide polymorphism.
pQTL studies: Genome-wide association studies where the dependent variables are the levels of proteins measured using a proteomics approach. The identified loci that associate with protein levels are termed ‘protein quantitative trait loci’ (pQTLs).
Open reading frames: Portions of DNA that can be translated into protein and that are terminated by a stop codon.
Post-translational modifications: Biochemical modification of the primary peptide sequence, typically by covalent addition of a chemical group, such as for phosphorylation and glycosylation. Post-translational modifications can change the accessibility to a protein epitope and potentially influence the binding of affinity reagents.
Data-dependent acquisition: (DDA). A data acquisition mode used in mass spectrometry analysis where only a selected set of peptides with the most intense peptide ions are being fragmented and analysed.
Data-independent acquisition: (DIA). A data acquisition mode used in mass spectrometry analysis where all peptides detected within a particular window of the mass-to-charge ratio are being fragmented and analysed.
Aptamers: Short single-stranded (and possibly modified) nucleotides that are selected from a synthetic library of sequences to recognize a specific target protein (for example, via structural elements) with high affinity.
cis-pQTLs: When a protein quantitative trait locus (pQTL) is at or near the genetic locus that encodes the associated protein; often an ad hoc distance cut-off is used to differentiate cis-pQTLs from trans-pQTLs. A cis-pQTL suggests a direct influence of a genetic variant at that locus on protein expression or turnover.
trans-pQTLs: When a protein quantitative trait locus (pQTL) is distant from the protein-coding gene or on another chromosome. A trans-pQTL indicates an indirect link between the genetic locus and protein expression or turnover.
Linkage disequilibrium: Two genetic loci are in linkage disequilibrium if their genotypes correlate within a population. Lack of recombination between loci results in them commonly being co-inherited as a haplotype.
Mendelian randomization: A method to estimate the unconfounded effect of an exposure (for example, protein level) on an outcome (for example, disease risk) using genetic variation.
Gaussian graphical models: (GGMs). Network representations of the partial correlations between a set of quantitative variables, here the protein levels. Partial correlations used in a protein GGM can be viewed as the amount of pairwise correlation between the levels of two proteins that remains when the contributions of all other proteins are accounted for.
Pleiotropic: A genetic locus is pleiotropic when one or more of its variants is associated with two or more seemingly unrelated phenotypic traits.
Epitope effect: An effect of an epitope-changing variant on the binding properties of affinity reagents with regard to their antigens. A difference in reported antigen recognition may be mistaken for a difference in protein abundance.
Polygenic risk scores: Combined risk scores derived from a weighed combination of genetic associations, possibly including millions of associations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Suhre, K., McCarthy, M.I. & Schwenk, J.M. Genetics meets proteomics: perspectives for large population-based studies. Nat Rev Genet 22, 19–37 (2021). https://doi.org/10.1038/s41576-020-0268-2

Download citation

Accepted: 14 July 2020
Published: 28 August 2020
Issue Date: January 2021
DOI: https://doi.org/10.1038/s41576-020-0268-2

Subjects

Abstract

Similar content being viewed by others

Plasma proteomic associations with genetics and health in the UK Biobank

Large-scale plasma proteomics comparisons through genetics and disease associations

Large-scale integration of the plasma proteome with genetics and disease

Introduction

The human plasma proteome

Experimental coverage of the human proteome

The plasma proteome

Current limitations in the knowledge of the circulating proteome

Probing the plasma proteome in high throughput

MS-based proteomics

Affinity-based proteomics

Comparative discussion of the available methods

GWAS with proteins circulating in plasma

Identification of pQTL signals

pQTL studies to date

Cis-pQTLs and trans-pQTLs

Colocalization with eQTLs

Data sharing

Working with ratios

Network integration

Analytical considerations

Epitope effects

Binding specificity and cross-reactivity

Clinical and biomedical applications

Genetic variance in clinical biomarker proteins

Interpretation of the findings of disease GWAS

Polygenic scores

Inferring causality

Conclusions and perspectives

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Peer review information

Publisher’s note

Related links

Supplementary information

Supplementary Information

Glossary

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links