Introduction

Genome-wide association studies (GWAS) for human diseases have robustly identified large numbers of risk-associated genetic variants implicated in disease susceptibility1. The colocalization of genetic associations for disease traits with those for intermediate molecular phenotypes, such as gene expression2 and metabolomics3, provides powerful evidence to advance hypotheses regarding the genes and pathways through which these disease-associated variants mediate their effects4. Proteins can appear in the blood circulation owing to active secretion or cellular leakage and thereby provide a window into the current state of human health5. Quantitative trait loci (QTLs) for circulating proteins can therefore complement these analyses: collectively, colocalized associations with multiple omic phenotypes offer a route to a comprehensive molecular interpretation of disease GWAS hits6. These analyses have the potential to expose disease-causing pathways, uncover new drug targets, highlight novel therapeutic indications and identify clinically relevant biomarkers7.

However, despite the central importance of proteins in disease pathogenesis, large-scale studies of protein QTLs (pQTLs) have only recently become feasible8,9,10,11,12,13. For most biomedical and clinical applications conducted at a population scale, whole blood and its cell-free derivatives are clinically accessible and minimally invasive specimens that are well suited to the assessment of human health and disease states. Studies of blood-based pQTLs have identified hundreds of associations between single-nucleotide polymorphisms (SNPs) and protein levels6,8,9,10,11,14,15,16,17. Many of these pQTLs colocalize with association signals for common human diseases. Nevertheless, the scale of the largest pQTL studies conducted to date remains limited to the examination of a few thousand proteins across a few thousand individuals. There remains considerable potential in scaling these studies up, in terms of both sample size and proteome coverage.

Beyond the value of large-scale proteomics data to genetic studies, collecting large-scale proteomics data in population studies provides opportunities to study non-genetic associations, making it possible to capture markers of lifestyle and environmental exposure, to stratify individuals according to their state of health (or disease) and to monitor the longitudinal progression of disease. The development of increasingly massive biobanks and population cohorts allows us to envisage deep molecular phenotyping performed across hundreds of thousands of individuals18,19,20,21. These opportunities inevitably raise questions regarding which technologies to prioritize, what outcomes to expect and whether it is worth the investment (given current costs) to characterize the proteomes of entire populations.

In this Review, we focus on large-scale proteomic technologies currently capable of profiling the blood circulating proteome in large population studies (Fig. 1). We start by reviewing the current state of the field and describe the proteomic composition of serum and plasma. Next, we turn to the experimental challenges related to measuring the circulating proteins, focusing on technologies that are applicable in high throughput to blood-derived samples collected from large population studies. Then we provide an overview of the pQTL studies conducted to date and discuss current challenges and opportunities faced by the field. Finally, we showcase selected applications of pQTL studies in the context of disease GWAS. We close with perspectives for future large-scale studies with proteomics and how the information they provide can be translated into clinical and biomedical applications.

Fig. 1: Key concepts of pQTL studies.
figure 1

This figure sketches the central steps and challenges encountered when one is conducting a genome-wide association study (GWAS) with proteomics in large population cohorts. a | The blood proteome is a composite of many proteins from different cells and tissues. Some are secreted into the bloodstream to perform some signalling or enzymatic actions, while others may leak from damaged cells and reflect processes occurring elsewhere in the body. b | Mass spectrometry and affinity (antibody-based or aptamer-based) proteomics are the key methods available to determine blood circulating protein levels (Box 1). Each method has its inherent advantages and challenges (Table 1). c | Many approaches used in the statistical analysis of protein quantitative trait locus (pQTL) studies are similar to those applied in other GWAS with high data dimensionality, such as metabolomics. Shown on the left is the minor-allele frequency (MAF), as well as the resultant percentages of homozygous and heterozygous genotypes involving the major (A) and minor (B) alleles. These genotypes are tested for statistical association with protein levels, as shown in the box-and-whisker plot below. As in expression QTL (eQTL) studies, cis and trans associations can be distinguished (middle panel), the latter suggesting possible functional links between the protein encoded at the pQTL locus and the trans-associated proteins (Fig. 2). The resultant pQTLs can be tested for whether they colocalize with disease-associated loci (right panel). d | Newly identified pQTLs should be investigated for potential artefacts, such as epitope effects, cross-reactivity and genotype-dependent binding specificity, cleavage site and peptide mass (see the main text). e | pQTLs have multiple biomedical applications. For instance, pQTLs can be subjected to Mendelian randomization (MR) analysis using proteins as exposures to identify proteins that are causal for disease outcomes (see also Box 2); support drug target selection and validation, as well as drug safety and repurposing; indicate genetic confounders of clinical protein biomarkers (see also Supplementary Table 1); and be integrated in larger systems biology networks to help interpret complex biological processes. The graphic at the top right illustrates the strategy for MR analyses. The graphic at the bottom right shows a hypothetical network where blue nodes are proteins, green nodes are single-nucleotide polymorphisms and red nodes are disease traits. Links between blue and green nodes represent pQTLs, links between red and green nodes represent a disease/trait GWAS hit and links between blue nodes represent correlations between protein levels. FDA, US Food and Drug Administration.

The human plasma proteome

The human proteome is defined by analysis and annotation of all potentially protein-coding genes in the human genome. There are approximately 20,000 genes in humans that provide the blueprint for all proteins to be expressed and processed at any given point in time and in any tissue or organ of the human body. However, translated proteins often undergo further post-translational processing: for example, particular amino acid residues can be modified to modulate their physiological properties through phosphorylation or glycosylation. Most proteins then execute their function in interaction with other partners, either locally in their cells and tissues of origin, or after being transported to act in some distant process. In the following section, we introduce the current efforts of the proteomics community with a focus on approaches and technologies to profile proteins in blood.

Experimental coverage of the human proteome

Critical for any proteomic studies is the definition of the set of proteins that one might expect to detect. As of today, the Human Proteome Project has collected mass spectrometry (MS) data that provide experimental evidence for almost 90% of the approximately 20,000 canonical proteins predicted from genomic open reading frames22. One of the current main efforts of the Human Proteome Project is to find experimental evidence for the predicted — but so far undetected — ‘missing proteins’ and to determine their function. One strategy involves expanding proteomic analyses to rare tissues and cell types under a range of conditions23.

MS remains the most commonly used technology24 in proteomics, alongside complementary approaches using affinity-based assays25,26 or recently emerging protein sequencing methods27. Large-scale MS-based studies have captured nearly 80% of the human proteins observed across several human tissues28,29 and provide access to browse data built from more than 250,000 peptides (see PeptideAtlas, Human Proteome Map and ProteomicsDB in Related links). In parallel, the Human Protein Atlas (HPA) is dedicated to mapping human protein expression across organs and tissues30, including blood31, and to mapping their subcellular localization32. The HPA uses antibodies and transcriptomics to annotate tissue-specific protein expression on a gene-centric level and currently hosts data from 26,000 antibodies targeting proteins from more than 17,000 protein-coding genes (approximately 87% of the human proteome). The HPA has demonstrated that half of all proteins are expressed in all tissues and that instances of tightly restricted protein expression are far less common, except for proteins involved in specialized processes such as spermatogenesis. In terms of their cellular localizations, the HPA found that the products of approximately 13,000 protein-coding genes reside inside cells and the products of approximately 5,500 protein-coding genes reside in membranes, while the products of approximately 2,600 protein-coding genes are secreted into the extracellular space30. Although the proteomic analysis of tissues or cells is less relevant to large-scale population studies, such analyses are central to efforts to understand the mechanisms involved in disease pathogenesis, as well as to put findings from population studies into context.

The plasma proteome

On collection, blood is often centrifuged to generate fractions containing blood cells and a cell-free fluid, which accommodates the circulating proteome. The fluid is called ‘serum’ or ‘plasma’ depending on whether blood clotting is permitted (serum) or inhibited by anticoagulants (plasma) during the blood draw. The key challenges in analysing blood samples are the broad range of concentrations at which circulating proteins appear and the fact that some proteins fluctuate strongly in their abundance in response to disease or other physiological reasons. Collectively, MS or immunoassays have detected about 5,000 circulating proteins, which represent approximately 25% of the human proteome33. Whereas more than 2,300 proteins account for 75% of the overall protein mass in cells, only 20 account for around 90% of the total protein mass in plasma28.

Proteins circulating in plasma can theoretically originate from any organ or cell, or can even pass through the placenta to be exchanged between mother and child34. However, the most abundant proteins, such as albumin, apolipoproteins and components of the complement system, are primarily produced and secreted by the liver. A recent reannotation of the human secretome found that about 730 of the estimated 2,600 secreted proteins have the bloodstream as their primary destination: other secreted proteins can reside locally in the extracellular matrix or in other body fluids35. More information about the human secretome annotation is accessible via the secreted protein section of the HPA portal (see Related links). Proteins secreted or that leak into the circulation are found in a wide range of concentrations and are involved in processes related to blood homeostasis, transportation, defence, signalling, digestion and inhibition of other proteins, inflammation and mechanisms of wound healing35.

Although secretion is the active process directing proteins towards the extracellular space, the presence of cellular proteins in the bloodstream may be a consequence of a variety of different and complex processes, including response to damage to cells and organs (Fig. 1). The detection of leakage proteins, such as cardiac troponin, can provide biomarkers for tissue damage. Knowledge of the tissue and cellular origin of the proteins that leak into the blood can reveal insights into the characteristics of the biosample: proteins leaking from blood cells can be used to assess the quality of sample preparation36.

Current limitations in the knowledge of the circulating proteome

There has been substantial progress in studying the characteristics of proteins across tissues and determining their structure–function relationship in biological systems37. More recently, there have been moves towards single-cell proteomics38. Blood-based proteomics has focused mainly on expanding measures of protein abundance: insights into interactions, structures, isoforms and post-translational modifications of circulating proteins have not been addressed in great detail. Even though these factors contribute to the complexity of identifying and precisely measuring the components of the plasma proteome, current technologies are unable to characterize these at the required throughput, precision and sensitivity. When alternative splicing of transcripts from the approximately 20,000 protein-coding genes is accounted for, the estimated number of possible ‘proteoforms’ reaches 70,000 (ref.39), a number further magnified when all possible combinations of post-translational modifications are considered. Capturing this diversity poses enormous technical challenges considering that different proteoforms of the same protein may coexist in the same sample, and that one of them may be pathogenic, while another may be protective40.

Preanalytical variables can also influence the quantification of circulating proteins, and to minimize possible effects, stringent, consistent and short sample processing procedures are recommended36,41. Variables include the blood collection type (serum or plasma), collection tubes (type of anticoagulants), preservatives (protease inhibitors), separation of blood cells (centrifugation speed), times for preparation (needle to centrifugation) and until storage (needle to freeze), storage format (aliquotation) and biobanking (freezer temperature). Besides inconsistency in plasma purity36, protein degradation occurring after blood draw and over the years of storage can influence the detectable protein levels42. Effects of preanalytical variables have been demonstrated for the processing steps from needle to freezer, such as precentrifugation or postcentrifugation delay43,44,45. In multicentre studies, differences in institutional practices in sample processing, shipping and storage are to be considered. For each proteomics technology or assay used, the influence of preanalytical variables on the protein detectability needs to be considered and assessed46,47,48,49 so that data from different studies can be robustly compared41.

Probing the plasma proteome in high throughput

The ultimate aim of population proteomic analysis is to be able to survey the circulating plasma proteome of individuals at high sensitivity and high specificity combined with a high degree of multiplexing, massive sample throughput and low cost. The challenge lies in applying precise protein-detection methods that are able to quantify thousands of proteins across a wide dynamic range in many thousands of samples, while minimizing the amount of starting material required and the analysis time needed. In the following discussions, we introduce and compare the main technologies that allow the capture of circulating proteins in the context of large-scale population studies; that is, MS and affinity-based methods (for further methodological information, see Box 1).

MS-based proteomics

Since the early years of the first decade of this century, MS-based analysis has been applied to measure circulating proteins5,50. At its core, MS measures the mass-to-charge ratio (m/z) of ionized molecules, such as proteins or peptides, within a gas phase. Although many variations and instruments are used for the ionization, detection, identification and quantification of the ions, plasma proteomics workflows are typically based on an enzymatic digestion of the circulating proteins into peptides. Depletion of highly abundant proteins51 or enrichment of proteins present at low concentrations52 is sometimes applied to address the challenge imposed by the concentration range and dynamics of proteins of the plasma.

There are two complementary approaches to peptide measurement in MS. Targeted MS uses stable-isotope standard peptides as reference points to provide absolute quantification of the peptides in the sample. By contrast, untargeted MS uses the intensity of the peptide ions as a semiquantitative readout of peptide abundance. Particularly for plasma, as peptides representing low-abundance proteins can be masked by peptides originating from more highly abundant proteins, accurate measurement across a wider range of protein abundance concentrations has generally required extensive prefractionation of the sample, limiting MS-based analyses to smaller-scale studies53. Efforts to increase the scale and scope of MS-based assays have revived interest in application to plasma, with recent studies demonstrating reliable detection of approximately 500 abundant proteins in nearly 1,000 samples41,54. In MS studies focusing on detecting as many proteins as possible, up to 4,000 proteins can be identified in a handful of samples55. Although general concepts, advances and the versatility of MS have been reviewed extensively elsewhere37,56, we describe here some recent studies that indicate the growing potential of MS for population-wide studies of the plasma proteome.

For untargeted MS, data-dependent acquisition (DDA) MS methods, which focus on analysing the more prominent peptide ions, have been widely used57. Recent efforts to shorten the analysis time and peptide scanning schemes allow measurement of nearly 300 proteins in 10 plasma samples within 3 hours58, and one recent study included 1,300 sample assays for 450 proteins59. Data-independent acquisition (DIA) approaches consider all peptide ions and are better able to detect less-abundant proteins. DIA requires less complex preanalytical sample handling than DDA but at the cost of increased time for data collection and subsequent bioinformatic analyses to match the extended list of targets with comprehensive protein libraries. One recent DIA-based study described the quantification of 340 proteins in 200 plasma samples from twins60 and revealed that 50% of all detected peptides (so-called peptidoforms) had undergone post-translational modification61. The largest DIA MS proteomics study to date included 1,500 plasma samples, using improved chromatography and increased reproducibility to detect a common set of 320 proteins in more than 90% of the samples62.

For more focused studies, where a smaller number of preselected proteins are of interest, targeted MS workflows can be preferred. In such a setting, targeted MS provides a more sensitive and absolute quantification of protein concentrations. These assays use predetermined analysis conditions and addition of known amounts of synthetic, isotopically labelled peptides (or proteins) to act as references for quantification. This approach facilitates comparison of protein quantities across studies51,63,64. However, the scope of targeted MS is constrained by the predefined sets of labelled canonical peptides offered by companies such as Biognosys (Zurich, Switzerland) or MRM Proteomics (Montreal, Canada).

Affinity-based proteomics

Affinity proteomics has emerged as an attractive alternative to MS-based identification of proteins26 by conducting classical immunoassays at higher throughput and higher sensitivity and in multiplex format25. Among the variety of immunoassays such as ELISA, immunohistochemistry, immunofluorescence, flow cytometry and western blots, all mainly use classic antibodies but may also use alternative binders to detect endogenous proteins in a variety of sample types65. The basic concept for these multianalyte assays66, often making use of advances in DNA technologies (such as DNA microarrays), relies on measuring multiple proteins simultaneously in one sample by miniaturizing the assays. A reduced analytical surface area increases the signal-to-noise ratios by increasing the number of occupied binding sites and avoids the depletion of the analyte in solution.

Over the past two decades, methods for multiplexed protein measurement have emerged to enable analyses across a wide range of concentrations, in thousands of samples. These developments have made these assays attractive for population-wide analyses, and they have been adopted in most recent genome-wide pQTL analyses. These advances in affinity assay methods rely on the improved performance of the binding reagents (antibodies or aptamers) used to detect antigens with high selectivity and high binding affinity67. For analyses of plasma proteins, available platforms range from ultrasensitive single-analyte assays for the detection of circulating proteins at less than 1 pg ml−1 (ref.68) to multiplexed assays that capture more than 4,000 proteins by designed protein-binding DNA aptamers9.

The suspension bead array technology of Luminex (Austin, TX, USA) supports multiplexing via combining different colour-coded microparticles followed by flow cytometry analysis69: each colour code represents one immunoassay for a given protein target. These assays can typically be multiplexed to quantify 30 analytes per sample, processed in batches of up to 384 samples, and have been applied to studies of more than 8,000 samples and up to 50 proteins70. Myriad RBM (Austin, TX, USA) offers services based on a set of more than 300 proteins on the Luminex technology, whereas a variety of predefined and/or customizable panels of sandwich immunoassays are available from providers such as MilliporeSigma, R&D Systems, Bio-Rad and Thermo Fisher. Alternative platforms for multiplexed immunoassays are offered by Quanterix (Billerica, MA, USA), ProteinSimple (San Jose, CA, USA) and Meso Scale Diagnostics (Rockville, MD, USA), and are reviewed elsewhere71.

By contrast, the technology implemented by Olink (Uppsala, Sweden) uses proximity extension assays. Detection of a given protein requires binding of two separate antibodies that carry complementary oligonucleotide tags: when two antibodies bind to the target protein, these oligonucleotides can be hybridized and extended by a DNA polymerase72. This assay therefore uses protein-specific binding properties to generate a readout that then relies on DNA concentrations more easily quantifiable by quantitative PCR. Presently, the technology enables the analysis of up to 1,164 proteins, distributed across 14 themed protein panels, each panel analysing 92 proteins in 90 samples at a time. Recently, one of these panels was deployed across 3,400 individuals to map genetic loci for plasma protein biomarkers in cardiovascular disease73, followed by a meta-analysis in more than 21,000 individuals74.

SomaLogic (Boulder, CO, USA) uses an in-house-developed library of modified aptamers for highly multiplexed protein profiling75. In recent years, its aptamer library has grown from approximately 800 (ref.75) to more than 4,000; in parallel the sample sizes investigated have grown from hundreds to tens of thousands76. The proprietary assay processes 90 samples per batch, and its readout is on DNA arrays. Instead of matching two binding reagents for increased specificity, the selectivity of the platform is built around specific aptamers selected for their capability to bind to their target protein for an extended time (slow off rate).

Comparative discussion of the available methods

Several criteria influence decisions about which technology to deploy for a given large-scale plasma proteome analysis. Although no single method is a universal solution for all analytical aspects, each has particular strengths. There has been no systematic and direct comparison of all available methods, but a starting point for future work could be the ‘popular proteins’77, which is a set of approximately 1,000 circulating proteins that can currently be detected by both MS-based and affinity assay approaches33. Analytical performance criteria for the main technologies are summarized in Table 1, including specificity, reproducibility, sensitivity, degree of multiplexing, sample throughput, quantification and translatability, costs and accessibility of the technology and the derived data.

Table 1 Features of plasma proteomics technologies

MS-based approaches benefit from a large community of users and companies. The mode of MS-based protein identification and absolute target quantification as well as future opportunities to capture several post-translational modifications61 are noticeable benefits compared with the affinity assays. However, sample throughput and analytical sensitivity are relatively low (there is no amplification method available), limiting the potential to apply MS-based approaches to large plasma studies. Peptide sequence variants and modifications may further increase the number of missing data points. Newer approaches — such as DIA and variations thereof56 — enable the protein-level data to be digitally stored, so improvements in matching tools to detect peptides together with growing libraries may allow reanalysis of the data in the future.

Affinity-based methods clearly lead on sample throughput (N > 1,000) and analytical sensitivity (below 1 ng ml−1), which can be enhanced by use of signal amplification methods such as are possible with some DNA-based readouts. The number of analytes that can be measured differs between technologies, and several panels may be used to detect large numbers of target proteins. Although the number of reagents to target the human proteome is increasing78, the susceptibility to off-target binding and the lack of reproducibility of assays have raised concerns about the quality of some affinity-based measures79. To achieve higher-quality data, guidelines on the validation of affinity capture-based data have been established80 and systematically applied to a variety of common techniques and samples81,82.

GWAS with proteins circulating in plasma

The goals of proteomic GWAS are to identify genetic sequence variants associated with proteomic features in a given cell or biofluid. Those features are typically protein abundance levels but could extend to other characteristics, such as isoform diversity or post-translational modification. GWAS analysis conducted across the large number of proteomic phenotypes now becoming available for many cohorts presents challenges, not only in terms of computational efforts and data management but also with respect to the quality control of the phenotypic measurements, automation of data analyses, and integration and interpretation of the results.

Identification of pQTL signals

Preprocessing and quality control of proteomic data to avoid spurious association from outliers, or deviations of the protein abundance distributions from assumptions made in standard statistical models, may be cumbersome when one is working with thousands of quantitative traits. Inverse normal scaling of protein levels is often used as a simple and conservative approach to deal with distributional issues and appears to lead to robust associations, as shown by the generally good replication between studies10; however, this is achieved at the cost of some loss in statistical power. Log scaling and/or winsorization — that is, moving extreme values closer to the normal distribution of the proteomic data — offer alternative approaches to inverse normal scaling, and computation of P values based on data preprocessed by multiple methods can be used to obtain consensus associations8. Associations can be computed with the same tools and models as deployed in other GWAS of quantitative traits83,84: batch effects and covariates can be managed by a combination of direct inclusion in analytical models and the use of latent variable approaches, such as PEER85, to ‘find’ the covariates that matter (at some price of removing real effects). Adding specific principal components of the proteomic data as covariates may increase statistical power, for instance where principal components reflect variation induced by processes related to sample handling and storage.

For pQTL discovery studies, the gold standard involves application of a combined genome-wide and proteome-wide significance level together with independent replication. However, this can be a high bar, since a study-wide significance threshold designed to minimize false-positive associations may require researchers to account for the billions of tests performed (a million or more DNA variants for each of several thousand proteins). This approach leaves many studies underpowered to detect genuine but weaker associations. Limiting association testing to variants located in the immediate vicinity of the respective protein-encoding genes can increase statistical power for the detection of cis-pQTLs but restricts the breadth of the study. Novel methods to better incorporate correlative structures between proteins and genetic variants, such as use of sparse multivariate regression models86, may improve this trade-off between power and robustness.

Extending pQTL analyses to include rarer alleles (rather than the common alleles that have been the primary focus of GWAS) may increase opportunities to detect variants with particularly large effects on protein expression and disease risk, since rare alleles are often of more recent origin and may not yet have been subject to negative selection. However, as with disease association studies, the robust detection of rare variant associations can be troublesome, particularly if the rarity of the alleles themselves requires the use of ‘variant aggregation’ tests to provide a signal. Such approaches depend for their power on the ability to combine disparate alleles that have similar broad functional impact, while avoiding dilution of the test by including neutral variants. Such an approach can be difficult enough for coding variants, where the gene provides an obvious unit for aggregation, but is some way from being solved for variants that fall into intergenic and regulatory regions.

pQTL studies to date

Over the past decade, there have been a profusion of GWAS analyses, many also with a limited proteomic scope1,6, including several large-scale studies with multiplexed immunoassays that cover tens of proteins to a few hundred proteins12,13,17,70,73,87. We provide a comprehensive list of all pQTL studies of which we are currently aware in Table 2. To date, only two MS-based studies in blood have combined proteomics with GWAS, one identifying approximately 160 proteins in the plasma of approximately 1,000 individuals53 and the other analysing the small peptide subset of a non-targeted metabolomics platform88. Otherwise, large-scale pQTL studies have made use of affinity proteomics approaches8,9,10,11.

Table 2 Published GWAS with plasma proteomics

Several of the affinity proteomics studies have used the aptamer approach developed by SomaLogic. The KORA study analysed more than 1,100 proteins in 1,000 participants from a German cohort8 and identified 540 pQTLs that connected 450 independent genetic variants with 280 proteins. This analysis was extended within the INTERVAL cohort of UK blood donors10: an expanded SomaScan panel of almost 3,000 proteins was deployed across 3,300 individuals, raising the pQTL count to 1,930, involving 1,480 proteins. The largest published SomaScan-based pQTL study analysed more than 4,000 proteins across 5,500 Icelanders from the AGES Reykjavik study9 and reported more than 3,130 pQTLs influencing the abundance of 1,800 proteins.

Studies using the Olink platform have tended to include smaller numbers of proteins but larger sample sizes. The most recent analysis, conducted by the SCALLOP consortium, analysed 90 cardiovascular proteins in more than 21,000 individuals74: this study yielded a total of 467 pQTLs influencing 94% of these protein targets.

These pQTL studies have provided confidence in the reproducibility of the underlying methods. The three SomaScan studies displayed good agreement: the AGES study confirmed more than 84% of the pQTLs found in the KORA study and more than 72% of those from the INTERVAL study. The INTERVAL study assayed participants with both the SomaScan platform and the Olink platform and found that 65% of the SomaScan-detected pQTLs for overlapping targets were replicated with Olink (with a correlation of 0.95 in effect-size estimates). Some small-scale studies provide data measured in parallel on two platforms89,90. Further comparative studies are needed to fully delineate the consistency between platforms, particularly between MS-based and affinity-based analyses: such analyses have been limited by the sparse overlap between proteins jointly detected on both platforms91.

Cis-pQTLs and trans-pQTLs

Genetic variants that associate with protein levels are generally classified into one of two categories. They are located either in cis, close to the gene that encodes the associated protein, or in trans, at greater distance, typically on a different chromosome (Fig. 2). Between 18% and 25% of the proteins assayed by aptamer assays have been found, at current sample sizes, to have a significant cis-pQTL. Twenty-seven per cent of the pQTLs reported by the KORA study8 were trans-pQTLs, rising to 70% in the larger INTERVAL study10. The greater multiple testing burden inherent in trans-pQTL analysis, together with the increased potential for non-genetic phenotypic variability (Fig. 2), makes sample size a critical factor in trans-pQTL detection.

Fig. 2: Ways a genetic variant can lead to a pQTL.
figure 2

A protein quantitative trait locus (pQTL) is a locus containing one or more genetic variants that associate with protein levels. Genetic variants can affect many biological processes in the life cycle of a protein, which eventually leads to their association with protein levels measured in a blood sample. Genetic variants can act in cis, that is, affect processes involved in the expression of the associated protein itself (blue), or act in trans, and modify the associated protein through various indirect processes (red). Variants can directly change the properties of a protein or mediate their effects through regulatory feedback loops that affect protein levels. Variants can also exert regulatory effects at the RNA level, such as by altering transcriptional output of mRNAs (for instance by changing DNA methylation sites or altering transcription factor binding), modifying RNA structure, altering mRNA splicing, impacting RNA turnover or affecting ribosomal binding. pQTLs may also act through more complex processes, such as affecting co-translational protein folding or microRNA (miRNA) expression or target specificity. Some of the variation in protein levels may also reflect differential experimental response to changes in protein epitopes. Identification of the functional variants of a pQTL can therefore be challenging.

Cis-pQTLs indicate the presence of a variant that is likely to have a direct and causal effect on the observed protein levels at that locus. If the causal variant of a cis-pQTL acts primarily through mRNA expression or turnover, an (RNA) expression QTL (eQTL) may also be found in a relevant tissue or cell type. If instead, the cis variant alters protein abundance through an impact on protein translation or turnover, a cis-eQTL is less likely, but it may still be observed if expression is upregulated through compensatory feedback to maintain protein levels. Although the presence of a nearby eQTL generally supports the identification of a causal pQTL variant, it is also possible that the two signals map close together by chance or that the eQTL and pQTL arise from different variants that are in high linkage disequilibrium. A variant in a shared promoter of two genes may further complicate the picture, giving rise to more complex hypotheses regarding the link between the variant and the associated protein level.

Trans-pQTLs are of particular inferential value because a trans-pQTL implies an interaction between an — often still to be identified — causal gene at the pQTL locus and the associated protein encoded at the trans position, pointing to novel pathways of protein regulation or interaction. Sun et al.10 present several examples including the identification of PRDM1 as the probable causal gene at an inflammatory bowel disease GWAS locus, and the dissection of complex association signals for antineutrophil cytoplasmic antibody-associated vasculitis at the SERPINA1 locus. Orthogonal evidence based on pharmacological intervention and transgenic mice can support hypothesized causal gene-to-protein relationships from trans-pQTLs, as exemplified by the recent study by the SCALLOP consortium on eight gene products targeted by compounds or antibodies in clinical development74. Genetic loci involving multiple cis-pQTLs and trans-pQTLs are particularly informative, as they link multiple proteins in putative gene–protein networks that can suggest new hypotheses as to their possible interactions and functions. A genetic variant associated with a trans-pQTL can further change protein levels in trans through regulation or modification of cis-encoded microRNAs or epigenetic marks. The study by the SCALLOP consortium identified 30 trans-pQTLs that involved two or more proteins, with ABO, ST3GAL4, JMJD1C, SH2B3 and ZFPM2 showing association with the levels of 5 or more of the 90 proteins analysed by one panel of the Olink platform74.

Colocalization with eQTLs

Most disease GWAS signals map to regulatory rather than coding sequences, and the downstream effectors through which they operate are typically unclear92. Evidence of enrichment between patterns of disease association and the location of tissue-specific regulatory sequences (such as enhancers), as well as for the sites of tissue-specific eQTLs, indicates that the identification of colocalizing cis-eQTLs has value in highlighting those effectors and reconstructing disease mechanisms93. However, it is also clear that this approach is not as robust as has often been assumed: at many loci, there are multiple colocalizing cis-eQTLs, implicating several candidate effectors, and it seems intrinsically unlikely that all are directly mediating disease risk94. pQTL analysis provides a complementary approach to reconstructing links between genetic variation, molecular processes and disease predisposition. In recent studies8,9,10, only 40% of detected cis-pQTLs could be shown to colocalize with cis-eQTLs detected in projects such as Genotype–Tissue Expression (GTEx)2, which generated cis-eQTL data for approximately 40 tissues in up to 900 individuals. It remains to be seen how the recently reported contamination of GTEx data with highly expressed, tissue-enriched genes impacts on this observation95.

The discrepancy between pQTL and eQTL discovery has led to questions about their relative merits, questions which, by and large, ignore the fact that these data types are quite distinct. Most obviously, large-scale pQTL data are almost entirely restricted to the circulating proteome, whereas projects such as GTEx have provided eQTL maps for multiple tissues: there will be many loci where the downstream effectors (for example, transcription factors) are not represented in the circulation and where any contribution from pQTL studies will need to await technologies that provide tissue-specific, and even cell-specific, proteomic readouts. On the other hand, there will be other loci where the mediating mechanisms can be detected only through pQTL analyses: where the genetic variants exert their effects through an impact on protein stability or modification, for example, or where the protein half-life in the circulation far exceeds that of the RNA.

Data sharing

Sharing of full summary statistics — that is, reporting all pairwise variant–protein associations with effect size and standard error regardless of their significance levels — can be data storage intensive but is an essential component of scientific discourse. The GWAS catalogue1 now accepts and maintains full summary association statistics data sets, and summary statistics for some of the larger pQTL studies are already freely available8,10. Access to summary data enables immediate incorporation into online tools for causal inference that use Mendelian randomization approaches, such as EpiGraphDB96 and MRbase97. Whereas sharing of summary-level data should be a prerequisite for publication, proactive sharing of individual-level data (both raw and processed genotypes and protein measurements) renders data sets far more valuable facilitators of scientific discovery, fostering development of novel statistical approaches and maximizing the opportunities for integration with other complementary sources of data. Controlled-access repositories, such as EGA and dbGAP, provide a mechanism to support the sharing of sensitive individual-level data with bona fide investigators.

Working with ratios

It is possible to conduct association analyses focused not only on the expression levels of individual proteins but also on the ratios between them. Inspired by the use of hypothesis-free testing of metabolite ratios in metabolomic GWAS98, the challenge with ratio-based analyses is the massive explosion in the number of statistical tests that could be performed and the consequent need to allow for the inflation in type 1 error that could result. However, as with metabolomic data, where known relationships between products and substrates can be used to constrain the scope of ratio testing (avoiding testing of all pairwise combinations), there may be opportunities to perform protein-ratio QTL analyses in subsets of functionally connected proteins. For example, the strength of the pQTL association of rs41341749 with CCL14 and CCL23 increased by 35 orders of magnitude when the ratios of their abundance were used, an effect replicated in an independent cohort8. A gain in the strength of association when ratios are used may also result when one of the proteins concerned is a proxy for variability induced by sample handling and storage, or any other common normalizing factor.

Network integration

A broader view of disease biology can be enabled by analyses that present the relationships between proteins as networks, especially when those networks can be integrated with additional information such as that provided by genetic associations and colocalization of those associations with clinical end points. Network relationships between proteins can be derived from existing biological knowledge, using databases such as WikiPathways99 and STRING100, or they can emerge from data-driven integration in multidimensional omics data sets. For example, Gaussian graphical models (GGMs) have been shown to reconstruct pathway reactions from high-throughput metabolomics data101, and, when integrated with metabolic and disease GWAS associations, are powerful tools to mine high-dimensional GWAS data sets for biomedically relevant associations102. GGMs that include proteomic data recently revealed a genome–proteome–disease subnetwork implicated in Crohn’s disease8.

One illustrative example involves the pleiotropicABO locus. In GWAS, this locus has been associated with numerous disease outcomes, including coronary artery disease103 and venous thromboembolism104. This locus includes at least three independent, widely replicated genetic signals that influence the plasma abundance of multiple proteins in trans8,9,10,14,91,105. Two of these signals tag the blood group B and O alleles, and the third lies upstream of the ABO gene. A network based around these trans proteins, assembled from literature findings, experimental protein–protein interactions and the partial correlations between their abundances, indicates that these proteins play a joint role in angiogenesis and that this angiogenic function is modulated by variation at the three ABO signals8. Further analyses of such network models are required to establish the functional relevance of observed links between GGM network nodes, possibly using statistical approaches to objectively define functional modules in such networks106.

Analytical considerations

As with all high-throughput omics technologies, proteomic data are subject to multiple potential sources of error and bias, which the user must consider to avoid misinterpretation or overinterpretation of the results (Table 1).

Epitope effects

One challenge in interpreting cis-pQTL associations from affinity assay-based GWAS is the possibility that a genetic variant in the protein-coding sequence modifies the binding epitope of the protein that is recognized by an assay’s antibody or aptamer, in the absence of other biological consequences14. Such an epitope effect can be introduced by a non-synonymous mutation within the target epitope. Alternatively, it may reflect larger-scale changes in the protein structure arising from variants (including frameshift insertions or deletions) that modify the protein’s overall structure or folding, or its potential for post-translational modification. Linkage disequilibrium (that is, correlation) between such a variant influencing epitope binding and otherwise entirely unrelated variants within adjacent non-coding sequence can result in a pattern of association that is indistinguishable from a genuine cis-pQTL.

There are several ways to establish whether epitope effects could be driving an observed cis-pQTL signal8,9,10,14. Evidence that the gene implicated in a cis-pQTL contains a coding variant (which may require interrogation of imputed or sequenced whole-genome sequence data) raises this as a concern: the coding variant needs to be evaluated for its impact on protein sequence and structure. Experimentally, one could produce both protein variants and quantify differential recognition by directly comparing assay performance. The presence of a colocalizing eQTL can be reassuring, serving as a pointer that the observed differences in protein abundance are likely to result from altered protein expression rather than differential recognition. However, this is not fail-safe: the eQTL could itself be an artefact if the coding variant leads to allelic bias in the mapping of RNA sequencing reads. Where pQTLs are the consequence of SNP-associated allele-specific gene expression, RNA sequencing can be used to provide evidence that only one protein isoform is expressed. Replication of a pQTL on multiple platforms, or when a range of binders and assays are used, provides confidence that the association reflects changes in protein levels rather than binding affinity. pQTLs that display exceptionally large effect sizes and that show no colocalization with other GWAS phenotypes are more likely to represent epitope effects, as large differences in protein levels might be expected to have measurable biological impact.

Trans-acting epitope effects are also possible. This can happen when a coding variant at a pQTL signal influences the properties of a protein-modifying enzyme or an interaction partner in cis: this can affect the epitopes of proteins modified by this enzyme or those which interact in a genotype-dependent manner with the binding partner8. In some relevant loci, binders have been selected explicitly to bind different epitopes of the same target: the use of multiple SomaScan aptamers to target the products of the various APOE alleles is one such example8.

Protein-altering variants can also influence MS-derived data107. If a SNP introduces a new protein cleavage site (or eliminates an expected one), the peptides generated by enzymatic digestion may differ from those listed in the canonical reference peptide library. A possible workaround for capturing these instances is to apply a ‘proteogenomics’ approach108, in which peptide variants are imputed from the genetic data and then used to build a variant-aware peptide reference109.

From the findings taken together, it is clear that epitope effects represent a challenge to the interpretation of pQTL studies and require careful annotation of the assays used. However, these effects also constitute opportunities for further analyses of biological variation in the protein sequences that have no direct impact on the biological function (until shown otherwise). It is important to bear in mind that pQTLs identified with affinity assays after all represent genetic differences in the amount of affinity-captured proteins, which is then merely interpreted as representing protein abundances.

Binding specificity and cross-reactivity

Affinity-based assays rely on the correct identification of their intended targets, but cross-reactivity and lack of specificity have been a concern81. The selectivity and suitability of affinity reagents are context and technology dependent and hence must be evaluated for each sample type and application71. For example, a SomaScan-based proteome analysis of blood from young and old parabiotic mice reported an age-related reduction in circulating levels of growth/differentiation factor 11 (GDF11), raising the possibility that restoring normal levels of GDF11 could reverse ageing. However, it was subsequently established that the assay used to identify and isolate GDF11 had low specificity110. Cross-reactivity to similar epitopes, such as those encoded by paralogous genes, or domains shared across different proteins may confound an association signal10. Cross-target binding may also occur when binders designed to target non-human proteins are included in the panel, such as viral or bacterial proteins on the SomaScan platform3. In the absence of the intended target, pQTLs identified with such binders may correspond to genuine genetic signals of another, yet unidentified target protein. Finally, genuine errors in binder selection, for instance due to misidentification of the intended target protein, constitute another potential source of error.

Binding specificity has been addressed by several studies8,9,10,14. For instance, Sun et al.10 assessed potential off-target cross-reactivity for 920 aptamers and found that 14% showed comparable binding with a homologous protein, nearly half of which were alternative forms of the same protein. Emilsson et al.9 provided evidence for target specificity for 773 affinity binders for a panel of 4,137 aptamers by using affinity pull-down followed by MS-based target identification, although only some of the experiments were conducted with blood serum or plasma. Recently, SomaLogic conducted a systematic analysis of the reliability and specificity of SOMAmer protein-affinity reagents in the SomaScan assay. They found that out of 1,612 tested SOMAmer reagents, 73% did not bind to any related proteins, 14% bound to related proteins with at least tenfold weaker affinity and 13% bound to other unrelated proteins with similar affinity. They further confirmed specific target enrichment by pull-downs from human plasma for 123 of the SOMAmer reagents111. A recent article from SomaLogic also provides updated information on MS-target confirmation and characterization of possible cross-binding and off-target effects for many of the proteins on its in-house platform76.

The use of community standards for validation80 and open access to validation data will help to drive improvements: it should be possible to understand, for any given binder, not only that it captures the intended target but also which other circulating proteins can be enriched. Sharing raw data from multiplatform studies may allow independent evaluation of platform performance89. The integration of such information into public protein databases with cross-referencing identifiers to affinity proteomics platforms should increase the accuracy of pQTL inference.

Clinical and biomedical applications

The characterization of pQTLs drives various biomedical and pharmaceutical applications. pQTLs provide intermediate phenotypes to interpret the findings of disease GWAS, clues to genes that are causal for disease biology, opportunities to discover clinical biomarkers, matches between existing drugs and new disease indications, pointers to potential safety concerns for drugs in development, insights into protein–protein interaction networks and much more (Box 2). Here, we discuss some of the key applications in further detail.

Genetic variance in clinical biomarker proteins

The measurement of protein levels in accessible biofluids (for example, plasma, urine or cerebrospinal fluid) is one of the mainstays of clinical medicine, providing many biomarkers with diagnostic or prognostic value. The ability to generate robust pQTL associations at scale could not only provide mechanistic insights into disease biology that prioritize targets for therapeutic development but could also promote the discovery of novel clinical biomarkers. Some of these biomarkers will support more accurate diagnosis of disease, whereas others will increase the detection of drug side effects, stratify disease risk or act as surrogates of developing disease that are valuable in clinical trials.

Many frequently used clinical tests are blood based; nearly half involve the measurement of proteins54. Historically, these tests have involved measurement of the abundance of single proteins, but multiprotein biomarkers are emerging, one example being the profiling of cardiovascular risk in patients with coronary heart disease112. The current set of clinically used proteins represents only a fraction of the circulating proteome, suggesting considerable opportunities to develop improved protein signatures for monitoring human health and disease states76. Ten years ago, Anderson113 identified 109 US Food and Drug Administration (FDA)-approved protein analytes that can be assayed in serum or plasma. An update of this list (Supplementary Table 1) now includes 199 unique analytes. Among the subset of these FDA-approved protein analytes for which assays were available in at least one of the three recent large-scale SomaScan studies8,9,10, almost one-third have pQTLs with an effect size large enough to be detected (Table 3). This observation, which is in agreement with recent reports of the impact of ancestry on protein biomarker levels114, indicates that some of the observed differences in disease prevalence between ethnicities may have a genetic basis reflected in blood protein abundances and that reference intervals for those biomarkers should be tailored to ancestry.

Table 3 Established blood-based protein markers with cis-pQTLs

Interpretation of the findings of disease GWAS

One of the key motivations behind large-scale pQTL analyses is to support efforts to relate protein abundance levels to the growing inventory of disease-associated genetic variants and thereby to accelerate the identification of potentially translatable biomarkers. As the scale and scope of pQTL studies have expanded, so too has the list of pQTLs coincident with disease risk variants8,9,10. In recent SomaScan articles8,9, between 11.5% and 20.7% of cis-pQTLs were in high linkage disequilibrium (r2 ≥ 0.8) with sentinel disease-associated variants. Webservers such as PhenoScanner115 and SNiPA116 integrate GWAS hits from multiple sources and can be used to identify and interpret such overlapping signals. By addition of data from large-scale studies of RNA expression, such as those provided by GTEx, it becomes possible to construct causal chains leading from DNA to RNA and on to protein and disease2. Tissue-specific protein expression data gathered from the HPA30 can further illuminate these pathways. One important caveat is that the rapidly growing numbers of GWAS associations and pQTLs mean that some of the apparent overlap in genomic locations is merely the consequence of two distinct signals — driven by entirely different sequence variants — which happen to map to the same stretch of DNA but which have no mechanistic relationship. It is therefore essential to confirm that coincident signals are driven by the same genetic variants (that is, that they colocalize) before assuming any biological connection117.

Polygenic scores

One powerful strategy for biomarker discovery is emerging from recent advances in the use of polygenic risk scores to stratify disease risk across populations118. These scores aggregate the information on individual genetic predisposition for a given disease across many thousands of risk variants. For many common conditions, these scores can identify individuals who, on the basis of their patterns of shared common sequence variation, are at substantially increased (or decreased) future risk of disease118. Proteomic analyses conducted in individuals from these extremes of disease risk, ideally using plasma samples collected from healthy individuals many years ahead of clinical disease onset, offer a route to accelerate the discovery of prognostic biomarkers. It is possible to directly evaluate relationships between disease polygenic risk scores and protein expression levels in cross-sectional cohorts using a Mendelian randomization framework119. As an example of the power of such an approach, Mosley et al.120 intersected coronary artery disease risk score data with an aptamer-based pQTL study of 759 individuals, computing a ‘virtual proteome’ for each individual, which was then evaluated for its association with clinical end points in a much larger cohort.

To the extent that these biomarkers capture processes fundamental to disease risk, they have the potential to profile individual risk irrespective of its basis — genetic or otherwise — in the same way that cholesterol levels synthesize both genetic and lifestyle-related contributions to the risk of coronary disease. There are opportunities to extend these approaches beyond measures of overall disease risk to characterize biomarkers that capture the relative contribution of multiple disease-associated processes to disease progression in a given individual. This information can be parlayed into improved tools for prognostication and therapeutic optimization. In the case of type 2 diabetes, for example, it is possible to construct a set of ‘partitioned’ risk scores, each made up of variants that influence one of several pathways contributing to disease risk (such as obesity, fat distribution, insulin resistance and insulin secretion)121 and then to use them in the search for process-specific biomarkers.

Inferring causality

Although pQTL and eQTL analyses can generate important hypotheses linking a disease risk variant to a plausible downstream effector, these inferences are correlative rather than causal. Even in the ideal setting, whereby a disease risk variant is shown to colocalize with a pQTL for a protein with a biologically plausible link to disease, there is no formal proof that the protein lies on the causal pathway to disease. Instead, variation in the circulating levels of that protein might be just one of several molecular events attributable to the risk variant, only one of which is causally linked to disease development. Alternatively, variation in the biomarker may be secondary to, rather than causal for, the disease process (‘reverse causation’). Such pQTLs may still point to useful biomarkers of disease predisposition or severity, but interventions that restore levels of the biomarker to normal levels will not necessarily be disease modifying. In the case of coronary artery disease, for example, the evidence indicates that triglycerides and LDL cholesterol play a causal role, but the levels of HDL cholesterol and C-reactive protein do not.

Proof of a causal connection ultimately relies on demonstrating — in a suitable system and using a disease-relevant readout — that perturbation of the expression or function of the protein of interest has a material effect on the development of disease. Fortunately, there are a host of approaches for examining the consequences of protein perturbation, ranging from cellular screens, through manipulation in animal models, to the detection, in isolated populations or those with high levels of consanguinity, of individuals who have inherited genotypes that result in extreme (high or low) levels of the protein. Here too, the integration of genetic and proteomic data can be extremely valuable. Finding that coding variants in the gene encoding a protein of interest are causally implicated with disease provides reassurance that a regulatory pQTL involving the same gene is also likely to be causal. Similarly, evidence from Mendelian randomization studies that a given protein is the target of several distinct cis-pQTLs and trans-pQTLs, each of which also colocalizes with a disease GWAS signal, represents consistent evidence that the protein is mediating disease risk rather than simply reflecting the activity of another causal process with which it happens to share some regulatory overlap122.

Mendelian randomization analyses are central to many of these approaches, providing what can be considered the genetic equivalent of a randomized clinical trial. For example, large-scale SomaScan analyses have implicated several protein pathways as causal for disease risk, including a protective role for prostate secretory protein of 94 amino acids (PSP94) in prostate cancer, and associations between raised IL1RL2 and IL18R1 levels and atopic dermatitis risk, and between higher MMP12 levels and decreased risk of coronary disease risk and stroke9. Yao and colleagues13 used pQTL data to identify six proteins likely to be causal for coronary heart disease, two of which, cystatin C and PON1, were also associated prospectively with death from new-onset coronary heart disease or cardiovascular disease in the Framingham Heart Study. Chong and colleagues123 conducted a systematic Mendelian randomization meta-analysis of the circulating proteome based on pQTL data from more than 20,000 individuals8,9,10,70,73 and disease GWAS data from up to 400,000 participants. This study revealed seven causal mediators for ischaemic stroke, including a protective role for SCARA5 and TNFSF12, highlighting important roles for stroke protein biomarkers in cardiovascular and non-cardiovascular diseases and singling out LPA, F11 and SCARA5 as particularly attractive drug targets. Many recent association studies with intermediate proteomics traits include Mendelian randomization analyses11,74,87,124, and new methods are emerging to cope with the inherent challenges of a multi-omics Mendelian randomization approach, such as pleiotropy between the traits.

Conclusions and perspectives

Genetic association studies using intermediate phenotypes represent analyses of experiments conducted by nature, able to provide a valuable resource of biological information. Proteomic data at scale have only recently become accessible to large-scale GWAS approaches, largely due to analytical limitations that are specific to the composition of the plasma proteome. Advances in affinity proteomics technologies are rapidly remedying this shortfall, with recent studies reporting the parallel measurement of 5,000 proteins in almost 17,000 blood samples76, and many substantially larger studies are under way74. Currently, between 10% and 20% of all pQTLs discovered were found to colocalize with clinical GWAS loci. This overlap can be expected to increase as the scale and scope of both pQTL studies and clinical GWAS continues to increase. This will add to the value of plasma proteomic data especially for the biomedical and pharmaceutical field by providing more and better instruments for drug target validation7 and causal inference96. Crucially, large-scale pQTL studies in large healthy cohorts complement the growing use of plasma proteomic data in the context of case–control datasets and in clinical trials, improving inference in study designs that are subject to reverse causation and confounding.

Although affinity assays lead the way in pQTL mapping, it is wise to remain aware of their limitations. Associations of interest should be supported by independent validation of target specificity and should include consideration of possible cross-reactivity and epitope effects. As none of the current methods is the optimal or sole solution, validation across multiple technologies will be key, with the most important findings further supported by cellular and other functional studies. It remains to be shown how feasible MS-based approaches will be for large-scale GWAS projects. Although low sample throughout and low analytical sensitivity currently limit the use of MS-based techniques in population studies, MS has the potential to identify a broader range of peptides and proteins, as well as a variety of post-translational modifications and splice forms. GWAS of IgG glycosylation125 and total N-glycans126 have already reported biologically relevant associations, most of them with proteins that are involved in protein glycosylation. Other approaches, known as ‘glycoproteogenomics’, seek to understand the differential glycosylation of plasma protein biomarkers127. Future efforts should consequently also attempt to capture the different modification characteristics of proteins by high-throughput and quantitative technologies and possibly determine tissue-specific isoforms that are of relevance for these traits. In parallel, the prospects of combining data from large-scale proteomics with molecular readouts, such as DNA methylation87,124,128,129, metabolomics130 and glycomics131, opens new avenues for studying human health at a molecular level in population-scale studies.

Recent advances in the scale and scope with which it is possible to survey the proteomic content of plasma and other biofluids are allowing proteomics to take its place alongside the comprehensive characterization possible for other omics approaches, such as those focused on genetic variation and RNA expression. These advances open new opportunities to use proteomics to deliver improved understanding of the mechanistic basis of disease132,133 and to promote novel translational strategies through target and biomarker identification74,76.