Assignment Bias Definition

Bias in Statistics > Assignment Bias

What is Assignment Bias?

Assignment bias happens when experimental groups have significantly different characteristics due to a faulty assignment process. For example, if you’re performing a set of intelligence tests, one group might have more people who are significantly smarter. Although this type of bias is usually associated with non-random sampling and assignment, it can occasionally be an issue with random techniques.

Controlling Assignment Bias

Random assignment can help to control assignment bias by ensuring that treatment groups and control groups have an equal spread of characteristics. That said, random assignment is not always possible, especially in the medical fields where it may be unethical to assign patients to control groups. If you are unable to use random sampling and random assignment methods to select participants, alternative methods include:

  • Instrumental Variables: A third variable used in regression that helps you to uncover “hidden” variables (other than the independent variables) that cause results.
  • Propensity Score Matching: a matching technique that accounts for covariates in the experiment.
  • Purposive Sampling: Selecting samples based on your knowledge about the population and the study.
  • Randomization Tests: an approach that considers all of the possible ways experimental values could be assigned to all groups.
  • Sequential Assignment (assigning the first patient to the first group, the second patient to the second group, the third to the fist group…and so on), followed by Treatment-as-Usual (accepted protocols for treatment).
  • Sequential Sampling.

Threats to Validity

Assignment bias can be a threat to internal validity, because it allows two different explanations for differences in treatment results. For example, if you find that a weight loss procedure results in weight loss of more than 50 lbs, it could be that the treatment is actually effective, or it could be that the differences are because people in the experimental group weigh more at the outset (and therefore, have more to lose). In more technical terms, unmeasured extraneous variables (e.g. extra weight) might be interfering with the relationship between the independent variable and dependent variable.

Assignment bias can also be a threat to external validity “…if it affects study results, leading to inaccurate estimates of the relationships between variables in a population” (Dattalo, 2010). In other words, if you take your questionable experimental results and apply them to the broader population, this results in issues with external validity.

Dattalo, P. (201). Strategies to Approximate Random Sampling and Assignment. Oxford University Press, USA.


If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!

Some biases are universally abundant in metazoans

Over 40,000 CB regions in thirteen metazoan proteomes were assigned using the procedures described in Methods. Briefly, protein sequences are initially scanned for the lowest-probability subsequences (LPSs) for single amino-acid types; subsequently, an exhaustive search for lowest probability subsequences (LPSs) for multiple residue types is performed iteratively until convergence, to define CB region boundaries. A CB region is labelled with a CB signature (denoted {abc...} where a, b, c, ... are the residue types that it comprises, in decreasing order of significance). Each CB region has an associated Pmin value. Any region with an initial strong bias for residue type a, and any number of other subsidiary biases is denoted {a(X)n}. It is important to note that these P-values are only meaningful in a relative sense; the process of probability minimization provides a way to define boundaries for regions comprising complex compositional biases, that are distributed or mingled over the length of a particular subsequence.

What are the most consistently abundant biases across all of the metazoan proteomes? To answer this question, for each proteome, each bias type was ranked in decreasing order of abundance. Then, across all of the proteomes, the mean of this ranking was calculated, as well as the number of times the bias types occurred in the top ten of rankings. The twenty-five bias types with the smallest mean ranking values are listed in Table ​1. Strikingly, nine single- and double-residue biases are consistently highly ranked in these proteomes: {C}, {P}, {GP}, {Q}, {ED}, {G}, {E}, {S}, {H} and {T} occur in the top ten of at least six species, both vertebrate and invertebrate (Tables ​1 and ​2).

Table 1

Universally abundant compositional biases ***

Table 2

Top biases for the the thirteen metazoan proteomes (*)

Some abundant species-specific biases stand out, e.g., {Q} regions are most abundant in the fruitfly (Table ​2), when compared to all the other proteomes, and, in combination with {QH} regions (the second most prevalent bias in fruitfly) and {QPH} regions, comprise 13% of all the CB regions in that organism. These CB regions will be discussed in more detail below.

Other examples of abundant species-specific biases may be indicative of spurious gene predictions. Examination of examples of the many {HT} and {CV} regions found in the two puffer-fish proteomes (Table ​2), indicates that they arise from genome regions with simple repeats, and typically have poorly predicted introns; these thus may arise from systematic errors in gene prediction.

Although many of the most abundant biases across the metazoans are made from either one or two residue types, most biased regions are comprised of a larger number of residues, with a broad mode from about 3 to 5 residue types. This is illustrated for the human proteome (Figure ​1). More than a quarter (~27%) of the human CB regions have signatures of ≥ 6 residue types; this is because the bias assignment algorithm can detect CB regions that are composed of multiple milder single-residue biases. (An example of such a region is given in Figure 7(C) below.)

Figure 1

Number of bias residue types per CB region in the human proteome. The number of bias residue types per CB region is binned in a bar chart (x-axis). The total occurrences for each 'number of bias residue types' is on the y-axis.

Figure 7

Examples of assigned CB regions. In each case, the name of the protein, its current Ensembl identifier, its CB signature and Pmin value are indicated. The CB region is in bold and underlined; the rest of the sequence is in plain text. The proteins are...

Functional biases and predicted protein disorder content of the top ten biases in human and Drosophila

Obviously, these bias prevalences represent many diverse types of protein subsequence; therefore, to pick out specific subpopulations that are of interest, we need to perform some further characterizations. To this end, for the CB regions in both the human and Drosophila proteomes, after filtering for coiled coils and known protein structures, we examined: (i) significant functional associations based on Gene Ontology (GO) categories and terms; (ii) predicted protein disorder content (using the program DISOPRED [12]); (iii) CB region length; (iv) CB region conservation. We focus specifically on Q-based and E-based biases, as specific examples.

Tables ​3 and ​4 show that most of the top ten biases (6/10 for both human and Drosophila) come from the 'universally prevalent' list; some of these have significant associations with transcriptional functional categories and with nuclear localization. These CB regions also have moderate to high predicted protein disorder contents (D value ~0.4–0.8) (Tables ​3 and ​4). The D value is the fraction of the CB region that is predicted to be disordered by the program DISOPRED [12].

Table 3

Most abundant CB regions in Human and their significant functional associations and predicted protein disorder (*)

Table 4

Top Ten Biases for Fruitfly, and their significant functional associations and protein disorder values (*)

For example, {ED} regions in human have significant associations to 'nucleus' and 'DNA-dependent regulation of transcription', and are on average predicted to be moderately disordered (mean D values of 0.56) (Table ​3). {Q} regions (in both Drosophila and human) and {QH} regions (in Drosophila only) have similar functional associations, and are predicted to be moderately to highly disordered (D ~0.4–0.8) (Tables ​3 and ​4).

Additionally, we separated GO terms into those that are transcription-associated and those that are not (see Methods for details). Then, using these two 'supercategories', we tested for significant association with the transcription supercategory for each CB region type. For both human and Drosophila, the CB regions that demonstrate such a significant association with the transcription supercategory, also have significant association to individual GO terms linked to transcription (Tables ​3 and ​4).

Further analysis of nuclear-/transcription-related biases

GO and protein domain associations for the largest CB region grouping, {Q(X)n}

Since {Q} regions, and {Q(X)n} in general, represent the most numerous CB region grouping in either human or Drosophila, we examined the top twenty significant GO assignments for {Q(X)n} regions in more detail for Drosophila and Human, as well as for Rat and Mouse (Table ​5). Noticeably, across Drosophila and the three mammals, 'DNA-dependent regulation of transcription', 'transcription factor activity' and 'nucleus' are all highly-ranked functional associations. Similar prevalences are observed for abundant GO terms, if all {Q}+{QH}+{QPH} regions are analyzed in the same way (not shown).

Table 5

Most abundant GO terms for {Q(X)n} CB regions in the fruitfly, mouse, rat and human proteomes *

The {Q(X)n} grouping is also sufficiently numerous that we can count up the most frequently associated globular domains (i.e., domains that are in the same sequences) (Table ​6). The most commonly associated domain in both Human and Drosophila is the 'DNA/RNA-binding three-helical bundle', chiefly arising from the 'Homeodomain-like' superfamily. This domain was first found in Drosophila homeotic genes, and occurs widely in transcription factors; related domains are also used in other DNA-binding proteins, such as telomeric proteins, recombinases, etc.

Table 6

Associated SCOP domains for Q{(X)n} regions in Human and Fruitfly (*)

CB region length

In general, the nuclear-/transcription-related biases show a mode in region length at 20–40 residues. This is shown specifically for {QH} regions in Figure ​2. A similar fall-off is observed for the distribution for the subset of {QH} regions that are labelled in the GO classification as associated with 'transcription' or localization in the 'nucleus'. A 'blow-up' of the overall {QH} histogram (Figure ​3) demonstrates that these regions are not adequately analysed simply as homopolymeric tracts. The subsidiary nature of the H component of the bias is evident, as it is interspersed with longer homopolymeric runs of Q.

Figure 2

Distribution of lengths of {QH} regions in D. melanogaster. There are two histograms: the overall distribution (red bars), and the nuclear- or transcription-related proteins (blue bars). The nuclear- and transcription-related proteins have been compiled...

Figure 3

A 'blow-up' of the overall distribution of {QH} region lengths. The {QH} regions are listed horizontally in order of increasing length; Q residues are coloured red and H residues green, with other residues in black.


As case studies, we examined the conservation of {Q(X)n} and {E(X)n} regions in other metazoans, relative to human. Orthologs of proteins were determined with the bi-directional best hits approach, using BLASTP [13] (e-value ≤ 0.0001 with alignment over 0.6 of the length of both sequence, both with and without masking compositionally biased parts). We analysed the fraction of orthologs that maintain a biased region of the same character ({Q(X)n} or {E(X)n}) (Table ​7). Generally, these regions (filtered for coiled coils), show high conservation in orthologs from other mammals (60–80% depending on criteria), and low conservation in invertebrates (0–50%) (Table ​7). Obviously, these numbers broadly cover a diverse set of CB regions; visual curation reveals that shorter {Q(X)n} and {E(X)n} CB regions consisting of short homopolymeric runs of {Q} are not conserved from human to invertebrates, and that all of the regions that are conserved are longer (> ~90 residues). Indeed, this lack of conservation in invertebrates is also evident when one examines specifically the {Q}+{Q}+{QPH} and {ED}+{E} subsets (Table ​7). A multiple alignment of FOXP2, a gene important in language in humans, is illustrated as an example of conservation of a {Q} region defined in vertebrate proteomes (Figure ​4).

Table 7

Conservation of {Q(X)n} and {E(X)n} biased regions (*)

Figure 4

Example of conservation of {Q} region in vertebrates: FOXP2 and its orthologs. A multiple alignment is shown for FOXP2 and its orthologs on other vertebrates, made using the MUSCLE program [21]; the {Q} region is highlighted in red if its P-value was...

Predicted protein disorder – general observations

Prediction of protein disorder has recently been the focus of much research activity [1,12,14]. Such regions present a challenge for further proteome-scale experimental characterization. We analyzed the predicted protein disorder content of the human and Drosophila CB regions, using the program DISOPRED [12]. In summed total (simply adding up the total amounts of residues), the human CB region data is predicted to be ~42% disordered, with a similar value observed for the fruitfly (45%). This compares to 17% (human) and 15% (fruitfly) for the whole proteomes of these organisms, indicating a strong relationship between the defined CB regions and predicted protein disorder. However, most predicted protein-disorder is not defined as compositionally biased (67% of predicted protein disorder regions ≥ 20 residues in human, and 72% in fruitfly). Figure ​5 shows that distribution of the fraction of disorder (denoted D) predicted for each CB region for human and fruitfly, is approximately uniform; a wide diversity of predicted protein disorder contents is also illustrated by plots of D versus CB region length (shown for human in Figure ​6).

Figure 5

The fraction of predicted disorder (denoted D in the text) is binned as a bar chart for both the human and fruitfly proteomes. The bin p-q contains all values D, such that p D <q. The proportion of occurrences in each bin is given on...

Figure 6

Plot of the D value versus the length of a CB region for the human proteome.

We examined the inferred cellular compartment for the CB regions, divided into four different groupings according to their D values, and then calculated propensities to have these compartments for each disorder grouping (Table ​8). For human, biased regions have a propensity to be nuclear if D > 0.25, and to be nuclear regardless of D value for the fruitfly. Also, for very high disorder values (D > 0.75), there is significant linkage to both nuclear and cytoplasmic compartments for both human and fruitfly.

Table 8

Cellular compartments for protein with CB regions with different D values (*)

0 Replies to “Assignment Bias Definition”

Lascia un Commento

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *