| Title: | Codon Usage Bias Analysis |
|---|---|
| Description: | A suite of functions for rapid and flexible analysis of codon usage bias. It provides in-depth analysis at the codon level, including relative synonymous codon usage (RSCU), tRNA weight calculations, machine learning predictions for optimal or preferred codons, and visualization of codon-anticodon pairing. Additionally, it can calculate various gene- specific codon indices such as codon adaptation index (CAI), effective number of codons (ENC), fraction of optimal codons (Fop), tRNA adaptation index (tAI), mean codon stabilization coefficients (CSCg), and GC contents (GC/GC3s/GC4d). It also supports both standard and non-standard genetic code tables found in NCBI, as well as custom genetic code tables. |
| Authors: | Hong Zhang [aut, cre] (ORCID: <https://orcid.org/0000-0002-4064-9432>), Mengyue Liu [aut], Bu Zi [aut] |
| Maintainer: | Hong Zhang <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.2.0.9000 |
| Built: | 2026-05-27 10:20:38 UTC |
| Source: | https://github.com/mt1022/cubar |
A data.frame of mapping from amino acids to codons
aa2codonaa2codon
a data.frame with two columns: amino_acid, and codon.
amino acid corresponding to the codon
codon identity
It is actually the standard genetic code.
aa2codonaa2codon
ca_pairs show possible codon-anticodons pairings
ca_pairs(codon_table = get_codon_table(), domain = "Eukarya", plot = FALSE)ca_pairs(codon_table = get_codon_table(), domain = "Eukarya", plot = FALSE)
codon_table |
a table of genetic code derived from |
domain |
The taxonomic domain of interest. "Eukarya" (default), "Bacteria" or "Archaea". |
plot |
FALSE (default) or TRUE. Whether to keep the columns required for plotting. |
a data.table of codon-anticodon pairing information. The columns represent the pairing type, codon, corresponding anticodon, and the encoded amino acid when the argument "plot" is FALSE.
# get possible codon and anticodon pairings for the vertebrate mitochondrial genetic code ctab <- get_codon_table(gcid = '2') pairing <- ca_pairs(ctab) head(pairing)# get possible codon and anticodon pairings for the vertebrate mitochondrial genetic code ctab <- get_codon_table(gcid = '2') pairing <- ca_pairs(ctab) head(pairing)
check_cds performs comprehensive quality control on coding sequences (CDS)
by filtering sequences based on various criteria and optionally removing start
or stop codons. This function ensures that sequences meet the requirements for
downstream codon usage analysis.
check_cds( seqs, codon_table = get_codon_table(), min_len = 6, check_len = TRUE, check_start = TRUE, check_stop = TRUE, check_istop = TRUE, rm_start = TRUE, rm_stop = TRUE, start_codons = c("ATG") )check_cds( seqs, codon_table = get_codon_table(), min_len = 6, check_len = TRUE, check_start = TRUE, check_stop = TRUE, check_istop = TRUE, rm_start = TRUE, rm_stop = TRUE, start_codons = c("ATG") )
seqs |
Input CDS sequences as a DNAStringSet or compatible object. |
codon_table |
Codon table matching the genetic code of the input sequences.
Generated using |
min_len |
Minimum CDS length in nucleotides (default: 6). |
check_len |
Logical. Check whether CDS length is divisible by 3 (default: TRUE). |
check_start |
Logical. Check whether CDSs begin with valid start codons (default: TRUE). |
check_stop |
Logical. Check whether CDSs end with valid stop codons (default: TRUE). |
check_istop |
Logical. Check for internal stop codons (default: TRUE). |
rm_start |
Logical. Remove start codons from the sequences (default: TRUE). |
rm_stop |
Logical. Remove stop codons from the sequences (default: TRUE). |
start_codons |
Character vector specifying valid start codons (default: "ATG"). |
A DNAStringSet containing filtered and optionally trimmed CDS sequences that pass all quality control checks.
# Perform CDS sequence quality control for a sample of yeast genes s <- head(yeast_cds, 10) print(s) check_cds(s)# Perform CDS sequence quality control for a sample of yeast genes s <- head(yeast_cds, 10) print(s) check_cds(s)
codon_diff takes two set of coding sequences and
perform differential codon usage analysis.
codon_diff(seqs1, seqs2, codon_table = get_codon_table())codon_diff(seqs1, seqs2, codon_table = get_codon_table())
seqs1 |
DNAStringSet, or an object that can be coerced to a DNAStringSet |
seqs2 |
DNAStringSet, or an object that can be coerced to a DNAStringSet |
codon_table |
a table of genetic code derived from |
a data.table of the differential codon usage analysis. Global tests examine wthether a codon
is used differently relative to all the other codons. Family tests examine whether a codon is used
differently relative to other codons that encode the same amino acid. Subfamily tests examine whether
a codon is used differently relative to other synonymous codons that share the same first two nucleotides.
Odds ratio > 1 suggests a codon is used at higher frequency in seqs1 than in seqs2.
yeast_exp_sorted <- yeast_exp[order(yeast_exp$fpkm),] seqs1 <- yeast_cds[names(yeast_cds) %in% head(yeast_exp_sorted$gene_id, 1000)] seqs2 <- yeast_cds[names(yeast_cds) %in% tail(yeast_exp_sorted$gene_id, 1000)] cudiff <- codon_diff(seqs1, seqs2)yeast_exp_sorted <- yeast_exp[order(yeast_exp$fpkm),] seqs1 <- yeast_cds[names(yeast_cds) %in% head(yeast_exp_sorted$gene_id, 1000)] seqs2 <- yeast_cds[names(yeast_cds) %in% tail(yeast_exp_sorted$gene_id, 1000)] cudiff <- codon_diff(seqs1, seqs2)
codon_optimize redesigns a coding sequence by replacing each codon
with its optimal synonymous alternative, while maintaining the same amino
acid sequence. This function supports multiple optimization methods and is
useful for improving protein expression in heterologous systems.
codon_optimize( seq, optimal_codons = optimal_codons, cf = NULL, codon_table = get_codon_table(), level = "subfam", method = "naive", num_sequences = 1, organism = NULL, envname = "cubar_env", attention_type = "original_full", deterministic = TRUE, temperature = 0.2, top_p = 0.95, match_protein = FALSE, spliceai = FALSE )codon_optimize( seq, optimal_codons = optimal_codons, cf = NULL, codon_table = get_codon_table(), level = "subfam", method = "naive", num_sequences = 1, organism = NULL, envname = "cubar_env", attention_type = "original_full", deterministic = TRUE, temperature = 0.2, top_p = 0.95, match_protein = FALSE, spliceai = FALSE )
seq |
A coding sequence as a DNAString object or any object that can be coerced to DNAString. The sequence should not include stop codons. |
optimal_codons |
A table of optimal codons as generated by
|
cf |
Matrix of codon frequencies from |
codon_table |
A codon table defining the genetic code, derived from
|
level |
Character string specifying optimization level: "subfam" (default, within codon subfamilies) or "amino_acid" (within amino acid groups). Required for "naive" and "IDT" methods. |
method |
Character string specifying the optimization algorithm:
|
num_sequences |
Integer. Number of different optimized sequences to generate (default: 1). For "CodonTransformer" with deterministic=FALSE, each sequence is independently sampled. |
organism |
Organism identifier (integer ID or string name) for "CodonTransformer" method. Must be from ORGANISM2ID in CodonUtils (e.g., "Escherichia coli general"). |
envname |
Environment name for "CodonTransformer" method. Should match your conda environment name (default: "cubar_env"). |
attention_type |
Attention mechanism type for "CodonTransformer":
|
deterministic |
Logical. For "CodonTransformer" method:
|
temperature |
Numeric. Controls randomness in non-deterministic mode for "CodonTransformer". Lower values (0.2, default) are conservative; higher values (0.8) increase diversity. Must be positive. |
top_p |
Numeric. Nucleus sampling threshold (0-1) for "CodonTransformer". Only tokens with cumulative probability up to this value are considered. Default: 0.95. |
match_protein |
Logical. For "CodonTransformer", constrains predictions to maintain exact amino acid sequence. Recommended for unusual proteins or high temperature settings (default: FALSE). |
spliceai |
Logical. Whether to predict splice sites using SpliceAI (default: FALSE). Requires appropriate environment setup. |
The return type depends on parameters:
Single DNAString: When num_sequences=1 and spliceai=FALSE
DNAStringSet: When num_sequences>1 and spliceai=FALSE
data.table: When spliceai=TRUE (includes sequences and splice predictions)
Fallahpour A, Gureghian V, Filion GJ, Lindner AB, Pandi A. CodonTransformer: a multispecies codon optimizer using context-aware neural networks. Nat Commun. 2025 Apr 3;16(1):3205.
Jaganathan K, Panagiotopoulou S K, McRae J F, et al. Predicting splicing from primary sequence with deep learning. Cell, 2019, 176(3): 535-548.e24.
cf_all <- count_codons(yeast_cds) optimal_codons <- est_optimal_codons(cf_all) seq <- 'ATGCTACGA' # method "naive": codon_optimize(seq, optimal_codons) # method "IDT": codon_optimize(seq, cf = cf_all, method = "IDT") codon_optimize(seq, cf = cf_all, method = "IDT", num_sequences = 10) # # The following examples requires pre-installation of python package SpliceAI or Codon # # Transformer. see the codon optimization vignette for further details. # seq_opt <- codon_optimize(seq, method = "CodonTransformer", # organism = "Saccharomyces cerevisiae") # seqs_opt <- codon_optimize(seq, method = "CodonTransformer", # organism = "Saccharomyces cerevisiae", num_sequences = 10, # deterministic =FALSE, temperature = 0.4) # seqs_opt <- codon_optimize(seq, cf = cf_all, method = "IDT", # num_sequences = 10, spliceai = TRUE) # seq_opt <- codon_optimize(seq, method = "CodonTransformer", # organism = "Saccharomyces cerevisiae", spliceai = TRUE)cf_all <- count_codons(yeast_cds) optimal_codons <- est_optimal_codons(cf_all) seq <- 'ATGCTACGA' # method "naive": codon_optimize(seq, optimal_codons) # method "IDT": codon_optimize(seq, cf = cf_all, method = "IDT") codon_optimize(seq, cf = cf_all, method = "IDT", num_sequences = 10) # # The following examples requires pre-installation of python package SpliceAI or Codon # # Transformer. see the codon optimization vignette for further details. # seq_opt <- codon_optimize(seq, method = "CodonTransformer", # organism = "Saccharomyces cerevisiae") # seqs_opt <- codon_optimize(seq, method = "CodonTransformer", # organism = "Saccharomyces cerevisiae", num_sequences = 10, # deterministic =FALSE, temperature = 0.4) # seqs_opt <- codon_optimize(seq, cf = cf_all, method = "IDT", # num_sequences = 10, spliceai = TRUE) # seq_opt <- codon_optimize(seq, method = "CodonTransformer", # organism = "Saccharomyces cerevisiae", spliceai = TRUE)
count_codons tabulates the frequency of all 64 possible codons across
input coding sequences. This function provides the foundation for most codon
usage bias analyses in the cubar package.
count_codons(seqs, ...)count_codons(seqs, ...)
seqs |
Coding sequences as a DNAStringSet object, or compatible input that can be coerced to DNAStringSet. |
... |
Additional arguments passed to |
A matrix where rows represent individual CDS sequences and columns represent the 64 possible codons. Each cell contains the frequency count of the corresponding codon in the respective sequence.
# Count codon frequencies across all yeast CDS sequences cf_all <- count_codons(yeast_cds) dim(cf_all) cf_all[1:5, 1:5] # Count codons for a single sequence count_codons(yeast_cds[1])# Count codon frequencies across all yeast CDS sequences cf_all <- count_codons(yeast_cds) dim(cf_all) cf_all[1:5, 1:5] # Count codons for a single sequence count_codons(yeast_cds[1])
create_codon_table generates a codon table from a user-defined data
frame that maps codons to their corresponding amino acids. This function
enables analysis of non-standard or artificial genetic codes not available
in the NCBI genetic code collection.
create_codon_table(aa2codon)create_codon_table(aa2codon)
aa2codon |
A data frame with two required columns:
|
A data.table with four columns:
aa_code: Single-letter amino acid code
amino_acid: Three-letter amino acid abbreviation
codon: Three-nucleotide codon sequence
subfam: Codon subfamily identifier (amino_acid_XY format)
# View the example amino acid to codon mapping head(aa2codon) # Create a custom codon table custom_table <- create_codon_table(aa2codon = aa2codon) head(custom_table)# View the example amino acid to codon mapping head(aa2codon) # Create a custom codon table custom_table <- create_codon_table(aa2codon = aa2codon) head(custom_table)
Estimate Amino Acid Usage Frequencies of CDSs.
est_aau(cf, codon_table = get_codon_table())est_aau(cf, codon_table = get_codon_table())
cf |
matrix of codon frequencies as calculated by 'count_codons()'. |
codon_table |
codon_table a table of genetic code derived from |
a data.table with amino acid frequencies of CDSs. The columns include three-letter abbreviation of the amino acid, single-letter abbreviation, usage frequency of the amino acid in all sequences, and usage frequency proportion.
# estimate amino acid frequencies of yeast genes cf_all <- count_codons(yeast_cds) aau <- est_aau(cf_all) print(aau)# estimate amino acid frequencies of yeast genes cf_all <- count_codons(yeast_cds) aau <- est_aau(cf_all) print(aau)
get_csc calculate codon occurrence to mRNA stability correlation coefficients (Default to Pearson's).
est_csc( seqs, half_life, codon_table = get_codon_table(), cor_method = "pearson" )est_csc( seqs, half_life, codon_table = get_codon_table(), cor_method = "pearson" )
seqs |
CDS sequences of all protein-coding genes. One for each gene. |
half_life |
data.frame of mRNA half life (gene_id & half_life are column names). |
codon_table |
a table of genetic code derived from |
cor_method |
method name passed to 'cor.test' used for calculating correlation coefficients. |
a data.table of codons and their CSCs. The columns include codon, codon stability coefficient, and correlation P-value.
Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.
# estimate yeast mRNA CSC est_csc(yeast_cds, yeast_half_life)# estimate yeast mRNA CSC est_csc(yeast_cds, yeast_half_life)
est_optimal_codons identifies optimal codons within each codon family
or amino acid group using binomial regression. Optimal codons are those whose
usage correlates positively with high gene expression or negatively with
codon usage bias (ENC), suggesting they are preferred for efficient translation.
est_optimal_codons( cf, codon_table = get_codon_table(), level = "subfam", gene_score = NULL, fdr = 0.001 )est_optimal_codons( cf, codon_table = get_codon_table(), level = "subfam", gene_score = NULL, fdr = 0.001 )
cf |
A matrix of codon frequencies as calculated by |
codon_table |
A codon table defining the genetic code, derived from
|
level |
Character string specifying the analysis level: "subfam" (default, analyzes codon subfamilies) or "amino_acid" (analyzes at amino acid level). |
gene_score |
A numeric vector of gene-level scores used to identify
optimal codons. Length must equal the number of rows in
If not provided, the negative of ENC values will be used (lower ENC = higher bias). |
fdr |
Numeric value specifying the false discovery rate threshold for determining statistical significance of codon optimality (default depends on method). |
A data.table containing the input codon table with additional columns indicating codon optimality status, statistical significance, and effect sizes from the regression analysis. The columns include single-letter abbreviation of the amino acid, three-letter abbreviation, codon, codon subfamily, regression coefficient, regression P-value, Benjamini and Hochberg corrected Q-value, and indication of whether the codon is optimal.
Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.
# perform binomial regression for optimal codon estimation cf_all <- count_codons(yeast_cds) codons_opt <- est_optimal_codons(cf_all) codons_opt <- codons_opt[optimal == TRUE] codons_opt# perform binomial regression for optimal codon estimation cf_all <- count_codons(yeast_cds) codons_opt <- est_optimal_codons(cf_all) codons_opt <- codons_opt[optimal == TRUE] codons_opt
est_rscu calculates the Relative Synonymous Codon Usage (RSCU) values
for codons, which quantify the bias in synonymous codon usage. RSCU values
indicate whether a codon is used more (>1) or less (<1) frequently than
expected under uniform usage within its synonymous group.
est_rscu( cf, weight = 1, pseudo_cnt = 1, codon_table = get_codon_table(), level = "subfam", incl_stop = FALSE )est_rscu( cf, weight = 1, pseudo_cnt = 1, codon_table = get_codon_table(), level = "subfam", incl_stop = FALSE )
cf |
A matrix of codon frequencies as calculated by |
weight |
A numeric vector of the same length as the number of sequences
in |
pseudo_cnt |
Numeric pseudo count added to avoid division by zero when few sequences are available for RSCU calculation (default: 1). |
codon_table |
A codon table defining the genetic code, derived from
|
level |
Character string specifying the analysis level: "subfam" (default, analyzes codon subfamilies) or "amino_acid" (analyzes at amino acid level). |
incl_stop |
Logical. Whether to include RSCU values for stop codons in the output (default: FALSE). |
A data.table containing the codon table with additional columns for RSCU analysis: usage frequency counts (cts), frequency proportions (prop), CAI weights (w_cai), and RSCU values (rscu). The table includes amino acid codes, full amino acid names, codons, and subfamily classifications.
Sharp PM, Tuohy TM, Mosurski KR. 1986. Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res 14:5125-5143.
# Calculate RSCU for all yeast genes cf_all <- count_codons(yeast_cds) rscu_all <- est_rscu(cf_all) head(rscu_all) # Calculate RSCU for highly expressed genes (top 500) heg <- head(yeast_exp[order(-yeast_exp$fpkm), ], n = 500) cf_heg <- count_codons(yeast_cds[heg$gene_id]) rscu_heg <- est_rscu(cf_heg) head(rscu_heg)# Calculate RSCU for all yeast genes cf_all <- count_codons(yeast_cds) rscu_all <- est_rscu(cf_all) head(rscu_all) # Calculate RSCU for highly expressed genes (top 500) heg <- head(yeast_exp[order(-yeast_exp$fpkm), ], n = 500) cf_heg <- count_codons(yeast_cds[heg$gene_id]) rscu_heg <- est_rscu(cf_heg) head(rscu_heg)
est_trna_weight calculates tRNA weights for each codon based on tRNA
availability and codon-anticodon pairing efficiency. These weights are used
in tRNA Adaptation Index (TAI) calculations and reflect how well each codon
is supported by the cellular tRNA pool.
est_trna_weight( trna_level, codon_table = get_codon_table(), domain = "Eukarya", s = NULL )est_trna_weight( trna_level, codon_table = get_codon_table(), domain = "Eukarya", s = NULL )
trna_level |
A named numeric vector of tRNA expression levels or gene copy numbers. Names should be in the format "AminoAcid-Anticodon" (e.g., "Ala-GCA"). Each value represents the abundance of that tRNA species. |
codon_table |
A codon table defining the genetic code, derived from
|
domain |
Character string specifying the taxonomic domain: "Eukarya" (default), "Bacteria", or "Archaea". This determines the codon-anticodon pairing rules and selection penalties. Specify either "domain" or "s". |
s |
A named list of selection penalties for non-Watson-Crick pairings. If provided, overrides the default domain-specific penalties. Specify either "domain" or "s". |
A data.table containing comprehensive tRNA weight information with columns:
aa_code: Single-letter amino acid code
amino_acid: Three-letter amino acid abbreviation
codon: Codon sequence
subfam: Codon subfamily identifier
anticodon: Corresponding anticodon sequence
trna_id: tRNA identifier (amino_acid-anticodon)
ac_level: tRNA abundance level
W: Absolute adaptiveness value
w: Relative adaptiveness (normalized weight for TAI)
dos Reis M, Savva R, Wernisch L. 2004. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res 32:5036-5044.
Sabi R, Tuller T. 2014. Modelling the efficiency of codon-tRNA interactions based on codon usage bias. DNA Res 21:511-526.
# Calculate tRNA weights for yeast using gene copy numbers yeast_trna_w <- est_trna_weight(yeast_trna_gcn) head(yeast_trna_w) # View the weight distribution hist(yeast_trna_w$w, main = "Distribution of tRNA weights")# Calculate tRNA weights for yeast using gene copy numbers yeast_trna_w <- est_trna_weight(yeast_trna_gcn) head(yeast_trna_w) # View the weight distribution hist(yeast_trna_w$w, main = "Distribution of tRNA weights")
extract_trna_gcn processes tRNA sequence data from GtRNADB to
extract gene copy numbers for each tRNA type. This information is essential
for calculating tRNA availability weights used in TAI analysis.
extract_trna_gcn(trna_seq)extract_trna_gcn(trna_seq)
trna_seq |
A named vector or DNAStringSet of tRNA sequences, typically from GtRNADB. Sequence names should follow the standard format containing amino acid and anticodon information (e.g., "tRNA-Ala-AGC-1-1"). |
A named table of tRNA gene copy numbers. Names are in the format "AminoAcid-Anticodon" (e.g., "Ala-AGC") and values represent the count of genes encoding each tRNA type. Initiator tRNAs (iMet, fMet) and undetermined tRNAs (Und-NNN) are automatically excluded as they serve specialized functions in translation initiation.
# Extract tRNA gene copy numbers for yeast trna_gcn <- extract_trna_gcn(yeast_trna) head(trna_gcn) # View the distribution of tRNA gene copies hist(trna_gcn, main = "Distribution of tRNA gene copy numbers")# Extract tRNA gene copy numbers for yeast trna_gcn <- extract_trna_gcn(yeast_trna) head(trna_gcn) # View the distribution of tRNA gene copies hist(trna_gcn, main = "Distribution of tRNA gene copy numbers")
Calculate Amino Acid Usage Frequencies of each CDS.
get_aau(cf, codon_table = get_codon_table())get_aau(cf, codon_table = get_codon_table())
cf |
matrix of codon frequencies as calculated by 'count_codons()'. |
codon_table |
a table of genetic code derived from |
a matrix of amino acid frequencies for each CDS. Each row corresponds to a sequence, and each column represents an amino acid.
# estimate amino acid frequencies of yeast CDSs cf_all <- count_codons(yeast_cds) aau_gene <- get_aau(cf_all) head(aau_gene)# estimate amino acid frequencies of yeast CDSs cf_all <- count_codons(yeast_cds) aau_gene <- get_aau(cf_all) head(aau_gene)
get_cai calculates the Codon Adaptation Index (CAI) for each input
coding sequence. CAI measures how similar the codon usage of a gene is to
that of highly expressed genes, serving as an indicator of translational
efficiency. Higher CAI values suggest better adaptation to the translational
machinery.
get_cai(cf, rscu, level = "subfam")get_cai(cf, rscu, level = "subfam")
cf |
A matrix of codon frequencies as calculated by |
rscu |
An RSCU table containing CAI weights for each codon. This table
should be generated using |
level |
Character string specifying the analysis level: "subfam" (default, analyzes codon subfamilies) or "amino_acid" (analyzes at amino acid level). |
A named numeric vector of CAI values ranging from 0 to 1. Names correspond to sequence identifiers from the input matrix. Values closer to 1 indicate higher similarity to highly expressed genes.
Sharp PM, Li WH. 1987. The codon Adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15:1281-1295.
# Calculate CAI for yeast genes based on RSCU of highly expressed genes heg <- head(yeast_exp[order(-yeast_exp$fpkm), ], n = 500) cf_all <- count_codons(yeast_cds) cf_heg <- cf_all[heg$gene_id, ] rscu_heg <- est_rscu(cf_heg) cai <- get_cai(cf_all, rscu_heg) head(cai) hist(cai, main = "Distribution of CAI values")# Calculate CAI for yeast genes based on RSCU of highly expressed genes heg <- head(yeast_exp[order(-yeast_exp$fpkm), ], n = 500) cf_all <- count_codons(yeast_cds) cf_heg <- cf_all[heg$gene_id, ] rscu_heg <- est_rscu(cf_heg) cai <- get_cai(cf_all, rscu_heg) head(cai) hist(cai, main = "Distribution of CAI values")
get_codon_table creates a standardized codon table based on genetic
codes cataloged by NCBI. This function provides the mapping between codons
and amino acids for different organisms and organelles, which is essential
for accurate codon usage analysis.
get_codon_table(gcid = "1")get_codon_table(gcid = "1")
gcid |
A character string specifying the NCBI genetic code ID. Use
|
A data.table with four columns:
aa_code: Single-letter amino acid code
amino_acid: Three-letter amino acid abbreviation
codon: Three-nucleotide codon sequence
subfam: Codon subfamily identifier (amino_acid_XY format)
# Standard genetic code (used by most organisms) standard_code <- get_codon_table() head(standard_code) # Vertebrate mitochondrial genetic code mito_code <- get_codon_table(gcid = '2') head(mito_code)# Standard genetic code (used by most organisms) standard_code <- get_codon_table() head(standard_code) # Vertebrate mitochondrial genetic code mito_code <- get_codon_table(gcid = '2') head(mito_code)
get_cscg calculates Mean Codon Stabilization Coefficients of each CDS.
get_cscg(cf, csc)get_cscg(cf, csc)
cf |
matrix of codon frequencies as calculated by |
csc |
table of Codon Stabilization Coefficients as calculated by |
a named vector of cscg values. The names of the elements correspond to the sequence names.
Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.
# estimate CSCg of yeast genes yeast_csc <- est_csc(yeast_cds, yeast_half_life) cf_all <- count_codons(yeast_cds) cscg <- get_cscg(cf_all, csc = yeast_csc) head(cscg) hist(cscg)# estimate CSCg of yeast genes yeast_csc <- est_csc(yeast_cds, yeast_half_life) cf_all <- count_codons(yeast_cds) cscg <- get_cscg(cf_all, csc = yeast_csc) head(cscg) hist(cscg)
get_dp calculates Deviation from Proportionality of each CDS.
get_dp( cf, host_weights, codon_table = get_codon_table(), level = "subfam", missing_action = "ignore" )get_dp( cf, host_weights, codon_table = get_codon_table(), level = "subfam", missing_action = "ignore" )
cf |
matrix of codon frequencies as calculated by |
host_weights |
a named vector of tRNA weights for each codon that reflects the relative availability of tRNAs in the host organism. |
codon_table |
a table of genetic code derived from |
level |
"subfam" (default) or "amino_acid". If "subfam", the deviation is calculated at the codon subfamily level. Otherwise, the deviation is calculated at the amino acid level. |
missing_action |
Actions to take when no codon of a group were found in a CDS. Options are "ignore" (default), or "zero" (set codon proportions to 0). |
a named vector of dp values. The names of the elements correspond to the sequence names.
Chen F, Wu P, Deng S, Zhang H, Hou Y, Hu Z, Zhang J, Chen X, Yang JR. 2020. Dissimilation of synonymous codon usage bias in virus-host coevolution due to translational selection. Nat Ecol Evol 4:589-600.
# estimate DP of yeast genes cf_all <- count_codons(yeast_cds) trna_weight <- est_trna_weight(yeast_trna_gcn) trna_weight <- setNames(trna_weight$w, trna_weight$codon) dp <- get_dp(cf_all, host_weights = trna_weight) head(dp) hist(dp)# estimate DP of yeast genes cf_all <- count_codons(yeast_cds) trna_weight <- est_trna_weight(yeast_trna_gcn) trna_weight <- setNames(trna_weight$w, trna_weight$codon) dp <- get_dp(cf_all, host_weights = trna_weight) head(dp) hist(dp)
get_enc computes the effective number of codons (ENC) for each coding
sequence, which quantifies the degree of codon usage bias. Lower ENC values
indicate stronger bias (fewer codons are used), while higher values indicate
more uniform codon usage.
get_enc(cf, codon_table = get_codon_table(), level = "subfam")get_enc(cf, codon_table = get_codon_table(), level = "subfam")
cf |
A matrix of codon frequencies as calculated by |
codon_table |
A codon table defining the genetic code, derived from
|
level |
Character string specifying the analysis level: "subfam" (default, analyzes codon subfamilies) or "amino_acid" (analyzes at amino acid level). |
A named numeric vector of ENC values. Names correspond to sequence identifiers from the input matrix. ENC values typically range from 20 (maximum bias) to 61 (uniform usage).
Wright F. 1990. The 'effective number of codons' used in a gene. Gene 87:23-29.
Sun X, Yang Q, Xia X. 2013. An improved implementation of effective number of codons (NC). Mol Biol Evol 30:191-196.
# Calculate ENC for yeast genes cf_all <- count_codons(yeast_cds) enc <- get_enc(cf_all) head(enc) hist(enc, main = "Distribution of ENC values")# Calculate ENC for yeast genes cf_all <- count_codons(yeast_cds) enc <- get_enc(cf_all) head(enc) hist(enc, main = "Distribution of ENC values")
get_fop calculates the fraction of optimal codons (Fop) for each
coding sequence, which represents the proportion of codons that are
considered optimal for translation efficiency. Higher Fop values suggest
stronger selection for optimal codon usage.
get_fop(cf, op = NULL, codon_table = get_codon_table(), ...)get_fop(cf, op = NULL, codon_table = get_codon_table(), ...)
cf |
A matrix of codon frequencies as calculated by |
op |
A character vector specifying which codons are considered optimal.
If not provided, optimal codons will be determined automatically using
|
codon_table |
A codon table defining the genetic code, derived from
|
... |
Additional arguments passed to |
A named numeric vector of Fop values (ranging from 0 to 1). Names correspond to sequence identifiers from the input matrix. Higher values indicate greater usage of optimal codons.
Ikemura T. 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151:389-409.
# Calculate Fop for yeast genes (optimal codons determined automatically) cf_all <- count_codons(yeast_cds) fop <- get_fop(cf_all) head(fop) hist(fop, main = "Distribution of Fop values")# Calculate Fop for yeast genes (optimal codons determined automatically) cf_all <- count_codons(yeast_cds) fop <- get_fop(cf_all) head(fop) hist(fop, main = "Distribution of Fop values")
get_gc calculates the overall GC content (percentage of guanine and
cytosine nucleotides) for each coding sequence.
get_gc(cf)get_gc(cf)
cf |
A matrix of codon frequencies as calculated by |
A named numeric vector of GC content values (ranging from 0 to 1). Names correspond to sequence identifiers from the input matrix.
# Calculate GC content for yeast genes cf_all <- count_codons(yeast_cds) gc <- get_gc(cf_all) head(gc) hist(gc, main = "Distribution of GC content")# Calculate GC content for yeast genes cf_all <- count_codons(yeast_cds) gc <- get_gc(cf_all) head(gc) hist(gc, main = "Distribution of GC content")
Calculate GC content at synonymous 3rd codon positions.
get_gc3s(cf, codon_table = get_codon_table(), level = "subfam")get_gc3s(cf, codon_table = get_codon_table(), level = "subfam")
cf |
matrix of codon frequencies as calculated by |
codon_table |
a table of genetic code derived from |
level |
"subfam" (default) or "amino_acid". For which level to determine GC content at synonymous 3rd codon positions. |
a named vector of GC3s values. The names of the elements correspond to the sequence names.
Peden JF. 2000. Analysis of codon usage.
# estimate GC3s of yeast genes cf_all <- count_codons(yeast_cds) gc3s <- get_gc3s(cf_all) head(gc3s) hist(gc3s)# estimate GC3s of yeast genes cf_all <- count_codons(yeast_cds) gc3s <- get_gc3s(cf_all) head(gc3s) hist(gc3s)
Calculate GC content at synonymous position of codons (using four-fold degenerate sites only).
get_gc4d(cf, codon_table = get_codon_table(), level = "subfam")get_gc4d(cf, codon_table = get_codon_table(), level = "subfam")
cf |
matrix of codon frequencies as calculated by |
codon_table |
a table of genetic code derived from |
level |
"subfam" (default) or "amino_acid". For which level to determine GC contents at 4-fold degenerate sites. |
a named vector of GC4d values. The names of the elements correspond to the sequence names.
# estimate GC4d of yeast genes cf_all <- count_codons(yeast_cds) gc4d <- get_gc4d(cf_all) head(gc4d) hist(gc4d)# estimate GC4d of yeast genes cf_all <- count_codons(yeast_cds) gc4d <- get_gc4d(cf_all) head(gc4d) hist(gc4d)
get_tai calculates the tRNA Adaptation Index (TAI) for each coding
sequence, which measures how well codon usage matches tRNA availability in
the cell. Higher TAI values indicate better adaptation to the tRNA pool,
suggesting more efficient translation.
get_tai(cf, trna_w, w_format = "cubar")get_tai(cf, trna_w, w_format = "cubar")
cf |
A matrix of codon frequencies as calculated by |
trna_w |
A table of tRNA weights for each codon, generated using
|
w_format |
Character string specifying the format of tRNA weights: "cubar" (default, weights from cubar package) or "tAI" (weights from the tAI package format). |
A named numeric vector of TAI values. Names correspond to sequence identifiers from the input matrix. Values range from 0 to 1, with higher values indicating better adaptation to tRNA availability. It should be noted that the tAI package does not include codons that correspond to methionine in its calculation. In contrast, cubar requires removing the start codon and takes into account codons encoding methionine within the sequence. Correspondingly, cubar excludes initiator tRNA (iMet in eukaryotes or fMet in prokaryotes), and focuses on the anticodons that correspond to regular methionine.
dos Reis M, Savva R, Wernisch L. 2004. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res 32:5036-5044.
# calculate TAI of yeast genes based on genomic tRNA copy numbers w <- est_trna_weight(yeast_trna_gcn) # note: check_istop is suppressed to facilitate package development. # We suggest enable this option for real sequence analyses. yeast_cds_qc <- check_cds(yeast_cds, check_istop = FALSE) cf <- count_codons(yeast_cds_qc) tai <- get_tai(cf, w) head(tai) hist(tai)# calculate TAI of yeast genes based on genomic tRNA copy numbers w <- est_trna_weight(yeast_trna_gcn) # note: check_istop is suppressed to facilitate package development. # We suggest enable this option for real sequence analyses. yeast_cds_qc <- check_cds(yeast_cds, check_istop = FALSE) cf <- count_codons(yeast_cds_qc) tai <- get_tai(cf, w) head(tai) hist(tai)
CDSs of 13 protein-coding genes in the human mitochondrial genome extracted from ENSEMBL Biomart
human_mthuman_mt
a DNAStringSet of 13 sequences
<https://www.ensembl.org/index.html>
head(human_mt)head(human_mt)
plot_ca_pairs show possible codon-anticodons pairings
plot_ca_pairs(codon_table = get_codon_table(), pairs = pairs)plot_ca_pairs(codon_table = get_codon_table(), pairs = pairs)
codon_table |
a table of genetic code derived from |
pairs |
a table of codon-anticodon pairing derived from |
a plot on possible codon-anticodons pairings
# plot possible codon and anticodon pairings for the vertebrate mitochondrial genetic code ctab <- get_codon_table(gcid = '2') pairs <- ca_pairs(ctab, plot = TRUE) plot_ca_pairs(ctab, pairs) # plot possible codon and anticodon pairings for the standard genetic code in bacteria plot_ca_pairs(pairs = ca_pairs(domain = "Bacteria", plot = TRUE))# plot possible codon and anticodon pairings for the vertebrate mitochondrial genetic code ctab <- get_codon_table(gcid = '2') pairs <- ca_pairs(ctab, plot = TRUE) plot_ca_pairs(ctab, pairs) # plot possible codon and anticodon pairings for the standard genetic code in bacteria plot_ca_pairs(pairs = ca_pairs(domain = "Bacteria", plot = TRUE))
rev_comp generates the reverse complement of input DNA sequences.
This is commonly used for analyzing complementary strands or anticodon sequences.
rev_comp(seqs)rev_comp(seqs)
seqs |
Input DNA sequences as a DNAStringSet object, or a named vector of sequences that can be coerced to DNAStringSet. |
A DNAStringSet object containing the reverse complemented sequences.
# Reverse complement of codons rev_comp(Biostrings::DNAStringSet(c('TAA', 'TAG')))# Reverse complement of codons rev_comp(Biostrings::DNAStringSet(c('TAA', 'TAG')))
seq_to_codons converts a coding sequence (CDS) into a vector of codons by
splitting the sequence into non-overlapping triplets starting from the first position.
seq_to_codons(seq)seq_to_codons(seq)
seq |
A coding sequence as a DNAString object, or any object that can be coerced to a DNAString (e.g., character string). |
A character vector where each element represents a codon (3-nucleotide sequence).
# Convert a CDS sequence to a sequence of codons seq_to_codons('ATGTGGTAG') seq_to_codons(yeast_cds[[1]])# Convert a CDS sequence to a sequence of codons seq_to_codons('ATGTGGTAG') seq_to_codons(yeast_cds[[1]])
show_codon_tables displays a formatted list of all genetic code tables
available from NCBI, showing their ID numbers and descriptive names. This
function helps users identify the appropriate genetic code ID to use with
get_codon_table().
show_codon_tables()show_codon_tables()
No return value (called for side effects). The function prints a formatted table of available genetic codes to the console, with each line showing the numeric ID and corresponding organism/organelle description.
# Display all available NCBI genetic code tables show_codon_tables()# Display all available NCBI genetic code tables show_codon_tables()
slide creates a data.table defining sliding window positions for
analyzing sequences or data along a continuous range. This function provides
the foundation for positional analyses of codon usage patterns within genes.
slide(from, to, step = 1, before = 0, after = 0)slide(from, to, step = 1, before = 0, after = 0)
from |
Integer specifying the start position of the analysis range. |
to |
Integer specifying the end position of the analysis range. |
step |
Integer specifying the step size between consecutive window centers (default: 1). Larger values create non-overlapping or less overlapping windows. |
before |
Integer specifying the number of positions to include before the window center (default: 0). Determines the left boundary of each window. |
after |
Integer specifying the number of positions to include after the window center (default: 0). Determines the right boundary of each window. |
A data.table with three columns:
start: Start position of each window
center: Center position of each window
end: End position of each window
# Create sliding windows with step size 2 and window size 3 slide(1, 10, step = 2, before = 1, after = 1)# Create sliding windows with step size 2 and window size 3 slide(1, 10, step = 2, before = 1, after = 1)
slide_apply applies a function to a sliding window of codons.
slide_apply(seq, .f, step = 1, before = 0, after = 0, ...)slide_apply(seq, .f, step = 1, before = 0, after = 0, ...)
seq |
DNAString, the sequence |
.f |
function, the codon index calculation function to apply, for
example, |
step |
integer, the step size in number of codons |
before |
integer, the number of codons before the center of a window |
after |
integer, the number of codons after the center of a window |
... |
additional arguments to pass to the function |
data.table with start, center, end, and codon usage index columns
slide_apply(yeast_cds[[1]], get_enc, step = 1, before = 10, after = 10)slide_apply(yeast_cds[[1]], get_enc, step = 1, before = 10, after = 10)
slide_codon creates sliding window intervals specifically designed
for codon-based analysis of DNA sequences. This function automatically
handles codon boundaries and is useful for studying positional effects
in codon usage within genes.
slide_codon(seq, step = 1, before = 0, after = 0)slide_codon(seq, step = 1, before = 0, after = 0)
seq |
A DNA sequence as a DNAString object, or any object that can be coerced to DNAString. |
step |
Integer specifying the step size between consecutive window centers in codons (default: 1). A step of 3 creates non-overlapping windows. |
before |
Integer specifying the number of codons to include before the window center (default: 0). |
after |
Integer specifying the number of codons to include after the window center (default: 0). |
A data.table with three columns containing nucleotide positions:
start: Start nucleotide position of each window
center: Center nucleotide position of each window
end: End nucleotide position of each window
# Create sliding windows for codon analysis x <- Biostrings::DNAString('ATCTACATAGCTACGTAGCTCGATGCTAGCATGCATCGTACGATCGTCGATCGTAG') slide_codon(x, step = 3, before = 1, after = 1)# Create sliding windows for codon analysis x <- Biostrings::DNAString('ATCTACATAGCTACGTAGCTCGATGCTAGCATGCATCGTACGATCGTCGATCGTAG') slide_codon(x, step = 3, before = 1, after = 1)
slide_plot visualizes codon usage in sliding window.
slide_plot(windt, index_name = "Index")slide_plot(windt, index_name = "Index")
windt |
data.table, the sliding window codon usage
generated by |
index_name |
character, the name of the index to display. |
ggplot2 plot.
sw <- slide_apply(yeast_cds[[1]], get_enc, step = 1, before = 10, after = 10) slide_plot(sw)sw <- slide_apply(yeast_cds[[1]], get_enc, step = 1, before = 10, after = 10) slide_plot(sw)
CDSs of all protein-coding genes in Saccharomyces_cerevisiae
yeast_cdsyeast_cds
a DNAStringSet of 6600 sequences
<https://ftp.ensembl.org/pub/release-107/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz>
head(yeast_cds)head(yeast_cds)
Yeast mRNA FPKM determined from rRNA-depleted (RiboZero) total RNA-Seq libraries. RUN1_0_WT and RUN2_0_WT (0 min after RNA Pol II repression) were averaged and used here.
yeast_expyeast_exp
a data.frame with 6717 rows and three columns:
gene ID
gene name
mRNA expression level in Fragments per kilobase per million reads
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57385>
Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.
head(yeast_exp)head(yeast_exp)
Half life of yeast mRNAs in Saccharomyces_cerevisiae calculated from rRNA-deleted total RNAs by Presnyak et al.
yeast_half_lifeyeast_half_life
a data.frame with 3888 rows and three columns:
gene id
gene name
mRNA half life in minutes
<https://doi.org/10.1016/j.cell.2015.02.029>
Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.
head(yeast_half_life)head(yeast_half_life)
Yeast tRNA sequences obtained from gtRNAdb.
yeast_trnayeast_trna
a RNAStringSet with a length of 275.
<http://gtrnadb.ucsc.edu/genomes/eukaryota/Scere3/sacCer3-mature-tRNAs.fa>
Chan PP, Lowe TM. 2016. GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes. Nucleic Acids Res 44:D184-189.
yeast_trnayeast_trna
Yeast tRNA gene copy numbers (GCN) by anticodon obtained from gtRNAdb.
yeast_trna_gcnyeast_trna_gcn
a named vector with a length of 41. Value names are anticodons.
<http://gtrnadb.ucsc.edu/genomes/eukaryota/Scere3/sacCer3-mature-tRNAs.fa>
Chan PP, Lowe TM. 2016. GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes. Nucleic Acids Res 44:D184-189.
yeast_trna_gcnyeast_trna_gcn