Annotations

Genomic features relevant to R-loops. Both mm10 and hg38 annotations are available.

annots_primary_hg38(quiet = FALSE)

annots_full_hg38(quiet = FALSE)

annots_primary_mm10(quiet = FALSE)

annots_full_mm10(quiet = FALSE)

Arguments

quiet	If TRUE, messages are suppressed. Default: FALSE.

Value

A list of tbl objects. See details.

Details

The list contains tbl objects (tidyverse-style data frames) containing annotations as genomic ranges. The primary annotations (e.g., annots_primary_hg38()) are an abbreviated version of the full annotations (e.g., annots_full_hg38()). See the description below for further details:

Databases available

This section details the annotation databases which are available in RLHub. See the succeeding section ("Objects available based on accessor") for a list of which databases are available within each function. All processing was performed using this script as part of the RLBase-data processing protocol.

Centromeres
- Description: Centromere locations within the genome.
- Source: UCSC table centromeres.
CNA
- Description: Copy-number alterations found in inherited disorder cell lines. See source for full description. CNV states (0-4) are represented in the data as separate types. For example, Deep deletion (0) sites are accessed with annots_full_hg38()$CNA__0.
- Source: UCSC table coriellDelDup.
Cohesin
- Description: This database contains manually-curated STAG2 and STAG1 ChIP-Seq data reprocessed by the RLHub authors to find consensus STAG1 and STAG2 sites between cell lines (using chip-r.)
- Source: processed data source Pan et al and processed further as part of the RLBase-data pipeline here.
CpG_Islands:
- Description: CpG island predicted locations throughout the genome.
- Source: UCSC table cpgIslandExt.
Encode_CREs:
- Description: The UCSC Encode_CREs table contains putative promoter-like ("prom"), promoter-enhancer-like ("enhP"), distal-enhancer-like ("enhD"), H3K4me3 ("K4me3"), and CTCF ("CTCF") chromatin states across the genome.
- Source: UCSC table encodeCcreCombined.
Encode_Histone:
- Description: consensus peaks from histone ChIP experiments downloaded from Encode. Biological replicates were summarized with chip-r.
- Source: Encode v121. Manifest of samples downloaded is here.
encodeTFBS:
- Description: The collection of curated transcription-factor binding profiles from encode, made available by UCSC table browser.
- Source: UCSC table encRegTfbsClustered.
G4Qexp:
- Description: G4-Quadruplex ChIP-Seq data
- Source: GEO accession GSE63874.
G4Qpred:
- Description: Re-processed and binned G4-Quadruplex Predictions. The type names for this database are the G4Q prediction classes and follow the pattern tl:N_nl:N_gn:N. tl: the length of guanine tracts in region; nl: number of locations for G4 formation; gn: the number of possible simultaneous G4 structures. For more information, see the source publication here. Due to the large number of possible configurations of tl, nl, and gn, they were binned based on frequency.
- Source: Figshare Rouchka et al. and direct download link
knownGene_RNAs:
- Description: RNA species provided by UCSC KnownGene, split up by the "transcriptType" column from the source table.
- Source: UCSC table knownGene.
Microsatellite:
- Description: Microsatellite DNA regions predicted based on motif.
- Source: UCSC table microsat.
PolyA:
- Description: List of predicted poly-A sites, split by the "name2" column of the source table.
- Source: UCSC table wgEncodeGencodePolyaV38.
RBP_ChIP:
- Description: ChIP-Seq data sets for RNA-binding proteins (RBPs) generated by Nostrand et al for Encode. Data are split by ChIP target.
- Source: Encode v121. Manifest of samples here from source study Nostrand et al.
RBP_eCLiP:
- Description: eCLiP-Seq data sets for RNA-binding proteins (RBPs) generated by Nostrand et al for Encode. Data are split by eCLiP target.
- Source: Encode v121. Manifest of samples here from source study Nostrand et al.
Repeat_Masker:
- Description: Repeat masker table from UCSC containing genomic annotations for predicted repetitive elements, split by class of repetitive element ("repClass").
- Source: UCSC table rmsk.
skewr:
- Description: Regions of G or C-skew profiled using the skewr program. See the RLBase-data README.md for steps.
- Source: From UCSC goldenPath, hg38 and mm10 genomes. hg38 and mm10 gene GTF. CpG islands for mm10 and hg38 provided as described in the CpG_Islands entry above. Processing proceeded using skewr with stochHMM v0.38.
snoRNA_miRNA_scaRNA:
- Description: snoRNA, miRNA, and scaRNA species provided by UCSC table browser and split by the "type" column.
- Source: UCSC table wgRna.
Splice_Events:
- Description: UCSC table of alternative splice events predicted from transcriptome data sets. Split by "name" column.
- Source: UCSC table knownAlt.
Transcript_Features:
- Description: Transcript features (e.g., "exon", "intron", etc) provided by Bioconductor TxDb packages. Split based on the following features: "Exon", "Intron", "fiveUTR", "threeUTR", "TSS", "TTS", "Intergenic".
- Source: TxDb for hg38 and mm10.
tRNAs:
- Description: UCSC table containing predicted tRNA genes.
- Source: UCSC table tRNAs.

Objects available based on accessor function

Here, we show which objects are available with each accessor function:

DataBase name	`annots_primary_hg38()`	`annots_primary_mm10()`	`annots_full_hg38()`	`annots_full_mm10()`
Centromeres	FALSE	FALSE	TRUE	FALSE
CNA	FALSE	FALSE	TRUE	FALSE
Cohesin	FALSE	FALSE	TRUE	FALSE
CpG_Islands	TRUE	TRUE	TRUE	TRUE
Encode_CREs	TRUE	TRUE	TRUE	TRUE
Encode_Histone	FALSE	FALSE	TRUE	FALSE
encodeTFBS	FALSE	FALSE	TRUE	FALSE
G4Qexp	FALSE	FALSE	TRUE	FALSE
G4Qpred	TRUE	FALSE	TRUE	FALSE
knownGene_RNAs	TRUE	FALSE	TRUE	FALSE
Microsatellite	FALSE	FALSE	TRUE	TRUE
PolyA	TRUE	FALSE	TRUE	FALSE
RBP_ChIP	FALSE	FALSE	TRUE	FALSE
RBP_eCLiP	FALSE	FALSE	TRUE	FALSE
Repeat_Masker	TRUE	TRUE	TRUE	TRUE
skewr	TRUE	TRUE	TRUE	TRUE
snoRNA_miRNA_scaRNA	TRUE	FALSE	TRUE	FALSE
Splice_Events	FALSE	FALSE	TRUE	TRUE
Transcript_Features	TRUE	TRUE	TRUE	TRUE
tRNAs	TRUE	TRUE	TRUE	TRUE

Object structure

Accessor functions (e.g., annots_primary_hg38()) return a named list of tbl objects that specify feature ranges. Below, we detail the naming and structure of each.

List names

The names in the list objects provided by each accessor function (e.g., annots_primary_hg38()) follow this structure: DataBase__Type. DataBase is the database from which annotations were derived and Type indicates the specific annotations from the database which are included in the tbl. This is required as some databases produce > 1 type of annotation (e.g., Transcript_Features contains "Exon" (Transcript_Features__Exon) and "Intron" (Transcript_Features__Intron)).

`tbl` structure

Each tbl returned has the following structure:

chrom	start	end	strand	id
chr1	10015	10498	+	1
chr1	10614	11380	+	2
...

Columns:

"chom" - the Chromosome of the feature range (UCSC style)
"start" - the starting position of the feature range.
"end" - the end position of the feature range.
"strand" - the strand of the feature range.
"id" - A unique ID for the feature range.

Examples

annos <- annots_primary_hg38()

annos <- annots_full_hg38()

annos <- annots_primary_mm10()

annos <- annots_full_mm10()