Genomic features relevant to R-loops. Both mm10 and hg38 annotations are available.
annots_primary_hg38(quiet = FALSE) annots_full_hg38(quiet = FALSE) annots_primary_mm10(quiet = FALSE) annots_full_mm10(quiet = FALSE)
quiet | If TRUE, messages are suppressed. Default: FALSE. |
---|
A list of tbl
objects. See details.
The list
contains tbl
objects (tidyverse-style data frames) containing
annotations as genomic ranges. The primary annotations (e.g.,
annots_primary_hg38()
) are an abbreviated
version of the full annotations (e.g., annots_full_hg38()
). See the
description below for further details:
This section details the annotation databases which are available in RLHub. See the succeeding section ("Objects available based on accessor") for a list of which databases are available within each function. All processing was performed using this script as part of the RLBase-data processing protocol.
Centromeres
Description: Centromere locations within the genome.
Source: UCSC table centromeres.
CNA
Description: Copy-number alterations found in inherited disorder cell lines. See source for full description. CNV states (0-4)
are represented in the data as separate types. For example, Deep deletion (0) sites are accessed with annots_full_hg38()$CNA__0
.
Source: UCSC table coriellDelDup.
Cohesin
CpG_Islands:
Description: CpG island predicted locations throughout the genome.
Source: UCSC table cpgIslandExt.
Encode_CREs:
Description: The UCSC Encode_CREs table contains putative promoter-like ("prom"), promoter-enhancer-like ("enhP"), distal-enhancer-like ("enhD"), H3K4me3 ("K4me3"), and CTCF ("CTCF") chromatin states across the genome.
Source: UCSC table encodeCcreCombined.
Encode_Histone:
encodeTFBS:
Description: The collection of curated transcription-factor binding profiles from encode, made available by UCSC table browser.
Source: UCSC table encRegTfbsClustered.
G4Qexp:
Description: G4-Quadruplex ChIP-Seq data
Source: GEO accession GSE63874.
G4Qpred:
Description: Re-processed and binned G4-Quadruplex Predictions. The type names for this database are the G4Q prediction classes and follow the pattern tl:N_nl:N_gn:N
.
tl
: the length of guanine tracts in region; nl
: number of locations for G4 formation; gn
: the number of possible simultaneous G4 structures. For more information, see the source publication here.
Due to the large number of possible configurations of tl
, nl
, and gn
, they were binned based on frequency.
Source: Figshare Rouchka et al. and direct download link
knownGene_RNAs:
Description: RNA species provided by UCSC KnownGene, split up by the "transcriptType" column from the source table.
Source: UCSC table knownGene.
Microsatellite:
Description: Microsatellite DNA regions predicted based on motif.
Source: UCSC table microsat.
PolyA:
Description: List of predicted poly-A sites, split by the "name2" column of the source table.
Source: UCSC table wgEncodeGencodePolyaV38.
RBP_ChIP:
Description: ChIP-Seq data sets for RNA-binding proteins (RBPs) generated by Nostrand et al for Encode. Data are split by ChIP target.
Source: Encode v121. Manifest of samples here from source study Nostrand et al.
RBP_eCLiP:
Description: eCLiP-Seq data sets for RNA-binding proteins (RBPs) generated by Nostrand et al for Encode. Data are split by eCLiP target.
Source: Encode v121. Manifest of samples here from source study Nostrand et al.
Repeat_Masker:
Description: Repeat masker table from UCSC containing genomic annotations for predicted repetitive elements, split by class of repetitive element ("repClass").
Source: UCSC table rmsk.
skewr:
Description: Regions of G or C-skew profiled using the skewr
program. See the RLBase-data README.md
for steps.
Source: From UCSC goldenPath, hg38 and
mm10 genomes.
hg38
and mm10 gene GTF.
CpG islands for mm10 and hg38 provided as described in the CpG_Islands entry above. Processing
proceeded using skewr
with
stochHMM v0.38
.
snoRNA_miRNA_scaRNA:
Description: snoRNA, miRNA, and scaRNA species provided by UCSC table browser and split by the "type" column.
Source: UCSC table wgRna.
Splice_Events:
Description: UCSC table of alternative splice events predicted from transcriptome data sets. Split by "name" column.
Source: UCSC table knownAlt.
Transcript_Features:
tRNAs:
Description: UCSC table containing predicted tRNA genes.
Source: UCSC table tRNAs.
Here, we show which objects are available with each accessor function:
DataBase name | annots_primary_hg38() | annots_primary_mm10() | annots_full_hg38() | annots_full_mm10() |
Centromeres | FALSE | FALSE | TRUE | FALSE |
CNA | FALSE | FALSE | TRUE | FALSE |
Cohesin | FALSE | FALSE | TRUE | FALSE |
CpG_Islands | TRUE | TRUE | TRUE | TRUE |
Encode_CREs | TRUE | TRUE | TRUE | TRUE |
Encode_Histone | FALSE | FALSE | TRUE | FALSE |
encodeTFBS | FALSE | FALSE | TRUE | FALSE |
G4Qexp | FALSE | FALSE | TRUE | FALSE |
G4Qpred | TRUE | FALSE | TRUE | FALSE |
knownGene_RNAs | TRUE | FALSE | TRUE | FALSE |
Microsatellite | FALSE | FALSE | TRUE | TRUE |
PolyA | TRUE | FALSE | TRUE | FALSE |
RBP_ChIP | FALSE | FALSE | TRUE | FALSE |
RBP_eCLiP | FALSE | FALSE | TRUE | FALSE |
Repeat_Masker | TRUE | TRUE | TRUE | TRUE |
skewr | TRUE | TRUE | TRUE | TRUE |
snoRNA_miRNA_scaRNA | TRUE | FALSE | TRUE | FALSE |
Splice_Events | FALSE | FALSE | TRUE | TRUE |
Transcript_Features | TRUE | TRUE | TRUE | TRUE |
tRNAs | TRUE | TRUE | TRUE | TRUE |
Accessor functions (e.g., annots_primary_hg38()
) return a named list
of tbl
objects that specify feature ranges. Below, we detail the naming and structure of each.
The names in the list
objects provided by each accessor function (e.g., annots_primary_hg38()
)
follow this structure: DataBase__Type
. DataBase
is the database from which
annotations were derived and Type
indicates the specific annotations from the database
which are included in the tbl
. This is required as some databases produce > 1
type of annotation (e.g., Transcript_Features contains "Exon"
(Transcript_Features__Exon
) and "Intron" (Transcript_Features__Intron
)).
tbl
structureEach tbl
returned has the following structure:
chrom | start | end | strand | id |
chr1 | 10015 | 10498 | + | 1 |
chr1 | 10614 | 11380 | + | 2 |
... |
Columns:
"chom" - the Chromosome of the feature range (UCSC style)
"start" - the starting position of the feature range.
"end" - the end position of the feature range.
"strand" - the strand of the feature range.
"id" - A unique ID for the feature range.
annos <- annots_primary_hg38() annos <- annots_full_hg38() annos <- annots_primary_mm10() annos <- annots_full_mm10()