R-loop regions (RL regions) are consensus sites of R-loop formation derived from a meta-analysis of the R-loop mapping experiments in RLBase. RLHub includes information about these regions (rlregions_meta), the overlap of genomic features with RL regions (rlregions_annot), and the read count matrices for each sample in RLBase within each RL region (rlregions_counts).

rlregions_annot(quiet = FALSE)

rlregions_meta(quiet = FALSE)

rlregions_counts(quiet = FALSE)

Arguments

quiet

If TRUE, messages are suppressed. Default: FALSE.

Value

A list of tbl objects. See details.

Details

R-loop regions (RL regions) are consensus sites of R-loop formation derived from a meta-analysis of the R-loop mapping experiments in RLBase. In RLHub, we provide access to:

  • rlregions_meta - The metadata describing these RL regions.

  • rlregions_annot - RL regions overlapped with genome features (see annotations).

  • rlregions_counts - RL region read counts across RLBase samples.

Source

RL Regions - rlregions_meta

RL Regions were derived during this step of the RLBase-data protocol.

Here is a brief summary of the key processing steps.

  1. Human RLBase samples which were labeled as "POS", classified as "POS" (see rlfs_res), and which have at least 5000 peaks were selected.

  2. Then, these peaks were then randomly downsampled to 5000 ranges each and aggregated.

  3. Then, the hg38 genome was binned into 10bp bins and the number of aggregated peaks was overlapped with each bin and counted, producing a bedGraph.

  4. Then, macs3 was implemented to call peaks on the bedGraph file.

  5. Finally, the resulting ranges were combined to create summary ranges.

Note: steps 2-4 were performed separately on datasets from dRNH (catalytically-dead RNaseH1) R-loop mapping experiments and S9.6-based experiments due to the noted differences in R-loop mapping between these modalities. In step 5, they were combined to create summary ranges, and the original source (dRNH or S9.6) was noted in the resulting metadata.

RL Regions anno - rlregions_annot

Once RL regions were generated, they were overlapped with the annotations found in annots_full_hg38(). This was done separately for dRNH-derived, S9.6-derived, and combined RL regions.

RL Regions Counts - rlregions_counts

Once RL regions were generated, Rsubread::featureCounts() was used to count the number of overlapping reads from the .bam alignment files for each sample in RLBase (rlbase_samples) within each RL region (only "combined" regions used).

These data were then VST-transformed and TPM was also calculated.

Structure

RL Regions - rlregions_meta

rlregions_meta() is a tbl with the following structure:

rlregionlocationis_rlfssourceconfidence_levelmedSignalValmedPValmedQValavgSignalValavgPValavgQValnumPeaksnSamplesnStudiesnModesnTissuesip_typesnIPTypespct_caseavgNumPeaksmedNumPeaksgeneIDsallGenesmainGenesis_repeat
All_RL64926chrM:0-16569:.TRUEdRNH S9653.211.96288120.9242017.017822.563848653.89461649.800152373298441636S9.6,dRNH,RNA-m6A,None475.1677980117.7945063NANANATRUE
All_RL19159chr5:49656610-49661950:.FALSEdRNH S9636.552.82661019.1616015.240253.282115108.41481104.6716107393261431534S9.6,RNA-m6A,None,dRNH477.0114973058.9643815NANANATRUE
All_RL50251chr16:46386300-46391280:.FALSEdRNH S9631.683.25856025.2196521.004183.42875587.3323383.3639133750243431633RNA-m6A,S9.6,dRNH,None476.1316965721.6038878NANANATRUE
...........................................................................

Column description:

  • rlregion - The ID of the RL Region.

  • location - The location of the RL Region. This column follows the pattern "chrom:start-stop:strand"

  • is_rlfs - Logical indicating whether the RL region overlaps with an R-loop forming sequence.

  • source - Indicates whether the RL region was discovered from catalytically-dead RNaseH1-based mapping (dRNH), S9.6-based mapping (S96), or both (dRNH S96).

  • conservation_score - The percent of samples in which this range was found (calculated separately for dRNH and S96 sites and then averaged together for sites from dRNH S96 source).

  • medSignalVal - The median broadPeak "signalVal" (see UCSC specification) of sample peaks in the RL region.

  • medPVal - The median broadPeak -log10 pval (see UCSC specification) of sample peaks in the RL region.

  • medQVal - The median broadPeak -log10 qval (adjusted pval) (see UCSC specification) of sample peaks in the RL region.

  • avgSignalVal/avgPVal/avgQVal - same as above, but with mean instead of median.

  • medNumPeaks - The median number of peaks for samples which overlapped with this RL Region.

  • avgNumPeaks - Same as above with mean instead of median.

  • nSamples - The number of samples which overlap with this peak.

  • nModes - The number of different R-loop mapping methods in which this RL region was found.

  • nTissues - The number of different tissues in which this RL region was found.

  • ip_types - The "IP types" (S9.6, dRNH, or None) in which this RL Region was found.

  • nIPType - The number of different "IP types" in which this RL region was found.

  • pct_case - The percentage of samples which overlapped with this RL region that were predicted "POS" by RLSeq::predictCondition().

  • geneIDs - The IDs of all genes overlapping with this RL Region.

  • allGenes - The Gene Symbols of all genes overlapping with this RL Region.

  • mainGenes - The Gene Symbols of all genes (which appear in pathway databases) overlapping with this RL Region.

  • is_repeat - A logical indicating whether the RL Region occurs in a repetitive genomic region as determined by repeat masker (and also peri-centromeric regions were included.)

  • confidence_score - A confidence score for ranking RL Regions.

    • Method:

      • Summary: confidence_score is the geometric mean of features relevant to RL Region robustness with a penalty for regions found only in dRNH and S96 alone.

      • Calculation: The confidence_score is calculated as following: C = m * (cons * ns * mq * ms)^(1/4); where m is a penalty (of 1.25 weight) for RL Regions not found in both S96 and dRNH sets. cons is the standardized log2 transform of the conservation score. ns is the standardization of the nSamples column. mq is the standardized log2 transform of the medQVal column. ms is the standardized log2 transform of the medSignalVal column.

  • corr(R/PVal/PAdj) - Results of correlation analysis between R-loop abundance and expression within each RL Region. (see also gene_exp)

    • Method: R-loop alignment files were summarized within RL Regions to make a count matrix, which was then log2(tpm + 1) normalized. Then, the expression log2(tpm + 1) matrix was summarized to the RL Regions level as the mean expression of genes overlapping with the R-loop. Then, R-loop and expression samples were paired based on the study of origin, sample condition, and other factors (73 sample pairs in total) (see the description of the exp_matchCond column in rlbase_samples). Within each pairing, there were some many-to-many mappings between R-loop and expression samples, due to multiple samples having the same exp_matchCond. In these cases, RL region expression and R-loop abundance was summarized as the mean across redundant samples. Finally, the Spearman correlation between R-loop abundance and expression was calculated within each R-loop region (see cor.test), yielding Rho and the p.value. See also the relevant processing script (buildExpression.R) from the RLBase-data repo.

    • corrR - Spearman's Rho for the correlation between expression and R-loop abundance within each RL Region (see gene_exp).

    • corrPVal - The pvalue from running cor.test as described above.

    • corrPAdj - The result of multiple testing correction on corrPVal using p.adjust with default arguments. (See also p.adjust)

RL Regions anno - rlregions_annot

rlregions_annot() is a tbl with the following structure:

rlregionannotation
All_RL4615CNA__3__4
All_RL4615CNA__1__8
All_RL4615G4Qexp__G4Q_PDS_NaK_GSE63874__95848
......

Column description:

  • rlregion - The RL Region ID corresponding to the rlregions_meta table.

  • annotation - The annotation which overlaps with the query RL region. Annotation names follow the convention "DataBase__Type__ID" (see also annotations).

RL Regions Counts - rlregions_counts

rlregions_counts() is a SummarizedExperiment with the following structure:

  • dim: 64418 532

  • assays

    • cts - Raw read counts of each RLBase sample within each RL Region.

    • tpm - The TPM normalization of cts

    • vst - The VST transform of cts (see also DESeq2::vst()).

  • colData - A DataFrame with structure:

experimentlabelstrand_specificpaired_endmodepredictiondiscardednumPeaksbambam_avail
SRX113814POSFALSEFALSEDRIVENEGTRUE337../RLBase-data/rlbase-data/rlpipes-out/bam/SRX113814/SRX113814_hg38.bamTRUE
SRX113812POSFALSEFALSEDRIPNEGTRUE78../RLBase-data/rlbase-data/rlpipes-out/bam/SRX113812/SRX113812_hg38.bamTRUE
SRX2675003POSTRUEFALSER-ChIPPOSFALSE10331../RLBase-data/rlbase-data/rlpipes-out/bam/SRX2675003/SRX2675003_hg38.bamTRUE
..............................

Column description:

  • experiment - The RLBase sample ID

  • label - The label provided by the data submitters ("POS" or "NEG")

  • strand_specific - A logical indicating whether the data is stranded or not.

  • paired_end - A logical indicating whether the data is paired-end or not.

  • prediction - The predicted label from RLSeq::predictCondition(). Can be "POS" or "NEG".

  • discarded - A logical indicating whether the sample was discarded during model building due to a mismatch with its label.

  • numPeaks - The number of peaks called from this sample.

  • bam - The relative path of the bam file for this sample.

  • bam_avail - A logical indicating whether the bam file was available.

Examples

rlregions <- rlregions_meta()

rlregionsAnno <- rlregions_annot()

rlregionsCounts <- rlregions_counts()