RLBase Sample Manifest — rlbase

A tbl containing metadata about each sample in RLBase.

rlbase_samples(quiet = FALSE)

Arguments

quiet	If TRUE, messages are suppressed. Default: FALSE.

Value

A tbl.

Details

Source

RLBase samples were curated by hand in Excel from searching keywords such as "R-loops" and "RNA:DNA hybrids" in GEO, SRA, and PubMed. Where R-loop mapping data was publically available, entries were added to the Excel spreadsheet such that every sample (SRX.../ERX.../GSM...) had it's own line. Information was noted for each sample, such as the "mode" (the type of R-loop mapping it was) and the "Condition" (e.g., "RNaseH1", "WKKD", etc). When genomic input controls were available, they were manually matched to the experimental samples for which they could serve as a background control during peak calling.

The up-to-date excel sheet is found here.

Throughout the process of analyzing the data (see RLBase-data), additional metadata was added to the sample sheet (see structure for full account).

Structure

rlbase_samples is a tbl with the structure:

rlsample

label

condition

mode

lab

tissue

genotype

other

PMID

group

family

ip_type

strand_specific

moeity

bisulfite_seq

file_type

experiment_original

control_original

study

name

paired_end

read_length

control

eff_genome_size

genome

prediction

discarded

numPeaks

expsamples

exp_matchCond

coverage_s3

peaks_s3

fastq_stats_s3

bam_stats_s3

report_html_s3

rlranges_rds_s3

rlfs_rda_s3

SRX1070676

POS

S96

DRIPc

Fred Chedin

NT2

27373332

DRIP

S9.6

TRUE

RNA

FALSE

public

GSM1720613

SRP059800

GSM1720613: NT2 DRIPc-seq, rep 1; Homo sapiens; OTHER

FALSE

2706186140

hg38

POS

FALSE

34092

SRX1070685,SRX1070686

WT_NT_SRP059800_NT2

coverage/SRX1070676_hg38.bw

peaks/SRX1070676_hg38.broadPeak

fastq_stats/SRX1070676_hg38__fastq_stats.json

bam_stats/SRX1070676_hg38__bam_stats.txt

reports/SRX1070676_hg38.html

rlranges/SRX1070676_hg38.rds

rlfs_rda/SRX1070676_hg38.rlfs.rda

SRX1070677

POS

S96

DRIPc

Fred Chedin

NT2

27373332

DRIP

S9.6

TRUE

RNA

FALSE

public

GSM1720614

SRP059800

GSM1720614: NT2 DRIPc-seq, rep 2; Homo sapiens; OTHER

FALSE

2706186140

hg38

POS

FALSE

22117

SRX1070685,SRX1070686

WT_NT_SRP059800_NT2

coverage/SRX1070677_hg38.bw

peaks/SRX1070677_hg38.broadPeak

fastq_stats/SRX1070677_hg38__fastq_stats.json

bam_stats/SRX1070677_hg38__bam_stats.txt

reports/SRX1070677_hg38.html

rlranges/SRX1070677_hg38.rds

rlfs_rda/SRX1070677_hg38.rlfs.rda

SRX1070678

POS

S96

DRIP

Fred Chedin

NT2

27373332

DRIP

S9.6

FALSE

DNA

FALSE

public

GSM1720615

SRP059800

GSM1720615: NT2 DRIP-seq, 1; Homo sapiens; OTHER

FALSE

2706186140

hg38

POS

FALSE

73924

SRX1070685,SRX1070686

WT_NT_SRP059800_NT2

coverage/SRX1070678_hg38.bw

peaks/SRX1070678_hg38.broadPeak

fastq_stats/SRX1070678_hg38__fastq_stats.json

bam_stats/SRX1070678_hg38__bam_stats.txt

reports/SRX1070678_hg38.html

rlranges/SRX1070678_hg38.rds

rlfs_rda/SRX1070678_hg38.rlfs.rda

...

Column description:

rlsample - The unique ID of the sample, same as in the SRA.
label - Label corresponding to the author-supplied condition of the sample. "POS" indicates the sample should robustly map R-loops, "NEG" indicates the opposite.
condition - The specific condition for each sample.
mode - The type of R-loop mapping for each sample.
lab - The senior author on the publication from which the data was provided.
tissue - The tissue condition for the sample.
genotype - The sample's genotype.
other - A column for other pertinent metadata provided by the authors.
PMID - The PMID associated with the sample
group - One of "rl" (R-loop mapping) or "exp" (Expression data/RNA-Seq).
family - The family of the "mode" (e.g., "DRIP" includes "sDRIP", "DRIPc", "qDRIP", etc)
ip_type - The IP type of the "mode" for each sample. One of "S9.6", "dRNH" (dead RNaseH1), or "None".
strand_specific - Whether the sample is stranded.
moeity - The moeity which was IP'd (if applicable)
bisulfite_seq - Whether the data uses bisulfite conversion sequencing (e.g., "BisDRIP-Seq" samples)
file_type - The type of data (always "public" for RLBase samples).
experiment_original - The original name of this sample as entered by hand in the curated Excel spreadsheet (usually converted from GSM to SRX).
control_original - Same as above for the accompanying control sample (if applicable.)
study - The SRA study accession for this sample.
name - The sample's name as entered in SRA.
paired_end - A logical indicating whether the data is paired end.
read_length - The read length for the sample.
control - The RLBase ID of the genomic input control sample corresponding to this sample (if applicable)
eff_genome_size - The effective genome size based on read length and genome (calculated with the khmer package) see relevant portion of RLBase-data protocol here.
genome - The UCSC genome ID for this sample.
prediction - The prediction from running RLSeq::predictCondition().
discarded - A logical indicating whether this sample was discarded during model building for a mismatch with its "label" (see models).
numPeaks - The number of peaks called for this sample.
expsamples - The IDs of any corresponding expression samples.
exp_matchCond - The meta data used to match this sample to any corresponding expression samples (if applicable).
- Method: Some R-loop mapping studies also had matched RNA-Seq data. In these cases, they were also recorded with the same metadata (where applicable) as R-loop mapping samples. To match expression and R-loop samples, the study, tissue, genotype, and other columns were compared iteratively for each R-loop sample. If all four were a match with at least one expression samples, then those four columns would be assigned as the exp_matchCond. If only three were available, then they would become the exp_matchCond. To see the order in which columns were checked for possible matches, view the buildExpression.R script in the RLBase-data repo. See also the section on corr(R/PVal/PAdj) column in rlregions.
coverage_s3 - The location of the coverage tracks (.bw) in the AWS S3 bucket for RLBase data ('s3://rlbase_data/').
peaks_s3 - Same as above for peak files (.broadPeak)
fastq_stats_s3 - Same as above for fastq QC statistics data (.json).
bam_stats_s3 - Same as above for BAM QC statistics data (.txt).
report_html_s3 - Same as above for reports from RLSeq::report() (.html).
rlranges_rds_s3 - Same as above for RLRanges R objects, as in RLSeq::RLRanges() (.rds)
rlfs_rda_s3 - Same as above for rlfs_res objects generated by RLSeq::analyzeRLFS() (.rda).

Examples

rlsamples <- rlbase_samples()