A tbl containing metadata about each sample in RLBase.

rlbase_samples(quiet = FALSE)

Arguments

quiet

If TRUE, messages are suppressed. Default: FALSE.

Value

A tbl.

Details

Source

RLBase samples were curated by hand in Excel from searching keywords such as "R-loops" and "RNA:DNA hybrids" in GEO, SRA, and PubMed. Where R-loop mapping data was publically available, entries were added to the Excel spreadsheet such that every sample (SRX.../ERX.../GSM...) had it's own line. Information was noted for each sample, such as the "mode" (the type of R-loop mapping it was) and the "Condition" (e.g., "RNaseH1", "WKKD", etc). When genomic input controls were available, they were manually matched to the experimental samples for which they could serve as a background control during peak calling.

The up-to-date excel sheet is found here.

Throughout the process of analyzing the data (see RLBase-data), additional metadata was added to the sample sheet (see structure for full account).

Structure

rlbase_samples is a tbl with the structure:

rlsamplelabelconditionmodelabtissuegenotypeotherPMIDgroupfamilyip_typestrand_specificmoeitybisulfite_seqfile_typeexperiment_originalcontrol_originalstudynamepaired_endread_lengthcontroleff_genome_sizegenomepredictiondiscardednumPeaksexpsamplesexp_matchCondcoverage_s3peaks_s3fastq_stats_s3bam_stats_s3report_html_s3rlranges_rds_s3rlfs_rda_s3
SRX1070676POSS96DRIPcFred ChedinNT2WTNT27373332rlDRIPS9.6TRUERNAFALSEpublicGSM1720613NASRP059800GSM1720613: NT2 DRIPc-seq, rep 1; Homo sapiens; OTHERFALSE50NA2706186140hg38POSFALSE34092SRX1070685,SRX1070686WT_NT_SRP059800_NT2coverage/SRX1070676_hg38.bwpeaks/SRX1070676_hg38.broadPeakfastq_stats/SRX1070676_hg38__fastq_stats.jsonbam_stats/SRX1070676_hg38__bam_stats.txtreports/SRX1070676_hg38.htmlrlranges/SRX1070676_hg38.rdsrlfs_rda/SRX1070676_hg38.rlfs.rda
SRX1070677POSS96DRIPcFred ChedinNT2WTNT27373332rlDRIPS9.6TRUERNAFALSEpublicGSM1720614NASRP059800GSM1720614: NT2 DRIPc-seq, rep 2; Homo sapiens; OTHERFALSE50NA2706186140hg38POSFALSE22117SRX1070685,SRX1070686WT_NT_SRP059800_NT2coverage/SRX1070677_hg38.bwpeaks/SRX1070677_hg38.broadPeakfastq_stats/SRX1070677_hg38__fastq_stats.jsonbam_stats/SRX1070677_hg38__bam_stats.txtreports/SRX1070677_hg38.htmlrlranges/SRX1070677_hg38.rdsrlfs_rda/SRX1070677_hg38.rlfs.rda
SRX1070678POSS96DRIPFred ChedinNT2WTNT27373332rlDRIPS9.6FALSEDNAFALSEpublicGSM1720615NASRP059800GSM1720615: NT2 DRIP-seq, 1; Homo sapiens; OTHERFALSE50NA2706186140hg38POSFALSE73924SRX1070685,SRX1070686WT_NT_SRP059800_NT2coverage/SRX1070678_hg38.bwpeaks/SRX1070678_hg38.broadPeakfastq_stats/SRX1070678_hg38__fastq_stats.jsonbam_stats/SRX1070678_hg38__bam_stats.txtreports/SRX1070678_hg38.htmlrlranges/SRX1070678_hg38.rdsrlfs_rda/SRX1070678_hg38.rlfs.rda
...............................................................................................................

Column description:

  • rlsample - The unique ID of the sample, same as in the SRA.

  • label - Label corresponding to the author-supplied condition of the sample. "POS" indicates the sample should robustly map R-loops, "NEG" indicates the opposite.

  • condition - The specific condition for each sample.

  • mode - The type of R-loop mapping for each sample.

  • lab - The senior author on the publication from which the data was provided.

  • tissue - The tissue condition for the sample.

  • genotype - The sample's genotype.

  • other - A column for other pertinent metadata provided by the authors.

  • PMID - The PMID associated with the sample

  • group - One of "rl" (R-loop mapping) or "exp" (Expression data/RNA-Seq).

  • family - The family of the "mode" (e.g., "DRIP" includes "sDRIP", "DRIPc", "qDRIP", etc)

  • ip_type - The IP type of the "mode" for each sample. One of "S9.6", "dRNH" (dead RNaseH1), or "None".

  • strand_specific - Whether the sample is stranded.

  • moeity - The moeity which was IP'd (if applicable)

  • bisulfite_seq - Whether the data uses bisulfite conversion sequencing (e.g., "BisDRIP-Seq" samples)

  • file_type - The type of data (always "public" for RLBase samples).

  • experiment_original - The original name of this sample as entered by hand in the curated Excel spreadsheet (usually converted from GSM to SRX).

  • control_original - Same as above for the accompanying control sample (if applicable.)

  • study - The SRA study accession for this sample.

  • name - The sample's name as entered in SRA.

  • paired_end - A logical indicating whether the data is paired end.

  • read_length - The read length for the sample.

  • control - The RLBase ID of the genomic input control sample corresponding to this sample (if applicable)

  • eff_genome_size - The effective genome size based on read length and genome (calculated with the khmer package) see relevant portion of RLBase-data protocol here.

  • genome - The UCSC genome ID for this sample.

  • prediction - The prediction from running RLSeq::predictCondition().

  • discarded - A logical indicating whether this sample was discarded during model building for a mismatch with its "label" (see models).

  • numPeaks - The number of peaks called for this sample.

  • expsamples - The IDs of any corresponding expression samples.

  • exp_matchCond - The meta data used to match this sample to any corresponding expression samples (if applicable).

    • Method: Some R-loop mapping studies also had matched RNA-Seq data. In these cases, they were also recorded with the same metadata (where applicable) as R-loop mapping samples. To match expression and R-loop samples, the study, tissue, genotype, and other columns were compared iteratively for each R-loop sample. If all four were a match with at least one expression samples, then those four columns would be assigned as the exp_matchCond. If only three were available, then they would become the exp_matchCond. To see the order in which columns were checked for possible matches, view the buildExpression.R script in the RLBase-data repo. See also the section on corr(R/PVal/PAdj) column in rlregions.

  • coverage_s3 - The location of the coverage tracks (.bw) in the AWS S3 bucket for RLBase data ('s3://rlbase_data/').

  • peaks_s3 - Same as above for peak files (.broadPeak)

  • fastq_stats_s3 - Same as above for fastq QC statistics data (.json).

  • bam_stats_s3 - Same as above for BAM QC statistics data (.txt).

  • report_html_s3 - Same as above for reports from RLSeq::report() (.html).

  • rlranges_rds_s3 - Same as above for RLRanges R objects, as in RLSeq::RLRanges() (.rds)

  • rlfs_rda_s3 - Same as above for rlfs_res objects generated by RLSeq::analyzeRLFS() (.rda).

Examples

rlsamples <- rlbase_samples()