Home: {generatervis}: An R package to synthesise, visualise clinical data, and create a workflow - Part 2

Jyoti Bhogal

*Yellow Allamanda Cathartica flowers clicked by me.*

This is the second blog in the series of blogs about my first open-source R package {generatervis}. In this blog, I share how I convert the domain knowledge into a flowchart for developing an MVP of the package.

📝 Figuring out what we really need

Over the past several weeks, I have been discussing with my cohort fellows and our mentor Rowland Mosbergen. The domain of Whole Genome Sequencing (WGS) is very vast, and the first lesson we have learnt is to narrow down the problem space we are trying to solve and create a roadmap for what we want to implement. This gives us a structure for our Minimum Viable Product (MVP). Based on the requirements that we have gathered so far, I have created the following maturity model of a research data management workflow for creating and ingesting the WGS dataset into REDMANE.

*Maturity model to create and ingest data into REDMANE.*

🧠 Creating a flowchart: How am I approaching this?

As a next step, I divided the different stages of the maturity model into further granular and achievable chunks. These steps help to create a basic workflow that starts with creating empty files in the .fastq format for the WGS dataset and eventually uploading the modified file(s) to the GitHub repository of a data portal, similar to the cBioportal, which is an open-source platform to explore and analyse large-scale cancer genomic datasets.

🌼 Bringing the mental model to life, step by step

*The maturity is refined into a flowchart.*

When creating this flowchart, I made the assumption that there is only one sample per patient.

Once I created the flowchart, it became much easier for me to start implementing the steps into the functions of the package. With the goal of developing a functional MVP version of the package, I have written an R function in the R package {generatervis} for each of the above steps. Follow along the blog to see the usage of each of these functions!

⬇️ Installing the package {generatervis}:

The development version of the {generatervis} R package can be installed from GitHub using:

# install.packages("pak")  
pak::pak("Clinical-Informatics-Collaborative/generatervis")

Once the package is installed, it can be used to perform the workflow steps as detailed below.

First let’s create an empty .fastq file by specifying a patient ID (multiple patient IDs can be provided as a vector) using the create_empty_fastq() function.

As a demo let us consider the patient_id to be ”patient_123”.

generatervis::create_empty_fastq(patient_id)

Empty FASTQ file created at: /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.fastq

Now let’s generate a random sample of n (say 2) reads, for ”patient_123” using the rreads() function. Read length is the length of the sequence of nucleotides to be generated for each read. As an example, consider a .fastq file with read length 8.

set.seed(1067)  
generatervis::rreads(patient_id, n, read_length)

[1] "@patient_123_read1" "ACACGGCG"           "+"                  "IIIIIIII"            
[5] "@patient_123_read2" "CCATTTTT"           "+"                  "IIIIIIII"

Next, we fill the empty .fastq file with the random sample of n reads, for ”patient_123”, using the fill_fastq() function.

generatervis::fill_fastq(patient_id, output_dir, n, read_length)

File already exists. Appending reads to the existing file.  
Populated /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.fastq with 2 reads.

At this stage, you can plot a heatmap of the sequences of the nucleotides using the fastq_plot() function. For the time being, I have hard-coded the sequences of nucleotides. As I develop the package further, I will modify this function so that it can be utilised.

generatervis::fastq_plot(patient_id, output_dir, n, read_length)

Plot saved to: /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/fastq_plot_patient_123.png

Once the .fastq file is created, it needs to be processed to be converted into a .bam file. This can be done using the fastq_to_bam() function.

reference <- "chr1"  
fastq_file <- file.path(output_dir, paste0(patient_id, ".fastq"))  
sam_file <- paste0(output_dir, "/", patient_id, ".sam")  
generatervis::fill_fastq(patient_id, output_dir, n, read_length)  
generatervis::fastq_to_bam(fastq_file, patient_id, output_dir, sam_file, reference)

Dummy SAM file written to: /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.sam

The .fastq is actually first converted into a .sam file, which you need to further convert into a .bam file, using the samtools command-line tool.

# samtools view -Sb path_to/file_name.sam > path_to/file_name.bam

Following this conversion, the .bam file is summarised into a .vcf file using bam_to_vcf().

vcf_file <- paste0(output_dir, "/", patient_id, ".vcf")  
generatervis::bam_to_vcf(patient_id, output_dir, vcf_file)

Dummy .vcf written to /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.vcf

Finally we create the metadata .txt files using create_metadata().

generatervis::create_metadata(patient_id, output_dir)

These files can then be uploaded to a data storage portal. Inspired by the cBioportal/datahub repository, I have created the Clinical-Informatics-Collaborative/data_storage_portal repository for uploading the data.

🎯 Next steps

Some of the potential next steps include:

Implement the workflow for real WGS dataset files.
For other dataset types, like Tabular Circular data, Imaging Dataset, etc., implement the process of creating the dataset and uploading it to cBioportal.

P.S.: The project development is done using CI/CD. To know further, keep an eye on future blog posts in this series.

🔑 Resources created

{generatervis} website: https://clinical-informatics-collaborative.github.io/generatervis/
data_storage_repo GitHub repository to store the data from the R package {generatervis}: https://github.com/Clinical-Informatics-Collaborative/data_storage_portal

💡References

R Packages (2e) by Hadley Wickham and Jennifer Bryan, CC BY-NC-ND 4.0 License
For more reading about Bioconductor R packages for WGS data, see: https://www.bioconductor.org/packages/release/BiocViews.html#___Software

Get In Touch:

Email: bhogaljyoti1@gmail.com
LinkedIn: jyoti-bhogal
GitHub: jyoti-bhogal
Mastodon: jyoti_bhogal

Bluesky: jyoti-bhogal.bsky.social

Website: https://jyoti-bhogal.github.io/about-me/index.html

{generatervis}: An R package to synthesise, visualise clinical data, and create a workflow - Part 2