I have been developing an R package {generatervis} which is useful to synthesise and visualise clinical data, and create a data workflow. This is the second blog in this series of blogs to share about the process from ideation to implementation.
This is the second blog in the series of blogs about my first open-source R package {generatervis}. In this blog, I share how I convert the domain knowledge into a flowchart for developing an MVP of the package.
Over the past several weeks, I have been discussing with my cohort fellows and our mentor Rowland Mosbergen. The domain of Whole Genome Sequencing (WGS) is very vast, and the first lesson we have learnt is to narrow down the problem space we are trying to solve and create a roadmap for what we want to implement. This gives us a structure for our Minimum Viable Product (MVP). Based on the requirements that we have gathered so far, I have created the following maturity model of a research data management workflow for creating and ingesting the WGS dataset into REDMANE.
As a next step, I divided the different stages of the maturity model into further granular and achievable chunks. These steps help to create a basic workflow that starts with creating empty files in the .fastq
format for the WGS dataset and eventually uploading the modified file(s) to the GitHub repository of a data portal, similar to the cBioportal, which is an open-source platform to explore and analyse large-scale cancer genomic datasets.
When creating this flowchart, I made the assumption that there is only one sample per patient.
Once I created the flowchart, it became much easier for me to start implementing the steps into the functions of the package. With the goal of developing a functional MVP version of the package, I have written an R function in the R package {generatervis} for each of the above steps. Follow along the blog to see the usage of each of these functions!
The development version of the {generatervis} R package can be installed from GitHub using:
# install.packages("pak")
::pak("Clinical-Informatics-Collaborative/generatervis") pak
Once the package is installed, it can be used to perform the workflow steps as detailed below.
First let’s create an empty .fastq
file by specifying a patient ID (multiple patient IDs can be provided as a vector) using the create_empty_fastq() function.
As a demo let us consider the patient_id
to be ”patient_123”
.
::create_empty_fastq(patient_id) generatervis
Empty FASTQ file created at: /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.fastq
Now let’s generate a random sample of n
(say 2) reads, for ”patient_123”
using the rreads() function. Read length is the length of the sequence of nucleotides to be generated for each read. As an example, consider a .fastq
file with read length 8.
set.seed(1067)
::rreads(patient_id, n, read_length) generatervis
[1] "@patient_123_read1" "ACACGGCG" "+" "IIIIIIII"
[5] "@patient_123_read2" "CCATTTTT" "+" "IIIIIIII"
Next, we fill the empty .fastq
file with the random sample of n
reads, for ”patient_123”
, using the fill_fastq() function.
::fill_fastq(patient_id, output_dir, n, read_length) generatervis
File already exists. Appending reads to the existing file.
Populated /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.fastq with 2 reads.
At this stage, you can plot a heatmap of the sequences of the nucleotides using the fastq_plot() function. For the time being, I have hard-coded the sequences
of nucleotides. As I develop the package further, I will modify this function so that it can be utilised.
::fastq_plot(patient_id, output_dir, n, read_length) generatervis
Plot saved to: /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/fastq_plot_patient_123.png
Once the .fastq
file is created, it needs to be processed to be converted into a .bam
file. This can be done using the fastq_to_bam() function.
<- "chr1"
reference <- file.path(output_dir, paste0(patient_id, ".fastq"))
fastq_file <- paste0(output_dir, "/", patient_id, ".sam")
sam_file ::fill_fastq(patient_id, output_dir, n, read_length)
generatervis::fastq_to_bam(fastq_file, patient_id, output_dir, sam_file, reference) generatervis
Dummy SAM file written to: /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.sam
The .fastq
is actually first converted into a .sam
file, which you need to further convert into a .bam
file, using the samtools
command-line tool.
# samtools view -Sb path_to/file_name.sam > path_to/file_name.bam
Following this conversion, the .bam
file is summarised into a .vcf
file using bam_to_vcf().
<- paste0(output_dir, "/", patient_id, ".vcf")
vcf_file ::bam_to_vcf(patient_id, output_dir, vcf_file) generatervis
Dummy .vcf written to /var/folders/7k/kpyh33yd4mlbp_p2j8m4810m0000gn/T//RtmpLlrqm6/patient_123.vcf
Finally we create the metadata .txt
files using create_metadata().
::create_metadata(patient_id, output_dir) generatervis
These files can then be uploaded to a data storage portal. Inspired by the cBioportal/datahub repository, I have created the Clinical-Informatics-Collaborative/data_storage_portal repository for uploading the data.
Some of the potential next steps include:
P.S.: The project development is done using CI/CD. To know further, keep an eye on future blog posts in this series.
data_storage_repo
GitHub repository to store the data from the R package {generatervis}: https://github.com/Clinical-Informatics-Collaborative/data_storage_portalEmail: bhogaljyoti1@gmail.com
LinkedIn: jyoti-bhogal
GitHub: jyoti-bhogal
Mastodon: jyoti_bhogal
Bluesky: jyoti-bhogal.bsky.social
Website: https://jyoti-bhogal.github.io/about-me/index.html
For attribution, please cite this work as
Bhogal (2025, May 5). Home: {generatervis}: An R package to synthesise, visualise clinical data, and create a workflow - Part 2. Retrieved from https://jyoti-bhogal.github.io/about-me/
BibTeX citation
@misc{bhogal2025{generatervis}:, author = {Bhogal, Jyoti}, title = {Home: {generatervis}: An R package to synthesise, visualise clinical data, and create a workflow - Part 2}, url = {https://jyoti-bhogal.github.io/about-me/}, year = {2025} }