Skip to main content

Engagements tagged biology

Prediction of Polymerization of the Yersinia Pestis Type III Secretion System
Nova Southeastern University

Yersinia pestis, the bacterium that causes the bubonic plague, uses a type III secretion system (T3SS) to inject toxins into host cells. The structure of the Y. pestis T3SS needle has not been modeled using AI or cryo-EM. T3SS in homologous bacteria have been solved using cryo-EM. Previously, we created possible hexamers of the Y. pestis T3SS needle protein, YscF, using CollabFold and AlphaFold2 Colab on Google Colab in an effort to understand more about the needle structure and calcium regulation of secretion. Hexamers and mutated hexamers were designed using data from a wet lab experiment by Torruellas et. al (2005). T3SS structures in homologous organisms show a 22 or 23mer structure where the rings of hexamers interlocked in layers. When folding was attempted with more than six monomers, we observed larger single rings of monomers. This revealed the inaccuracies of these online systems. To create a more accurate complete needle structure, a different computer software capable of creating a helical polymerized needle is required. The number of atoms in the predicted final needle is very high and more than our computational infrastructure can handle. For that reason, we need the computational resources of a supercomputer. We have hypothesized two ways to direct the folding that have the potential to result in a more accurate needle structure. The first option involves fusing the current hexamer structure into one protein chain, so that the software recognizes the hexamer as one protein. This will make it easier to connect multiple hexamers together. Alternatively, or additionally the cryo-EM structures of the T3SS of Shigella flexneri and Salmonella enterica Typhimurium can be used as models to guide the construction of the Y. pestis T3SS needle. The full AlphaFold library or a program like RoseTTAFold could help us predict protein-protein interactions more accurately for large structures. Based on our needs we have identified the TAMU ACES, Rockfish and Stampede-2 as promising resources for this project. The generated model of the Y. pestis T3SS YscF needle will provide insight into a possible structure of the needle. 

Status: Complete
Run Markov Chain Monte Carlo (MCMC) in Parallel for Evolutionary Study
Texas Tech University

My ongoing project is focused on using species trait value (as data matrices) and its corresponding phylogenetic relationship (as a distance matrix) to reconstruct the evolutionary history of the smoke-induced seed germination trait. The results of this project are expected to increase the predictability of which untested species could benefit from smoke treatment, which could promote germination success of native species in ecological restoration. This computational resources allocated for this project pull from the high-memory partition of our Ivy cluster of HPCC (Centos 8, Slurm 20.11, 1.5 TB memory/node, 20 core /node, 4 node). However, given that I have over 1300 species to analyze, using the maximum amount of resources to speed up the data analysis is a challenge for two reasons: (1) the ancestral state reconstruction (the evolutionary history of plant traits) needs to use the Markov Chain Monte Carlo (MCMC) in Bayesian statistics, which runs more than 10 million steps and, according to experienced evolutionary biologists, could take a traditional single core simulation up 6 months to run; and (2) my data contain over 1300 native species, with about 500 polymorphic points (phylogenetic uncertainty), which would need a large scale of random simulation to give statistical strength. For instance, if I use 100 simulations for each 500 uncertainty points, I would have 50,000 simulated trees. Based on my previous experience with simulations, I could design codes to parallel analyze 50,000 simulated trees but even with this parallelization the long run MCMC will still require 50000 cores to run for up to 6 months. Given this computational and evolutionary research challenge, my current work is focused on discovering a suitable parallelization methods for the MCMC steps. I hope to have some computational experts to discuss my project.

Status: On Hold
Modelling conformational heterogeneity for human glutamine synthetase variants
San Francisco State University

Hi!

I want to use EMMIVox molecular dynamics based ensemble refinement to model into multiple cryoEM maps that my lab has generated. I am a collaborator of Max Bonomi (PI behind EMMIVox) but also want to be able to do this independently. We have pre-processed single-refinement models but need multiple GPUs to run the ensemble refinement. I am new to NSF ACCESS and also new to ensemble refinement/molecular dynamics in general. Specifically, I need help to:

  • Allocate Resources (which is best/how much is needed)
  • Software Installation (plumed dependencies and the software in the github link)
  • Job running/management 

I am running this part of the project myself and have not staffed this project yet with a student; so a mentor would be working directly with me. I would like assistance to run this procedure through at least once so that I can reach independence and start training my students to implement it.

Thank you so much in advance for any help and considering whether the MATCH+ program would be suitable for this project. 

All the best,

Eric

Status: On Hold
Conservation Stewardship Legacy for Porcupines and Chipmunks
Missouri Botanical Garden

I am, for the first time, transitioning workflows from lab computers to HPC via ACCESS. While I have advanced knowledge of R and command-line operations, I have not used HPC before. I have navigate the ACCESS portals without struggle; my challenge is getting jobs on machines.

The project I am focusing on uses hierarchical Bayesian models written in R to 1) evaluate the effects of weather and climate on a vulnerable species of chipmunk; and 2) assess whether the North American porcupine has experienced an otherwise unnoticed range-wide decline. I have working code in R written and the data structures I need to run the code. Runtime and memory requirements are beyond a typical desktop's capacity, though. Since this is a Bayesian model, I expect that I need just 4 cores (at a time), but each running for quite a few hours (days/weeks?), with "modest" memory requirements (100-200GB). Data storage is likely < 1TB.

Status: Declined