Current Research Topics
Compiler: LLVM
Python: Python Engine
Networking: FRR
Acquired Stardust

Discoholic 🪩

ellievsbear
Cosimo Galluzzi
noise dept.
One Nice Bug Per Day
Xuebing Du

Kiana Khansmith
NASA
cherry valley forever
🪼
Keni
Monterey Bay Aquarium

Andulka
Cosmic Funnies
tumblr dot com
i don't do bad sauce passes
Today's Document
taylor price
YOU ARE THE REASON
seen from South Korea

seen from Malaysia
seen from United Kingdom
seen from United Kingdom

seen from United States
seen from United States
seen from United States
seen from United Kingdom

seen from Indonesia
seen from Malaysia
seen from China
seen from Italy
seen from United States

seen from Dominican Republic

seen from United States

seen from Malaysia
seen from Canada

seen from United States
seen from United Kingdom

seen from Iraq
@hpcjourney
Current Research Topics
Compiler: LLVM
Python: Python Engine
Networking: FRR
GPFS - Investigating - RPC Wait
I am looking into the 'RPC wait' for node getData with a named node. I am trying to understand how this error message works, and the internals of what it means.
Update 10/13/25:
Well it's kinda opaque how this all works. It also seems like there are a couple possible reasons for why this could be happening. It was not clear to me if there is a way to tell, in the abstract, which one without looking at node logs.
It seems like the most likely reason is RDMA issues. Likely due to network issues where the communications are not getting through as they should causing 'RPC wait' issues.
It all makes sense just different IT sense
Something that this move into HPC has further confirmed to me–technology has its own logic and if you can wrap your mind around it, then its all a matter of different looking buttons and procedures, but its all familiar at a fundamental level.
Technology problems all decompose in a similar way when you really get after solving them, and that is the beauty and the boredom that is technology work. It's all the same pile it just sometimes looks different and may smell more or less.
Finally Feeling like a Storage Administrator
It has taken a few months, but I am starting to feel like a storage administrator–which is exciting. Obviously, still more to learn and that's a good thing. However, it is nice to finally be able to contribute!
Map to Understand GPFS
If you are like me and are trying to learn GPFS (IBM Spectrum Scale) the best advice I can give is get yourself a copy and then pull it apart. Seriously you will spend hours reading documentation and you will miss a lot of the details just due to the density of the documentation. Having the documentation open while having the binaries and scripts so you can disassemble and read them–will make the learning process more efficient.
Also don't forget your notes. Take notes in your own words and try not to quote. With these three things:
Documentation
Binaries and Scripts to read and disassemble
Good notes
You at least have a fighting chance of understanding what is going on–even if you are going in blind.
GPFS - Notes - Install Internals
You can actually try IBM Storage Scale for free–obviously there are limits, but you can certainly get a sense of how it all works. To that end I was pulling apart the installer the other day and noticed that I couldn't find mmdiag.
I was confused by this–until I unpacked the .deb file and looked at the post install script. As it turns out IBM makes liberal use of symlinks and mmdiag is just a symlink to tsdiag. I wanted to document that here–so I don't forget and in case anyone else finds themselves needlessly puzzled. I also posted a list of the symlinks I found in the GPFS Base .deb file here.
I am currently trying to understand the various diagnostic tools built into GPFS, and I wanted to look at how things work internally.
Research Topic - DAOS
You can download the source here:
DAOS Storage Stack (client libraries, storage engine, control plane) - daos-stack/daos
Just recording this here to research later.
GPFS - Notes - Quorum Options
Two types of quorum:
Minority Quorum
Uses both nodes and tiebreaker drives to achieve quorum even if there is only one surviving quorum node. Each of the quorum nodes needs to have access to all tiebreaker drives. You can have up to three tiebreaker drives.
Majority Quorum
Uses an odd number of quorum nodes to achieve quorum. Is the only viable option if you want to have distributed quorum. For example: if you have three GPFS clusters you may want to distribute your quorum over the three clusters. In this case you can't have minority quorum because there is no way to have each quorum node access each tiebreaker drive without going through another cluster's nodes.
Just documenting the summary for quorum types for future reference, and to put it in my own terms for comprehension and retention.
Current Area of Investigation - GPFS
My current area of investigation is General Parallel File System (GPFS). Up until now I have had almost no exposure to this sort of filesystem beyond being aware that they exist. In my previous role we didn't use GPFS and had a single level file system with archive.
I will post more about GPFS as I dig into it, but for now it suffices to say that GPFS seems excessively fragile and I can understand why people say troubleshooting it is as much art as it is science.
Hello World
In this blog I want to document to varying extents my journey into HPC. I don't know where it is going to go or how long it will last. I want to document this journey in case it proves useful to anyone out there. If nothing else it will hopefully have something insightful, amusing or otherwise worth reading.