TACC - Storage Administrator
·
Texas Advanced Computing Center
·
Austin
SessionJob Fair
Event Type
Job Posting
TP
W
TUT
XO
TimeMonday, 9 November 20209am - 8pm EDT
Location
DescriptionThe Texas Advanced Computing Center (TACC) at the University of Texas at Austin seeks an experience Storage Administrator with strong storage background.
TACC is looking for personnel familiar with HPC storage hardware and parallel filesystems for a senior storage administrator position. We host around 60PB total of non-archive storage, divided up in separate subsystems attached to our various HPC clusters. Most of our storage is presented to clients using Lustre, a popular opensource parallel filesystem. Our storage backend is provided by vendors like: DDN with their 14K and 18K controller solutions, ClusterStor (via Cray via HPE) with their ClusterStor E1000 storage platform, IBM any of which provide solutions for presenting redundant pools of disks via redundant controllers, and leveraging multipath to our lustre servers. We maintain the majority of the stack and work closely with vendors for upgrades to low-level core infrastructure. We're interested in talented staff with substantial experience in any of the above layers our storage solutions. We also require strong Linux/Unix skills due to the nature of the environment that we operate and familiarity with careful troubleshooting and documentation processes.
Must be eligible to work in the US on a full-time basis for any employer without sponsorship.
Responsibilities
• Leverage linux expertise to monitor and troubleshoot interconnects in a SAN/Cluster environment based on in-depth understanding and working knowledge of each layer to correlate error messages and identify component failures.
• Assist in the operation of production parallel file systems at maximal availability while maintaining expected performance levels.
• Utilize scripting skills to proactively identify performance issues and provide timely resolution of errors to prevent or reduce duration of file system outages.
• Respond to emergency situations involving system problems, downtime and security breaches.
• Create and maintain documentation aligned with TACC's workflow tracking methodology.
Required Qualifications
• Bachelor’s degree in engineering, computer and information science, or other applied science.
• 4+ years in developing, debugging and administering Linux/Unix systems including network exported filesystems.
• Demonstrated expertise with standard concepts, practices, and procedures in Linux systems administration.
• Proficient in network configuration of Ethernet and Fiber Channel/SAN environments.
• Experience with Bash scripting to automate system tasks, or programming languages such as Perl, Python.
• Excellent troubleshooting skills including the ability to quickly recognize diverse failure modes and corresponding symptoms.
• Demonstrates excellent verbal/written communication skills
• Works well independently and in a team environment
Relevant education and experience may be substituted as appropriate.
Preferred Qualifications
• Bachelor's degree in computer science or related field
• 4+ years of experience in administering Lustre or GPFS parallel file systems
• 3+ years high speed interconnect experience with InfiniBand (IB), Omni-Path (OPA), or 40gE
• 2+ years of experience with storage subsystem including conventional RAID, Declustered RAID, or Erasure Coding with ClusterStor or DDN
• Proficiency in RedHat/CentOS system software package management, software/kernel patching/compilation, hardware diagnostics, backups, and failure-recovery procedures
• Experience configuring, operating and maintaining HPC/Cloud Linux clusters and PXE based baremetal provisioning.
• Ability to contribute to the evaluation of new storage architectures, systems, and software tools.
• Experience in an academic research environment and with large scale storage deployments greater then 16+ servers and 10PB
Salary Range
$90,000 + depending on qualifications
TACC is looking for personnel familiar with HPC storage hardware and parallel filesystems for a senior storage administrator position. We host around 60PB total of non-archive storage, divided up in separate subsystems attached to our various HPC clusters. Most of our storage is presented to clients using Lustre, a popular opensource parallel filesystem. Our storage backend is provided by vendors like: DDN with their 14K and 18K controller solutions, ClusterStor (via Cray via HPE) with their ClusterStor E1000 storage platform, IBM any of which provide solutions for presenting redundant pools of disks via redundant controllers, and leveraging multipath to our lustre servers. We maintain the majority of the stack and work closely with vendors for upgrades to low-level core infrastructure. We're interested in talented staff with substantial experience in any of the above layers our storage solutions. We also require strong Linux/Unix skills due to the nature of the environment that we operate and familiarity with careful troubleshooting and documentation processes.
Must be eligible to work in the US on a full-time basis for any employer without sponsorship.
Responsibilities
• Leverage linux expertise to monitor and troubleshoot interconnects in a SAN/Cluster environment based on in-depth understanding and working knowledge of each layer to correlate error messages and identify component failures.
• Assist in the operation of production parallel file systems at maximal availability while maintaining expected performance levels.
• Utilize scripting skills to proactively identify performance issues and provide timely resolution of errors to prevent or reduce duration of file system outages.
• Respond to emergency situations involving system problems, downtime and security breaches.
• Create and maintain documentation aligned with TACC's workflow tracking methodology.
Required Qualifications
• Bachelor’s degree in engineering, computer and information science, or other applied science.
• 4+ years in developing, debugging and administering Linux/Unix systems including network exported filesystems.
• Demonstrated expertise with standard concepts, practices, and procedures in Linux systems administration.
• Proficient in network configuration of Ethernet and Fiber Channel/SAN environments.
• Experience with Bash scripting to automate system tasks, or programming languages such as Perl, Python.
• Excellent troubleshooting skills including the ability to quickly recognize diverse failure modes and corresponding symptoms.
• Demonstrates excellent verbal/written communication skills
• Works well independently and in a team environment
Relevant education and experience may be substituted as appropriate.
Preferred Qualifications
• Bachelor's degree in computer science or related field
• 4+ years of experience in administering Lustre or GPFS parallel file systems
• 3+ years high speed interconnect experience with InfiniBand (IB), Omni-Path (OPA), or 40gE
• 2+ years of experience with storage subsystem including conventional RAID, Declustered RAID, or Erasure Coding with ClusterStor or DDN
• Proficiency in RedHat/CentOS system software package management, software/kernel patching/compilation, hardware diagnostics, backups, and failure-recovery procedures
• Experience configuring, operating and maintaining HPC/Cloud Linux clusters and PXE based baremetal provisioning.
• Ability to contribute to the evaluation of new storage architectures, systems, and software tools.
• Experience in an academic research environment and with large scale storage deployments greater then 16+ servers and 10PB
Salary Range
$90,000 + depending on qualifications
·
·