Pre-Summer Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: clap70

NCP-AII NVIDIA AI Infrastructure Questions and Answers

Questions 4

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices " > select a switch > " Cables ' tab to see ASIC firmware and transceiver versions.

B.

Use " Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- < LID > -m on each port manually.

D.

Export all switch logs and grep for ’FW Version " .

Buy Now
Questions 5

What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?

Options:

A.

To benchmark production training speed and ensure all GPUs are running at identical clock speeds.

B.

To stress test the hardware and software stack with representative NeMo workloads, ensuring reliability.

C.

To tune NeMo model hyperparameters for maximum accuracy on user datasets during cluster deployment.

Buy Now
Questions 6

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

Options:

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

Buy Now
Questions 7

You are preparing a Spectrum-based NVIDIA switch for integration into a production AI cluster. To confirm that all modules are running approved firmware versions, you must use the appropriate command from the switch CLI. Which step most accurately meets best practices for ensuring firmware version consistency and cluster compliance?

Options:

A.

Use the show version command to check the overall system version and confirm all modules are updated if the system version matches the documentation.

B.

Use the show interfaces status command to verify all ports are up, and proceed with integration if no interface errors are shown.

C.

Use the show asic-version command to review firmware versions for all modules, then compare these against the documented approved versions.

D.

Use the show inventory command to display component details and serial numbers before proceeding, as this output will include all firmware versions for review.

Buy Now
Questions 8

An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?

Options:

A.

cmsh status to check HA status and active/standby roles.

B.

nvsm show health to validate GPU status on both head nodes.

C.

systemctl restart cmdaemon to force a failover test.

D.

ping < secondary-head-node-ip > to test basic connectivity.

Buy Now
Questions 9

An administrator is configuring node categories in BCM for a DGX BasePOD cluster. They need to group all NVIDIA DGX H200 nodes under a dedicated category for GPU-accelerated workloads. Which approach aligns with NVIDIA ' s recommended BCM practices?

Options:

A.

Assign nodes to the ’login " category to simplify Slurm integration.

B.

Create a new " dgx-h200 " category, assign all DGX H200 nodes to it.

C.

Use the existing " dgxnodes " category without modification, as it is preconfigured for all DGX systems.

D.

Avoid categories and configure each DGX node individually via CLI.

Buy Now
Questions 10

After a recent OS upgrade, you need to reinstall NVIDIA GPU and DOCA drivers to support both AI training and accelerated networking. What best practice ensures successful installation and full hardware capability?

Options:

A.

Download and install only the specific versions of GPU and DOCA drivers listed as compatible with the current OS and hardware.

B.

Apply legacy drivers for hardware released within the last two years to maintain maximum compatibility across versions.

C.

Install the latest available drivers directly from the NVIDIA website.

D.

Use the default drivers provided by the Linux distribution, unless an installation fails during system boot.

Buy Now
Questions 11

An infrastructure engineer is preparing a new AI cluster for production use, relying on NVIDIA switches and high-speed optical transceivers for node connectivity. The team is finalizing network validation before launching large-scale training jobs. Why is it critical to confirm and align the firmware version on all switch transceivers prior to production?

Options:

A.

To guarantee that hardware inventory tools can report serial numbers and manufacturer codes for asset management, which is critical for future support and troubleshooting.

B.

To ensure stability, bandwidth, and compatibility across the cluster, avoiding link issues and performance loss.

C.

To allow the network operating system to automatically discover all connected transceivers with heterogeneous firmware.

D.

To reduce GPU memory consumption during distributed training jobs.

Buy Now
Questions 12

For an NVIDIA Enterprise AI Factory with 256 GPUs, which storage solution characteristic is most critical to validate during scaling tests?

Options:

A.

Consistent per-node throughput > 8 GiB/s.

B.

Single-node write performance during idle clusters.

C.

RAID rebuild times under disk failure.

D.

Maximum 4K random read IOPS exceeding 1 million.

Buy Now
Questions 13

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

Options:

A.

NCCL_TESTS_SPLIT= " OR 0x7 " ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT= " MOD 2 " ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT= " DIV 8 " ./all_reduce_perf -g 1

Buy Now
Questions 14

A system administrator needs to install a container toolkit and successfully run the following commands:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime docker

What step should be taken next to finish the installation?

Options:

A.

dpkg -i doca-host-repo-ubuntu < version > _amd64.deb

B.

apt-get install cuda-drivers

C.

systemctl restart docker

D.

apt-get remove nvidia-container-toolkit

Buy Now
Questions 15

During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?

Options:

A.

nvidia-smi --query-disk-space

B.

du -sh /home/*

C.

df -h | grep ' /var '

D.

lsof +L1

Buy Now
Questions 16

The system administrator plans to use Multi-Instance GPU profiles. What command should be used to verify that the GPU has this mode enabled?

Options:

A.

nvidia-mode

B.

nvidia-mig

C.

nvidia-enable

D.

nvidia-smi

Buy Now
Questions 17

An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?

Options:

A.

Lane power variance < 3dB across all transceivers.

B.

Transceiver model matching QSFP-DD specifications.

C.

Temperature fluctuations > 5°C during validation.

D.

Effective BER > 1.5E-254 during a < 6-hour monitoring window.

Buy Now
Questions 18

An engineer wants to verify that an NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. The NVIDIA Container Toolkit is installed on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?

Options:

A.

docker run --rm --runtime=docker nvidia/cuda nvidia-smi

B.

docker run --rm -it ubuntu:22.04 nvidia-smi

C.

docker run --rm --gpus all nvidia/cuda:12.4.6-base-ubuntu22.04 nvidia-smi

D.

docker run --rm nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Buy Now
Questions 19

After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?

Options:

A.

Check /var/log/messages on the DPU operating system for update logs.

B.

Query the DPU BMC with the Task ID of the installation process.

C.

Power cycle the DPU immediately to force a rollback.

D.

Run bfrec --status on the DPU to view flash progress.

Buy Now
Questions 20

A system administrator has upgraded the firmware of the DPU. What will be the state of the firmware after the upgrade?

Options:

A.

The firmware is installed on the DPU.

B.

The firmware is deleted from the DPU.

C.

The firmware is copied to the DPU but not installed.

D.

The firmware is waiting on reboot to become active.

Buy Now
Questions 21

An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?

Options:

A.

sudo ubuntu-drivers uninstall

B.

sudo rm -rf /usr/lib/nvidia

C.

sudo apt-get remove nvidia-driver-550

D.

sudo apt-get purge nvidia-* & & sudo apt-get autoremove

Buy Now
Questions 22

You are an infrastructure engineer tasked with validating a new AI training cluster before releasing it to users. Your team wants to perform a NeMo burn-in to ensure both hardware and software are reliable and ready for production workloads. Which of the following actions are required as part of a proper NeMo burn-in process?

Pick the 2 correct responses below.

Options:

A.

Download a pre-trained NeMo model and use it for a quick accuracy check on a user dataset, then consider the burn-in complete if results are reasonable.

B.

Test inference using the NeMo API and approve the environment if the model outputs valid predictions.

C.

Run the configured NeMo training job repeatedly or for an extended duration, monitoring for errors, stalls, or performance drops across all GPUs and nodes.

D.

Configure a representative NeMo training or pretraining recipe and set up an executor to launch the job across intended nodes and GPUs.

Buy Now
Questions 23

An engineer needs to validate NVLink Switch functionality on a DGX H100 system with 8 GPUs. Which NCCL command verifies intra-node NVLink bandwidth?

Options:

A.

broadcast_perf -b 8 -e 16G -f 2 -g 8 without split configuration

B.

all_reduce_perf -b 8 -e 16G -f 2 -g 4 with NCCL_TESTS_SPLIT= " MOD 2 "

C.

all_reduce_perf -b 8 -e 16G -f 2 -g 1 repeated 8 times

D.

all_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT= " OR 0x7 "

Buy Now
Questions 24

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

Options:

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Buy Now
Questions 25

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

Options:

A.

export HPL_OOC_SAFE_SIZE=4.0

B.

export HPL_OOC_MODE=0

C.

export HPL_OOC_NUM_STREAMS=8

D.

export HPL_OOC_MAX_GPU_MEM=90

Buy Now
Questions 26

When verifying network cable signal integrity during cluster deployment, which measurement result most strongly indicates a cable signal problem?

Options:

A.

Repeated CRC errors and intermittent port flapping reported by switch counters.

B.

Output of ifconfig showing link speed at the expected rate on both ends of the cable.

C.

Network pings between all cluster nodes return responses with delays under 2 ms on a 100Gb network.

Buy Now
Questions 27

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Buy Now
Questions 28

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

Options:

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

Buy Now
Questions 29

After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?

Options:

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Buy Now
Questions 30

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

Options:

A.

Helm is installed on the installer machine.

B.

Ensure Kubernetes is running on the cluster.

C.

All cluster nodes have NVIDIA GPUs installed.

D.

NTP is disabled to simplify time synchronization.

Buy Now
Questions 31

After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?

Options:

A.

Review firmware update logs and run nvsm show health to check for hardware or firmware errors on the affected GPU.

B.

Remove the GPU from the system and replace it with a new one before any diagnostics.

C.

Ignore the issue and proceed with production workloads if the other GPUs are operational.

D.

Immediately re-run the firmware upgrade on all system components.

Buy Now
Questions 32

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?

Options:

A.

The network card has no link / connection.

B.

A boot disk has failed.

C.

Multiple GPUs have failed.

D.

There are more than two failed power supplies.

Buy Now
Questions 33

An engineer is reimaging a DGX system in a large cluster. Which method ensures the most efficient and secure remote installation without physical access?

Options:

A.

Use apt-get to upgrade the operating system without rebooting the system.

B.

Create a USB drive with the ISO and manually boot from it on the DGX system.

C.

Build a software image on Base Command Manager and then reimage the system.

D.

Skip ISO verification and directly flash the operating system to the disk via SSH.

Buy Now
Questions 34

During cluster deployment, the UFM Cable Validation Tool reports " Wrong-neighbor " errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

Options:

A.

Reboot all leaf switches to force LLDP rediscovery.

B.

Replace all affected cables with higher-grade OM5 fiber optics.

C.

Verify LLDP data against topology files and remediate.

D.

Disable FEC on all switches to bypass neighbor validation.

Buy Now
Questions 35

A system administrator needs to validate a GPU-based server and ensure that no errors occur under load. What command should be used?

Options:

A.

nvsm dump health

B.

stress-test --usage

C.

nvsm show health

D.

nvsm stress-test

Buy Now
Questions 36

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Options:

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: May 29, 2026
Questions: 71
NCP-AII pdf

NCP-AII PDF

$25.5  $84.99
NCP-AII Engine

NCP-AII Testing Engine

$30  $99.99
NCP-AII PDF + Engine

NCP-AII PDF + Testing Engine

$40.5  $134.99