Spring Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: clap70

NCP-AII NVIDIA AI Infrastructure Questions and Answers

Questions 4

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

Options:

A.

Reboot the host system to apply the repository changes and proceed.

B.

Install the nvidia-container-toolkit package using your package manager.

C.

Format the disk to clear any existing NVIDIA-related dependencies first.

D.

Download the CUDA toolkit installer from NVIDIA'S official website.

Buy Now
Questions 5

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Options:

A.

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.

Power drain then restart the DGX and check if the performance degradation resolves.

D.

Increase the fan speed to maximum and check whether the performance improves.

Buy Now
Questions 6

An InfiniBand server stops working, and a system administrator runs the "ibstat" command that provides the following output:

CA 'mlx5_1'

CA type: MT4115

Number of ports: 2

Firmware version: 10.20.1010

Hardware version: 0

Node GUID: 0x0002c90300002f78

System image GUID: 0x0002c90300002f7b

Port 1:

State: Initializing

Physical state: Linkup

Rate: 100

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0251086a

Port GUID: 0x0002c90300002f79

Link layer: InfiniBand

What is the cause of the issue?

Options:

A.

The HCA port is faulty.

B.

There is no running SM in the fabric.

C.

The neighboring switch port is faulty.

D.

The cable is disconnected.

Buy Now
Questions 7

ClusterKit's NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Critical failure; expected is >390 GB/s for HDR InfiniBand.

D.

Inconclusive; rerun with --stress=cpu to validate.

Buy Now
Questions 8

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.

Reduction of problem size (N) to accelerate computation.

B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.

Automatic NVLink bandwidth doubling via driver updates.

Buy Now
Questions 9

During cluster deployment, the UFM Cable Validation Tool reports "Wrong-neighbor" errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

Options:

A.

Reboot all leaf switches to force LLDP rediscovery.

B.

Replace all affected cables with higher-grade OM5 fiber optics.

C.

Verify LLDP data against topology files and remediate.

D.

Disable FEC on all switches to bypass neighbor validation.

Buy Now
Questions 10

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?

Options:

A.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

B.

Use watts used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

Buy Now
Questions 11

One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?

Options:

A.

lspci | grep NVIDIA

B.

nvidia-smi

C.

nvidia-gpu-status

D.

iblinkinfo

Buy Now
Questions 12

During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?

Options:

A.

Enable HPL_USE_NVSHMEM=1 for shared memory acceleration

B.

HPL_RUN_GEMM_TESTS to skip validation

C.

Set --gpu-affinity and --cpu-affinity to align GPU and NUMA nodes

D.

HPL_OOC_TILE_M to 8192 for larger blocks

Buy Now
Questions 13

You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?

Options:

A.

Confirm the system fans are running at 100% under all workloads to prevent overheating.

B.

Review the system BIOS to ensure GPU overclocking is enabled for maximum performance.

C.

Use NVSM to disable unused PCIe devices to reduce overall system heat output.

D.

Verify that each DGX system is connected to redundant, properly rated PDUs and that all power supplies are reporting nominal input.

Buy Now
Questions 14

During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?

Options:

A.

device use dgx001 ; interfaces add vlan vlan100 ; set parent bond0 ; set mode 1 ; set network internalnet

B.

device use dgx001 ; interfaces add bond bond0 ; append interfaces enp225s0f1np1 enp97s0f1np1 ; set mode 1 ; set network internalnet

C.

device use dgx001 ; interfaces set enp225s0f1np1 network internalnet ; interfaces set enp97s0f1np1 network internalnet

D.

device use dgx001 ; interfaces delete enp225s0f1np1 ; interfaces delete enp97s0f1np1

Buy Now
Questions 15

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Options:

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Buy Now
Questions 16

A leaf switch shows "FW Version Mismatch" alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?

Options:

A.

flint

B.

iblinkinfo

C.

mlxconfig

D.

ethtool

Buy Now
Questions 17

A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?

Options:

A.

Multi-protocol data access with low latency.

B.

High capacity with moderate speed.

C.

Tape backup systems.

D.

Low-cost HDD solutions.

Buy Now
Questions 18

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.

B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.

Critical failure; bus bandwidth exceeds hardware capabilities.

D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

Buy Now
Questions 19

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

Options:

A.

ngc config get

B.

ngc init

C.

ngc config set

D.

ngc config update

Buy Now
Questions 20

A 24-hour HPL burn-in fails with "illegal value" errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?

Options:

A.

Switch from FP64 to FP32 precision.

B.

Disable GPU affinity.

C.

Reduce test duration to 12 hours.

D.

Verify the matrix size is divisible by block size.

Buy Now
Questions 21

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

Options:

A.

Implement redundant switches with spanning tree protocol.

B.

MLAG for bonded interfaces across redundant switches.

C.

Use only one switch for all management and storage traffic.

D.

Disable VLANs and use unmanaged switches.

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: Feb 28, 2026
Questions: 71
NCP-AII pdf

NCP-AII PDF

$25.5  $84.99
NCP-AII Engine

NCP-AII Testing Engine

$30  $99.99
NCP-AII PDF + Engine

NCP-AII PDF + Testing Engine

$40.5  $134.99