

This document describes changes between PGI Workstation 3.1 and previous releases, as well as late-breaking information not included in the current printing of the PGI User's Guide.
PGI Workstation 3.1 includes the following components:
Depending on the product you purchased, you may not have received all of the above components.
PGI Workstation 3.1 is supported on systems using Intel Pentium, Pentium II or Pentium III processors running NT 4.0 or higher, Solaris86 2.4 or higher, or Linux with a kernel version of 2.0 or above. This includes newer versions of Linux that use glibc 2.1.1, such as Redhat 6.0 and SuSE 6.x.
The PGI compilers and tools are license-managed. For PGI Workstation products using PGI-style licensing (the default), a single user can run as many simultaneous copies of the compiler as desired, on a single system, and no license daemon or Ethernet card is required. However, usage of the compilers and tools is restricted to a pre-specified username. If you would like the PGI compilers and tools to be usable under any username, you must request FLEXlm-style license keys and use FLEXlm-style licensing. See section 1, PGI Workstation 3.1 Installation Notes, for a more detailed description of licensing options.
Following are the new features included in PGI Workstation 3.1:
Six new or updated generic compiler options (options which apply to all of the PGI compilers) have been added in release 3.1:
* -Mchkfpstk - check for internal consistency of the IA32 floating-point stack in the prologue of a function and after returning from a function or subroutine call. Floating-point stack corruption may occur in many ways, one of which is Fortran code calling floating-point functions as subroutines (i.e. with the CALL statement). If the PGI_CONTINUE environment variable is set upon execution of a program compiled with -Mchkfpstk, the stack will be automatically cleaned up and execution will continue. There is a performance penalty associated with the stack cleanup. If PGI_CONTINUE is set to verbose (must be all lower case), the stack will be automatically cleaned up and execution will continue after a warning message is printed.
* -Mvect=sse - search for vectorizable loops and, where possible, generate calls to equivalent hand-tuned functions which use Pentium III SSE instructions. Using this switch, it is possible to automatically use the Pentium III SSE instructions without making alterations in your source code.
* -Mcache_align - instructs the compiler to align unconstrained data objects of length greater than or equal to 16 bytes on cache-line boundaries. An unconstrained data object is a data object that is not a member of an aggregate structure or common block. This option does not affect the alignment of automatic or allocatable arrays. NOTE: to effect cache-line alignment for stack-based local variables, the C/C++ main() function or Fortran main PROGRAM must be compiled using a PGI compiler.
* -mp - interpret and process user-inserted OpenMP or SGI parallelization directives or pragmas. The OpenMP shared-memory parallel programming API was supported in PGF77 and PGF90 in a previous release, and is now supported by the PGCC ANSI C and C++ compilers as well. See the PGI User's Guide for a complete list of supported directives and pragmas.
* -Mnoopenmp - when used in conjunction with -mp, causes the PGI compilers to ignore OpenMP directives or pragmas but still process SGI-style parallelization directives or pragmas.
* -Mnosgimp - when used in conjunction with -mp, causes the PGI compilers to ignore SGI-style parallelization directives or pragmas but still process OpenMP directives or pragmas.
Two new NT-specific compiler options have been added in release 3.1:
* -g - Compile/link for debugging using the gdb debugger. You must have the full Cygwin32 environment installed in order to use this switch. You must use this switch in combination with -Mstabs.
* -Mstabs - Generate GNU STABS symbol information so that resulting executables can be debugged using gdb. You must have the full Cygwin32 environment installed in order to use this switch. You must use this switch in combination with -g .
Full support for the OpenMP C/C+ Application Programming Interface, Version 1.0 is supported in release 3.1 of the PGCC ANSI C and C++ compilers. The PGI User's Guide, Chapter 11, contains a complete description of the OpenMP pragmas, functions, and environment variables supported by the PGCC compilers. For more information on the OpenMP programming model or to obtain copies of the OpenMP API specifications, see the URL http://www.openmp.org.
When the compiler switch -Mvect=sse is used, the vectorizer in release 3.1 of the PGI Workstation compilers automatically uses Pentium III SSE instructions where possible. This capability is supported by all of the PGI Fortran, C and C++ compilers, and is accomplished by replacing vectorizable loops with calls to optimized vector intrinsics (a modification in the generated assembly code - your source code remains unaltered). Using -Mvect=sse, performance improvements of up to two times over equivalent scalar code sequences are possible. However, the Pentium III SSE instructions apply only to 32-bit floating-point data, and meaningful performance improvements occur only for unit-stride vector operations on data that is aligned on a cache-line boundary.
Executables compiled using -Mvect=sse must be executed on a Pentium III system with an SSE-enabled operating system (NT 4.0 Service Pack 4, or Linux kernel 2.2.10 or higher with the appropriate kernel patches).
In the following program, the vectorizer recognizes the vector-vector addition in subroutine 'loop' when the compiler switch -Mvect=sse is used. This example shows the compilation, informational messages, and runtime results using the SSE instructions, along with some of the issues which greatly affect SSE performance.
program vector_add
parameter (n = 99999)
real*4 x(n),y(n),z(n)
do i = 1,n
y(i) = i
z(i) = 2*i
enddo
do j = 1, 10000
call loop(x,y,z,n)
enddo
print*,x(1),x(771),x(3618),x(23498),x(99999)
end subroutine loop(a,b,c,n)
integer i,n
real*4 a(n),b(n),c(n)
do i = 1,n
a(i) = b(i) + c(i)
enddo
end
First note that the arrays are single-precision. SSE instructions only operate on single-precision data that is aligned on cache-line boundaries. You can guarantee that unconstrained local arrays (such as x, y and z defined in the program above) are aligned on cache-line boundaries by compiling with the -Mcache_align switch.
NOTE: Fortran common blocks are also aligned on cache-line boundaries when -Mcache_align is used. If you have arrays in common blocks on which you'd like to invoke SSE vectorization, you must pad the common blocks explicitly to ensure all arrays contained in the common blocks are properly aligned.
The examples below show results of compiling the example code above with and without -Mcache_align. Assume the program is compiled as follows:
% pgf90 -fast -Mvect -Minfo vadd.f
No compile-time informational messages are emitted, so that's an indicator that no loops are optimized by the vectorizor. Following is the result if the generated executable is run and timed on a standalone Pentium III 450 Mhz system:
% /bin/time a.out
3.000000 2313.000 10854.00 70494.00 299997.0 47.52user 0.07system 0:47.57elapsed 100%CPU
Now, recompile with SSE vector idiom recognition enabled:
% pgf90 -fast -Mvect=sse -Minfo vadd.f
loop:
22, Call to __pgi_add4 generated
Unvectorized altcode loop generated for
count < 12
Note the informational message indicating that the loop has been vectorized and a call to the SSE-optimized intrinsic __pgi_add4 has been generated. The second part of the informational message notes that the scalar (i.e. non-SSE) version of the loop will be executed if the loop count is less than 12.
Executing again, you should see results similar to the following:
% /bin/time a.out
3.000000 2313.000 10854.00 70494.00 299997.0 47.85user 0.00system 0:47.84elapsed 100%CPU
So, the numerical results are identical but there is no performance improvement. That's because the starting addresses of vector data computed on using SSE instructions must be aligned to cache-line boundaries to obtain meaningful performance enhancements. Unconstrained local arrays and common blocks are aligned on cache-line boundaries when the -Mcache_align switch is used. Using this switch combined with those used previously results in the following:
% pgf90 -fast -Mvect=sse -Mcache_align -Minfo vadd.f
loop:
22, Call to __pgi_add4 generated
Unvectorized altcode loop generated for
count < 12
So, the same informational messages are emitted. Executing this version of the code, you should see results similar to the following:
% /bin/time a.out
3.000000 2313.000 10854.00 70494.00 299997.0 20.51user 0.06system 0:20.56elapsed 100%CPU
The result is a speed-up of more than 2 times over the equivalent scalar (i.e. non-SSE) version of the program.
By careful coding in combination with the -Mvect=sse and -Mcache_align switches, it is possible to get substantial speed-ups on programs which operate on 32-bit stride-one floating point vectors. However, in some cases, codes which operate on unaligned or strided data can see performance degradations when compiling with -Mvect=sse. For this reason, PGI recommends that you always measure the performance of codes with and without -Mvect=sse rather than using this switch as a default for optimization.
PGDBG can be used to debug F77, F90, C, C++, and assembly-language programs. It is not HPF-aware. PGDBG is currently available only on Linux and Solaris86. To use the command-level version of PGDBG, compile and link your program using the -g option and invoke the debugger as follows:
% pgdbg a.out
If you wish to use the graphical user interface (GUI), it is invoked using the command Xpgdbg:
% Xpgdbg a.out
Chapter 15 of the PGI User's Guide contains a complete description of PGDBG and how it is used, including an overview of the GUI.
Version 3.1 of PGDBG has the following limitations:
1. It cannot evaluate functions. If you specify "print" with an argument equivalent to a function call from the source code, you will see the error message:
INTERNAL ERROR: tgt_callfunc: not implemented
Program terminated, exit code is 0
However, it is generally safe to continue debugging your program even after this error message is printed.
2. It cannot process core files
3. When referencing members of Fortran 90 user-defined types, you must use a C-like syntax rather than the Fortran syntax (e.g. "print x.y" rather than "print x%y"). Also, when specifying types for printing (e.g. in a memory window) you must use C-like data type identifiers.
4. When debugging a program parallelized using either OpenMP directives or auto-parallelization, step or next into a parallel region is unreliable. If you need to debug within a parallel region, set a break point inside the parallel region and run to it. You can then alter thread contexts, examine the values of variables for each thread, and step the threads individually.
5. Correlation between the GUI and the underlying debugger is not reliable within parallel regions of code. For example, if you break within a parallel region and step or next an individual thread, the source correlation marker will track the instructions as they are executed by the active thread. However, when you switch contexts to a thread that is still stopped at the breakpoint, the source correlation marker will not repaint the screen to show the location of the new thread.
The PGI Workstation 3.1 compilers for NT support generation of GNU STABS format debug information under control of the -g and -Mstabs compile/link switches. For NT users who have installed the full Cygwin32 environment, this enables debugging of PGI-compiled programs using the version of gdb 4.18 included in Cygwin32. See http://www.cygnus.com for more information on how to obtain the full Cygwin32 environment. The -g and -Mstabs switches enable STABS generation and invoke the GNU assembler included with the full Cygwin32 environment rather than the assembler shipped by default with PGI Workstation 3.1.
Once you have created an executable (for example a.out) using the above switches, simply invoke gdb as follows:
% gdb a.out
within a Cygwin32 shell window.
Note that there are shortcomings in gdb with respect to its ability to debug Fortran - in particular it doesn't support COMPLEX data types and cannot examine data included in Fortran COMMON blocks. Also, on NT gdb doesn't understand the 'drive' (C:\) syntax of path names, so you must use gdb commands to set the source directory paths. The NT version of gdb does allow you to set and run to function and line breakpoints, examine variables, list source lines, and examine stack traces.
Precompiled versions of the BLAS and LAPACK math libraries are included in the files $PGI/<target>/lib/libblas.a and $PGI/<target>/lib/liblapack.a. These can be linked in to your applications by simply placing the -llapack -lblas options on the link line:
% pgf77 myprog.F -lblas -llapack
On NT, assembly-coded BLAS and FFT routines are included in the file $PGI/<target>/lib/libmkl.a. You can specify that these should be linked in place of the standard (compiled Fortran) version of the BLAS using the -lmkl link time option:
% pgf90 myprog.F -lmkl -llapack
For more information about this library, see the URL:
http://support.intel.com/support/performancetools/libraries
A similar library is available for Linux systems, but cannot be shipped with PGI Workstation for legal reasons. However, you may obtain it at no cost at the following URL:
http://www.cs.utk.edu/~ghenry/distrib/index.htm
Follow the instructions for obtaining the software, install it in the file $PGI/linux86/lib/libmkl.a, and compile/link as above for NT. NOTE: The contents of this library are similar but not identical to libmkl.a for NT. Also, you must link with -g77libs when using this library.
C++ static constructors in shared libraries are now supported.
All Microsoft calling conventions including Fortran STDCALL are supported by the PGI Fortran compilers. In addition, the PGI Fortran compilers support UNIX-style calling conventions on NT. This allows simple porting of mixed Fortran/C applications from UNIX to NT.
IMPORTANT: Object files compiled using release 1.7-6 or prior of the PGI Fortran compilers for NT are not compatible with object files compiled using releases 3.0 or 3.1. The default (as of release 3.0) is compatible with Microsoft PowerStation 4.0 and Digital Visual Fortran.
Section 6.14 of the PGI User's Guide contains a detailed description of all supported Fortran calling conventions under NT.
A self-guided online tutorial is available to help you become familiar with how OpenMP parallelization directives. In particular, the tutorial takes the user step by step through the process of parallelizing the NAS FT benchmark using OpenMP directives. The tutorial can be found at:
ftp://ftp.pgroup.com/pub/SMP
You can download this file using a web browser, and unpack the file using the following commands:
% gunzip fftpde.tar.gz
% tar xvf fftpde.tar
Change directories to the fftpde sub-directory, and follow the instructions in the README file.
This release contains the EDG 2.40 C++ front-end.
Release 3.1 of PGI Workstation is built and validated under both the Linux 2.0.36 and 2.2.x kernels. Newer distributions of Linux, such as Red Hat 6.0 and SuSE 6.x, incorporate revision 2.2.x of the Linux kernel and glibc2.1.1. If you are using a revision of Linux that includes the 2.2.x kernel and glibc 2.1.1, it will be detected automatically by the PGI Workstation installation script. Your installation will be modified as appropriate for these systems.
On NT, a UNIX-like shell environment is bundled with PGI Workstation. After installation, a double-left-click on the PGI Workstation icon on your desktop will launch a bash shell command window with pre-initialized environment settings. Most familiar UNIX commands are available (vi, emacs, sed, grep, awk, make, etc). If you are unfamiliar with the bash shell, reference the user's guide included with the online HTML documentation.
Alternatively, you can launch a standard NT command window pre-initialized for usage of the PGI compilers by selecting the appropriate option from the PGI Workstation program group accessed in the usual way through the "Start" button.
Except where noted in the PGI User's Guide, the command-level PGI compilers and tools on NT function identically to their UNIX counterparts. You can customize your command window (white background with black text, add a scroll bar, etc.) by right-clicking on the top border of the PGI Workstation command window, selecting "Properties", and making the appropriate modifications. When the changes are complete, NT will allow you to apply the modifications globally to any command window launched using the PGI Workstation desktop icon.
After installing the standard NT version of PGI Workstation as outlined in section 1, it is possible to invoke the PGI compilers from an Interix 2.2 command window (for more information about Interix from Softway Systems, see http://www.interix.com). However, note that you will be using the standard NT versions of the PGI Workstation compilers to produce standard NT executables. Essentially, you will simply be using the Interix shell window as a command-level user interface to the PGI compilers for NT.
Issue the following commands from within an Interix shell to initialize your environment and path.
Assuming csh:
% setenv PGI C:/pgi
% set path=(//C/pgi/nt86/bin $path)
Or, assuming sh or ksh:
% PGI=C:/pgi
% export PGI
% PATH=//C/pgi/nt86/bin:$PATH
The UNIX-style manual pages must be viewed in their HTML form on NT. See section 3 for information on how to view the HTML documentation.
Note that the standard NT versions of the PGI Workstation 3.1 compilers themselves have not been built under Interix, and do not link against the Interix libraries. Thus the executables produced by these compilers are standard NT executables rather than Interix executables. In particular, this means that Interix programs which rely on UNIX-specific system calls (e.g. fork()) cannot be built using the standard NT installation of the PGI Workstation 3.1 compilers.
NOTE 1: PGHPF-compiled programs sometimes fail to execute when invoked from an Interix shell. You may see the error message:
% a.out
1: CreateProcess: no error
and a temporary file named pghpf_map_nnn may be left in your working directory. You will want to delete this temporary file manually, as it will be very large. This behavior has not been observed for executables compiled using PGF77 or PGF90, and PGHPF programs compiled within an Interix shell should operate correctly when invoked from within either a BASH command window or a standard MS-DOS command window.
NOTE 2: Even though your environment is initialized correctly within an Interix shell, the pgprof command won't be found in your path. You can either invoke PGPROF using the full executable name of pgprof.exe, or you can issue the following alias command:
% alias pgprof pgprof.exe
The PGI Workstation 3.1 command-level compilers can be used from within an MKS Toolkit shell window (for more information on the MKS Toolkit from Mortice Kern Systems, see http://www.mks.com).
After installing PGI Workstation as outlined in section 1, issue the following commands from within an MKS korn shell to initialize your environment and path:
% PGI=C:/pgi
% export PGI
% PATH="C:\PGI\nt86\bin;$PATH"
The UNIX-style manual pages must be viewed in their HTML form on NT. See section 4 for information on how to view the HTML documentation.
To create dynamically linked libraries (DLLs) using the PGI compilers for NT, you must use the utilities dlltool and dllwrap which are included as part of the PGI Workstation for NT command environment. Here are the steps in the process.
STEP 1) Use dlltool to create a .def file from the object file(s) you wish to have included in the DLL. Th .def file includes entry points and intermediate code for all of the functions/subroutines in the DLL. This intermediate code replaces the actual objects in an executable that references the DLL, and causes the objects to be loaded from the static .a library file at runtime. Only the objects that are to be included in the DLL are entered here.
To create a DLL from the object code in files object1.o and object2.o, create a file obj12.def as follows:
% dlltool --export-all --output-def obj12.def \
object1.o object2.o
STEP 2) Create the intermediate DLL file using dllwrap. This step requires a complete linking of the objects declared previously, ensuring that any DLL entries referenced in the target DLLs have all of their symbols resolved by the linker (the resolved symbols can also be DLLs).
Assuming the objects object1.o and object2.o are compiled by PGF90, do the following to create obj12.dll from the objects and the required PGF90 libraries:
% dllwrap --def obj12.def -o obj12.dll \
--driver-name pgcc object1.o object2.o \
-L. -dll -cyglibs -lpgf90 -lpgf90_rpm1 -lpgf902 \
-lpgf90rtl -lpgftnrtl
If
the objects are compiled using PGF77, you need only include the
reference to -lpgftnrtl (i.e. you can omit the references to
-lpgf90,
-lpgf90_rpm1, -lpgf902 and
-lpgf90rtl. If the objects are compiled using PGCC, you need
not include any of the PGI Fortran runtime library references.
The dllwrap command creates a series of commands to send to the linker, among which is -nostartfiles, which directs pgcc to not load various startup files into the list of object files sent to the linker.
STEP 3) Use dlltool again to create the libobj12dll.a library file from obj12.dll and obj12.def.
% dlltool --dllname obj12.dll --def obj12.def \
--output-lib libobj12dll.a
As an example, consider the following source files, object1.f:
subroutine subf1 (n)
integer n
n=1
print *,"n=",n
return
end
and object2.f:
function funf2 ()
real funf2
funf2 = 2.0
return
end
and prog.f:
program test
external subf1
real funf2, val
integer n
call subf1(n)
val = funf2()
write (*,*) 'val = ', val
stop
end
Create the DLL libobj12dll.a using the steps above. To create the test program using libobj12dll.a, do the following:
% pgf90 -o test prog.f -L. -lobj12dll
should you wish to change libobj12dll.a without changing the subroutine or function interfaces, no rebuilding of test is necessary. Just recreate libobj12dll.a, and it will be loaded at runtime.

