GCC front-end whitepaper

July 28th, 2011 Andi Hellmund No comments

In the last couple of months, I created a white paper about GCC front-end internals. This white-paper is not yet complete and there are many many other areas of the compiler which are worth describing.

So, if you do have any valuable feedback for the white paper or if you do have areas which you wish to get some internal documentation about, please just let me know and I will think about adding more sections.

Just drop me a comment or an eMail (andi@mail.lxgcc.net) …

The white paper could be downloaded here!

Categories: Compilers Tags: , , ,

Simplistic GNU makefile for lazy programmers

June 26th, 2011 Andi Hellmund No comments

A friend of mine and I, we were recently discussing about the layout of a GNU makefile that allows you to add source files to your source tree in an arbitrary directory hierarchy without having to modify the makefile. As other requirements, we only wanted to create a single top-level makefile in the build directory, though no cascading sub-directory makefiles, and the makefile should be able to handle multiple source files with the same name.

Beside the usual make logic, GNU make provides the makefile writer with a good set of helper functions, e.g. for text replacement or directory/file operations. For a full reference of GNU make functions, please check out the official documentation.

As mentioned in the title, this makefile is very simplistic. It makes the following assumption about the software project. It is for now basically oriented towards software development in C, but it should be easy possible to enhance it for other languages.

  • all the source files are located in a single directory, e.g. src
  • there is a separate build directory containing the created object files, e.g. build (this is not a hard requirement, but I like to separate the sources from the object and executable files
  • all source files having the same file type will be built with the same compiler options

So, here it is:

source_files = $(shell find ../src -type f -iname '*.c' | sed 's/^\.\.\/src\///')
obj_files = $(subst .c,.o,$(source_files))

LDFLAGS =
CFLAGS = -g

vpath %.c ../src/

all: bin

bin: $(obj_files)
    gcc $(LDFLAGS) $^ -o $@

%.o: %.c
    -mkdir -p $(dir $@)
    gcc $(CFLAGS) $< -c -o $@

.PHONY: all clean

clean:
    rm -rf $(obj_files) $(dir $(obj_files)) bin

So, what is this makefile doing?

The first line collects all the source files, in this case all C source files, from the source directory. The $(shell …) command will execute any shell command from within the makefile. These specific shell commands will only record the file path relative to the source directory. I will explain below why this is useful. The second line is a text replacement to determine the object files corresponding the source files.

The next interesting line is the vpath command. The vpath command allows the makefile writer to specify directories where make should look for dependencies in addition to the local directory. It furthermore allows you to specify alternative directories based on the file extension. Here, we say that make should also look for .c files in the source directory.

Finally, the most important part of the makefile is the generic rule to translate .c files into .o files (%.o: %.c). This rule is used for each object file requested as dependency for the final executable (bin). For example, assume that the final executable only depends on the object file generic/a/bin.o, then GNU make will internally handle the generic rule as:

generic/a/bin.o: generic/a/bin.c
    -mkdir -p $(dir $@)
    gcc $< -c -o $@

The next important aspect here is that we use the special make variables $< (refering to the first dependency) and $@ (refering to the target). This rule is also where the vpath instruction from the beginning of the makefile comes into play. When searching for the file generic/a/bin.c as dependency, GNU make will first try to find this file relative to the current directory, but since it does not exist relative to the local directory, it will then search relative to the vpath directory, though the source directory. There, GNU make will find the file and use it. As a short side note about this rule: the $(dir …) make function extracts the directory part of a file name and the dash (-) in front of the rule tells make to ignore the return value of the mkdir command. This is a somewhat nasty hack, but simplifies the makefile by avoiding checks for directory existence.

As said, this makefile is very simplistic and might not match all requirements. However, it at least shows some interesting aspects and functions of GNU make.

Categories: Programming, Toolchain Tags:

GCC front-end (last): official GCC internal documentation

June 28th, 2010 Andi Hellmund No comments

This post is part of a series about GCC internals and specifically about how-to create a new language front-end for GCC. For a list of related posts, please check this page.

As part of his Google Summer of Code project, one of the future GCC contributors (redbrain) decided to extend the currently available internal GCC documentation by a detailed tutorial about GCC front-ends. I finally decided that the GCC internals document is the right place for such a tutorial so that I’ll contribute my findings and a really supported front-end skeleton including IR generation.

For this reason, I’ll stop this still very young session about GCC front-ends. Instead, I’m hopefully going to talk a bit about the newly available feature of GCC 4.5.0 named link-time-optimization (LTO). As I’m currently working on a dumping tool for LTO intermediate files (more about that later), I’ll give some insights into the details of LTO’s implementation …

Categories: Compilers Tags: , ,

GCC front-end (3): makefile

May 23rd, 2010 Andi Hellmund No comments

This post is part of a series about GCC internals and specifically about how-to create a new language front-end for GCC. For a list of related posts, please check this page.

As described in the last post, each language front-end has its own makefile or makefile fragment, named Make-lang.in (located in the front-end directory), which gets called by the main makefiles in the toplevel-build directory (build-x.y.z) and the gcc sub-directory (build-x.y.z/gcc). This post will only go through the major make targets and assumes a fundamental understanding of the make utility. As a reference for rest of this post, please check out the Make-lang.in file from the GCC front-end skeleton here.

As an entry point to the GCC makefile hierarchy, let’s consider which targets are called when building GCC and specifically a GCC front-end. Assuming a bootstrap build with the C++ front-end as an example, whenever you do a ‘make’ (which is basically ‘make all’) or a ‘make bootsrap’, the following targets get called. The targets which must be available in the language makefile are marked bold.

[Makefile in the toplevel build directory]
bootstrap
   -> stage3-bubble
     -> all-stage3
       -> all-stage3-gcc

The all-stage3-gcc target changes into the gcc directory and calls make all:

[Makefile in gcc subdirectory]
all
   -> all.internal
     -> native
       -> c++
     -> start.encap
       -> lang.start.encap
         -> c++.start.encap
     -> rest.encap
       -> lang.rest.encap
         -> c++.rest.encap

These three language targets are the main targets for building the compilation driver and the compiler. Other targets like for building the documentation are not discussed here, while the installation target will be discussed in the final section of this post. The following rules & common practices apply to these three targets:

  • c++ (or <lang>): This target is usually used to build the core compiler, e.g. cc1plus
  • c++.start.encap (or <lang>.start.encap): This target allows to include all those parts which don’t rely on a working gcc-driver version. Working gcc-driver version in this context just means a gcc-driver created by this build, because the gcc-driver (usually called xgcc before the installation) is also built by this target. Though, this target is usually used to build the compilation drivers, e.g. like g++
  • c++.rest.encap (or <lang>.rest.encap): This target finally allows to include all those parts which rely on a working gcc-driver version, so if your front-end requires any parts to be built by the newly created gcc (not the host gcc generally used for the build), put those targets here. I checked several GCC front-ends and none of these use this target.

Next, I’ll go through a sample make command to explain and show how to include dependent libraries or how to get the GCC backend integrated into your compiler:

sfe1$(exeext): sample_fe/sfe1.o $(BACKEND) $(LIBSDEPS) attribs.o
        $(CC) $(ALL_CFLAGS) $(LDFLAGS) -o $@ sample_fe/sfe1.o \
        $(BACKEND) $(LIBS) attribs.o $(GMPLIBS) $(BACKENDLIBS)

So, the GCC infrastructure provides a lot of variables to simplify the dependency notation and the build commands. Because there are so many variables defined by the infrastructure, I won’t list them here, except the variable BACKEND. The BACKEND variable lists all the object files provided by the GCC infrastructure to connect your front-end to the middle-end and back-end of GCC.
For a reference of the mainly used variables, please check the makefile of the front-end skeleton and the makefiles of GCC-integrated front-ends (e.g. c++, java, fortran, etc.). If you would like to know which variable has what specific value, I could just recommend to grep the makefile in the gcc subdirectory.

Front-end installation

For the installation of the front-end executables, the language front-end needs to define a separate installation target, named <lang>.install-common:

EXES = g++

sample_fe.install.common:  installdirs
   for name in $(EXES); \
   do \
      if [ -f $$name ] ; then \
       name2="`echo \`basename $$name\` | sed -e '$(program_transform_name)'`"; \
       rm -f $(DESTDIR)$(bindir)/$$name2$(exeext); \
       $(INSTALL_PROGRAM) $$name$(exeext) $(DESTDIR)$(bindir)/$$name2$(exeext); \
       chmod a+x $(DESTDIR)$(bindir)/$$name2$(exeext); \
     fi ; \
  done

When running through the above makefile extract, you will notice that the installation target only installs the compilation driver and not the compiler. The compiler gets automatically installed by the GCC makefile. If you are interested in the details, the compiler gets installed by the target install-common of the gcc makefile. While the compilation driver is installed in the {prefix}/bin directory, the compiler is put into the {prefix}/libexec/gcc/<target_noncanoncial>/<version> directory. As a note, {prefix} is the directory specified for the –prefix option of the configure script, <target_noncanoncial> is a string like x86_64-unknown-linux-gnu (YMMV) and the <version> usually has a form like x.y.z.

GCC front-end (2): language-specific files

April 7th, 2010 Andi Hellmund 1 comment

This post is part of a series about GCC internals and specifically about how-to create a new language front-end for GCC. For a list of related posts, please check this page.

The last post gave a short introduction into the differences between compiler and (compilation) driver and how-to control the different phases of the gcc/g++ drivers. This post now explains the basic file and directory structure of a GCC front-end (the file and directory structure of the GCC project in general will be shown where appropriate in the context of the front-end explanation).

Except the C compiler, the source code for each GCC front-end is located in a separate directory, namely gcc-x.y.z/gcc/any_name, e.g. gcc-x.y.z/gcc/any_sample_fe. Just as a note in this context, to configure GCC for a new front-end with the –enable-languages configure option, don’t use the directory name, but the language name configured in gcc-x.y.z/gcc/any_sample_fe/config-lang.in as described below.

After the extraction of the sample, minimal front-end as described here, the following files shall be typically found in the front-end directory:

config-lang.in
General front-end language configuration used by the configure/make build process, like the name of the language (parameter: language) or the file name of the compiler (parameter: compilers) among others. This file gets read by the configure scripts (gcc-x.y.z/configure, gcc-x.y.z/gcc/configure, etc.) and the contents get incorporated into the generated makefiles. For a detailed description of the single parameters, please check the GCC internals manual (Chapter 6.3.8.1) available here.

lang.opt
Language-specific driver and compiler options which are automatically parsed by the GCC common code and passed to a front-end specific function for final analysis and internal processing. For a sample file layout, check out the files gcc-x.y.z/gcc/c.opt and gcc-x.y.z/gcc/common.opt

lang-specs.h
This is the specification used by the GCC driver infrastructure to handle the specific phases of the compilation process. As described in the last post, GCC selects the phases and corresponding tools (e.g. compiler) based on file extensions. Though assume, as an example, you want to create a new language which shouldn’t be directly translated into machine code, but beforehand into C/C++ code. Assume furthermore that your new language files end in ‘.my_c_ext’. By using the lang-specs.h file, you could then instruct your language driver to pass your new language file to your compiler which creates a ‘.c’ file. This file could/will then further be processed by the common tools, e.g. cc1, as, ld, etc. For a sample file layout, please check the file gcc-x.y.z/gcc/gcc.c which greatly explains the syntax of the lang-specs.h file.

Make-lang.in
Language-specific makefile fragment. This fragments gets included into the GCC Makefile (builddir/gcc/Makefile). This makefile fragment should contain the instructions to build and install the language-specific driver, compiler, man pages and documentation.

lang-tree.def, e.g. sample_fe-tree.def
To simplify the creation of new front-ends and the interaction of the front-end with the middle-end/back-end, GCC provides a language-independent abstract-syntax tree (AST) named GENERIC. GENERIC is a tree-based representation while each tree node has a unique tree code. All the tree codes available by GCC are listed in the file gcc-x.y.z/gcc/tree.def. Most of these tree codes suffice the purposes of a new language front-end, but you might need additional ones for specific language constructs. These extra tree codes are put into a language-specific tree definition file. The naming convention for this file is to use the front-end directory as the first part of the name instead of the language parameter in the config-lang.in file. For example, the C++ front-end – located in the directory gcc-x.y.z/gcc/cp – defines the file gcc-x.y.z/gcc/cp/cp-tree.def. For a sample layout, please check the file gcc-x.y.z/gcc/tree.def. But please keep in mind, the GCC middle-end expects GIMPLE – three-address code tuples – as input (in earlier version of GCC, GIMPLE was a tree-based intermediate representation based on GENERIC). The tree codes in gcc-x.y.z/gcc/tree.def could be automatically transformed into GIMPLE by the GCC gimplifier, but additional tree codes must be manually transformed into GIMPLE code.

driver-specific source files
compiler-specific files
All the source files for the driver, compiler or whatever tools are required for the new language front-end.

GCC front-end (1): driver vs. compiler

March 26th, 2010 Andi Hellmund 5 comments

This post is part of a series about GCC internals and specifically about howto create a new language front-end for GCC. For a list of related posts, please check this page.

If I would ask many people what the executable gcc is doing, most of the people would answer, “Well, it’s a compiler, though … it’s compiling the source file into a target file”. But that is NOT correct. The executable gcc is not a compiler although the abbreviation means GNU C compiler. gcc represents what is generally called a compiler driver or more generic a compilation driver, in the following just called driver. If you now think, this guy is completely insane, please add the -v option to one of your gcc commands and and check what gcc is really doing.

While a compiler really is only responsible for transforming the source file into (possibly optimized) target machine code, the driver is the high-level organizer in the overall compilation process creating a object/shared/executable file from one or more source files. Thereby, the driver divides the compilation process into several phases which greatly depend on the capabilities of the used programs. gcc in newer versions (for example 4.4.0 and newer) uses the following phases assuming that an executable is created from a single source file (gcc source.c -o exec):

1. compiler (cc1)
2. assembler (as; from GNU binutils)
3. collect2 (collect2; part of GCC)
   3.1. linker (ld; from GNU binutils)

Just a note: in earlier versions of gcc, the driver added a separate pre-processing phase before the compiler phase, but in the recent versions the pre-processor is omitted, because the compiler (cc1) incorporates the pre-processor, at least when using the default options. By using the -no-integrated-cpp option, you instruct the driver to split the compiler phase into 2 distinct phases: pre-processing and compiler.

Each of the phases above produces intermediate or temporary files (assuming that the -pipe option is not specified) as output files which then serve as input files for the subsequent phase, in detail:

  • pre-processor (if separate phase)
    • input: *.c
    • output: *.i
  • compiler
    • input: *.c & *.i files
    • output: *.s
  • assembler
    • input: *.s
    • output: *.o
  • collect2/linker
    • input: *.o
    • output: self-defined pattern, e.g. *.exe

Based on the file extension, the driver knows which phase to start with when processing the file. Though, if a *.s file is given as input file along with a *.o and *.c file to produce an executable in a single gcc command, gcc would run the *.s file through the assembler phase, the *.c file through the compiler and assembler phase and the resulting three *.o files finally through the collect2/linker phase. By the way, if an input file extension matches none of the defined extensions, the file is taken is collect2/linker input!

The default behavior of gcc is to try to produce an executable file from the input files. But, gcc could be instructed to stop after any of the above phases by using specific driver options:

-E : stop after pre-processing, produce a *.i file
-S : stop after compiler, produce a *.s file
-c : stop after assembler, produce a *.o file
none : stop after collect2/linker

Hint: If you want to run through all the phases with a single gcc command but neverthess keep the intermediate files, use the -save-temps option.

You may ask, why all this is important for a new language front-end in gcc? Right! Because your new language might require different or additional phases than the described ones and then you should know where to start with to bring your new language front-end driver to execute your specific phases. For this purpose, GCC (capital letters are used to distinguish the complete compiler project from the C-specific driver) is designed in a modular way to allow the front-ends to “register” new phases, but the addition/modification of the phases will be discussed in a later post. However, for the already interested reader, please take a look at the file gcc-x.y.z/gcc/gcc.c and one of the the files of an already existing front-end, for example of C++ in gcc-x.y.z/gcc/cp/lang-specs.h.

GNU Dynamic Loader search directories

July 5th, 2009 Andi Hellmund No comments

The GNU dynamic loader┬áis one of the main components of the user space in Linux based systems (file name: /lib/ld-linux.so.2). Whenever a program is executed, the dynamic loader is loaded into the process’ address space and called by the kernel before the control is passed to the program’s “main” function (basically, it is not the “main” function initially called by the kernel, but this knowledge is sufficient for a high-level understanding and is not part of the dynamic loader internals). The main task of the dynamic loader is to handle the interaction between the program and the system’s shared libraries by relocating unresolved symbols. To keep the programs as portable as possible across different GNU/Linux based systems, the program usually only records the shared library name (in fact the shared library’s internal soname to enable library versioning) while omitting the absolute path to the shared library. Before relocating unresolved symbols, the dynamic loader needs to find the appropriate shared library by searching different directories which is the focus of this short and high-leveled post.

The man page ld.so(8) [1] serves thereby as an entry point which directories are searched in which order:

  1. the DT_RPATH value of the program’s .dynamic ELF section (colon separated list of directories)
  2. the LD_LIBRARY_PATH environment variables (colon separated list of directories)
  3. the DT_RUNPATH value of the program’s .dynamic ELF section (colon separated list of directories)
  4. the dynamic loader cache file, usually /etc/ld.so.cache
  5. the default system library directories configured at compile-time, usually /lib and /usr/lib (skipped if the binary is linked with -z nodefaultlib)

(Note: the LD_PRELOAD environment variable could be used to specify shared libraries to loaded before any other shared libraries, but you need to specify the absolute library path instead of just a search directory as in the cases above)

Search steps 2, 4 and 5 are easy ignoring the very details, but how to set the DT_RPATH and DT_RUNPATH value of the .dynamic ELF section? For demonstration purpose, assume the following simple test program (main.c) and two shared libraries source files (lib.c and alt/lib.c) as reference for the subsequent examples. The base directory is ~/rpath_example.

/* Source file: ~/rpath_example/main.c */
extern void call_into_library ();

int main (int argc, char *argv[])
{
call_into_library ();
return 0;
}

Shared library source files lib.c and alt/lib.c:

/* Source file: ~/rpath_example/lib.c */
#include <stdio.h>

int call_into_library()
{
printf("lib\n");
}
/* Source file: ~/rpath_example/atl/lib.c */
#include <stdio.h>

int call_into_library()
{
printf("alt-lib\n");
}

Refer to this page for a GCC shared library tutorial. The following commands will finally create the shared libraries libtest.so and alt/libtest.so:

andi@roma:~/rpath_example$ ls -R
.:
alt  lib.c

./alt:
lib.c

andi@roma:~/rpath_example$ gcc -fPIC lib.c -c
andi@roma:~/rpath_example$ gcc -shared lib.o -o libtest.so
andi@roma:~/rpath_example$ cd alt
andi@roma:~/rpath_example/alt$ gcc -fPIC lib.c -c
andi@roma:~/rpath_example/alt$ gcc -shared lib.o -o libtest.so

Common Use

The common way to link the main program against the shared library libtest.so would require the environment variable LD_LIBRARY_PATH to be set:

andi@roma:~/rpath_example$ gcc main.c -o main -L. -ltest
andi@roma:~/rpath_example$ ./main
./main: error while loading shared libraries: libtest.so: cannot open shared object file: No such file or directory
andi@roma:~/rpath_example$ export LD_LIBRARY_PATH=.
andi@roma:~/rpath_example$ ./main
lib

The .dynamic ELF section of the main program then yields:

andi@roma:~/rpath_example$ readelf -d main
[...]
0x00000001 (NEEDED)                     Shared library: [libtest.so]
0x00000001 (NEEDED)                     Shared library: [libc.so.6]

DT_RPATH

The DT_RPATH feature allows you to embed the search path for the dynamic loader into the executable:

andi@roma:~/rpath_example$ gcc main.c -o main -L. -ltest -Wl,-rpath,\$ORIGIN
andi@roma:~/rpath_example$ ./main
lib
andi@roma:~/rpath_example$ readelf -d main
[...]
0x00000001 (NEEDED)                     Shared library: [libtest.so]
0x00000001 (NEEDED)                     Shared library: [libc.so.6]
0x0000000f (RPATH)                      Library rpath: [$ORIGIN]

andi@roma:~/rpath_example$ ./main
lib
andi@roma:~/rpath_example$ export LD_LIBRARY_PATH=~/rpath_example/alt
andi@roma:~/rpath_example$ ./main
lib

The $ORIGIN variable within the DT_RPATH value refers to the current execution directory of the main program (which means that $ORIGIN is equal to “.”). As an alternative to the linker option -rpath, you could also set the LD_RUN_PATH environment variable.

The general drawback using DT_RPATH is that you cannot overwrite this setting with LD_LIBRARY_PATH which limits the flexibity of using shared libraries (the only way to get around this limitation is to remove the shared library from the specified paths if you want to avoid the re-compilation of the application). This drawback is solved by DT_RUNPATH.

DT_RUNPATH

The DT_RPATH limitation is solved by the dynamic loader by ignoring the DT_RPATH value if the DT_RUNPATH value is set. In this case, the dynamic loader searches the LD_LIBRARY_PATH before the embedded path in the executable. The DT_RUNPATH value is set with the linker options -rpath (or LD_RUN_PATH) and the –enable-new-dtags.

andi@roma:~/rpath_example$ gcc main.c -o main -L. -ltest -Wl,-rpath,\$ORIGIN,--enable-new-dtags
andi@roma:~/rpath_example$ readelf -d main
[...]
0x00000001 (NEEDED)                     Shared library: [libtest.so]
0x00000001 (NEEDED)                     Shared library: [libc.so.6]
0x0000000f (RPATH)                      Library rpath: [$ORIGIN]
0x0000001d (RUNPATH)                    Library runpath: [$ORIGIN]

andi@roma:~/rpath_example$ ./main
lib
andi@roma:~/rpath_example$ export LD_LIBRARY_PATH=~/rpath_example/alt
andi@roma:~/rpath_example$ ./main
alt-lib

References
[1] http://linux.die.net/man/8/ld-linux