# vim:wrap:spell:spelllang=en,en # # DO NOT EDIT! This file is generated automatically! # # __SOURCE([intro.mc], [20201127-21:01:29], [95cfba0], [e652fd5]) 7 Generating code in M4: introduction # number of characters in perex (200 ±10 is recommended): 238 16 The M4 macro processor is used to generate arbitrarily complex code from simple source code. The introductory part of the series contains its history, the basic principles of language, examples of usage and prerequisites for its mastery. 24 Introduction 30 Readers of this series will learn how to write scripts for machine code generation. The machine-generated code can be arbitrarily complex and can contain other internal dependencies. Interdependent files with complex code are hardly sustainable for humans in a consistent state. It is necessary to use some code generation mechanism. The code generation is performed by a tool for text transformation – a macro processor. 45 The series focus on the practical use of the universal macro processor M4 (hereafter M4) using small examples. It also describes the theoretical part of all its implementations. The aim of the series is to acquaint the reader with this tool (m4 – is a command line program) and also the programming language (M4 – is a programming language). What is it used for, how to program in it and its advantages and disadvantages. 59 [note] 59 Multilingual series “Generating code in M4” are generated by M4 scripts, which will make it easier (maybe) for other authors to write articles on www.root.cz (Root.cz – information not only from the Linux world). The result of the series is also a set of sample scripts for generating code. 61 The introductory part describes the basic principles of the language with simple examples of use. All examples use rewriting rules of context-free grammar. Later we will learn how to use output queues, automata, associative memories, stacks and pushdown automata. We will also learn how to write testing automata to test input data. 77 Examples for readers 83 The examples are a complementary part of the series and will be based to some extent on the discussion below the article. At the beginning of each episode, some parts of the M4 language will be described and supplemented with a set of examples at the end. Each part can be read in any order. 94 Code generation examples 94 Preprocessor examples 94 M4: examples 94 Why to use M4 and why not? 94 http://github.com/jkubin/m4root (Generating code in M4) – project generating this series 113 History of macro languages 119 Macro languages were invented when the assembly language (ASM) dominated. ASM source code usually contains identical instruction sequences that differ only in operand values. Identical instruction sequences can be grouped into one word or a macro instruction. The name usually describes the purpose of the hidden sequence of instructions. These macro instructions are translated by the macro processor to the original instruction sequences, which are then translated into the executable machine code. Programming in ASM using macro instructions is simpler, faster and less prone to human errors. 136 Later, macro languages were used to extend compiled programming languages because they made it possible to write a source code at the higher level of abstraction than offered by the programming language itself. The speed, performance and efficiency of a complex lower-level programming language is maintained through macro languages. However, it is important to understand all layers of code well. 147 GPM (General Purpose Macro-generator) 151 Christopher Strachey (Wikipedia) introduced the basic idea of rewritable strings with arguments which recursively rewrite to other strings in his GPM (General Purpose Macro-generator) in 1965. The next generation of M3 and M4 macro processors basically just expanded the original GPM (General Purpose Macro-generator). The basic idea of the original proposal remained the same. 165 M3 169 Dennis Ritchie (Wikipedia) took over the basic idea of GPM (General Purpose Macro-generator) and wrote an improved macro processor for generating source code of C (1972) language, which he himself designed. The new macro processor was written for the minicomputer AP-3, hence the name M3. This direct ancestor of the current M4 managed to significantly save heavy and time-consuming work and attract developers programming to other languages (FORTRAN (FORmula TRANslation), COBOL (COmmon Business-Oriented Language), PL/I (Programming Language One), …). Developers have customized M3 for these languages turning it into a universally usable M4 macro processor. 182 [m4 ∈ {set of UNIX tools}] 182 Dennis Ritchie was also a co-creator of UNIX and therefore: 182 M4 is minimalist and fast, it does one thing and it does well (UNIX philosophy Wikipedia) 182 it relies solely on the non-interactive command line interface 182 parameters and dependencies of M4 scripts are described by 182 the character begins with a one-line comment like in a UNIX shell 182 variables , , , , , , … have similar meanings as in a UNIX shell 182 the argument delimiter is comma 241 The M3 macro processor was also extended by Jim E. Weythman, the author of program construction, which is used in almost every M4 script: 257 [note] 257 The (divert(-1), divert(0), divert(1), …, divert(2147483647)) keyword switches output queues. Argument completely disables any text output. Argument switches output to (standard output). 268 M4 272 Brian Kernighan (Wikipedia) has enhanced the M3 macro processor to the FORTRAN 66 preprocessor (Wikipedia) to create a hybrid language extension named RATFOR (RATional FORtran). The basic program constructions of this extension (conditions, cycles) are the same as in C language. Programming in RATFOR is similar to C programming. The macro processor converts the source code back to FORTRAN, then the compiler performs the usual compilation to machine code. 287 [M4 language complements C language] 287 Note the almost perfect symbiosis with the C language: 287 CPP (C preprocessor) directives , , , … are comments for M4 287 most keywords separated from parentheses by a white character lose meaning for example, M4 ignores 287 macro arguments separate commas just like commas in C functions if the macro is defined, its variables are: , , , 287 the left control character is not a part of the C family syntax 287 the right control character does not matter if it is not part of the macro both control characters can be hidden into user-defined macros , 287 macros are written , just like nonterminal symbols (Wikipedia) this delimits their namespace 356 The user manual mentions other co-authors not mentioned here. So it would be fairly unfair to write that the authors of the M4 macro processor (1977) are only two people. 365 [Christopher Strachey, Dennis Ritchie, Brian Kernighan] 365 Christopher Strachey, Dennis Ritchie, Brian Kernighan 382 GNU M4 386 Today, there are several implementations that differ from the original implementation rather by small details. The most common implementation of M4 is the GNU M4 used for Autotools (Wikipedia) and for translating the simple configuration file to complex . The author of this implementation (1990) is René Seindal. To install m4 (with small letter “m”), type the following command: 401 [the command also installs other important packages] 405 A detailed description of the keywords can be found in the documentation: 419 Basics of M4 425 M4 is based on context-free grammar, automata, stacks and output queues. To understand M4, it is therefore crucial to understand the basic concepts of formal language theory – terminal symbols (Wikipedia) (briefly terminals) and nonterminal symbols (briefly nonterminals). These terms will be explained later in more detail. The objective is to show the basic practical use of M4 language on examples. 440 Context-free grammar 446 Context-free grammar (shortly CFG (Context-Free Grammar)) is a formal grammar in which all rules for rewriting have the form. The nonterminal is rewritten to an arbitrarily long (the right side of the rewriting rule) string composed of nonterminals or terminals . Kleene star (Wikipedia) means that nonterminal can be rewritten to (epsilon – empty symbol) (rewriting rule ). 457 [context-free grammar rewriting rules] 463 M4 rewriting rules 469 The rules for rewriting are the same for context-free grammar and M4. 476 [M4 rewriting rules] 485 All M4 keywords are nonterminals (macros), which take action and are rewritten to (epsilon – empty symbol) or another symbol. All keywords can be renamed or turned off completely. This feature is crucial for the preprocessor mode. 496 [M4 keywords are nonterminals] 504 Nonterminal expansion control 510 The default character pair in M4 controls the expansion of nonterminals. The keyword can change them to other characters, for example {(square brackets), (nonprintable characters), (UTF-8 characters)}. The nonterminals that we do not want to (immediately) expand are surrounded by this pair of characters. When passing through the macro processor, all the symbols between this character pair are terminal symbols and the outer character pair is removed. The next pass will cause the expansion of the originally protected nonterminals. The control character pair is set at the beginning of the root file. 528 Automata 534 Automata serve as “switches” of grammar rules. Automata use the grammar rules for rewriting as nodes and change their states according to input symbols. The currently used rule produces a specific code to the output queue (or several output queues) until the automaton moves to another node with a different rule. The examples of generating automata are in appendix. 549 Output queues 555 The output queues temporarily store the portions of the resulting code. These parts are formed using the grammar rules for rewriting which subsequently rewrite input symbols. The keyword sets the output queue. Finally, all non-empty queues are dumped in ascending order to the standard output and compose the final code. The examples of the output queues are in the appendix. 570 [for information] 570 Stacks will be described later. 578 Main uses of M4 584 M4 is used to generate the source code of any programming language or as a preprocessor for any source code. 592 The code generation 598 M4 transforms input data from (Macro Configuration) files to output data with the following command: 605 [← the_most_general.m4 … the_most_special.m4 →] 609 Two basic operations are performed during file loading: 616 the reading transformation rules from files with the extension 616 the expansion of macros inside files 634 The and files contain the input data in a format that allows them to be transformed into output data according to the rules in the previous files. The data files usually do not contain any transformation rules. 643 The input data may also come from the pipeline: 650 [input code → source code generation → file] 654 [input code → source code generation → program] 658 Try: Code generation examples 666 The preprocessor 672 M4 can operate in the preprocessor mode and can also be part of a pipeline. The input source code passes unchanged through except for nonterminal symbols. The nonterminals found are expanded to terminals and the output along with the source code. M4 can extend any other language where the preprocessor is insufficient (no recursion) or none. It is important to select the left character for nonterminal expansion control, which must not collide with the input source code character. However the character collision is easily solved by a regex. 689 [M4 as preprocessor – in general] 693 [M4 as preprocessor – without intermediate file] 698 Default characters 704 The conflicting character from the input source code is hidden into a macro . An empty pair of control characters before the macro serves as a symbol separator. When the source code is passed through the macro processor, the macro is rewritten back to the original character and the empty pair is removed. 718 [M4 as preprocessor with control characters: `'] 722 If there are or comments in the source code, they must be hidden first. The characters turn off original meaning and will be removed by the macro processor. M4 and comments are hidden between default characters: 733 [M4 as preprocessor with control characters: `'] 737 [M4 as preprocessor with control characters differently: `'] 742 Square brackets 748 If square brackets are used to control the expansion of nonterminals, the left square bracket is hidden in the same way. Everything else applies as for default characters . 757 [M4 as preprocessor with control characters: []] 761 M4 and comments are hidden between parentheses: 768 [M4 as preprocessor with control characters: []] 772 [M4 as preprocessor with control characters differently: []] 777 Non printable characters 783 Non printable characters () and () can be used to control the expansion of nonterminals. These characters cannot interfere with printable source code characters. 792 [M4 as preprocessor with control characters: ␂␆] 796 M4 and comments are hidden between non printable characters: 803 [M4 as preprocessor with control characters: ␂␆] 807 [M4 as preprocessor with control characters differently: ␂␆] 812 UTF-8 characters 818 Expansion of nonterminals can also be controlled by a suitably selected UTF-8 character pair. The usual source code does not contain such characters, so we do not have to solve the collision of the left bracket. UTF-8 characters offer similar advantages to non printable characters. 829 [M4 as preprocessor with control characters: ⟦⟧] 833 M4 and comments are hidden between UTF-8 characters: 840 [M4 as preprocessor with control characters: ⟦⟧] 844 [M4 as preprocessor with control characters differently: ⟦⟧] 850 Try: Preprocessor examples 854 Mixed mode 860 The mixed mode is a combination of the previous modes and is mainly used for experiments. The data is not separated from the rules for its transformation. The leaf file contains transformation rule definitions along with input data. 871 [how to learn M4] 875 Try: M4: examples 880 Prerequisites for mastering M4 886 To successfully master this macro language it is important to fulfill several prerequisites. M4 is not a simple language because it is not possible to think and program in it like an ordinary programming language. The most important thing to realize is that it is used to program the grammar rules for rewriting. Each string is either a terminal or a nonterminal symbol, including all language keywords (the symbols and are special cases of nonterminals). 899 [note] 899 M4 intentionally does not have keywords for cycles (/) because its basis is quite different from procedural or functional languages. 899 loops are only left-recursive or right-recursive 899 branching is made by symbol concatenation or , keywords 931 Fundamentals of grammars 937 All grammars are based on the rules for rewriting and their forms are generally described: 944 Formal grammar (Chomsky type) 961 The Formal grammar (Wikipedia) describes the subsets (Chomsky hierarchy (Wikipedie)) of the formal language (Wikipedia) rewriting rules and one of the subsets is called context-free grammar (Wikipedia), shortly CFG (Context-Free Grammar). As mentioned earlier, the CFG rewriting rules work the same as the M4 rewriting rules. Some of the following episodes of this series will focus on formal grammar in detail. 981 Fundamentals of automata 987 The ability to use predominantly two-state automata is an essential thing for writing simple M4 scripts because the vast majority of scripts use small automata. 994 Testing automaton 1000 The order of input symbols or their context can be tested by an automaton. If the input symbols meet the required properties, the automaton ends up in a double-ring node which indicates the accepting state. 1009 [deterministic finite automaton (DFA)] 1009 Example of an automaton accepting an even number (none is even) of symbols , ignoring symbols . The automaton is the same as the regular expression . 1024 The previous automaton can be written as an ASCII art accompanying the M4 script: 1031 [ASCII art for M4 code documentation] 1040 Generating automaton 1046 Input symbols change the nodes of the automaton, thereby changing the rewriting rules for code generation. See the appendix for this generating automaton: 1055 [ASCII art of generating automaton] 1073 (GNU) make 1079 A well-designed code generator usually consists of several smaller files whose order, dependencies and parameters are written to the file. Good knowledge of writing is therefore a prerequisite for mastering M4. Reading and maintaining source code generally takes more time than creating it. A well-structured therefore significantly contributes to the overall clarity of the resulting code generator. 1092 [we will deal with this topic at another part] 1092 Executing from the code editor with a shortcut key will significantly speed up M4 code development. The file contains . 1102 Vim 1108 Mastering the Vim editor is an important prerequisite for the convenience and speed of writing M4 code. Vim shortcuts, defined by the keyword, will save large amounts of unnecessary typing. These shortcuts also significantly reduce the occurrence of almost invisible errors caused by an unpaired bracket, thus saving the lost time spent on debugging. 1120 Talent and time 1126 M4 usually cannot be mastered over the weekend, especially when the fundamentals of automata theory (Wikipedia) and formal grammars (Wikipedia) are lacking. In order to master the M4 language, it is necessary to program in a longer period of time and write amounts of bad (complex) M4 code that you rewrite for a better idea. In this way it is possible to gradually gain practice. 1143 Code generation examples 1143 [for information] 1143 Chars. {(quotation marks), (square brackets), (nonprintable characters), (UTF-8 characters)} in the name controls the expansion of nonterminals. 1149 [note] 1149 The examples in this appendix are more complex and are intended to demonstrate the practical use of M4. They will be explained in detail later. 1159 Input source code 1165 The input source code is similar to CSV (Comma Separated Values), which is converted to arbitrarily complex target code of another language using CFG (Context-Free Grammar), automata and output queues. Stacks in the examples are not used. The input source code contains special characters that must be hidden: 1176 [] 1178 [note] 1178 The input file may also contain notes that may not be hidden in the comments , , or . 1186 CSV: simplest example 1192 This example does not use output queues, it only prints CSV (Comma Separated Values) separated by to standard output. 1199 [] 1204 CSV: counter 1210 The example uses the macro from the file whose (the right side of the rewriting rule) is copied to the right side of the macro. During the first expansion of its initial value is initialized. Further expansion returns the numeric terminal symbol and increases the inner auxiliary (global) symbol by one. is a small automaton. 1223 [] 1231 (how to do it) Modification of special characters 1237 Each type of output code requires the modification of the special characters. The M4 keyword is inappropriate for this type of task. First, we hide all special characters of the input file into appropriately named macros using regular expressions. 1248 Modified input code 1254 [all special characters are hidden into macros] 1258 We create several conversion files according to the target code type, and macros for square brackets are already defined in the root file. 1267 Conversion file for XML, XSLT, HTML 1271 [conversion file for markup languages] 1274 Conversion file for C, JSON, INI: 1279 [conversion file for a source code] 1282 Conversion file for Bash: 1288 [conversion file for Bash "strings in quotation marks"] 1290 Conversion file for Bash: 1296 [conversion file for Bash 'strings in apostrophes'] 1298 Conversion file for CSV, M4 (returns all characters) 1304 [the conversion file puts all special characters back] 1307 C: output queue 1313 The example uses one output queue for characters to close the array at the end. 1325 INI: an external command 1331 The example runs an external command and places its output in square brackets. The output of an external command are two comma-separated items. The macro selects the first item because the second item contains an unwanted () new line character. 1342 [] 1354 .h: hex counter 1360 The example uses the macro to number the resulting CPP (C preprocessor) macros and one output queue. The queue number contains the preprocessor directive to terminate the header file. The decimal value of the counter is converted to the two-digit hex by keyword . 1371 [] 1382 C: small automaton 1388 The example uses a small automaton to generate a newline character and one output queue number containing characters to terminate resulting string. Run the first time , is rewritten to (epsilon – empty symbol), in all following ones, it is rewritten to . 1397 [] 1404 [] 1409 C: small automaton 2 1415 This example is similar to the previous one, but each string is on a new line. 1422 [] 1424 [] 1427 HTML: output queues 1433 The example uses two output queues. The queue number contains paragraphs. The queue number contains closing HTML tags. Navigation links do not have to be stored anywhere, they go straight to the output. The and messages are processed in the same way as the messages. 1453 Branching by grammar 1459 The example shows branching by grammar, macro arguments are ignored. Input nonterminals are rewritten to terminals (🐛), (🐜), (🐝). 1474 [] 1480 Branching by grammar – basic principle 1486 The variable is replaced by the name of the macro and concatenated with another symbol. The newly formed nonterminal is rewritten to the corresponding terminal symbol (queue number or name). 1495 [grammar branching in M4] 1506 JSON: generating automaton 1512 The example uses two output queues and one generating automaton. The first error message in the state generates a header with brackets and outputs the first record. The automaton goes to the state which is a rule (the rule is used as the right side of another rewriting rule). The following error messages in the state only output individual records. At the end the output queue number and number print the characters and to close the resulting JSON. 1532 JSON: named queues 1538 The example processes other types of messages and . It uses three automata and six output queues. If we generate more complex source code, we will soon encounter the problem of maintaining index consistency for output queues. To avoid confusion, we use queue names instead of numbers. To avoid having to define similar rules, we copy the right side of (it is also a rule (the rule is used as the right side of another rewriting rule)) to the right side of the and rules. 1558 JSON: generated queue indexes 1564 During development, the order and number of output queues often change, which also requires frequent changes of their indexes. It is therefore appropriate to generate indexes. We can then use a virtually unlimited number of queues. The following example shows how these indexes are generated. 1577 [] 1583 INI: discontinuous queue index 1589 The example uses three automata and two output queues number and defined in a separate file. INI section names are generated by symbol chaining (see branching). The example uses the same file for output queues as the example to generate JSON. 1600 [] 1605 XML: mixed messages 1611 The example uses one output queue number for the closing tag. 1623 XML: separated messages 1629 The example groups messages by their type using output queues. 1641 Bash 1652 Bash 1663 Preprocessor examples 1663 [for information] 1663 Chars. {(quotation marks), (square brackets), (nonprintable characters), (UTF-8 characters)} in the name controls the expansion of nonterminals. 1669 C preprocessor and M4 1675 The CPP (C preprocessor) directives are a one-line comment for M4 preventing unwanted expansion of the same named macros. If we define a safer macro, the similar macro will not be overwritten. Thus, the CPP (C preprocessor) namespace can be completely separated from the M4 namespace. The problematic (backquote) character is hidden in the macro. The apostrophe does not matter in the source code. Apostrophe inside macro is hidden in macro. Note the or function names and where the is expanded. 1694 [] 1698 [] 1704 [] 1710 CSS: file inclusion, comment 1716 CSS uses the character for color codes, which is also the beginning of a one-line M4 comment. The keyword sets a multiline comment and rewrites itself into (epsilon – empty symbol). The comments can be turned off with the same keyword without parameters. 1727 [file embedded by the macro processor] 1728 [] 1729 [] 1731 [] 1734 Bash: nonprintable characters 1740 Bash uses both and characters. If we do not want to hide them either in an or macro, we can use nonprintable characters (displayed as UTF-8 characters) for expansion control, see the example: 1749 [] 1750 [] 1755 M4: examples 1755 [for information] 1755 Chars. {(quotation marks), (square brackets), (nonprintable characters), (UTF-8 characters)} in the name controls the expansion of nonterminals. 1762 JSON: left bracket 1768 The inside square brackets. Therefore, the left square bracket is replaced by the macro defined in the root file. 1782 Bash: counters 1788 The and counters are defined in the file . The nonterminals will not be expanded, only the outer brackets will be removed. The macro defined in the root file must be used. 1799 [] 1808 .h: brackets , , , 1814 The empty pair (or the empty symbol in brackets ) serves as a symbol separator. Brackets around the comment character turn off its original meaning as well as the meaning of the more powerful M4 comment . They also turn off the original meaning of the comma as a macro argument delimiter. These symbols become ordinary terminal symbols without any side effect. 1829 [] 1835 [] 1841 AWK: examples of safer macros 1847 The universal alert is ignored without parentheses, such as for example or . Such macros are explicitly created by a script developer, see the root file . 1856 [] 1863 [] 1873 Why to use M4 and why not? 1873 [for information] 1873 Chars. {(quotation marks), (square brackets), (nonprintable characters), (UTF-8 characters)} in the name controls the expansion of nonterminals. 1881 Why to generate code in M4 1887 direct use of context-free grammar (recursion for free) minimum M4 code is required for data transformation 1887 direct use of automata possibility to model necessary algorithms (M4 does not need versions) 1887 direct use of stacks stacks connected to automata extend capabilities of code generator 1887 direct use of output queues to temporarily store resulting pieces of code individual queues are finally dumped to output in ascending order 1887 significantly faster code generation (compared to XSLT) low demands on computing resources 1937 Why to avoid M4 1943 low-level universal language (similar to C language) which in return it provides tremendous flexibility as UNIX 1943 almost nonexistent developer community (as of Autumn 2019) M4 is nearly forgotten language with small number of existing projects 1943 unusual programming paradigm requiring several prerequisites that is why the M4 can be considered a challenging language 1943 productivity greatly depends on experience (problem with short-term deadlines) writing M4 scripts requires basic knowledge of automata and grammars 1943 maintaining badly written M4 code is not easy existing M4 code is easily thrown into confusion (supervision required!) --- 59 Generating code in M4 a template with examples for www.root.cz (Root.cz – information not only from the Linux world) 151 A General Purpose Macro-generator Computer Journal 8, 3 (1965), 225–41 272 RATFOR — A Preprocessor for a Rational Fortran Brian W. Kernighan 356 The M4 Macro Processor Bell Laboratories (1977) 365 Christopher Strachey Computer Hope – Free computer help since 1998 365 Dennis Ritchie Zomrel tvorca Unixu a jazyka C 365 Brian Kernighan An Interview with Brian Kernighan 405 GNU M4 - GNU macro processor Free Software Foundation 1009 Automata theory From Wikipedia, the free encyclopedia 1092 GNU Make Manual Free Software Foundation 1108 Vim – the ubiquitous text editor that edits text at the speed of thought 1126 Automaty a formální jazyky I Učební text FI MU 1140 Automaty a gramatiky Michal Chytil, 1. vydání, Praha, 331 s. 1984.