(* Awsum surface grammar — EBNF. Applies-to: Awsum 0.0.6 *) Program = [ ModuleComment , BlankLine ] , [ FirstImport , { Import } ] , { TopItem , Terminator } , [ Terminator ] ; (* Optional single block comment at the very top of the file, separated from the rest by exactly one blank line. Line comments are rejected in this position; multiple block comments are rejected because the second one lacks its own preceding blank line. *) ModuleComment = BlockComment ; BlankLine = NEWLINE , NEWLINE ; (* The first import (if any) must not carry leading comments — those would be module-comment material. Subsequent imports may have leading comments (a commented-out "import IO.X" between live imports is the motivating case). *) FirstImport = "import" , ModulePath , [ TrailingLineComment ] ; Import = { TopComment , Terminator } , "import" , ModulePath , [ TrailingLineComment ] ; ModulePath = UIdent , { "." , UIdent } ; TopItem = TopComment | TypeDecl | EmptyTypeDecl | Sig | FunDef ; TypeDecl = "type" , UIdent , { LIdent } , [ "=" , ConDef , { "|" , ConDef } ] , [ TrailingLineComment ] ; (* `empty type X` declares the row identity: an uninhabited type whose value (none exist) subsumes into any expected position via row subsumption, and where two `empty type` declarations are interchangeable in row positions. The form forbids type parameters and constructors (the parser rejects `empty type X a` and `empty type X = …`); a plain `type X` with zero constructors stays uninhabited but a distinct row label. *) EmptyTypeDecl = "empty" , "type" , UIdent , [ TrailingLineComment ] ; ConDef = UIdent , { TypeAtom } ; Sig = DeclName , ":" , Type , [ TrailingLineComment ] ; FunDef = DeclName , { ParamBinder } , "=" , Expr , [ TrailingLineComment ] ; (* Function parameters: either a simple name, or a parens-wrapped destructuring pattern (single-constructor types pass exhaustiveness trivially; refutable patterns from multi-constructor types raise NonExhaustiveCase). The parens around a pattern are syntactically required to distinguish a destructuring parameter from a sequence of bare-name parameters — `f Tuple3 a b c` is four parameters, not one pattern with three fields. *) ParamBinder = LIdent | "(" , Pattern , ")" ; (* Name of a top-level declaration. The parenthesised "++" form is how the bundled prelude spells the concatenation operator itself (`(++) = BuiltIn.concatString`), so hover and go-to-definition on `a ++ b` land on the same binding. The form is syntactically allowed in user files too but is practically prelude-only: only the literal "++" token is accepted, `a ++ b` always compiles to the prelude binding regardless of any user-level redefinition, and a user declaration would collide with the prelude's. User-defined infix operators are not a language feature. *) DeclName = LIdent | "(" , "++" , ")" ; TopComment = LineComment | BlockComment ; LineComment = "--" , LineText ; BlockComment = "{-" , BlockText , "-}" ; TrailingLineComment = "--" , LineText ; (* Types: '|' < '->' < TypeApp < TypeAtom (precedence increases left to right). '|' is right-associative; '->' is right-associative; TypeApp is left-associative. So 'A | B -> C' parses as 'A | (B -> C)' and 'A B C' parses as '(A B) C'. To put a union on the LHS of an arrow, write '(A | B) -> C' with explicit parens. *) Type = TypeArrow , { "|" , Type } ; TypeArrow = TypeApp , [ "->" , TypeArrow ] ; TypeApp = TypeAtom , { TypeAtom } ; TypeAtom = UIdent | LIdent | "(" , Type , ")" ; (* Expressions: lambda / let / do / case / |> chain / ++ chain / application / atoms. Lambda, let, do-block and case forms are kept at the lowest precedence (same level) so their bodies extend as far right as possible. The infix operator precedence (lowest to highest): 1) |> (left-assoc) — pure syntactic rewrite to application; `(|>)` is /not/ a referenceable name. 2) ++ (left-assoc) — string concatenation; sugar for the prelude binding `(++)`. 3) application (left-assoc) 4) atoms *) Expr = Lambda | LetIn | DoBlock | CaseExpr | Pipe ; Pipe = PipeOp , { "|>" , PipeOp } ; (* Right-hand side of '|>' is every Expr alternative /except/ Pipe itself — that's what makes '|>' left-associative without giving up the right-greedy reach of Lambda / LetIn / DoBlock / CaseExpr. *) PipeOp = Lambda | LetIn | DoBlock | CaseExpr | Concat ; Lambda = "\\" , LIdent , { LIdent } , "->" , Expr ; (* Two surface shapes are accepted for 'let': - Inline single binding: let n = e in body - Layout multi-binding: let n1 = e1 (NEWLINE ni = ei)* (NEWLINE)? in body In the layout form, all 'ni' must align at the column where 'n1' started; 'in' must appear at a column strictly less than that bindings column ('in' is dedented relative to the bindings, Haskell-style). Both shapes parse to a chain of nested 'ELet' nodes — the renderer collapses chains visually but the AST is single-binding throughout. *) LetIn = "let" , LetBinding , { NEWLINE , LetBinding } , [ NEWLINE ] , "in" , Expr ; (* The let-binding's LHS is a pattern, so destructuring forms like `let (Tuple3 a b c) = e` work alongside the simple `let n = e` shape. The optional ":" Type ascription only applies to single-name (`PVar`) binders — ascribing a destructuring pattern as a whole is rejected with `PatternLetAscription`; the user should ascribe the right-hand side instead. Without an ascription the typechecker tries to synthesise the right-hand side; on synthesis failure (e.g. a do-block whose `<-` steps return Either with different error labels and the row-union can't be reconciled bottom-up) it reports `MissingLetAnnotation`, requiring the user to spell out the expected type. The same rules apply to the LetBindingDoStmt form below. *) LetBinding = Pattern , [ ":" , Type ] , "=" , Expr ; DoBlock = "do" , NEWLINE , DoStmt , { Terminator , DoStmt } ; DoStmt = Pattern , "<-" , Expr | "let" , LetBindingDoStmt | Expr ; LetBindingDoStmt = Pattern , [ ":" , Type ] , "=" , Expr ; Concat = App , { "++" , App } ; App = Atom , { Atom } ; Atom = QName | UIdent | "(" , Expr , ")" | String | Int ; (* Case: arms at reference indent (set by first arm). Comments accepted at any column > 1 — lenient so misaligned comments don't break compilation; the formatter normalizes. A blank line terminates the case block. *) CaseExpr = "case" , Expr , "of" , NEWLINE , { CaseItem } ; CaseItem = ( LineComment | BlockComment ) , Terminator | Pattern , "->" , Expr , [ TrailingLineComment ] , Terminator ; Pattern = UIdent , { PatternAtom } | LIdent | "(" , Pattern , ":" , Type , ")" ; PatternAtom = LIdent | UIdent | "(" , Pattern , ")" | "(" , Pattern , ":" , Type , ")" ; QName = { UIdent , "." } , LIdent ; (* Lexical sketch (exact rules implemented by the lexer): - String escapes: \n \t \r \" \\ \0 - Block comment nesting is handled lexically Identifiers accept an optional leading underscore. A bare "_" is also accepted (both for LIdent and UIdent) because the typechecker — not the parser — is responsible for rejecting it in positions where a wildcard isn't allowed (type names, constructor names, top-level defs, type parameters). Letting the parser accept it gives friendlier error messages than Megaparsec's "unexpected character". *) LIdent = ( lower | "_" ) , { identTail } ; UIdent = upper , { identTail } | "_" , [ upper , { identTail } ] ; identTail = lower | upper | digit | "_" | "'" ; String = '"' , { stringChar } , '"' ; stringChar = escape | ( any - ['"' | '\\'] ) ; escape = "\\" , ( "n" | "t" | "r" | "\"" | "\\" | "0" ) ; (* Integer literal: optional leading '-' adjacent to digits, plus optional '_' separators *between* digits as a readability affordance — '1_000_000' parses to the same Integer as '1000000', '10_00', or '1_0_0_0'. The formatter canonicalises every literal to one form: groups of 3 digits from the right, separator starting at 4 digits (so '42' stays '42', '1234567' becomes '1_234_567', '-1000000' becomes '-1_000_000'). Forbidden positions for '_': leading ('_1'), trailing ('1_'), adjacent ('1__2'), and immediately after the sign ('-_1'). Range validation is performed at typecheck time against the declared type (e.g. Int32 or UInt8); there is no defaulting. *) Int = [ "-" ] , digit , { [ "_" ] , digit } ; LineText = { any - ['\n'] } ; BlockText = { any } ; Terminator = NEWLINE | EOF ; lower = 'a'…'z' ; upper = 'A'…'Z' ; digit = '0'…'9' ; any = U+0000…U+10FFFF ; (* Semantic rules layered on top of the grammar (enforced by the parser or typechecker, as noted) *) (* 1. The "_" prefix marks a binding as intentionally unused. Referencing any identifier whose name begins with "_" is a compile error (whether it's a value, type parameter, type constructor, or data constructor). Binding such a name is allowed; the compiler never warns on it. 2. A bare "_" is a wildcard: allowed as a function parameter or a pattern (PWild). It introduces no binding, so multiple "_" in the same scope don't collide. It is rejected in positions where the language requires a referenceable name: top-level definition names, type names, constructor names, and type parameter names. 3. Unused bindings that are NOT "_"-prefixed produce warnings with a rename-to-"_name" quick-fix. Run `awsum check --strict` to escalate warnings to errors (CI-friendly). 4. No shadowing, ever. A fresh binder must not reuse any name already visible in its scope. Applies to every binding form (function params, pattern vars, top-level defs, type params). 5. Reserved words (parser). The identifiers `import`, `type`, `case`, `of`, `do`, `let`, and `in` are reserved by the grammar and cannot be used as `LIdent` (function name, parameter, pattern variable, type variable). They appear as literals in `Import`, `TypeDecl`, `CaseExpr`, `DoBlock`, `DoStmt`, and `LetIn`; the parser rejects them in any other position with a parse error. *)