User:Sinthorion/drafts/Unsafe
Unsafe is a programming language designed with two main considerations:
- extremely lightweight runtime (no data types, no GC, not even a proper stack...)
- extremely flexible syntax (should be able to reconstruct its own syntax at compile time to the syntax of almost any other language)
Together, this unintentionally makes Unsafe the least safe language you could imagine. Speaking, malloc, pointer arithmetic, overridable built-ins and operators, no typing at all, even regex based preprocessor instructions to add custom literals to the compilation process.
Overview
Unsafe code (.u files) is first compiled into Unsafe bytecode (.uc), which can then be executed using an Unsafe runtime. The compilation is for a certain word size (eg. 64 bit) and is expected to run on an architecture with that word size.
The entire syntax consists of 2 types of statements: Preprocessor instructions and code expressions. Expressions have 3 different elements: variables (includes function references), operators and literals. Functions, operators and literals can all be either built-in or custom (generated from the code at compile time). The compilation simplifies the code until all of those are replaced with just data word values and a few meta codes.
Default Syntax and Built-ins
The available built-in variables, functions, operators and literals is compiler/runtime dependent. The compiler must compile with the same built-ins as the runtime supports, or it must include any built-ins in the compiled bytecode.
Excluding preprocessor directives, any symbol in the source code is either a variable, an operator, or part of a literal. Variables are a reference to the variable register and hold a value. References to functions or objects is all usually done through variables.
Literals are defined as regex that matches part of the code. If multiple literals match the same part of the code, the one defined last goes first.
Compilation Process
- Prepare built-ins.
- Parse and interpret all preprocessor directives.
- Compile all literals from start to end.
- Parse all operators and replace them with functions.
- Build bytecode, procedure by procedure, expression by expression:
- Replace all variables with the variable index and a call to an internal function, that returns the value of the variable, except in assignments.
- Keep in mind that function names *are* variables.
- Previously unknown variables in a subprocedure should be initialised with the top value stack value.
- Adjust the order of values to the stack-oriented structure of the bytecode, inserting the function call meta code where needed:
1(2, 3(4, 5))
is transformed to2 4 5 3 call 1 call
- Group expressions together with the expression end meta code.
- Group procedures together with the procedure end meta code, with the main procedure at the start.
- Replace all variables with the variable index and a call to an internal function, that returns the value of the variable, except in assignments.
Runtime and Memory
The entire memory at runtime is considered as a single large block of data words. The size of the word is the word size of the processor. It is not specified how a data word is to be interpreted.
At the start of the program, the memory already contains a few sections for internal runtime functionality (and ideally the runtime should not use any memory outside this virtual memory, except what is needed to manage the memory):
- A register of all procedures.
- A register of all variables.
- The call stack.
- The value stack for the current expression and calls.
- Some special counters used by the runtime (eg. instruction pointer, value stack size before last expression, ...).
All of these blocks have a static size. The procedure register and the variable register are created at compile time or during the initialisation phase; the stacks have a fixed size (and will error when overflowing). The call stack only contains return addresses; function parameters are passed over the value stack.
New memory blocks can be allocated with the `malloc` built-in and free with the `free` built-in. The runtime has to keep track of all allocated memory and not allocate already allocated memory again. However, since the program can access any memory address by interpreting any value as pointer, this does not guarantee that unallocated memory is really unused. This also allows to access and modify the internal memory sections. Abusing this is on the risk of the programmer.
Bytecode
The bytecode consists of some metadata and then a bunch of values (to be interpreted as their bit sequence) and meta codes. When executed, each value is pushed on the value stack until a meta code is found. Meta codes are escaped with an escape character (eg. the null byte). Every value has a size in bit equivalent to the size of the data word the code is compiled for. For example in 64-bit bytecode, the escape character plus meta code is 128 bit of data, or 16 bytes of bytecode.
Standard meta codes:
- Repeated escape character: Use the escape character as value.
- Call (or end of function): Pops the last value, looks it up in the function register and executes it. Errors if the function register does not contain this index.
- End of expression: Reduces the value stack back to its size before the last expression.
- End of procedure: If the call stack is empty, exit the program. Else, pop the call stack and move the instruction pointer to the bytecode position pointed at by the popped value.
Standard Preprocessor Directives
define <NAME> <P_EXPRESSION>
: Sets a constant, which can be used in other preprocessor directives.assign <NAME> <NAME_OR_P_EXPRESSION>
: The program will run with the specified variable predefined to the value of the second parameter. Useful in combination with #externinclude <FILEPATH>
: include the specified file directly in place of this include directive.extern <NAME> <COMMAND>
: Defines a procedure, which, when run, will execute the specified command. This function is available in the preprocessor as the given name.literal <REGEX> <PROCREF_OR_LITERALREF>
: Defines a new literal, which uses the specified regex for parsing. If the second parameter is the code name of an existing literal, remap the regex to the definition of that literal. If the second parameter is a name of a function (either built-in or extern), this function will be called with the matching groups of the regex as parameters. It is expected to return a string as replacement for the literal.operator <SYMBOL> <PROCREF> <TYPE>
: Defines a runtime operator. During combination it will be replaced with the specified procedure. Its type can be UNARY_LEFT, UNARY_RIGHT, or BINARY, which will affect which of its left and right sides will be used as parameters of the procedure.
Standard Built-ins
A compiler or runtime should minimally support most of these, with any syntax that is appropriate.
Variables
One variable for each built-in function (except functions for internal use).
Operators or Functions
- Dereference (*var): Returns the value of the memory at the specified address.
- Call (proc()): Calls a function with parameters.
- Dereferencing assignment: Sets the memory at the specified address to the specified value.
- Basic arithmetic (+, -, *, /), bit operations (&, |, ^, !),
- Basic control structures (if, while)
- Return/exit: Leave the current function, optionally push values to return.
Additionally, since operators don't exist in the bytecode, there must be one (internal) function for each operator.
Compile-time Operators
These are technically not possible to execute at runtime, so the compiler already has to know about them. Technically they have to be implemented like literals.
- Direct assignment
- Get variable index
- Get variable pointer
- Lazy expressions?? (eg. loop conditions)
Literals
- Number literal: Inserts the number literally as integer.
- Code block literal: Creates a new procedure that contains the code inside it. Returns the procedure index.