Inverse lexing construction
I don't know if this kind of thing is already in the repository, but I thought up a way to inverse lex an arbitrary tokenized program into a string of 0s and 1s. Let T be the number of unique tokens, numbered 0 to T-1. Then
- Prefix the program with 04T 1 04T+1 1 04T+2 1 04T 1 04T+1 1 04T+2 1 04T.
- Encode the ith copy of token t as 1 04t+i 1 04t 1 04t+i for i = 1,2,3.
The prefix ensures that every substring of form 0m 1 0n for m,n < 4T occurs at least 4 times, thus ensuring every token in the rest of the program must have at least two 1s; the rest of its complexity is to prevent it adding tokens itself.
The actual repeated token strings should be 04t+1 1 04t 1 04t+1.
- Oh good, I was pretty sure something like this was possible (adding extra tokens to prevent accidental ones forming) but I couldn't see how. It's nice to have a definitive construction that works, even if it does use O(n) space. (This would also incidentally prove the language Turing-complete, because an inverse lex was the last piece of the puzzle that was missing.) --ais523 21:14, 22 November 2016 (UTC)
- That's nice. I made the construction because I thought it was a missing piece but then I started worrying if I'd misunderstood (since you also need to encode the AFM program as Incident tokens).
- It may be clear enough to you, but perhaps I should expand a bit: The construction also makes sure that for any substring of the form 1 0m 1 0n 1, at least one of 1 0m 1 and 1 0n 1 occurs at most twice in the program, thus ensuring no token can contain more than two 1s. This is one of the main purposes of the 1 04t+i 1 substrings, the other being to prevent actual tokens overlapping. --Ørjan (talk) 23:59, 22 November 2016 (UTC)
My argument that the outermost block lengths of the prefix cannot be among the four largest apparently breaks when I allow three equally sized blocks (which must include the leftmost one). So in fact the prefix can be simplified to:
- 04T 1 04T+1 1 04T 1 04T+1 1 04T.
O(log n) version
After more thought, I hope I have managed to create a delexing construction with O(log n) per token overhead (which is asymptotically optimal).
In the below, concatenation means string concatenation, even if digits are involved, and spaces have been added for readability.
- Insert dummy tokens into the program to ensure that:
- The start and end tokens are dummies (they will not have complete delimiters, so would not work.)
- No consecutive pair of tokens occurs more than once.
- Choose a number N ≥ 1 such that the number of distinct tokens (including new dummies) ≤ fib(N). For each token a (numbered 0,1...), define an id string A of length N that
- contains no 11s;
- begins and ends with 0.
- Can use formula: 1 A = fibbase(a+fib(N+1)) 0.
- Encode each token a as 1 A 1.
- Between consecutive tokens a and b with ids A and B respectively, pad with this monster (neighboring token encodings in parentheses):
(1 A 1) 1 B1A 111 B1A 11 B1A 111 B1A 111 B1A 111 B1A 11 B1A 111 B1A 111 B1A 111 B1A 111 B1A 11 B1A 111 B1A 111 B1A 11 B1A 11 B1A 111 B1A 11 B1A 111 B1A 111 B1A 111 B1A 11 B1A 111 B1A 111 B1A 111 B1A 111 B1A 11 B1A 111 B1A 111 B1A 1 (1 B 1)
There will be no substrings occuring exactly three times in the above (including what's in parentheses), and every substring short enough to be found in a different pair is copied many times.
The "monster" works analogously to the prefix of the O(n) construction if you interpret 111 and 11 as replacing the original's 0 and 1. Unfortunately the working of the prefix depended on having the left side free, so I had to adjust the details.
In the resulting Incident program, 11 A 11 will be a tripled string if the original a was a tripled token. It may not be a token though, as the actual token might include a larger subpart of the surrounding B 11 A 11 C. --Ørjan (talk) 19:08, 26 November 2016 (UTC)
I did some calculations, and the overhead is quite a bit: to beat the O(n) construction, the O(log n) one needs at least 135 distinct tokens in the best case where only the end dummies are needed (and due to a bitsize jump 143 and 144 are briefly worse again.) For the (really not possible) worst case where you add dummies around everything, it only takes over at 369 tokens. Sample sizes (EDIT: see below):
|Dist. tokens||O(n)||O(log n) min||O(log n) max|
- O(n): 18*T^2 + 31*T + 12
- O(log n): 59*M*N + 109*M - 58*N - 107
- M is the total number of tokens, with repeats and dummies.
- With only end dummies: M = 3*T + 2, T+1 <= fib(N)
- Maximal dummies: M = 5*T + 1, 2*T + 1 <= fib(N)
EDIT: Updated table and formulas with more reasonable assumptions (dummy tokens can be used twice, and you don't need a dummy between the first occurrence of a pair.) As expected this gets rid of the 143 exception (but not 144) and reduces the worst case to 220. --Ørjan (talk) 18:46, 27 November 2016 (UTC)
If you're not restricting yourself to characters 0 and 1, the following should be a version where the delimiters are different characters disjoint from those used in the token ids:
(A)! BA,BA! BA,BA,BA,BA! BA,BA,BA,BA,BA! BA,BA,BA! BA! BA,BA! BA,BA,BA,BA! BA,BA,BA,BA,BA! BA,BA,BA!(B)
I realized the dummies in the O(log n) version aren't necessary: the problem they solve can be fixed by removing padding instead. That is:
- Pad at the beginning and end of the program with single 1s, to complete the tokens there.
If a consecutive pair (a,b) of tokens occurs more than once, put padding only between one occurrence of them.
Oops, scratch that second point. I'd forgotten that the padding is also used to keep the final tokens from overlapping...
Fortunately I have a different scheme:
- If a consecutive pair (a,b) of tokens occurs more than once, replace the B1A substrings to make each padding unique. The string used for a padding should start with B1 and end with 1A, and contain no 11s in it. The simplest options are B101A and B1001A.
Using overlap rejection
The above paddings all make sure to have no substrings repeating exactly 3 times. It seems that they can be improved quite a bit by allowing some such substrings which are then rejected due to overlap, as long as we avoid the intended token substrings. The O(n) prefix can be shortened to
- 04T 1 04T 1 04T 1 04T+1
while the O(log n) padding can be shortened to
- (1 A 1)1 B1A 111 B1A 11 B1A 11 B1A 11 B1A 111 B1A 1(1 B 1)
The underlined region contains overlap, so we must ensure that the intended tokens don't expand into it. Normally the 11 A 11 token at the beginning cannot expand beyond the first B, except if all three a's are followed by b. But if we use the B1A, B101A and B1001A variations mentioned in the previous section, the token cannot expand beyond the common B10 prefix of them, so we should be fine. --Ørjan (talk) 04:26, 29 November 2016 (UTC)