Lexer

20-Sep-2023

Just give me the codez

What is a lexer? #

The lexer¹, also known as the tokenizer or the scanner, is the first stage in the compilation process. It takes source code (NTLC source code in our case) and converts it into a stream of “tokens”.

First stage you say, are there more? Definitely!

Actually, the compiler is more like a pipeline. Each stage takes the output of the previous one and transforms it into something that is ready for the next stage to consume. The output of the lexer is the input of the parser, and so on.

Here is a very typical compiler pipeline²:

Compiler Pipeline

Why do we need a lexer? #

Strictly speaking, we don’t. It just allows us to work with the source code on a higher conceptual level. Later stages, namely the parser, will have way easier time reasoning about the structure of the source code if it doesn’t have to worry about string indices, whitespace characters, and/or the nuances of character sets and text encoding.

Actually this is true of every stage in the pipeline. If the source language was so simple, you probably would not even need a compiler. You could get away with a few regular expressions and a bunch of string manipulation functions.

The compiler is essentially a systematic and deterministic way to convert a program written in one language to another. Remember the good old divide and conquer technique: it advises you to break a big problem into smaller manageable ones! Well, that’s what the compiler really does with all its stages.

Lexer output #

Remember when we said earlier that the lexer converts source code to tokens. It is time we talk about those tokens.

The lexer typically outputs an array/vector/list of tokens (technically speaking, that’s if it is not a streaming lexer but let’s ignore that for now).

Now, most of the time a token³ is a structure that has a type and a value. Here is an example to illustrate how our NTLC snippet from the previous post would map to a list of tokens:

Lexer Output

Notice that because our language is very simple, things almost map one-to-one without any additional data.

But, say for example our language was to support arbitrary numbers or strings, then we would probably extend our token type to include a value field that would hold the actual value of the token.

For instance, a string token might look like this:

pub enum Token {
    StringLiteral(String),
    //            ^^^^^^
    //            This is the value field
    //
    //
    // More token types ...
}

or in TypeScript:

class StringLiteral {
  constructor(public readonly value: string) {}
}

type Token = StringLiteral | ...

When the lexer scans an input that has a string literal like “Hello World”, it would emit a token of type StringLiteral with a value of “Hello World” like so:

Token::StringLiteral("Hello World")

One final thing, you might be thinking: wait, where did we get our token types from? Did we just make them up?

Actually, sort of. Remember, this is our language and we get to define the rules, but, we won’t get into those until the next article on parsers.

How do lexers work? #

Before we describe the workings of the lexer, and parser for that matter, it is worth mentioning that there exists tools known as Parser Generators (yacc and Tree Sitter are such tools) that can actually generate code that implements the lexer and parser for your given you provide them with your programming language definition in some form or another, but hey, where is the fun in that!

Okay, getting back to how lexers work. They basically scan the input character by character⁴ (and most of the time, they take a sneak peek ahead at the next character or characters) and try to match those characters against a set of rules. If a rule matches, the lexer would emit a token and move on to the next character. If it comes across an unexpected character, it would throw an error.

It’s good to know that in NTLC we have the following token types:

pub enum Token {
    True, // Literal true
    False, // Literal false
    If, // Keyword if
    Then, // Keyword then
    Else, // Keyword else
    Zero, // Literal zero
    Succ, // builtin function succ()
    Pred, // builtin function pred()
    IsZero, // builtin function iszero()
    LeftParenthesis, // (
    RightParenthesis, // )
    EOF, // End of file (this is not something that ones types in but it's a handy way to represent
         // the end of the input)
}

Okay, enough with words. I feel that describing the workings of a compiler makes things sound more complicated than they actually are. So, let’s just look at the code. I have left comments all over the place to explain what is going on.

The word "lexer" is derived from the Latin word "lexicon" which refers to the vocabulary of a language, including its words and their meanings.
Actually, there is something quite satisfying about compilers: they are as "pure" as they can be. The compiler is essentially a function that takes a string and returns an alternative representation of that string -- no network calls, no database transactions, no file storage etc.
Sometimes the token encodes additional information such as the line number and column of the corresponding source code. Choosing what to store in a token depends on your application and requirements
Most of the time, lexers are defined in terms of one or more more regular expressions and you will see why soon!

Full Lexer code listing #

HINT: check the tests at the bottom of the file for more examples on how the lexer works

Again, the full source code for the project is available here!

The scan()function at line 271 is the entry point of the lexer module and the place where the main logic resides, so, make sure you start there first as everything else is more or less helper utilities.

1 use std::error::Error;
2 use std::fmt::Debug;
3 use std::vec::Vec;
4 
5 // I keep a copy of the grammar here for reference
6 /*
7  *    t ::= // terms
8  *        true  // constant true
9  *        false // constant false
10  *        if t then t else t // conditional
11  *        0 // constant zero
12  *        succ t // successor
13  *        pred t // predecessor
14  *        iszero t // zero test
15  *        */
16 
17 static KEYWORD_TRUE: &str = "true";
18 static KEYWORD_FALSE: &str = "false";
19 static KEYWORD_IF: &str = "if";
20 static KEYWORD_THEN: &str = "then";
21 static KEYWORD_ELSE: &str = "else";
22 static KEYWORD_IS_ZERO: &str = "iszero";
23 static KEYWORD_ZERO: &str = "0";
24 static KEYWORD_SUCC: &str = "succ";
25 static KEYWORD_PRED: &str = "pred";
26 
27 // These are directives. Similar to annotations or decorators in other programming languages
28 // clippy is Rust's linter and we are just telling it to be quite about EOF casing
29 #[allow(clippy::upper_case_acronyms)]
30 // In Rust one can ask the compiler to implement certain interfaces for
31 // a type by using the `derive` keyword. In this case we are asking the
32 // compiler to implement the `PartialEq` (Partial Equality) interface for the `Token` type
33 // which is needed/used when we compare two values of the same type for equality using ==.
34 #[derive(PartialEq)]
35 /*
36  `enum` is, roughly speaking, similar to how unions work in TypeScript where you can say:
37  `type Token = True | False`
38  And so token can either be True or False
39  Notice that `True` and `False` are not the literal boolean values `true` and `false`
40  but rather the names of the variants of the enum. They are types in their own right.
41  Tokens are the output of the lexer and are the input of the parser.
42 */
43 pub enum Token {
44     True,
45     False,
46     If,
47     Then,
48     Else,
49     Zero,
50     Succ,
51     Pred,
52     IsZero,
53     LeftParenthesis,
54     RightParenthesis,
55     EOF,
56 }
57 
58 /**
59  * Rust doesn't have classes but structs which are similar to classes in other languages, just without "inline" methods.
60  * This is how one implements "traits" (which roughly correspond to interfaces in other languages)
61  * for a struct.
62  * Essentially we are saying the Token type implements the Debug trait.
63  * The Debug trait allows us to print the value of a Token when used in a format string (akin to string interpolation).
64  */
65 impl Debug for Token {
66     fn fmt(
67         &self, /* similar to `this` in object oriented languages */
68         f: &mut std::fmt::Formatter,
69     ) -> std::fmt::Result {
70         /*
71          * I hope this very readable and self explanatory
72          */
73         match self {
74             Token::True => write!(f, "TRUE"),
75             Token::False => write!(f, "FALSE"),
76             Token::If => write!(f, "IF"),
77             Token::Then => write!(f, "THEN"),
78             Token::Else => write!(f, "ELSE"),
79             Token::Zero => write!(f, "0"),
80             Token::Succ => write!(f, "SUCC"),
81             Token::Pred => write!(f, "PRED"),
82             Token::IsZero => write!(f, "IS_ZERO"),
83             Token::LeftParenthesis => write!(f, "("),
84             Token::RightParenthesis => write!(f, ")"),
85             Token::EOF => write!(f, "<EOF>"),
86         }
87     }
88 }
89 
90 /**
91  * While the `Debug` trait is used for debugging purposes, the `Display` trait is used for printing
92  * a human readable or nice string representation of a value.
93  * It is similar in that regard to how one might override `toString()` in Java.
94  */
95 impl std::fmt::Display for Token {
96     fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
97         match self {
98             Token::True => write!(f, "TRUE"),
99             Token::False => write!(f, "FALSE"),
100             Token::If => write!(f, "IF"),
101             Token::Then => write!(f, "THEN"),
102             Token::Else => write!(f, "ELSE"),
103             Token::Zero => write!(f, "0"),
104             Token::Succ => write!(f, "SUCC"),
105             Token::Pred => write!(f, "PRED"),
106             Token::IsZero => write!(f, "IS_ZERO"),
107             Token::LeftParenthesis => write!(f, "("),
108             Token::RightParenthesis => write!(f, ")"),
109             Token::EOF => write!(f, "<EOF>"),
110         }
111     }
112 }
113 
114 #[derive(PartialEq)]
115 /*
116 * We use an enum to represent the different kind of errors the lexer could run into
117 */
118 pub enum LexingError {
119     UnrecognizedToken {
120         found: String,
121         column: usize,
122         line: usize,
123     },
124     UnexpectedCharacter {
125         found: char,
126         expected: Vec<String>,
127         column: usize,
128         line: usize,
129     },
130     UnexpectedEOF {
131         expected: Vec<String>,
132         column: usize,
133         line: usize,
134     },
135 }
136 
137 /**
138  * This is a helper function that is used to format Lexer error message
139  */
140 fn format_expectation_error(expected: &Vec<String>) -> String {
141     let mut result: Vec<String> = Vec::new();
142 
143     for e in expected {
144         match e {
145             s if s == " " => result.push("<SPACE>".to_string()),
146             s if s == "\t" => result.push("<TAB>".to_string()),
147             s if s == "\0" => result.push("<EOF>".to_string()),
148             s => result.push(format!("'{}'", s)),
149         }
150     }
151 
152     result.join(", ")
153 }
154 
155 impl std::fmt::Display for LexingError {
156     fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
157         match self {
158             LexingError::UnrecognizedToken {
159                 found,
160                 column,
161                 line,
162             } => {
163                 write!(
164                     f,
165                     "Unrecognized token '{}' at line {}, column {}",
166                     found,
167                     line + 1,
168                     column
169                 )
170             }
171             LexingError::UnexpectedCharacter {
172                 found,
173                 expected,
174                 column,
175                 line,
176             } => {
177                 write!(
178                     f,
179                     "Found '{}' at line {}, column {} when perhaphs you meant '{}'?",
180                     found,
181                     line,
182                     column + 1,
183                     format_expectation_error(expected)
184                 )
185             }
186             LexingError::UnexpectedEOF {
187                 expected,
188                 column,
189                 line,
190             } => {
191                 write!(
192                     f,
193                     "Unexpected end of file at line {}, column {} when you perhaps meant '{}'?",
194                     line,
195                     column + 1,
196                     format_expectation_error(expected)
197                 )
198             }
199         }
200     }
201 }
202 
203 impl std::fmt::Debug for LexingError {
204     fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
205         match self {
206             LexingError::UnrecognizedToken {
207                 found,
208                 column,
209                 line,
210             } => {
211                 write!(f, "Unknown token '{}' at ({}, {})", found, line, column)
212             }
213             LexingError::UnexpectedCharacter {
214                 found,
215                 expected,
216                 column,
217                 line,
218             } => {
219                 write!(
220                     f,
221                     "Unexpected token '{}' at ({}, {}). Expectation context {}",
222                     found,
223                     line,
224                     column,
225                     format_expectation_error(expected)
226                 )
227             }
228             LexingError::UnexpectedEOF {
229                 expected,
230                 column,
231                 line,
232             } => {
233                 write!(
234                     f,
235                     "Unexpected EOF looking for lexeme '{}' at ({}, {})",
236                     format_expectation_error(expected),
237                     line,
238                     column
239                 )
240             }
241         }
242     }
243 }
244 
245 /**
246  * The `Error` trait is like a marker interface
247  * It is just tells the compiler that `LexingError` is, well, kinda of an error
248  */
249 impl Error for LexingError {}
250 
251 /**
252  * This is the heart of the lexer. It basically takes NTLC program source code
253  * as input and returns a vector/list of tokens of type `Token` as output.
254  *
255  * It iterates the input string character by character from left to right and tries to match them
256  * against the known tokens of the language.
257  *
258  * Most of the time, the lexer is actually made part of the parser module and is implemented
259  * as a streaming lexer. That is, it is implemented as a function that returns the next
260  * token only when asked for explicitly by the parser and not all at once.
261  * This is done to avoid having to read the entire input into memory at once.
262  * Also, when working within a text editor or IDE, those tools can send you incomplete
263  * pieces of code to avoid resending the entire source file over and over again.
264  * We will see this when we are working on the language server.
265  *
266  *
267  * You might see the `&str` type and wonder what it is. It is just a reference to a string.
268  * Or to be more technically correct, a readonly reference to a string slice (what a mouthful).
269  * This avoids copying the string around in memory. Don't worry about it too much.
270  *
271  * The `Result<>` type is a probably implemented as an `enum` under te hood. It says this function
272  * either returns a vector of tokens or an error
273  * This is a concept from functional programming and it's very idiomatic to see the Result<>
274  * type all over the place in Rust code.
275  * If you worked with a language that uses exceptions, this might be feel strange at first
276  * but, I personally think it is way cleaner to explicitly state that your function might
277  * return an error. This reminds me to some extent of checked exceptions in Java (at least last time I used it).
278  */
279 pub fn scan(input: &str) -> Result<Vec<Token>, LexingError> {
280     let mut tokens: Vec<Token> = Vec::new();
281 
282     /*
283     This line might seem to be doing a bit of magic but essentially it is just a convenient
284     way to line up the input with itself shifted by one character, essentially giving us each character
285     and the one following it as a pair (char, next_char). Let me demonstrate:
286 
287     let's say that the input is "succ(0)" this will give us the following:
288     s u c c ( 0 )
289     u c c ( 0 ) \0
290 
291     and as pairs comes out as:
292     [('s', 'u'), ('u', 'c'), ('c', 'c'), ('c', '('), ('(', '0'), ('0', ')'), (')', '\0')]
293 
294     This is handy because a lot of the time we need to "lookahead" when we are scanning the input
295     to decide what the next token is. Keep reading.
296     */
297     let mut chars = input
298         .chars()
299         .zip(input.chars().skip(1).chain(['\0'])) // we are adding a null character at the end of the shifted string so both strings have the same length, this prevents the iterator from stopping prematurely
300         .peekable();
301 
302     // We are tracking the line number although our lexer doesn't support multiple lines of input
303     // This is useful for error reporting if we were to support new lines.
304     // The same with the `index` which is the column number within the input string
305     let line = 1usize;
306     let mut index = 0usize;
307 
308     // This is the main loop of the lexer. It iterates over the input characters
309     // and tries to match them against known tokens of the language
310     // The `let Some(&(c, next)) = chars.peek()` is known as a pattern match expression
311     // It is also something that is very common in functional languages
312     // It is similar to destructuring in JavaScript or Python but more powerful
313     // Here we are saying, if there is a next character in the input, assign it to `c`
314     // and assign the character after it to `next`
315     while let Some(&(c, next)) = chars.peek() {
316         // This is the main match statement. It is similar to a switch statement in other languages
317         // but more powerful because of pattern matching
318         // Now, the code basically and does the following:
319         match c {
320             // if the current character is 't' then check the next character
321             // because in our grammar, there are two tokens that start with 't'
322             // `true` and `then` and we need to know which one it is
323             // if the `t` is followed by an `r` then it is `true` otherwise it is `then`
324             // and we can move on to the next token
325             // the reset of the rules work in a similar fashion
326             // now you see why lexers typically use regular expressions to match tokens
327             't' => match next {
328                 'r' => {
329                     /*
330                     We use the consume function to "consume" the rest of the keyword
331                     and advance the iterator to the next character
332                      */
333                     index += consume(&mut chars, KEYWORD_TRUE, line, index)?;
334                     tokens.push(Token::True);
335                     /*
336                     We use the match_one_of function to check if the next character is
337                     one of the characters we expect to see after the keyword
338                     Imagine that the input might be the invalid program `truefalse`
339                     This is incorrect and we we use match_one_of to detect that
340                     Here we are saying we expect to see either a space, a tab, a right parenthesis
341                     Notice that this would accept the input `true)` which is also incorrect
342                     but that would be detected by the parser
343                     Choosing which errors to catch where is a matter of taste and maybe
344                     a bit of performance
345                     */
346                     match_one_of(
347                         chars.peek().map(|&(x, _)| x),
348                         &[' ', '\t', ')'],
349                         true,
350                         line,
351                         index,
352                     )?;
353                 }
354                 'h' => {
355                     index += consume(&mut chars, KEYWORD_THEN, line, index)?;
356                     tokens.push(Token::Then);
357                     match_one_of(
358                         chars.peek().map(|&(x, _)| x),
359                         &[' ', '\t'],
360                         false,
361                         line,
362                         index,
363                     )?;
364                 }
365                 '\0' => {
366                     return Err(LexingError::UnexpectedEOF {
367                         expected: vec!["true".to_string(), "then".to_string()],
368                         column: index,
369                         line,
370                     })
371                 }
372                 c => {
373                     return Err(LexingError::UnexpectedCharacter {
374                         found: c,
375                         expected: vec!["true".to_string(), "then".to_string()],
376                         column: index,
377                         line,
378                     })
379                 }
380             },
381             'f' => {
382                 index += consume(&mut chars, KEYWORD_FALSE, line, index)?;
383                 tokens.push(Token::False);
384                 match_one_of(
385                     chars.peek().map(|&(x, _)| x),
386                     &[' ', '\t', ')'],
387                     true,
388                     line,
389                     index,
390                 )?;
391             }
392             'i' => match next {
393                 'f' => {
394                     index += consume(&mut chars, KEYWORD_IF, line, index)?;
395                     tokens.push(Token::If);
396                     match_one_of(
397                         chars.peek().map(|&(x, _)| x),
398                         &[' ', '\t'],
399                         false,
400                         line,
401                         index,
402                     )?;
403                 }
404                 's' => {
405                     index += consume(&mut chars, KEYWORD_IS_ZERO, line, index)?;
406                     tokens.push(Token::IsZero);
407                     match_one_of(
408                         chars.peek().map(|&(x, _)| x),
409                         &[' ', '\t', '('],
410                         false,
411                         line,
412                         index,
413                     )?;
414                 }
415                 '\0' => {
416                     return Err(LexingError::UnexpectedEOF {
417                         expected: vec!["if".to_string(), "iszero".to_string()],
418                         column: index,
419                         line,
420                     })
421                 }
422                 c => {
423                     return Err(LexingError::UnexpectedCharacter {
424                         found: c,
425                         expected: vec!["if".to_string(), "iszero".to_string()],
426                         column: index,
427                         line,
428                     })
429                 }
430             },
431             'e' => {
432                 index += consume(&mut chars, KEYWORD_ELSE, line, index)?;
433                 tokens.push(Token::Else);
434                 match_one_of(
435                     chars.peek().map(|&(x, _)| x),
436                     &[' ', '\t'],
437                     false,
438                     line,
439                     index,
440                 )?;
441             }
442             '0' => {
443                 index += consume(&mut chars, KEYWORD_ZERO, line, index)?;
444                 tokens.push(Token::Zero);
445                 match_one_of(
446                     chars.peek().map(|&(x, _)| x),
447                     &[' ', '\t', ')'],
448                     true,
449                     line,
450                     index,
451                 )?;
452             }
453             's' => {
454                 index += consume(&mut chars, KEYWORD_SUCC, line, index)?;
455                 tokens.push(Token::Succ);
456                 match_one_of(
457                     chars.peek().map(|&(x, _)| x),
458                     &[' ', '\t', '('],
459                     false,
460                     line,
461                     index,
462                 )?;
463             }
464             'p' => {
465                 index += consume(&mut chars, KEYWORD_PRED, line, index)?;
466                 tokens.push(Token::Pred);
467                 match_one_of(
468                     chars.peek().map(|&(x, _)| x),
469                     &[' ', '\t', '('],
470                     false,
471                     line,
472                     index,
473                 )?;
474             }
475             '(' => {
476                 chars.next();
477                 index += 1;
478                 tokens.push(Token::LeftParenthesis);
479             }
480             ')' => {
481                 chars.next();
482                 index += 1;
483                 tokens.push(Token::RightParenthesis);
484             }
485             /*
486             We don't care about whitespace characters so we just skip them
487             */
488             '\t' | ' ' => {
489                 chars.next();
490                 index += 1;
491             }
492             /*
493             This is the case where we don't recognize the current character
494             as part of the language so we just error
495             */
496             ut => {
497                 return Err(LexingError::UnrecognizedToken {
498                     found: ut.to_string(),
499                     column: index,
500                     line,
501                 })
502             }
503         }
504     }
505 
506     // We add an EOF token at the end of the token stream to indicate we are done
507     // This is used later by the parser to know when to stop or error
508     tokens.push(Token::EOF);
509 
510     Ok(tokens)
511 }
512 
513 /*
514 This is a helper function that consumes a literal from the input
515 It is used to match keywords and other literals in the language
516  */
517 fn consume<T>(
518     input: &mut T,
519     literal: &'static str,
520     line: usize,
521     index: usize,
522 ) -> Result<usize, LexingError>
523 where
524     T: std::iter::Iterator<Item = (char, char)>,
525 {
526     let mut i = index;
527 
528     for expected in literal.chars() {
529         match input.next() {
530             Some((c, _)) => {
531                 if c != expected {
532                     return Err(LexingError::UnexpectedCharacter {
533                         found: c,
534                         expected: vec![literal.to_string()],
535                         column: i,
536                         line,
537                     });
538                 } else {
539                     i += 1;
540                 }
541             }
542             None => {
543                 return Err(LexingError::UnexpectedEOF {
544                     expected: vec![literal.to_string()],
545                     column: i,
546                     line,
547                 });
548             }
549         }
550     }
551 
552     Ok(literal.len())
553 }
554 
555 /**
556  * This is another helper function that is used to match a character against a list of characters
557  * Sometimes we might expecting one character or another so this function tries to do that
558  * and if it cannot it errors
559  * For example, let's say our input program is `pre(0)`
560  * The `scan` function sees a `p` and tries to match it against the `pred` keyword
561  * but when it reaches `e` it cannot find `d` so it errors
562  */
563 fn match_one_of(
564     input: Option<char>,
565     literals: &[char],
566     allow_eof: bool,
567     line: usize,
568     index: usize,
569 ) -> Result<usize, LexingError> {
570     match input {
571         Some(next_next) => {
572             let eof_expectation = if allow_eof {
573                 vec!["<EOF>".to_string()]
574             } else {
575                 vec![]
576             };
577 
578             if !literals.contains(&next_next) {
579                 return Err(LexingError::UnexpectedCharacter {
580                     found: next_next,
581                     expected: literals
582                         .iter()
583                         .map(|c| format!("{}", c))
584                         .chain(eof_expectation)
585                         .collect(),
586                     column: index,
587                     line,
588                 });
589             }
590 
591             Ok(1)
592         }
593         None => {
594             if !allow_eof {
595                 return Err(LexingError::UnexpectedEOF {
596                     expected: literals.iter().map(|c| format!("{}", c)).collect(),
597                     column: index,
598                     line,
599                 });
600             }
601 
602             Ok(0)
603         }
604     }
605 }
606 
607 #[cfg(test)]
608 mod lexer_tests_happy_path {
609     use super::*;
610 
611     #[test]
612     fn test_scan_empty_string() {
613         let input = "";
614 
615         let tokens = scan(input).unwrap();
616 
617         assert_eq!(tokens[0], Token::EOF);
618     }
619 
620     #[test]
621     fn test_scan_basic() {
622         let input = "true false if then else 0 succ pred iszero ()";
623 
624         let tokens = scan(input).unwrap();
625 
626         assert_eq!(tokens[0], Token::True);
627         assert_eq!(tokens[1], Token::False);
628         assert_eq!(tokens[2], Token::If);
629         assert_eq!(tokens[3], Token::Then);
630         assert_eq!(tokens[4], Token::Else);
631         assert_eq!(tokens[5], Token::Zero);
632         assert_eq!(tokens[6], Token::Succ);
633         assert_eq!(tokens[7], Token::Pred);
634         assert_eq!(tokens[8], Token::IsZero);
635         assert_eq!(tokens[9], Token::LeftParenthesis);
636         assert_eq!(tokens[10], Token::RightParenthesis);
637     }
638 
639     #[test]
640     fn test_scan_mixture_of_whitespaces() {
641         let input = "if\ttrue\t\t then         0 else        succ\t0";
642 
643         let tokens = scan(input).unwrap();
644 
645         assert_eq!(tokens[0], Token::If);
646         assert_eq!(tokens[1], Token::True);
647         assert_eq!(tokens[2], Token::Then);
648         assert_eq!(tokens[3], Token::Zero);
649         assert_eq!(tokens[4], Token::Else);
650         assert_eq!(tokens[5], Token::Succ);
651         assert_eq!(tokens[6], Token::Zero);
652         assert_eq!(tokens[7], Token::EOF);
653     }
654 
655     #[test]
656     fn test_scan_parantheses() {
657         let input = "if (true) then 0 else (succ (pred(0))";
658 
659         let tokens = scan(input).unwrap();
660 
661         assert_eq!(tokens[0], Token::If);
662         assert_eq!(tokens[1], Token::LeftParenthesis);
663         assert_eq!(tokens[2], Token::True);
664         assert_eq!(tokens[3], Token::RightParenthesis);
665         assert_eq!(tokens[4], Token::Then);
666         assert_eq!(tokens[5], Token::Zero);
667         assert_eq!(tokens[6], Token::Else);
668         assert_eq!(tokens[7], Token::LeftParenthesis);
669         assert_eq!(tokens[8], Token::Succ);
670         assert_eq!(tokens[9], Token::LeftParenthesis);
671         assert_eq!(tokens[10], Token::Pred);
672         assert_eq!(tokens[11], Token::LeftParenthesis);
673         assert_eq!(tokens[12], Token::Zero);
674         assert_eq!(tokens[13], Token::RightParenthesis);
675         assert_eq!(tokens[14], Token::RightParenthesis);
676         assert_eq!(tokens[15], Token::EOF);
677     }
678 
679     #[test]
680     fn test_scan_valid_tokens_bad_syntax() {
681         let input = "true)false)if then 0";
682 
683         let tokens = scan(input).unwrap();
684 
685         assert_eq!(tokens[0], Token::True);
686         assert_eq!(tokens[1], Token::RightParenthesis);
687         assert_eq!(tokens[2], Token::False);
688         assert_eq!(tokens[3], Token::RightParenthesis);
689         assert_eq!(tokens[4], Token::If);
690         assert_eq!(tokens[5], Token::Then);
691         assert_eq!(tokens[6], Token::Zero);
692         assert_eq!(tokens[7], Token::EOF);
693     }
694 }
695 
696 #[cfg(test)]
697 mod lexer_tests_sad_path {
698     use super::*;
699 
700     #[test]
701     fn test_scan_unrecognized_token_error() {
702         let input = "if    true then 0   else succ    0 1";
703 
704         let tokens = scan(input);
705 
706         assert_eq!(
707             tokens.err(),
708             Some(LexingError::UnrecognizedToken {
709                 found: String::from("1"),
710                 column: 35,
711                 line: 1,
712             })
713         );
714     }
715 
716     #[test]
717     fn test_scan_unexpected_token_error() {
718         let input = "succ( suc ( 0 ) )";
719 
720         let tokens = scan(input);
721 
722         assert_eq!(
723             tokens.err(),
724             Some(LexingError::UnexpectedCharacter {
725                 found: ' ',
726                 expected: vec!["succ".to_string()],
727                 column: 9,
728                 line: 1,
729             })
730         );
731     }
732 
733     #[test]
734     fn test_scan_unexpected_eof_error() {
735         let input = "tr";
736 
737         let tokens = scan(input);
738 
739         assert_eq!(
740             tokens.err(),
741             Some(LexingError::UnexpectedEOF {
742                 expected: vec!["true".to_string()],
743                 column: 2,
744                 line: 1,
745             })
746         );
747     }
748 
749     #[test]
750     fn test_scan_literals_no_spaces_unexpected_token_error() {
751         let input = "truefalse";
752 
753         let tokens = scan(input);
754 
755         assert_eq!(
756             tokens.err(),
757             Some(LexingError::UnexpectedCharacter {
758                 found: 'f',
759                 expected: vec![
760                     " ".to_string(),
761                     "\t".to_string(),
762                     ")".to_string(),
763                     "<EOF>".to_string()
764                 ],
765                 column: 4,
766                 line: 1,
767             })
768         );
769     }
770 }

⟨ Introduction Parser ⟩