Lexer



Just give me the codez

What is a lexer? #

The lexer1, also known as the tokenizer or the scanner, is the first stage in the compilation process. It takes source code (NTLC source code in our case) and converts it into a stream of “tokens”.


First stage you say, are there more? Definitely!


Actually, the compiler is more like a pipeline. Each stage takes the output of the previous one and transforms it into something that is ready for the next stage to consume. The output of the lexer is the input of the parser, and so on.


Here is a very typical compiler pipeline2:

Compiler Pipeline

Why do we need a lexer? #

Strictly speaking, we don’t. It just allows us to work with the source code on a higher conceptual level. Later stages, namely the parser, will have way easier time reasoning about the structure of the source code if it doesn’t have to worry about string indices, whitespace characters, and/or the nuances of character sets and text encoding.


Actually this is true of every stage in the pipeline. If the source language was so simple, you probably would not even need a compiler. You could get away with a few regular expressions and a bunch of string manipulation functions.


The compiler is essentially a systematic and deterministic way to convert a program written in one language to another. Remember the good old divide and conquer technique: it advises you to break a big problem into smaller manageable ones! Well, that’s what the compiler really does with all its stages.

Lexer output #

Remember when we said earlier that the lexer converts source code to tokens. It is time we talk about those tokens.

The lexer typically outputs an array/vector/list of tokens (technically speaking, that’s if it is not a streaming lexer but let’s ignore that for now).


Now, most of the time a token3 is a structure that has a type and a value. Here is an example to illustrate how our NTLC snippet from the previous post would map to a list of tokens:

Lexer Output

Notice that because our language is very simple, things almost map one-to-one without any additional data.


But, say for example our language was to support arbitrary numbers or strings, then we would probably extend our token type to include a value field that would hold the actual value of the token.


For instance, a string token might look like this:

pub enum Token {
    StringLiteral(String),
    //            ^^^^^^
    //            This is the value field
    //
    //
    // More token types ...
}

or in TypeScript:

class StringLiteral {
  constructor(public readonly value: string) {}
}

type Token = StringLiteral | ...


When the lexer scans an input that has a string literal like “Hello World”, it would emit a token of type StringLiteral with a value of “Hello World” like so:

Token::StringLiteral("Hello World")


One final thing, you might be thinking: wait, where did we get our token types from? Did we just make them up?


Actually, sort of. Remember, this is our language and we get to define the rules, but, we won’t get into those until the next article on parsers.

How do lexers work? #

Before we describe the workings of the lexer, and parser for that matter, it is worth mentioning that there exists tools known as Parser Generators (yacc and Tree Sitter are such tools) that can actually generate code that implements the lexer and parser for your given you provide them with your programming language definition in some form or another, but hey, where is the fun in that!


Okay, getting back to how lexers work. They basically scan the input character by character4 (and most of the time, they take a sneak peek ahead at the next character or characters) and try to match those characters against a set of rules. If a rule matches, the lexer would emit a token and move on to the next character. If it comes across an unexpected character, it would throw an error.


It’s good to know that in NTLC we have the following token types:

pub enum Token {
    True, // Literal true
    False, // Literal false
    If, // Keyword if
    Then, // Keyword then
    Else, // Keyword else
    Zero, // Literal zero
    Succ, // builtin function succ()
    Pred, // builtin function pred()
    IsZero, // builtin function iszero()
    LeftParenthesis, // (
    RightParenthesis, // )
    EOF, // End of file (this is not something that ones types in but it's a handy way to represent
         // the end of the input)
}


Okay, enough with words. I feel that describing the workings of a compiler makes things sound more complicated than they actually are. So, let’s just look at the code. I have left comments all over the place to explain what is going on.

  1. The word "lexer" is derived from the Latin word "lexicon" which refers to the vocabulary of a language, including its words and their meanings.
  2. Actually, there is something quite satisfying about compilers: they are as "pure" as they can be. The compiler is essentially a function that takes a string and returns an alternative representation of that string -- no network calls, no database transactions, no file storage etc.
  3. Sometimes the token encodes additional information such as the line number and column of the corresponding source code. Choosing what to store in a token depends on your application and requirements
  4. Most of the time, lexers are defined in terms of one or more more regular expressions and you will see why soon!

Full Lexer code listing #

HINT: check the tests at the bottom of the file for more examples on how the lexer works


Again, the full source code for the project is available here!


The scan()function at line 271 is the entry point of the lexer module and the place where the main logic resides, so, make sure you start there first as everything else is more or less helper utilities.

1use std::error::Error;
2use std::fmt::Debug;
3use std::vec::Vec;
4
5// I keep a copy of the grammar here for reference
6/*
7 * t ::= // terms
8 * true // constant true
9 * false // constant false
10 * if t then t else t // conditional
11 * 0 // constant zero
12 * succ t // successor
13 * pred t // predecessor
14 * iszero t // zero test
15 * */
16
17static KEYWORD_TRUE: &str = "true";
18static KEYWORD_FALSE: &str = "false";
19static KEYWORD_IF: &str = "if";
20static KEYWORD_THEN: &str = "then";
21static KEYWORD_ELSE: &str = "else";
22static KEYWORD_IS_ZERO: &str = "iszero";
23static KEYWORD_ZERO: &str = "0";
24static KEYWORD_SUCC: &str = "succ";
25static KEYWORD_PRED: &str = "pred";
26
27// These are directives. Similar to annotations or decorators in other programming languages
28// clippy is Rust's linter and we are just telling it to be quite about EOF casing
29#[allow(clippy::upper_case_acronyms)]
30// In Rust one can ask the compiler to implement certain interfaces for
31// a type by using the `derive` keyword. In this case we are asking the
32// compiler to implement the `PartialEq` (Partial Equality) interface for the `Token` type
33// which is needed/used when we compare two values of the same type for equality using ==.
34#[derive(PartialEq)]
35/*
36 `enum` is, roughly speaking, similar to how unions work in TypeScript where you can say:
37 `type Token = True | False`
38 And so token can either be True or False
39 Notice that `True` and `False` are not the literal boolean values `true` and `false`
40 but rather the names of the variants of the enum. They are types in their own right.
41 Tokens are the output of the lexer and are the input of the parser.
42*/
43pub enum Token {
44 True,
45 False,
46 If,
47 Then,
48 Else,
49 Zero,
50 Succ,
51 Pred,
52 IsZero,
53 LeftParenthesis,
54 RightParenthesis,
55 EOF,
56}
57
58/**
59 * Rust doesn't have classes but structs which are similar to classes in other languages, just without "inline" methods.
60 * This is how one implements "traits" (which roughly correspond to interfaces in other languages)
61 * for a struct.
62 * Essentially we are saying the Token type implements the Debug trait.
63 * The Debug trait allows us to print the value of a Token when used in a format string (akin to string interpolation).
64 */
65impl Debug for Token {
66 fn fmt(
67 &self, /* similar to `this` in object oriented languages */
68 f: &mut std::fmt::Formatter,
69 ) -> std::fmt::Result {
70 /*
71 * I hope this very readable and self explanatory
72 */
73 match self {
74 Token::True => write!(f, "TRUE"),
75 Token::False => write!(f, "FALSE"),
76 Token::If => write!(f, "IF"),
77 Token::Then => write!(f, "THEN"),
78 Token::Else => write!(f, "ELSE"),
79 Token::Zero => write!(f, "0"),
80 Token::Succ => write!(f, "SUCC"),
81 Token::Pred => write!(f, "PRED"),
82 Token::IsZero => write!(f, "IS_ZERO"),
83 Token::LeftParenthesis => write!(f, "("),
84 Token::RightParenthesis => write!(f, ")"),
85 Token::EOF => write!(f, "<EOF>"),
86 }
87 }
88}
89
90/**
91 * While the `Debug` trait is used for debugging purposes, the `Display` trait is used for printing
92 * a human readable or nice string representation of a value.
93 * It is similar in that regard to how one might override `toString()` in Java.
94 */
95impl std::fmt::Display for Token {
96 fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
97 match self {
98 Token::True => write!(f, "TRUE"),
99 Token::False => write!(f, "FALSE"),
100 Token::If => write!(f, "IF"),
101 Token::Then => write!(f, "THEN"),
102 Token::Else => write!(f, "ELSE"),
103 Token::Zero => write!(f, "0"),
104 Token::Succ => write!(f, "SUCC"),
105 Token::Pred => write!(f, "PRED"),
106 Token::IsZero => write!(f, "IS_ZERO"),
107 Token::LeftParenthesis => write!(f, "("),
108 Token::RightParenthesis => write!(f, ")"),
109 Token::EOF => write!(f, "<EOF>"),
110 }
111 }
112}
113
114#[derive(PartialEq)]
115/*
116* We use an enum to represent the different kind of errors the lexer could run into
117*/
118pub enum LexingError {
119 UnrecognizedToken {
120 found: String,
121 column: usize,
122 line: usize,
123 },
124 UnexpectedCharacter {
125 found: char,
126 expected: Vec<String>,
127 column: usize,
128 line: usize,
129 },
130 UnexpectedEOF {
131 expected: Vec<String>,
132 column: usize,
133 line: usize,
134 },
135}
136
137/**
138 * This is a helper function that is used to format Lexer error message
139 */
140fn format_expectation_error(expected: &Vec<String>) -> String {
141 let mut result: Vec<String> = Vec::new();
142
143 for e in expected {
144 match e {
145 s if s == " " => result.push("<SPACE>".to_string()),
146 s if s == "\t" => result.push("<TAB>".to_string()),
147 s if s == "\0" => result.push("<EOF>".to_string()),
148 s => result.push(format!("'{}'", s)),
149 }
150 }
151
152 result.join(", ")
153}
154
155impl std::fmt::Display for LexingError {
156 fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
157 match self {
158 LexingError::UnrecognizedToken {
159 found,
160 column,
161 line,
162 } => {
163 write!(
164 f,
165 "Unrecognized token '{}' at line {}, column {}",
166 found,
167 line + 1,
168 column
169 )
170 }
171 LexingError::UnexpectedCharacter {
172 found,
173 expected,
174 column,
175 line,
176 } => {
177 write!(
178 f,
179 "Found '{}' at line {}, column {} when perhaphs you meant '{}'?",
180 found,
181 line,
182 column + 1,
183 format_expectation_error(expected)
184 )
185 }
186 LexingError::UnexpectedEOF {
187 expected,
188 column,
189 line,
190 } => {
191 write!(
192 f,
193 "Unexpected end of file at line {}, column {} when you perhaps meant '{}'?",
194 line,
195 column + 1,
196 format_expectation_error(expected)
197 )
198 }
199 }
200 }
201}
202
203impl std::fmt::Debug for LexingError {
204 fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
205 match self {
206 LexingError::UnrecognizedToken {
207 found,
208 column,
209 line,
210 } => {
211 write!(f, "Unknown token '{}' at ({}, {})", found, line, column)
212 }
213 LexingError::UnexpectedCharacter {
214 found,
215 expected,
216 column,
217 line,
218 } => {
219 write!(
220 f,
221 "Unexpected token '{}' at ({}, {}). Expectation context {}",
222 found,
223 line,
224 column,
225 format_expectation_error(expected)
226 )
227 }
228 LexingError::UnexpectedEOF {
229 expected,
230 column,
231 line,
232 } => {
233 write!(
234 f,
235 "Unexpected EOF looking for lexeme '{}' at ({}, {})",
236 format_expectation_error(expected),
237 line,
238 column
239 )
240 }
241 }
242 }
243}
244
245/**
246 * The `Error` trait is like a marker interface
247 * It is just tells the compiler that `LexingError` is, well, kinda of an error
248 */
249impl Error for LexingError {}
250
251/**
252 * This is the heart of the lexer. It basically takes NTLC program source code
253 * as input and returns a vector/list of tokens of type `Token` as output.
254 *
255 * It iterates the input string character by character from left to right and tries to match them
256 * against the known tokens of the language.
257 *
258 * Most of the time, the lexer is actually made part of the parser module and is implemented
259 * as a streaming lexer. That is, it is implemented as a function that returns the next
260 * token only when asked for explicitly by the parser and not all at once.
261 * This is done to avoid having to read the entire input into memory at once.
262 * Also, when working within a text editor or IDE, those tools can send you incomplete
263 * pieces of code to avoid resending the entire source file over and over again.
264 * We will see this when we are working on the language server.
265 *
266 *
267 * You might see the `&str` type and wonder what it is. It is just a reference to a string.
268 * Or to be more technically correct, a readonly reference to a string slice (what a mouthful).
269 * This avoids copying the string around in memory. Don't worry about it too much.
270 *
271 * The `Result<>` type is a probably implemented as an `enum` under te hood. It says this function
272 * either returns a vector of tokens or an error
273 * This is a concept from functional programming and it's very idiomatic to see the Result<>
274 * type all over the place in Rust code.
275 * If you worked with a language that uses exceptions, this might be feel strange at first
276 * but, I personally think it is way cleaner to explicitly state that your function might
277 * return an error. This reminds me to some extent of checked exceptions in Java (at least last time I used it).
278 */
279pub fn scan(input: &str) -> Result<Vec<Token>, LexingError> {
280 let mut tokens: Vec<Token> = Vec::new();
281
282 /*
283 This line might seem to be doing a bit of magic but essentially it is just a convenient
284 way to line up the input with itself shifted by one character, essentially giving us each character
285 and the one following it as a pair (char, next_char). Let me demonstrate:
286
287 let's say that the input is "succ(0)" this will give us the following:
288 s u c c ( 0 )
289 u c c ( 0 ) \0
290
291 and as pairs comes out as:
292 [('s', 'u'), ('u', 'c'), ('c', 'c'), ('c', '('), ('(', '0'), ('0', ')'), (')', '\0')]
293
294 This is handy because a lot of the time we need to "lookahead" when we are scanning the input
295 to decide what the next token is. Keep reading.
296 */
297 let mut chars = input
298 .chars()
299 .zip(input.chars().skip(1).chain(['\0'])) // we are adding a null character at the end of the shifted string so both strings have the same length, this prevents the iterator from stopping prematurely
300 .peekable();
301
302 // We are tracking the line number although our lexer doesn't support multiple lines of input
303 // This is useful for error reporting if we were to support new lines.
304 // The same with the `index` which is the column number within the input string
305 let line = 1usize;
306 let mut index = 0usize;
307
308 // This is the main loop of the lexer. It iterates over the input characters
309 // and tries to match them against known tokens of the language
310 // The `let Some(&(c, next)) = chars.peek()` is known as a pattern match expression
311 // It is also something that is very common in functional languages
312 // It is similar to destructuring in JavaScript or Python but more powerful
313 // Here we are saying, if there is a next character in the input, assign it to `c`
314 // and assign the character after it to `next`
315 while let Some(&(c, next)) = chars.peek() {
316 // This is the main match statement. It is similar to a switch statement in other languages
317 // but more powerful because of pattern matching
318 // Now, the code basically and does the following:
319 match c {
320 // if the current character is 't' then check the next character
321 // because in our grammar, there are two tokens that start with 't'
322 // `true` and `then` and we need to know which one it is
323 // if the `t` is followed by an `r` then it is `true` otherwise it is `then`
324 // and we can move on to the next token
325 // the reset of the rules work in a similar fashion
326 // now you see why lexers typically use regular expressions to match tokens
327 't' => match next {
328 'r' => {
329 /*
330 We use the consume function to "consume" the rest of the keyword
331 and advance the iterator to the next character
332 */
333 index += consume(&mut chars, KEYWORD_TRUE, line, index)?;
334 tokens.push(Token::True);
335 /*
336 We use the match_one_of function to check if the next character is
337 one of the characters we expect to see after the keyword
338 Imagine that the input might be the invalid program `truefalse`
339 This is incorrect and we we use match_one_of to detect that
340 Here we are saying we expect to see either a space, a tab, a right parenthesis
341 Notice that this would accept the input `true)` which is also incorrect
342 but that would be detected by the parser
343 Choosing which errors to catch where is a matter of taste and maybe
344 a bit of performance
345 */
346 match_one_of(
347 chars.peek().map(|&(x, _)| x),
348 &[' ', '\t', ')'],
349 true,
350 line,
351 index,
352 )?;
353 }
354 'h' => {
355 index += consume(&mut chars, KEYWORD_THEN, line, index)?;
356 tokens.push(Token::Then);
357 match_one_of(
358 chars.peek().map(|&(x, _)| x),
359 &[' ', '\t'],
360 false,
361 line,
362 index,
363 )?;
364 }
365 '\0' => {
366 return Err(LexingError::UnexpectedEOF {
367 expected: vec!["true".to_string(), "then".to_string()],
368 column: index,
369 line,
370 })
371 }
372 c => {
373 return Err(LexingError::UnexpectedCharacter {
374 found: c,
375 expected: vec!["true".to_string(), "then".to_string()],
376 column: index,
377 line,
378 })
379 }
380 },
381 'f' => {
382 index += consume(&mut chars, KEYWORD_FALSE, line, index)?;
383 tokens.push(Token::False);
384 match_one_of(
385 chars.peek().map(|&(x, _)| x),
386 &[' ', '\t', ')'],
387 true,
388 line,
389 index,
390 )?;
391 }
392 'i' => match next {
393 'f' => {
394 index += consume(&mut chars, KEYWORD_IF, line, index)?;
395 tokens.push(Token::If);
396 match_one_of(
397 chars.peek().map(|&(x, _)| x),
398 &[' ', '\t'],
399 false,
400 line,
401 index,
402 )?;
403 }
404 's' => {
405 index += consume(&mut chars, KEYWORD_IS_ZERO, line, index)?;
406 tokens.push(Token::IsZero);
407 match_one_of(
408 chars.peek().map(|&(x, _)| x),
409 &[' ', '\t', '('],
410 false,
411 line,
412 index,
413 )?;
414 }
415 '\0' => {
416 return Err(LexingError::UnexpectedEOF {
417 expected: vec!["if".to_string(), "iszero".to_string()],
418 column: index,
419 line,
420 })
421 }
422 c => {
423 return Err(LexingError::UnexpectedCharacter {
424 found: c,
425 expected: vec!["if".to_string(), "iszero".to_string()],
426 column: index,
427 line,
428 })
429 }
430 },
431 'e' => {
432 index += consume(&mut chars, KEYWORD_ELSE, line, index)?;
433 tokens.push(Token::Else);
434 match_one_of(
435 chars.peek().map(|&(x, _)| x),
436 &[' ', '\t'],
437 false,
438 line,
439 index,
440 )?;
441 }
442 '0' => {
443 index += consume(&mut chars, KEYWORD_ZERO, line, index)?;
444 tokens.push(Token::Zero);
445 match_one_of(
446 chars.peek().map(|&(x, _)| x),
447 &[' ', '\t', ')'],
448 true,
449 line,
450 index,
451 )?;
452 }
453 's' => {
454 index += consume(&mut chars, KEYWORD_SUCC, line, index)?;
455 tokens.push(Token::Succ);
456 match_one_of(
457 chars.peek().map(|&(x, _)| x),
458 &[' ', '\t', '('],
459 false,
460 line,
461 index,
462 )?;
463 }
464 'p' => {
465 index += consume(&mut chars, KEYWORD_PRED, line, index)?;
466 tokens.push(Token::Pred);
467 match_one_of(
468 chars.peek().map(|&(x, _)| x),
469 &[' ', '\t', '('],
470 false,
471 line,
472 index,
473 )?;
474 }
475 '(' => {
476 chars.next();
477 index += 1;
478 tokens.push(Token::LeftParenthesis);
479 }
480 ')' => {
481 chars.next();
482 index += 1;
483 tokens.push(Token::RightParenthesis);
484 }
485 /*
486 We don't care about whitespace characters so we just skip them
487 */
488 '\t' | ' ' => {
489 chars.next();
490 index += 1;
491 }
492 /*
493 This is the case where we don't recognize the current character
494 as part of the language so we just error
495 */
496 ut => {
497 return Err(LexingError::UnrecognizedToken {
498 found: ut.to_string(),
499 column: index,
500 line,
501 })
502 }
503 }
504 }
505
506 // We add an EOF token at the end of the token stream to indicate we are done
507 // This is used later by the parser to know when to stop or error
508 tokens.push(Token::EOF);
509
510 Ok(tokens)
511}
512
513/*
514This is a helper function that consumes a literal from the input
515It is used to match keywords and other literals in the language
516 */
517fn consume<T>(
518 input: &mut T,
519 literal: &'static str,
520 line: usize,
521 index: usize,
522) -> Result<usize, LexingError>
523where
524 T: std::iter::Iterator<Item = (char, char)>,
525{
526 let mut i = index;
527
528 for expected in literal.chars() {
529 match input.next() {
530 Some((c, _)) => {
531 if c != expected {
532 return Err(LexingError::UnexpectedCharacter {
533 found: c,
534 expected: vec![literal.to_string()],
535 column: i,
536 line,
537 });
538 } else {
539 i += 1;
540 }
541 }
542 None => {
543 return Err(LexingError::UnexpectedEOF {
544 expected: vec![literal.to_string()],
545 column: i,
546 line,
547 });
548 }
549 }
550 }
551
552 Ok(literal.len())
553}
554
555/**
556 * This is another helper function that is used to match a character against a list of characters
557 * Sometimes we might expecting one character or another so this function tries to do that
558 * and if it cannot it errors
559 * For example, let's say our input program is `pre(0)`
560 * The `scan` function sees a `p` and tries to match it against the `pred` keyword
561 * but when it reaches `e` it cannot find `d` so it errors
562 */
563fn match_one_of(
564 input: Option<char>,
565 literals: &[char],
566 allow_eof: bool,
567 line: usize,
568 index: usize,
569) -> Result<usize, LexingError> {
570 match input {
571 Some(next_next) => {
572 let eof_expectation = if allow_eof {
573 vec!["<EOF>".to_string()]
574 } else {
575 vec![]
576 };
577
578 if !literals.contains(&next_next) {
579 return Err(LexingError::UnexpectedCharacter {
580 found: next_next,
581 expected: literals
582 .iter()
583 .map(|c| format!("{}", c))
584 .chain(eof_expectation)
585 .collect(),
586 column: index,
587 line,
588 });
589 }
590
591 Ok(1)
592 }
593 None => {
594 if !allow_eof {
595 return Err(LexingError::UnexpectedEOF {
596 expected: literals.iter().map(|c| format!("{}", c)).collect(),
597 column: index,
598 line,
599 });
600 }
601
602 Ok(0)
603 }
604 }
605}
606
607#[cfg(test)]
608mod lexer_tests_happy_path {
609 use super::*;
610
611 #[test]
612 fn test_scan_empty_string() {
613 let input = "";
614
615 let tokens = scan(input).unwrap();
616
617 assert_eq!(tokens[0], Token::EOF);
618 }
619
620 #[test]
621 fn test_scan_basic() {
622 let input = "true false if then else 0 succ pred iszero ()";
623
624 let tokens = scan(input).unwrap();
625
626 assert_eq!(tokens[0], Token::True);
627 assert_eq!(tokens[1], Token::False);
628 assert_eq!(tokens[2], Token::If);
629 assert_eq!(tokens[3], Token::Then);
630 assert_eq!(tokens[4], Token::Else);
631 assert_eq!(tokens[5], Token::Zero);
632 assert_eq!(tokens[6], Token::Succ);
633 assert_eq!(tokens[7], Token::Pred);
634 assert_eq!(tokens[8], Token::IsZero);
635 assert_eq!(tokens[9], Token::LeftParenthesis);
636 assert_eq!(tokens[10], Token::RightParenthesis);
637 }
638
639 #[test]
640 fn test_scan_mixture_of_whitespaces() {
641 let input = "if\ttrue\t\t then 0 else succ\t0";
642
643 let tokens = scan(input).unwrap();
644
645 assert_eq!(tokens[0], Token::If);
646 assert_eq!(tokens[1], Token::True);
647 assert_eq!(tokens[2], Token::Then);
648 assert_eq!(tokens[3], Token::Zero);
649 assert_eq!(tokens[4], Token::Else);
650 assert_eq!(tokens[5], Token::Succ);
651 assert_eq!(tokens[6], Token::Zero);
652 assert_eq!(tokens[7], Token::EOF);
653 }
654
655 #[test]
656 fn test_scan_parantheses() {
657 let input = "if (true) then 0 else (succ (pred(0))";
658
659 let tokens = scan(input).unwrap();
660
661 assert_eq!(tokens[0], Token::If);
662 assert_eq!(tokens[1], Token::LeftParenthesis);
663 assert_eq!(tokens[2], Token::True);
664 assert_eq!(tokens[3], Token::RightParenthesis);
665 assert_eq!(tokens[4], Token::Then);
666 assert_eq!(tokens[5], Token::Zero);
667 assert_eq!(tokens[6], Token::Else);
668 assert_eq!(tokens[7], Token::LeftParenthesis);
669 assert_eq!(tokens[8], Token::Succ);
670 assert_eq!(tokens[9], Token::LeftParenthesis);
671 assert_eq!(tokens[10], Token::Pred);
672 assert_eq!(tokens[11], Token::LeftParenthesis);
673 assert_eq!(tokens[12], Token::Zero);
674 assert_eq!(tokens[13], Token::RightParenthesis);
675 assert_eq!(tokens[14], Token::RightParenthesis);
676 assert_eq!(tokens[15], Token::EOF);
677 }
678
679 #[test]
680 fn test_scan_valid_tokens_bad_syntax() {
681 let input = "true)false)if then 0";
682
683 let tokens = scan(input).unwrap();
684
685 assert_eq!(tokens[0], Token::True);
686 assert_eq!(tokens[1], Token::RightParenthesis);
687 assert_eq!(tokens[2], Token::False);
688 assert_eq!(tokens[3], Token::RightParenthesis);
689 assert_eq!(tokens[4], Token::If);
690 assert_eq!(tokens[5], Token::Then);
691 assert_eq!(tokens[6], Token::Zero);
692 assert_eq!(tokens[7], Token::EOF);
693 }
694}
695
696#[cfg(test)]
697mod lexer_tests_sad_path {
698 use super::*;
699
700 #[test]
701 fn test_scan_unrecognized_token_error() {
702 let input = "if true then 0 else succ 0 1";
703
704 let tokens = scan(input);
705
706 assert_eq!(
707 tokens.err(),
708 Some(LexingError::UnrecognizedToken {
709 found: String::from("1"),
710 column: 35,
711 line: 1,
712 })
713 );
714 }
715
716 #[test]
717 fn test_scan_unexpected_token_error() {
718 let input = "succ( suc ( 0 ) )";
719
720 let tokens = scan(input);
721
722 assert_eq!(
723 tokens.err(),
724 Some(LexingError::UnexpectedCharacter {
725 found: ' ',
726 expected: vec!["succ".to_string()],
727 column: 9,
728 line: 1,
729 })
730 );
731 }
732
733 #[test]
734 fn test_scan_unexpected_eof_error() {
735 let input = "tr";
736
737 let tokens = scan(input);
738
739 assert_eq!(
740 tokens.err(),
741 Some(LexingError::UnexpectedEOF {
742 expected: vec!["true".to_string()],
743 column: 2,
744 line: 1,
745 })
746 );
747 }
748
749 #[test]
750 fn test_scan_literals_no_spaces_unexpected_token_error() {
751 let input = "truefalse";
752
753 let tokens = scan(input);
754
755 assert_eq!(
756 tokens.err(),
757 Some(LexingError::UnexpectedCharacter {
758 found: 'f',
759 expected: vec![
760 " ".to_string(),
761 "\t".to_string(),
762 ")".to_string(),
763 "<EOF>".to_string()
764 ],
765 column: 4,
766 line: 1,
767 })
768 );
769 }
770}