Length of Longest Substring

Find the longest non-repeating substring of an input string.

Author
Published

September 17, 2024

The problem may be found on leetcode.

Analysis

A basic example of a search with linear time complexity is

abcabcbb -> abcabcbb --> abcabcbb --> abcabcbb --> abcabcbb --> abcabcbb
^           ^^           ^^^           ^^^           ^^^           ^^^ 
k=0         k=1          k=2          k=3          k=4          k=5
start=0     start=0      start=0      start=1      start=2      start=3

It would appear that

  1. the end of the range should be incremented on every loop,
  2. when a character in the selected range is encountered again, the start of the range should be incremented from its last occurrence.

However, with strings with many repeats, it is apparent that there are some subtleties to this approach:

foofoofiifii --> foofoofiifii --> foofoofiifii --> foofoofiifii --> foofoofiifii 
^                ^^                 ^               ^^^^              ^^^^
k=0              k=1              k=2              k=3              k=4
start=0          start=0          start=2          start=1          start=4

lastseen:        lastseen:        lastseen:
  f: 0             f: 0             f: 0
                   o: 1             o: 2

The step where k=3 makes it clear that backtracking start is not an option, and that

foofoofiifii --> foofoofiifii --> foofoofiifii --> foofoofiifii --> foofoofiifii 
^                ^^                 ^                ^^                 ^
k=0              k=1              k=2              k=3              k=4
start=0          start=0          start=2          start=2          start=4

A modification to this fixes the problem:

  1. when a character in the selected range is encoutered again, the start of the range should be incremented from its last occurrence unless this new value is less than the current value of start, in which case start should be maintained.

Implementation

My initial solution was

class SolutionInitial:
    def longestSubstring(self, s: str) -> tuple[str, int]:

        seen: dict[str, int] = dict()

        start = 0
        best_start, best_stop = 0, 0
        best_size = -1

        for k, char in enumerate(s):

            if (char_last_seen := seen.get(char)) is not None:
                start = max(start, char_last_seen + 1)

            if (diff := k - start) > best_size:
                # print("-->", "new best")
                best_start, best_stop = start, k
                best_size = diff

            seen[char] = k
            # print(s)
            # print(((start) * " ") + ((diff + 1) * "^"))

        if best_size == -1:
            return "", 0

        return s[best_start : best_stop + 1], 1 + best_stop - best_start

    def lengthOfLongestSubstring(self, s: str) -> int:
        _, best_size = self.longestSubstring(s)
        return best_size

which performed reasonably well in runtime (\(75\%\) quantile) and an abysmal \(8\%\) quantile in memory. Some small modifications made my solution much more performant, with \(93\%\) quantile runtime and \(39\%\) runtime:

class SolutionInitial:
    def longestSubstring(self, s: str) -> tuple[str, int]:

        seen: dict[str, int] = dict()

        start = 0
        best_start, best_stop = 0, 0
        best_size = -1

        for k, char in enumerate(s):

            if (char_last_seen := seen.get(char)) is not None:
                start = max(start, char_last_seen + 1)

            if (diff := k - start) > best_size:
                # print("-->", "new best")
                best_start, best_stop = start, k
                best_size = diff

            seen[char] = k
            # print(s)
            # print(((start) * " ") + ((diff + 1) * "^"))

        if best_size == -1:
            return "", 0

        return s[best_start : best_stop + 1], 1 + best_stop - best_start

    def lengthOfLongestSubstring(self, s: str) -> int:
        _, best_size = self.longestSubstring(s)
        return best_size

It would appear that seen is the main issue with memory usage, however other options probably have performance costs for searching (for instance maintaining a copy of the longest nonrepeating substring instead of its bounds).