- Last update: June 8, 2022 Translated From: e-maxx.ru

Hashing algorithms are helpful in solving a lot of problems.

We want to solve the problem of comparing strings efficiently.The brute force way of doing so is just to compare the letters of both strings, which has a time complexity of \(O(\min(n_1, n_2))\) if \(n_1\) and \(n_2\) are the sizes of the two strings.We want to do better.The idea behind the string hashing is the following: we map each string into an integer and compare those instead of the strings.Doing this allows us to reduce the execution time of the string comparison to \(O(1)\).

For the conversion, we need a so-called **hash function**.The goal of it is to convert a string into an integer, the so-called **hash** of the string.The following condition has to hold: if two strings \(s\) and \(t\) are equal (\(s = t\)), then also their hashes have to be equal (\(\text{hash}(s) = \text{hash}(t)\)).Otherwise, we will not be able to compare strings.

Notice, the opposite direction doesn't have to hold.If the hashes are equal (\(\text{hash}(s) = \text{hash}(t)\)), then the strings do not necessarily have to be equal.E.g. a valid hash function would be simply \(\text{hash}(s) = 0\) for each \(s\).Now, this is just a stupid example, because this function will be completely useless, but it is a valid hash function.The reason why the opposite direction doesn't have to hold, is because there are exponentially many strings.If we only want this hash function to distinguish between all strings consisting of lowercase characters of length smaller than 15, then already the hash wouldn't fit into a 64-bit integer (e.g. unsigned long long) any more, because there are so many of them.And of course, we don't want to compare arbitrary long integers, because this will also have the complexity \(O(n)\).

So usually we want the hash function to map strings onto numbers of a fixed range \([0, m)\), then comparing strings is just a comparison of two integers with a fixed length.And of course, we want \(\text{hash}(s) \neq \text{hash}(t)\) to be very likely if \(s \neq t\).

That's the important part that you have to keep in mind.Using hashing will not be 100% deterministically correct, because two complete different strings might have the same hash (the hashes collide).However, in a wide majority of tasks, this can be safely ignored as the probability of the hashes of two different strings colliding is still very small.And we will discuss some techniques in this article how to keep the probability of collisions very low.

## Calculation of the hash of a string

The good and widely used way to define the hash of a string \(s\) of length \(n\) is

\[\begin{align}\text{hash}(s) &= s[0] + s[1] \cdot p + s[2] \cdot p^2 + ... + s[n-1] \cdot p^{n-1} \mod m \\&= \sum_{i=0}^{n-1} s[i] \cdot p^i \mod m,\end{align}\]

where \(p\) and \(m\) are some chosen, positive numbers.It is called a **polynomial rolling hash function**.

It is reasonable to make \(p\) a prime number roughly equal to the number of characters in the input alphabet.For example, if the input is composed of only lowercase letters of the English alphabet, \(p = 31\) is a good choice.If the input may contain both uppercase and lowercase letters, then \(p = 53\) is a possible choice.The code in this article will use \(p = 31\).

Obviously \(m\) should be a large number since the probability of two random strings colliding is about \(\approx \frac{1}{m}\).Sometimes \(m = 2^{64}\) is chosen, since then the integer overflows of 64-bit integers work exactly like the modulo operation.However, there exists a method, which generates colliding strings (which work independently from the choice of \(p\)).So in practice, \(m = 2^{64}\) is not recommended.A good choice for \(m\) is some large prime number.The code in this article will just use \(m = 10^9+9\).This is a large number, but still small enough so that we can perform multiplication of two values using 64-bit integers.

Here is an example of calculating the hash of a string \(s\), which contains only lowercase letters.We convert each character of \(s\) to an integer.Here we use the conversion \(a \rightarrow 1\), \(b \rightarrow 2\), \(\dots\), \(z \rightarrow 26\).Converting \(a \rightarrow 0\) is not a good idea, because then the hashes of the strings \(a\), \(aa\), \(aaa\), \(\dots\) all evaluate to \(0\).

`long long compute_hash(string const& s) { const int p = 31; const int m = 1e9 + 9; long long hash_value = 0; long long p_pow = 1; for (char c : s) { hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m; p_pow = (p_pow * p) % m; } return hash_value;}`

Precomputing the powers of \(p\) might give a performance boost.

## Example tasks

### Search for duplicate strings in an array of strings

Problem: Given a list of \(n\) strings \(s_i\), each no longer than \(m\) characters, find all the duplicate strings and divide them into groups.

From the obvious algorithm involving sorting the strings, we would get a time complexity of \(O(n m \log n)\) where the sorting requires \(O(n \log n)\) comparisons and each comparison take \(O(m)\) time.However, by using hashes, we reduce the comparison time to \(O(1)\), giving us an algorithm that runs in \(O(n m + n \log n)\) time.

We calculate the hash for each string, sort the hashes together with the indices, and then group the indices by identical hashes.

`vector<vector<int>> group_identical_strings(vector<string> const& s) { int n = s.size(); vector<pair<long long, int>> hashes(n); for (int i = 0; i < n; i++) hashes[i] = {compute_hash(s[i]), i}; sort(hashes.begin(), hashes.end()); vector<vector<int>> groups; for (int i = 0; i < n; i++) { if (i == 0 || hashes[i].first != hashes[i-1].first) groups.emplace_back(); groups.back().push_back(hashes[i].second); } return groups;}`

### Fast hash calculation of substrings of given string

Problem: Given a string \(s\) and indices \(i\) and \(j\), find the hash of the substring \(s [i \dots j]\).

By definition, we have:

\[\text{hash}(s[i \dots j]) = \sum_{k = i}^j s[k] \cdot p^{k-i} \mod m\]

Multiplying by \(p^i\) gives:

\[\begin{align}\text{hash}(s[i \dots j]) \cdot p^i &= \sum_{k = i}^j s[k] \cdot p^k \mod m \\&= \text{hash}(s[0 \dots j]) - \text{hash}(s[0 \dots i-1]) \mod m\end{align}\]

So by knowing the hash value of each prefix of the string \(s\), we can compute the hash of any substring directly using this formula.The only problem that we face in calculating it is that we must be able to divide \(\text{hash}(s[0 \dots j]) - \text{hash}(s[0 \dots i-1])\) by \(p^i\).Therefore we need to find the modular multiplicative inverse of \(p^i\) and then perform multiplication with this inverse.We can precompute the inverse of every \(p^i\), which allows computing the hash of any substring of \(s\) in \(O(1)\) time.

However, there does exist an easier way.In most cases, rather than calculating the hashes of substring exactly, it is enough to compute the hash multiplied by some power of \(p\).Suppose we have two hashes of two substrings, one multiplied by \(p^i\) and the other by \(p^j\).If \(i < j\) then we multiply the first hash by \(p^{j-i}\), otherwise, we multiply the second hash by \(p^{i-j}\).By doing this, we get both the hashes multiplied by the same power of \(p\) (which is the maximum of \(i\) and \(j\)) and now these hashes can be compared easily with no need for any division.

## Applications of Hashing

Here are some typical applications of Hashing:

- Rabin-Karp algorithm for pattern matching in a string in \(O(n)\) time
- Calculating the number of different substrings of a string in \(O(n^2 \log n)\) (see below)
- Calculating the number of palindromic substrings in a string.

### Determine the number of different substrings in a string

Problem: Given a string \(s\) of length \(n\), consisting only of lowercase English letters, find the number of different substrings in this string.

To solve this problem, we iterate over all substring lengths \(l = 1 \dots n\).For every substring length \(l\) we construct an array of hashes of all substrings of length \(l\) multiplied by the same power of \(p\).The number of different elements in the array is equal to the number of distinct substrings of length \(l\) in the string.This number is added to the final answer.

For convenience, we will use \(h[i]\) as the hash of the prefix with \(i\) characters, and define \(h[0] = 0\).

`int count_unique_substrings(string const& s) { int n = s.size(); const int p = 31; const int m = 1e9 + 9; vector<long long> p_pow(n); p_pow[0] = 1; for (int i = 1; i < n; i++) p_pow[i] = (p_pow[i-1] * p) % m; vector<long long> h(n + 1, 0); for (int i = 0; i < n; i++) h[i+1] = (h[i] + (s[i] - 'a' + 1) * p_pow[i]) % m; int cnt = 0; for (int l = 1; l <= n; l++) { set<long long> hs; for (int i = 0; i <= n - l; i++) { long long cur_h = (h[i + l] + m - h[i]) % m; cur_h = (cur_h * p_pow[n-i-1]) % m; hs.insert(cur_h); } cnt += hs.size(); } return cnt;}`

## Improve no-collision probability

Quite often the above mentioned polynomial hash is good enough, and no collisions will happen during tests.Remember, the probability that collision happens is only \(\approx \frac{1}{m}\).For \(m = 10^9 + 9\) the probability is \(\approx 10^{-9}\) which is quite low.But notice, that we only did one comparison.What if we compared a string \(s\) with \(10^6\) different strings.The probability that at least one collision happens is now \(\approx 10^{-3}\).And if we want to compare \(10^6\) different strings with each other (e.g. by counting how many unique strings exists), then the probability of at least one collision happening is already \(\approx 1\).It is pretty much guaranteed that this task will end with a collision and returns the wrong result.

There is a really easy trick to get better probabilities.We can just compute two different hashes for each string (by using two different \(p\), and/or different \(m\), and compare these pairs instead.If \(m\) is about \(10^9\) for each of the two hash functions than this is more or less equivalent as having one hash function with \(m \approx 10^{18}\).When comparing \(10^6\) strings with each other, the probability that at least one collision happens is now reduced to \(\approx 10^{-6}\).

## Practice Problems

- Good Substrings - Codeforces
- A Needle in the Haystack - SPOJ
- Double Profiles - Codeforces
- Password - Codeforces
- SUB_PROB - SPOJ
- INSQ15_A
- SPOJ - Ada and Spring Cleaning
- GYM - Text Editor
- 12012 - Detection of Extraterrestrial
- Codeforces - Games on a CD
- UVA 11855 - Buzzwords
- Codeforces - Santa Claus and a Palindrome
- Codeforces - String Compression
- Codeforces - Palindromic Characteristics
- SPOJ - Test
- Codeforces - Palindrome Degree
- Codeforces - Deletion of Repeats
- HackerRank - Gift Boxes

- Contributors:
- jakobkogler (60.71%)
- paramsingh (12.05%)
- gaurav-singh1998 (8.48%)
- Morass (5.8%)
- likecs (4.46%)
- adamant-pwn (4.02%)
- minhazmiraz (0.89%)
- tcNickolas (0.89%)
- MayankPratap (0.89%)
- hanzala-sohrab (0.45%)
- Tahanima (0.45%)
- Siddharths8212376 (0.45%)
- wiktorkuchta (0.45%)

## FAQs

### Is hashing important for competitive programming? ›

**Hashing gives a more secure and adjustable method of retrieving data compared to any other data structure**. It is quicker than searching for lists and arrays.

**Which string hashing is best? ›**

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.

**Does use of hash function reduce the time complexity of string matching algorithm? ›**

Comparing strings of length N takes O(N) time complexity but comparing integers take O(1) time complexity. Hence, **comparing hash of strings take O(1) time complexity**.

**What will be the time complexity to find out the hash value for all the strings? ›**

The complexity of a hashing function is never **O(1)**. If the length of the string is n then the complexity is surely O(n). However, if you compute all hashes in a given array, you won't have to calculate for the second time and you can always compare two strings in O(1) time by comparing the precalculated hashes.

**Is hash code tough? ›**

How hard is hashcode? Google HashCode is **not an easy competition but it is also not that hard if you are familiar with the programming language you've chosen and can come up with an efficient solution to problems within a small window**.

**What is the most efficient hashing algorithm? ›**

Probably the one most commonly used is **SHA-256**, which the National Institute of Standards and Technology (NIST) recommends using instead of MD5 or SHA-1. The SHA-256 algorithm returns hash value of 256-bits, or 64 hexadecimal digits.

**Which is the most efficient string matching algorithm? ›**

The so-called **naive or brute force algorithm** is the most intuitive approach to the string pattern-matching problem. This algorithm attempts simply to match the pattern in the target at successive positions from left to right.

**Should I use SHA512 or SHA-256? ›**

The reason why **SHA-512 is faster than SHA-256 on 64-bit machines** is that has 37.5% less rounds per byte (80 rounds operating on 128 byte blocks) compared to SHA- 256 (64 rounds operating on 64 byte blocks), where the operations use 64-bit integer arithmetic.

**What are two most popular hashing algorithms? ›**

There are many different types of hash algorithms such as RipeMD, Tiger, xxhash and more, but the most common type of hashing used for file integrity checks are MD5, SHA-2 and CRC32. MD5 - An MD5 hash function encodes a string of information and encodes it into a 128-bit fingerprint.

**What is the time complexity of string matching algorithm? ›**

It is an efficient algorithm as it has linear time complexity. It has a time complexity of **O(m+n)**, where m is the length of the string and n is the length of the pattern to be searched.

### How does string hash work? ›

Hashing is **the process of transforming any given key or a string of characters into another value**. This is usually represented by a shorter, fixed-length value or key that represents and makes it easier to find or employ the original string. The most popular use for hashing is the implementation of hash tables.

**In which string algorithm hash function is used? ›**

One of the most famous applications of hashing is the **Rabin-Karp algorithm**. This is basically a string-searching algorithm which uses hashing to find any one set of patterns in a string.

**Which hash is $6$? ›**

Passwords starting with “$5$” or “$6$” are interpreted as hashed with **Linux SHA256 or SHA512** password hashing, respectively.

**What hash is 40 characters long? ›**

**SHA-1** is a hash function designed by the N.S.A. ("SHA" = "Secure Hash Algorithm"). It returns a 160-bit hexadecimal string which is 40 characters long. It was used for secure encryption from 1996 to 2010, largely as a replacement for MD5, but now it is used mostly for data integrity.

**How many hashes are in 8 bit output? ›**

#3 If you have an 8 bit hash output, how many possible hashes are there? There are **28** possibles hashes.

**What is the hardest type of coding? ›**

**Haskell**. The language is named after a mathematician and is usually described to be one of the hardest programming languages to learn. It is a completely functional language built on lambda calculus.

**What is the hardest form of coding? ›**

**Malbolge**. Malbolge was invented in 1998 by Ben Olmstead. This esolang is considered to be the most complicated programming language.

**What is the hardest coding? ›**

**7 Hardest Programming Languages to Learn for FAANG Interviews**

- C++ C++ is an object-oriented programming language and is considered the fastest language out there. ...
- Prolog. Prolog stands for Logic Programming. ...
- LISP. LISP stands for List Processing. ...
- Haskell. ...
- Assembly Language (ASM) ...
- Rust. ...
- Esoteric Languages.

**What is the weakest hashing algorithm? ›**

The simplest hashing algorithm is parity, which with a single bit of output can't do miracles. But algorithms that are designed as **cryptographic algorithms** are usually not broken in the sense that all the expected properties are violated.

**Which hashing technique is best in data structure? ›**

**Open Hashing (Separate Chaining)**: It is the most commonly used collision hashing technique implemented using Lined List. When any two or more elements collide at the same location, these elements are chained into a single-linked list called a chain.

### Is SHA-256 secure enough? ›

SHA-256 is **one of the most secure hashing functions on the market**. The US government requires its agencies to protect certain sensitive information using SHA-256.

**What is the fastest algorithm in string matching field? ›**

1. Which of the following is the fastest algorithm in string matching field? Explanation: **Quick search algorithm** is the fastest algorithm in string matching field whereas Linear search algorithm searches for an element in an array of elements.

**Which algorithm is most effective? ›**

**Quicksort**. Quicksort is one of the most efficient sorting algorithms, and this makes of it one of the most used as well. The first thing to do is to select a pivot number, this number will separate the data, on its left are the numbers smaller than it and the greater numbers on the right.

**Which is better Rabin Karp or KMP algorithm? ›**

The most important difference between them is how reliable they are in finding a match. **KMP guarantees 100% reliability**. You cannot guarantee 100% with Rabin Karp because of a chance of collision during hash table lookup.

**Is SHA512 secure enough? ›**

The first element is the hash function. The MD5 function is now considered very insecure: it is easy to reverse with current processing power. The SHA1, SHA256, and SHA512 functions are **no longer considered secure**, either, and PBKDF2 is considered acceptable.

**What are the weakness of SHA-512? ›**

There are no known attacks against SHA-512, not against **collision resistance** and certainly not against first or second preimage resistance.

**How hard is it to crack SHA512? ›**

SHA-512 is a fast hash not well suited to storing passwords, hashcat can do 414 million SHA-512 hashes per second, so **if a password is common it will be broken in less then a second**. If you are wanting to store passwords please consider designed slow hash like PBKDF2 or bcrypt. See CrackStation for more details.

**What is the latest hashing algorithm? ›**

...

SHA-3.

General | |
---|---|

Designers | Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles van Assche. |

First published | 2016 |

Series | (SHA-0), SHA-1, SHA-2, SHA-3 |

Certification | FIPS PUB 202 |

**What hash does Instagram use? ›**

Instagram use **AES256-GCM** to encrypt the password in this with an 12 byte IV and a timestamp as AD.

**Why SHA-256 is the best algorithm? ›**

The main reason technology leaders use SHA-256 is that **it doesn't have any known vulnerabilities that make it insecure and it has not been “broken” unlike some other popular hashing algorithms**. To better understand what this means, we need to look at the history of some other popular hashing algorithms.

### How many string matching algorithms are there? ›

Exact String Matching Algorithms:

These are further classified into **four categories**: Algorithms based on character comparison: Naive Algorithm: It slides the pattern over text one by one and check for a match.

**What is the problem of string matching? ›**

A shift is valid if P occurs with shift s in T and invalid otherwise. The string-matching problem is the problem of **finding all valid shifts for a given choice of P and T**. P ≡ dada Valid shifts are two, twelve and fourteen.

**Why is Rabin Karp better than naive algorithm? ›**

Like the Naive Algorithm, the Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, the Rabin Karp algorithm matches the hash value of the pattern with the hash value of the current substring of text, and if the hash values match then only it starts matching individual characters.

**What are the string 5 methods? ›**

**String Methods**

- length() String Method in Java.
- concat() String Method in Java.
- String Method equals() in Java.
- indexOf() String Method in Java.
- charAt() String Method in Java.
- toUpperCase() and toLowerCase() String Methods.

**How do you make a string hash? ›**

Getting the hash code of a string is simple in C#. We **use the GetHashCode() method**. A hash code is a uniquely identified numerical value. Note that strings that have the same value have the same hash code.

**What is the purpose of string method? ›**

This method is used **to convert all the characters of a string into a Character Array**. This is widely used in the String manipulation programs.

**How do you check if a string is a hash? ›**

To check if a string is a valid MD5 hash in JavaScript, we can **use a regex expression to match for the 32 consecutive hexadecimal digits which are characters from a-f and numbers from 0-9** .

**How do you find the hash value of a string? ›**

...

- s[i] represents the ith character of the string.
- ^ refers to the exponential operand.
- n represents the length of the string.

**How many hashing algorithms are there? ›**

FIPS 180-4 specifies **seven** hash algorithms: SHA-1 (Secure Hash Algorithm-1), and the. SHA-2 family of hash algorithms: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256.

**What hash is $1$? ›**

$1$ is the prefix used to identify **md5-crypt hashes**, following the Modular Crypt Format. salt is 0-8 characters drawn from the regexp range [./0-9A-Za-z] ; providing a 48-bit salt ( 5pZSV9va in the example).

### What hash is $2 $? ›

The hashes start with a $2y$ identifier, this indicates **the hash has been created using BCrypt** ( $2*$ usually indicates BCrypt).

**What hash is $2$? ›**

The prefix "$2a$" or "$2b$" (or "$2y$") in a hash string in a shadow password file indicates that hash string is a **bcrypt hash in modular crypt format**. This is an unfortunate hash format to have to 'crack'.

**Is SHA-256 always 64 characters? ›**

**Yes, a SHA256 is always 256 bits long, equivalent to 32 bytes, or 64 bytes in an hexadecimal string format**. You can even use char(64) instead of varchar(64) since the size won't change. Save this answer.

**How many characters is a 512 hash? ›**

A: A SHA512 hash value is 512 bits long. The Department represents a SHA512 hash value as a sequence of **128 hexadecimal digits**.

**How long is a 256-bit hash? ›**

It's always **64 characters**, which can be determined by running anything into one of the online SHA-256 calculators.

**How long is a 128-bit hash? ›**

MD5 hashes are 128 bits in length and generally represented by **32 hex digits**. SHA-1 hashes are 160 bits in length and generally represented by 40 hex digits.

**How much data can 8 bits hold? ›**

With 8 bits, the maximum number of values is **256 or 0 through 255**.

**Why is 8bit 256? ›**

Since computers work with binary numbers, all powers of two are important. **8bit numbers are able to represent 256 (2^8) distinct values, enough for all characters of English and quite a few extra ones**. That made the numbers 8 and 256 quite important.

**When should you not use hash? ›**

There are some operations which are not efficiently supported by hash tables, such as **iterating over all the elements whose keys are within a certain range, finding the element with the largest key or smallest key**, and so on. The O(n) complexity is on average.

**Is hashing used for performance improvement? ›**

Hashing as we know **it is used for performance improvement**, error checking, and authentication. One example of a performance improvement is the common hash table, which uses a hash function to index into the correct bucket in the hash table, followed by comparing each element in the bucket to find a match.

### Why is hashing algorithm important? ›

Hash functions are important to data security because they **allow users to verify the integrity of the files or online communications they receive**. Hashing helps you protect your files against unauthorized changes and alterations so that your users can realize that they've been changed in some way.

**Why hash is used in programming? ›**

Hashing is the process of transforming any given key or a string of characters into another value. This is usually represented by a shorter, fixed-length value or key that represents and **makes it easier to find or employ the original string**. The most popular use for hashing is the implementation of hash tables.

**What is the safest hash? ›**

Common attacks like brute force attacks can take years or even decades to crack the hash digest, so **SHA-2** is considered the most secure hash algorithm.

**What are the disadvantages of hashing? ›**

**Disadvantages of Hash:**

- Hash is inefficient when there are many collisions.
- Hash collisions are practically not be avoided for large set of possible keys.
- Hash does not allow null values.

**Which two algorithms are used for hash? ›**

There are multiple types of hashing algorithms, but the most common are **Message Digest 5 (MD5) and Secure Hashing Algorithm (SHA) 1 and 2**.

**What are the 3 types of hashing? ›**

Types of Hashing

There are many different types of hash algorithms such as RipeMD, Tiger, xxhash and more, but the most common type of hashing used for file integrity checks are **MD5, SHA-2 and CRC32**. MD5 - An MD5 hash function encodes a string of information and encodes it into a 128-bit fingerprint.

**Where hashing is used in real life? ›**

One of the most famous applications of hashing is the Rabin-Karp algorithm. This is basically a string-searching algorithm which uses hashing to find any one set of patterns in a string. A practical application of this algorithm is **detecting plagiarism**.

**What is a main advantage of hash functions? ›**

Advantages of Hashing

The main advantage of hash tables over other data structures is speed . The access time of an element is on average O(1), therefore lookup could be performed very fast. Hash tables are particularly efficient when the maximum number of entries can be predicted in advance.

**What are 4 characteristics a strong hashing algorithm should possess? ›**

The Characteristics of Cryptographic Hash Functions

**It accepts a message of any length**. It produces a fixed-length message digest. It is easy (and therefore fast) to compute the message digest for any given message. The hash is irreversible – it is not possible to generate a message from its message digest.

**What are 5 common uses applications of hash functions? ›**

**Applications of Hashing Algorithms**

- Verifying the integrity of messages and files. An important application of secure hashes is the verification of message integrity. ...
- Signature generation and verification. ...
- Password verification. ...
- Proof-of-work. ...
- File or data identifier.

### What are the 3 main properties of hash function? ›

In particular, cryptographic hash functions exhibit these three properties: They are “**collision-free**.” This means that no two input hashes should map to the same output hash. They can be hidden. It should be difficult to guess the input value for a hash function from its output.

**How does hashing algorithm work? ›**

Hashing is one of the algorithms which **calculates a string value from a file, which is of a fixed size**. Basically, it contains blocks of data, which is transformed into a short fixed-length key or value from the original string. Usually, a summary of the information or data within that sent file.