Experimental library for a simple text similarity measure.
The CRI distance is a measure for superficial text similarity. "CRI" stands for character repetition intervals. The idea behind it is to gauge a text by capturing the numbers of symbols it takes for each character to repeat.
The normalized CRI distance of two texts is a value between 0 and 1, where 0 means the texts are identical and 0 means they have nothing in common at all.
Given an alphabet consisting of the letters "a" and "b". Let's index them with 0 and 1. Let's define an "interval" simply as the difference between the character indexes in the string. The length of the first interval is the position of the respective character. Because it takes that many symbols until the character "repeats" for the first time.
So much for the interval.
Now, let's look at the string "abba". It takes 0 positions to increment for the "a" to appear for the first time and one for the "b". From the first "b" to the second "b", the position increments by one, so the interval is one. From the first "a" to the next, the position increments by 3.
Now, let's put this in a matrix where the first dimension is the alphabet indexes and the second are the frequencies (counts) of the respective repetition interval (CRI) lengths:
String "abba":
Position | 0 | 1 | 2 | 3 |
Character | a | b | b | a |
CRI counts of "abba":
⬐ Symbol / CRI length → | 0 | 1 | 2 | 3 |
0 ("a") | 1 | 0 | 0 | 1 |
1 ("b") | 0 | 2 | 0 | 0 |
Let's compare this with the matrix we get for the string "baba":
String "baba":
0 | 1 | 2 | 3 |
b | a | b | a |
CRI counts of "baba":
⬐ Symbol / CRI length → | 0 | 1 | 2 | 3 |
0 ("a") | 0 | 1 | 1 | 0 |
1 ("b") | 1 | 0 | 1 | 0 |
Now, let's calculate a naïve distance between those two matrixes simply by adding up the absolute differences between the individual cell values.
The distance between "abba" and "baba" is 8.
Let's compare the strings "abbababa" and "babaabba" in the same way. The two strings have been concatenated and concatenated in reverse order. The distance is still 8.
The algorithm honors identical subsequences that can appear in different orders and in different locations.
Finally, it would be nice to have a normalized value which is easier to interpret than the raw distance. We divide the distance by the sum of the two text's lengths and we get a result between 0 and 1, 0 meaning identical and 1 meaning nothing in common at all.
Let si be a unique symbol where 0 ≤ i ≤ n .
Let A be an alphabet of n symbols:
A = {s0, ... sn}
Let S be a sequence of length L of symbols from A.
Let j,k (0≤j<;k, k<L) be the positions of two occurrences of si in S where no other occurrence of si intervenes.
Then a CRI interval is just the range of positions from j to k, not including k:
CRIi,j,k = [j..k)
For the first occurrence of si we define j=0 and set k to the position of the first occurrence of si (j≤k).
The length lk,j of the CRI is k - j.
The CRI count ci,l is the count of all repetition intervals of si of length l.
Let M1i⨯l be a matrix of CRI counts c of S1 by symbol index i and CRI length l.
Let M2i⨯l be a matrix of CRI counts c of S2 by symbol index i and CRI length l.
Let Mdiff be the subtraction of the two matrixes: Mdiff = M1 - M2
Then the absolute CRI distance D is just the sum over the absolute values of d of Mdiff: D = ∑|di,l|
The normalized CRI distance is D/(L1+L2).
I have no idea whether this algorithm has any academic or whatsoever discourse. Didn't find it among the top 10 on Google. (Or googled the wrong buzzwords.) And didn't bother to do any research on the field. (Actually, I was just thinking about a pragmatic solution for an imminent task.) Giving the algorithm is so trivial, I'm sure it already exists. Or otherwise it's just crap. I'm implementing it anyway as a nice playground to learn Kotlin.
So, forgive me and let me know if I (unintentionally) plagiarized.
🙈 And condone the amateurish Kotlin...
I have a weakness for the builder pattern, so the API looks like this:
val nextText = NextText.Builder()
.withMinCodePoint(0)
.withMaxCodePoint(127)
.build()
val criDistance = nextText.criDistance("text 1", "text 2")
println("The normalized CRI distance is $criDistance.")