Translated from the post, originally written in Chinese by Su, Jianlin
Translated by Norm Inui
TL;DR
 Interpret RoPE from the perspective of a βbased encoding.
 Introduce recent developments in the opensource community regarding long contexts.
 Some approaches, such as NTKaware Scale RoPE, can extend context length without finetuning.
For developers who are interested in how to extend the context length of LLMs (Large Language Models), the opensource community has continuously presented us with fascinating methods in the past few weeks. First, @kaiokendev experimented with a “positional linear interpolation” approach in his project SuperHOT.
He demonstrated that with minimal finetuning on long texts, existing LLMs can be easily adapted to handle contexts over their pretraining context length. Almost simultaneously, Meta proposed the same idea in the paper titled “Extending Context Window of Large Language Models via Positional Interpolation.”
Shortly after the paper was published, @bloc97 introduced the NTKaware Scaled RoPE, enabling LLM to extend its context length without finetuning! With all these methods, especially the NTKaware Scaled RoPE, it persuades me to revisit the idea behind RoPE from a \(\beta\)based encoding perspective.
Encoding the Number
Suppose we have an integer \(n\) within \(1,000\) (not including \(1,000\)) that we want to input into a model. What would be the best way to do this?
The most intuitive idea is to input it directly as a onedimensional vector. However, the value of this vector has a large range from \(0\) to \(999\), which is not easy to optimize for gradientbased optimizers. What if we scale it between 0 and 1? That’s not good either, because the difference between adjacent numbers changes from \(1\) to \(0.001\), making it challenging for both the model and optimizer to distinguish between the numbers. In general, gradientbased optimizers are a bit “vulnerable” and can only handle inputs that aren’t too large or too small.
To solve this problem, it is necessary to find a smart way to represent the input. We might think about how we humans do. For an integer, like \(759\), it’s a threedigit number in decimal, with each digit ranging from 0 to 9. This inspires me to represent the input in decimal directly as a vector. That is, we transform the integer \(n\) as a threedimensional vector \([a,b,c]\), where \(a\), \(b\), and \(c\) represent the hundreds, tens, and units of \(n\) respectively. By increasing the input dimension, we can both reduce the range of each digit and get rid of small resolution between numbers. Luckily, neural networks are good at handling highdimensional vectors.
If we want to further reduce the value span of each dimension, we can simply decrease the base, like using 8, 6, or even 2 as the encoding base at a cost of an increase in input vector dimensions.
Direct Extrapolation
Let’s say we have trained a model with the input ranging from \(0\) to \(999\) represented as a threedimensional vector in decimal. Then we want to enlarge the upper bound of \(n\) from \(1,000\) to \(2,000\). How can we realize this?
If we follow the same idea discussed above, the input will now be a \(4\)dimensional vector. However, the model was trained for a \(3\)dimensional vector. Therefore, the input with an extra dimension may confuse the model. Some might wonder why can’t we reserve extra dimensions in advance? Indeed, we can preallocate a few more dimensions. During the training phase with the upper bound as \(1,000\), they can be set to \(0\), but during the inference phase with an upper bound as \(2,000\), the preallocate dimension has to be set to value besides \(0\). This is what we call Extrapolation.
However, the dimensions reserved during the training phase have always been set to \(0\). If these dimensions were changed to other values during the inference phase, the performance might be significantly harmed. This is because the model is never trained to adapt to those preallocated dimension in different values.
Linear Interpolation
Considering the challenges above, some proposed interpolation instead of extrapolation to compress the value upper bound of \(2,000\) down to \(1,000\). For example, the number \(1749\) can simply compress to \(874.5\) by dividing \(2\). Thus, \(874.5\) can be converted into a threedimensional vector \([8,7,4.5]\). Following this double mapping strategy, the vector \([7,4,9]\) now corresponds to \(1498\). However, the difference between adjacent numbers was \(1\), but now it is \(0.5\), making the last dimension more “crowded”. Therefore, interpolation usually needs to finetune the model to make it adapt to the “crowded” dimensions.
You might notice that the extrapolation can also be finetuned to adapt to the preallocated dimension. Well, that is correct, but the interpolation requires far fewer steps compared to extrapolation. It is because position encoding (both absolute and relative) has taught the model to understand the relative concept rather than precisely knowing what the number it is, that is: the model knows \(875\) is greater than \(874\), but it doesn’t know what is \(875\). Given the generalized ability of LLM, injecting the concept that \(874.5\) is greater than \(874\) is not particularly challenging.
However, this interpolation approach is not without flaws. The broader the range we want to expand, the smaller the unit dimension resolution becomes, while the hundreds and tens dimensions still remain at a resolution of 1. This means that the interpolation implicitly introduces resolution inconsistency across dimensions. Each dimension is not equally interpolated, making finetuning/continuing learning challenging.
Base Conversion
Is there a method that doesn’t require adding extra dimensions and can still maintain resolution across dimensions? The answer is YES.
It is a solution that we are all familiar with, base conversion. A threedigit decimal number can represent \(0\) to \(999\). What if it’s in hexadecimal? Its maximum value in 16base can be \(16^31=4095 > 1999\). So, by simply converting to hexadecimal, with a base of 16, such as turning \(1749\) into \([6,13,5]\), a threedimensional vector can represent a larger range. The cost? Each dimension’s value changes from a range of \(09\) to \(015\).
It’s indeed a clever idea. As mentioned earlier, what we care about is relative position information in the sequence since a trained model has already learned \(875 > 874\). This holds true in hexadecimal as well. The model doesn’t care what base you use to represent the input, but only the relative information. Now, the only problem would be if the value of each dimension exceeds \(9\) (values between \(1015\)), will the model still know how to compare accurately? Luckily, most LLMs have such capability. Based on this hypothesis, we can now extend the range without finetuning! Furthermore, to make interpolation more robust, we can use a smaller base like \(\lceil\sqrt[3]{2000}\rceil = 13\) instead of \(16\) to limit the value range of each dimension.
This idea of base conversion finally leads us to the NTKaware scaled RoPE that was mentioned at the beginning.
Positional Encoding
Based on the explanation above, we can claim:
The Rotational Positional Encoding (RoPE) at position \(n\) is the \(\beta\)based encoding of the number \(n\)
This might surprise you at first glance, however, it does hold true.
Proof:
Suppose we have a decimal number \(n\). To calculate the digit at position \(m\) (counting from right to left) in its βbased encoding, we have:
\[\begin{equation}\lfloor\dfrac{n}{\beta^{m1}}\rfloor \mod \beta \end{equation}\]As for RoPE, which is adapted from Sinusoidal Position Embedding
\[\begin{equation}[\text{cos}(\dfrac{n}{\beta^0}), \text{sin}(\dfrac{n}{\beta^0}), \text{cos}(\dfrac{n}{\beta^1}), \text{sin}(\dfrac{n}{\beta^1}), …, \text{cos}(\dfrac{n}{\beta^{d/21}}), \text{sin}(\dfrac{n}{\beta^{d/21}})]\end{equation}\]where \(\beta = 10000^{2/d}\)
we can notice that:
1) eq1 and eq2 share the same component \(\frac n {\beta^{m1}}\);
2) \(\text{mod}\) introduces periodicity, while \(\text{sin}\) and \(\text{cos}\) are also periodical functions.
Therefore, if we ignore the ceiling operation, we can say RoPE (or Sinusoidal Position Embedding) is a kind of βbased encoding.
With this property, we can now apply extrapolation on \(n\) by simply replacing \(n\) as \(n/k\), \(k\) is the scale we want to enlarge. This is the Positional Interpolation proposed in Meta’s paper, and the experimental results show that extrapolation indeed requires more finetuning steps than interpolation.
Regarding numeral base conversion, the objective is to expand the representation range by \(k\). Therefore, the βbase should be converted to at least \(β(k^{2/d})\) (according to eq2, \(\text{cos}\) and \(\text{sin}\) appear in pairs. This can be regarded as a βbase representation with \(d/2\) bits, not \(d\)). Alternatively, the original base \(10000\) can be replaced with \(10000k\), which is the NTKaware Scaled RoPE. As discussed earlier, since positional embedding has taught the model the sequence relative information, NTKaware Scaled RoPE can achieve good performance in longer contexts without finetuning.
Let’s dig further
You might wonder why we call it NTK (Neural Tangent Kernel). In fact, it is the academic background of @bloc97 that makes him use this term to name it.
In “Fourier Features Let Networks Learn HighFrequency Functions in LowDimensional Domains”, authors use NTK methods to demonstrate that neural networks cannot learn highfrequency signals efficiently. Instead, their solution is to transform it into Fourier features, which share the same idea with the Sinusoidal position encoding we mention in eq1.
Thus, based on the findings from this NTK paper, @bloc97 proposed the NTKaware Scaled RoPE. I ask him about how he derived it. Surprisingly, his derivation is quite straightforward. The main idea is to combine extrapolation with interpolation:
extrapolation in highfrequency and interpolation in lowfrequency
According to eq2, the lowest frequency in each element of the position features is \(\dfrac{n}{\beta^{d/21}}\) Here we introduce a factor \(\lambda\) in base, now we have: \(\dfrac{n}{(\beta\lambda)^{d/21}}\)
We expect that scaling the rotation base \(\beta\) can work as interpolation, therefore
\[\begin{equation}\dfrac{n}{(\beta\lambda)^{d/21}} = \dfrac{n/k}{\beta^{d/21}}\end{equation}\]We can solve from eq3:
\[\lambda = k^{2/(d2)}\]The same idea for the highest frequency in the RoPE feature:
\(\dfrac{n}{\beta}\) now becomes \(\dfrac{n}{\lambda\beta}\).
Let’s insert the value \(\lambda\) we solve from eq3, which allows a low frequency represented as interpolation, into \(\dfrac{n}{\lambda\beta}\). Since \(d\) is very large ( 64 for BERT, 128 for LLAMA1), \(\lambda \to 1\), thus, we can conclude from eq3:
\[\begin{equation}\dfrac{n}{\beta\lambda}\simeq \dfrac{n}{\beta}\end{equation}\]You can see the frequency remains relatively stable w.r.t \(\lambda\), indicating that the corresponding dimension doesn’t become too crowded. Therefore, to represent a larger number, a highfrequency dimension is more likely to extrapolate by using an additional dimension rather than expanding the value range one dimension can hold. This is what we call: extrapolation in highfrequency.
From the derivation, we can see that NTKaware Scaled RoPE cleverly combines interpolation and extrapolation together. Besides scaling the base, I believe any transformations on the frequencies will be also effective as long as it ensures the extrapolation in high frequencies and interpolation in lowfrequencies.
from translator: We can actually inteprete the extrapolation in highfrequency and interpolation in lowfrequency by considering from a wavelength perspective in RoPE. Specifically, the wavelength in RoPE is used to define the length of the token sequence required for the encoding at dimension \(d\) to complete a full rotation, \(2\pi\). The higher the frequency, the smaller the wavelength and vice verse. Therefore, a longer wavelength can hold more interpolated tokens, while a shorter one cannot. That is why we employ interpolation in lowfrequency.
please refer to YaRN: Efficient Context Window Extension of Large Language Models for more details
Experiments
from translator: the table shows: the average accuracy of predicting next token to match the groundtruth next token given previous context. The experiment is based on a hybrid TransformerGAU (Gated Attention Unit) model with a size of 100M parameters.
For more details on the GAU, please refer to: https://arxiv.org/abs/2202.10447
When \(k=8\)
context length  512(trained)  4096 (repeated text)  4096 (nonrepeated text) 

Baseline
Baseline\(\log n\) 
49.41%
49.40% 
24.17%
24.60% 
23.16%
24.02% 
PIRoPE
PIRoPE\(\log n\) 
49.41%
49.40% 
15.04%
14.99% 
13.54%
16.51% 
NTKRoPE
NTKRoPE\(\log n\) 
49.41%
49.40% 
51.28%
61.71% 
39.27%
43.75% 
No finetuning is applied on all tests. Baseline: use extrapolation; PI（Positional Interpolation): replaces extrapolation in Baseline with interpolation; NTKRoPE: replace extrapolation in Baseline with NTKaware Scaled RoPE; \(\log n\): apply a scale to optimize selfattention for long context ref_1
Conclusion

Direct extrapolation doesn’t work effectively on extension.

Interpolation yields poor results without finetuning.

NTKRoPE achieves promising (though slightly reduced) results in extended context even without finetuning.

A \(\log n\) factor indeed optimize selfattention for long context.

What’s even more encouraging is that NTKRoPE performs significantly better in ‘repeated’ extrapolation compared to ‘nonrepeated’ one, suggesting that LLM with NTKRoPE still retain the global attention ability across the expanded context, rather than confining its attention to a limited scope.
In just a few weeks, the opensource community concerning long contexts totally blows our minds. OpenClosedAI, you better watch out.
Future Research
please check part2