In this session, we introduce learning to rank (LTR), a machine learning sub-field applicable to a variety of real world problems that are related to ranking prediction or candidate recommendation. We will walk through the evolution of LTR research in the past two decades, illustrate the very basic concept behind the theory. In the end we will also live-demo how we can quickly build a proof-of-concept LTR model using real world data and two of the state-of-the-art machine learning open source libraries.

- Introduction to Learning-to-Rank (LTR)
- What is LTR and what's the difference between it and other ML models?
- The classical problem (And also the non-classical ones)
- Different types of LTR modeling approach

- How to Evaluate a Ranking Model?
- The Evolution of mainstream LTR
- RankNet -> LambdaNet -> LambdaMART -> LambdaLoss

- Demo with the go-to open source libraries
- LambdaMART with
`lightgbm`

(Gradient Boosting Trees) - Listwise LTR with
`tensorflow`

(Deep Neural Nets)

- LambdaMART with

Learning to rank refers to machine learning techniques for training a model to solve a ranking task. Usually it is a supervised task and sometimes semi-supervised.

## Regression vs Classification vs LTR
### Regression¶

### Classification¶

### Learning to Rank¶

They are all supervised learning. But the target variables differ. The learning algorithm will differ in how we mathematically formulate the learning objective.

We try to learn a function $f(x)$ given feature $x$ to predict a real-valued $y \in \mathbb{R}$.

**Example:** *I want to predict the average CASA balance in the next 7 days for an account.*

We try to learn a function $f(x)$ given feature $x$ to predict a set of discrete integer label $y \in \{1, 2, ..., N\}$ with $N$-classes.

**Example:** *I want to predict if a given FD (Fixed Deposit) will be renewed at its current expire.*

We try to learn a function $f(q, D)$, given a query $q$ and a relevant list of items $D$, to predict the order (ranking) of all items within list.

**Example:** *See the next slide.*

*Given a search query, rank the relevance of the resulting matched document URLs, such that more relevant document should be presented first to the user.*

More formally, we depict the above problem as the following task:

Given a query $q$, and the resulting $n$ documents $D = {d_1, d_2, ..., d_n}$, we'd like to learn a function $f$ such that $f(q, D)$ will predict the relevance of any given document associated with a query. Ideally, $f(q, D)$ should return an ordered list of documents $D^*$, ranked from the most to least relevant to the given query $q$.

Popular web search datasets for large-scale LTR benchmark:

LTR is a general approach for solving ranking task. Here are some examples other than just web search ranking. Note that not all of them are obviously a ranking task in the first glance.

**Recommender system**(Solve personal product perference ranking)**Stock portfolio selection**(Solving equity return ranking)**Message auto reply**(Solving best-candidate ranking in email/message reply recommendation)**Image to text**(Solving best-candidate contextual feature)

- Pointwise
- Pairwise
- Listwise

They are distinguished by how we formulate the **loss function** in the underlying machine learning task.

**The Ranking Task**

Given a query $q$, and the resulting $n$ document $D = {d_1, d_2, ..., d_n}$, we'd like to learn a function $f$ such that $f(q, D)$ will predict the relevance of any given document associated with a query.

In pointwise approach, the above ranking task is re-formulated as a regression (or classification) task. The function to be learned $f(q, D)$ is simplied as $f(q, d_i)$. That is, the relevance of each document given a query is scored independently. In recent literature $f(q, d_i)$ is called pointwise scoring function, while $f(q, D)$ is refered to as groupwise scoring function.

If we have two queries associated with 2 and 3 resulting matching documents, respectively:

$ \begin{align} q_1 & \rightarrow d_1, d_2 \\ q_2 & \rightarrow d_3, d_4, d_5 \end{align} $

Then the training examples $x_i$ in a pointwise framework will decouple them into every query-document pair:

$ x_1: q_1, d_1 \\ x_2: q_1, d_2 \\ x_3: q_2, d_3 \\ x_4: q_2, d_4 \\ x_5: q_2, d_5 $

Since each document is indeed scored independently with the absolute relevance as the target label (could be real-valued in order or simply binary),
**the task is entirely no difference than a traditional regression or classification task.**
Any such machine learning algorithm can be applied to pointwise solution.

- Pros:
- Simplicity. Existing ML models are ready to apply.

- Cons:
- The result is usually sub-optimal due to not utilizing the full information in the entire list of matching documents for each query.
- Explicit pointwise labels are required to constitute the training dataset.

**The Ranking Task**

Given a query $q$, and the resulting $n$ document $D = {d_1, d_2, ..., d_n}$, we'd like to learn a function $f$ such that $f(q, D)$ will predict the relevance of any given document associated with a query.

In pairwise approach, we are still trying to learn the *pointwise* scoring function $f(q, d_i)$,
however, our training examples are now consructed by pairs of documents within the same query:

$ x_1: q_1, (d_1, d_2) \\ x_2: q_2, (d_3, d_4) \\ x_3: q_2, (d_3, d_5) \\ x_4: q_2, (d_4, d_5) $

Given such setup, a new set of pairwise BINARY labels can be derived, by simply comapring the individual relevance score in each pair.
For example, given the first query $q_1$, if $y_1 = 0$ (totally irrelevant) for $d_1$ and $y_2 = 3$ (highly relevant) for $d_2$, then we have a new label $y_1 < y_2$ for the document pair $(d_1, d_2)$.
**Now the problem has become a binary classification learning task.**

In order to learn the still-pointwise function $f(q, d_i)$ in a pairwise manner, we model the score difference probablistically:

$$ Pr(i \succ j) \equiv \frac{1}{1 + exp^{-(s_i - s_j)}} $$

In plain words, if document $i$ is better matched than document $j$ (which we denote as $i \succ j$), then the probability of the scoring function to have scored $f(q, d_i) = s_i$ higher than $f(q, d_j) = s_j$ should be close to 1. Put it differnetly, the model is trying to learn, given a query, how to score a pair of document such that a more relevant document should be scored higher.

- Pros:
- The model is learning how to rank directly, even though only in a pairwise manner, but in theory it can approximate the performance of a general ranking task given N document in a matched list.
- We don't need explicit pointwise labels. Only pairwise preferences are required. This is an advantage because sometimes we are only able to infer the pairwise preference from collected user behavior.

- Cons:
- Scoring function itself is still pointwise, meaning that relative information in the feature space among different documents given the same query is still not fully exploited.

**The Ranking Task**

Given a query $q$, and the resulting $n$ document $D = {d_1, d_2, ..., d_n}$, we'd like to learn a function $f$ such that $f(q, D)$ will predict the relevance of any given document associated with a query.

The first ever proposed listwise approach is ListNet. Here we explain how it approach the ranking task.

ListNet is based on the concept of permutation probability given a ranking list.
Again we assume there is a *pointwise* scoring function $f(q, d_i)$ used to score and hence rank a given list of items.
But instead of modeling the probability of a pairwise comparison using scoring difference,
now we'd like to model the probability of the entire ranking results.

$ x_1: q_1, (d_1, d_2) \\ x_2: q_2, (d_3, d_4, d_5) $

Let's denote $\pi$ as a particular permutation of a given list of length $n$, $\phi(s_i) = f(q, d_i)$ as any increasing function of scoring $s_i$ given query $q$ and document $i$. The probability of having a permutation $\pi$ can be written as:

$$ Pr(\pi) = \prod_{i=1}^n \frac{\phi(s_i)}{\sum_{k=i}^n\phi(s_k)} $$

To illustrate, given a list of 3 items, the probability of returning the permutation ${s_1, s_2, s_3}$ is calculated as: $Pr(\pi = \{s_1, s_2, s_3\}) = \frac{\phi(s_1)}{\phi(s_1) + \phi(s_2) + \phi(s_3)} \cdot \frac{\phi(s_2)}{\phi(s_2) + \phi(s_3)} \cdot \frac{\phi(s_3)}{\phi(s_3)}$.

Due to computational complexity, ListNet simplies the problem by looking at only the top-one probability of a given item. The top-one probability of object $i$ equals the sum of the permutation probabilities of permutations in which object $i$ is ranked on the top. Indeed, the top-one probability of object $i$ can be written as:

$$ Pr(i) = \frac{\phi(s_i)}{\sum_{k=1}^n \phi(s_k)} $$

Now given any two ranking list represented by top-one probabilities, we are able to measure their difference using cross entropy. Then we can build a machine learning algorithm that minimize that cross entropy.

For the choice of function $\phi(\cdot)$, it can be as simple as just an exponential function. Indeed, when $\phi(\cdot)$ is expotential and list length is two, the solution will reduce to a pairwise model we just depicted in the previos section.

- Pros:
- Theoretically sound solution to approach a ranking task.

- Cons:
- Costly to compute in its theoretical form and hence several approximations are used in practice. (For example the use of top-one probability.)
- Scoring function is still pointwise, which could be sub-optimal.

How to we evaluate a result of ranking prediction?

Several metrics have been proposed and commonly used in the evaluation of a ranking model:

- Binary Relevance
- Mean Average Precision (MAP)
- Mean Reciprocal Rank (MRR)

- Graded Relevance
- Normalized Discounted Cumulative Gain (NDCG)
- Expected Reciprocal Rank (ERR)

In general, binary measures only consider relevant v.s. irrelevant, while graded measures will also consider the ranking among relevant items. The degree of relevancy matters in this case when scoring a ranking list.

MAP is a measure based on binary label of relevancy.
The calculation of MAP is indeed NOT that straightforward.
First we need to define *precision at k given a query* $P@k(q)$ as:

$$ P@k(q) \equiv \frac{\sum_{i=1}^k r_i}{k} $$

for an ordered list of prediction $r_i$ for all $k$ items. $r_i = 1$ if it is relevant and $0$ otherwise.

Then we define the *average precision given a query* $AP(q)$ at $k$ items as:

$$ AP(q)@k \equiv \frac{1}{\sum_{i=1}^k r_i} \sum_{i=1}^k P@i(q) \times r_i $$

Mean Average Precision is just the mean of $AP(q)$ for all queries:

$$ MAP \equiv \frac{\sum_{q=1}^Q AP(q)}{Q} $$

Note that MAP is order-sensitive due to the introduction of the term $r_i$ in the calculation of AP. Intuitively, it is doing the average of precision at each ranking position, but penalizing the precision at positions with irrelevant item by strcitly setting them to zeroes.

**Example:**

$ \begin{align} q_1 & \rightarrow d_1, d_2 \\ q_2 & \rightarrow d_3, d_4, d_5 \end{align} $

Assuming only $d_2, d_3, d_5$ are relevant document given their corresponding query.

AP of query 1: $\frac{1}{1} \times (\frac{0}{1} \times 0 + \frac{1}{2} \times 1) = \frac{1}{2}$

AP of query 2: $\frac{1}{2} \times (\frac{1}{1} \times 1 + \frac{1}{2} \times 0 + \frac{2}{3} \times 1) = \frac{5}{6}$

MAP: $\frac{1}{2} \times (\frac{1}{2} + \frac{5}{6}) \approx 67\%$

**Caveat:**

MAP is order-sensitive, but only in a binary context: relevant items should come first than irrelevant ones. It does not take into account the optimal ranking among only relevant items. In the above example, even if $d_5$ is prefered than $d_3$ (and both are relevant), average precision of the query is the same for $d_3, d_4, d_5$ and $d_5, d_4, d_3$. But ideally the latter should be scored higher.

RR focuses on the first correctly predicted relevant item in a list. Given a ranking list, assume $r_i$ is the rank of the highest ranking relevant item. Say, if the the 2nd item is the first relevant item in the list, RR is $\frac{1}{2}$ for this query.

By definition, each query will have a reciprocal rank. MRR is simply the average of RR for all queries:

$$ MRR \equiv \frac{1}{Q} \sum_{i=1}^Q\frac{1}{r_i} $$

The underlying rationale is the empirical finding in the web search problem that *the likelihood a user examines the document at rank i is dependent on how satisfied the user was with previously observed documents in the list*.
ERR hence tries to quantify the usefulness of a document at rank $i$ conditioned on the degree of relevance of documents at rank less than $i$.

Assume the probability of a user finding the result is satisfied at position $i$ in a given list is denoted as $R_i$. The likelihood of a session for which the user is satisfied and stops at position $r$ is: $\prod_{i=1}^{r-1}(1 - R_i)R_r$.

Now we model $R_i$ such that it is an increasing function of relevance:

$$ R = R(g) \equiv \frac{2^g - 1}{2^{g_{max}}} $$

where $g$ is the labeled (graded) relevance such that $g \in \{0, 1, ..., g_{max}\}$. $g = 0$ suggests irrelevant and $g = g_{max}$ perfectly relevant.

Finally, ERR is defined as:

$$ ERR \equiv \sum_{r=1}^n\frac{1}{r}R_r\prod_{i=1}^{r-1}(1-R_i) $$

Here $\frac{1}{r}$ can be considered as a utility function $\tau(r)$ that satisfies $\tau(1) = 1$ and $\tau(r) \rightarrow 0$ as $r \rightarrow \infty$.

Note that ERR is a measure on a list with a single query, so the corresponding de-generated measure is RR instead of MRR. To evaluate on results from multiple queries, we will need to further average ERRs among queries.

**Example:**

$ \begin{align} q_1 & \rightarrow d_1, d_2 \\ q_2 & \rightarrow d_3, d_4, d_5 \end{align} $

Assuming only $d_2, d_3, d_5$ are relevant document given their corresponding query.

MRR: $(\frac{1}{2} + \frac{1}{1}) \times \frac{1}{2} = \frac{3}{4}$

ERR of $q_1$: $0 + \frac{1}{2} \times \frac{2^1 - 1}{2^1} \times (1 - \frac{2^0 - 1}{2^1}) = \frac{1}{4}$

ERR of $q_2$: $\frac{1}{1} \times \frac{2^1 - 1}{2^1} + 0 + \frac{1}{3} \times \frac{2^1 - 1}{2^1} \times (1 - \frac{2^0 - 1}{2^1}) \times (1 - \frac{2^1 - 1}{2^1}) = \frac{7}{12}$

**Caveat:**

MRR is a binary measure, while ERR is a graded measure. Also MRR and ERR is NOT directly comparable between each other. Both being graded measure, ERR is less popular than NDCG in empirical works due to its computational complexity.

First we define Discounted Cumulative Gain at position $k$ as:

$$ DCG@k \equiv \sum_{i=1}^k\frac{2^{l_i} - 1}{log_2(i + 1)} $$

where $l_i$ is the grading of relevance at rank $i$.

To intuitively understand the metric,
the numerator is simply an increasing function of relevance, the more relevant the higher.
This is the *gain* from each item.
The denominator is a decreasing function of ranking position, this is the *discounted* component of the metric.
Put together, higher relevance gains more points, but the lower it is ranked the higher also the discount.
In the end the metric will prefer higher relevant item to be ranked higher, which is exactly the desired ranking property we'd like to pursue.

Normalized DCG is then defined as:

$$ NDCG@k = \frac{DCG@k}{IDCG@k} $$

where $IDCG@k$ is the *Ideal* DCG@k given the result. It is the DCG@k calculated by re-sorting the given list by its true relevance labels.
That is, IDCG@k is the maximum possible DCG@K value one can get given a ranking list.

**Example:**

$ \begin{align} q_1 & \rightarrow d_1, d_2 \\ q_2 & \rightarrow d_3, d_4, d_5 \end{align} $

Assuming only $d_2, d_3, d_5$ are relevant document given their corresponding query.

NDCG of $q_1$: $\frac{0 + \frac{2^1 - 1}{log_23}}{\frac{2^1 - 1}{log_22} + 0} = \frac{1}{log_23} \approx 0.631$

NDCG of $q_2$: $\frac{\frac{2^1 - 1}{log_22} + 0 + \frac{2^1 - 1}{log_24}}{\frac{2^1 - 1}{log_22} + \frac{2^1 - 1}{log_23} + 0} = \frac{1.5}{1 + \frac{1}{log_23}} \approx 0.92$

Broadly speaking there are two approaches to label a ranking dataset:

- Human judgement
- Derivation from user behavior log

For the 1st approach, massive manpower is required to label the relevance of each item given a query. In real world lots of dataset cannnot be labeled in such way, so we rely on the 2nd approach which indirectly infer user preference among different items.

Usually pairwise preference can be infered from user interaction with the query result. For example, use click data to infer web search relevance. This is also why pairwise approach in LTR can gain much more attention than the pointwise method: due to data availability.

In the literature of LTR, a set of very important theoretical and also empirical works were done by Chris Burges from MicroSoft Research, who have established the very foundation of the pairwise approach in LTR.

Those important pairwise LTR models include:

- RankNet (2005)
- LambdaNet (2006)
- LambdaMART (2007); high quality implementation available in the library
`lightgbm`

Recently, researchers from Google generalize the LambdaMART framework to provide a theoretical background of the ranking model of all 3 types of loss function (pointwise, pairwise, listwise) and the direct optimization of all the popular ranking metrics (NDCG, MAP, ...). The framework is called LambdaLoss (2018).

A production-ready implementation of such framework is also open-sourced as a ranking module under the popular library `TensorFlow`

.
A *groupwise* scoring function is also proposed and can be implemented in the library.

The reason why we choose specifically to elaborate the above mentioned models is because they are the very foundation of LTR literature, cited more than a thousand times.

And the reason why the libraries are chosen is basically the same: they are the state-of-the-art popular open source go-to frameworks in this field.

ML is nothing more than *mathematical optimization*.
But in plain words, what is ML?

Here is a metaphor of ML:

- There is a function $f(\cdot)$, or an estimator, also called a
**learner** - The job of a learner is to
*learn*, and then to*guess*the correct anwser to a given question - Learner is characterized by model parameters $W$; value of $W$ shapes learner's opinion about the answer
- Here is how learner guess:
- Setup a value for $W$
- Give out a guess based on that $W$
- Calculate how far the guess is from the truth, this is measured by
**loss**

- When loss is high (meaning the guess is not even close), learner repeatedly tries the above steps until loss cannot be further reduced
- Learner stops at $W^*$ which can give out the best guess

ML is a science of how we should efficiently perform the above exercise.

Real-life analogy? **Number Guessing**.

**Interviewer:** What's your biggest strength?

**Me:** I'm an machine learning expert.

**Interviewer:** What's 9 + 10?

**Me:** It's 3.

**Interviewer:** Not even close. It's 19.

**Me:** It's 16.

**Interviewer:** It's still 19.

**Me:** It's 18.

**Interviewer:** No. It's 19.

**Me:** It's 19.

**Interviewer:** *You're hired.*

Remember that we model the score difference between a given pair $(i, j)$ as a probability based on the sigmoid function:

$$ Pr(i \succ j) = P_{ij} \equiv \frac{1}{1 + exp^{-(s_i - s_j)}} $$

where

$$ s_i = f(q, d_i) $$

is the pointwise score output by our underlying learner $f(q, d)$, which in RankNet is formulated as a *2-layer neural network* parameterized by a set of $w_k$.
(Or even think simplier, let $f(q, d_i) = wx_i$ as a linear learner.)

Given a probability distribution $p$, the entropy is defined as: $p \cdot log_2\frac{1}{p}$.
Now let $y_{ij} \in \{0, 1\}$ be the actual label of the given pair $(i, j)$,
The loss function of the above setup will be the *cross entropy*:

$$ loss = -\sum_{i \neq j}{y_{ij}log_2P_{ij} + (1-y_{ij})log_2(1-P_{ij})} $$

The cross entropy measures how close two probability distribution are to each other. So naturally it is a good objective function for a machine learning model that models probability to optimize. Using backprop techinque we can numerically find the model weights in $f(q, d)$ that minimize the cross entropy loss.

Note that the above loss is very general: it is just the expected log-loss, or the sum of cross entropy from each training record, used to measure how good the model distribution is approximating the empirical distribution of the traing data (which in turn serves as an approximation to the unknown true distribution generating the training data). We can easily swap the neural network with other learners, resulting in a variety of different pairwise LTR models.

Two important enhancements have been achieved from RankNet to LambdaNet.

- Training speed-up thanks to factorization of gradient calculation
- Optimization towards a ranking metric

For the first point, LambdaNet is a **mathematically improved version of RankNet**.
The improvement is based on a factorization of the calculation of gradient of the cross entropy loss, under its pairwise update context.

Given the point cross entropy loss as $L$:

$$ L = y_{ij}log_2P_{ij} + (1-y_{ij})log_2(1-P_{ij}) $$

The gradient (the 1st-order derivative of the loss w.r.t. a model parameter $w_k$) can be written as:

$$ \frac{\partial L}{\partial w_k} = \frac{\partial L}{\partial s_i} \frac{\partial s_i}{\partial w_k} + \frac{\partial L}{\partial s_j} \frac{\partial s_j}{\partial w_k} $$

In plain words, the impact of a change in model parameter $w_k$ will go through the resulting changes in the model scores and then the changes in loss. Now rewrite the gradient in total losses for all training pairs $\{i, j\}$ that satisfied $i \succ j$:

$$ \begin{align} \frac{\partial L_T}{\partial w_k} &= \sum_{\{i, j\}} \bigg[ \frac{\partial L}{\partial s_i} \frac{\partial s_i}{\partial w_k} + \frac{\partial L}{\partial s_j} \frac{\partial s_j}{\partial w_k} \bigg] \\ &= \sum_i \frac{\partial s_i}{\partial w_k} \bigg( \sum_{\forall j \prec i} \frac{\partial L(s_i, s_j)}{\partial s_i} \bigg) + \sum_j \frac{\partial s_j}{\partial w_k} \bigg( \sum_{\forall i \succ j} \frac{\partial L(s_i, s_j)}{\partial s_j} \bigg) \end{align} $$

with the fact that:

$$ \frac{\partial L(s_i, s_j)}{\partial s_i} = - \frac{\partial L(s_i, s_j)}{\partial s_j} = log_2e\big[(1 - y_{ij}) - \frac{1}{1 + e^{s_i - s_j}}\big], $$

and a re-indexing of the second-term, we end up with:

$$ \begin{align} \frac{\partial L_T}{\partial w_k} &= \sum_i \frac{\partial s_i}{\partial w_k} \bigg[ \sum_{\forall j \prec i} \frac{\partial L(s_i, s_j)}{\partial s_i} + \sum_{\forall j \prec i} \frac{\partial L(s_j, s_i)}{\partial s_i} \bigg] \\ &= \sum_i \frac{\partial s_i}{\partial w_k} \bigg[ \sum_{\forall j \prec i} \frac{\partial L(s_i, s_j)}{\partial s_i} - \sum_{\forall j \succ i} \frac{\partial L(s_j, s_i)}{\partial s_j} \bigg] \\ &= \sum_i \frac{\partial s_i}{\partial w_k} \lambda_i \end{align} $$

The intuition behind the above gradient:

For each document in a given query, there is a gradient component we denoted as lambda, which is calculated by considering all the superior and inferior documents comparing to it. A relatively worse document will push the current document up, and a relatively better one will push it down.

The implication of the above factorization is that during the learning process, instead of doing update by each pair of documents, we can update on a per-query basis. And since lambda is by far cheaper to calculate, the entire training process can speed up considerably.

Since we model the score difference of a pair of documents in a query as a probability measure, the model is optimizing the pairwise correctness of ranking, which may not be the ultimately desirable objective.

Remember that the ranking objective is indeed measured by (ideally) a position-sensitive graded measure such as NDCG. But in the above setup NDCG is not directly linked to the minimization of cross entropy. A straightforward and also simple solution is to use NDCG as an early stop criteria and determine by using a validation dataset.

LambdaRank proposes yet another solution. The researcher found that during the gradient update using the lambda notion, for each pair instead of calculating just the lambda, we can adjusted lambda by the change in NDCG for that pair provided that the position of the two item swaped with each other.

The lambda of a given document is:

$$ \begin{align} \lambda_i &= \bigg[ \sum_{\forall j \prec i} \frac{\partial L(s_i, s_j)}{\partial s_i} - \sum_{\forall j \succ i} \frac{\partial L(s_j, s_i)}{\partial s_j} \bigg] \\ &= \bigg[ \sum_{\forall j \prec i} \lambda_{ij} - \sum_{\forall j \succ i} \lambda_{ij} \bigg] \end{align} $$

The proposed method is to adjust the pairwise lambda $\lambda_{ij}$ such that:

$$ \lambda_{ij} \equiv \frac{\partial L(s_i, s_j)}{\partial s_i} \cdot |\Delta NDCG_{ij}| $$

where $\Delta NDCG_{ij}$ is the change in NDCG when the position of $i$ and $j$ are swapped.

The researcher found that by such adjustment, without theoretical proof, the model is empirically optimizing NDCG, and hence yield better overall results.

LambdaMART is simply a LambdaNet but replaces the underlying neural network model with **gradient boosting regression trees** (or more general, gradient boosting machines, GBM).
GBM is proven to be very robust and performant in handling real world problem.

The model wins several real-world large-scale LTR contests.

In the original LambdaRank and LambdaMART framework, no theoretical work has been done to mathematically prove that ranking metric is being optimized after the adjustment of the lambda calculation. The finding is purely based on empirical works, i.e., by observing the results from varying dataset and simulation with experiments.

Researchers from Google recently (2018) published a generalized framework called LambdaLoss, which serves as an extension of the original ranking model and comes with a thorough theoretical groundwork to justify that the model is indeed optimizing a ranking metric.

In [1]:

```
# Example of one training instance in the Yahoo LTR dataset.
!head -n1 data/ltrc_yahoo/set1.train.txt
```

In [2]:

```
import lightgbm as lgb
# Note that we have convert the original raw data into a pure libsvm format.
# For more details, pls refer to: https://github.com/guolinke/boosting_tree_benchmarks/tree/master/data
infile_train = "data/ltrc_yahoo/yahoo.train"
infile_valid = "data/ltrc_yahoo/yahoo.test"
train_data = lgb.Dataset(infile_train)
valid_data = lgb.Dataset(infile_valid)
# Set group info.
# We can igonre the step if *.query files exist with input files in the same dir.
train_group_size = [l.strip("\n") for l in open(infile_train + ".query")]
valid_group_size = [l.strip("\n") for l in open(infile_valid + ".query")]
train_data.set_group(train_group_size)
valid_data.set_group(valid_group_size)
# Parameters are borrowed from the official experiment doc:
# https://lightgbm.readthedocs.io/en/latest/Experiments.html
param = {
"task": "train",
"num_leaves": 255,
"min_data_in_leaf": 1,
"min_sum_hessian_in_leaf": 100,
"objective": "lambdarank",
"metric": "ndcg",
"ndcg_eval_at": [1, 3, 5, 10],
"learning_rate": .1,
"num_threads": 2
}
res = {}
bst = lgb.train(
param, train_data,
valid_sets=[valid_data], valid_names=["valid"],
num_boost_round=50, evals_result=res, verbose_eval=10)
```

In [3]:

```
# Show the eval metric in tabular format.
import pandas as pd
pd.DataFrame(res["valid"]).tail()
```

Out[3]:

`tensorflow`

¶We will use `tensorflow`

along with `tensorflow_ranking`

to demonstrate how we can build a PoC LTR model within 200 lines of Python code.

Note that usually using `tensorflow`

involves much more effort since it is a lower-level framework for machine learning modeling.

In [4]:

```
import tensorflow as tf
import tensorflow_ranking as tfr
import warnings
warnings.filterwarnings("ignore")
tf.logging.set_verbosity(tf.logging.ERROR)
# The code here largely borrows from:
# https://github.com/tensorflow/ranking/blob/master/tensorflow_ranking/examples/tf_ranking_libsvm.ipynb
# Comparing to lightgbm, we need much more effort to build a model using tf.
# This is because tf is in general a lower-level framework for machine learning modeling,
# particularly deep neural nets.
tf.enable_eager_execution()
assert tf.executing_eagerly()
_NUM_FEATURES = 699
_LIST_SIZE = 99
_BATCH_SIZE = 32
_HIDDEN_LAYER_DIMS=["20", "10"]
_LOSS = tfr.losses.RankingLossKey.APPROX_NDCG_LOSS
# Input reader.
def input_fn(path):
train_dataset = tf.data.Dataset.from_generator(
tfr.data.libsvm_generator(path, _NUM_FEATURES, _LIST_SIZE, seed=777),
output_types = (
{str(k): tf.float32 for k in range(1, _NUM_FEATURES + 1)},
tf.float32
),
output_shapes = (
{str(k): tf.TensorShape([_LIST_SIZE, 1]) for k in range(1, _NUM_FEATURES + 1)},
tf.TensorShape([_LIST_SIZE])
))
train_dataset = train_dataset.shuffle(1000).repeat().batch(_BATCH_SIZE)
return train_dataset.make_one_shot_iterator().get_next()
# Test the reader.
infile_train_svm = "data/ltrc_yahoo/set1.train.txt"
infile_valid_svm = "data/ltrc_yahoo/set1.test.txt"
d = input_fn(infile_train_svm)
print(d[1]) # A label matrix of batch size X list size.
```

In [5]:

```
# Let's print the first feature tensor.
print(d[0]["1"])
```

In [6]:

```
# Define feature column, which serves as an api for adding features into estimators.
def example_feature_columns():
feature_names = [
"%d" % (i + 1) for i in range(0, _NUM_FEATURES)
]
return {
name: tf.feature_column.numeric_column(
name, shape=(1,), default_value=0.0) for name in feature_names
}
# Define the network structure.
# It is simply a fully-connected multi-layer perceptron.
def make_score_fn():
"""Returns a scoring function to build `EstimatorSpec`."""
def _score_fn(context_features, group_features, mode, params, config):
"""Defines the network to score a documents."""
del params
del config
# Define input layer.
example_input = [
tf.layers.flatten(group_features[name])
for name in sorted(example_feature_columns())
]
input_layer = tf.concat(example_input, 1)
cur_layer = input_layer
for i, layer_width in enumerate(int(d) for d in _HIDDEN_LAYER_DIMS):
cur_layer = tf.layers.dense(
cur_layer,
units=layer_width,
activation="tanh")
logits = tf.layers.dense(cur_layer, units=1)
return logits
return _score_fn
def eval_metric_fns():
"""Returns a dict from name to metric functions."""
metric_fns = {}
metric_fns.update({
"metric/ndcg@%d" % topn: tfr.metrics.make_ranking_metric_fn(
tfr.metrics.RankingMetricKey.NDCG, topn=topn)
for topn in [1, 3, 5, 10]
})
return metric_fns
def get_estimator(hparams):
"""Create a ranking estimator.
Args:
hparams: (tf.contrib.training.HParams) a hyperparameters object.
Returns:
tf.learn `Estimator`.
"""
def _train_op_fn(loss):
"""Defines train op used in ranking head."""
return tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.train.get_global_step(),
learning_rate=hparams.learning_rate,
optimizer="Adagrad")
ranking_head = tfr.head.create_ranking_head(
loss_fn=tfr.losses.make_loss_fn(_LOSS),
eval_metric_fns=eval_metric_fns(),
train_op_fn=_train_op_fn)
return tf.estimator.Estimator(
model_fn=tfr.model.make_groupwise_ranking_fn(
group_score_fn=make_score_fn(),
group_size=1,
transform_fn=None,
ranking_head=ranking_head),
params=hparams)
```

In [7]:

```
# Initialize the model.
hparams = tf.contrib.training.HParams(learning_rate=0.1)
ranker = get_estimator(hparams)
# Train.
ranker.train(input_fn=lambda: input_fn(infile_train_svm), steps=300)
# Evaluate.
ranker.evaluate(input_fn=lambda: input_fn(infile_valid_svm), steps=100)
# Monitor the tensorboard:
# tensorboard --logdir=<ranker.model_dir output>
```

Out[7]:

- Chapelle, Olivier, et al. "Expected reciprocal rank for graded relevance." Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009.

- Ai, Qingyao, et al. "Learning groupwise scoring functions using deep neural networks." arXiv preprint arXiv:1811.04415 (2018).
- Burges, Christopher, et al. "Learning to rank using gradient descent." Proceedings of the 22nd International Conference on Machine learning (ICML-05). 2005.
- Burges, Christopher J., Robert Ragno, and Quoc V. Le. "Learning to rank with nonsmooth cost functions." Advances in neural information processing systems. 2007.
- Burges, Christopher JC. "From ranknet to lambdarank to lambdamart: An overview." Learning 11.23-581 (2010): 81.
- Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
- Li, Hang. "A short introduction to learning to rank." IEICE TRANSACTIONS on Information and Systems 94.10 (2011): 1854-1862.
- Wang, Xuanhui, et al. "The lambdaloss framework for ranking metric optimization." Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 2018.