Why is only the element on row 0 and column 1 picked as the answer to rank score calculation in the following function found in Advanced EDA Notebook?

```
def get_rank_corr_score(y_preds,y_trues):
rank_pred = y_preds.groupby('date',group_keys=True).apply(lambda x: x.rank(pct=True, method="first"))
correlation_score = np.corrcoef(rank_pred['y'],y_trues['y'])[0,1]
return correlation_score
```

Thank you.

As np.corrcoef() returns a correlation matrix of 2 input vectors, the function gets the element of [0, 1]. And [1, 0] element is the same with this value because the matrix is symmetric. [0,0] and [1,1] are 1 because a correlation of the same two vectors is 1.

1 Like

I am under the impression that the 2 input vectors are as long as 460 features long. I am not 100% but it does not look like we are calculating the correlation coefficient of a 2x2 matrix, are we?

Yes. But we are making 2x2 corr matrix, not consisting of 460 features, but 2 features (in the above setting, rank_pred[‘y’] & y_trues[‘y’])

1 Like

I have just run the following code with 1 model in List_models.

```
def get_rank_corr_score(y_preds,y_trues):
rank_pred = y_preds.groupby('date',group_keys=True).apply(lambda x: x.rank(pct=True, method="first"))
correlation_score = np.corrcoef(rank_pred['y'],y_trues['y'])[0,1]
print("rank_pred['y'],y_trues['y']",rank_pred['y'].shape,y_trues['y'].shape)
return correlation_score
```

```
List_models = [
sklearn.linear_model.LinearRegression(),
]
```

and

`statsCV = TemporalCV(List_models=List_models,X_data=X_train_2, y_data = y_train,n_splits=10)`

The following is the print out.

```
rank_pred['y'],y_trues['y'] (35607,) (35607,)
rank_pred['y'],y_trues['y'] (46696,) (46696,)
rank_pred['y'],y_trues['y'] (50612,) (50612,)
```

So I dont know what you meant by the inputs to get_rank_corr_score() is 2x2 in shape?

The values of 35607, 46696, 50612 are not in the feature dimension but in the sample size dimension.

Think of the formula of correlation coefficient.

r = (1/N) * SUM_(i=1 to N) (x-x.mean)(y-y.mean) / (x-x.mean)^2

Then,

x: rank_pred[‘y’]

y: y_trues[‘y’]

(it is just an example, we can exchange x and y)

N: 35607 or 46696 or 50612

x, y are the features, so we calculate correlation using 2 features.

1 Like

I guess my original question is why `[0,1]`

Why are you only interested in those 2 features, when there are 460 features we start with?

What am I missing here?

OK, I think I finally get it after reviewing the code again and again.

Of course, you are using the predicted y values to compare against the true y values, after all the 460 features have been used in fitting a model.

Thank you.