Why is only the element on row 0 and column 1 picked as the answer to rank score calculation in the following function found in Advanced EDA Notebook?
def get_rank_corr_score(y_preds,y_trues):
rank_pred = y_preds.groupby('date',group_keys=True).apply(lambda x: x.rank(pct=True, method="first"))
correlation_score = np.corrcoef(rank_pred['y'],y_trues['y'])[0,1]
return correlation_score
Thank you.
As np.corrcoef() returns a correlation matrix of 2 input vectors, the function gets the element of [0, 1]. And [1, 0] element is the same with this value because the matrix is symmetric. [0,0] and [1,1] are 1 because a correlation of the same two vectors is 1.
2 Likes
I am under the impression that the 2 input vectors are as long as 460 features long. I am not 100% but it does not look like we are calculating the correlation coefficient of a 2x2 matrix, are we?
Yes. But we are making 2x2 corr matrix, not consisting of 460 features, but 2 features (in the above setting, rank_pred[‘y’] & y_trues[‘y’])
1 Like
I have just run the following code with 1 model in List_models.
def get_rank_corr_score(y_preds,y_trues):
rank_pred = y_preds.groupby('date',group_keys=True).apply(lambda x: x.rank(pct=True, method="first"))
correlation_score = np.corrcoef(rank_pred['y'],y_trues['y'])[0,1]
print("rank_pred['y'],y_trues['y']",rank_pred['y'].shape,y_trues['y'].shape)
return correlation_score
List_models = [
sklearn.linear_model.LinearRegression(),
]
and
statsCV = TemporalCV(List_models=List_models,X_data=X_train_2, y_data = y_train,n_splits=10)
The following is the print out.
rank_pred['y'],y_trues['y'] (35607,) (35607,)
rank_pred['y'],y_trues['y'] (46696,) (46696,)
rank_pred['y'],y_trues['y'] (50612,) (50612,)
So I dont know what you meant by the inputs to get_rank_corr_score() is 2x2 in shape?
The values of 35607, 46696, 50612 are not in the feature dimension but in the sample size dimension.
Think of the formula of correlation coefficient.
r = (1/N) * SUM_(i=1 to N) (x-x.mean)(y-y.mean) / (x-x.mean)^2
Then,
x: rank_pred[‘y’]
y: y_trues[‘y’]
(it is just an example, we can exchange x and y)
N: 35607 or 46696 or 50612
x, y are the features, so we calculate correlation using 2 features.
1 Like
I guess my original question is why [0,1]
Why are you only interested in those 2 features, when there are 460 features we start with?
What am I missing here?
OK, I think I finally get it after reviewing the code again and again.
Of course, you are using the predicted y values to compare against the true y values, after all the 460 features have been used in fitting a model.
Thank you.