Feature Engineering code in sample notebook

In the Advanced Exploratory Data Analysis notebook, I am having difficulties understanding how ranking of features works below.

def compute_features_of_interest_local(data): #This is per date, will be called once per existing date
    n,d = data.shape        
    feats = list(data.columns)[1:]    
    centroid = []
    sds = []
    data.loc[:,'sum_rank'] = 0
    data.loc[:,'sum_vals'] = 0
    for feat in feats:
        df = data[feat]
        dfs = np.array(sorted(enumerate(df),key= lambda x: x[1],reverse=True))[:,0] #rankings of each feature w.r.t the others (low rank higher score) 
        data.loc[:,'sum_rank'] = dfs + data.loc[:,'sum_rank']     
    data.loc[:,'centroid_l2'] = data.loc[:,feats].apply(lambda x: calc_dist(2,centroid,x),axis=1)
    data.loc[:,'centroid_l1'] = data.loc[:,feats].apply(lambda x: calc_dist(1,centroid,x),axis=1)
    data.loc[:,'centroid_linf'] = data.loc[:,feats].apply(lambda x: calc_dist(0,centroid,x),axis=1)
    data.loc[:,'sum_vals'] = data.apply(lambda x: sum(x[1:max_feats]),axis=1) #We quickly can add another feature to summarize the overall ranking
    return data

Lets zoom into the following line of code.
data.loc[:,'sum_rank'] = dfs + data.loc[:,'sum_rank']

By adding dfs to the ‘sum_rank’ column, we would have wrongly assigned the original assets in data df to sum_rank, haven’t we?

For eg if asset in row 0 has some feature values and rank value, now after the assignment of dfs it would have been assigned a rank that is not due to its feature values, yes?

This is the value of dfs after 1 iteration.

[[  3.           4.1196394 ]
 [522.           4.06719589]
 [403.           2.95877194]
 [246.          -2.28429198]
 [294.          -2.54271579]
 [251.          -2.62656593]]

After assigning dfs to the ‘sum_rank’ column, row 0 now has a rank of 3. Is that right?

Thank you.

Hi @simplexity, what you call “ranking of features” is actually a script using the cross-sectional rank of each feature to generate a new feature, which is proportional to the average rank of all the features of a certain entry of that moon. This new feature, sum_rank, is not related to a specific existing feature in the dataset, but makes use of all of them to build a new (hopefully) informative descriptor of the cross-sectional state of the system. Does this clarify?

In other words, your understanding is right after 1 iteration. After 1 iteration, you will have in ‘sum_rank’ just the rankings of the first column, so unique values from 1 to N.

The sum_rank takes the rankings that we compute for each individual feature and sums them together. If we have D columns, and the first row contains a value that is the highest in every column, the final value of ‘sum_ranks’ will be D (rank 1, summed D times), on the opposite side, you can have a row that has values as low as possible, and then ‘sum_rank’ will output D*N (rank N, summed D times) for that row.

They way ‘sum_ranks’ is computes is by adding the new ranks obtained from exploring a new column to the already existing sum of rankings, hence the line:

data.loc[:,‘sum_rank’] = dfs + data.loc[:,‘sum_rank’]

Intuitively, ‘sum_ranks’ captures the overall position in rankings that each sample has. One can think that if one visualizes the rows of data as points in D-dimensional space, points towards the edges of the cloud of points that comprises a single date, would show some interesting values for this feature, and perhaps this carries predictive power for our problem.