I am trying to reproduce how the normalized data (sdata[‘anucleus’].X) is calculated. I am applying the defined log1p-normalized function (at the end of the notebook basic-EDA) to the raw count data:
def log1p_normalization(arr):
return np.log1p((arr/np.sum(arr, axis=1)) * 100)
gene_name_list = sdata[‘anucleus’].var[‘gene_symbols’].values
x_count = pd.DataFrame(sdata[‘anucleus’].layers[‘counts’],
columns=gene_name_list) # raw count
x_count_norm_using_def = log1p_normalization(x_count)
The above code returns NAN results… Could anyone help me with this? In addition, what is the rationale behind this specific normalization method? Is it better to analyze the raw count data or the normalized data?
Thank you!
i checked the log1p-normalized target, it is indeed gives the same output as sdata[‘anucleus’].X:
gene_name_list = sdata[‘anucleus’].var[‘gene_symbols’].values
cell_id_train = sdata[‘cell_id-group’].obs[sdata[‘cell_id-group’].obs[‘group’] == ‘train’][‘cell_id’].to_numpy()
cell_id_train = list(set(cell_id_train).intersection(set(sdata[‘anucleus’].obs[‘cell_id’].unique())))
ground_truth_example = sdata[‘anucleus’].layers[‘counts’][sdata[‘anucleus’].obs[‘cell_id’].isin(cell_id_train),:]
y = pd.DataFrame(ground_truth_example, columns= gene_name_list, index = cell_id_train)
def log1p_normalization1(arr):
arr_sum = np.sum(arr) # Sum over all elements
if arr_sum == 0: # Avoid division by zero
np.zeros_like(arr) # Return an array of zeros if the sum is zero
np.log1p((arr / arr_sum) * 100)
print(sdata[‘anucleus’].X[0],log1p_normalization1(y.iloc[0,:].values))