Pretraining Dataset

#2
by BerkIGuler - opened

Is there a way to reproduce the exact same dataset used for pre-training? I realized that you shared the scenario files for some cities not used for pre-training, and as far as I understand, the channel data is generated by the DeepMIMO package, which generates channels based on the provided scenario. Did you generate the training data on the go during pre-training, like in the examples? I believe it would be very helpful to open-source the pre-training data for researchers.

Owner

Thank you for your interest in our work! Everything here is open-source, but due to the large size of the data, you will need to generate it yourself. All the necessary functions for reproducing the pre-training dataset are provided in the input_preprocess module.

To get started, download the 15 pre-training DeepMIMO scenarios mentioned in the paper from the DeepMIMO website, and place them in the scenarios folder within the LWM directory.

The only step you need to complete is adding the number of rows and columns for each scenario to the row_column_users dictionary inside the get_parameters function in the input_preprocess module. You can find this information in the "Key Components" section of each scenario’s page on the DeepMIMO website, under "Users".

Once you have added this information, everything will be ready for generating the pre-training data.

If you have any questions, feel free to reach out!

This comment has been hidden

Thanks for your quick response! For the O1 scenario, grid 3 has different number of users per row. How should the corresponding key in row_column_users look like in that case? For the Boston5G scenario, I guess since the number of users per row is the same for both grids, we can simply sum the number of rows, leading to:

'Boston5G_3p5': {
        'n_rows': 1622,
        'n_per_row': 595
    }

Upon reading the rest of the code in input_process.py, I realized that you're selecting certain user row indices for Boston5G but not for others:

    if scenario == 'Boston5G_3p5':
        parameters['user_rows'] = np.arange(row_column_users[scenario]['n_rows'][0],
                                            row_column_users[scenario]['n_rows'][1])
    else:
        parameters['user_rows'] = np.arange(row_column_users[scenario]['n_rows'])

I believe for Boston5G you had something like

'Boston5G_3p5': {
        'n_rows': (811, 1622),
        'n_per_row': 595
    }

and picked the second grid with the if-else block for some reason..

As for the O1 scenario, I see that uniform_sampling function accepts an integer argument users_per_row but there are two different possible values (361 or 181) depending on which grid is picked.

Overall, I just want to make sure that I'm using the same users for all scenarios such that I can generate the same dataset for reproducibility purposes, although I believe generating the EXACT same pre-training dataset is not possible since we are sampling users randomly?

Owner

In the Boston scenario, the first grid has very few non-zero channels, which is why we initially excluded it. However, there is no issue in using it, and the first 811 rows can be incorporated into your analysis.

For the O1 scenario, it can be divided into two separate scenarios, distinguished by the suffixes _v1 and _v2 in the following dictionary. Each division has a different number of rows and users per row:

row_column_users = {
    'Boston5G_3p5': {
        'n_rows': [812, 1622],
        'n_per_row': 595 
    },
    'O1_3p5_v1': {
        'n_rows': 3852,
        'n_per_row': 181
    },
    'O1_3p5_v2': {
        'n_rows': [3853, 5203],
        'n_per_row': 361
    }
}

To divide the O1 scenario into two sub-scenarios, you can feed the original O1_3p5 scenario name into the channel generator while appending the _v1 or _v2 suffix to distinguish between the two parts. For example:

parameters['scenario'] = scenario.split("_v")[0]

Since the n_rows key in the dictionary is a list for Boston5G_3p5 and O1_3p5_v2, you'll need to handle these cases differently. Here's an example:

if scenario in ['Boston5G_3p5', 'O1_3p5_v2']:
    parameters['user_rows'] = np.arange(row_column_users[scenario]['n_rows'][0],
                                        row_column_users[scenario]['n_rows'][1])
else:
    parameters['user_rows'] = np.arange(row_column_users[scenario]['n_rows'])

This ensures that the correct rows are assigned for each scenario, accommodating the division of O1 and handling the unique case of Boston5G_3p5.

The dataset can be exactly reproduced. There is no randomness.

Please let us know if further clarification is needed!

I understand your points and I can definitely modify it in a similar way you described but my only concern is to use the same user grids in the generated data for pre-training. So my final understanding is that you picked the SECOND user grid for Boston scenario and used ALL grids for O1?

Hello, any updates on this?

I understand your points and I can definitely modify it in a similar way you described but my only concern is to use the same user grids in the generated data for pre-training. So my final understanding is that you picked the SECOND user grid for Boston scenario and used ALL grids for O1?

Owner

Hello, we used only the second grid in the Boston scenario. For the O1 scenario, we included all 181 users in each row of grids one and two, but only the first 181 users in the rows of the third grid. Note that since base station 4 was used in this scenario, the channels in the third grid are relatively weak.

Hello and thank you for sharing the very good work that has been done.
I want to use the LWM project for downstream tasks. To do this on the RML2018 dataset, I tried to do it for few samples and using extracted embeddings and the model, but I am facing the problem of overfit. While I have used a lightweight network. My question is that for downstream tasks, this LWM framework has been tested only for MIMO applications or has it also been tested for applications such as AMR? And if it has been tested, please guide me. Thank you

@MojtabaHajiloo Hello, I have the same problem as you. When I used the embedding output by the LWM model to predict the UE position, I found that overfitting occurred.

Hello @MojtabaHajiloo and thank you for your interest in LWM!

  • LWM has primarily been tested for MIMO-based applications rather than AMR tasks. The model was pre-trained on wireless MIMO channels. While it is designed to extract universal channel features, it has not been explicitly tested on Automatic Modulation Recognition (AMR).

  • Overfitting with few samples is expected, especially if the dataset distribution is very different from LWM’s training data. Since LWM was trained on MIMO channels, its learned representations may not directly generalize to AMR without fine-tuning. If your dataset is small, embeddings might still help, but a lightweight network alone may not prevent overfitting.

  • To improve generalization, consider fine-tuning LWM on AMR data or using augmentation techniques. You can fine-tune LWM’s final layers while keeping the lower layers frozen to help it adapt to AMR tasks. Alternatively, increasing training samples through data augmentation or using transfer learning techniques might help mitigate overfitting.

  • Most importantly, make sure to apply proper data normalization. Even if LWM was not explicitly designed for AMR, its learned features might still capture useful signal properties. Before using LWM, make sure to normalize your dataset correctly, as differences in input distributions can impact performance. We encourage testing and comparing results with traditional AMR approaches to see if embeddings provide any advantage.

We will release the next version of LWM next week, allowing you to easily fine-tune it for your specific input data or task. We will also provide some videos that address your question.

Let us know if you need further guidance!

Hello @Markydh7 , thank you for your feedback. Normalizing your channels will likely solve the overfitting issue. Without proper normalization, channel patches tend to map to similar spots in the embedding space, causing overfitting. Please try it and let us know the outcome!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment