The PSP dataset (Additional Parts) is fully accessible to the public under the Creative Commons 4.0 License (CC-BY-4.0 License).

The original PSP dataset, which was used to train the MEGAFold-monomer model, can be found at: [http://ftp.cbi.pku.edu.cn/psp/](http://ftp.cbi.pku.edu.cn/psp/).  

In addition to this, we provide paired MSA files, which are essential for training protein complexes. These files were used to train both GRASP and MEGAFold-multimer models.  

For use of the original monomer PSP dataset, please cite:
Liu, S. et al., PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction. arXiv:2206.12240 (2022). doi: 10.48550/arXiv.2206.12240.

For use of the PSP dataset (Additional Parts) in this directory, please also cite:
Xie, Y., et al., Integrating various Experimental Information to Assist Protein Complex Structure Prediction by GRASP. bioRxiv:2024.09.16.613256(2024). doi: 10.1101/2024.09.16.613256.

The dataset is organized as follows:  

```
PSP/
├── paired_msa_tar/    # 177GB of .pkl packages containing paired MSA data  
│  
└── sample_data/       # A sample .pkl file from the paired_msa_tar package  
                       # Each .pkl file contains paired MSA and deletion_matrix  
                       # for all unique sequences of a given protein complex  
```