Posted by John Cottrell (November 14, 2015)

Shooting fish in a barrel

We sometimes get asked about searching a sequence database with just one entry, or maybe a small database where the entries are variants of the same protein. Details tend to be sketchy or confidential, so we assume this is usually QA of a recombinant protein, rather than protein identification in the conventional sense. Maybe Mascot Server is just a convenient tool to see whether there are spectra that don’t match to the intended product.

Since you can never exclude the possibility of contaminants, we advise including common contaminants and the host cell proteome in the search. This doesn’t require you to edit any Fasta files. From Mascot 2.3 onwards, you’ve been able to select multiple databases for a search, so the recombinant protein or its variants can be one database that is selected for the search alongside a contaminants database and the host cell proteome from UniProt.

Including the host cell proteome has the additional benefit of giving the statistics some traction. If you search a single entry, it is hard to decide whether a low scoring match is correct or simply a chance peptide molecular mass match. Even so, you are unlikely to get a data set off a single protein that is large enough for a meaningful decoy search; there will be too few matches. You depend on the significance threshold calculated by Mascot being acceptably accurate.

An error tolerant search is ideal for picking up peptides modified by artefacts, such as oxidation or over-alkylation, not to mention non-specific cleavage and the occasional post-translational modification or SNP. Don’t be tempted to select anything as a variable modification unless a test search has shown it to be very abundant, so that there is a high probability of a single peptide having the selected modification plus a second, unanticipated one.

If the database contains variants of the same protein, the Protein Family Summary report is likely to group them all into a single family, which can become ugly if there are hundreds or thousands of them. One option is to choose the earlier, Select Summary. Another is to turn off grouping by adding &group_family=0 to the report URL in the browser address bar.

Better still, assuming that the protein is digested with trypsin, add sequences for the individual tryptic peptides that span each mutation to the entry for the consensus sequence. You could copy the MSIPI approach, and use J or O to create unconditional cleavage sites between each peptide, but this isn’t strictly necessary. Concatenating the mutant peptide sequences before the start of the consensus sequence works fine. It creates a few non-existant sequences if the missed cleavage setting is greater than 0, but this will have a negligible effect on the search space.

Leave a Reply

Your email address will not be published. Required fields are marked *


HTML tags are not allowed.