Data file format
A Mascot data file is a plain text (ASCII) file containing peak list information and, optionally, search parameters.
For a Peptide Mass Fingerprint, the file should contain a list of peptide mass values, one per line, optionally followed by white space and a peak area or intensity value. The peak list formats of a wide range of instrument data systems are directly compatible with these requirements. In addition, Mascot will automatically recognise the following formats:
For an MS/MS Ions Search, the data file must contain one or more MS/MS peak lists. In the Mascot generic format, (MGF), each MS/MS dataset is a list of pairs of mass and intensity values, delimited by BEGIN IONS and END IONS statements. The following formats are also supported for MS/MS data:
A data file may include embedded search parameters. Most embedded parameters can only appear once, at the head of the data file. In a Mascot generic format file, certain parameters can appear within an MS/MS dataset.
If there is a conflict between the values of the embedded parameters and values entered into search form fields, the embedded parameters always take precedence. The search form fields are essentially defaults for values missing from the data file.
The following paragraphs illustrate the data file formats by means of examples. The rules which Mascot follows when parsing a data file provide an alternative description of what is and is not acceptable.
The Mascot generic format for a data file submitted to Mascot is (square brackets indicate optional elements, they should not be included in an actual data file):
Blank lines can be used anywhere, to improve readability.
Comment lines beginning with one of the symbols #;!/ can be included, but only outside of the BEGIN IONS and END IONS statements that delimit an MS/MS dataset.
Peptide Mass Fingerprint
In the case of a Peptide Mass Fingerprint, each query is just a single peptide m/z value, with an optional second value for peak area or intensity. For example:
If your MS data system outputs additional values on each line, these will be ignored.
There are two ways to change default search parameters. One way is using the search form fields. The other is to place embedded parameters at the beginning of the data file. For example:
The embedded parameters (COM, CLE, CHARGE, PFA) over-ride the entries in the corresponding form fields, if any. All of the other search parameters default to the search form settings.
A peptide mass fingerprint data file can only contain peptide mass fingerprint queries. Sequence queries or MS/MS datasets are not permitted.
MS/MS Ions Search
For an MS/MS Ions Search, each query represents a complete MS/MS spectrum, and is delimited by a pair of statements: BEGIN IONS and END IONS.
The search form defaults can be over-ridden by including embedded parameters at the beginning of the data file. Parameters specified in the search form or the data file header apply to the entire search. Within each MS/MS query, the mass of the precursor peptide(s) must be specified using one or more PEPMASS parameters. Precursor intensity and charge can be specified by including additional values on the PEPMASS line, delimited by white space.
Certain additional parameters can be specified at query level, between BEGIN IONS and END IONS, as shown in the table below. Parameters within an MS/MS query only apply locally, to the one spectrum. In the case of the CHARGE parameter, this means that you can have a global CHARGE setting, either from the search form or from a parameter at the head of the data file, as well as a local setting in one or more of the MS/MS queries.
This can be useful if the mass spectrometer data system cannot always determine precursor charge state correctly. For example, the global setting could be 2+ and 3+. When an unambiguous charge state can be determined, the correct charge is written to the local CHARGE parameter. Parameters within an MS/MS query must always be at the beginning, immediately following the BEGIN IONS tag. They cannot appear within or following the fragment ion list. For example:
COM=10 pmol digest of Sample X15
In the fragment ion list, the first value is fragment m/z and the second intensity. The third place is reserved for fragment charge, but this is not currently used by Mascot, and will be ignored.
Fragment ion intensity information is very important. Mascot will iteratively select sub-sets of the most intense peaks, looking for the group which most clearly discriminates the score of the top matched protein. There is an upper limit of 10,000 peaks per individual MS/MS spectrum. If you see an error message reporting that this limit has been exceeded, it almost certainly means that your data are profile data, and not peak lists. It is very unlikely that a single MS/MS spectrum could ever contain more than 1000 genuine peaks, never mind 10,000.
It is possible for an MS/MS ions search data file in the Mascot generic format to include sequence queries and peptide mass fingerprint queries. This is not allowed if the file contains proprietary format MS/MS data, and neither is mixing proprietary formats.
Here is a rather baroque example:
# following lines define parameters.
Search parameters can be embedded into the data file or entered in the search form query window using the following parameter labels. In the absence of an embedded parameter, the default value is the setting of the corresponding search form field.
The FORMAT parameter is used to identify proprietary MS/MS dataset formats. It can appear once only, at the start of the file. If there is no FORMAT parameter, the default is Mascot generic format (MGF).
If the peak list format is not MGF, then parameters can only appear once, in the data file header, before the peak list begins.
For an MGF peak list, parameters with a tick in the Header column of the table below can appear in the header and those with a tick in the Local column can appear in the local scope of a single MS/MS query (spectrum). That is, after the BEGIN IONS line and before the fragment mass and intensity values.
|ACCESSION||Database entries to be searched||List of double quoted, comma separated values|
|CHARGE||Peptide charge||1-||M-H- on PMF form|
|1+||MH+ on PMF form|
|N- to N+ where N is an integer and combinations||Not PMF|
|CLE||Enzyme||Trypsin etc., as defined in enzymes file||No default, so must be specified|
|COM||Search title||Applies to the whole search|
|CUTOUT||Precursor removal||Pair of comma separated integers||MIS only|
|COMP||Amino acid composition|
|DB||Database||As defined in mascot.dat|
|DECOY||Perform decoy search||0 (false)||Default|
|ERRORTOLERANT||Error tolerant||0 (false)||Default|
|1 (true)||Not PMF|
|ETAG||Error tolerant sequence tag||A single query can have multiple ETAGs|
|FORMAT||MS/MS data file||Mascot generic||Default|
|Sciex API III|
|FRAMES||NA translation||Comma separated list of frames||Default is 1,2,3,4,5,6|
|INSTRUMENT||MS/MS ion series||Default||Default|
|ESI-QUAD-TOF etc., as defined in fragmentation_rules|
|IT_MODS||Variable Mods||As defined in unimod.xml|
|ITOL||Fragment ion tol.||Unit dependent|
|ITOLU||Units for ITOL||ppm|
|LOCUS||Hierarchical scan range identifier||string||MIS only|
|MASS||Mono. or average||Monoisotopic|
|MODS||Fixed Mods||As defined in unimod.xml|
|MULTI_SITE_MODS||Allow two modifications at a single site||0 (false) or 1 (true)||default 0|
|PEP_ISOTOPE_ERROR||Misassigned 13C||0 to 2||MIS only|
|PEPMASS||Peptide mass||>100||optionally followed by intensity and charge|
|PFA||Partials||integer, 0 to 9||default 1|
|QUANTITATION||Quantitation method||as defined in quantitation.xml||MIS only|
|RAWFILE||Raw file identifier||string||MIS only|
|RAWSCANS||Native scan range identifiers||a[:b]||MIS only|
|REPORT||Maximum hits||AUTO or integer|
|REPTYPE||Type of report||protein|
|peptide||Default for MIS|
|concise||Default for PMF|
|RTINSECONDS||Retention time or range (in seconds)||a[-b]||MIS only|
|SCANS||Scan number or range||v[-w]||MIS only|
|SEARCH||Type of search||PMF|
|SEG||Protein mass (kDa)||Empty or >0|
|SEQ||Amino acid sequence||A single query can have multiple SEQs|
|TAG||Sequence tag||A single query can have multiple TAGs|
|TAXONOMY||Taxonomy||As defined in taxonomy file|
|TITLE||Query title||Applies to a single spectrum|
|TOL||Peptide mass tol.||Unit dependent|
|TOLU||Units for TOL||%|
|USER00 to USER12||Uncommitted parameters|
Although scan and retention time information is not used directly in the Mascot search, it can be very useful for applications that import the Mascot search results. Two obvious cases are quantitation and Percolator. If a peak list contains data from multiple raw files, annotating scan and retention time information in a structured and non-verbose manner can become complicated. The MGF format includes a choice of parameters for this purpose:
RTINSECONDS Anything from a single retention time to a complex list of retention time ranges. This parameter is for passing machine readable information, not for display, so there is no RTINMINUTES, etc. When there are multiple raw files, there can be multiple RTINSECONDS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RTINSECONDS
SCANS Anything from a single scan number to a complex list of ranges, e.g. SCANS=1278,1280-1284,1290-1294,1298. When there are multiple raw files, there can be multiple SCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. SCANS[3
RAWSCANS Identifiers corresponding to the data structure in the raw file. A two letter abbreviation followed by a number for each level of the hierarchy and a colon is used to delimit the start and end of a range. When there are multiple raw files, there can be multiple RAWSCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RAWSCANS.
For example, AB Sciex Analyst scans are characterised by a triplet of period, experiment, and cycle, which is represented as pd1cy2ex3.
RAWFILE An identifier to relate a query back to one or more raw files. Can be a file name or file path or anything else that is meaningful to the downstream application.
LOCUS A hierarchical identifier used mainly by AB Sciex software. A unique combination of file, sample, period, cycle, and experiment might be represented as 188.8.131.52.1. Mascot treats this as a string and simply passes it through to the result file, so the content can be anything meaningful to the downstream application
When there are multiple raw files, adding an index is the most concise way of connecting queries and raw file(s). For example, an MGF peak list from Distiller for a multi-file project might look like this:
This is fine if a single application creates the merged peak list. It cannot be used so easily when one application creates a peak list from each file and a second application independently merges these peak lists into a single search. In such cases, the RAWFILE or LOCUS parameter can be used to embed an identifier into each query as the peak list is created. This identifier then travels with the query as the peak lists are merged and is written to the search result file by Mascot.
The MGF format allows intensity information to be associated with peptide and fragment m/z values. It doesn’t specify what these values represent, which is determined by the peak picking software. They could be peak height or peak area and they could be for the 12C peak or for the complete isotope distribution. Units are generally arbitrary and absolute values have no meaning.
During a Mascot search, subsets of the most intense peaks are selected and scored iteratively, looking for the best score, which presumably corresponds to an optimum separation of signal peaks from noise peaks. In the result report for an MS/MS search, the spectra in the unassigned list can be sorted by precursor intensity (in case it is of interest to see which are the strongest spectra that failed to get a significant match). For these purposes, as long as the intensity values are derived in a consistent manner, it doesn’t greatly matter what they represent.
If the peak list is being used for quantitation, then the origin of the intensity values will be of greater interest. If Mascot Distiller is being used for peak picking, a setting in preferences can be used to choose between S/N, which behaves like height, or area under the complete isotope distribution. However, Distiller can also be configured to pass through centroid values direct from the raw data, in which case the intensity will be whatever value was assigned by the instrument data system. This is only relevant for MS2 quantitation (iTRAQ / TMT). Distiller MS1 quantitation is always based on integrating survey scan intensity across the elution profile of the precursor, and this information is not present in the peak list used for the search.
Files in this format are created by the LIST command on the ICIS data system. The header block for each MS/MS dataset begins with a “LIST:” field. The text in this field is used by Mascot to identify the query, equivalent to an embedded TITLE parameter.
The ASC file header does not specify a charge state for the precursor peptide. This can be specified (globally) on the search form, or by an embedded CHARGE parameter at the head of the data file.
The precursor peptide m/z value is parsed from the “Mode:” field. Mascot uses the prevailing CHARGE value to calculate Mr from the observed m/z.
A blank line to delimit MS/MS datasets is optional.
Example of Finnigan ASC format:
LIST: dp210198b 21-Jan-98 DERIVED SPECTRUM #9
Sequest users can create these files from Finnigan LCQ data using the lcq_dta.exe or extract_msn.exe utilities. Further information can be found here.
The DTA format is very simple. The first line contains the singly protonated peptide mass (MH+) and the peptide charge state as a pair of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values.
N.B. In a DTA file, the precursor peptide mass is an MH+ value independent of the charge state. In Mascot generic format, the precursor peptide mass is an observed m/z value, from which Mr or MHnn+ is calculated using the prevailing charge state. For example, in Mascot:
… means that the relative molecular mass Mr is 1998. This is equivalent to a DTA file which starts:
The DTA format uses the file name to identify the dataset. An example of a file name would be “Myoglobin_digest.0012.0015.3.dta”. This corresponds to scans 12 to 15 of an LC-MS run, averaged together, and a peptide charge state of 3+.
While it is perfectly possible to submit a native DTA file to Mascot, each file contains only a single MS/MS data set. If you have a series of related datasets, such as from an LC-MS experiment, it is much better to concatenate the DTA files into a single data file so that the queries can be scored and reported collectively.
Remember to include at least one blank line between each MS/MS dataset. A delimiter between datasets is essential because the DTA format is relatively unstructured. Without a delimiter, the first line of a new dataset (peptide mass, charge) might be just another line from the previous dataset (fragment ion mass, intensity).
Utilities to concatenate DTA files automatically can be downloaded from the Xcalibur help page.
QTof users can export peak list data in either DTA or PKL format using the Micromass ProteinLynx package. Further information can be found here.
The PKL format is similar to the DTA file format, but supports multiple MS/MS datasets in a single file. The first line of a PKL dataset contains the observed m/z, intensity, and charge state of the precursor peptide as a triplet of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values.
Multiple MS/MS datasets are delimited by at least one blank line.
PSD peak lists exported from Grams as .PKS files contain data from a single PSD spectrum. Since the .PKS format does not include details of the precursor peptide m/z, this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.
Example of the .PKS file format:
Peak lists exported from PE Sciex API III contain data from a single MS/MS spectrum. Since the file format does not include details of the precursor peptide m/z, this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.
Example of PE Sciex peak list format:
287.50 650 287.5
Bruker XMASS and flexAnalysis save peak lists in a simple XML format. A DTD or XSD for the format is not publicly available. For each peak, Mascot takes the m/z value from the <mass> element and the intensity from the <absi> element.
The file format for MS/MS does not include details of the precursor peptide m/z, so this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.
Mascot supports mzData version 1.05. Follow the link for a schema document and further information.
Mascot supports mzML version 1.1.0. Follow the link for a schema document and further information.