Data file format

A Mascot data file is a plain text (ASCII) file containing peak list information and, optionally, search parameters.

For a Peptide Mass Fingerprint, the file should contain a list of peptide mass values, one per line, optionally followed by white space and a peak area or intensity value. The peak list formats of a wide range of instrument data systems are directly compatible with these requirements. In addition, Mascot will automatically recognise the following formats:

For an MS/MS Ions Search, the data file must contain one or more MS/MS peak lists. In the Mascot generic format, (MGF), each MS/MS dataset is a list of pairs of mass and intensity values, delimited by BEGIN IONS and END IONS statements. The following formats are also supported for MS/MS data:

A data file may include embedded search parameters. Most embedded parameters can only appear once, at the head of the data file. In a Mascot generic format file, certain parameters can appear within an MS/MS dataset.

If there is a conflict between the values of the embedded parameters and values entered into search form fields, the embedded parameters always take precedence. The search form fields are essentially defaults for values missing from the data file.

The following paragraphs illustrate the data file formats by means of examples. The rules which Mascot follows when parsing a data file provide an alternative description of what is and is not acceptable.

Mascot Generic Format

The Mascot generic format for a data file submitted to Mascot is (square brackets indicate optional elements, they should not be included in an actual data file):

[Embedded Parameter(s)]
Query 1
[Query 2]
.
.
.
[Query N]

Blank lines can be used anywhere, to improve readability.

Comment lines beginning with one of the symbols #;!/ can be included, but only outside of the BEGIN IONS and END IONS statements that delimit an MS/MS dataset.

Peptide Mass Fingerprint

In the case of a Peptide Mass Fingerprint, each query is just a single peptide m/z value, with an optional second value for peak area or intensity. For example:

764.2
1231.0
1284
1944.8
2020.2
2100.35

Or

764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566

If your MS data system outputs additional values on each line, these will be ignored.

There are two ways to change default search parameters. One way is using the search form fields. The other is to place embedded parameters at the beginning of the data file. For example:

COM=Digest #A6345
CLE=Lys-C
CHARGE=1+
PFA=1
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566

The embedded parameters (COM, CLE, CHARGE, PFA) over-ride the entries in the corresponding form fields, if any. All of the other search parameters default to the search form settings.

A peptide mass fingerprint data file can only contain peptide mass fingerprint queries. Sequence queries or MS/MS datasets are not permitted.

MS/MS Ions Search

For an MS/MS Ions Search, each query represents a complete MS/MS spectrum, and is delimited by a pair of statements: BEGIN IONS and END IONS.

The search form defaults can be over-ridden by including embedded parameters at the beginning of the data file. Parameters specified in the search form or the data file header apply to the entire search. Within each MS/MS query, the mass of the precursor peptide(s) must be specified using one or more PEPMASS parameters. Precursor intensity and charge can be specified by including additional values on the PEPMASS line, delimited by white space.

Certain additional parameters can be specified at query level, between BEGIN IONS and END IONS, as shown in the table below. Parameters within an MS/MS query only apply locally, to the one spectrum. In the case of the CHARGE parameter, this means that you can have a global CHARGE setting, either from the search form or from a parameter at the head of the data file, as well as a local setting in one or more of the MS/MS queries.

This can be useful if the mass spectrometer data system cannot always determine precursor charge state correctly. For example, the global setting could be 2+ and 3+. When an unambiguous charge state can be determined, the correct charge is written to the local CHARGE parameter. Parameters within an MS/MS query must always be at the beginning, immediately following the BEGIN IONS tag. They cannot appear within or following the fragment ion list. For example:

COM=10 pmol digest of Sample X15
ITOL=1
ITOLU=Da
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
MASS=Monoisotopic
USERNAME=Lou Scene
USEREMAIL=leu@altered-state.edu
CHARGE=2+ and 3+
BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.60 73
846.80 44
847.60 67
.
.
.
1640.10 291
1640.60 54
1895.50 49
END IONS
 
BEGIN IONS
TITLE=Spectrum 2
PEPMASS=1084.9
SCANS=3
RTINSECONDS=25
345.10 237
370.20 128
460.20 108
.
.
.
1673.30 1007
1674.00 974
1675.30 79
END IONS
 
BEGIN IONS
TITLE=Spectrum 3
PEPMASS=1244.7
SCANS=29-34
RTINSECONDS=95-97
.
.
.

In the fragment ion list, the first value is fragment m/z and the second intensity. The third place is reserved for fragment charge, but this is not currently used by Mascot, and will be ignored.

Fragment ion intensity information is very important. Mascot will iteratively select sub-sets of the most intense peaks, looking for the group which most clearly discriminates the score of the top matched protein. There is an upper limit of 10,000 peaks per individual MS/MS spectrum. If you see an error message reporting that this limit has been exceeded, it almost certainly means that your data are profile data, and not peak lists. It is very unlikely that a single MS/MS spectrum could ever contain more than 1000 genuine peaks, never mind 10,000.

It is possible for an MS/MS ions search data file in the Mascot generic format to include sequence queries and peptide mass fingerprint queries. This is not allowed if the file contains proprietary format MS/MS data, and neither is mixing proprietary formats.

Here is a rather baroque example:

# following lines define parameters.
# NB no spaces allowed on either side of the = symbol
COM=My favourite protein has been eaten by an enzyme
CLE=Trypsin
CHARGE=2+
# following line will be treated as a peptide mass
1024.6
# following line is a sequence query, which must
# conform precisely to sequence query syntax rules
2321 seq(n-ACTL) comp(2[C])
# so is this
1896 ions(345.6:24.7,347.8:45.4, ... ,1024.7:18.7)
# An MS/MS ions query is delimited by the tags
# BEGIN IONS and END IONS. Space(s)
# are used to separate mass and intensity values
BEGIN IONS
TITLE=The first peptide - dodgy peak detection, so extra wide tolerance
PEPMASS=896.05 25674.3
CHARGE=3+
TOL=3
TOLU=Da
SEQ=n-AC[DHK]
COMP=2[H]0[M]3[DE]*[K]
240.1 3
242.1 12
245.2 32
.
.
.
1623.7 55
1624.7 23
END IONS

Embedded Search Parameters

Search parameters can be embedded into the data file or entered in the search form query window using the following parameter labels. In the absence of an embedded parameter, the default value is the setting of the corresponding search form field.

The FORMAT parameter is used to identify proprietary MS/MS dataset formats. It can appear once only, at the start of the file. If there is no FORMAT parameter, the default is Mascot generic format (MGF).

If the peak list format is not MGF, then parameters can only appear once, in the data file header, before the peak list begins.

For an MGF peak list, parameters with a tick in the Header column of the table below can appear in the header and those with a tick in the Local column can appear in the local scope of a single MS/MS query (spectrum). That is, after the BEGIN IONS line and before the fragment mass and intensity values.

Name Description Header Local Choices/Range Notes
ACCESSION Database entries to be searched List of double quoted, comma separated values
CHARGE Peptide charge 1- M-H- on PMF form
Mr
1+ MH+ on PMF form
N- to N+ where N is an integer and combinations Not PMF
CLE Enzyme Trypsin etc., as defined in enzymes file No default, so must be specified
COM Search title Applies to the whole search
CUTOUT Precursor removal Pair of comma separated integers MIS only
COMP Amino acid composition
DB Database As defined in mascot.dat
DECOY Perform decoy search 0 (false) Default
1 (true)
ERRORTOLERANT Error tolerant 0 (false) Default
1 (true) Not PMF
ETAG Error tolerant sequence tag A single query can have multiple ETAGs
FORMAT MS/MS data file Mascot generic Default
Sequest (.DTA)
Finnigan (.ASC)
Micromass (.PKL)
PerSeptive (.PKS)
Sciex API III
Bruker (.XML)
mzData (.XML)
mzML (.mzML)
FRAMES NA translation Comma separated list of frames Default is 1,2,3,4,5,6
INSTRUMENT MS/MS ion series Default Default
ESI-QUAD-TOF etc., as defined in fragmentation_rules
IT_MODS Variable Mods As defined in unimod.xml
ITOL Fragment ion tol. Unit dependent
ITOLU Units for ITOL ppm
Da
mmu
LOCUS Hierarchical scan range identifier string MIS only
MASS Mono. or average Monoisotopic
Average
MODS Fixed Mods As defined in unimod.xml
MULTI_SITE_MODS Allow two modifications at a single site 0 (false) or 1 (true) default 0
PEP_ISOTOPE_ERROR Misassigned 13C 0 to 2 MIS only
PEPMASS Peptide mass >100 optionally followed by intensity and charge
PFA Partials integer, 0 to 9 default 1
PRECURSOR Precursor m/z >100
QUANTITATION Quantitation method as defined in quantitation.xml MIS only
RAWFILE Raw file identifier string MIS only
RAWSCANS Native scan range identifiers a[:b] MIS only
REPORT Maximum hits AUTO or integer
REPTYPE Type of report protein
peptide Default for MIS
archive MIS only
concise Default for PMF
select MIS only
unassigned MIS only
RTINSECONDS Retention time or range (in seconds) a[-b] MIS only
SCANS Scan number or range v[-w] MIS only
SEARCH Type of search PMF
SQ = MIS
MIS = SQ
SEG Protein mass (kDa) Empty or >0
SEQ Amino acid sequence A single query can have multiple SEQs
TAG Sequence tag A single query can have multiple TAGs
TAXONOMY Taxonomy As defined in taxonomy file
TITLE Query title Applies to a single spectrum
TOL Peptide mass tol. Unit dependent
TOLU Units for TOL %
ppm
mmu
Da
USER00 to USER12 Uncommitted parameters
USEREMAIL User email
USERNAME User name

Specifying a scan or time range

Although scan and retention time information is not used directly in the Mascot search, it can be very useful for applications that import the Mascot search results. Two obvious cases are quantitation and Percolator. If a peak list contains data from multiple raw files, annotating scan and retention time information in a structured and non-verbose manner can become complicated. The MGF format includes a choice of parameters for this purpose:

RTINSECONDS Anything from a single retention time to a complex list of retention time ranges. This parameter is for passing machine readable information, not for display, so there is no RTINMINUTES, etc. When there are multiple raw files, there can be multiple RTINSECONDS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RTINSECONDS[0]

SCANS Anything from a single scan number to a complex list of ranges, e.g. SCANS=1278,1280-1284,1290-1294,1298. When there are multiple raw files, there can be multiple SCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. SCANS[3

RAWSCANS Identifiers corresponding to the data structure in the raw file. A two letter abbreviation followed by a number for each level of the hierarchy and a colon is used to delimit the start and end of a range. When there are multiple raw files, there can be multiple RAWSCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RAWSCANS[1].

For example, AB Sciex Analyst scans are characterised by a triplet of period, experiment, and cycle, which is represented as pd1cy2ex3.

  • Analyst pd1cy2ex3
  • Masslynx fn2ix1
  • LCMS Solution sg1ev4sn53
  • Kratos Axima wlJ5
  • Generic (scan number) – Xcalibur, mzXML, Bruker .yep/.baf, Agilent QTOF sn492

RAWFILE An identifier to relate a query back to one or more raw files. Can be a file name or file path or anything else that is meaningful to the downstream application.

LOCUS A hierarchical identifier used mainly by AB Sciex software. A unique combination of file, sample, period, cycle, and experiment might be represented as 2.1.1.24.1. Mascot treats this as a string and simply passes it through to the result file, so the content can be anything meaningful to the downstream application

When there are multiple raw files, adding an index is the most concise way of connecting queries and raw file(s). For example, an MGF peak list from Distiller for a multi-file project might look like this:

_DISTILLER_RAWFILE[0]={1}C:\data\replicate\Orbi_0319_01.RAW
_DISTILLER_RAWFILE[1]={1}C:\data\replicate\Orbi_0319_02.RAW
_DISTILLER_RAWFILE[2]={1}C:\data\replicate\Orbi_0319_08.RAW
.
.
.
BEGIN IONS
TITLE=22927: Scan rt=4669.74 from file [2]
PEPMASS=797.36086 89994.258
CHARGE=2+
SCANS[2]=48055
RAWSCANS[2]=sn8964
RTINSECONDS[2]=4669.736
227.05463 199.54773
242.21568 120.42233
.
.
.

This is fine if a single application creates the merged peak list. It cannot be used so easily when one application creates a peak list from each file and a second application independently merges these peak lists into a single search. In such cases, the RAWFILE or LOCUS parameter can be used to embed an identifier into each query as the peak list is created. This identifier then travels with the query as the peak lists are merged and is written to the search result file by Mascot.

Intensity values

The MGF format allows intensity information to be associated with peptide and fragment m/z values. It doesn’t specify what these values represent, which is determined by the peak picking software. They could be peak height or peak area and they could be for the 12C peak or for the complete isotope distribution. Units are generally arbitrary and absolute values have no meaning.

During a Mascot search, subsets of the most intense peaks are selected and scored iteratively, looking for the best score, which presumably corresponds to an optimum separation of signal peaks from noise peaks. In the result report for an MS/MS search, the spectra in the unassigned list can be sorted by precursor intensity (in case it is of interest to see which are the strongest spectra that failed to get a significant match). For these purposes, as long as the intensity values are derived in a consistent manner, it doesn’t greatly matter what they represent.

If the peak list is being used for quantitation, then the origin of the intensity values will be of greater interest. If Mascot Distiller is being used for peak picking, a setting in preferences can be used to choose between S/N, which behaves like height, or area under the complete isotope distribution. However, Distiller can also be configured to pass through centroid values direct from the raw data, in which case the intensity will be whatever value was assigned by the instrument data system. This is only relevant for MS2 quantitation (iTRAQ / TMT). Distiller MS1 quantitation is always based on integrating survey scan intensity across the elution profile of the precursor, and this information is not present in the peak list used for the search.

Proprietary MS/MS Peak List Formats

Finnigan (ASC) Files

Files in this format are created by the LIST command on the ICIS data system. The header block for each MS/MS dataset begins with a “LIST:” field. The text in this field is used by Mascot to identify the query, equivalent to an embedded TITLE parameter.

The ASC file header does not specify a charge state for the precursor peptide. This can be specified (globally) on the search form, or by an embedded CHARGE parameter at the head of the data file.

The precursor peptide m/z value is parsed from the “Mode:” field. Mascot uses the prevailing CHARGE value to calculate Mr from the observed m/z.

A blank line to delimit MS/MS datasets is optional.

Example of Finnigan ASC format:

LIST: dp210198b 21-Jan-98 DERIVED SPECTRUM #9
Samp: Spot 6483 from Gel 29A44 Start : 18:37:54 100
Mode: ESI +DAU 808.3 @ 25eV UP LR
Oper: Administrator Inlet :
Base: 798.9 Inten : 25525 Masses: 225 > 2000
Norm: 798.9 RIC : 181489 #peaks: 586
Peak: 1000.00 mmu
Data: +/1>99
0
No. Mass Intensity %RA %RIC Flags
1 229.3 8 0.03 0.00 #
2 230.3 9 0.04 0.00 #
3 259.9 8 0.03 0.00 #
.
.
.
583 1831.0 5 0.02 0.00 #
584 1878.3 5 0.02 0.00 #
585 1881.8 8 0.03 0.00 #  
LIST: dp210198a 21-Jan-98 DERIVED SPECTRUM #9
Samp: Spot 6483 from Gel 29A44 Start : 18:27:30 95
Mode: ESI +DAU 973.9 @ 25eV AVER UP LR
Oper: Administrator Inlet :
Base: 974.5 Inten : 191564 Masses: 270 > 1800
Norm: 974.5 RIC : 341387 #peaks: 593
Peak: 1000.00 mmu
Data: +/1>95
0
No. Mass Intensity %RA %RIC Flags
1 297.9 10 0.01 0.00 #
2 326.7 8 0.00 0.00 #
3 345.1 237 0.12 0.07 #
.
.
.

Sequest (DTA) Files

Sequest users can create these files from Finnigan LCQ data using the lcq_dta.exe or extract_msn.exe utilities. Further information can be found here.

The DTA format is very simple. The first line contains the singly protonated peptide mass (MH+) and the peptide charge state as a pair of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values.

N.B. In a DTA file, the precursor peptide mass is an MH+ value independent of the charge state. In Mascot generic format, the precursor peptide mass is an observed m/z value, from which Mr or MHnn+ is calculated using the prevailing charge state. For example, in Mascot:

PEPMASS=1000
CHARGE=2+

… means that the relative molecular mass Mr is 1998. This is equivalent to a DTA file which starts:

1999 2

The DTA format uses the file name to identify the dataset. An example of a file name would be “Myoglobin_digest.0012.0015.3.dta”. This corresponds to scans 12 to 15 of an LC-MS run, averaged together, and a peptide charge state of 3+.

While it is perfectly possible to submit a native DTA file to Mascot, each file contains only a single MS/MS data set. If you have a series of related datasets, such as from an LC-MS experiment, it is much better to concatenate the DTA files into a single data file so that the queries can be scored and reported collectively.

Remember to include at least one blank line between each MS/MS dataset. A delimiter between datasets is essential because the DTA format is relatively unstructured. Without a delimiter, the first line of a new dataset (peptide mass, charge) might be just another line from the previous dataset (fragment ion mass, intensity).

Utilities to concatenate DTA files automatically can be downloaded from the Xcalibur help page.

Micromass (PKL) Files

QTof users can export peak list data in either DTA or PKL format using the Micromass ProteinLynx package. Further information can be found here.

The PKL format is similar to the DTA file format, but supports multiple MS/MS datasets in a single file. The first line of a PKL dataset contains the observed m/z, intensity, and charge state of the precursor peptide as a triplet of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values.

Multiple MS/MS datasets are delimited by at least one blank line.

PerSeptive (.PKS)

PSD peak lists exported from Grams as .PKS files contain data from a single PSD spectrum. Since the .PKS format does not include details of the precursor peptide m/z, this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.

Example of the .PKS file format:

"Peak Table"
OP=0
Center X Peak Y Left X Right X Time X Mass Difference Name
STD.Misc Height Left Y Right Y %Height,Width,%Area,%Quan,H/A
818.39992 4265.0000 818.39992 818.39992 81554.550 0 818.3999
C 0.? 0 4265.0000 4265.0000
820.42154 3765.0000 820.42154 820.42154 81616.547 0 820.4215
C 0.? 0 3765.0000 3765.0000
842.38252 2571.0000 842.10681 842.62999 82290.021 0 842.3825
C 0.? 0 1800.0000 1800.0000
.
.
.

Sciex API III

Peak lists exported from PE Sciex API III contain data from a single MS/MS spectrum. Since the file format does not include details of the precursor peptide m/z, this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.

Example of PE Sciex peak list format:

287.50 650 287.5
301.00 1150 301.0
305.00 1150 305.0
315.00 6550 315.0
321.00 16,000 321.0
333.00 3050 333.0
333.50 1800 333.5
370.00 1550 370.0
.
.
.

Bruker (.XML)

Bruker XMASS and flexAnalysis save peak lists in a simple XML format. A DTD or XSD for the format is not publicly available. For each peak, Mascot takes the m/z value from the <mass> element and the intensity from the <absi> element.

The file format for MS/MS does not include details of the precursor peptide m/z, so this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.

mzData (.XML)

Mascot supports mzData version 1.05. Follow the link for a schema document and further information.

mzML (.mzML)

Mascot supports mzML version 1.1.0. Follow the link for a schema document and further information.

The Rules

  1. Filename extensions are not significant.
  2. Numeric values must be non-localised US ASCII. That is, the decimal separator must be a period and the thousands separator, if any, must be a comma. Leading white space is acceptable on lines that start with a number.
  3. Parameter labels are not case sensitive. Parameter values may be case sensitive. Case is preserved for parameter values which are free text strings. There must be no leading space before a parameter label and no space either side of the = symbol
  4. Parameters at the head of the data file apply to the entire search and over-ride the default settings provided by the search form fields.
  5. In the absence of a FORMAT parameter, the default format is Mascot generic.
  6. Mascot generic format permits an MS/MS search to include peptide mass fingerprint queries and sequence queries.
  7. In Mascot generic format, each MS/MS spectrum is delimited by BEGIN IONS and END IONS statements. There is a line for each fragment ion peak, containing an m/z and intensity value, separated by white space. Fragment ion m/z values must be positive, non-zero values. Intensities must be positive values. Any additional values or text are ignored, although the third value is reserved for future use as fragment charge.
  8. Parameters between the BEGIN IONS and END IONS statements only apply to the local MS/MS query. At least one PEPMASS parameter is required, all others are optional. Parameters within an MS/MS query must appear before the fragment ion data. If an MS/MS query has no fragment ions, it is treated as a PMF query.
  9. Most parameters can only appear at the head of the file, prior to any query data. The exceptions are PEPMASS, TITLE, SCANS, RTINSECONDS, RAWFILE, LOCUS, and RAWSCANS which can only appear within an MS/MS query block, and CHARGE, INSTRUMENT, IT_MODS, TOL, and TOLU, which can appear in either place. SEQ, COMP, TAG and ETAG can appear within an MS/MS query block or as qualifiers to a mass value using the Sequence Query syntax. When IT_MODS are specified within an MS/MS query block, they are appended to any IT_MODS specified at the head of the file or in the search form.
  10. Blank lines can be used anywhere to improve readability.
  11. Lines that start with one of the symbols # ; ! / are comment lines and are ignored. Comments cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block
  12. A SEARCH type must be defined, (PMF, SQ or MIS). The default is determined by the search form used to upload the file. Like any other parameter, this can be over-ridden by including a SEARCH parameter in the file header.
  13. A peptide mass fingerprint (PMF) search can only contain PMF queries. This allows for a relaxed syntax in which any line starting with a number is assumed to be a query. The first number is parsed as a peptide m/z value and the second number, if any, is parsed as a peak area or intensity. The rest of the line is ignored. Peptide m/z values must be equivalent to 100 <= Mr <= 16000.
  14. MS/MS searches can contain MS/MS data in proprietary formats only if this is declared with a FORMAT parameter. Mixing proprietary formats, or including non-MS/MS queries in a proprietary format file, is not allowed.
  15. User parameters are any parameters named USER\d\d (where \d is a digit) or any name beginning with an underscore except for the following, which are reserved:
    _INSIGHT_*
    _INTEGRA_*
    _DAEMON_*
    _DISTILLER_*
    _SERVER_*
    User parameters cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block.