Help > Data file format

Data file format

A Mascot data file is a plain text (ASCII) file containing peak list information and, optionally, search parameters.

For a Peptide Mass Fingerprint, the file should contain a list of peptide mass values, one per line, optionally followed by white space and a peak area or intensity value. The Mascot generic format (MGF) is recommended for PMF searches.

For an MS/MS Ions Search, the data file must contain one or more MS/MS peak lists. The recommended format is the Mascot generic format (MGF). In MGF, each MS/MS dataset is a list of pairs of mass and intensity values, delimited by BEGIN IONS and END IONS statements. Mascot also supports mzML (.mzML).

Earlier versions of Mascot Server supported a range of proprietary file formats. These obsolete data file formats are still available in Mascot Server 3.0, but they are hidden by default. Support for the obsolete formats may be removed in a future release.

Mascot Generic Format

The following paragraphs illustrate the data file formats by means of examples. The rules which Mascot follows when parsing a data file provide an alternative description of what is and is not acceptable.

The Mascot generic format for a data file submitted to Mascot is:

[Embedded Parameter(s)] Query 1 [Query 2] . . . [Query N]

Blank lines can be used anywhere to improve readability. Square brackets indicate optional elements; they should not be included in an actual data file.

Comment lines beginning with one of the symbols #;!/ can be included, but only outside of the BEGIN IONS and END IONS statements that delimit an MS/MS dataset.

A data file may include embedded search parameters. Most embedded parameters can only appear once, at the head of the data file. Certain parameters can appear within an MS/MS dataset.

If there is a conflict between the values of the embedded parameters and values entered into search form fields, the embedded parameters always take precedence. The search form fields are essentially defaults for values missing from the data file.

Peptide Mass Fingerprint

In the case of a Peptide Mass Fingerprint, each query is just a single peptide m/z value, with an optional second value for peak area or intensity. For example:

764.2 1231.0 1284 1944.8 2020.2 2100.35

764.2 2010 1231.0 2345 1284 456 1944.8 1012 2020.2 23 2100.35 566

If your MS data system outputs additional values on each line, these will be ignored.

There are two ways to change default search parameters. One way is using the search form fields. The other is to place embedded parameters at the beginning of the data file. For example:

COM=Digest #A6345 CLE=Lys-C CHARGE=1+ PFA=1 764.2 2010 1231.0 2345 1284 456 1944.8 1012 2020.2 23 2100.35 566

The embedded parameters (COM, CLE, CHARGE, PFA) over-ride the entries in the corresponding form fields, if any. All of the other search parameters default to the search form settings.

A peptide mass fingerprint data file can only contain peptide mass fingerprint queries. Sequence queries or MS/MS datasets are not permitted.

MS/MS Ions Search

For an MS/MS Ions Search, each query represents a complete MS/MS spectrum, and is delimited by a pair of statements: BEGIN IONS and END IONS.

The search form defaults can be over-ridden by including embedded parameters at the beginning of the data file. Parameters specified in the search form or the data file header apply to the entire search. Within each MS/MS query, the mass of the precursor peptide(s) must be specified using one or more PEPMASS parameters. Precursor intensity and charge can be specified by including additional values on the PEPMASS line, delimited by white space. Specifying multiple PEPMASS lines for a query is useful with chimeric spectra.

Certain additional parameters can be specified at query level, between BEGIN IONS and END IONS, as shown in the table below. Parameters within an MS/MS query only apply locally, to the one spectrum. In the case of the CHARGE parameter, this means that you can have a global CHARGE setting, either from the search form or from a parameter at the head of the data file, as well as a local setting in one or more of the MS/MS queries.

This can be useful if the mass spectrometer data system cannot always determine precursor charge state correctly. For example, the global setting could be 2+ and 3+. When an unambiguous charge state can be determined, the correct charge is written to the local CHARGE parameter. Parameters within an MS/MS query must always be at the beginning, immediately following the BEGIN IONS tag. They cannot appear within or following the fragment ion list. For example:

COM=10 pmol digest of Sample X15 ITOL=1 ITOLU=Da MODS=Carbamidomethyl (C) IT_MODS=Oxidation (M) MASS=Monoisotopic USERNAME=Lou Scene USEREMAIL=leu@altered-state.edu CHARGE=2+ and 3+ BEGIN IONS TITLE=Spectrum 1 PEPMASS=983.6 846.60 73 846.80 44 847.60 67 . . . 1640.10 291 1640.60 54 1895.50 49 END IONS BEGIN IONS TITLE=Spectrum 2 PEPMASS=1084.9 SCANS=3 RTINSECONDS=25 345.10 237 370.20 128 460.20 108 . . . 1673.30 1007 1674.00 974 1675.30 79 END IONS BEGIN IONS TITLE=Spectrum 3 PEPMASS=1244.7 SCANS=29-34 RTINSECONDS=95-97 . . .

In the fragment ion list, the first value is fragment m/z, the second intensity, and the third fragment charge.

Fragment ion intensity information is very important. Mascot will iteratively select sub-sets of the most intense peaks, looking for the group which most clearly discriminates the score of the top matched protein. There is an upper limit of 10,000 peaks per individual MS/MS spectrum. If you see an error message reporting that this limit has been exceeded, it almost certainly means that your data are profile data, and not peak lists. It is very unlikely that a single MS/MS spectrum could ever contain more than 1000 genuine peaks, never mind 10,000.

It is possible for an MS/MS ions search data file in the Mascot generic format to include sequence queries and peptide mass fingerprint queries.

Here is a rather baroque example:

# following lines define parameters. # NB no spaces allowed on either side of the = symbol COM=My favourite protein has been eaten by an enzyme CLE=Trypsin CHARGE=2+ # following line will be treated as a peptide mass 1024.6 # following line is a sequence query, which must # conform precisely to sequence query syntax rules 2321 seq(n-ACTL) comp(2[C]) # so is this 1896 ions(345.6:24.7,347.8:45.4, ... ,1024.7:18.7) # An MS/MS ions query is delimited by the tags # BEGIN IONS and END IONS. Space(s) # are used to separate mass and intensity values BEGIN IONS TITLE=The first peptide - dodgy peak detection, so extra wide tolerance PEPMASS=896.05 25674.3 CHARGE=3+ TOL=3 TOLU=Da SEQ=n-AC[DHK] COMP=2[H]0[M]3[DE]*[K] 240.1 3 242.1 12 245.2 32 . . . 1623.7 55 1624.7 23 END IONS

Embedded Search Parameters

Search parameters can be embedded into the data file or entered in the search form query window using the following parameter labels. In the absence of an embedded parameter, the default value is the setting of the corresponding search form field.

The FORMAT parameter is used to identify obsolete MS/MS dataset formats. It can appear once only, at the start of the file. If there is no FORMAT parameter, the default is Mascot generic format (MGF).

If the peak list format is not MGF, then parameters can only appear once, in the data file header, before the peak list begins.

For an MGF peak list, parameters with a tick in the Header column of the table below can appear in the header and those with a tick in the Local column can appear in the local scope of a single MS/MS query (spectrum). That is, after the BEGIN IONS line and before the fragment mass and intensity values.

Name	Description	Choices/Range	Notes
ACCESSION	Database entries to be searched	List of double quoted, comma separated values
CHARGE	Peptide charge	1-	M-H- on PMF form
		Mr
		1+	MH+ on PMF form
		N- to N+ where N is an integer and combinations	Not PMF
CLE	Enzyme	Trypsin etc., as defined in enzymes file	No default, so must be specified
COM	Search title		Applies to the whole search
CUTOUT	Precursor removal	Pair of comma separated integers	MIS only
COMP	Amino acid composition
CROSSLINKING	Crosslinking method	as defined in crosslinking.xml	MIS only
DB	Database	As defined in mascot.dat
DECOY	Perform decoy search	0 (false)	Default
DECOY	Perform decoy search	1 (true)
ERRORTOLERANT	Automatic second pass search of selected modification classes	0 (false)	Default
ERRORTOLERANT		1 (true)	Not PMF
ET_CLASSIFICATIONS	Restrict error tolerant search space	Zero or more classifications as defined in unimod_2.xsd
ETAG	Error tolerant sequence tag		A single query can have multiple ETAGs
FORMAT	MS/MS data file	Mascot generic	Default
		mzML (.mzML)
		Sequest (.DTA)	Obsolete
		Finnigan (.ASC)	Obsolete
		Micromass (.PKL)	Obsolete
		PerSeptive (.PKS)	Obsolete
		Sciex API III	Obsolete
		Bruker (.XML)	Obsolete
		mzData (.XML)	Obsolete
FRAMES	NA translation	Comma separated list of frames	Default is 1,2,3,4,5,6
INSTRUMENT	MS/MS ion series	Default	Default
INSTRUMENT	MS/MS ion series	ESI-QUAD-TOF etc., as defined in fragmentation_rules
ION_MOBILITY	Drift time	floating point number
IT_MODS	Variable Mods	As defined in unimod.xml
ITOL	Fragment ion tol.	Unit dependent
ITOLU	Units for ITOL	ppm
		Da
		mmu
LIBRARY_SEARCH	Allow search to include spectral libraries	0 (false)	Default
LIBRARY_SEARCH	Allow search to include spectral libraries	1 (true)
LOCUS	Hierarchical scan range identifier	string	MIS only
MASS	Mono. or average	Monoisotopic
MASS	Mono. or average	Average
ML_ADAPTER_PARAM	Parameters to pass to machine learning adapters	May appear zero or more times. Allowed values are defined in ML_adapters.toml config file.
MODS	Fixed Mods	As defined in unimod.xml
MULTI_SITE_MODS	Allow two modifications at a single site	0 (false) or 1 (true)	default 0
PEP_ISOTOPE_ERROR	Misassigned ¹³C	0 to 2	MIS only
PEPMASS	Peptide mass	>100	optionally followed by intensity and charge; multiple lines allowed if chimeric spectrum
PERCOLATE	Refine results with machine learning	0 (false)	Default
PERCOLATE	Refine results with machine learning	1 (true)
PFA	Partials	integer, 0 to 9	default 1
PRECURSOR	Precursor m/z	>100
QUANTITATION	Quantitation method	as defined in quantitation.xml	MIS only
RAWFILE	Raw file identifier	string	MIS only
RAWSCANS	Native scan range identifiers	a[:b]	MIS only
REPORT			Obsolete
REPTYPE			Obsolete
RTINSECONDS	Retention time or range (in seconds)	a[-b]	MIS only
SCANS	Scan number or range	v[-w]	MIS only
SEARCH	Type of search	PMF
		SQ	= MIS
		MIS	= SQ
SEG	Protein mass (kDa)	Empty or >0
SEQ	Amino acid sequence		A single query can have multiple SEQs
TAG	Sequence tag		A single query can have multiple TAGs
TARGET_FDR_PERCENT	Target FDR	Floating point number between 0 and 100	default 0
TAXONOMY	Taxonomy	As defined in taxonomy file
TITLE	Query title		Applies to a single spectrum
TOL	Peptide mass tol.	Unit dependent
TOLU	Units for TOL	%
		ppm
		mmu
		Da
USER00 to USER12		Uncommitted parameters
USEREMAIL	User email
USERNAME	User name

Search parameters that override global defaults set in the Options section of mascot.dat are prefixed OPTION_. These parameters can only appear in the peak list header.

OPTION_DechargeFragmentPeaks overrides DechargeFragmentPeaks. If the MGF peak list contains charge information for fragments, this positive integer is the maximum absolute charge state to be decharged, default 10. A value of 0 means ignore the fragment charge state. Peaks will be decharged to MH⁺ or MH^- values when three conditions are satisfied: (i) fragment charge information is present in the peak list, (ii) the MH⁺ or MH^- value will be less than 16384, (iii) the MH⁺ or MH^- value will be less than that of the precursor.

OPTION_MaxPepNumVarMods overrides MaxPepNumVarMods. The maximum number of different variable mods allowed in a single peptide match.

OPTION_MaxPepNumModifiedSites overrides MaxPepNumModifiedSites. The maximum number of sites carrying variable mods allowed in a single peptide match.

OPTION_MaxPepModArrangements overrides MaxPepModArrangements. The maximum number of arrangements of variable mods tested to obtain a single peptide match.

Specifying a scan or time range

Although scan and retention time information is not used directly in the Mascot search, it can be very useful for applications that import the Mascot search results. Two obvious cases are quantitation and refining results using machine learning. If a peak list contains data from multiple raw files, annotating scan and retention time information in a structured and non-verbose manner can become complicated. The MGF format includes a choice of parameters for this purpose:

RTINSECONDS Anything from a single retention time to a complex list of retention time ranges. This parameter is for passing machine readable information, not for display, so there is no RTINMINUTES, etc. When there are multiple raw files, there can be multiple RTINSECONDS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RTINSECONDS[0]

SCANS Anything from a single scan number to a complex list of ranges, e.g. SCANS=1278,1280-1284,1290-1294,1298. When there are multiple raw files, there can be multiple SCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. SCANS[3].

RAWSCANS Identifiers corresponding to the data structure in the raw file. A two letter abbreviation followed by a number for each level of the hierarchy and a colon is used to delimit the start and end of a range. When there are multiple raw files, there can be multiple RAWSCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RAWSCANS[1].

For example, AB Sciex Analyst scans are characterised by a triplet of period, experiment, and cycle, which is represented as pd1cy2ex3.

Analyst pd1cy2ex3
Masslynx fn2ix1
LCMS Solution sg1ev4sn53
Kratos Axima wlJ5
Generic (scan number) – Xcalibur, mzXML, Bruker .yep/.baf, Agilent QTOF sn492

RAWFILE An identifier to relate a query back to one or more raw files. Can be a file name or file path or anything else that is meaningful to the downstream application.

LOCUS A hierarchical identifier used mainly by AB Sciex software. A unique combination of file, sample, period, cycle, and experiment might be represented as 2.1.1.24.1. Mascot treats this as a string and simply passes it through to the result file, so the content can be anything meaningful to the downstream application

When there are multiple raw files, adding an index is the most concise way of connecting queries and raw file(s). For example, an MGF peak list from Distiller for a multi-file project might look like this:

_DISTILLER_RAWFILE[0]={1}C:\data\replicate\Orbi_0319_01.RAW _DISTILLER_RAWFILE[1]={1}C:\data\replicate\Orbi_0319_02.RAW _DISTILLER_RAWFILE[2]={1}C:\data\replicate\Orbi_0319_08.RAW . . . BEGIN IONS TITLE=22927: Scan rt=4669.74 from file [2] PEPMASS=797.36086 89994.258 CHARGE=2+ SCANS[2]=48055 RAWSCANS[2]=sn8964 RTINSECONDS[2]=4669.736 227.05463 199.54773 242.21568 120.42233 . . .

This is fine if a single application creates the merged peak list. It cannot be used so easily when one application creates a peak list from each file and a second application independently merges these peak lists into a single search. In such cases, the RAWFILE or LOCUS parameter can be used to embed an identifier into each query as the peak list is created. This identifier then travels with the query as the peak lists are merged and is written to the search result file by Mascot.

Ion mobility

The first part of the argument must be an exact copy of an existing PEPMASS line, while the second part is the drift time as a floating point number.

This is OK:
PEPMASS=498.34 25674.3 2+ ION_MOBILITY=498.34 25674.3 2+ 1.5

This is also OK:
PEPMASS=498.34 25674.3 ION_MOBILITY=498.34 25674.3 1.5

This is an error because PEPMASS includes charge but ION_MOBILITY does not :
PEPMASS=498.34 25674.3 2+ ION_MOBILITY=498.34 25674.3 1.5

Mascot doesn’t use the value in matching, but just passes it through to the query section in the results file.

Intensity values

The MGF format allows intensity information to be associated with peptide and fragment m/z values. It doesn’t specify what these values represent, which is determined by the peak picking software. They could be peak height or peak area and they could be for the ¹²C peak or for the complete isotope distribution. Units are generally arbitrary and absolute values have no meaning.

During a Mascot search, subsets of the most intense peaks are selected and scored iteratively, looking for the best score, which presumably corresponds to an optimum separation of signal peaks from noise peaks. In the result report for an MS/MS search, the spectra in the unassigned list can be sorted by precursor intensity (in case it is of interest to see which are the strongest spectra that failed to get a significant match). For these purposes, as long as the intensity values are derived in a consistent manner, it doesn’t greatly matter what they represent.

If the peak list is being used for quantitation, then the origin of the intensity values will be of greater interest. If Mascot Distiller is being used for peak picking, a setting in preferences can be used to choose between S/N, which behaves like height, or area under the complete isotope distribution. However, Distiller can also be configured to pass through centroid values direct from the raw data, in which case the intensity will be whatever value was assigned by the instrument data system. This is only relevant for MS2 quantitation (iTRAQ / TMT). Distiller MS1 quantitation is always based on integrating survey scan intensity across the elution profile of the precursor, and this information is not present in the peak list used for the search.

The Rules

Filename extensions are not significant.
Numeric values must be non-localised US ASCII. That is, the decimal separator must be a period and the thousands separator, if any, must be a comma. Leading white space is acceptable on lines that start with a number.
Parameter labels are not case sensitive. Parameter values may be case sensitive. Case is preserved for parameter values which are free text strings. There must be no leading space before a parameter label and no space either side of the = symbol
Parameters at the head of the data file apply to the entire search and over-ride the default settings provided by the search form fields.
In the absence of a FORMAT parameter, the default format is Mascot generic.
Mascot generic format permits an MS/MS search to include peptide mass fingerprint queries and sequence queries.
In Mascot generic format, each MS/MS spectrum is delimited by BEGIN IONS and END IONS statements. There is a line for each fragment ion peak, containing an m/z and intensity value, separated by white space. Fragment ion m/z values must be positive, non-zero values. Intensities must be positive values. The third value is fragment charge, which is optional. Any additional values or text are ignored.
Parameters between the BEGIN IONS and END IONS statements only apply to the local MS/MS query. At least one PEPMASS parameter is required, all others are optional. Parameters within an MS/MS query must appear before the fragment ion data. If an MS/MS query has no fragment ions, it is treated as a PMF query.
Most parameters can only appear at the head of the file, prior to any query data. The exceptions are PEPMASS, TITLE, SCANS, RTINSECONDS, RAWFILE, LOCUS, and RAWSCANS which can only appear within an MS/MS query block, and CHARGE, INSTRUMENT, IT_MODS, TOL, and TOLU, which can appear in either place. SEQ, COMP, TAG and ETAG can appear within an MS/MS query block or as qualifiers to a mass value using the Sequence Query syntax. When IT_MODS are specified within an MS/MS query block, they are appended to any IT_MODS specified at the head of the file or in the search form.
Blank lines can be used anywhere to improve readability.
Lines that start with one of the symbols # ; ! / are comment lines and are ignored. Comments cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block
A SEARCH type must be defined, (PMF, SQ or MIS). The default is determined by the search form used to upload the file. Like any other parameter, this can be over-ridden by including a SEARCH parameter in the file header.
A peptide mass fingerprint (PMF) search can only contain PMF queries. This allows for a relaxed syntax in which any line starting with a number is assumed to be a query. The first number is parsed as a peptide m/z value and the second number, if any, is parsed as a peak area or intensity. The rest of the line is ignored. Peptide m/z values must be equivalent to 100 <= Mr <= 16000.
MS/MS searches can contain MS/MS data in proprietary formats only if this is declared with a FORMAT parameter. Mixing proprietary formats, or including non-MS/MS queries in a proprietary format file, is not allowed.
User parameters are any parameters named USER\d\d (where \d is a digit) or any name beginning with an underscore except for the following, which are reserved:
_INSIGHT_*
_INTEGRA_*
_DAEMON_*
_DISTILLER_*
_SERVER_*
User parameters cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block.

mzML (.mzML)

Mascot supports mzML version 1.1.0. Follow the link for a schema document and further information.

mzML format can contain centroided spectra or profile data. Mascot only supports centroided spectra. If you submit profile data, you will get very poor results. And, if any peak list has more than 10,000 masses, the search may terminate with an error. Check your peak picking settings carefully. If in doubt, try processing the file with Mascot Distiller.

Matrix Science