Practical 06: Principal component analysis (PCA)

A. Introduction

In practical 2, we’ve simulated the dynamics of a small protein. Although a protein’s function is often intimately linked to its conformational dynamics, it is not always straightforward to extract the functionally relevant motions from a simulated trajectory. To illustrate this, we will focus today on the dynamics of a small, flexible protein: bacteriophage T4 lysozyme. Many organisms, including humans, have a form of lysozyme, for instance in tears and human milk. Lysozyme degrades polysaccharides (sugar molecules) that are part of bacterial cell walls and as such it has an antibacterial function. Click here for more background information on lysozyme.

Download the structure of T4 lysozyme complexed to a substrate analog (PDB code 148L) from the Protein Data Bank, and view the structure in PyMol:

pymol 148L.pdb

Alternatively, just start PyMol and type fetch 148L into the PyMol console. Do you see the substrate? Depending on your PyMol version you see lysozyme in a useful representation or not. You can further highlight the substrate by typing (on the PyMol prompt or in the graphics window):

select sub, (resn nag,mub)
color red, sub
show sticks, sub

(These commands show the molecules with residue names NAG or MUB as red sticks.) Now select the catalytic residues:

select cat, (resi 11,20,26)
color yellow,cat
show sticks,cat

Now highlight the secondary structure:

show cartoon

Let’s also try the surface representation of the protein.

select protein, (polymer)
show surface, protein

As you can see, the substrate is almost enclosed by the protein, with the active site (catalytic) residues buried in a groove on the protein surface.

Question: Speculate, why many enzymes have a deep binding cleft to which the substrate must bind?

💡 To catalyze a chemical reaction, the catalytic residues must be placed exactly at the right position relative to the substrate. Defining and maintaining such an arrangement (for sufficient time to allow the reaction to happen) is often only possible when the substrate is tightly bound in a well-defined geometry.

Since it would take too long to generate a well-converged MD simulation during the practical, we have prepared a simulation of the (substrate-free) enzyme. Now download this trajectory fragment and view it in PyMol:

pymol t4l.pdb

Try out different visualization modes in PyMol to view as clearly as possible the structure and the dynamics of the protein. Use dss command to show the secondary structure. Use the same selections as above to highlight the catalytic residues and the secondary structure and press the play button at the bottom right of the main PyMol window to see an animation of the simulation.

You also can load the 148L structure and align both structures. This will give you some insight on the function of the observed motions.
Question: Which motions can you identify? How can you imagine those motions to be crucial for the protein’s function?

💡 Think about the previous Question and hint…

B. Principal component analysis (PCA) of an MD simulation

Probably, you will appreciate the difficulty of identifying functionally relevant motions. One of the main reasons for this is that we see everything moving at the same time. Both local fluctuations and collective motions occur simultaneously, which makes it hard to distinguish the two types of motion from each other. A principal component analysis (PCA) can help in such cases, as it can filter global, collective (often slow) motions from local, fast motions.

Download ref.pdb and md1_backbone.xtc. To reduce the size of the analysis, we will concentrate on the backbone only for the analysis.

First, get GROMACS into your path using

source /home/gromacs/softwarerc

Now, have a look at the raw (backbone) trajectory:

gmx trjconv -s ref.pdb -f md1_backbone.xtc -o md1.pdb -e 1000

(answer 0 when asked for a group).

pymol md1.pdb

Press the play button at the bottom right of the main PyMol window. If it is too fast, you can move from frame to frame using the next and previous buttons or drag the movie bar at the bottom of the PyMol window.

Now let’s start the principal component analysis of this trajectory. This is done by building a so-called covariance matrix of the atomic fluctuations. Diagonalisation of this matrix yields a set of eigenvectors and eigenvalues, that describe collective modes of fluctuations of the protein. Recall:

  • The eigenvectors corresponding to the largest eigenvalues are called “principal components”. They represent the largest amplitude of collective motions.
  • The motions along the principal components are obtained by orthogonally projecting the protein coordinates onto the eigenvectors.
  • The eigenvalues give the variance of the principal components, i.e., the contribution of the principal component to the overall fluctuation of the protein.
gmx covar -s ref.pdb -f md1_backbone.xtc

Answer 0 twice when asked for a group. If you issue a

ls -lrt

You’ll see that gmx covar has generated both eigenvalues and eigenvectors, in files called eigenval.xvg and eigenvec.trr. Recall that the eigenvalues indicate the contributions of the eigenvectors to the mean square fluctuations in the given ensemble. View the eigenvalue spectrum with:

xmgrace eigenval.xvg

The eigenvalues are sorted according to their size. Use the zoom bottom of xmgrace 🔍 to zoom to the largest eigenvalues (with the smallest eigenvector index). As you can see, there are only a few large eigenvalues, all others are relatively small, suggesting that a large fraction of the total fluctuations are explained by only a few principal components.

To see what type of motion the individual eigenvectors correspond to, we filter the original trajectory and project on a selected eigenvector. For example, along the first eigenvector:

gmx anaeig -s ref.pdb -f md1_backbone.xtc -filt filter1.pdb -first 1 -last 1 -skip 100

View the animation with:

pymol filter1.pdb

To see the movie slower you can in the PyMol tools bar: Movie → Frame Rate → 5 FPS

Question: How would you describe this type of motion? How does it compare to the raw trajectory?

💡 The structure is now fluctuating along a single collective vector, namely along the first PCA vector. You see the first principal component as a function of simulation time. Because the input ensemble contained relatively few frames (large time step), the structure seems to “jump”.

As you can see, the animation is kind of jerky. This is because even such large-scale motions do not occur smoothly, but stochastically. To see a smooth animation of the motion along the first eigenvector, we can artificially interpolate between the extreme conformations sampled during the simulation along this eigenvector:

gmx anaeig -s ref.pdb -f md1_backbone.xtc -extr extreme1.pdb -first 1 -last 1 -nframes 30

View with:

pymol extreme1.pdb

Now repeat the last two commands for the second eigenvector.

Question: How would you describe the motion along the first and second eigenvectors? What happens with the substrate-binding cleft in these two motions?

💡 You may see a hinge-bending motion and a twisting motion. Try to imagine how these motions allow lysozyme to arrange the catalytic residues relative to the substrate.

To see the contrast with the other eigenvectors, now repeat the above procedure for eigenvector 20 (or 53, or any other number of your choice).

Question: The backbone of T4 lysozyme contains 486 atoms (162 residues and 3 backbone atoms per residue). How many eigenvectors/values would you expect? How many eigenvalues should be exactly zero, and why? How many should be approximately zero?

💡 With N atoms, how many degrees of freedom do you have? So how many basis vectors are needed to describe the conformational space?

💡 Before gmx covar computed the PCA, the translation and rotation of the overall lysozyme were removed by a least-square fit. This is done because we are mainly interested in the internal motions of the enzyme. So many eigenvalues should be (nearly) zero. Check your guess at the last lines of the eigenvalue file (tail -n20 eigenval.xvg, but consider rounding errors).

For a more quantitative analysis, we can project the trajectory onto individual eigenvectors (giving the “principal components”), and display the projections as a function of time. Let us look at the projection onto the first 10 eigenvectors:

gmx anaeig -s ref.pdb -f md1_backbone.xtc -proj -first 1 -last 10
xmgrace -nxy proj.xvg

Question: Which eigenvector displays the slowest motion? Would you consider the simulation long enough to have sampled all relevant configurations? (one criterion for this is that all sub-conformations should have been visited multiple times).

What about the sampling of the higher eigenvectors? For instance, take a look at the projections onto eigenvectors 60 through 69:

gmx anaeig -s ref.pdb -f md1_backbone.xtc -proj -first 60 -last 69
xmgrace -nxy proj.xvg
Question: Would you consider the sampling of the higher eigenvectors as converged? Are these motions particularly interesting?

💡 The principal components with small eigenvalues describe local vibrations, which are sampled quickly but are not very interesting. The interesting large-scale motions (with large eigenvalues) are sampled slowly. Therefore, the distributions of the first few principal components are often not fully converged.

Another way to visualize the sampled conformations in the subspace spanned by the eigenvectors is a so-called two-dimensional projection:

gmx anaeig -s ref.pdb -f md1_backbone.xtc -2d -first 1 -last 2
xmgrace 2dproj.xvg

In this graph, each point represents a snapshot from the simulation, and the distribution shows the sampled region along the first two eigenvectors during the simulation.

C. Comparison to experimental structural data

We can use a PCA not only to analyze an MD simulation and extract the main concerted motions from a trajectory, but also for a comparison of multiple simulations or for the comparison to an ensemble of experimental structures, in terms of dominant, collective motions. For T4 lysozyme, a large number of x-ray structures has been determined (if you search the PDB for “T4 lysozyme”, you’ll find more than 400 structures). We have collected 38 non-redundant x-ray structures of T4 lysozyme, which are all together collected in allpdb_bb.xtc. Just like on the simulated trajectory, we can perform a PCA on the collection of experimental structures:

gmx covar -s ref.pdb -f allpdb_bb.xtc

Again, choose 0 twice when asked for a group. Use the gmx anaeig program like above to analyze the results.

Question: What kind of motion is represented by the first and second eigenvector? How does this compare to the results from the simulation?

💡 Do you see again hinge-bending and twisting motions? Which motions correspond to the largest eigenvalue (largest variance in the ensemble)?

For a quantitative comparison of the simulation and the experimental structures, we’ll project both the experimental structures and the simulated structures onto the eigenvectors extracted from the x-ray structures:

gmx anaeig -s ref.pdb -f md1_backbone.xtc -2d md_2d.xvg -first 1 -last 2
gmx anaeig -s ref.pdb -f allpdb_bb.xtc -2d xray_2d.xvg -first 1 -last 2
xmgrace md_2d.xvg xray_2d.xvg

You now see the MD simulation in black and the x-ray structures in red. Since the x-ray structures are independent, it is confusing to have them connected by lines, which suggests a continuous path between them. In xmgrace, double click on one of the data points inside the plot, select the second set (G0.S1), Main → Line properties → Type → None, then Main → Symbol properties → Type → Circle. Finally, prsee Accept.

As you will have seen, eigenvector 1 of the x-ray ensemble describes an opening motion similar to the animation on the top-right of this screen, whereas the second eigenvector describes a twisting motion of both domains concerning each other.

Question: How do the x-ray and MD ensemble compare to each other? Why do you think the MD simulation does not sample the x-ray configurations on the extreme left of the plot (the most open x-ray conformations)? And why do you think the twisting mode is sampled with a larger amplitude in the simulation than in the ensemble of x-ray structures?

💡 There are several possible explanations: The MD simulation was probably too short to visit all possible configurations along the opening motion. However, differences might also arise due to crystal contacts in the x-ray structures. For instance, the crystal might block certain conformations along the twisting mode, or certain twisting modes might not crystallize. In addition, some x-ray structures might be stabilized by substrates, which were not present in the MD simulation.

Two additional simulations are available for T4 lysozyme, that started from different x-ray structures. Download the two backbone trajectories to your local disk: md2_backbone.xtc and md3_backbone.xtc. Cross-project these two trajectories onto the x-ray eigenvectors and visualize all three simulations and the ensemble of x-ray structures projected onto the first two eigenvectors of the x-ray ensemble.

Question: Which of the three simulations covers most of the crystallographic structures? Do you consider the simulations long enough to sample the most relevant conformations of the protein?

Question: All three simulations were of the substrate-free protein. Would you expect a more “open” or a more “closed” state as the most probable configuration in this situation?

💡 Many enzymes (but not all) take mostly an open conformation in the absence of the substrate, and then close upon substrate binding. In such cases, substrate-protein contacts stabilize a closed, compacted state.

Question: Alternatively, we could have carried out a Normal Modes analysis of this protein, approximating the energy landscape by a multidimensional harmonic energy minimum. Does the motion along the first PCA eigenvectors appear (quasi)-harmonic?

💡 An approximately harmonic (quasi-harmonic) potential would imply a Gaussian distribution along the first PCA vector.

D. Optional

Think back to the first tutorial, where we simulated the noble gas Argon.

Question: For Argon in the gas phase (without periodic boundary conditions), what kind of results would you expect from a PCA analysis? Particularly, what would you expect the eigenvalue spectrum to look like?

Carry out a short simulation of Argon in the gas phase, using this topology, these starting coordinates, and these MD parameters, and using the gmx grompp and the gmx mdrun modules. Now carry out a PCA over the Argon trajectory:

gmx covar -s argon_gas.pdb -f traj_comp.xtc -nofit

Question: Is the eigenvalue spectrum as expected?

Have a look at the projections of the principal eigenvectors with:

gmx anaeig -s argon_gas.pdb -f traj_comp.xtc -proj -first 1 -last 6


xmgrace proj.xvg
Question: Do you note the similarity of these projections to cosines? What implications does this have for protein trajectories, where the principal mode projections appear as cosines?

💡 The observation of the cosine-like appearance of the principal mode projections indicates that, as in the case of the Argon gas, the protein motion is primarily diffusive. What does that mean? That most likely means that the protein is exploring a rather flat energy landscape, as a multidimensional random walk, without observing any major barriers. This is a strong indication that the trajectory has not converged, i.e. has not reached equilibrium. The “cosine content” can be obtained from gmx analyze -cc, yielding a powerful convergence criterion. Note that the fact that a high cosine content in a principal mode from a protein trajectory does not mean that the direction corresponding to that mode is not meaningful, it just means that the motion has not yet converged. Whether a direction is meaningful or not can be analyzed by checking how similar the principal modes from one part of the trajectory are to those obtained from another part. Such a comparison can be carried out using the overlap measure activated by gmx anaeig -over.

Further references and advanced reading

  • A Kitao and N Go. Investigating protein dynamics in collective coordinate space Current Opinion in Structural Biology 9: 164-169 (1999). [link]
  • A Amadei, ABM Linssen and HJC Berendsen. Essential Dynamics of Proteins PROTEINS: Structure, Function, and Genetics 17:412-425 (1993). [link]
  • Herman JC Berendsen and Steven Hayward Collective protein dynamics about function Current Opinion in Structural Biology 10:165-169 (2000). [link]