This tutorial is designed to explain how to use Magpie to make a machine-learning-based model for the formation energy of crystalline compounds. It covers installation, launching Magpie from the command-line, preparing data in a Magpie-friendly format, and the basics of creating and using models.
Before installing Magpie, you need to make sure your system has the Java Runtime Environment Version 7 or greater. To do so, open up your computer's command-line prompt and call java -version
. If the fist line of the output doesn't look like java version 1.7_071
, go to Java.com and download the latest version.
Once your computer has the correct version of Java, download the latest version of Magpie from OQMD and extract it. The ZIP file available from this link is updated nightly. This folder includes a compiled version of Magpie, this documentation, and a few examples scripts. To verify that everything works, open a command prompt and navigate to your new Magpie folder, and then call
java -jar dist/Magpie.jar
(Note: If you are using the Windows Command Prompt, the command to launch Magpie is java -jar dist\Magpie.jar
).
This should open an interactive prompt for Magpie. Press "Enter" or type "exit" to close this prompt.
As a more advanced starting test, launch Magpie with the example input script examples/simple-model.in (e.g., by calling java -jar dist/Magpie.jar examples/simple-model.in
). You should see the echos of the commands in the input file and output being printed to screen. If so, Magpie is ready to run on your system.
In general, Magpie expects whitespace-delimited input files where the first line is a header describing the data (e.g., property names) and the first column is a string describing the material. For example, a dataset containing the composition and formation energy of materials could look like:
composition delta_e stability{Yes,No}
NaCl -5 Yes
Fe2O3 -4.2 Yes
Ni3.00Al1 -0.4 Yes
Ni3F None No
A few key things to note about this example are that the acceptable format for the composition is broad, "None" can be listed if a measurement is not available, and it is possible to define categorical properties by listing the category names in {}'s after the property name. Further details of input file formats are described in the Javadoc for Magpie (ex: see CompositionDataset).
Load data into Magpie by first creating a variable to store the data and then calling the "import" command for that variable. In the examples/simple-model.in example (which is the basis of the tutorial), composition data is loaded in from a sample dataset using the commands:
data = new data.materials.CompositionDataset
data import ./datasets/small_set.txt
The first command creates a variable representing a CompositionDataset object and names it "data." All of the available commands of this command are listed here. In general, you can find the available commands for any variable types from the Variables documentation page. As described in the referenced documentation pages, the "import" command of data is called with the path of the dataset file as an argument, as shown in the second command.
After running these commands, the composition and measured properties of each of the materials described in "small_set.txt" are stored in the data variable. To specify that the formation energy (which is named "delta_e" in the data file) is the desired class variable, run the the "target" command:
data target delta_e
The next step towards building a model is to generate attributes. By default, the CompositionDataset variable will compute attributes described in a recent (yet-unpublished) paper by Ward et al.. Some of these attributes are based on the properties of the constituent elements. So, to compute attributes it is first necessary to define where the elemental property lookup tables are located and then define the elemental properties to be considered. That is accomplished by calling two "attributes" commands of the data variable:
data attributes properties directory ./Lookup Data/
data attributes properties add set general
Once these settings are defined, attributes are computed by calling:
data attributes generate
The "data" variable now contains 145 attributes describing the composition and the measured formation energy as the class variable. This information can be saved to disk using the save command. As described in the documentation for the text interface, this command takes the name of the variable as the first argument, the desired filename as the format as the second, and (optionally) the desired format as the third. To save in CSV format, the command is
save data delta_e csv
If you have done the previous parts of the tutorial, you now have a file named "delta_e.csv" that contains the attributes and formation energy of a few hundred crystalline compounds. In this part of the tutorial, we will describe one method for finding a suitable machine learning algorithm for this data and creating a model in Magpie.
Most of the machine learning algorithm available through Magpie are provided by other software pages, such as Weka and scikit-learn. For simplicity, this tutorial only describes how to use Weka and will only briefly skim over the features of Weka. If you want to learn more about these packages, it is strongly recommend to read their associated documentation.
Weka provides an excellent graphical interface testing the performance of the variety of algorithms available through it. Again, for the purpose of brevity, this tutorial assumes that you have learned how to import data and run models in Weka (e.g., by reading the textbook). Once you have settled on the algorithm that works best for your materials problem, the only information you need to save is the name of the algorithm and the desired settings. Luckily, Weka makes this easy. Simply opposite click on the name of the model and select "Copy configuration to clipboard".
The appropriate variable type for regression models using Weka is WekaRegression. As described in the variable description page for WekaRegression (see here), the "Usage" for this command is the name of a Weka algorithm and then the settings for the algorithm. Broadly, "Usage" statements describe the options for creating a certain object (e.g., the parameters for a model). To create a Weka model, paste the configuration from Weka as these options:
model = new models.regression.WekaRegression weka.classifiers.trees.REPTree &
-M 2 -V 0.001 -N 3 -S 1 -L -1 -I 0.0
Note: To make the command look cleaner, it is split on to two lines using the "&" to mark that the lines should be combined together.
To train this model, first create an input file containing all of the commands from the previous parts of this example (or just copy them from examples/simple-model.in). The file extension does not matter. Then, add the above command to the end of the input file and the following command for training the model:
model train $data
The train command for variables that represent models will train the model using a dataset stored in another variable. When a variable is used in a command from another variable, it is accessed by putting a "$" in front of the name of the variable (e.g., "$data" to access the variable data).
Once the model is trained, you can print out the training statistics (which are automatically computed) using the print command. Like the save command, the first argument is a variable name, which is followed by the desired print command. To print training statistics of the variable "model", this is:
print model training stats
Likewise, one can perform 10-fold cross-validation and print the validation statistics by the two commands:
model crossvalidate $data 10
print model validation stats
This model and all of the associated statistics can be saved into a system-independent format by calling the save command without any format argument. It will also be prudent to an empty copy of the training data object, which can be used to compute attributes for a new dataset. The commands for this would be:
save model delta_e-model
template = data clone -empty
save template delta_e-data
After running this script, Magpie will save two files, delta_e-model.obj and delta_e-data.obj, that contain all of the information necessary to use your model and can be run on any system with Magpie installed.
If you have completed the other steps of this tutorial, you now have two Magpie object files (named delta_e-model.obj and delta_e-data.obj) that can be used to compute the formation energy of crystalline compounds. If you would prefer to skip those steps, simply add the save commands described at the end of the "Building a Model" section to the end of the examples/simple-model.in script, and run it. This portion of the tutorial shows how to use those components on a new dataset.
The first step is to create a dataset in which to store the search space. To do so, first launch Magpie and load in the object stored in delta_e‑data.obj using the load command:
search = load delta_e-data.obj
This command a empty dataset named "search." While this dataset does not contain any entries, it does contain all of the settings necessary to compute the same attributes used to train the model.
The next step is to generate a search space using the IonicCompoundGenerator command. As described in the "Usage" statement in the Javadoc, this entry generator takes 4 arguments: the minimum and maximum number of constituents, the maximum number of atoms per formula unit, and a list of elements to use. The appropriate command to generate all ternary ionic compounds with less than 10 atoms per unit cell composed of Li, Fe, Ni, Zr, O, or S is:
search generate IonicCompoundGenerator 3 3 10 Li Fe Ni Zr O S
This should have generated a search space of 154 compounds. Since the search variable contains all of the settings for computing the attributes used when creating the formation energy model, you can call the "attributes generate" command without first specifying those options. Now, load and run the model by calling:
model = load delta_e-model.obj
model run $search
At this point, you can save the results of the model by saving the search variable in "stats" format or use the "rank" command of the search variable to print out the entries with the lowest formation energy.