Microsoft Academic Graph: practical experience
Updated: Apr 11
In this article we provide a brief summary of the steps we implemented while testing Microsoft Academic Graph. Though the official Microsoft Documentation is quite clear we find it important to share our practical experience along with some tips and tricks that helped us to cope with certain challenges.
Get Microsoft Academic Graph on Azure storage
Unlike classical Azure Services this resource requires a written request for access. Here are the steps that we followed to get academic data on our storage.
1. Create an Azure Storage Account to contain the MAG data. It is recommended to create a separate resource group or even a separate subscription
a. It is important to note Azure storage name and primary access key for later use
2. Sign up for MAG provisioning by sending an email firstname.lastname@example.org
a. All the required items to be covered in an email are listed in the official documentation
b. We provided all the needed information and sent an email with the subject "Access request to Microsoft Academic Graph (MAG) on Azure Storage (AS) distribution preview"
So, what goes after?
It took us less than one day to get all the needed data and receive an email with a detailed explanation of how to start and useful links along with great analytics and visualization samples that the Microsoft Academic API Team used for their WWW Conference Analytics blog post.
After Microsoft approved our request, the Azure Storage we had created before was setup to receive MAG update through Azure Data Factory. Each MAG dataset is provisioned to a separate container named "mag-yyyy-mm-dd" and is pushed to developer’s Azure Storage.
MAG comes with ODC-BY license. Users are granted the rights to add values and redistribute the derivatives. Important: these derivatives should be based on the terms of the open data license.
The list of folders in the container:
Set up Azure Data Lake Analytics for Microsoft Academic Graph
According to the documentation Azure Data Lake Analytics is needed to run U-SQL scripts on Microsoft Academic Graph. Here are the steps we followed to prepare this part.
1. Create Azure Data Lake Analytics account
a. Important: it is necessary to create also a new Data Lake Storage Gen1 account.
b. Tip: Azure recommends applying a specific naming convention to easily distinguish different resources, e.g. resourcename-adla for data lake analytics and resourcename-adls for the corresponding data lake storage
c. Both accounts require a globally unique name
2. Add new data source to the Data Lake Analytics account
a. (Settings) Data sources > add data sources
b. This is needed to adapt the Data Lake for MAG usage
c. Use the storage created for MAG provisioning
At this step the following data sources should be present on the Data Lake Analytics Account
resourcename-adls (default) - Azure Data Lake Storage
Gen1magascompanyname - Azure Storage
In order to practice U-SQL before setting up Azure Search, we decided to implement author h-index computing on Azure Data Lake Analytics.
Compute author h-index using Azure Data Lake Analytics (U-SQL)
At this step the goal was to:
extract data from Azure Storagecompute h-indexsave the result in Azure Data Lake Storage
1. Define functions extracting MAG data from Azure Storage
a. Go to ADLA and add new job
b. Copy code from samples/CreateFunctions.usql
The samples folder is on the blog containing MAG data: mag-yyyy-mm-dd/samples
These functions read txt files and parse them as tsv values, because the columns are separated by tabulation. As parameter these functions accept base path to Azure Storage and add file names. Each file name corresponds to a separate table. For instance, here the Affiliations.txt file is parsed. For those ones who have not received MAG data yet, here's the function code.
N.B. float? means “nullable float”. These types are inherited from C#.
Quote from the official documentation
Nullable types are instances of the System.Nullable<T> struct. Nullable types can represent all the values of an underlying type T, and an additional null value. The underlying type T can be any non-nullable value type. T cannot be a reference type.
c. Set job name to CreateFunctions and click on submit
d. This job contains no graphs as there is no data flow at this step
Compute author h-index
The h-index is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar. The index is based on the set of the scientist's most cited papers and the number of citations that they have received in other publications. The index can also be applied to the productivity and impact of a scholarly journal as well as a group of scientists, such as a department or university or country. The index was suggested in 2005 by Jorge E. Hirsch, a physicist at UC San Diego, as a tool for determining theoretical physicists' relative quality and is sometimes called the Hirsch index or Hirsch number.
(quote from Wikipedia )
Again, at this step the goal is to create an ADLA job. However, this time the job computes author h-index and save output to Azure Data Lake Storage (ADLS).
1. Launch the script to perform the computing
a. Go to ADLA and add new job
Here's the code
N.B. <AzureStorageAccount> = name of Azure Storage (AS) account containing MAG dataset. This value is used as a base file path in the previous section. <MagContainer> = The container name in Azure Storage (AS) account containing MAG dataset, usually in the form of mag-yyyy-mm-dd. Outputters.Tsv saves the generated table in a tsv format.
Important: for this job Microsoft recommends augmenting AUs to 5. However, more Analytics Units (AUs) aren't always more efficient. To learn how AUs relate to the performance and cost of your job read the following article.
During the job execution a graph on the right panel demonstrates the MAG data flow.
At this step the following jobs are present on the ADLA account
The output of the job goes to "/Output/AuthorHIndex.tsv" in the Azure Data Lake Storage (ADLS).
N.B. Column names in AuthorHIndex.tsv are not mapped to the corresponding values, just 1,2 etc.
In the next article we will provide a summary of Full-text search using Azure Search. Hope this was helpful.
A special thanks goes out to Tasnime Omrani who actively participated to the experiment.