Document Types in Microsoft Academic Graph

I’ve had a chance to play with Microsoft Academic and Microsoft Academic Graph (MAG) for a while and wanted to document some baby steps here . Describing the two “services” in detail is stuff for another post, suffice it to say that Microsoft Academic is is like Google Scholar, just richer (semantic search, annotation with topics inferred with NLP techniques etc.) and MAG is the underlying data. The important point is that the MAG raw data is freely accessible via Azure, whereas Google Scholar is closed.

To get started with MAG, follow the steps on how to deploy the data on Azure storage. After that you will have roughly 500 GB of raw MAG data in your storage account. Next, set up Azure Data Lake Analytics and you’re good to go. In the following I describe how to summarize document types in MAG with U-SQL and also Azure Databricks. If you are looking for something more sophisticated, there are great tutorials on how to compute the h-index with MAG on the official website. For more examples and comparison of MAG with other offerings, have a look at this awesome list.

Count Document Types with U-SQL

U-SQL is an SQL variant developed by Microsoft that can efficiently analyze data across relational stores, including data lakes, SQL databases etc. A nice U-SQL introduction can be found here. Using the Azure portal you can submit U-SQL batch jobs that scale instantly without having to manage any infrastructure.

To submit a job that summarizes document types in MAG, simply:

Head over to the Azure Data Lake Analytics account you just created
Click on new job
Give the job a name
Copy and paste the code from this Gist
Replace blobAccount and dataVersion values with your account details
Change AU to, say, 2 (see here for details on AUs)
Hit submit

That’s it. To view the created output file, click on Data/Outputs and then the file itself (TypeCounts.csv). It will look roughly as follows:

Screenshot for DLA Data Outputs

Type	Counts
Book	1095185
Journal	82164726
	78297141
Patent	45486148
BookChapter	2544813
Conference	4385543
Dataset	39424

Count Document Types with Azure Databricks

While the above is for batch jobs, you would use Azure Databricks for more interactive work on Data Lake. Azure Databricks is an Apache Spark based analytics platform optimized for Azure. It features interactive workspaces with Jupyter-style notebooks, automated Spark cluster management and effortless integration with a wide variety of data stores and services. Follow these simple steps to create a workspace and an autoscaling Spark cluster. Below is a screenshot showing how effortless the cluster creation is:

Screenshot of Databricks cluster creation

That’s really all there is to it.

With this we’ll repeat the same analysis as above just in Python/Pyspark. First, import PySparkMagClass.py (this provides a convenience class called MicrosoftAcademicGraph) into a newly created notebook (directly attached to your just created Spark cluster) and follow the steps there down to the section called Define configuration variables

Now create a MAG instance, load the ‘Papers’ dataframe and list the first entries:

mag = MicrosoftAcademicGraph(container=MagContainer, account=AzureStorageAccount, key=AzureStorageAccessKey)
papers = mag.getDataframe('Papers')
papers.show(10)

Screenshot of loading of Papers Dataframe

Now extract the document types:

docTypeCounts = papers.select(papers.DocType).groupBy(papers.DocType).count()
display(docTypeCounts.na.fill('Others').orderBy("count"))

The data is displayed as table by default:

Screenshot of the table

But the cool thing is that you can immediately convert it into a variety of plots by clicking on the plot icon (second left on bottom):

Screenshot of the pie chart

And that’s all I wanted to show.