• Alibek Jakupov

Microsoft Academic Graph: practical experience, step 2

Updated: May 1





Here are the steps implemented in the previous article:

  • Get Microsoft Academic Graph on Azure storage

  • Set up Azure Data Lake Analytics for Microsoft Academic Graph

  • Compute author h-index using Azure Data Lake Analytics (U-SQL)

In today's article we are going to add Azure Search service to implement full-text search on the MAG data.



Define functions to extract MAG data


1.      Add new job to Azure ADLA

a.      Copy code from samples/CreateFunctions.usql


The code may be found in the previous article as well as on github.



Generate text documents for academic papers


The goal of this step is to submit an ADLA job to generate text files containing academic data that will be used to create an Azure Search service.


Before creating a job, it is necessary to create a blob account where the output files should go to.

Here the Storage Account configurations:


Subscription

   Your-subscription

Resource group

   Your-resource-group

Location

   e.g. (Europe) West Europe

Storage account name

   academicoutput - (this value will be used in a future job)

Deployment model

   Resource manager

Account kind

   StorageV2 (general purpose v2)

Replication

   Read-access geo-redundant storage (RA-GRS)

Performance

   Standard

Access tier (default)

   Hot

Secure transfer required

   Enabled

Allow access from

   All networks

Hierarchical namespace

   Disabled

Blob soft delete

   Disabled


Next it is necessary to create an output container. From the newly created resource go to blobs and click on +Container. The name of output container defined in this experiment is “academic-output-container”


Finally, it is necessary to add this blob as a data source.

Go to the ADLA account, click on data sources




And Add Data Source



In the dialog menu provide the following information:


  • Storage type: Azure Storage

  • Selection method: Select Account

  • Azure storage: newly created storage account (academic output in this case).

Important: if you don’t add the AS account as a data source, an exception will be thrown during each execution.


At this step, the following data sources should be present in the ADLA account

Blob with MAG dataADLS (Azure Data Lake Storage) accountBlob for output data



At this step everything is ready to add a new job to ADLA service.

Add new job.

<MagAzureStorageAccount> = Blob storage Account containing mag data (same as in the previous article)

<MagContainer> = mag-yyyy-mm-dd (same as in the previous section)

<OutputAzureStorageAccount> = name of the newly created Blob storage account (academicoutput in this experiment)

<OutputContainer> = newly created container in the output storage (academic-output-container in this experiment)


N.B. It is recommended to change AUs value before launching the job.


Here is the execution summary.

AUs: 32, input: 61.8 GB, output: 38 GB, estimated cost: EUR 9.14, efficiency: 67%, preparing: 49s, running: 13m 25s, duration: 14m 14s.



Create Azure Search service 


1.      From Azure Portal, create a resource -> Azure Search

Important: Create a new resource group for the service with the same name as the service. In this experiment we called the newly created resource ‘academic-search’.

Important: to ensure the best performance, use the same location as the Azure storage account containing the Microsoft Academic Graph data.

In this experiment we chose Free tier, however only ONE free tier is available per subscription.

2.      Once the new service has been created, navigate to the overview section of the service and get the URL

3.      Navigate to the keys section of the service and get the primary admin key



Configure initial Postman request and create data source


In Postman (to download the application follow the link: https://www.getpostman.com/downloads/) provide the following information:


url: url-obtained-from-previous-section/ datasources?api-version=2019-05-06


N.B. Api versions may be found in “Search Explorer” menu of Azure Search



method: post


Headers

api-key: primary-admin-key from previous section

Content-type: application/json


Body

{

   "name" : "azure-search-data",

   "type" : "azureblob",

   "credentials" : { "connectionString" : "<AzureStorageAccountConnectionString>" },

   "container”: { "name" : "<MagContainer>", "query" : "azure-search-data" }
}


Connection string should point to the output blob storage (academicoutput in this experiment)

Mag container should be in the output blob storage (academic-output-container in this experiment).

You should receive a "201 created" response.



Create index


Here are the request details to create index


url: url-obtained-from-previous-section/indexes?api-version=2019-05-06

method: post


Headers

api-key: primary-admin-key from previous section

Content-type: application/json


Body

{

   "name": "mag-index", 

   "fields": [

       {"name": "id", "type": "Edm.String", "key": true, "filterable": false, "searchable": false, "sortable": false, "facetable": false},

       {"name": "rank", "type": "Edm.Int32", "filterable": true, "searchable": false, "facetable": false, "sortable": true},

       {"name": "year", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "journal", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "conference", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "authors", "type": "Collection(Edm.String)", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "volume", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "issue", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "first_page", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "last_page", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "title", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "doi", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false}

   ]
}



Create indexers


Here are the request details to create index


url: url-obtained-from-previous-section/indexers?api-version=2019-05-06

method: post


Headers

api-key: primary-admin-key from previous section

Content-type: application/json


Body

{

   "name" : "mag-indexer-1",

   "dataSourceName" : "azure-search-data",

   "targetIndexName" : "mag-index",

   "schedule" : {

       "interval" : "PT5M"

   },

   "parameters" : {

       "configuration" : {

           "parsingMode" : "delimitedText",

           "delimitedTextHeaders" : "id,rank,year,journal,conference,authors,volume,issue,first_page,last_page,title,doi",

           "delimitedTextDelimiter": "         ",

           "firstLineContainsHeaders": false,

           "indexedFileNameExtensions": ".0"

       }

   }
}

This will create one indexer for .0 Indexed filename extensions. It is recommended to create six indexers each targeting a specific subset of the text documents generated earlier.



Thus, it is necessary to repeat the procedure 5 more times by changing indexedFileNameExtensions value [.0, .1, .2, .3, .4, .5] and changing the indexer name.


Important: as we are using Free tier only 3 indexers are available.


Indexer quota of 3 has been exceeded for this service. You currently have 3 indexers. You must either delete unused indexers first, or upgrade the service for higher limits



Scale up the service


This step is needed to scale up the services search units (SU) to ensure that each indexer can be run concurrently. To do this it is needed to go to the scale section of the service and change the number of partitions and number of replicas.


Important: this is not available in a Free tier. Create a Standard search service for scalability and greater performance. The indexing operation can take a long time to complete, likely between 16-24 hours.


N.B. to see the indexer status run the following command from postman


url: url-obtained-from-previous-section/indexers/[indexer name]/status?api-version=2019-05-06

method: get


Headers


api-key: primary-admin-key from previous section

Content-type: application/json


Body

None



Reference parsing with search explorer


Here is the needed information that should be provided to perform reference parsing with Azure Search REST API 


url: url-obtained-from-previous-section/indexes/mag-index/docs/search?api-version=2019-05-06

method: post


Headers

api-key: primary-admin-key from previous section

Content-type: application/json


Body

{ 

    "highlight": "year,journal,conference,authors,volume,issue,first_page,last_page,title,doi", 

    "highlightPreTag": "<q>", 

    "highlightPostTag": "</q>", 

    "search": "Lloyd, K., Wright, S., Suchet-Pearson, S., Burarrwanga, L., Hodge, & P. (2012). Weaving lives together: collaborative fieldwork in North East Arnhem Land, Australia. Annales de Géographie, 121(5), 513–524.", 

    "searchFields": "year,journal,conference,authors,volume,issue,first_page,last_page,title,doi", 

    "select": "id,rank,year,title",

    "top": 2
}

Hope this was helpful.

©2018 by macnabbs. Proudly created with Wix.com