AppNexus, later renamed Xandr, is an American cloud-based software platform that enables and optimizes programmatic online advertising. The company was founded in 2017. In 2018 it was sold to AT&T for over $1.6 billion. The platform provides online auction services, infrastructure, and technology for data management, optimization, financial clearing, and support for directly negotiated ad campaigns. It also has demand-side (DSP) and supply-side (SSP) platforms, and ad-serving functionalities. Basically, it helps you to make money of your site if you have enough visitors.
As data engineers or Python developers, we are interested in API related side of things. What if our sites ads go through AppNexus SSP and we want to setup reporting? Right, we need to connect to the API pull the data and setup reporting job.
Let’s start coding query API job.
When you are going to copy the code PLEASE make sure to set code blocks’ indentations correctly, otherwise the code won’t run.
This code can be easily converted to an Airflow DAG and I will explain how in a separate article.
First, we will start with the imports
import os
import time
import datetime
import subprocess
import json
import boto3
from analytics_library import get_secret, get_bucket
Let’s talk about each import separately.
import os – The OS library will be used to remove all the temporary files during the code execution.
import datetime – The DateTime library will be used to determine today’s date and report start and end dates.
import time – The Time library is used for putting the code to “sleep” or wait for a certain period of time. We need that functionality because the server side takes some time to generate the report. So we need to wait.
import subprocess – The subprocess library will be used to make API calls. We could use the requests module for this purpose as well but for this example, we will be using the subprocess.
import json – The JSON library is used to convert API response binary string to a dictionary.
import boto3 – The boto3 library (not sure why it was named like this) is an official AWS API library to interact with AWS resources like S3. Because this code can be converted to Airflow DAG to be a part of Export-Load-Transform (ETL) flaw, I decided to keep the library here to show you how can you Export data from the API and load it to S3 bucket for further processing.
from analytics_library import get_secret, get_bucket – The analytics_library is a custom library that I wrote to not to write the same code repeatedly. The library is an analytics_library.py file with multiple functions. For example, the get_secret function pulls secret values from a password repository, in my case it is AWS Secrets Manager. And the get_bucket function pulls the S3 bucket name based on the environment. There will be a separate article on how to create your own library file.
This will be it with the imports, let’s continue with the code implementation.
After the imports I usually start function definition. If possible, I try to keep one function per .py file and that file should perform one main function. In this article, the function will be “get the data”. Getting the data involves querying the API, converting the data to a csv report and sending the report to s3.
def api_query(**context):
try:
today = context['ti'].xcom_pull(key='init_today')
except:
today = str(datetime.date.today())
end = datetime.datetime.strptime(today, '%Y-%m-%d').date() - datetime.timedelta(days=1)
start = end - datetime.timedelta(days=6)
print(f'\nToday: {today}, start: {start}, end: {end}\n')
def api_query(**context): – is a definition of a function. I like to give a function a descriptive name, for example, “api_query” or “query_api”. But you can call it anything you want. As a parameter, we are supplying a dictionary and call it context. We will see the use of the context dictionary shortly.
The try-catch block goes next. Inside the try block, we are trying to define a variable called today. The variable will be holding today’s date in a string form in YYYY-MM-DD format. Example ‘2024-01-21’. The value is pulled from the context dictionary supplied by Airflow dag. The context dictionary has an object called ‘ti’ which stands for “Task Instance”. The task instance object has a function called .xcom_pull (the xcom means cross-communication) that returns the value of the key supplied. In our case, it returns the value of the key init_today. If the ‘init_today’ key and value were not xcom_pushed to the ‘ti’ beforehand, the try block will fail. In a separate article, I will go over how to create a simple Airflow DAG (job) and how to use the xcoms.
If the try block fails then the except block is invoked and today’s date is pulled from the datetime library. It then date object is converted to a string for consistency.
After that, we define the start and end dates for the API query report. The end date is defined first by converting the today date string back to the datetime object and subtracting one day. So yesterday’s date becomes the end date. The start date is the end date minus 6 days. So if today=’2024-01-21′, then end=’2024-01-20′ and start=’2024-01-13′ (7 days range).
js = get_secret('appnexus-api-creds')
Next line we are pulling login credentials from AWS Secrets Manager, the repository that holds login credentials. The get_secret is a custom function and we will discuss it in a separate article. These login credentials are stored in the form of a dictionary string. In this particular case, the get_secret function will return the following dictionary ‘{“auth”:{“username”:”the_user”,”password”:”PassW0rd!”}}’ where username and password are provided by AppNexus team. If you don’t use any password manager service, you can assign js variable directly in the code.
js = {"auth":{"username":"the_user","password":"PassW0rd!"}}
We are done with the initial setup and can continue with the authentication and report downloading.
The process consists of the following steps:
- Authenticate user
- Request the report
- Check if the report is ready
- Download the report
STEP 1. Authenticate the user
cookie_path='/tmp/cookies'
crl = f"curl -b {cookie_path} -c {cookie_path} -X POST -d "
crl += f"'{js}' 'https://api.appnexus.com/auth'"
res = subprocess.check_output(crl, shell=True)
res = json.loads(res)
print('\nCompleted authentication.\n')
cookie_path is the variable that holds a path for temporary cookie files that are needed for the authentication. In this case, the cookies will be stored in the /tmp directory, file called “cookies” without an extension.
crl is a string variable that will hold a cURL command. The command will be executed in the subprocess module right after the crl variable. The curl command is broken down into two lines for readability. The curl command is doing the following: “Send a POST request with a payload (user/password) to the provided URL and read/write the cookies.”
The subprocess – module executes the curl command using the check_output method. I use the shell=True parameter as well.
res – The subprocess will return a value to variable res. You don’t need to do anything with this value. But if you want to check if subprocess executed successfully, then you will need to convert the variable res to a dictionary using the JSON library: res=json.loads(res). In other words, res=json.loads(res) is optional. The last line notifies us that authentication is complete.
STEP 2. Report Request
First, we are going to put together the body of the report request:
rreq = '{"report": {"report_type": "network_analytics",'
rreq += f'"start_date": "{start} 00:00:00",'
rreq += f'"end_date": "{end} 00:00:00",'
rreq += '"timezone": "UTC",'
rreq += '"columns": ["day","placement_name","site_name","publisher_name",'
rreq += '"mediatype","imps","clicks","revenue"],'
rreq += '"group_filters": [{"imps": {"value": 0,"operator": ">"}}],'
rreq += '"format": "csv"}}'
rreq – is a string variable that holds a string version of a dictionary for the request payload. The dictionary has one key – “report”. The value of the key is another dictionary with report parameters like report_type, filters, and start/end dates. We will pass the start and end dates dynamically via the f-string. The group filter discards data with zero impressions because we don’t need data with zero impressions.
crl = f"curl -b {cookie_path} -c {cookie_path} -X POST "
crl += f"-d '{rreq}' 'https://api.appnexus.com/report'"
res = subprocess.check_output(crl, shell=True)
res = json.loads(res)
rid = res['response']['report_id']
print(f'\nReport {rid} has been requested\n')
time.sleep(2)
Now that we created the request body we can write an actual request cURL command, just like we did in the authentication step above.
The first two lines assign a string value to crl variable. The string meaning is “send a POST request with a data payload to the appnexus endpoint and save/read cookies.”
The next two lines do the same as in the authentication step – send the request and convert the response to a dictionary.
rid = res[‘response’][‘report_id’] line extracts the report id from the response dictionary. Then print out the message the report with the following ID is requested.
time.sleep(2) – waits 2 seconds before the next line of will be executed. We need that delay to give AppNexus server the time to process the report request.
STEP 3. Check if the report is ready
sts = ""
ct = 0
while sts != "ready" and ct < 5:
crl = f"curl -b {cookie_path} -c {cookie_path} "
crl += f"'https://api.appnexus.com/report?id={rid}'"
res = subprocess.check_output(crl, shell=True)
res = json.loads(res)
sts = res['response']['execution_status']
print(f'\nReport Status: {sts}\n')
time.sleep(5)
ct += 1
if ct >= 5:
raise Exception (f'failed to pull API report: "{sts}".')
In the first three lines, we declare two variables sts and ct followed by the while loop. The sts variable will hold the report’s status and ct counts the attempts. The while loop will run 5 times or the status is not equal to ready.
We send requests 5 times inside the while loop, incrementing the counter by 1 for each loop iteration.
After the loop, we check if the counter value is over 5. If it is, we raise an exception because something must be wrong. The report must be ready after 5 runs.
This will conclude the first part of the article. I am trying to keep them at a reasonable length. The new page is coming soon!
RELATED POSTS
View all
Leave a Reply